## What is Pandas ?
  
Pandas is a software library written for the python programming language
for data **manipulation and analysis.**

* Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in Pandas.
* Data in pandas is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn.
* The primary two components of pandas are the **Series and DataFrame**

## Loan dataset

https://www.kaggle.com/animeshparikshya/loan-dataset

## Importing  required libraries

In [113]:
import numpy as np
print('numpy version : ', np.__version__)

import pandas as pd
print('pandas version : ', pd.__version__)

import warnings
warnings.filterwarnings('ignore')

numpy version :  1.24.4
pandas version :  1.5.3


## 1. Pandas.apply()

* 1.1 **dataframe['column_name'].apply(function):** used to apply Python function or NumPy function on a particular column.
* 1.2 **dataframe.apply(function):** used to apply Python function or NumPy function on a entire dataframe.

In [114]:
data = pd.read_csv("dataset/loan.csv")
data

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


### Use .apply() function on a particular column

In [115]:
data['ApplicantIncome']

0      5849
1      4583
2      3000
3      2583
4      6000
       ... 
609    2900
610    4106
611    8072
612    7583
613    4583
Name: ApplicantIncome, Length: 614, dtype: int64

In [116]:
def check_applicant_income(income):
  
    if income<3000:
        return "Low"
    elif income>= 3000 and income<6000:
        return "Normal"
    else:
        return "High"

In [117]:
data['ApplicantIncome'] = data['ApplicantIncome'].apply(check_applicant_income)
data['ApplicantIncome']

0      Normal
1      Normal
2      Normal
3         Low
4        High
        ...  
609       Low
610    Normal
611      High
612      High
613    Normal
Name: ApplicantIncome, Length: 614, dtype: object

In [118]:
data

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,Normal,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,Normal,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,Normal,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,Low,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,High,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,Low,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,Normal,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,High,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,High,0.0,187.0,360.0,1.0,Urban,Y


### Use .apply() function on entire dataframe

In [119]:
data = {'Column_1':[1, 2, 3],
             'Column_2':[4, 5, 6],
             'Column_3':[7, 8, 9]}

df = pd.DataFrame(data)
df

Unnamed: 0,Column_1,Column_2,Column_3
0,1,4,7
1,2,5,8
2,3,6,9


In [120]:
df['Total'] = df.apply(np.sum, axis = 1)
df

Unnamed: 0,Column_1,Column_2,Column_3,Total
0,1,4,7,12
1,2,5,8,15
2,3,6,9,18


## 2.Pandas dataframe.aggregate()

* 2.1 **dataframe.aggregate(['sum', 'min', 'max']):** used for applying aggregation functions on entire dataframe.

In [121]:
data = {'Column_1':[1, 2, 3],
             'Column_2':[4, 5, 6],
             'Column_3':[7, 8, 9]}

df = pd.DataFrame(data)
df

Unnamed: 0,Column_1,Column_2,Column_3
0,1,4,7
1,2,5,8
2,3,6,9


In [122]:
df.aggregate(['sum', 'min', 'max'])

Unnamed: 0,Column_1,Column_2,Column_3
sum,6,15,24
min,1,4,7
max,3,6,9


In [123]:
df.aggregate(['sum', 'min', 'max'], axis=0)

Unnamed: 0,Column_1,Column_2,Column_3
sum,6,15,24
min,1,4,7
max,3,6,9


In [124]:
df.aggregate(['sum', 'min', 'max'], axis=1)

Unnamed: 0,sum,min,max
0,12,1,7
1,15,2,8
2,18,3,9


## 3.Pandas dataframe.mean()

* 3.1 **dataframe.mean():** used to get mean of dataframe based on given axis (0/1).

In [125]:
data = {'Column_1':[1, 2, 3],
             'Column_2':[4, 5, 6],
             'Column_3':[7, 8, 9]}

df = pd.DataFrame(data)
df

Unnamed: 0,Column_1,Column_2,Column_3
0,1,4,7
1,2,5,8
2,3,6,9


In [126]:
df.mean(axis = 0)

Column_1    2.0
Column_2    5.0
Column_3    8.0
dtype: float64

In [127]:
df.mean(axis = 1)

0    4.0
1    5.0
2    6.0
dtype: float64

In [128]:
df['Column_1'].mean()

2.0

## 4.Pandas dataframe.mad()

* 4.1 **dataframe.mad():** used to get **Mean Absolute Deviation (MAD)** of data. 


### Mean Absolute Deviation (MAD)
* The **Mean Absolute Deviation (MAD)** of a collection of information is the average distance between every information value and the mean.

* The mean absolute deviation of a dataset is the average distance between each data point and the mean. It gives us an idea about the variability in a dataset.

In [129]:
data = {'Column_1':[1, 2, 3],
             'Column_2':[4, 5, 6],
             'Column_3':[7, 8, 9]}

df = pd.DataFrame(data)
df

Unnamed: 0,Column_1,Column_2,Column_3
0,1,4,7
1,2,5,8
2,3,6,9


In [130]:
df.mad()

Column_1    0.666667
Column_2    0.666667
Column_3    0.666667
dtype: float64

In [131]:
df['Column_1'].mad()

0.6666666666666666

## 5.Pandas dataframe.sem()

* 5.1 **dataframe.sem():** used to find the standard error of the mean over the column axis..


### Standard Error (SE) 

* **Pandas dataframe.sem()** function return unbiased standard error of the mean over requested axis. 
* The **standard error (SE)** of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the parameter or the statistic is the mean, it is called the standard error of the mean (SEM).

In [132]:
data = {'Column_1':[1, 2, 3],
             'Column_2':[4, 5, 6],
             'Column_3':[7, 8, 9]}

df = pd.DataFrame(data)
df

Unnamed: 0,Column_1,Column_2,Column_3
0,1,4,7
1,2,5,8
2,3,6,9


In [133]:
df.sem()

Column_1    0.57735
Column_2    0.57735
Column_3    0.57735
dtype: float64

In [134]:
df.sem(axis=1)

0    1.732051
1    1.732051
2    1.732051
dtype: float64

In [135]:
df['Column_1'].sem()

0.5773502691896258

## 6.Pandas Series.value_counts()

* 6.1 **Series.value_counts():** used to return a Series containing counts of unique values

In [136]:
data = pd.read_csv("dataset/loan.csv")
data

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [137]:
data['Gender'].value_counts()

Male      489
Female    112
Name: Gender, dtype: int64

In [138]:
data['Education'].value_counts()

Graduate        480
Not Graduate    134
Name: Education, dtype: int64

In [139]:
data['Self_Employed'].value_counts()

No     500
Yes     82
Name: Self_Employed, dtype: int64

In [140]:
data['ApplicantIncome'].value_counts()

2500    9
4583    6
6000    6
2600    6
3333    5
       ..
3244    1
4408    1
3917    1
3992    1
7583    1
Name: ApplicantIncome, Length: 505, dtype: int64

## Quick Recap

### 1. Pandas.apply()

* 1.1 **dataframe['column_name'].apply(function):** used to apply Python function or NumPy function on a particular column.
* 1.2 **dataframe.apply(function):** used to apply Python function or NumPy function on a entire dataframe.
    
    
### 2.Pandas dataframe.aggregate()

* 2.1 **dataframe.aggregate(['sum', 'min', 'max']):** used for applying aggregation functions on entire dataframe.


### 3.Pandas dataframe.mean()

* 3.1 **dataframe.mean():** used to get mean of dataframe based on given axis (0/1).


### 4.Pandas dataframe.mad()

* 4.1 **dataframe.mad():** used to get **Mean Absolute Deviation (MAD)** of data. 


#### Mean Absolute Deviation (MAD)
* The **Mean Absolute Deviation (MAD)** of a collection of information is the average distance between every information value and the mean.

* The mean absolute deviation of a dataset is the average distance between each data point and the mean. It gives us an idea about the variability in a dataset.
    

### 5.Pandas dataframe.sem()

* 5.1 **dataframe.sem():** used to find the standard error of the mean over the column axis..


#### Standard Error (SE) 

* **Pandas dataframe.sem()** function return unbiased standard error of the mean over requested axis. 
* The **standard error (SE)** of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the parameter or the statistic is the mean, it is called the standard error of the mean (SEM).

### 6.Pandas Series.value_counts()

* 6.1 **Series.value_counts():** used to return a Series containing counts of unique values