## Analytics Vidhya - 12 Data Wrangling Techniques

https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/?

----

In [99]:
import numpy as np
import pandas as pd
data = pd.read_csv('Loan_Prediction.csv', index_col = 'Loan_ID')
data.head()

Unnamed: 0_level_0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


## 1 – Boolean Indexing in Pandas

What do you do, if you want to filter values of a column based on conditions from another set of columns from a Pandas Dataframe? For instance, we want a list of all females who are not graduates and got a loan. Boolean indexing can help here. You can use the following code:

In [19]:
data.loc[(data['Education'] != 'Graduate') & (data['Gender'] == 'Female') & (data['Loan_Status'] == 'Y'), ['Gender', 'Education', 'Loan_Status']]

Unnamed: 0,Gender,Education,Loan_Status
50,Female,Not Graduate,Y
197,Female,Not Graduate,Y
205,Female,Not Graduate,Y
279,Female,Not Graduate,Y
403,Female,Not Graduate,Y
407,Female,Not Graduate,Y
439,Female,Not Graduate,Y
463,Female,Not Graduate,Y
468,Female,Not Graduate,Y
480,Female,Not Graduate,Y


In [30]:
data[(data['Education'] != 'Graduate') & (data['Gender'] == 'Female') & (data['Loan_Status'] == 'Y')].shape

(14, 12)

## 2 – Apply Function in Pandas

It is one of the commonly used Pandas functions for manipulating a pandas dataframe and creating new variables. Pandas Apply function returns some value after passing each row/column of a data frame with some function. The function can be both default or user-defined. For instance, here it can be used to find the #missing values in each row and column.

In [34]:
# Create the function to find number of missing values in the passed series :
def num_missing(x):
    return sum(x.isnull())

# Applying per column :
print("Missing Values per column")
print(data.apply(num_missing, axis = 0))

#Applying per row :
print("Missing Values per row")
data.apply(num_missing, axis = 1).head()


Missing Values per column
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64
Missing Values per row


Loan_ID
LP001002    1
LP001003    0
LP001005    0
LP001006    0
LP001008    0
dtype: int64

## 3 – Imputing missing values using Pandas

‘fillna()’ does it in one go. It is used for updating missing values with the overall mean/mode/median of the column. Let’s impute the ‘Gender’, ‘Married’ and ‘Self_Employed’ columns with their respective modes.

In [38]:
#First we import scipy function to determine the mode
from scipy.stats import mode

mode(data['Gender'])
mode(data['Gender']).mode[0]

'Male'

In [100]:
#Impute the values:
data['Gender'].fillna(mode(data['Gender']).mode[0], inplace=True)
data['Married'].fillna(mode(data['Married']).mode[0], inplace=True)
data['Self_Employed'].fillna(mode(data['Self_Employed']).mode[0], inplace=True)

#Now check the #missing values again to confirm:
print(data.apply(num_missing, axis=0))

Gender                0
Married               0
Dependents           15
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64




## 4 – Pivot Table in Pandas

Pandas can be used to create MS Excel style pivot tables. For instance, in this case, a key column is “LoanAmount” which has missing values. We can impute it using mean amount of each ‘Gender’, ‘Married’ and ‘Self_Employed’ group. The mean ‘LoanAmount’ of each group in Pandas dataframe can be determined as:

In [94]:
#Determine pivot table
impute_grps = data.pivot_table(values=["LoanAmount"], index=["Gender","Married","Self_Employed"], aggfunc=np.mean)

print(impute_grps)

                              LoanAmount
Gender Married Self_Employed            
Female No      No             114.691176
               Yes            125.800000
       Yes     No             134.222222
               Yes            282.250000
Male   No      No             129.936937
               Yes            180.588235
       Yes     No             153.882736
               Yes            169.395833


In [None]:
impute_grps.reset_index(inplace = True)

In [78]:
impute_grps.drop('index', axis = 1)

Unnamed: 0,Gender,Married,Self_Employed,LoanAmount
0,Female,No,No,114.691176
1,Female,No,Yes,125.8
2,Female,Yes,No,134.222222
3,Female,Yes,Yes,282.25
4,Male,No,No,129.936937
5,Male,No,Yes,180.588235
6,Male,Yes,No,153.882736
7,Male,Yes,Yes,169.395833


As we used Pivot_Tables for imputation, let's look into some other ways of implementing it

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html

https://stackoverflow.com/questions/9588331/simple-cross-tabulation-in-pandas

In [47]:
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
                         "bar", "bar", "bar", "bar"],
                   "B": ["one", "one", "one", "two", "two",
                         "one", "one", "two", "two"],
                   "C": ["small", "large", "large", "small",
                         "small", "large", "small", "small",
                         "large"],
                   "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                   "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
df

Unnamed: 0,A,B,C,D,E
0,foo,one,small,1,2
1,foo,one,large,2,4
2,foo,one,large,2,5
3,foo,two,small,3,5
4,foo,two,small,3,6
5,bar,one,large,4,6
6,bar,one,small,5,8
7,bar,two,small,6,9
8,bar,two,large,7,9


In [52]:
table = pd.pivot_table(df, index=['A', 'B'], values='D', 
                    columns=['C'], aggfunc=np.sum, fill_value = 0)
table

Unnamed: 0_level_0,C,large,small
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,4,5
bar,two,7,6
foo,one,4,1
foo,two,0,6


In [60]:
table = pd.pivot_table(df, index=['A', 'B'], columns='C', aggfunc=len)
table

Unnamed: 0_level_0,Unnamed: 1_level_0,D,D,E,E
Unnamed: 0_level_1,C,large,small,large,small
A,B,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
bar,one,1.0,1.0,1.0,1.0
bar,two,1.0,1.0,1.0,1.0
foo,one,2.0,1.0,2.0,1.0
foo,two,,2.0,,2.0


In [48]:
table = pd.pivot_table(df, index=['A', 'C'], values=['D', 'E'], 
                    aggfunc={'D': np.sum,
                             'E': [min, max, np.mean]})

table

Unnamed: 0_level_0,Unnamed: 1_level_0,D,E,E,E
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,max,mean,min
A,C,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
bar,large,11,9.0,7.5,6.0
bar,small,11,9.0,8.5,8.0
foo,large,4,5.0,4.5,4.0
foo,small,7,6.0,4.333333,2.0


## 5 – Multi-Indexing in Pandas Dataframe

If you notice the output of step #3, it has a strange property. Each Pandas index is made up of a combination of 3 values. This is called Multi-Indexing. It helps in performing operations really fast.

Continuing the example from #3, we have the values for each group but they have not been imputed.
This can be done using the various techniques from pandas learned till now.

In [80]:
print(data.apply(num_missing, axis=0))

Gender                0
Married               0
Dependents           15
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64


In [None]:
#iterate only through rows with missing LoanAmount
for i,row in data.loc[data['LoanAmount'].isnull(),:].iterrows():
    ind = tuple([row['Gender'],row['Married'],row['Self_Employed']])
#     print(impute_grps.loc[ind])
    data.loc[i,'LoanAmount'] = impute_grps.loc[ind].values[0]

In [102]:
#Now check the #missing values again to confirm:
print(data.apply(num_missing, axis=0))

Gender                0
Married               0
Dependents           15
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64


In [82]:
data.loc[data['LoanAmount'].isnull(),:].head()

Unnamed: 0_level_0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
LP001106,Male,Yes,0,Graduate,No,2275,2067.0,,360.0,1.0,Urban,Y
LP001213,Male,Yes,1,Graduate,No,4945,0.0,,360.0,0.0,Rural,N
LP001266,Male,Yes,1,Graduate,Yes,2395,0.0,,360.0,1.0,Semiurban,Y
LP001326,Male,No,0,Graduate,No,6782,0.0,,360.0,,Urban,N


Note:

- Multi-index requires tuple for defining groups of indices in pandas loc statement. This is a tuple used in function.

- The .values[0] suffix is required because, by default a series element is returned which has an index not matching with that of the pandas dataframe. In this case, a direct assignment gives an error.

## 6. Pandas Crosstab

This function is used to get an initial “feel” (view) of the data. Here, we can validate some basic hypothesis. For instance, in this case, “Credit_History” is expected to affect the loan status significantly. This can be tested using cross-tabulation as shown below:

In [103]:
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True)

Loan_Status,N,Y,All
Credit_History,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,82,7,89
1.0,97,378,475
All,179,385,564


We can do this from Pivot_table as well. **Just make sure values function uses a non missing numeric variable.**

In [109]:
table = pd.pivot_table(data, index=['Credit_History'], columns=['Loan_Status'], values='ApplicantIncome', aggfunc=len, margins=True)
table

Loan_Status,N,Y,All
Credit_History,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,82,7,89
1.0,97,378,475
All,179,385,564


These are absolute numbers. But, percentages can be more intuitive in making some quick insights. We can do this using the Pandas apply function:

In [119]:
def percConvert(ser):
    return ser/float(ser[-1]) * 100

In [120]:
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True).apply(percConvert, axis=1)

Loan_Status,N,Y,All
Credit_History,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,92.134831,7.865169,100.0
1.0,20.421053,79.578947,100.0
All,31.737589,68.262411,100.0


In [121]:
table = pd.pivot_table(data, index=['Credit_History'], columns=['Loan_Status'], values='ApplicantIncome', aggfunc=len, margins=True)
print(table)
table.apply(percConvert, axis=1)

Loan_Status       N    Y  All
Credit_History               
0.0              82    7   89
1.0              97  378  475
All             179  385  564


Loan_Status,N,Y,All
Credit_History,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,92.134831,7.865169,100.0
1.0,20.421053,79.578947,100.0
All,31.737589,68.262411,100.0


Now, it is evident that people with a credit history have much higher chances of getting a loan as 80% people with credit history got a loan as compared to only 8% without credit history.

But that’s not it. It tells an interesting story. Since I know that having a credit history is super important, what if I predict loan status to be Y for ones with credit history and N otherwise. Surprisingly, we’ll be right 82+378=460 times out of 614 which is a whopping 75%!

I won’t blame you if you’re wondering why the hell do we need statistical models. But trust me, increasing the accuracy by even 0.001% beyond this mark is a challenging task.

----
Exploring Cross Tab functionality further

> https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.crosstab.html