## Problem Statement

Dream Housing Finance Inc. specializes in home loans across different market segments - rural, urban and semi-urban.  Thier loan eligibility process is based on customer details provided while filling an online application form. To create a targeted marketing campaign for different segments, they have asked for a comprehensive analysis of the data collected so far.

## About the Dataset
The dataset has details of 614 customers with the following 13 features

|Feature|Description|
|-----|-----|
|Loan_ID|Unique Loan ID|
|Gender|Male/Female|
|Married|Applicant Married (Y/N)|
|Dependents|Number of dependents|
|Education|Graduate/Under Graduate|
|Self_Employed|Self employed (Y/N)|
|ApplicantIncome|Income of the applicant|
|CoapplicantIncome|Income of the co-applicant|
|LoanAmount|Loan amount in thousands|
|Loan_Amount_Term|Term of loan in months|
|Credit_History|credit history meets guidelines}|
|Property_Area| Urban/Semi-Urban/Rural|
|Loan_Status|Loan approved (Y/N)|



Our major work for this project involves data analysis using Pandas. 

## Why solve this project ?

After completing this project, you will have better grip on working with pandas. In this project you will apply following concepts.

 
- Dataframe slicing 
- Dataframe aggregation 
- Pivot table operations

In [1]:
# Import packages
import numpy as np
import pandas as pd
from scipy.stats import mode 

## Task 1
**Let's check which variable is categorical and which one is numerical so that you will get a basic idea about the features of the bank dataset.**

#### Instructions :

- Create dataframe `bank` by passing the `path` of the file


- Create the variable `'categorical_var'` and using `'df.select_dtypes(include = 'object')'` check all categorical values.  


- print `'categorical_var'`


- Create the variable `'numerical_var'` and using `'df.select_dtypes(include = 'number')'` check all categorical values.


- print `'numerical_var'`

In [2]:
path = '../data/data.csv'
bank = pd.read_csv(path)
bank.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [3]:
bank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 43.2+ KB


In [4]:
bank.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


In [5]:
bank.describe(exclude='number')

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,Property_Area,Loan_Status
count,614,601,611,599,614,582,614,614
unique,614,2,2,4,2,2,3,2
top,LP002784,Male,Yes,0,Graduate,No,Semiurban,Y
freq,1,489,398,345,480,500,233,422


In [6]:
bank.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [7]:
nulls = pd.DataFrame(bank.isnull().sum()).merge(pd.DataFrame((bank.isnull().sum()/len(bank)) * 100), left_index=True, right_index=True)
nulls.columns = ['count', 'percentage']
nulls

Unnamed: 0,count,percentage
Loan_ID,0,0.0
Gender,13,2.117264
Married,3,0.488599
Dependents,15,2.442997
Education,0,0.0
Self_Employed,32,5.211726
ApplicantIncome,0,0.0
CoapplicantIncome,0,0.0
LoanAmount,22,3.583062
Loan_Amount_Term,14,2.28013


In [8]:
numerical_var = bank.select_dtypes(include='number')
numerical_var

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
0,5849,0.0,,360.0,1.0
1,4583,1508.0,128.0,360.0,1.0
2,3000,0.0,66.0,360.0,1.0
3,2583,2358.0,120.0,360.0,1.0
4,6000,0.0,141.0,360.0,1.0
...,...,...,...,...,...
609,2900,0.0,71.0,360.0,1.0
610,4106,0.0,40.0,180.0,1.0
611,8072,240.0,253.0,360.0,1.0
612,7583,0.0,187.0,360.0,1.0


In [9]:
categorical_var = bank.select_dtypes(exclude='number')
categorical_var

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,Urban,Y
4,LP001008,Male,No,0,Graduate,No,Urban,Y
...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,Urban,Y


## Task 2

**Sometimes customers forget to fill in all the details or they don't want to share other details. Because of that, some of the fields in the dataset will have missing values. Now you have to check which columns have missing values and also check the count of missing values each column has. If you get the columns that have missing values, try to fill them.**


#### Instructions :

- From the dataframe `bank`, drop the column `Loan_ID` to create a new dataframe `banks`

- To see the null values, use `"isnull().sum()"` function and print it.

- Calculate `mode` for the dataframe `banks` and store in `bank_mode`

- Fill missing(NaN) values of `banks` with `bank_mode` and store the cleaned dataframe back in `banks`. 

-  Check if all the missing values `(NaN)` are filled.

In [10]:
banks = bank.copy()
banks.drop('Loan_ID', axis=1, inplace=True)
banks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             601 non-null    object 
 1   Married            611 non-null    object 
 2   Dependents         599 non-null    object 
 3   Education          614 non-null    object 
 4   Self_Employed      582 non-null    object 
 5   ApplicantIncome    614 non-null    int64  
 6   CoapplicantIncome  614 non-null    float64
 7   LoanAmount         592 non-null    float64
 8   Loan_Amount_Term   600 non-null    float64
 9   Credit_History     564 non-null    float64
 10  Property_Area      614 non-null    object 
 11  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(7)
memory usage: 40.8+ KB


In [11]:
banks.isnull().sum()

Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [12]:
bank_mode = banks.mode().iloc[0]
bank_mode

Gender                    Male
Married                    Yes
Dependents                   0
Education             Graduate
Self_Employed               No
ApplicantIncome           2500
CoapplicantIncome            0
LoanAmount                 120
Loan_Amount_Term           360
Credit_History               1
Property_Area        Semiurban
Loan_Status                  Y
Name: 0, dtype: object

In [13]:
banks.fillna(bank_mode, inplace=True)
banks.isnull().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

## Task 3

**Now let's check the loan amount of an average person based on  `'Gender', 'Married', 'Self_Employed' `.  This will give a basic idea of the average loan amount of a person.**


#### Instructions :

- We will use previously created dataframe `banks` for this task.
- Generate a pivot table with index as `'Gender', 'Married', 'Self_Employed'` and values as `'LoanAmount'`,  using `mean aggregation`


- Store the result in a variable called `'avg_loan_amount'`




In [14]:
avg_loan_amount = pd.pivot_table(banks, index=['Gender', 'Married', 'Self_Employed'], values='LoanAmount')
avg_loan_amount

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,LoanAmount
Gender,Married,Self_Employed,Unnamed: 3_level_1
Female,No,No,114.768116
Female,No,Yes,125.272727
Female,Yes,No,133.714286
Female,Yes,Yes,282.25
Male,No,No,129.508621
Male,No,Yes,180.588235
Male,Yes,No,152.60815
Male,Yes,Yes,167.42


## Task 4

**Now let's check the percentage of loan approved based on a person's employment type.**


#### Instructions:

- We will use the previously created dataframe `banks` for this task.

- Create variable `'loan_approved_se'` and store the count of results where `Self_Employed` == `Yes ` and `Loan_Status` == `Y`.

- Create variable `'loan_approved_nse'` and store the count of results where `Self_Employed` == `No ` and `Loan_Status` == `Y`.

- `Loan_Status` count is given as `614`.

- Calculate the percentage of loan approval for self-employed people and store result in variable `'percentage_se'`. 

- Calculate the percentage of loan approval for people who are not self-employed and store the result in variable `'percentage_nse'`.



In [15]:
loan_approved_se = len(banks[(banks['Self_Employed']=='Yes') & (banks['Loan_Status']=='Y')])
loan_approved_se

56

In [16]:
loan_approved_nse = len(banks[(banks['Self_Employed']=='No') & (banks['Loan_Status']=='Y')])
loan_approved_nse

366

In [17]:
total_loan_status = len(banks['Loan_Status'])
total_loan_status

614

In [18]:
percentage_se = loan_approved_se/total_loan_status
percentage_nse = loan_approved_nse/total_loan_status
print(percentage_se)
print(percentage_nse)

0.09120521172638436
0.5960912052117264


## Task 5

**A government audit is happening real soon! So the company wants to find out those applicants with long loan amount term.**  

#### Instructions:

- Use `"apply()"` function to convert `Loan_Amount_Term`  which is in months to a year and store the result in a variable `'loan_term'`.

- Find the number of applicants having loan amount term greater than or equal to 25 years and store them in a variable called `'big_loan_term'`.



In [19]:
loan_term = banks['Loan_Amount_Term'].apply(lambda x: x/12)
big_loan_term = (loan_term>=25).sum()
big_loan_term

554

## Task 6

**Now let's check the average income of an applicant and the average loan given to a person based on their income.**


#### Instructions :

- Groupby the `'banks'` dataframe by `Loan_Status` and store the result in a variable called `'loan_groupby'`

- Subset `'loan_groupby'` to include only  `['ApplicantIncome', 'Credit_History']` and store the subsetted dataframe back in `'loan_groupby'`

- Then find the `mean` of `'loan_groupby'` and store the result in a new variable `'mean_values'`




In [20]:
loan_groupby = banks.groupby('Loan_Status')[['ApplicantIncome', 'Credit_History']].mean()
mean_values = loan_groupby.round(2)
mean_values

Unnamed: 0_level_0,ApplicantIncome,Credit_History
Loan_Status,Unnamed: 1_level_1,Unnamed: 2_level_1
N,5446.08,0.57
Y,5384.07,0.98
