 # <span style="color:#ff5f27;">🏦 Loan Approval </span>


This notebook is adapted from:
https://www.kaggle.com/code/faressayah/lending-club-loan-defaulters-prediction 

# <span style="color:#ff5f27;">📑 Introduction </span>

> `LendingClub` is a US peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. `LendingClub` is the world's largest peer-to-peer lending platform.

> Solving this case study will give us an idea about how real business problems are solved using EDA and Machine Learning. In this case study, we will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimise the risk of losing money while lending to customers.

# <span style="color:#ff5f27;">📝 Business Understanding </span>



> You work for the `LendingClub` company which specialises in lending various types of loans to urban customers. When the company receives a loan application, the company has to make a decision for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

> - If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company
> - If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company

> The data given contains the information about past loan applicants and whether they ‘defaulted’ or not. The aim is to identify patterns which indicate if a person is likely to default, which may be used for takin actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.

> When a person applies for a loan, there are two types of decisions that could be taken by the company:
> 1. `Loan accepted`: If the company approves the loan, there are 3 possible scenarios described below:
    - `Fully paid`: Applicant has fully paid the loan (the principal and the interest rate)
    - `Current`: Applicant is in the process of paying the instalments, i.e. the tenure of the loan is not yet completed. These candidates are not labelled as 'defaulted'.
    - `Charged-off`: Applicant has not paid the instalments in due time for a long period of time, i.e. he/she has defaulted on the loan
> 2. `Loan rejected`: The company had rejected the loan (because the candidate does not meet their requirements etc.). Since the loan was rejected, there is no transactional history of those applicants with the company and so this data is not available with the company (and thus in this dataset)

# <span style="color:#ff5f27;"> 🎯 Business Objectives</span>

> - `LendingClub` is the largest online loan marketplace, facilitating personal loans, business loans, and financing of medical procedures. Borrowers can easily access lower interest rate loans through a fast online interface. 
> - Like most other lending companies, lending loans to ‘`risky`’ applicants is the largest source of financial loss (called `credit loss`). The credit loss is the amount of money lost by the lender when the borrower refuses to pay or runs away with the money owed. In other words, borrowers who defaultcause the largest amount of loss to the lenders. In this case, the customers labelled as '`charged-off`' are the '`defaulters`'. 
> - If one is able to identify these risky loan applicants, then such loans can be reduced thereby cutting down the amount of credit loss. Identification of such applicants using EDA and machine learning is the aim of this case study. 
> - In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilise this knowledge for its portfolio and risk assessment. 
> - To develop your understanding of the domain, you are advised to independently research a little about risk analytics (understanding the types of variables and their significance should be enough).

# <span style="color:#ff5f27;"> 💾 Data Description</span>
----
-----
Here is the information on this particular data set:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>LoanStatNew</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>loan_amnt</td>
      <td>The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.</td>
    </tr>
    <tr>
      <th>1</th>
      <td>term</td>
      <td>The number of payments on the loan. Values are in months and can be either 36 or 60.</td>
    </tr>
    <tr>
      <th>2</th>
      <td>int_rate</td>
      <td>Interest Rate on the loan</td>
    </tr>
    <tr>
      <th>3</th>
      <td>installment</td>
      <td>The monthly payment owed by the borrower if the loan originates.</td>
    </tr>
    <tr>
      <th>4</th>
      <td>grade</td>
      <td>LC assigned loan grade</td>
    </tr>
    <tr>
      <th>5</th>
      <td>sub_grade</td>
      <td>LC assigned loan subgrade</td>
    </tr>
    <tr>
      <th>6</th>
      <td>emp_title</td>
      <td>The job title supplied by the Borrower when applying for the loan.*</td>
    </tr>
    <tr>
      <th>7</th>
      <td>emp_length</td>
      <td>Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.</td>
    </tr>
    <tr>
      <th>8</th>
      <td>home_ownership</td>
      <td>The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER</td>
    </tr>
    <tr>
      <th>9</th>
      <td>annual_inc</td>
      <td>The self-reported annual income provided by the borrower during registration.</td>
    </tr>
    <tr>
      <th>10</th>
      <td>verification_status</td>
      <td>Indicates if income was verified by LC, not verified, or if the income source was verified</td>
    </tr>
    <tr>
      <th>11</th>
      <td>issue_d</td>
      <td>The month which the loan was funded</td>
    </tr>
    <tr>
      <th>12</th>
      <td>loan_status</td>
      <td>Current status of the loan</td>
    </tr>
    <tr>
      <th>13</th>
      <td>purpose</td>
      <td>A category provided by the borrower for the loan request.</td>
    </tr>
    <tr>
      <th>14</th>
      <td>title</td>
      <td>The loan title provided by the borrower</td>
    </tr>
    <tr>
      <th>15</th>
      <td>zip_code</td>
      <td>The first 3 numbers of the zip code provided by the borrower in the loan application.</td>
    </tr>
    <tr>
      <th>16</th>
      <td>addr_state</td>
      <td>The state provided by the borrower in the loan application</td>
    </tr>
    <tr>
      <th>17</th>
      <td>dti</td>
      <td>A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.</td>
    </tr>
    <tr>
      <th>18</th>
      <td>earliest_cr_line</td>
      <td>The month the borrower's earliest reported credit line was opened</td>
    </tr>
    <tr>
      <th>19</th>
      <td>open_acc</td>
      <td>The number of open credit lines in the borrower's credit file.</td>
    </tr>
    <tr>
      <th>20</th>
      <td>pub_rec</td>
      <td>Number of derogatory public records</td>
    </tr>
    <tr>
      <th>21</th>
      <td>revol_bal</td>
      <td>Total credit revolving balance</td>
    </tr>
    <tr>
      <th>22</th>
      <td>revol_util</td>
      <td>Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.</td>
    </tr>
    <tr>
      <th>23</th>
      <td>total_acc</td>
      <td>The total number of credit lines currently in the borrower's credit file</td>
    </tr>
    <tr>
      <th>24</th>
      <td>initial_list_status</td>
      <td>The initial listing status of the loan. Possible values are – W, F</td>
    </tr>
    <tr>
      <th>25</th>
      <td>application_type</td>
      <td>Indicates whether the loan is an individual application or a joint application with two co-borrowers</td>
    </tr>
    <tr>
      <th>26</th>
      <td>mort_acc</td>
      <td>Number of mortgage accounts.</td>
    </tr>
    <tr>
      <th>27</th>
      <td>pub_rec_bankruptcies</td>
      <td>Number of public record bankruptcies</td>
    </tr>
  </tbody>
</table>

---
----

In [None]:
!pip install -q hvplot

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats 
import matplotlib.pyplot as plt
import hvplot.pandas

pd.set_option('display.float', '{:.2f}'.format)
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

In [None]:
loans_df = pd.read_parquet("https://repo.hops.works/dev/jdowling/loans.parquet")
loans_df.head()

In [None]:
loans_df.info()

In [None]:
loans_df.describe()

In [None]:
applicants_df = pd.read_parquet("https://repo.hops.works/dev/jdowling/applicants.parquet")
applicants_df.head()

In [None]:
applicants_df.info()

In [None]:
applicants_df.describe()

# <span style="color:#ff5f27;">🔍 Exploratory Data Analysis</span>

> **OVERALL GOAL:** 
> - Get an understanding for which variables are important, view summary statistics, and visualize the data

## ✔️ `loan_status`

> Current status of the loan

In [None]:
loans_df['loan_status'].value_counts().hvplot.bar(
    title="Loan Status Counts", xlabel='Loan Status', ylabel='Count', 
    width=500, height=350
)

In [None]:
plt.figure(figsize=(12, 8))

# Identify and exclude non-numeric columns
numeric_columns = loans_df.select_dtypes(include=[np.number]).columns

sns.heatmap(loans_df[numeric_columns].corr(), annot=True, cmap='viridis')

### 📌 Notice
> We noticed almost perfect correlation between "`loan_amnt`" the "`installment`" feature. We'll explore this features further. Print out their descriptions and perform a scatterplot between them. 

> - Does this relationship make sense to you? 
> - Do we think there is duplicate information here?

## ✔️ `loan_amnt` & `installment`

> - `installment`: The monthly payment owed by the borrower if the loan originates.
> - `loan_amnt`: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

In [None]:
installment = loans_df.hvplot.hist(
    y='installment', by='loan_status', subplots=False, 
    width=350, height=400, bins=50, alpha=0.4, 
    title="Installment by Loan Status", 
    xlabel='Installment', ylabel='Counts', legend='top'
)

loan_amnt = loans_df.hvplot.hist(
    y='loan_amnt', by='loan_status', subplots=False, 
    width=350, height=400, bins=30, alpha=0.4, 
    title="Loan Amount by Loan Status", 
    xlabel='Loan Amount', ylabel='Counts', legend='top'
)

installment + loan_amnt

In [None]:
loan_amnt_box = loans_df.hvplot.box(
    y='loan_amnt', subplots=True, by='loan_status', width=300, height=350, 
    title="Loan Status by Loan Amount ", xlabel='Loan Status', ylabel='Loan Amount'
)

installment_box = loans_df.hvplot.box(
    y='installment', subplots=True, by='loan_status', width=300, height=350, 
    title="Loan Status by Installment", xlabel='Loan Status', ylabel='Installment'
)

# loan_amnt_box + installment_box

In [None]:
loans_df.groupby(by='loan_status')['loan_amnt'].describe()

## ✔️ `grade` & `sub_grade`

> - `grade`: LC assigned loan grade
> - `sub_grade`: LC assigned loan subgrade

Let's explore the Grade and SubGrade columns that LendingClub attributes to the loans. 

What are the unique possible `grade` & `sub_grade`?

In [None]:
print(f"GRADE unique: {loans_df.grade.unique()}")
print(f"SUB_GRADE unique: {loans_df.sub_grade.unique()}")

In [None]:
fully_paid = loans_df.loc[loans_df['loan_status']=='Fully Paid', 'grade'].value_counts().hvplot.bar() 
charged_off = loans_df.loc[loans_df['loan_status']=='Charged Off', 'grade'].value_counts().hvplot.bar() 

(fully_paid * charged_off).opts(
    title="Loan Status by Grade", xlabel='Grades', ylabel='Count',
    width=500, height=450, legend_cols=2, legend_position='top_right', xrotation=90
)

In [None]:
fully_paid = loans_df.loc[loans_df['loan_status']=='Fully Paid', 'sub_grade'].value_counts().hvplot.bar() 
charged_off = loans_df.loc[loans_df['loan_status']=='Charged Off', 'sub_grade'].value_counts().hvplot.bar() 

(fully_paid * charged_off).opts(
    title="Loan Status by Grade", xlabel='Grades', ylabel='Count',
    width=500, height=400, legend_cols=2, legend_position='top_right', xrotation=90
)

In [None]:
# data.hvplot.bar()

In [None]:
plt.figure(figsize=(15, 10))

plt.subplot(2, 2, 1)
grade = sorted(loans_df.grade.unique().tolist())
sns.countplot(x='grade', data=loans_df, hue='loan_status', order=grade)

plt.subplot(2, 2, 2)
sub_grade = sorted(loans_df.sub_grade.unique().tolist())
g = sns.countplot(x='sub_grade', data=loans_df, hue='loan_status', order=sub_grade)
g.set_xticklabels(g.get_xticklabels(), rotation=90);

It looks like `F` and `G` subgrades don't get paid back that often. Isloate those and recreate the countplot just for those subgrades.

In [None]:
df = loans_df[(loans_df.grade == 'F') | (loans_df.grade == 'G')]

plt.figure(figsize=(15, 10))

plt.subplot(2, 2, 1)
grade = sorted(df.grade.unique().tolist())
sns.countplot(x='grade', data=df, hue='loan_status', order=grade)

plt.subplot(2, 2, 2)
sub_grade = sorted(df.sub_grade.unique().tolist())
sns.countplot(x='sub_grade', data=df, hue='loan_status', order=sub_grade)

## ✔️ `term`, `home_ownership`, `verification_status` & `purpose`

> - `term`: The number of payments on the loan. Values are in months and can be either 36 or 60.
> - `home_ownership`: The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER
> - `verification_status`: Indicates if income was verified by LC, not verified, or if the income source was verified
> - `purpose`: A category provided by the borrower for the loan request.

In [None]:
applicants_df['home_ownership'].value_counts()

In [None]:
# fully_paid = data.loc[data['loan_status']=='Fully Paid', 'home_ownership'].value_counts().hvplot.bar() 
# charged_off = data.loc[data['loan_status']=='Charged Off', 'home_ownership'].value_counts().hvplot.bar()

# home_ownership_count = (fully_paid * charged_off).opts(
#     title="Loan Status by Grade", xlabel='Home Ownership', ylabel='Count',
#     width=350, height=350, legend_cols=2, legend_position='top_right'
# ).opts(xrotation=90)

# home_ownership = data.home_ownership.value_counts().hvplot.bar(
#     title="Loan Status by Grade", xlabel='Home Ownership', ylabel='Count', 
#     width=350, height=350, legend='top'
# ).opts(xrotation=90)

# (home_ownership_count + home_ownership)

In [None]:
applicants_df.loc[(applicants_df.home_ownership == 'ANY') | 
                  (applicants_df.home_ownership == 'NONE'), 'home_ownership'] = 'OTHER'  
applicants_df.home_ownership.value_counts()

In [None]:
# applicants_df.loc[applicants_df['home_ownership']=='OTHER', 'loan_status'].value_counts()

joined_df = loans_df.merge(applicants_df, on="id")
joined_df

In [None]:
plt.figure(figsize=(15, 20))

plt.subplot(4, 2, 1)
sns.countplot(x='term', data=joined_df, hue='loan_status')

plt.subplot(4, 2, 2)
sns.countplot(x='home_ownership', data=joined_df, hue='loan_status')

plt.subplot(4, 2, 3)
sns.countplot(x='verification_status', data=joined_df, hue='loan_status')

plt.subplot(4, 2, 4)
g = sns.countplot(x='purpose', data=joined_df, hue='loan_status')
g.set_xticklabels(g.get_xticklabels(), rotation=90);

## ✔️ `int_rate` & `annual_inc`

> - `int_rate`: Interest Rate on the loan
> - `annual_inc`: The self-reported annual income provided by the borrower during registration

In [None]:
int_rate = joined_df.hvplot.hist(
    y='int_rate', by='loan_status', alpha=0.3, width=350, height=400,
    title="Loan Status by Interest Rate", xlabel='Interest Rate', ylabel='Loans Counts', 
    legend='top'
)

annual_inc = joined_df.hvplot.hist(
    y='annual_inc', by='loan_status', bins=50, alpha=0.3, width=350, height=400,
    title="Loan Status by Annual Income", xlabel='Annual Income', ylabel='Loans Counts', 
    legend='top'
).opts(xrotation=45)

int_rate + annual_inc

In [None]:
joined_df[joined_df.annual_inc <= 250000].hvplot.hist(
    y='annual_inc', by='loan_status', bins=50, alpha=0.3, width=500, height=400,
    title="Loan Status by Annual Income (<= 250000/Year)", 
    xlabel='Annual Income', ylabel='Loans Counts', legend='top'
).opts(xrotation=45)

In [None]:
print((applicants_df[applicants_df.annual_inc >= 250000].shape[0] / applicants_df.shape[0]) * 100)
print((applicants_df[applicants_df.annual_inc >= 1000000].shape[0] / applicants_df.shape[0]) * 100)

In [None]:
joined_df.loc[joined_df.annual_inc >= 1000000, 'loan_status'].value_counts()

In [None]:
joined_df.loc[joined_df.annual_inc >= 250000, 'loan_status'].value_counts()

- It seems that loans with high intersest rate are more likely to be unpaid.
- Only 75 (less then) borrowers have an annual income more than 1 million, and 4077

## ✔️ `emp_title` & `emp_length`

> - `emp_title`: The job title supplied by the Borrower when applying for the loan.
> - `emp_length`: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.

In [None]:
print(applicants_df.emp_title.isna().sum())
print(applicants_df.emp_title.nunique())

In [None]:
applicants_df['emp_title'].value_counts()[:20]

In [None]:
plt.figure(figsize=(15, 12))

plt.subplot(2, 2, 1)
order = ['< 1 year', '1 year', '2 years', '3 years', '4 years', '5 years', 
          '6 years', '7 years', '8 years', '9 years', '10+ years',]
g = sns.countplot(x='emp_length', data=joined_df, hue='loan_status', order=order)
g.set_xticklabels(g.get_xticklabels(), rotation=90);

plt.subplot(2, 2, 2)
plt.barh(applicants_df.emp_title.value_counts()[:30].index, joined_df.emp_title.value_counts()[:30])
plt.title("The most 30 jobs title afforded a loan")
plt.tight_layout()

## ✔️ `issue_d`, `earliest_cr_line`

> - `issue_d`: The month which the loan was funded
> - `earliest_cr_line`: The month the borrower's earliest reported credit line was opened

In [None]:
# data.hvplot.line(x='issue_d', y='loan_status')

In [None]:
applicants_df['earliest_cr_line'].value_counts()

In [None]:
loans_df['issue_d'] = pd.to_datetime(loans_df['issue_d'])
applicants_df['earliest_cr_line'] = pd.to_datetime(applicants_df['earliest_cr_line'])

In [None]:
# fully_paid = joined_df.loc[joined_df['loan_status']=='Fully Paid', 'issue_d'].hvplot.hist(bins=35) 
# charged_off = joined_df.loc[joined_df['loan_status']=='Charged Off', 'issue_d'].hvplot.hist(bins=35)

# # fully_paid * charged_off
# loan_issue_date = (fully_paid * charged_off).opts(
#     title="Loan Status by Loan Issue Date", xlabel='Loan Issue Date', ylabel='Count',
#     width=350, height=350, legend_cols=2, legend_position='top_right'
# ).opts(xrotation=45)

# fully_paid = loans_df.loc[loans_df['loan_status']=='Fully Paid', 'earliest_cr_line'].hvplot.hist(bins=35) 
# charged_off = loans_df.loc[loans_df['loan_status']=='Charged Off', 'earliest_cr_line'].hvplot.hist(bins=35)

# earliest_cr_line = (fully_paid * charged_off).opts(
#     title="Loan Status by earliest_cr_line", xlabel='earliest_cr_line', ylabel='Count',
#     width=350, height=350, legend_cols=2, legend_position='top_right'
# ).opts(xrotation=45)

# loan_issue_date + earliest_cr_line

## ✔️ `title`

> - `title`: The loan title provided by the borrower

In [None]:
loans_df.title.isna().sum()

In [None]:
loans_df['title'] = loans_df.title.str.lower()

In [None]:
loans_df.title.value_counts()[:10]

`title` will be removed because we have the `purpose` column with is generated from it.

## ✔️ `dti`, `open_acc`, `revol_bal`, `revol_util`, & `total_acc`

> - `dti`: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
> - `open_acc`: The number of open credit lines in the borrower's credit file.
> - `revol_bal`: Total credit revolving balance
> - `revol_util`: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
> - `total_acc`: The total number of credit lines currently in the borrower's credit file

In [None]:
applicants_df.dti.value_counts()

In [None]:
dti = joined_df.hvplot.hist(
    y='dti', bins=50, width=350, height=350, 
    title="dti Distribution", xlabel='dti', ylabel='Count'
)

sub_dti = joined_df[joined_df['dti']<=50].hvplot.hist(
    y='dti', bins=50, width=350, height=350, 
    title="dti (<=30) Distribution", xlabel='dti', ylabel='Count', shared_axes=False
)

dti + sub_dti

In [None]:
print(applicants_df[applicants_df['dti']>=40].shape)

In [None]:
joined_df.loc[joined_df['dti']>=50, 'loan_status'].value_counts()

In [None]:
dti = joined_df[joined_df['dti']<=50].hvplot.hist(
    y='dti', by='loan_status', bins=50, width=300, height=350, 
    title="dti (<=30) Distribution", xlabel='dti', ylabel='Count', 
    alpha=0.3, legend='top'
)

title="Loan Status by The number of open credit lines"

open_acc = joined_df.hvplot.hist(
    y='open_acc', by='loan_status', bins=50, width=300, height=350, 
    title=title, xlabel='The number of open credit lines', ylabel='Count', 
    alpha=0.4, legend='top'
)

title="Loan Status by The total number of credit lines"

total_acc = joined_df.hvplot.hist(
    y='total_acc', by='loan_status', bins=50, width=300, height=350, 
    title=title, xlabel='The total number of credit lines', ylabel='Count', 
    alpha=0.4, legend='top'
)

dti + open_acc + total_acc

In [None]:
print(applicants_df.shape)
print(applicants_df[applicants_df.open_acc > 40].shape)

In [None]:
print(applicants_df.shape)
print(applicants_df[applicants_df.total_acc > 80].shape)

In [None]:
print(applicants_df.shape)
print(applicants_df[applicants_df.revol_util > 120].shape)

In [None]:
title="Loan Status by Revolving line utilization rate"

revol_util = joined_df.hvplot.hist(
    y='revol_util', by='loan_status', bins=50, width=350, height=400, 
    title=title, xlabel='Revolving line utilization rate', ylabel='Count', 
    alpha=0.4, legend='top'
)

title="Loan Status by Revolving line utilization rate (<120)"


sub_revol_util = joined_df[joined_df.revol_util < 120].hvplot.hist(
    y='revol_util', by='loan_status', bins=50, width=350, height=400, 
    title=title, xlabel='Revolving line utilization rate', ylabel='Count', 
    shared_axes=False, alpha=0.4, legend='top'
)

revol_util + sub_revol_util

In [None]:
applicants_df[applicants_df.revol_util > 200]

In [None]:
print(applicants_df.shape)
print(applicants_df[applicants_df.revol_bal > 250000].shape)

In [None]:
title = "Loan Status by Total credit revolving balance"

revol_bal = joined_df.hvplot.hist(
    y='revol_bal', by='loan_status', bins=50, width=350, height=400, 
    title=title, xlabel='Total credit revolving balance', ylabel='Count', 
    alpha=0.4, legend='top'
)

title = "Loan Status by Total credit revolving balance (<250000)"

sub_revol_bal = joined_df[joined_df['revol_bal']<=250000].hvplot.hist(
    y='revol_bal', by='loan_status', bins=50, width=350, height=400, 
    title=title, xlabel='Total credit revolving balance', ylabel='Count', 
    alpha=0.4, legend='top', shared_axes=False
).opts(xrotation=45)

revol_bal + sub_revol_bal

In [None]:
joined_df.loc[joined_df.revol_bal > 250000, 'loan_status'].value_counts()

- It seems that the smaller the `dti` the more likely that the loan will not be paid.
- Only `217` borrower have more than `40` open credit lines.
- Only `266` borrower have more than `80` credit line in the borrower credit file.

## ✔️ `pub_rec`, `initial_list_status`, `application_type`, `mort_acc`, & `pub_rec_bankruptcies`

> - `pub_rec`: Number of derogatory public records
> - `initial_list_status`: The initial listing status of the loan. Possible values are – W, F
> - `application_type`: Indicates whether the loan is an individual application or a joint application with two co-borrowers
> - `mort_acc`: Number of mortgage accounts
> - `pub_rec_bankruptcies`: Number of public record bankruptcies

In [None]:
xlabel = 'Number of derogatory public records'
title = "Loan Status by Number of derogatory public records"

fully_paid = joined_df.loc[joined_df['loan_status']=='Fully Paid', 'pub_rec'].value_counts().hvplot.bar() 
charged_off = joined_df.loc[joined_df['loan_status']=='Charged Off', 'pub_rec'].value_counts().hvplot.bar()

(fully_paid * charged_off).opts(
    title=title, xlabel=xlabel, ylabel='Count',
    width=400, height=400, legend_cols=2, legend_position='top_right'
)

In [None]:
xlabel = "The initial listing status of the loan"
title = "Loan Status by The initial listing status of the loan"

fully_paid = joined_df.loc[joined_df['loan_status']=='Fully Paid', 'initial_list_status'].value_counts().hvplot.bar() 
charged_off = joined_df.loc[joined_df['loan_status']=='Charged Off', 'initial_list_status'].value_counts().hvplot.bar()

(fully_paid * charged_off).opts(
    title=title, xlabel=xlabel, ylabel='Count',
    width=400, height=400, legend_cols=2, legend_position='top_right'
)

In [None]:
fully_paid = joined_df.loc[joined_df['loan_status']=='Fully Paid', 'application_type'].value_counts().hvplot.bar() 
charged_off = joined_df.loc[joined_df['loan_status']=='Charged Off', 'application_type'].value_counts().hvplot.bar()

(fully_paid * charged_off).opts(
    title="Loan Status by Application Type", xlabel="Application Type", ylabel='Count',
    width=400, height=400, legend_cols=2, legend_position='top_right'
)

In [None]:
xlabel = "Number of public record bankruptcies"
title = "Loan Status by The Number of public record bankruptcies"

fully_paid = joined_df.loc[joined_df['loan_status']=='Fully Paid', 'pub_rec_bankruptcies'].value_counts().hvplot.bar() 
charged_off = joined_df.loc[joined_df['loan_status']=='Charged Off', 'pub_rec_bankruptcies'].value_counts().hvplot.bar()

(fully_paid * charged_off).opts(
    title=title, xlabel=xlabel, ylabel='Count',
    width=400, height=400, legend_cols=2, legend_position='top_right'
)

In [None]:
def pub_rec(number):
    if number == 0.0:
        return 0
    else:
        return 1
    
def mort_acc(number):
    if number == 0.0:
        return 0
    elif number >= 1.0:
        return 1
    else:
        return number
    
def pub_rec_bankruptcies(number):
    if number == 0.0:
        return 0
    elif number >= 1.0:
        return 1
    else:
        return number

In [None]:
applicants_df['pub_rec'] = applicants_df.pub_rec.apply(pub_rec)
applicants_df['mort_acc'] = applicants_df.mort_acc.apply(mort_acc)
applicants_df['pub_rec_bankruptcies'] = applicants_df.pub_rec_bankruptcies.apply(pub_rec_bankruptcies)

In [None]:
plt.figure(figsize=(12, 30))

plt.subplot(6, 2, 1)
sns.countplot(x='pub_rec', data=joined_df, hue='loan_status')

plt.subplot(6, 2, 2)
sns.countplot(x='initial_list_status', data=joined_df, hue='loan_status')

plt.subplot(6, 2, 3)
sns.countplot(x='application_type', data=joined_df, hue='loan_status')

plt.subplot(6, 2, 4)
sns.countplot(x='mort_acc', data=joined_df, hue='loan_status')

plt.subplot(6, 2, 5)
sns.countplot(x='pub_rec_bankruptcies', data=joined_df, hue='loan_status')

## <span style="color:#ff5f27;">📈 How numeric features correlate with the target variable?</span>

In [None]:
joined_df['loan_status'] = joined_df.loan_status.map({'Fully Paid':1, 'Charged Off':0})

# Identify and exclude non-numeric columns
numeric_columns = joined_df.dtypes[joined_df.dtypes != 'object'].index

In [None]:
joined_df[numeric_columns].corr()['loan_status'].drop('loan_status').sort_values().hvplot.barh(
    width=600, height=400, 
    title="Correlation between Loan status and Numeric Features", 
    ylabel='Correlation', xlabel='Numerical Features', 
)

****
## <span style="color:#ff5f27;">📝 Conclusion:</span>
We notice that, there are broadly three types of features: 
- 1. Features related to the applicant (demographic variables such as occupation, employment details etc.), 
- 2. Features related to loan characteristics (amount of loan, interest rate, purpose of loan etc.) 
****

# <span style="color:#ff5f27;">🔄 Data Cleanup Rules identification</span>

**Section Goals:** 
> - Remove or fill any missing data. 
> - Remove unnecessary or repetitive features. 
> - Convert categorical string features to dummy variables.

In [None]:
# Missing values
for column in joined_df.columns:
    if joined_df[column].isna().sum() != 0:
        missing = joined_df[column].isna().sum()
        portion = (missing / joined_df.shape[0]) * 100
        print(f"'{column}': number of missing values '{missing}' ==> '{portion:.3f}%'")

### `emp_title`

In [None]:
applicants_df.emp_title.nunique()

Realistically there are too many unique job titles to try to convert this to a dummy variable feature. Let's remove that emp_title column.

### `emp_length`

In [None]:
applicants_df.emp_length.unique()

In [None]:
for year in joined_df.emp_length.unique():
    print(f"{year} years in this position:")
    print(f"{joined_df[joined_df.emp_length == year].loan_status.value_counts(normalize=True)}")
    print('==========================================')

Charge off rates are extremely similar across all employment lengths. So we are going to drop the `emp_length` column.

### `title`

In [None]:
joined_df.title.value_counts().head()

In [None]:
joined_df.purpose.value_counts().head()

The title column is simply a string subcategory/description of the purpose column. So we should drop the title column.

### `mort_acc`

There are many ways we could deal with this missing data. We could attempt to build a simple model to fill it in, such as a linear model, we could just fill it in based on the mean of the other columns, or you could even bin the columns into categories and then set NaN as its own category. There is no 100% correct approach! 

Let's review the other columsn to see which most highly correlates to mort_acc

In [None]:
joined_df.mort_acc.value_counts()

In [None]:
joined_df.mort_acc.isna().sum()

In [None]:
joined_df[numeric_columns].corr()['mort_acc'].drop('mort_acc').sort_values().hvplot.barh()

Looks like the total_acc feature correlates with the mort_acc , this makes sense! Let's try this fillna() approach. We will group the dataframe by the total_acc and calculate the mean value for the mort_acc per total_acc entry. To get the result below:

In [None]:
total_acc_avg = joined_df[numeric_columns].groupby(by='total_acc').mean().mort_acc

In [None]:
def fill_mort_acc(total_acc, mort_acc):
    if np.isnan(mort_acc):
        return total_acc_avg[total_acc].round()
    else:
        return mort_acc

In [None]:
joined_df['mort_acc'] = joined_df.apply(
    lambda x: fill_mort_acc(x['total_acc'], x['mort_acc']),
    axis=1,
)

### `revol_util` & `pub_rec_bankruptcies`
These two features have missing data points, but they account for less than 0.5% of the total data. So we are going to remove the rows that are missing those values in those columns with dropna().

In [None]:
for column in joined_df.columns:
    if joined_df[column].isna().sum() != 0:
        missing = joined_df[column].isna().sum()
        portion = (missing / applicants_df.shape[0]) * 100
        print(f"'{column}': number of missing values '{missing}' ==> '{portion:.3f}%'")

In [None]:
joined_df.term.unique()

In [None]:
joined_df.address.head()

---