# LendingClub Loan Repayment Prediction

We will use publicly available data from LendingClub.com to build a model that predicts whether or not a borrower will repay their loan in full. The data includes information about the borrower's credit score, debt-to-income ratio, and other factors.

The columns in the dataset represent the following:

* credit.policy: Whether the borrower meets LendingClub.com's credit underwriting criteria.
* purpose: The purpose of the loan (e.g., credit card debt consolidation, educational expenses, etc.).
* int.rate: The interest rate of the loan.
* installment: The monthly payment amount.
* log.annual.inc: The natural log of the borrower's annual income.
* dti: The debt-to-income ratio.
* fico: The borrower's FICO credit score.
* days.with.cr.line: The length of time the borrower has had a credit line.
* revol.bal: The borrower's revolving balance.
* revol.util: The borrower's revolving line utilization rate.
* inq.last.6mths: The number of inquiries by creditors in the last 6 months.
* delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
* pub.rec: The number of derogatory public records (bankruptcy filings, tax liens, or judgments).
* not.fully.paid: Whether the borrower repaid their loan in full (0) or not (1).

The data is from 2007-2010, so it may not be representative of the current lending environment. However, it is a valuable resource for understanding the factors that affect a borrower's likelihood of repayment.

# Import Libraries

In [None]:
import pandas as pd
import seaborn as sns
sns.set_style('whitegrid')
palette = 'muted'
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

# Get the Data

In [None]:
loan = pd.read_csv('/kaggle/input/loan-data/loan_data.csv')

In [None]:
loan.head()

In [None]:
loan.info()

In [None]:
loan.describe()

# Exploratory Data Analysis

**Create a histogram that shows the distribution of FICO scores for two groups of borrowers:**

those who met LendingClub.com's credit underwriting criteria (credit_policy = 1) and those who did not (credit_policy = 0).

In [None]:
plt.figure(figsize=(10,6))
loan[loan['credit.policy']==1]['fico'].hist(bins=35, color='blue',  
                                            label='credit_policy = 1',
                                           alpha=0.6)
                                          
loan[loan['credit.policy']==0]['fico'].hist(bins=35, color='orangered', 
                                            label='credit_policy = 0',
                                           alpha=0.6)
                                           
plt.legend()
plt.xlabel('FICO Score')

* A majority of the people in the dataset meet LendingClub.com's credit underwriting criteria. 
* People with lower FICO scores are more likely to be rejected by LendingClub.com.

**Create a histogram that shows the distribution of FICO scores for two groups of borrowers:**

those who did not fully repay their loan (not.fully.paid = 1) and those who did fully repay their loan (not.fully.paid = 0).

In [None]:
plt.figure(figsize=(10,6))
loan[loan['not.fully.paid']==0]['fico'].hist(bins=35, color='orangered', 
                                            label='not.fully.paid = 0',
                                           alpha = 0.6)
loan[loan['not.fully.paid']==1]['fico'].hist(bins=35, color='blue',  
                                            label='not.fully.paid = 1',
                                            alpha=0.6)
plt.legend()
plt.xlabel('FICO Score')

* A majority of the people in the dataset are more likely to repay their loans in full.
* People with lower FICO scores are more likely to default on their loans.

**Create a countplot showing the counts of loans by purpose, with the color hue defined by not.fully.paid**

In [None]:
plt.figure(figsize=(12,7))
sns.countplot(x='purpose', hue='not.fully.paid', data=loan, palette=palette)

Debt consolidation was the most common reason for loans.

**Create a joint plot to visualize the relationship between FICO score and interest rate.**

In [None]:
sns.jointplot(x='fico', y='int.rate', data=loan, palette=palette)

People with higher FICO scores are more likely to be offered lower interest rates on loans.

**Create a lmplot to see if if there is a difference in the trend between the not.fully.paid and credit.policy variables**

In [None]:
plt.figure(figsize=(12,7))
sns.lmplot(y='int.rate', x='fico', data=loan, hue='credit.policy',
          col='not.fully.paid', palette=palette)

The lmplot showed that the trends in the data were similar regardless of whether the borrower met LendingClub.com's credit underwriting criteria (credit.policy=0) or not (credit.policy=1), or whether or not the borrower repaid their loan in full (not.fully.paid=0).

# Converting categorical data into dummy variables

In [None]:
loan.head()

The purpose column is categorical data.

In [None]:
cat_feats = ['purpose']

In [None]:
final_data = pd.get_dummies(loan, columns=cat_feats, drop_first=True)
final_data

# Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = final_data.drop('not.fully.paid', axis=1)
y = loan['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=101)

# Training a Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dtree = DecisionTreeClassifier()

In [None]:
dtree.fit(X_train, y_train)

# Predictions and Evaluation of Decision Tree

In [None]:
pred = dtree.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
print(classification_report(y_test, pred))

Overall, the classification report shows that the decision tree model is doing a fair job of predicting whether or not a borrower will repay their loan in full. However, there is still room for improvement.

In [None]:
print(confusion_matrix(y_test, pred))

The model correctly predicted that 1980 borrowers would repay their loan in full and incorrectly predicted that 451 borrowers would repay their loan in full. 

The model also incorrectly predicted that 331 borrowers would not repay their loan in full and correctly predicted that 112 borrowers would not repay their loan in full.

# Training a Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(n_estimators=300)

In [None]:
rfc.fit(X_train, y_train)

# Predictions and Evaluation of Random Forest

In [None]:
rfc_pred = rfc.predict(X_test)

In [None]:
print(classification_report(y_test, rfc_pred))

The report shows that the model was able to correctly classify 85% of the loans that were not fully paid. However, the model only correctly classified 2% of the loans that were fully paid. This is because the dataset is imbalanced, with the majority of loans being fully paid.

In [None]:
print(confusion_matrix(y_test, rfc_pred))

The model correctly classified 2421 borrowers as having repaid their loans and 9 borrowers as not having repaid their loans. 

The model also misclassified 434 borrowers as having repaid their loans and 9 borrowers as not having repaid their loans.

# Conclusion

The decision tree model has a higher precision and recall than the random forest model. This means that the decision tree model is better at predicting which borrowers will repay their loan in full and which borrowers will not repay their loan in full. However, the random forest model has a higher accuracy than the decision tree model. This means that the random forest model is better at classifying all borrowers, regardless of whether they repay their loan in full or not.

Overall, both models perform well. However, the random forest model may be a better choice if you are concerned about the risk of incorrectly predicting that a loan will be repaid in full.