## Credit Modelling using Machine Learning

In this project, we are going to predict if a borrower will pay off their loan on time or not using **machine learning model**.<br/> We will use the financial lending data from lending club. Lending Club is a marketplace for personal loans that matches <br/>borrowers who are seeking a loan with investors looking to lend money and make a return.  

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestClassifier

In [6]:
# !pip install -U -q kaggle
# !mkdir -p ~/.kaggle
# !echo '{"username":"jizabayo","key":"7a921bcf29dd2a15c59273bd616bafdf"}' > ~/.kaggle/kaggle.json
# !chmod 600 ~/.kaggle.json

# !kaggle datasets download -d wendykan/lending-club-loan-data

## Data cleaning

In [None]:
loans_2007 = pd.read_csv('loans_2007.csv')
print(loans_2007.iloc[0])
print(len(loans_2007.columns))#viewing the number of columns

We now identify the columns to drop which might not be good **features** for the modelling process:<br/>
According to dataquest, we might pay attention to features that:
<ol>

  <li>leak information from the future (after the loan has already been funded) which might cause our model to overfit.</li>
  <li>don't affect a borrower's ability to pay back a loan (e.g. a randomly generated ID value by Lending Club).</li>
  <li>formatted poorly and need to be cleaned up </li>
    <li>require more data or a lot of processing to turn into a useful feature </li>
    <li>contain redundant information </li>
</ol>

For instance funded_amnt would cause data leakage in the future after the loan is already started to be funded.<br/>
Thus, we remove several columns using drop function.

In [None]:
loans = loans_2007.drop(columns=['id', 'member_id', 'funded_amnt', 'funded_amnt_inv', 'grade', 'sub_grade', 
                                 'emp_title', 'issue_d'])

In [None]:
loans = loans_2007.drop(columns=['zip_code', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp'])

According to dataquest The out_prncp column as well as the total_pymnt column describe properties of the loan 
after it's fully funded and started to be paid off. 
This information isn't available to an investor before the loan is fully funded and we don't want to include it in our model.

In [10]:
loans = loans_2007.drop(columns=['total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 
                                      'last_pymnt_d', 'last_pymnt_amnt'])

After removing the features that are not helpful for our modelling, we go on by deciding which target (dependent variable)

## Exploratory data analysis

In [None]:
print(loans_2007['loan_status'].value_counts)#loan status is our dependent variable

We can consider this problem as a binary classification because we are interested in knowing which client will pay the loan and 
and who will not.<br/>
Also, we have class imbalance because the status "paid off" comes frequently than "charged off" which can cause our model to be biased on one class.

In [None]:
#making loan_status variable a 2 class variable
loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]

mapping_dict = {
    "loan_status": {
        "Fully Paid": 1,
        "Charged Off": 0
    }
}
loans_2007 = loans_2007.replace(mapping_dict)

In [None]:
#dropping column with one unique value
drop_columns = []
for column in loans_2007.columns:
    non_null = loans_2007[column].dropna()
    unique_non_null = non_null.unique()
    if len(unique_non_null) == 1:
        drop_columns.append(column)
loans_2007 = loans_2007.drop(drop_columns, axis=1)

In [None]:
#Handling missing values
#loans = pd.read_csv('filtered_loans_2007.csv')
null_counts = loans.isnull().sum()
print(null_counts[null_counts>0])

In [None]:
#dropping Nans
loans = loans.drop(columns = ['pub_rec_bankruptcies'])
loans = loans.dropna()
print(loans.dtypes.value_counts())

In [None]:
#
object_columns_df = loans.select_dtypes(include=["object"])

### Feature engineering

In [None]:
#Identifying categorical features to be converted into numerical
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for colum in cols:
    print(loans[colum].value_counts())

In [None]:
#converting features to numerical
mapping_dictionary = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}
loans = loans.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")
loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")
loans = loans.replace(mapping_dictionary)

In [None]:
#Using dummy variables to get numerical values
# Returns a new Dataframe containing 1 column for each dummy variable.
dummy_df = pd.get_dummies(loans[["home_ownership", "verification_status",  "purpose", "term"]])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(["home_ownership", "verification_status",  "purpose", "term"], axis=1)

## Modeling and making predictions

In [None]:
features = loans.drop(loans["loan_status"])
target = loans["loans_status"]
rf = RandomForestClassifier(class_weight="balanced", random_state=1)
predictions = cross_val_predict(rf, features, target, cv=3)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.`
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)