# Credit Modelling

In this project, we focus on credit modelling, which is a common data science problem. For this purpose, we will work with data collected from the Lending Club.

The Lending Club is a marketplace where borrowers and investors meet. On one side, investors may lend their money for an interest rate ranging from 5% to 30%. On the other side, borrowers can borrow money and repay their loans in up to 6 months.

Considering the investor's point of view, we want to avoid people that will default on their loan. 

**Objective:** Create a machine learning model to allow a conservative investor to avoid people that will default on their loans.

This project is separated into 3 parts:
* Data Cleaning
* Feature Engineering
* Modelling

# Data Cleaning

**Techniques used:**
* Pandas

## Introduction of the dataset

In [2]:
import pandas as pd

loans_2007 = pd.read_csv('loans_2007.csv')
loans_2007.drop_duplicates()

print(loans_2007.iloc[0])
print(loans_2007.shape[1])

id                                1077501
member_id                      1.2966e+06
loan_amnt                            5000
funded_amnt                          5000
funded_amnt_inv                      4975
term                            36 months
int_rate                           10.65%
installment                        162.87
grade                                   B
sub_grade                              B2
emp_title                             NaN
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
issue_d                          Dec-2011
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
zip_code                            860xx
addr_state                             AZ
dti                                 27.65
delinq_2yrs                       

  interactivity=interactivity, compiler=compiler, result=result)


### Drop irrelevant columns for modelling

In [3]:
cols = ['id', 'member_id', 'funded_amnt', 'funded_amnt_inv', 'grade', 'sub_grade', 'emp_title', 'issue_d']

loans_2007 = loans_2007.drop(cols, axis = 1)

In [4]:
cols2 = ['zip_code', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp']

loans_2007 = loans_2007.drop(cols2, axis = 1)

In [5]:
cols3 = ['total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt']

loans_2007 = loans_2007.drop(cols3, axis = 1)

print(loans_2007.iloc[0])
print(loans_2007.shape[1])

loan_amnt                            5000
term                            36 months
int_rate                           10.65%
installment                        162.87
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                 Jan-1985
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                          83.7%
total_acc                               9
initial_list_status                     f
last_credit_pull_d               J

As observed, removing irrelevant features for modelling reduced the dataset to 32 columns.

### Convert target column to numerical

In [6]:
loans_2007 = loans_2007[(loans_2007['loan_status'] == 'Fully Paid') | (loans_2007['loan_status'] == 'Charged Off')]

status_replace = {
    'loan_status':{
        'Fully Paid' : 1,
        'Charged Off': 0
    }
}

loans_2007 = loans_2007.replace(status_replace)

### Remove columns containing unique values

These columns will not be relevant for modelling purposes.

In [7]:
drop_columns = []

for each in loans_2007.columns:
    series = loans_2007[each].dropna().unique()
    
    if len(series) == 1:
        drop_columns.append(each)
        
loans_2007 = loans_2007.drop(drop_columns, axis = 1)
print(drop_columns)

['pymnt_plan', 'initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']


Nine columns were removed from the dataset because they contained unique values. The dataset is now clean and can be used for feature engineering. 