# CASE STUDY with CREDIT SCORING

## 1. PROBLEM

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.

- **Input**: Historical data of 250,000 borrowers.
- **Output**: SeriousDlqin2yrs.
- **Goal**: Build a model that borrowers can use to help make the best financial decisions.

Reference: [Kaggle Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

## 2. Exploratory Data Analysis (EDA)

### 2.1 Variables descriptions

|Variables|Descriptions|
|-|-|
|**SeriousDlqin2yrs**| Person experienced 90 days past due delinquency or worse|
|**RevolvingUtilizationOfUnsecuredLines**| Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits (Tổng số dư trên thẻ tín dụng và hạn mức tín dụng cá nhân ngoại trừ bất động sản và không có nợ trả góp như khoản vay mua ô tô chia cho tổng hạn mức tín dụng)|
|**age**| Age of borrower in years|
|**NumberOfTime30-59DaysPastDueNotWorse**| Number of times borrower has been 30-59 days past due but no worse in the last 2 years|
|**DebtRatio**| Monthly debt payments, alimony,living costs divided by monthy gross income (Thanh toán nợ hàng tháng, cấp dưỡng, chi phí sinh hoạt chia cho tổng thu nhập hàng tháng)|
|**MonthlyIncome**| Monthly income|
|**NumberOfOpenCreditLinesAndLoans**| Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) -> Số khoản vay và thẻ tín dụng|
|**NumberOfTimes90DaysLate**| Number of times borrower has been 90 days or more past due|
|**NumberRealEstateLoansOrLines**| Number of mortgage and real estate loans including home equity lines of credit (Số lượng các khoản vay thế chấp và bất động sản bao gồm hạn mức tín dụng vốn chủ sở hữu nhà)|
|**NumberOfTime60-89DaysPastDueNotWorse**| Number of times borrower has been 60-89 days past due but no worse in the last 2 years|
|**NumberOfDependents**| Number of dependents in family excluding themselves (spouse, children etc.)|

Random Forest:
- Handle well with different types of features: numerical/ categorical

### 2.2 Statistics

**Q1: Import necessary libraries: Pandas, Numpy, Matplotlib, Seaborn**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**Q2: Load data using pd.read_csv()**

In [None]:
df_train=pd.read_csv('cs-training.csv')
df_test=pd.read_csv('cs-test.csv')

**Q3: Get the first 5 rows of train set**

In [None]:
df_train.head()

**Q4: Get number of rows and columns of  train set**

In [None]:
df_train.shape

**Q5: Describe the distribution of train set**

In [None]:
df_train.describe()

**Q6: Get information of train set by df.info()**

In [None]:
df_train.info()

**Q7: Get the missing percent per columns of train set**
( Null do quá trình thu thập dữ liệu có vấn đề hoặc do bản thân dữ liệu )

In [None]:
df_train.isna().sum()/len(df_train)

### 2.3 Visualization

**Q8: Target disitribution on train set via bar chart**

In [None]:
# countplot 

plt.figure(figsize=(10,7))

ax = sns.countplot(x = df_train.SeriousDlqin2yrs, palette='Set1')

for p, label in zip(ax.patches, df_train.SeriousDlqin2yrs.value_counts()):
    ax.annotate(label, {p.get_x() + 0.35, p.get_height() +0.3})

plt.show()


In [None]:
plt.pie(df_train.SeriousDlqin2yrs.value_counts(), 
        labels = ['Good credit', 'Bad credit'],
        autopct = '%.2f%%', 
        explode=[0, 0.1])

# Precision, Recall, F-1 Score, AUC,...


**Q9: Correlation of features and target**

In [None]:
plt.figure(figsize= (10, 7))
sns.heatmap(df_train.corr(), annot = True, linewidths=0.1,linecolor = 'grey')

**Q10: Describe features distribution and correlation given histogram and pairplot chart**

In [None]:
df_train.hist(figsize=(10,10))

**Q11: Explore each feature with target**

## 3. MODEL

**Q12: Handle outliers in dataset**

- Percentile
- Decike
- Quantile
- Quartlie

In [None]:
from collections import Counter

def detect_outliers(df, n, features):
    outlier_indices = []
    
    # iretate over features
    for col in features:

        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)

        # 3rd quartile (75%)
        Q3 = np.percentile(df[col], 75)

        # Interquartile - IQR
        IQR = Q3 - Q1

        # Outlier steps < Q1 - 1.5 IQR, > Q3 + 1.5 IQR
        outlier_step = 1.5*IQR

        # Determine a list of indice of outlier
        
        outlier_list_col = df[(df[col]< Q1 - outlier_step) | 
                            (df[col] > Q3 + outlier_step)].index # conditions]
        
        outlier_indices.extend(outlier_list_col)
    
    # Select records containing more than n(const) outliers
    outlier_indices = Counter(outlier_indices)

    multiple_outliers = [k for k, v in outlier_indices.items() if v > n]
    
    
    return multiple_outliers


In [None]:
df_train.columns

In [None]:
Outlier_to_drop = detect_outliers(df_train, 2, ['RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome',
       'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate',
       'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse',
       'NumberOfDependents'])

len(Outlier_to_drop)*100/len(df_train)

In [None]:
df_train = df_train.drop(Outlier_to_drop, axis = 0)

In [None]:
len(df_train)

**Q13: Merge train and test dataset**

In [None]:
dataset = pd.concat([df_train, df_test])
len(dataset), len(df_train) + len(df_test)

In [None]:
dataset.columns

**Q14: Rename columns name into shorter alias**

In [None]:
dataset=dataset.rename(columns={'SeriousDlqin2yrs':'Target',
                       'RevolvingUtilizationOfUnsecuredLines':'UnsecuredLines',
                       'NumberOfTime30-59DaysPastDueNotWorse':'Late3059',
                        'NumberOfOpenCreditLinesAndLoans':'OpenCredit',
                       'NumberOfTimes90DaysLate':'Late90',
                       'NumberRealEstateLoansOrLines':'ProLines',
                       'NumberOfTime60-89DaysPastDueNotWorse':'Late6089',
                        'NumberOfDependents':'Deps'})
dataset.head(5)

**Q15: Building binary/dummy variables**

In [None]:
pd.qcut(dataset.UnsecuredLines.values,5).codes

In [None]:
dataset.UnsecuredLines=pd.qcut(dataset.UnsecuredLines.values,5).codes

In [None]:
g=sns.catplot(x='UnsecuredLines', y='Target', data=dataset, kind='bar')
plt.show()
#tìm ra mqh giữa biến và target, đồng thời giảm số lượng biến xuống.
#chia dữ liệu thành nhóm dữ liệu 3-10 nhóm, để tìm tương quan.

In [None]:
dataset.age = pd.qcut(dataset.age.values, 5).codes

In [None]:
g = sns.catplot(x = 'age', y = 'Target', data = dataset, kind = 'bar')
plt.show()

In [None]:
g = sns.catplot(x = 'Late3059', y = 'Target', data = dataset, kind = 'bar')
plt.show()


In [None]:
dataset.Late3059 = [x if x <6 else 6 for x in dataset.Late3059]
g = sns.catplot(x = 'Late3059', y = 'Target', data = dataset, kind = 'bar')
plt.show()

In [None]:
dataset.DebtRatio = pd.cut(dataset.DebtRatio.values, 5).codes
dataset.MonthlyIncome = dataset.MonthlyIncome.fillna(dataset.MonthlyIncome.median()) #scalar, mean, median, mode
dataset.MonthlyIncome = pd.cut(dataset.MonthlyIncome.values, 5).codes
dataset.OpenCredit = pd.cut(dataset.OpenCredit.values, 5).codes
dataset.Late90 = [x if x < 5 else 5 for x in dataset.Late90]
dataset.PropLines = [x if x < 6 else 6 for x in dataset.PropLines]
dataset.Late6089 = [x if x < 6 else 6 for x in dataset.Late6089]
dataset.Deps = dataset.Deps.fillna(dataset.Deps.median())
dataset.Deps = [x if x < 4 else 4 for x in dataset.Deps]

In [None]:
dataset = pd.get_dummies(dataset, columns = ['UnsecuredLines'], prefix = 'UnsecuredLines')
dataset = pd.get_dummies(dataset, columns = ['age'], prefix = 'age')
dataset = pd.get_dummies(dataset, columns = ['Late3059'], prefix = 'Late3059')
dataset = pd.get_dummies(dataset, columns = ['DebtRatio'], prefix = 'DebtRatio')
dataset = pd.get_dummies(dataset, columns = ['MonthlyIncome'], prefix = 'MonthlyIncome')
dataset = pd.get_dummies(dataset, columns = ['OpenCredit'], prefix = 'OpenCredit')
dataset = pd.get_dummies(dataset, columns = ['Late90'], prefix = 'Late90')
dataset = pd.get_dummies(dataset, columns = ['PropLines'], prefix = 'PropLines')
dataset = pd.get_dummies(dataset, columns = ['Late6089'], prefix = 'Late6089')
dataset = pd.get_dummies(dataset, columns = ['Deps'], prefix = 'Deps')

**Q16: Train test split**

**Q17: Train prediction model using Random Forest Classifier**

**Q18: Get feature importance from classifier**

**Q19: Retrain with better parameters**

**Q20: Predict and evaluate the model performance**