## Home Credit Default Risk
##### Can you predict how capable each applicant is of repaying a loan?
#### Overview 
This project was inspired by that fact that many people who deserves loan do not get it and ends up in the hands of untrustworthy lenders.
This project is a competition from Kaggle. Below is the link: [Kaggle | Home Credit Default Risk Competition](https://www.kaggle.com/c/home-credit-default-risk)


Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

![about homecredit](https://storage.googleapis.com/kaggle-media/competitions/home-credit/about-us-home-credit.jpg)       [Source : Kaggle](https://storage.googleapis.com/kaggle-media/competitions/home-credit/about-us-home-credit.jpg)

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a
 
variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.


## Problem Statement.
### Can you predict how capable each applicant is of repaying a loan ?
- My analysis will be predicting how capable each applicant is at repaying a loan.

### Datasets and Inputs.
The dataset for this project has been provided by Kaggle. <br>
Data description is below :
There are 7 different sources of data:
+ **application_train/application_test**: the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating **0** if the loan was repaid **`Repayers`** and **`1`** for default **`Defaulters`**
+ **bureau**: data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
+ **bureau_balance**: monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
+ **previous_application**: previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
+ **POS_CASH_BALANCE:** monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
+ **credit_card_balance**: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
+ **installments_payment**: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.
For more information on what each data represents, please read the [PROPOSAL]('/Users/bhetey/version_control/machine-learning/projects/capstone/proposal.pdf'), or [Kaggle](https://www.kaggle.com/c/home-credit-default-risk) <br>
- Below is a diagram of how the data are connected. 
![Data structure](https://storage.googleapis.com/kaggle-media/competitions/home-credit/home_credit.png)

### Loading Data 

In [None]:
from __future__ import division
import pandas as pd # this is to import the pandas module
import numpy as np # importing the numpy module 
import os # file system management 
import zipfile # module to read ZIP archive files.
from glob import glob 
import seaborn as sns
import matplotlib.pyplot as plt

# Figures inline and set visualization style
%matplotlib inline
sns.set()

#print(os.listdir("../input/*.csv"))
filenames = glob("../input/*.csv")
filenames

In [None]:
# reading the data with pandas 
def reading_csv_file(filename):
    return pd.read_csv(filename)

In [None]:
app_train = reading_csv_file(filenames[2])
print ('Training data shape :{}'.format(app_train.shape))
app_train.head()


Training data has 307511 rows (*each one represents separate loan*) and 112 featurees (columns) including the TARGET(What is to be predicted)

In [None]:
y = app_train.TARGET # y is going to be our target variable
y.head()

In [None]:
app_test = reading_csv_file(filenames[7])
print ('Testing test contains :{}'.format(app_test.shape))
app_test.head(5)

The Testing set does not have target variable 

In [None]:
# traing and testing set do not have the same shape. 
app_train.shape == app_test.shape 

### Data conversion in pandas DataFrame

In [None]:
# converted all csv in pandas dataframe. 
pos_cash_balance = reading_csv_file(filenames[0])
bureau_balance = reading_csv_file(filenames[1])
previous_application = reading_csv_file(filenames[3])
installments_payments = reading_csv_file(filenames[4])
credit_card_balance = reading_csv_file(filenames[5])
bureau = reading_csv_file(filenames[8])

HomeCredit Columns Description gives us the details about each features in the dataset 

In [None]:
# this is done for indexing of the joint data later 
data = (os.listdir("../input/"))
data.remove('sample_submission.csv')
data

### Visual Exploratory Data Analysis (EDA)

In [None]:
print (app_train['TARGET'].value_counts())
app_train.head(5)

+ ### How many people repay loans : 
**Take away here is that**: Looking at the picture below, **`1`** for **Defaulter** and **`0`** for  **Repayers**. The image below shows that most applicant pay back the loan. This is what we called [Imbalanced Class Problem](http://www.chioka.in/class-imbalance-problem/). The differences between Repayer and Defaulter is too big 

In [None]:
sns.countplot(x='TARGET',hue='TARGET', data=app_train)

+ ### What is the family status of the applicant:
In the image, it is shown that more married candidates pay back thier loans

In [None]:
plt.figure(figsize=(10,3))
sns.countplot(x = "NAME_FAMILY_STATUS", hue = "TARGET", data = app_train)
fam_stat_target = pd.crosstab(app_train['NAME_FAMILY_STATUS'], app_train['TARGET'])
kind_of_applicant = ("Repayers", "Defaulters")
fam_stat_target.columns = kind_of_applicant
fam_stat_target

+ ### What is the Income Class and Family type that default the most :   
**The Takeaway:** Most married and working class mostly default on loan payment 

In [None]:
plt.figure(figsize= (10,5))# plot the figure 
plt.show(sns.countplot(x = "NAME_FAMILY_STATUS", 
                       hue = "NAME_INCOME_TYPE" , 
                       # filter the train set by using TARGET column == 1
                       data= app_train.loc[app_train['TARGET'] == 1])) 

Checking for missing values.

In [None]:
# below is a function to check for missing values. 
def check_missing_values(input):
    # checking total missing values
    total_miss_values = input.isnull().sum()
    
    # percentage of missing values. 
    miss_val_percent = total_miss_values/len(input)*100
    
    # table of total_miss_values and it's percentage
    miss_val_percent_tab = pd.concat([total_miss_values, miss_val_percent], axis=1)
    
    # columns renamed
    new_col_names = ('Missing values', 'Total missing values in %')
    miss_val_percent_tab.columns = new_col_names
    renamed_miss_val_percent_tab = miss_val_percent_tab
    
    # descending table sort 
    renamed_miss_val_percent_tab = renamed_miss_val_percent_tab[
        renamed_miss_val_percent_tab.iloc[:,1] != 0
    ].sort_values('Total missing values in %', ascending = False).round(1)
    
    # display information 
    print ('The selected dataframe has {} columns.\n'.format(input.shape[1]))
    print ('There are {} columns missing in the dataset'.format(renamed_miss_val_percent_tab.shape[0]))
    
    return renamed_miss_val_percent_tab

In [None]:
missing_value = check_missing_values(app_train)
missing_value.head(10)

**Take away :** For some machine learning models, we have to deal with the missing values buy imputing or dropping either the roles or the columns with the highest percentage of missing values. However we might be loosing some data from them. We also do not know if the data removed will harm the analysis or help it ahead of time until we experiment on them. 
Algorithmns like **XGBoost** can handle missing data without imputation. [It automatically learn how to deal with missing data point.](https://machinelearningmastery.com/data-preparation-gradient-boosting-xgboost-python/)
+ [Additonal reading](https://arxiv.org/abs/1603.02754)

Dealing with features 

**Obviously we have 3 data types :** Numeric and Non-numeric (e.g Text ) called _object_.

Numeric can be of discrete time or continuous time horizon. 
Non_numeric are [variables containing label values rather numeric values.](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) They are sometimes called [nominal](https://en.wikipedia.org/wiki/Nominal_category)

In [None]:
app_train.get_dtype_counts() # Shows the numbers of types of values 

Looking at the dataset with object type, below is the total number of object. However since we want to work with them we will need to hot encode them. 

However this depends on personal view. it depend on how big the categorical variables are. 

One of the major problems with categorical data is that only few machine learning alogorithms works with them without any special form of implementation while others needs some implementation where the data needs to be encoded into numeric variables. 

How to convert categorical data into numerical data: 
+ **Integer Encoding** _where integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship._ 


+ **One-Hot Encoding** _where the integer encoded variable is removed and a new binary variable is added for each unique integer value._

[Read more](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/)

In [None]:
print ('Total numbe of object type is : ', len(app_train.select_dtypes('object')))

Checking the number of unique class in each object column 

In [None]:
app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

**.Describe** enerates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

DAYS_BIRTH was originally in days and now it will be converted to years. The columns has negative as they were recorded relative to the current loan application 

In [None]:
(app_train['DAYS_BIRTH']/-365).describe()

Looking at the result above everything seems okay. Cannot seems to find any outlier in this analysis 

**DAYS_EMPOYED:** How many days before the application the person started current employment'

This is also relative to the current loan application 

In [None]:
(app_train['DAYS_EMPLOYED']/365).describe()

In [None]:
(app_train['DAYS_EMPLOYED']/365).plot.hist(title = 'DAYS EMPLOYED');
plt.xlabel('NO OF DAYS EMPLOYED BEFORE APPLICATION')

Looking at the image above, 1000 years does not seem right. 
We will use imputation to solve this. 

### Checking correlation of the data.
It helps to show possible relationship within our data. 
This article helps in interpreting [correlation](http://www.statstutor.ac.uk/resources/uploaded/pearsons.pdf), [How to interpret a Correlation Coefficient](https://www.dummies.com/education/math/statistics/how-to-interpret-a-correlation-coefficient-r/)

In [None]:
data_corr = app_train.corr()['TARGET'].sort_values()
print ('These are samples of negative correlations : \n',data_corr.head(20))

In [None]:
print ('These are samples of positive correlations : \n', data_corr.tail(20))

#### ONE-HOT ENCODING 
Let's **One-hot Encode** the categorical variable <br>
We need to import the module from scikit-learn library

In [None]:
app_train.select_dtypes('object').columns

In [None]:
len (app_train.columns) == len(app_train.select_dtypes('object').columns)

In [None]:
from sklearn.preprocessing import OneHotEncoder

one_hot_encoded_app_train = pd.get_dummies(app_train)
one_hot_encoded_app_test = pd.get_dummies(app_test)

print ('Shape of the training set after one hot encoding {}'.format(one_hot_encoded_app_train.shape))
print ('Shape of the test set after one hot encoding {}'.format(one_hot_encoded_app_test.shape))

In [None]:
app_train.shape == one_hot_encoded_app_train.shape

In [None]:
app_train.shape

In [None]:
app_test.shape

Looking at the analysis above is obvious that **One-Hot Encoding** has added extra features to the original ones we have hereby leaving our data unaligned. 

We need to have same features in both the training and testing data for our machine learning model to work if not we will get error when running the algorithm. 

**STEPS TAKING:** 
+ I decided to remove any column that is present on the training set but not on our testing set. 
+ Intuitively the **y** which is our **TARGET** is expected to  be removed as well but will add it back

In [None]:
#https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.align.html
one_hot_encoded_app_train, one_hot_encoded_app_test = one_hot_encoded_app_train.align(one_hot_encoded_app_test,
                                                                                      join='inner', axis=1)

print ('Shape of the training set after alignment {}'.format(one_hot_encoded_app_train.shape))
print ('Shape of the test set after alignment {}'.format(one_hot_encoded_app_test.shape))

In [None]:
one_hot_encoded_app_train['TARGET'] = y # adding it back to the data 

In [None]:
# dropping the target to get our X 
X = one_hot_encoded_app_train.drop(['TARGET'], axis=1)

**Removing missing values**

In [None]:
# function is to drop the missing values 
def dropping_missing_columns(input_set):
    """this function removes the columns with missing values.
    However input_set is the set you will put inside in the function 
    either the training set or the test set 
    """
    to_drop_missing_missing_values = [
        col for col in input_set.columns if X[col].isnull().any()
    ]
    return input_set.drop(to_drop_missing_missing_values, axis = 1)

In [None]:
#assigned a variable to data after dropping the missing values 
after_removing_missing_values = dropping_missing_columns(X)
test_removing_missing_values = dropping_missing_columns(one_hot_encoded_app_test)

In [None]:
print ('The shape of training set after removing missing values : {}'.format(after_removing_missing_values.shape))
print ('The shape of testing set after removing missing values :{}'.format(test_removing_missing_values.shape))

Looking at the dataset now after removing the **NaN** in the data, we have the columns reduced to **181 columns**

**Imputing Nan values**

In [None]:
from sklearn.preprocessing import MinMaxScaler, Imputer

# making a copy for the data before imputing 
train_tobe_imputed = one_hot_encoded_app_train.copy() 
test_tobe_imputed = one_hot_encoded_app_test.copy()
new_y = y.copy() # a copy of our target 

# dropping the target columns before imputing 
new_X = train_tobe_imputed.drop(['TARGET'], axis=1)

# calling imputer and transforming the data
imputer = Imputer()
transformed_X = imputer.fit_transform(new_X)
transformed_test_X = imputer.fit_transform(test_tobe_imputed)

print ('Transformed training set :{}'.format(transformed_X.shape))
print ('Transformed testing set :{}'.format(transformed_test_X.shape))
print ('The data is back to the same shape we had during the Hot coding')

## EVALUATION METRICS 

There are different kind of evaluation metrics we can used since this is a **Classification problem.** <br>
Below are some of the metrics : 

+ Classification accuracy 
+ Logarithmic Loss 
+ Area Under ROC Curve 
+ Confusion Matrix 
+ Classification report

[**Read more about them**](https://machinelearningmastery.com/metrics-evaluate-machine-learning-algorithms-python/)

### CLASSIFICATION MODELS 
Classification depends on whether the variables we are trying to predict are **Binary or Non-Binary**.

**Binary variables** are those variables where the outcome we are looking are either 1 or 0, True or False. 

**Non-Binary variables** are those variables where the outcome we are looking are categorical. for example looking at the dataset and predicting where the color of the dress of a person will be `Yellow, Brown or Blue`

#### Binary Classification Model : 
+ Logistic regression 
+ Decision Trees 
+ Support Vector Machine (SVM) : _good for anomaly detection especially in large feature sets_

#### Non- Binary Classification Model: 
+ Adaboost 
+ Random Forest 
+ Decision Tree
+ Neural Networks 

#### Considering choosing an algorithm, : 
+ Take note of the accuracy 
+ Training time 
+ Linearity 
+ Number of parameters 
+ Number of features 

[The Machine Learning Algorithm Cheat Sheet](https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice)

However, I am going to try each of them and compare their performance.

### Logistic Regression Model 
This is my first model. <br>
C is used to control overfitting and a small tends to reduce overfitting 

Logistic regression using the data set with dropped missing values. 
#### Model Evaluation using a validation set

In [None]:
# Import statements 
from sklearn.metrics import classification_report
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import r2_score
from sklearn.cross_validation import train_test_split
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(
    after_removing_missing_values, y, test_size=0.25, random_state=42)

# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression(C = 0.0001)

# evaluate the model by splitting into train and test sets
model.fit(X_train, y_train)
print ('This accuracy seems good {} but need to check further for the prediction and on testing set'.format(model.score(X_train, y_train)))

In [None]:
predicted = model.predict(X_test)
predicting_with_test = model.predict(test_removing_missing_values)
predicting_probability = model.predict_proba(X_test)
matrix = confusion_matrix(y_test, predicted)
scoring = accuracy_score(y_test, predicted)
r_score = r2_score(y_test, predicted)
report = classification_report(y_test, predicted)
print (report)
print (matrix)
print ('Accuracy of the model :{}'.format(scoring))
print ('R2 Score for the prediction :'.format(r_score))
print(metrics.roc_auc_score(y_test, predicting_probability[:, 1]))

#### Model Evaluation Using Cross-Validation
Now let's try 10-fold cross-validation, to see if the accuracy holds up more rigorously.

In [None]:
# evaluate the model using 10-fold cross-validation
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(LogisticRegression(), 
                         after_removing_missing_values, y, scoring='accuracy', cv=10)
print (scores)
print (scores.mean())
print (predicting_with_test)

In [None]:
my_submission = pd.DataFrame({'SK_ID_CURR': one_hot_encoded_app_test.SK_ID_CURR, 'TARGET': predicting_with_test})
my_submission.to_csv('homecredit.csv', index=False)

In [None]:
my_submission