In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as pp
%matplotlib inline

In [2]:
df=pd.read_csv("C:/Users/CampusUser/Anaconda3/train.csv")

For a predictive model on the data set, 
Skicit-Learn (sklearn) is the most commonly used library in Python for this purpose and sklearn requires all inputs to be numeric.
So before converting all our categorical variables into numeric by encoding the categories, let's take care of the missing values in the dataset 

In [3]:
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Married'].fillna(df['Married'].mode()[0], inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)
df['Self_Employed'].fillna('No',inplace=True)

In [4]:
from sklearn.preprocessing import LabelEncoder
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
    df[i] = le.fit_transform(df[i])
df.dtypes 

Loan_ID               object
Gender                 int64
Married                int64
Dependents             int64
Education              int64
Self_Employed          int64
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area          int64
Loan_Status            int64
Unnamed: 13          float64
dtype: object

After importing the required modules, we will define a generic classification function, which takes a model as input and determines the Accuracy and Cross-Validation scores.

In [5]:
#Importing models from scikit learn module:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold   #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

  from numpy.core.umath_tests import inner1d


In [6]:
#Import models from scikit learn module:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold   #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

#Generic function for making a classification model and accessing performance:
def classification_model(model, data, predictors, outcome):
  #Fit the model:
  model.fit(data[predictors],data[outcome])
  
  #Make predictions on training set:
  predictions = model.predict(data[predictors])
  
  #Print accuracy
  accuracy = metrics.accuracy_score(predictions,data[outcome])
  print ("Accuracy : %s" % "{0:.3%}".format(accuracy))

  #Perform k-fold cross-validation with 5 folds
  kf = KFold(data.shape[0], n_folds=5)
  error = []
  for train, test in kf:
    # Filter training data
    train_predictors = (data[predictors].iloc[train,:])
    
    # The target we're using to train the algorithm.
    train_target = data[outcome].iloc[train]
    
    # Training the algorithm using the predictors and target.
    model.fit(train_predictors, train_target)
    
    #Record error from each cross-validation run
    error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
 
  print ("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))

  #Fit the model again so that it can be refered outside the function:
  model.fit(data[predictors],data[outcome]) 

Logistic Regression: 
We can easily make some intuitive hypothesis to set the ball rolling. 
The chances of getting a loan will be higher for:
* Applicants having a credit history (remember we observed this in exploration?)
* Applicants with higher applicant and co-applicant incomes
* Applicants with higher education level
* Properties in urban areas with high growth perspectives

In [7]:
out_var = 'Loan_Status'
model = LogisticRegression()
pred_var = ['Credit_History']
classification_model(model, df,pred_var,out_var)

Accuracy : 80.945%
Cross-Validation Score : 80.942%


In [8]:
#We can try different combination of variables:
pred_var = ['Credit_History','Education','Married','Self_Employed','Property_Area']
classification_model(model, df,pred_var,out_var)

Accuracy : 80.945%
Cross-Validation Score : 80.942%


Generally the accuracy is expected to increase on adding variables. But, here the accuracy and cross-validation score are not getting impacted by less important variables

In [9]:
# Decision Tree: It is known to provide higher accuracy than logistic regression model.
model = DecisionTreeClassifier()
pred_var = ['Credit_History','Gender','Married','Education']
classification_model(model, df,pred_var,out_var)

Accuracy : 80.945%
Cross-Validation Score : 80.942%


In [10]:
# The model based on categorical variables is unable to have an impact because Credit History is dominating over them
df['LoanAmount_log'] = np.log(df['LoanAmount'])
df['LoanAmount_log'].fillna(df['LoanAmount_log'].mean())
df['TotalIncome'] = df['ApplicantIncome'] + df['CoapplicantIncome']
df['TotalIncome_log'] = np.log(df['TotalIncome'])

In [11]:
#We can try different combination of variables:
pred_var = ['Credit_History','Loan_Amount_Term','LoanAmount_log']
classification_model(model, df,pred_var,out_var)

Accuracy : 81.270%
Cross-Validation Score : 79.964%


Here we observed that although the accuracy went up on adding variables, the cross-validation error went down. This is the result of model over-fitting the data.

Random Forest:
Random forest is another algorithm for solving the classification problem.
An advantage with Random Forest is that we can make it work with all the features and it returns a feature importance matrix which can be used to select features.

In [12]:
model = RandomForestClassifier(n_estimators=100)
predictor_var = ['Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Property_Area',
        'LoanAmount_log','TotalIncome_log']
classification_model(model, df,predictor_var,outcome_var)

Accuracy : 99.837%
Cross-Validation Score : 76.218%


In [14]:
#Create a series with feature importances:
featimp = pd.Series(model.feature_importances_, index=pred_var).sort_values(ascending=False)
print (featimp)

TotalIncome_log     0.448355
Credit_History      0.280170
Dependents          0.064225
Loan_Amount_Term    0.055405
Property_Area       0.051871
Education           0.027904
Gender              0.025259
Self_Employed       0.023764
Married             0.023046
dtype: float64


In [15]:
model = RandomForestClassifier(n_estimators=25, min_samples_split=25, max_depth=7, max_features=1)
pred_var = ['TotalIncome_log','LoanAmount_log','Credit_History','Dependents','Property_Area']
classification_model(model, df,pred_var,out_var)

Accuracy : 82.410%
Cross-Validation Score : 80.293%


Although accuracy reduced, but the cross-validation score is improving showing that the model is generalizing well. Remember that random forest models are not exactly repeatable. Different runs will result in slight variations because of randomization. we have reached a cross-validation accuracy only slightly better than the original logistic regression model.