# Task 10 : Benchmark Top ML Algorithms

This task tests your ability to use different ML algorithms when solving a specific problem.


### Dataset
Predict Loan Eligibility for Dream Housing Finance company

Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers.

Train: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv

Test: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv

## Task Requirements
### You can have the following Classification models built using different ML algorithms
- Decision Tree
- KNN
- Logistic Regression
- SVM
- Random Forest
- Any other algorithm of your choice

### Use GridSearchCV for finding the best model with the best hyperparameters

- ### Build models
- ### Create Parameter Grid
- ### Run GridSearchCV
- ### Choose the best model with the best hyperparameter
- ### Give the best accuracy
- ### Also, benchmark the best accuracy that you could get for every classification algorithm asked above

#### Your final output will be something like this:
- Best algorithm accuracy
- Best hyperparameter accuracy for every algorithm

**Table 1 (Algorithm wise best model with best hyperparameter)**

Algorithm   |     Accuracy   |   Hyperparameters
- DT
- KNN
- LR
- SVM
- RF
- anyother

**Table 2 (Best overall)**

Algorithm    |   Accuracy    |   Hyperparameters



### Submission
- Submit Notebook containing all saved ran code with outputs
- Document with the above two tables

In [1]:
import pandas as pd 
import numpy as np                      
import seaborn as sns                 
import matplotlib.pyplot as plt       
%matplotlib inline 
import warnings     

In [2]:
#read train dataset
train = pd.read_csv("https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv")
train.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [3]:
train.shape

(614, 13)

In [4]:
#features of train
train.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [5]:
train.dtypes

Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In [6]:
#target variable distribution of its values
train['Loan_Status'].value_counts()

Y    422
N    192
Name: Loan_Status, dtype: int64

In [7]:
#propotions of its values
train['Loan_Status'].value_counts(normalize=True)

Y    0.687296
N    0.312704
Name: Loan_Status, dtype: float64

In [8]:
# preprocessing categorical features
train['Dependents'].replace('3+',3,inplace=True)
train['Loan_Status'].replace('Y',1,inplace=True)
train['Loan_Status'].replace('N',0,inplace=True)

In [9]:
#missing values
train.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [10]:
#Filling the missing values
#numerical variables with mode or median
#categorical variables with mode

train['Gender'].fillna(train['Gender'].mode()[0],inplace=True)
train['Married'].fillna(train['Married'].mode()[0],inplace=True)
train['Dependents'].fillna(train['Dependents'].mode()[0],inplace=True)
train['Self_Employed'].fillna(train['Self_Employed'].mode()[0],inplace=True)
train['Credit_History'].fillna(train['Credit_History'].mode()[0],inplace=True)
train['Loan_Amount_Term'].fillna(train['Loan_Amount_Term'].mode()[0], inplace=True)
train['LoanAmount'].fillna(train['LoanAmount'].median(), inplace=True)

In [11]:
train.isnull().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [12]:
train = train.drop('Loan_ID',axis=1)

In [13]:
from sklearn import preprocessing

In [14]:
label_encoder = preprocessing.LabelEncoder()

In [15]:
#LabelEncoding
train['Gender_Clean']=label_encoder.fit_transform(train['Gender'])
train['Married_Clean']=label_encoder.fit_transform(train['Married'])
train['Education_Clean']=label_encoder.fit_transform(train['Education'])
train['Self_Employed_Clean']=label_encoder.fit_transform(train['Self_Employed'])
train['Property_Area_Clean']=label_encoder.fit_transform(train['Property_Area'])
train['Loan_Status_Clean']=label_encoder.fit_transform(train['Loan_Status'])

In [16]:
train.columns

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status',
       'Gender_Clean', 'Married_Clean', 'Education_Clean',
       'Self_Employed_Clean', 'Property_Area_Clean', 'Loan_Status_Clean'],
      dtype='object')

In [17]:
train = train[['Dependents', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area_Clean', 'Gender_Clean', 'Education_Clean',
       'Self_Employed_Clean','Married_Clean','Loan_Status_Clean']]
train

Unnamed: 0,Dependents,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area_Clean,Gender_Clean,Education_Clean,Self_Employed_Clean,Married_Clean,Loan_Status_Clean
0,0,5849,0.0,128.0,360.0,1.0,2,1,0,0,0,1
1,1,4583,1508.0,128.0,360.0,1.0,0,1,0,0,1,0
2,0,3000,0.0,66.0,360.0,1.0,2,1,0,1,1,1
3,0,2583,2358.0,120.0,360.0,1.0,2,1,1,0,1,1
4,0,6000,0.0,141.0,360.0,1.0,2,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
609,0,2900,0.0,71.0,360.0,1.0,0,0,0,0,0,1
610,3,4106,0.0,40.0,180.0,1.0,0,1,0,0,1,1
611,1,8072,240.0,253.0,360.0,1.0,2,1,0,0,1,1
612,2,7583,0.0,187.0,360.0,1.0,2,1,0,0,1,1


In [18]:
#read test dataset
test = pd.read_csv('https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv')
test.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban


In [19]:
test.shape

(367, 12)

In [20]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 367 entries, 0 to 366
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            367 non-null    object 
 1   Gender             356 non-null    object 
 2   Married            367 non-null    object 
 3   Dependents         357 non-null    object 
 4   Education          367 non-null    object 
 5   Self_Employed      344 non-null    object 
 6   ApplicantIncome    367 non-null    int64  
 7   CoapplicantIncome  367 non-null    int64  
 8   LoanAmount         362 non-null    float64
 9   Loan_Amount_Term   361 non-null    float64
 10  Credit_History     338 non-null    float64
 11  Property_Area      367 non-null    object 
dtypes: float64(3), int64(2), object(7)
memory usage: 34.5+ KB


In [21]:
test.isnull().sum()

Loan_ID               0
Gender               11
Married               0
Dependents           10
Education             0
Self_Employed        23
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            5
Loan_Amount_Term      6
Credit_History       29
Property_Area         0
dtype: int64

In [22]:
test['Dependents'].replace('3+',3,inplace=True)

In [23]:
#filling missing values of test dataset
test['Gender'].fillna(test['Gender'].mode()[0],inplace=True)
test['Dependents'].fillna(test['Dependents'].mode()[0],inplace=True)
test['Self_Employed'].fillna(test['Self_Employed'].mode()[0],inplace=True)
test['Credit_History'].fillna(test['Credit_History'].mode()[0],inplace=True)
test['Loan_Amount_Term'].fillna(test['Loan_Amount_Term'].mode()[0], inplace=True)
test['LoanAmount'].fillna(test['LoanAmount'].median(), inplace=True)
test['Credit_History'].fillna(test['Credit_History'].mean(),inplace=True)

In [24]:
#label encoding
test['Property_Area_Clean']= label_encoder.fit_transform(test['Property_Area'])
test['Gender_Clean']= label_encoder.fit_transform(test['Gender'])
test['Education_Clean']= label_encoder.fit_transform(test['Education']) 
test['Self_Employed_Clean']= label_encoder.fit_transform(test['Self_Employed'])
test['Married_Clean']=label_encoder.fit_transform(test['Married'])

In [25]:
test = test[['Dependents', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area_Clean', 'Gender_Clean', 'Education_Clean',
       'Self_Employed_Clean','Married_Clean']]
test.head()

Unnamed: 0,Dependents,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area_Clean,Gender_Clean,Education_Clean,Self_Employed_Clean,Married_Clean
0,0,5720,0,110.0,360.0,1.0,2,1,0,0,1
1,1,3076,1500,126.0,360.0,1.0,2,1,0,0,1
2,2,5000,1800,208.0,360.0,1.0,2,1,0,0,1
3,2,2340,2546,100.0,360.0,1.0,2,1,0,0,1
4,0,3276,0,78.0,360.0,1.0,2,1,1,0,0


In [26]:
X = train.drop(['Loan_Status_Clean'], axis = 1)
y = train['Loan_Status_Clean']

In [27]:
#Scaling the data
from sklearn.preprocessing import MinMaxScaler

In [28]:
min_max=MinMaxScaler()

In [29]:
X =min_max.fit_transform(X[['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])

In [30]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [31]:
from sklearn.model_selection import GridSearchCV

In [32]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

### Random Forest Model

In [33]:
#set the parameters for GridSearchCV
grid_params_rf = {'n_estimators': [10, 50, 100],
                  'max_depth': [1, 5, 10],
                  'min_samples_split': [2, 5, 10],
                  'min_samples_leaf': [1, 2, 4],
                  'criterion': ['gini', 'entropy']}

In [34]:
rf = RandomForestClassifier(random_state=0)

rf = GridSearchCV(rf, grid_params_rf, cv=5, scoring='accuracy', n_jobs=-1)

rf.fit(X_train,y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=0), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [1, 5, 10], 'min_samples_leaf': [1, 2, 4],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': [10, 50, 100]},
             scoring='accuracy')

In [35]:
print(f' Best parameter: {rf.best_params_}\n Best Estimator: {rf.best_estimator_}\n Best Score: {rf.best_score_}')

 Best parameter: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
 Best Estimator: RandomForestClassifier(criterion='entropy', max_depth=5, min_samples_split=5,
                       random_state=0)
 Best Score: 0.8044526901669758


In [36]:
best_rf = rf.best_estimator_

y_pred_rf = best_rf.predict(X_test)

rf_score = accuracy_score(y_test, y_pred_rf)
rf_score


0.8292682926829268

### Logistic Regression Model

In [37]:
from sklearn.linear_model import LogisticRegression

In [38]:
grid_params_lr = { "C": [0.001, 0.01, 0.1, 1,50,100], 
              'penalty': ['l2', 'l1'], 
              'solver': ['liblinear']}

lr = LogisticRegression(random_state=0)

lr = GridSearchCV(lr, grid_params_lr, cv=5, scoring='accuracy')

lr.fit(X_train,y_train)

print(f' Best parameter: {lr.best_params_}\n Best Estimator: {lr.best_estimator_}\n Best Score: {lr.best_score_}')

 Best parameter: {'C': 50, 'penalty': 'l1', 'solver': 'liblinear'}
 Best Estimator: LogisticRegression(C=50, penalty='l1', random_state=0, solver='liblinear')
 Best Score: 0.806472892187178


In [39]:
best_lr = lr.best_estimator_
pred_lr = best_lr.predict(X_test)
lr_score = accuracy_score(y_test, pred_lr)
lr_score

0.8373983739837398

### Decision Tree Model

In [40]:
from sklearn.tree import DecisionTreeClassifier

In [41]:
grid_params_dt = { "max_depth": [3, 5, 7, 9, 11, 13], 
                  'min_samples_split': [2, 5, 10],
                  'min_samples_leaf': [1, 2, 4],
                  'max_features': ['auto', 'sqrt'],
                  'criterion': ['gini', 'entropy']}

dt = DecisionTreeClassifier(random_state=0)

dt = GridSearchCV(dt, grid_params_dt, cv=5, scoring='accuracy',n_jobs=-1)

dt.fit(X_train,y_train)

print(f' Best parameter: {dt.best_params_}\n Best Estimator: {dt.best_estimator_}\n Best Score: {dt.best_score_}')

 Best parameter: {'criterion': 'entropy', 'max_depth': 3, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2}
 Best Estimator: DecisionTreeClassifier(criterion='entropy', max_depth=3, max_features='auto',
                       random_state=0)
 Best Score: 0.8024118738404453


In [43]:
best_dt = dt.best_estimator_
pred_dt = best_dt.predict(X_test)
dt_score = accuracy_score(y_test, pred_dt)
dt_score

0.8048780487804879

### SVM Model

In [44]:
from sklearn import svm
from sklearn.svm import SVC

In [45]:
grid_params_svm = { "C": [0.001, 0.01, 0.1, 1, 10,50,100], 
                      'kernel': ['linear','rbf']}

svm = SVC(random_state=0)

svm = GridSearchCV(svm, grid_params_svm, cv=5, scoring='accuracy',n_jobs=-1)

svm.fit(X_train,y_train)

print(f' Best parameter: {svm.best_params_}\n Best Estimator: {svm.best_estimator_}\n Best Score: {svm.best_score_}')

 Best parameter: {'C': 0.1, 'kernel': 'linear'}
 Best Estimator: SVC(C=0.1, kernel='linear', random_state=0)
 Best Score: 0.8044320758606472


In [46]:
best_svm = svm.best_estimator_
pred_svm = best_svm.predict(X_test)
svm_score = accuracy_score(y_test, pred_svm)
svm_score

0.8292682926829268

### KNN Model

In [47]:
from sklearn.neighbors import KNeighborsClassifier

In [48]:
grid_params_knn = { 'n_neighbors': [3, 5, 7, 9],
                    'weights': ['uniform', 'distance'],
                    'p': [1, 2]}

knn = KNeighborsClassifier()

knn = GridSearchCV(knn, grid_params_knn, cv=5, scoring='accuracy')

knn.fit(X_train,y_train)

print(f' Best parameter: {knn.best_params_}\n Best Estimator: {knn.best_estimator_}\n Best Score: {knn.best_score_}')

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mo

 Best parameter: {'n_neighbors': 7, 'p': 1, 'weights': 'uniform'}
 Best Estimator: KNeighborsClassifier(n_neighbors=7, p=1)
 Best Score: 0.7983714698000413


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [49]:
best_knn = knn.best_estimator_
pred_knn = best_knn.predict(X_test)
knn_score = accuracy_score(y_test, pred_knn)
knn_score

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


0.8048780487804879

In [70]:
pd.options.display.max_colwidth = None
scores = pd.DataFrame({'Models': ['Random Forest','Logistic Regression','Decision Tree','SVM','KNN'], 
                       'Best parameters': [rf.best_params_,lr.best_params_,dt.best_params_,svm.best_params_,knn.best_params_],
                       'Scores': [rf_score,lr_score,dt_score,svm_score,knn_score]})
scores
scores.sort_values(by='Scores', ascending=False)

Unnamed: 0,Models,Best parameters,Scores
1,Logistic Regression,"{'C': 50, 'penalty': 'l1', 'solver': 'liblinear'}",0.837398
0,Random Forest,"{'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}",0.829268
3,SVM,"{'C': 0.1, 'kernel': 'linear'}",0.829268
2,Decision Tree,"{'criterion': 'entropy', 'max_depth': 3, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2}",0.804878
4,KNN,"{'n_neighbors': 7, 'p': 1, 'weights': 'uniform'}",0.804878


In [69]:
pd.options.display.max_colwidth = None
best_scores = pd.DataFrame({'Models': ['Random Forest','Logistic Regression','Decision Tree','SVM','KNN'], 
                       'Best parameters': [rf.best_params_,lr.best_params_,dt.best_params_,svm.best_params_,knn.best_params_],
                       'Best Scores': [rf.best_score_,lr.best_score_,dt.best_score_,svm.best_score_,knn.best_score_]})
best_scores.sort_values(by='Best Scores', ascending=False)

Unnamed: 0,Models,Best parameters,Best Scores
1,Logistic Regression,"{'C': 50, 'penalty': 'l1', 'solver': 'liblinear'}",0.806473
0,Random Forest,"{'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}",0.804453
3,SVM,"{'C': 0.1, 'kernel': 'linear'}",0.804432
2,Decision Tree,"{'criterion': 'entropy', 'max_depth': 3, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2}",0.802412
4,KNN,"{'n_neighbors': 7, 'p': 1, 'weights': 'uniform'}",0.798371


In [53]:
!pip install python-docx



In [54]:
import docx

In [64]:
doc = docx.Document()

In [65]:
table = doc.add_table(rows=1, cols=len(scores.columns))
header_cells = table.rows[0].cells
for i, column_title in enumerate(scores.columns):
    header_cells[i].text = column_title
for i, row in enumerate(scores.values):
    row_cells = table.add_row().cells
    for j, cell_value in enumerate(row):
        row_cells[j].text = str(cell_value)

In [66]:
doc.save('table.docx')

In [67]:
table2 = doc.add_table(rows=1, cols=len(best_scores.columns))
header_cells = table2.rows[0].cells
for i, column_title in enumerate(best_scores.columns):
    header_cells[i].text = column_title
for i, row in enumerate(best_scores.values):
    row_cells = table2.add_row().cells
    for j, cell_value in enumerate(row):
        row_cells[j].text = str(cell_value)

In [68]:
doc.save('table2.docx')