# Task 10 : Benchmark Top ML Algorithms

This task tests your ability to use different ML algorithms when solving a specific problem.


### Dataset
Predict Loan Eligibility for Dream Housing Finance company

Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers.

Train: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv

Test: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv

## Task Requirements
### You can have the following Classification models built using different ML algorithms
- Decision Tree
- KNN
- Logistic Regression
- SVM
- Random Forest
- Any other algorithm of your choice

### Use GridSearchCV for finding the best model with the best hyperparameters

- ### Build models
- ### Create Parameter Grid
- ### Run GridSearchCV
- ### Choose the best model with the best hyperparameter
- ### Give the best accuracy
- ### Also, benchmark the best accuracy that you could get for every classification algorithm asked above

#### Your final output will be something like this:
- Best algorithm accuracy
- Best hyperparameter accuracy for every algorithm

**Table 1 (Algorithm wise best model with best hyperparameter)**

Algorithm   |     Accuracy   |   Hyperparameters
- DT
- KNN
- LR
- SVM
- RF
- anyother

**Table 2 (Best overall)**

Algorithm    |   Accuracy    |   Hyperparameters



### Submission
- Submit Notebook containing all saved ran code with outputs
- Document with the above two tables

### Importing required libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from pprint import pprint

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error, log_loss

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

  from pandas import MultiIndex, Int64Index


### Defining models and their hyperparameters in dictionary

In [57]:
RANDOM_STATE = 42
random_model_dict = {
    "decision_tree": {
        "model": DecisionTreeClassifier,
        "random_hyperparams": {
            "criterion": ["gini", "entropy", "log_loss"],
            "splitter": ["best", "random"],
            "max_depth": list(range(5, 200)),
            "min_samples_split":list(range(2, 50)),
            "max_features": [2, 4, 6, 10] + ["sqrt", "log2"]    # 'auto' is deprecated in latest sklearn
        }
    },
    "logistic_regression": {
        "model": LogisticRegression,
        "random_hyperparams": {
            "penalty": ["l2", "none"],   # penalty 'l1' and 'elasticnet' is not available for all solvers
            "C": list(np.arange(0.001, 1, 0.01)),
            "solver": ["newton-cg", "lbfgs", "sag", "saga"],
            "max_iter": list(range(5, 500)),
        }
    },
    "knn": {
        "model": KNeighborsClassifier,
        "random_hyperparams": {
            "n_neighbors": list(range(2, 100)),
            "weights": ["uniform", "distance"],
            "algorithm": ["auto", "ball_tree", "kd_tree", "brute"],
            "p": list(range(1, 100))
        }
    },
    "svc": {
        "model": SVC,
        "random_hyperparams": {
            "C": list(np.arange(0.001, 1, 0.01)),
            "kernel": ["linear", "poly", "rbf", "sigmoid"],
            "degree": list(range(1, 100)),
            "gamma": ["scale", "auto"],
            "max_iter": list(range(5, 1000)) + [-1]
        }
    },
    "random_forest": {
        "model": RandomForestClassifier,
        "random_hyperparams": {
            "n_estimators": list(range(100, 1000)),
            "criterion": ["gini", "entropy", "log_loss"],
            "max_depth": list(range(3, 100)) + [None],
            "max_leaf_nodes": list(range(10, 100)),
            "max_features": [2, 4, 6, 10] + ["sqrt", "log2", None],    # 'auto' is deprecated in latest sklearn
            "bootstrap": [True, False]
        }
    },
    "xgb": {
        "model": XGBClassifier,
        "random_hyperparams": {
            "n_estimators": list(range(10, 1000)),
            "max_depth": list(range(3, 100)) + [None],
            "learning_rate": list(np.arange(0.001, 1, 0.01)),
            "eval_metric": [accuracy_score, mean_absolute_error, mean_squared_error, log_loss],
        }
    }
}

### Setting up training and test datasets

In [34]:
# Reading train dataset
train_df = pd.read_csv("loan_train.csv")
train_df

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [35]:
# Reading test dataset
test_df = pd.read_csv("loan_test.csv")
test_df

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban
...,...,...,...,...,...,...,...,...,...,...,...,...
362,LP002971,Male,Yes,3+,Not Graduate,Yes,4009,1777,113.0,360.0,1.0,Urban
363,LP002975,Male,Yes,0,Graduate,No,4158,709,115.0,360.0,1.0,Urban
364,LP002980,Male,No,0,Graduate,No,3250,1993,126.0,360.0,,Semiurban
365,LP002986,Male,Yes,0,Graduate,No,5000,2393,158.0,360.0,1.0,Rural


In [36]:
# Checking info
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [37]:
# Getting numarical feature description
train_df.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


In [38]:
# Checking for null values
train_df.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [39]:
for col in train_df.columns:
    print("*"*10, col, "*"*10, "\n")
    print(train_df[col].value_counts())
    print("\n")

********** Loan_ID ********** 

LP001002    1
LP002328    1
LP002305    1
LP002308    1
LP002314    1
           ..
LP001692    1
LP001693    1
LP001698    1
LP001699    1
LP002990    1
Name: Loan_ID, Length: 614, dtype: int64


********** Gender ********** 

Male      489
Female    112
Name: Gender, dtype: int64


********** Married ********** 

Yes    398
No     213
Name: Married, dtype: int64


********** Dependents ********** 

0     345
1     102
2     101
3+     51
Name: Dependents, dtype: int64


********** Education ********** 

Graduate        480
Not Graduate    134
Name: Education, dtype: int64


********** Self_Employed ********** 

No     500
Yes     82
Name: Self_Employed, dtype: int64


********** ApplicantIncome ********** 

2500    9
4583    6
6000    6
2600    6
3333    5
       ..
3244    1
4408    1
3917    1
3992    1
7583    1
Name: ApplicantIncome, Length: 505, dtype: int64


********** CoapplicantIncome ********** 

0.0       273
2500.0      5
2083.0      5
1666

#### logical assumptions as below can be made with this dataset
- if applicant has co-applicant income, he/she is married.


In [40]:
# Based on above analysis, let's create preprocessing function
def preprocess_dataset(data, test=False):
    # Dropping Loan_ID column as it's not relevent here
    data = data.drop("Loan_ID", axis=1)
    
    # Assumption: if applicant has no co-applicant income, he/she is not married.
    data.loc[(data["CoapplicantIncome"] == 0) & (data["Married"].isna()), "Married"] = "No"
    
    # Assumption: if applicant has co-applicant income, he/she is married.
    data.loc[(data["CoapplicantIncome"] != 0) & (data["Married"].isna()), "Married"] = "Yes"
    
    # Dividing in X and y
    if not test:
        X_unprocessed = data.drop("Loan_Status", axis=1)
        y_unprocessed = data["Loan_Status"]
    else:
        X_unprocessed = data
    
    # Getting categorical and numetical columns
    cat_cols = X_unprocessed.select_dtypes(include="object").columns.tolist()
    num_cols = X_unprocessed.select_dtypes(exclude="object").columns.tolist()
    
    # Since our dataset is small, it is best to go with SimpleImputer as there is not enough data to work with for IterativeImputer
    cat_imputer = SimpleImputer(strategy="most_frequent")
    num_imputer = SimpleImputer(strategy="median")  # Using median strategy to handle outliers
    
    # Creating categorical transformer using Pipeline
    cat_transformer = Pipeline(steps=[
        ("cat_imputer", cat_imputer),
        ("encoder", OneHotEncoder())
    ])
    
    # Creating Numerical transformer using Pipeline
    num_transformer = Pipeline(steps=[
        ("num_imputer", num_imputer),
        ("scaler", MinMaxScaler())
    ])
    
    # Creating Column Transformer using above two transformer
    preprocesser = ColumnTransformer(transformers=[
        ("cat", cat_transformer, cat_cols),
        ("num", num_transformer, num_cols)
    ])
    
    # Fitting dataset to get processed data
    X = preprocesser.fit_transform(X_unprocessed)
    if not test:
        y = y_unprocessed.replace({"Y": 1, "N": 0})
        return X, y
    else:
        return X

In [41]:
X, y = preprocess_dataset(train_df)
X_test = preprocess_dataset(test_df, test=True)

In [42]:
# Make train and test plit
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.30, random_state=RANDOM_STATE)

# Making sure all array stays as Float64
X_train = X_train.astype("float64")
X_val = X_val.astype("float64")
y_train = y_train.astype("float64")
y_val = y_val.astype("float64")

### Let's try to test with every model in dictionary but with default parameters

Also storing accuracy score in dataframe for future comparision

In [43]:
default_score = []

In [44]:
for model_key, model_val in model_dict.items():
    if model_key != "knn":
        model = model_val["model"](random_state=RANDOM_STATE)
    else:
        model = model_val["model"]()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    acc_score = accuracy_score(y_true=y_val, y_pred=y_pred)
    default_score.append({"model": model_key, "accuracy_score": acc_score})





In [45]:
default_score_df = pd.DataFrame.from_dict(default_score).sort_values("accuracy_score", ascending=False)
default_score_df

Unnamed: 0,model,accuracy_score
1,logistic_regression,0.783784
3,svc,0.783784
5,xgb,0.772973
4,random_forest,0.751351
0,decision_tree,0.686486
2,knn,0.648649


**As observed above, highest accuracy is achieved by 3 models by using default parameters is 78.37%**

### Now lets do a RandomizedSearch for all listed parameters in dictionary

This will be running for a while. How to make your PC a heater in this cold weather!!!

In [46]:
random_score = []
best_random_estimator = {}

In [53]:
for model_key, model_val in random_model_dict.items():
    print("*"*10, f"Fitting for {model_key}", "*"*10, "\n")
    if model_key != "knn":
        model = model_val["model"](random_state=RANDOM_STATE)
    else:
        model = model_val["model"]()
    
    random_cv = RandomizedSearchCV(
        n_iter=100,
        estimator=model,
        param_distributions=model_val["random_hyperparams"],
        verbose=5,
        n_jobs=-2  # Use all CPU core but 1
    )
    random_cv.fit(X_train, y_train)
    random_cv_output = random_cv.cv_results_
    best_random_estimator[model_key] = random_cv.best_estimator_
    
    for el in zip(random_cv_output["mean_test_score"], random_cv_output["params"], random_cv_output["mean_fit_time"]):
        random_score.append({
            "model": model_key,
            "accuracy_score": el[0],
            "params": el[1],
            "fit_time": el[2]
        })
        
    print("\n")

********** Fitting for decision_tree ********** 

Fitting 5 folds for each of 100 candidates, totalling 500 fits


********** Fitting for logistic_regression ********** 

Fitting 5 folds for each of 100 candidates, totalling 500 fits


********** Fitting for knn ********** 

Fitting 5 folds for each of 100 candidates, totalling 500 fits


********** Fitting for svc ********** 

Fitting 5 folds for each of 100 candidates, totalling 500 fits


********** Fitting for random_forest ********** 

Fitting 5 folds for each of 100 candidates, totalling 500 fits


********** Fitting for xgb ********** 

Fitting 5 folds for each of 100 candidates, totalling 500 fits






In [54]:
random_score_df = pd.DataFrame.from_dict(random_score).sort_values("accuracy_score", ascending=False)
random_score_df

Unnamed: 0,model,accuracy_score,params,fit_time
508,random_forest,0.825144,"{'n_estimators': 298, 'max_leaf_nodes': 78, 'm...",0.383983
83,random_forest,0.825144,"{'n_estimators': 310, 'max_leaf_nodes': 82, 'm...",0.431598
1040,random_forest,0.822791,"{'n_estimators': 808, 'max_leaf_nodes': 77, 'm...",1.069383
971,svc,0.820465,"{'max_iter': 84, 'kernel': 'rbf', 'gamma': 'au...",0.003340
945,svc,0.820465,"{'max_iter': 742, 'kernel': 'linear', 'gamma':...",0.003125
...,...,...,...,...
481,svc,0.454911,"{'max_iter': 962, 'kernel': 'poly', 'gamma': '...",0.006787
901,svc,0.436443,"{'max_iter': 841, 'kernel': 'poly', 'gamma': '...",0.000000
391,svc,0.417319,"{'max_iter': 133, 'kernel': 'poly', 'gamma': '...",0.000000
477,svc,0.415486,"{'max_iter': 948, 'kernel': 'poly', 'gamma': '...",0.006761


In [55]:
for model_key, model_df in random_score_df.groupby("model"):
    print("*"*10, f"Exporting DF for {model_key}", "*"*10, "\n")
    sampled_df = model_df.sort_values("accuracy_score", ascending=False)
    sampled_df = sampled_df.drop("model", axis=1)
    sampled_df.to_excel(f"{model_key}_randomcv.xlsx", index=False)

********** Exporting DF for decision_tree ********** 

********** Exporting DF for knn ********** 

********** Exporting DF for logistic_regression ********** 

********** Exporting DF for random_forest ********** 

********** Exporting DF for svc ********** 

********** Exporting DF for xgb ********** 



### Now lets build a GridSearchCV parameter grid around best parameters found in RandomizedSearchCV

In [58]:
grid_model_dict = {
    "decision_tree": {
        "model": DecisionTreeClassifier,
        "grid_hyperparams": {
            "criterion": ["gini", "entropy", "log_loss"],
            "splitter": ["best", "random"],
            "max_depth": [5, 25, 40, 50, 100, 125, 150, 175],
            "min_samples_split": list(range(40, 51)),
            "max_features": [6, 8, 10, 12, 14, 16] + ["log2"]    # 'auto' is deprecated in latest sklearn
        }
    },
    "logistic_regression": {
        "model": LogisticRegression,
        "grid_hyperparams": {
            "penalty": ["l2", "none"],   # penalty 'l1' and 'elasticnet' is not available for all solvers
            "C": list(np.arange(0.1, 1, 0.05)),
            "solver": ["newton-cg", "lbfgs", "sag", "saga"],
            "max_iter": [50, 75, 100, 150, 200, 300],
        }
    },
    "knn": {
        "model": KNeighborsClassifier,
        "grid_hyperparams": {
            "n_neighbors": list(range(25, 40, 2)),
            "weights": ["distance"],
            "algorithm": ["ball_tree", "kd_tree"],
            "p": list(range(30, 90, 5))
        }
    },
    "svc": {
        "model": SVC,
        "grid_hyperparams": {
            "C": list(np.arange(0.1, 1, 0.1)),
            "kernel": ["linear", "rbf"],
            "degree": list(range(30, 100, 10)),
            "gamma": ["scale", "auto"],
            "max_iter": list(range(80, 500, 50))
        }
    },
    "random_forest": {
        "model": RandomForestClassifier,
        "grid_hyperparams": {
            "n_estimators":  list(range(300, 800, 50)),
            "criterion": ["gini", "entropy", "log_loss"],
            "max_depth": list(range(65, 100, 5)),
            "max_leaf_nodes": list(range(70, 80, 2)),
            "max_features": [6, 10] + ["log2"],    # 'auto' is deprecated in latest sklearn
            "bootstrap": [True, False]
        }
    },
    "xgb": {
        "model": XGBClassifier,
        "grid_hyperparams": {
            "n_estimators": list(range(300, 800, 50)),
            "max_depth": list(range(20, 100, 5)),
            "learning_rate": list(np.arange(0.1, 1, 0.1)),
            "eval_metric": [mean_absolute_error],
        }
    }
}

In [59]:
grid_score = []
best_estimator = {}

In [61]:
for model_key, model_val in grid_model_dict.items():
    print("*"*10, f"Fitting for {model_key}", "*"*10, "\n")
    if model_key != "knn":
        model = model_val["model"](random_state=RANDOM_STATE)
    else:
        model = model_val["model"]()
    
    grid = GridSearchCV(
        estimator=model,
        param_grid=model_val["grid_hyperparams"],
        verbose=5,
        n_jobs=-2  # Use all CPU core but 1
    )
    grid.fit(X_train, y_train)
    grid_output = grid.cv_results_
    best_estimator[model_key] = grid.best_estimator_
    
    for el in zip(grid_output["mean_test_score"], grid_output["params"], grid_output["mean_fit_time"]):
        grid_score.append({
            "model": model_key,
            "accuracy_score": el[0],
            "params": el[1],
            "fit_time": el[2]
        })
        
    print("\n")
    

********** Fitting for decision_tree ********** 

Fitting 5 folds for each of 3696 candidates, totalling 18480 fits


********** Fitting for logistic_regression ********** 

Fitting 5 folds for each of 864 candidates, totalling 4320 fits


********** Fitting for knn ********** 

Fitting 5 folds for each of 192 candidates, totalling 960 fits


********** Fitting for svc ********** 

Fitting 5 folds for each of 2268 candidates, totalling 11340 fits






********** Fitting for random_forest ********** 

Fitting 5 folds for each of 6300 candidates, totalling 31500 fits


********** Fitting for xgb ********** 

Fitting 5 folds for each of 1440 candidates, totalling 7200 fits








In [62]:
grid_score_df = pd.DataFrame.from_dict(grid_score).sort_values("accuracy_score", ascending=False)
grid_score_df

Unnamed: 0,model,accuracy_score,params,fit_time
10420,random_forest,0.827469,"{'bootstrap': False, 'criterion': 'gini', 'max...",0.428124
10751,random_forest,0.827469,"{'bootstrap': False, 'criterion': 'gini', 'max...",0.525084
11052,random_forest,0.827469,"{'bootstrap': False, 'criterion': 'gini', 'max...",0.603526
10451,random_forest,0.827469,"{'bootstrap': False, 'criterion': 'gini', 'max...",0.500080
11201,random_forest,0.827469,"{'bootstrap': False, 'criterion': 'gini', 'max...",0.503125
...,...,...,...,...
4963,svc,0.703967,"{'C': 0.1, 'degree': 80, 'gamma': 'auto', 'ker...",0.002939
4964,svc,0.703967,"{'C': 0.1, 'degree': 80, 'gamma': 'auto', 'ker...",0.003125
4965,svc,0.703967,"{'C': 0.1, 'degree': 80, 'gamma': 'auto', 'ker...",0.007450
4966,svc,0.703967,"{'C': 0.1, 'degree': 80, 'gamma': 'auto', 'ker...",0.006155


In [63]:
for model_key, model_df in grid_score_df.groupby("model"):
    print("*"*10, f"Exporting DF for {model_key}", "*"*10, "\n")
    sampled_df = model_df.sort_values("accuracy_score", ascending=False)
    sampled_df = sampled_df.drop("model", axis=1)
    sampled_df.to_excel(f"{model_key}_gridcv.xlsx", index=False)

********** Exporting DF for decision_tree ********** 

********** Exporting DF for knn ********** 

********** Exporting DF for logistic_regression ********** 

********** Exporting DF for random_forest ********** 

********** Exporting DF for svc ********** 

********** Exporting DF for xgb ********** 

