After doing EDA on train data, we will work with below features to build our model:<br><br>
<b>Categorical Features:</b><br>
    Marital Status,<br>
    Number of Dependents,<br>
    Education (Whether Graduate or Not),<br>
    Credit History Meet Flag,<br>
    Property Area,<br>
    Loan Amount Term (bucket of 5 years)<br><br>
   
<b>Numerical Features:</b><br>
    Applicant Income<br>
    Loan Amount<br>

In [90]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MinMaxScaler
from scipy.sparse import hstack

First, we will load the train data and split it into train & test for model validation.

In [79]:
data = pd.read_csv("preprocessed_trained_data.csv")

In [80]:
data = data.drop(['Loan_ID', 'Gender', 'Self_Employed', 'CoapplicantIncome'], axis=1)

In [81]:
# Some more preprocessing
data['Loan_Amount_Term'] = data['Loan_Amount_Term'].astype(str)
data['Loan_Amount_Term'] = data['Loan_Amount_Term'].map('d{}'.format)
data['Dependents'] = data['Dependents'].astype(str)
data['Dependents'] = data['Dependents'].map('d{}'.format)
data['Credit_History'] = data['Credit_History'].astype(str)
data['Credit_History'] = data['Credit_History'].map('d{}'.format)

In [82]:
X = data.drop('Loan_Status', axis=1)
X.head()
y = data['Loan_Status'].values

In [83]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
print("*** Shape of datasets ***")
print("X_train: ",X_train.shape)
print("y_train: ",y_train.shape)
print("X_test: ",X_test.shape)
print("y_test: ",y_test.shape)


*** Shape of datasets ***
X_train:  (429, 8)
y_train:  (429,)
X_test:  (185, 8)
y_test:  (185,)


Lets prepare data for model input

### Categorical features
Marital Status,<br>
Number of Dependents,<br>
Education (Whether Graduate or Not),<br>
Credit History Meet Flag,<br>
Property Area,<br>
Loan Amount Term (bucket of 5 years)<br>

In [87]:
vectorizer = CountVectorizer()

vectorizer.fit(X_train['Married'].values)
X_train_married_ohe = vectorizer.transform(X_train['Married'].values)
X_test_married_ohe = vectorizer.transform(X_test['Married'].values)
print("\nAfter One Hot Encoding of Married:\n")
print("X_train shape: ",X_train_married_ohe.shape)
print("X_test shape: ",X_test_married_ohe.shape)
print("Features: ", vectorizer.get_feature_names_out())

vectorizer.fit(X_train['Dependents'].values)
X_train_dependents_ohe = vectorizer.transform(X_train['Dependents'].values)
X_test_dependents_ohe = vectorizer.transform(X_test['Dependents'].values)
print("\nAfter One Hot Encoding of Dependents:\n")
print("X_train shape: ",X_train_dependents_ohe.shape)
print("X_test shape: ",X_test_dependents_ohe.shape)
print("Features: ", vectorizer.get_feature_names_out())

vectorizer.fit(X_train['Education'].values)
X_train_education_ohe = vectorizer.transform(X_train['Education'].values)
X_test_education_ohe = vectorizer.transform(X_test['Education'].values)
print("\nAfter One Hot Encoding of Education:\n")
print("X_train shape: ",X_train_education_ohe.shape)
print("X_test shape: ",X_test_education_ohe.shape)
print("Features: ", vectorizer.get_feature_names_out())

vectorizer.fit(X_train['Credit_History'].values)
X_train_credithistory_ohe = vectorizer.transform(X_train['Credit_History'].values)
X_test_credithistory_ohe = vectorizer.transform(X_test['Credit_History'].values)
print("\nAfter One Hot Encoding of Credit History:\n")
print("X_train shape: ",X_train_credithistory_ohe.shape)
print("X_test shape: ",X_test_credithistory_ohe.shape)
print("Features: ", vectorizer.get_feature_names_out())

vectorizer.fit(X_train['Property_Area'].values)
X_train_propertyarea_ohe = vectorizer.transform(X_train['Property_Area'].values)
X_test_propertyarea_ohe = vectorizer.transform(X_test['Property_Area'].values)
print("\nAfter One Hot Encoding of Property Area:\n")
print("X_train shape: ",X_train_propertyarea_ohe.shape)
print("X_test shape: ",X_test_propertyarea_ohe.shape)
print("Features: ", vectorizer.get_feature_names_out())

vectorizer.fit(X_train['Loan_Amount_Term'].values)
X_train_loanterm_ohe = vectorizer.transform(X_train['Loan_Amount_Term'].values)
X_test_loanterm_ohe = vectorizer.transform(X_test['Loan_Amount_Term'].values)
print("\nAfter One Hot Encoding of loan Amount term:\n")
print("X_train shape: ",X_train_loanterm_ohe.shape)
print("X_test shape: ",X_test_loanterm_ohe.shape)
print("Features: ", vectorizer.get_feature_names_out())


After One Hot Encoding of Married:

X_train shape:  (429, 2)
X_test shape:  (185, 2)
Features:  ['no' 'yes']

After One Hot Encoding of Dependents:

X_train shape:  (429, 4)
X_test shape:  (185, 4)
Features:  ['d0' 'd1' 'd2' 'd3']

After One Hot Encoding of Education:

X_train shape:  (429, 2)
X_test shape:  (185, 2)
Features:  ['graduate' 'not']

After One Hot Encoding of Credit History:

X_train shape:  (429, 2)
X_test shape:  (185, 2)
Features:  ['d0' 'd1']

After One Hot Encoding of Property Area:

X_train shape:  (429, 3)
X_test shape:  (185, 3)
Features:  ['rural' 'semiurban' 'urban']

After One Hot Encoding of loan Amount term:

X_train shape:  (429, 9)
X_test shape:  (185, 9)
Features:  ['d120' 'd180' 'd240' 'd300' 'd36' 'd360' 'd480' 'd60' 'd84']


#### Numerical Features

Applicant Income<br>
Loan Amount

In [89]:
scaler = MinMaxScaler()

scaler.fit(X_train['ApplicantIncome'].values.reshape(-1,1))
X_train_income_norm = scaler.transform(X_train['ApplicantIncome'].values.reshape(-1,1))
X_test_income_norm = scaler.transform(X_test['ApplicantIncome'].values.reshape(-1,1))
print("\nAfter normalizing Applicant Income:\n")
print("X_train shape: ",X_train_income_norm.shape)
print("X_test shape: ",X_test_income_norm.shape)


scaler.fit(X_train['LoanAmount'].values.reshape(-1,1))
X_train_loanamount_norm = scaler.transform(X_train['LoanAmount'].values.reshape(-1,1))
X_test_loanamount_norm = scaler.transform(X_test['LoanAmount'].values.reshape(-1,1))
print("\nAfter normalizing Loan Amount:\n")
print("X_train shape: ",X_train_loanamount_norm.shape)
print("X_test shape: ",X_test_loanamount_norm.shape)


After normalizing Applicant Income:

X_train shape:  (429, 1)
X_test shape:  (185, 1)

After normalizing Loan Amount:

X_train shape:  (429, 1)
X_test shape:  (185, 1)


### Stack all features

In [92]:
X_train_stacked = hstack((X_train_married_ohe, X_train_dependents_ohe, X_train_education_ohe, X_train_credithistory_ohe, X_train_propertyarea_ohe, X_train_loanterm_ohe, X_train_income_norm,X_train_loanamount_norm)).tocsr()
X_test_stacked = hstack((X_test_married_ohe, X_test_dependents_ohe, X_test_education_ohe, X_test_credithistory_ohe, X_test_propertyarea_ohe, X_test_loanterm_ohe, X_test_income_norm,X_test_loanamount_norm)).tocsr()

### Apply Decision Tree

In [94]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [97]:
DT = DecisionTreeClassifier()
DT.fit(X_train_stacked, y_train)
y_predict = DT.predict(X_test_stacked)

print(classification_report(y_test, y_predict))
DT_SC = accuracy_score(y_predict,y_test)
print(f"{round(DT_SC*100,2)}% Accurate")

              precision    recall  f1-score   support

           N       0.50      0.45      0.47        58
           Y       0.76      0.80      0.78       127

    accuracy                           0.69       185
   macro avg       0.63      0.62      0.62       185
weighted avg       0.68      0.69      0.68       185

68.65% Accurate
