# Kampus Merdeka 6: IBM & Skilvul
# Data Science Phase Challenge

### Feivel Jethro Ezhekiel | Kelompok 14

# Problem Definition
## Latar Belakang
Semakin hari harga rumah semakin mahal, tetapi keinginan untuk memiliki rumah tetap dimiliki dari berbagai kalangan dari anak muda hingga orang tua. Banyak yang menginginkan rumah yang berada di perkotaan, dekat dengan fasilitas publik, mewah, tetapi harga tetap terjangkau. Oleh karena itu, bank memberikan fitur `Kredit Pemilikan Rumah(KPR)` dimana memiliki rumah terlebih dahulu sembari dicicil. Akan tetapi, pada prosesnya bank tetap memerlukan uji kelayakan dari nasabah yang ingin melakukan KPR, dikarekan banyaknya nasabah yang ingin melakukan KPR tetapi belum tentu memiliki kemampuan untuk mencicil KPR tersebut, maka apabila dilakukan oleh manusia akan sangat terbatas untuk waktu dan tenaga, terutama fokus yang dimiliki untuk `memeriksa` dan `mengevaluasi` nasabah-nasabah tersebut sehingga dibutuhkanlah sebuah teknologi AI yang dapat memprediksi dan memberikan _eligibility rate_ terhadap nasabah-nasabah tersebut.
## Tujuan Penelitian
Tujuan penelitian ini adalah untuk mementukan ***algoritma*** ML paling *baik* untuk menciptakan teknologi ***AI Eligibility Home Loan Detection*** pada perbankan agar proses verifikasi data lebih cepat dan akurat.
## Rumusan Masalah
Faktor umur, gaji, banyak orang yang bergantung pada nasabah, dan gender mempengaruhi ***eligibility rate***. 
## Data yang akan dipakai
- Load-test.csv
        <br>**Sumber**      : https://www.kaggle.com/datasets/vikasukani/loan-eligible-dataset?select=loan-test.csv
        <br>**Deskripsi**   : Dataset untuk testing model dan prediksi dengan data ini, isinya terdapat Loan_ID, Gender(Male/Female), Married(Y/N), Dependents, Education(Graduate/Undergraduate), Self-Employed(Y/N), ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, Credit_History, Property_Area(Urban/ Semi-Urban/ Rural), Loan_Status(Y/N). Jumlahnya ada 367 unique values
        
- Loan-train.csv
        <br>**Sumber**      : https://www.kaggle.com/datasets/vikasukani/loan-eligible-dataset?select=loan-train.csv
        <br>**Deskripsi**   : Dataset untuk training model dengan dataset berikut dimana isinya terdapat Loan_ID, Gender(Male/Female), Married(Y/N), Dependents, Education(Graduate/Undergraduate), Self-Employed(Y/N), ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, Credit_History, Property_Area(Urban/ Semi-Urban/ Rural), Loan_Status(Y/N). Jumlanya ada 614 unique values.
        
## Metode
Metode yang digunakan pada penelitian ini menggunakan Supervied Learning Model dimana menggunakan algoritma seperti berikut :
* LinearSVC
* RandomForest Classifier
* KNeighbours Classification
* Logistic Regression
* Decision Tree
* XGBoostClassifier
* CatBoostClassifier

# Preparation | Persiapan
## Import Libraries

Link : https://seaborn.pydata.org/installing.html

The basic invocation of pip will install seaborn and, if necessary, its mandatory dependencies. It is possible to include optional dependencies that give access to a few advanced features:

!pip install seaborn[stats]

In [None]:
# command to install Seaborn - Seaborn is a library for making statistical graphics in Python
%pip install seaborn[stats]

In [None]:
# Initiate libraries that will be used in the process of creating Machine Learning models

import pandas as pd # Pandas is a software library written for the Python programming language for data manipulation and analysis
import numpy as np # NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these array
import matplotlib.pyplot as plt # Basic visualization in python
import seaborn as sns # Advanced visualization
from sklearn.preprocessing import OneHotEncoder # Encodes categorical data to numerical ones
import re # Modificate string data 
import imblearn # Imblearn library is specifically designed to deal with imbalanced datasets
from imblearn.under_sampling import RandomUnderSampler # Balancing data for imbalance in the process
from collections import Counter # is a sub-class that is used to count hashable objects. It implicitly creates a hash table of an iterable when invoked
from sklearn.model_selection import train_test_split # a powerful tool in Scikit-learn's arsenal, primarily used to divide datasets into training and testing subsets

## Get Data | Mendapatkan Data

In [None]:
# Source : https://saturncloud.io/blog/how-to-merge-two-csvs-using-pandas-in-python/

# Take training data with a CSV extension into a df variable or Data Frame on the local machine
df = pd.read_csv("loan-train.csv")

# Retrieve data from the internet
# df = request.get(<link>)

# Take test data with a CSV extension into the df_validate variable whose function is to test the model at the end of the process
df_validate = pd.read_csv('loan-test.csv')

In [None]:
# To see the contents of the train data on the variable 'df'
df

In [None]:
# To see the contents of the validate data in the variable 'df_validate'
df_validate

It can be seen that ```df``` has several features arranged into rows, namely ```Loan_ID, Gender```, ```Married```, ```Dependents```, ``` Education```, ```Self_Employed```, ```ApplicantIncome```, ```CoapplicantIncome```, ```LoanAmount```, ```LoanAmount_Term```, ```Credit History ```, ```Property_Area```, and ```Loan_Status```. 

* Loan_ ID : customer ID<br>
* Gender: Customer's gender<br>
* Married: Customer's marital status (Y/N)<br>
* Dependents: Number of dependents<br>
* Education: Last education (Graduate/Under Graduate)<br>
* Self_Employed : Working or not (Y/N)<br>
* ApplicantIncome: Customer Income<br>
* CoapplicantIncome: Customer Partner's Income<br>
* LoanAmount: Loan amount (K/thousand)<br>
* Loan_Amount_Term : Loan term in months<br>
* Credit_History: Credit history meets guidelines (Y/N)<br>
* Loan_Status : Approval Approved(Y/N)<br>

## Explore Data (EDA) | Eksplorasi Data

At this stage we carry out exploration to see and analyze the data we have so that we know the next steps we will take before starting to create a Machine Learning Model.

### Purpose

The main goal of EDA is to gain insight into data and its underlying structure. EDA helps analysts identify patterns, relationships, and outliers in data, which can help in making more informed decisions. EDA can also help identify missing data, errors, and data inconsistencies. By performing EDA, analysts can gain a better understanding of the data, thereby providing more accurate and reliable results.

### Importance of EDA for Data Analysis

EDA is very important in data analysis because it helps identify patterns, relationships, and anomalies in data. It also helps identify missing data, errors, and data inconsistencies, which can have a significant impact on the analysis. Without EDA, analysts may miss important insights, which can lead to incorrect conclusions and poor decision making. EDA is also important for preparing data for further analysis, such as predictive modeling, machine learning, and statistical inference.

reference : https://www.linkedin.com/pulse/power-exploratory-data-analysis-eda-science-basics-best-soni/

In [None]:
def dataframe_info(df):
    # df.shape to see the dimensions of the dataframe
    print(f"Data shape : {df.shape}")
    # df.shape[0] shows many rows, while df.shape[1] shows many columns(features)
    print(f"The data have {df.shape[0]} rows and {df.shape[1]} columns")
    print()
    print(f"Overall information of the data:")
    print()
    print(f"Large amount of empty data: {df.isnull().sum().sum()}")
    print()
    print(df.info(),"\n", df.isnull().sum())# .info() functions to tell the contents of the dataframe and also the type of data contained in it and ensures that the contents are the same as df
    print("===================================================================")

dataframe_info(df), dataframe_info(df_validate)

Dapat terlihat bahwa pada data training(df) data yang kita miliki memiliki 13 features yang terdiri dari 4 features yang berbentuk float, 1 features yang berbentuk integer, dan 8 features yang berbentuk object maka totalnya adalah 13. 

In [None]:
# Move the original data frame to an empty dataframe which we will then call data_train
# The aim is to be more flexible and also aims to be used as training data
df_train = df
df_test = df_validate

In [None]:
# .describe() Analyze the values ​​contained in each feature using the summary features method
df_train.describe()

In [None]:
df_test.describe()

If you look at it, it turns out that in the 13 features there are only 5 features that are detected as having a numeric value, even though if you look at it there is 1 feature that can be made numeric, namely ```Dependents```.

Data Visualization Function

In [None]:
# Visualize data so that it can be seen well so that it can be communicated with other stakeholders

def static_viz(df):
    # To set the grid theme on Seaborn
    sns.set_theme(style='darkgrid')
    # To set the canvas size in Seaborn
    fig, ax = plt.subplots(nrows=4, ncols=3, figsize=(30,30)) # subplots indicate that there are plots combined in the same canvas which consists of 4 rows and 3 columns
    sns.histplot(data = df, x = "Gender", color="olive", ax=ax[0, 0])
    sns.histplot(data = df, x = "Married", color="yellowgreen", ax=ax[0, 1])
    sns.histplot(data = df, x = "Dependents", color="lightcoral", ax=ax[0, 2])
    sns.histplot(data = df, x = "Education", color="slategray", ax=ax[1, 0])
    sns.histplot(data = df, x = "Self_Employed", color="violet", ax=ax[1, 1])
    sns.histplot(data = df, x = "ApplicantIncome", kde=True, color="salmon", ax=ax[1, 2])
    sns.histplot(data = df, x = "CoapplicantIncome", kde=True, color="lightskyblue", ax=ax[2, 0])
    sns.histplot(data = df, x = "LoanAmount", kde=True, color="sandybrown", ax=ax[2, 1])
    sns.histplot(data = df, x = "Loan_Amount_Term", color="paleturquoise", ax=ax[2, 2])
    sns.histplot(data = df, x = "Credit_History", color="tan", ax=ax[3, 0])
    sns.histplot(data = df, x = "Property_Area", color="palevioletred", ax=ax[3, 1])
    sns.histplot(data = df, x = "Loan_Status", color="mediumslateblue", ax=ax[3, 2])

static_viz(df_train)

In [None]:
def dynamic_Viz(df_train, nrows, ncols):
    fig, ax = plt.subplots(nrows=nrows, ncols=ncols, figsize=(ncols*5, nrows*5)) 
    
    plots = [
        ("Gender", "olive"), ("Married", "yellowgreen"), ("Dependents", "lightcoral"),
        ("Education", "slategray"), ("Self_Employed", "violet"), ("ApplicantIncome", "salmon"),
        ("CoapplicantIncome", "lightskyblue"), ("LoanAmount", "sandybrown"),
        ("Loan_Amount_Term", "paleturquoise"), ("Credit_History", "tan"),
        ("Property_Area", "palevioletred"), ("Loan_Status", "mediumslateblue")
    ]
    
    for i, (feature, color) in enumerate(plots):
        row = i // ncols
        col = i % ncols
        sns.histplot(data=df_train, x=feature, color=color, ax=ax[row, col])

    plt.tight_layout()
    plt.show()

dynamic_Viz(df_train, 4, 3)

From the 12 Features we can see the inequality in the amount of data in the features ```Gender```, ```Married```, ```Dependents```, ```Education```, ```Self_Employed``` , and ```Credit_History``` which impacts ```Loan_Status```. Then in the features ```ApplicantIncome```, ```CoapplicantIncome```, and ```LoanAmount```, there is an abnormal distribution form, aka *skewed-left*.

This will greatly influence the final results of our Training Model later. Therefore, we need to clean data first.

In [None]:
# Melakukan Correlation Matrix untuk melihat hubungan tiap features

sns.heatmap(df_train.corr())

**We cannot see the correlation matrix yet because it contains a feature that cannot be converted into numeric.**

**We will do it after data cleansing only**

## Clean Data(data train) | Membersihkan Data

Ditahap ini kita melakukan pembersihan pada data. Pembersihan ini berupa manipulasi, transformasi, dan juga distribusi normal.

1. Manipulasi

    Pada manipulasi disini kita melakukan proses penghilangan pada data yang tidak memiliki *value* atau bernilai ```NaN```. Bisa juga memasukan *value* tertentu seperti mean, median, ataupun modus dalam data tersebut.

2. Transformasi

    Transformasi disini adalah perubahan bentuk data dari satu bentuk ke bentuk lain sesuai dengan kebutuhan dari Machine Learning Model itu sendiri.

3. Distribusi

    Distribusi disini dilakukan untuk data dapat terdistribusi normal sehingga pemodelannya dapat berjalan dengan **ideal**. Proses ini juga dilakukan dengan menghapus data-data outliers yang terdapat pada dataframe.

In [None]:
# .isnull().sum() berfungsi untuk menghitung jumlah data yang memiliki value NaN di dalamnya di tiap features dalam df_train
df_train.isnull().sum()

In [None]:
df_test.isnull().sum()

### TO DO LIST

Based on the data exploration above, we can do the following things

#### 1. Loan ID
Loan ID is a Unique Customer Identity which is different for each customer, aka it has no pattern, so this will actually be confusing for the *Machine Learning* model later. Therefore, it is better to just delete this feature.

In [None]:
# Removed Loan_ID column because it cannot be predicted by ML with .drop()
df_train = df_train.drop('Loan_ID', axis=1) # axis=1 indicates columns
df_train

In [None]:
# We could do it with the data test also
df_test = df_test.drop('Loan_ID', axis=1)
df_test

#### 2. Gender
The Gender feature was simply removed because it is not significant to loan eligibility. This is explained in a paper entitled ```Gender Bias and Credit Access``` which can be accessed at the following link:

https://drive.google.com/file/d/1NaboUUhEJALucTLw87R7_sQYhBPmCl88/view?usp=drive_link

In [None]:
# Removed the Gender column because it has no correlation with loan eligibility
df_train = df_train.drop('Gender', axis=1) # or you can also use df_train.drop(columns=['Gender'])
df_train

In [None]:
df_test = df_test.drop('Gender', axis=1) # or you can also use df_test.drop(columns=['Gender'])
df_test

#### 3. Married 
There are 3 empty data lines. These rows of data can be deleted because their number does not have a significant impact on the other 613 data, then these features will be converted to binary form using ``OneHotEncoder```

Based on the following article: https://www.rocketlawyer.com/family-and-personal/family-matters/marriage/legal-guide/how-marital-status-affects-credit-card-and-loan-applications. It is stated that marital status influences whether a customer is eligible to borrow money from the bank. Therefore, these *features* will be used.

In [None]:
# The syntax below aims to view empty column data in the Married feature so that we can see its contents.
df_train[df_train.Married.isna()]

In [None]:
df_test[df_test.Married.isna()]

There are only 3 out of 613 empty data in df_train in the Married feature, so it will not have a significant impact on the ML model. Therefore, we can delete it.

In [None]:
# Delete the Married column which has the value NaN because it is not significant

# dropna is a function to delete data that has NaN values ​​in it only in the 'Married' column 
df_train.dropna(subset=['Married'], inplace=True)

# .isnull().sum() is a function to count the number of empty data in a dataframe
df_train.isnull().sum()

Because the `Married` feature is in the form of an object or categorical in the form Yes / No, we can transform the data from *object*, to *Boolean* with the OneHotEncoder method, or change the category into 2 binary values, namely 0 and 1.

In [None]:
dum = df['Married'].head()
dum = pd.get_dummies(dum)
dum

For fast data cleaning and EDA, it makes a lot of sense to use pandas get dummies. However, if I plan to convert a categorical column into multiple binary columns for machine learning, it is better to use OneHotEncoder().

https://albertum.medium.com/preprocessing-onehotencoder-vs-pandas-get-dummies-3de1f3d77dcc

In [None]:
# Converting to binary values ​​using OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output= False).set_output(transform='pandas')

In [None]:
# encoder.fit_transform(y) is the process of learning the existing pattern in y and transforming it into an integer representation according to the index assigned to each unique value in the array y
Married_ohe = encoder.fit_transform(df_train[['Married']]).astype(int)
Married_ohe

In [None]:
Married_ohe_test = encoder.fit_transform(df_test[['Married']]).astype(int)
Married_ohe_test

In [None]:
# Combining df Married_ohe with df_train and deleting the old Married column to become a new Married Column with values ​​in the form of binary values
df_train = pd.concat([df_train, Married_ohe], axis=1).drop(columns=['Married'])
# Married = 1; Not Married = 0
df_train = df_train.rename(columns={"Married_Yes": "Married"})
df_train = df_train.drop(columns=['Married_No'])
df_train

In [None]:
# Combine df Married_ohe_test with df_test and delete the old Married column to become a new Married Column with values ​​in the form of binary values
df_test = pd.concat([df_test, Married_ohe_test], axis=1).drop(columns=['Married'])
# Married = 1; Not Married = 0
df_test = df_test.rename(columns={"Married_Yes": "Married"})
df_test = df_test.drop(columns=['Married_No'])
df_test

In [None]:
# Or we can create a function to combine two steps above

def BinaryEncoder(data, features):
    for feature in features:
        if data[feature].dtype == 'object':
            encoder = pd.get_dummies(data[feature], prefix=feature)
            data = pd.concat([data, encoder], axis=1)
            data.drop(columns=[feature], inplace=True)
            data.rename(columns={f"{feature}_Yes": feature}, inplace=True)
    return data

#### 4. Dependents (Object)
This feature has 15 blanks so we can just drop it even though there are more than Gender, but we also can't manipulate the median or mode because the distribution is? uneven. Then carry out manipulation by removing the ```+``` sign and changing the data form from ```object``` to *numeric values*

In this article: https://www.homeloanexperts.com.au/how-much-can-i-borrow/dependents-on-borrowing-power/

It is stated that banks pay attention and consider customers who want to borrow money whether they have children or not

In [None]:
df_train.info()

In [None]:
df_train.isnull().sum()

In [None]:
df_test.isnull().sum()

In [None]:
# View the contents of data rows that have NaN values ​​in the Dependents column
df_train[df_train.Dependents.isna()]

In [None]:
df_test[df_test.Dependents.isna()]

In [None]:
fig, ax2 = plt.subplots(1, 3, figsize=(20,15))

sns.boxplot(data=df_train, x="Dependents", y="LoanAmount", ax=ax2[0])
sns.boxplot(data=df_train, x="Dependents", y="ApplicantIncome", ax=ax2[1])
sns.boxplot(data=df_train, x="Dependents", y="CoapplicantIncome", ax=ax2[2])

There is a pattern in the LoanAmount and CoapplicantIncome features where the more dependents, the higher the LoanAmount and the lower the CoapplicantIncome. However, we can carry out further analysis after data cleansing

In [None]:
sns.histplot(data=df_train, x="Dependents")

In [None]:
# If analysis is carried out, the data rows that have NaN values ​​in them are at the average value, and also have a 50:50 influence on the final results. Therefore, it will not have a significant effect on the Machine Learning Model so we can remove it
df_train.dropna(subset=['Dependents'], inplace=True)

In [None]:
df_test.dropna(subset=['Dependents'], inplace=True)

In [None]:
# Mengubah bentuk 3+ menjadi 3 dan bentuk object menjadi integer
df_train['Dependents'].tail()

It can be seen that the 610th data contains a ```+``` sign

In [None]:
# Change strings with + signs to disappear : https://saturncloud.io/blog/how-to-remove-special-characters-in-pandas-dataframe/#:~:text=Use%20Regex%20Substitution%3A&text=sub ()%20function%20from%20the,)%2C%20effectively%20removing%20special%20characters.
# Change the string in dependents to numeric
df_train['Dependents'] = df_train['Dependents'].apply(lambda x:re.sub(r'[/\+/g]', '', x)).astype(int)

In [None]:
df_test['Dependents'] = df_test['Dependents'].apply(lambda x:re.sub(r'[/\+/g]', '', x)).astype(int)

In [None]:
# Make sure the + sign has disappeared in the 610th data and also that the data in object form has changed to numeric form so that later the correlation matrix can be carried out
df_train["Dependents"].tail()

In [None]:
df_test["Dependents"].tail()

#### 5. Education

In *features* ```Education``` there are no empty data rows, i.e. they are all filled. However, the form is still in object form, so we can convert it to a binary value using ```OneHotEncoder```

In this article: https://www.linkedin.com/pulse/hidden-ways-your-education-level-affects-finances-lana-bandoim/

It was found that the level of education influences Loan Eligibility

In [None]:
df_train['Education']

In [None]:
df_test['Education']

In [None]:
# Change the data form in the Education feature to binary values
Education_ohe = encoder.fit_transform(df_train[['Education']]).astype(int)
Education_ohe

In [None]:
Education_ohe_test = encoder.fit_transform(df_test[['Education']]).astype(int)
Education_ohe_test

In [None]:
# See the initial form before using OneHotEncoder
df_train.head()

In [None]:
# Combining Education_ohe with df_train
df_train = pd.concat([df_train, Education_ohe], axis=1).drop(columns=['Education'])
# Graduate = 1; Not Graduate = 0
df_train = df_train.rename(columns={"Education_Graduate": "Education"})
df_train = df_train.drop(columns=["Education_Not Graduate"])

In [None]:
# Combining Education_ohe_test with df_test
df_test = pd.concat([df_test, Education_ohe_test], axis=1).drop(columns=['Education'])
# Graduate = 1; Not Graduate = 0
df_test = df_test.rename(columns={"Education_Graduate": "Education"})
df_test = df_test.drop(columns=["Education_Not Graduate"])

In [None]:
# Melihat bentuk akhir setelah dilakukan OneHotEncoder
df_train.head()

In [None]:
df_test.head()

#### 6. Self Employed

In these Features there are 32 empty rows of data. Then we can convert them to binary values ​​using ```OneHotEncoder```.

In several articles it is said that those who are self-employed can take out loans but with several restrictions. However, because here we assume it is general, we will assume it does have an effect

In [None]:
# View the contents of data rows that have NaN values ​​in the Self_Employed column
df_train[df_train.Self_Employed.isna()].describe()

In [None]:
fig, ax3 = plt.subplots(figsize=(5,10))

sns.boxplot(data=df_train, x="Self_Employed", y="ApplicantIncome", hue="Loan_Status", ax=ax3)

Because the majority of data has the possibility of being accepted by the loan eligibility process and most of it is also Self_Employed, we can fill in the NaN value with the mode value in the feature ```Self_Employed```, namely ```1```, but because the number of Loan_Status is `` `imblance```/ condition where the values ​​```1``` and `0` are not equal in number and will affect the final results of the Machine Learning model, so we can just delete them.

In [None]:
# Provide input with the most information in that column: https://www.makeuseof.com/fill-missing-data-with-pandas/#:~:text=Use%20the%20fillna()%20Method,modal% 2C%20or%20any%20other%20value.
# df_train['Self_Employed'].fillna(df['Self_Employed'].mode()[0], inplace=True)

In [None]:
df_train.dropna(subset=["Self_Employed"], inplace=True)
df_train[df_train.Self_Employed.isna()]

In [None]:
df_test.dropna(subset=["Self_Employed"], inplace=True)
df_test[df_test.Self_Employed.isna()]

In [None]:
# Change the data form in the Self_Employed feature to binary values
Self_Employed_ohe = encoder.fit_transform(df_train[['Self_Employed']]).astype(int)
Self_Employed_ohe

In [None]:
# Change the data form in the Self_Employed feature to binary values
Self_Employed_ohe_test = encoder.fit_transform(df_test[['Self_Employed']]).astype(int)
Self_Employed_ohe_test

In [None]:
df_train.head()

In [None]:
# Combining Educatio_ohe with df_train
df_train = pd.concat([df_train, Self_Employed_ohe], axis=1).drop(columns=['Self_Employed'])
# Self_Employed_Yes = 1; Self_Employed_No = 0
df_train = df_train.rename(columns={"Self_Employed_Yes": "Self_Employed"})
df_train = df_train.drop(columns=['Self_Employed_No'])

In [None]:
# Combining Educatio_ohe with df_test
df_test = pd.concat([df_test, Self_Employed_ohe_test], axis=1).drop(columns=['Self_Employed'])
# Self_Employed_Yes = 1; Self_Employed_No = 0
df_test = df_test.rename(columns={"Self_Employed_Yes": "Self_Employed"})
df_test = df_test.drop(columns=['Self_Employed_No'])

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
df_train["Self_Employed"].info()

#### 7. Applicant Income dan Coapplicant Income
In these two features, it is better if we combine them so that it is easier to see the pattern because in the end it is one family who makes the loan

There seems to be no need to question this feature regarding its effect on Loan Status because without income it is impossible to borrow

In [None]:
# Combining ApplicantIincome with CoapplicantIncome
df_train['Income'] = df_train['ApplicantIncome'] + df_train['CoapplicantIncome']
df_train = df_train.drop('ApplicantIncome', axis = 1)
df_train = df_train.drop('CoapplicantIncome', axis = 1)
df_train

In [None]:
# Combining ApplicantIincome with CoapplicantIncome
df_test['Income'] = df_test['ApplicantIncome'] + df_test['CoapplicantIncome']
df_test = df_test.drop('ApplicantIncome', axis = 1)
df_test = df_test.drop('CoapplicantIncome', axis = 1)
df_test

In [None]:
# Converts float to integer
df_train['Income'] = df_train['Income'].astype(int)
df_train

In [None]:
# Converts float to integer
df_test['Income'] = df_test['Income'].astype(int)
df_test

#### 8. Loan Amount
In this feature there are 22 empty rows of data. We need to analyze further because this feature has an important role in the Machine Learning Model, then we also need to change the number of 0s, because in the description the data should be in the form of thousands.

This feature also has the same effect as Income

In [None]:
df_train[df_train.LoanAmount.isna()].describe()

In [None]:
df_train[df_train.LoanAmount.isna()]

Because `Loan_Amount` is important enough to be used in Machine Learning Models, we need to fill in the NaN values, but the NaN values ​​must have the `Loan_Status` N feature so that we can delete the `Loan_Status` with a value of Y and reduce the number.

In [None]:
df_temp = df_train[df_train.LoanAmount.isna()]
df_temp = df_temp((df_temp["Loan_Status"] == 'Y') and df_temp[df_temp.LoanAmount.isna()])
df_temp

Ternyata kita perlu mengubah data `Loan_Status` menjadi boolean atau numerik, maka kita akan lompat ke tahap ke [12. Loan Status](####-12.-Loan-Status)

In [None]:
# Membuat variabel kosong baru untuk memasukkan baris data yang memiliki NaN value yang memiliki Loan_Status bernilai
df_temp = df_train[df_train.LoanAmount.isna()]
df_temp = df_temp[df_temp['LoanAmount'].isna() & df_temp['Loan_Status'] == 0]
df_temp

In [None]:
# Menghapus baris data NaN pada Data Frame
df_train.dropna(subset=['LoanAmount'], inplace = True)
df_train[df_train.LoanAmount.isnull()]

In [None]:
# df_train.drop(df_train.tail(9).index, inplace=True)

In [None]:
# Menggabungkan df_temp diatas dengan df_train["LoanAmount"]
df_train = pd.concat([df_train, df_temp])
df_train.tail(9)

In [None]:
# Mengisi NaN values pada df_temp dengan range antara 9 hingga 177 dikarenakan 570 adalah data outliers yang jumlahnya hanya 1

# Kita menggunakan metode polynomial karena seringnya(walaupun tidak selalu), memberikan nilai yang lebih akurat
df_train["LoanAmount"] = df_train["LoanAmount"].interpolate(limit_direction='both', method="polynomial", order=2)
df_train.tail(10)

# reference : https://www.numpyninja.com/post/interpolation-using-pandas#:~:text=Interpolation%20is%20one%20such%20method,series%20while%20pre%2Dprocessing%20data.
# https://stackoverflow.com/questions/63632541/order-of-spline-interpolation-for-pandas-dataframe

In [None]:
# Sedangkan pada data test set, tidak perlu diutak atik jadi bisa kita drop saja
df_test.dropna(subset=['LoanAmount'], inplace=True)
df_test.isnull().sum()

In [None]:
df_test

In [None]:
# Mengubah tipe data LoanAmount menjadi Integer
df_train['LoanAmount'] = df_train['LoanAmount'].astype(int)
df_train.info()

In [None]:
# Mengubah tipe data LoanAmount menjadi Integer
df_test['LoanAmount'] = df_test['LoanAmount'].astype(int)
df_test.info()

In [None]:
# Menambah tiga angka 0 agar menjadi ribuan dan sesuai dengan format awal
df_train["LoanAmount"] = df_train["LoanAmount"].mul(1000)
df_train.head()

In [None]:
# Menambah tiga angka 0 agar menjadi ribuan dan sesuai dengan format awal
df_test["LoanAmount"] = df_test["LoanAmount"].mul(1000)
df_test.head()

#### 9. Loan Amount Term
In this feature we can change the form to an integer.

This feature has an effect on Loan Status, especially on the customer's ability or length of time to repay

In [None]:
df_train["Loan_Amount_Term"].value_counts()

In [None]:
df_train[df_train.Loan_Amount_Term.isnull()]

In [None]:
df_test[df_test.Loan_Amount_Term.isnull()]

In [None]:
# Deleting data that contains NaN because there are only 12 Loan_Amount Terms and half of them are approved and half are not approved so if we delete them it won't be too significant
df_train.dropna(subset='Loan_Amount_Term', inplace=True)

In [None]:
df_test.dropna(subset='Loan_Amount_Term', inplace=True)
df_test.tail()

In [None]:
df_train['Loan_Amount_Term'] = df_train['Loan_Amount_Term'].astype(int)

In [None]:
df_test['Loan_Amount_Term'] = df_test['Loan_Amount_Term'].astype(int)

In [None]:
df_train["Loan_Amount_Term"].info()

In [None]:
df_train[df_train["Loan_Amount_Term"] == 84]

It doesn't make sense, if we multiply income by LoanAmountTerm, it won't be able to pay off its debt, so we can delete everything except index 585 because it has Loan Status 0

In [None]:
# Untuk menghapus data dengan index tertentu
df_train = df_train.drop([313, 495, 575])

In [None]:
df_train[df_train["Loan_Amount_Term"] == 84]

In [None]:
df_train[df_train["Loan_Amount_Term"] == 120]

It still makes sense if we multiply Income by Loan_Amount_Term then he is still able to pay his debts in full

In [None]:
df_train[df_train["Loan_Amount_Term"] == 60]

It doesn't make sense if we multiply Income by Loan_Amount_Term then he is still able to pay off his debt

In [None]:
df_train = df_train.drop(df_train[df_train["Loan_Amount_Term"] == 60].index)
df_train[df_train["Loan_Amount_Term"] == 60]

In [None]:
df_train[df_train["Loan_Amount_Term"] == 36]

It doesn't make sense that if we multiply Income by Loan_Amount_Term then he will still be able to pay off his debt. But because Loan_Status shows 0, we leave it alone

In [None]:
df_train[df_train["Loan_Amount_Term"] == 240]

the 84th data still makes sense, the rest doesn't make sense if we multiply Income by Loan_Amount_Term and the index 591 data status is 0, then we just delete 16

In [None]:
df_train = df_train.drop([16, 84])

In [None]:
df_train[df_train["Loan_Amount_Term"] == 240]

In [None]:
df_train[df_train["Loan_Amount_Term"] == 12]

It doesn't make sense, if we multiply income by LoanAmountTerm, he won't be able to pay off his debt

In [None]:
df_train = df_train.drop(df_train[df_train["Loan_Amount_Term"] == 12].index)
df_train[df_train["Loan_Amount_Term"] == 12]

#### 10. Credit History
Removed some data containing NaN because Credit History has a significant role

Credit History has a significant impact on loan Status because that is where the bank can carry out background checking. It will be difficult to see a customer's ability to pay if there is no Credit History

In [None]:
df_train.isnull().sum()

In [None]:
df_test.isnull().sum()

In [None]:
df_test.dropna(subset=['Credit_History'], inplace=True)
df_test.isnull().sum()

In [None]:
df_test.info()

In [None]:
df_train.describe()

In [None]:
df_train[df_train["Credit_History"].isnull()]

In [None]:
sns.boxplot(data=df_train, x="Credit_History", y="Income", hue="Loan_Status")

In [None]:
df_train["Credit_History"].value_counts()

We have to balance the credit history of 0 with 1. Then we can fill the credit history of Nan with 0

In [None]:
df_train["Credit_History"] = df_train["Credit_History"].fillna(0)
df_train["Credit_History"].value_counts()

In [None]:
#df_train = df_train.drop([nomor index])

In [None]:
df_train['Credit_History'] = df_train['Credit_History'].astype(int)

In [None]:
df_test['Credit_History'] = df_test['Credit_History'].astype(int)

#### 11. Property Area
In this feature we only need to change it to `OneHotEncoder`

This feature also affects Loan Status because if the customer does not pay, the Bank can confiscate the property the customer owns

In [None]:
df_train.info()

In [None]:
df_test.info()

In [None]:
# Mengubah bentuk data pada fitur Propery menjadi binary values
Property_Area_ohe = encoder.fit_transform(df_train[['Property_Area']]).astype(int)
Property_Area_ohe

In [None]:
# Mengubah bentuk data pada fitur Propery menjadi binary values
Property_Area_ohe_test = encoder.fit_transform(df_test[['Property_Area']]).astype(int)
Property_Area_ohe_test

In [None]:
# Menggabungkan Property_Area_ohe menjadi satu
df_train = pd.concat([df_train, Property_Area_ohe], axis=1).drop(columns=['Property_Area'])
df_train.head()

In [None]:
# Menggabungkan Property_Area_ohe menjadi satu
df_test = pd.concat([df_test, Property_Area_ohe_test], axis=1).drop(columns=['Property_Area'])
df_test.head()

#### 12. Loan Status

We just need to convert it to binary values

These features are very important because we will predict the value of these features based on previous features.

In [None]:
# Changed Loan_Status to OneHotEncoder
Loan_Status_ohe = encoder.fit_transform(df_train[['Loan_Status']]).astype(int)
Loan_Status_ohe

In [None]:
# Loan_Status condition before OneHotEncoding
df_train

In [None]:
# Combine Loan_Status_ohe with df_train
df_train = pd.concat([df_train, Loan_Status_ohe], axis=1).drop(columns=['Loan_Status'])
# Loan Status Yes = 1; Loan Status No = 0
df_train = df_train.rename(columns={"Loan_Status_Y": "Loan_Status"})
df_train = df_train.drop(columns=['Loan_Status_N'])
# Loan_Status condition after OneHotEncoding
df_train

In [None]:
df_train.info()

Dengan ini kita dapat kembali ke tahap [sebelumnya](####-8.-Loan-Amount)

### ------------------------------------------------------------------------------------

#### #Standardisasi

In [None]:
fig, ax4 = plt.subplots(ncols=3, figsize=(20,10))
sns.histplot(df_train['Income'], bins=30, kde=True, ax=ax4[0])
sns.histplot(df_train['LoanAmount'], bins=30, kde=True, ax=ax4[1])
sns.histplot(df_train['Loan_Amount_Term'], bins=30, kde=True, ax=ax4[2])

In [None]:
# Deleting data that has a Loan Amount of more than 350000
df_train = df_train.drop(df_train[df_train["LoanAmount"] > 350000].index)

In [None]:
fig, ax4 = plt.subplots(ncols=3, figsize=(20,10))
sns.histplot(df_train['Income'], bins=30, kde=True, ax=ax4[0])
sns.histplot(df_train['LoanAmount'], bins=30, kde=True, ax=ax4[1])
sns.histplot(df_train['Loan_Amount_Term'], bins=30, kde=True, ax=ax4[2])

It can be seen in the LoanAmount Feature that the data is normally distributed.

In [None]:
# Menghapus data income yang outliers agar bisa terstandarizazi
df_train = df_train.drop(df_train[df_train["Income"] > 20000].index)
fig, ax4 = plt.subplots(ncols=2, figsize=(20,10))
sns.histplot(df_train['Income'], bins=30, kde=True, ax=ax4[0])
sns.histplot(df_train['LoanAmount'], bins=30, kde=True, ax=ax4[1])

It can be seen that when we delete Income, the LoanAmount will have an impact

#### 13. There is still an Imbalance between Yes and No in Loan Status

In [None]:
df_train['Loan_Status'].value_counts()

In fact, the number is more than 2 times that. So we need to do *under sampling*

In [None]:
# Install Imbalanced-learn library
%pip install imbalanced-learn

In [None]:
print(imblearn.__version__)

Dalam prosesnya terdapat beberapa metode, yaitu:
* Near Miss Undersampling
    * NearMiss-1: Majority class examples with minimum average distance to three closest minority class examples.
    * NearMiss-2: Majority class examples with minimum average distance to three furthest minority class examples.
    * NearMiss-3: Majority class examples with minimum distance to each minority class example.
* Condensed Nearest Neighbor Rule Undersampling -> the notion of a consistent subset of a sample set. This is a subset which, when used as a stored reference set for the NN rule, correctly classifies all of the remaining points in the sample set.
* Tomek Links for Undersampling -> The condensed nearest-neighbor (CNN) method chooses samples randomly. This results in a)retention of unnecessary samples and b) occasional retention of internal rather than boundary samples.
* Random Under Sampler

In [None]:
# Split the data X and y data from the dataframe
y = df_train['Loan_Status']
X = df_train.drop('Loan_Status', axis=1)

In [None]:
X.head()

In [None]:
y.head()

In [None]:
y.value_counts()

In [None]:
sns.countplot(x=y, data=X)
plt.title('Number of Eligible and not Eligible')
plt.show()

In [None]:
rus = RandomUnderSampler()
rus_X_train, rus_y_train = rus.fit_resample(X, y)

In [None]:
print("Before sampling class distribution: -", Counter(y))
print("Before sampling class distribution: -", Counter(rus_y_train))

In [None]:
rus_y_train.info()

In [None]:
rus_X_train.info()

In [None]:
old_skew = rus_X_train.skew().sort_values(ascending=False)
old_skew

Based on this article: (https://www.kaggle.com/code/aimack/how-to-handle-skewed-distribution)

* Positive values ​​mean skewed-right distribution
* Negative means the distribution is skewed-left
* 0 means perfect normal distribution

In [None]:
fig, ax5 = plt.subplots(ncols=2, figsize=(10,5))
sns.histplot(rus_X_train['Income'], bins=50, kde=True, ax=ax5[0])
sns.histplot(rus_X_train['LoanAmount'], bins=50, kde=True, ax=ax5[1])

In [None]:
df_train.head()

In [None]:
rus_X_train.head()

In [None]:
rus_y_train.head()

In [None]:
df_train = rus_X_train
df_train.head()

In [None]:
df_train = pd.concat([df_train, rus_y_train], axis=1)
df_train.head()

It looks like we don't need to continue with Loan_Amount_Term because the value range is too far

#### #Normalisasi


There are several ways to normalize publish_display_data
1. Simple Feature Scaling<br>
    ***df['length'] = df['length']/df['length'].max()***
2. Min-Max<br>
    ***df['length'] = (df['length']-df['length'].min())/(df['length'].max()-df['length'].min())***
3. Z-Score<br>
    ***df['length'] = (df['length']-df['length'].mean())/df['length'].std()***

In [None]:
# 1. Simple Feature Scaling
df_income_SFS = df_train.copy()
column1 = 'Income'
column2 = 'LoanAmount'
column3 = 'Dependents'
column4 = 'Loan_Amount_Term'
df_income_SFS[column1] = df_income_SFS[column1]/df_income_SFS[column1].max()
df_income_SFS[column2] = df_income_SFS[column2]/df_income_SFS[column2].max()
df_income_SFS[column3] = df_income_SFS[column3]/df_income_SFS[column3].max()
df_income_SFS[column4] = df_income_SFS[column4]/df_income_SFS[column4].max()

In [None]:
# 2. Min-Max
df_income_mM = df_train.copy()
df_income_mM[column1] = (df_income_mM[column1] - df_income_mM[column1].min())/(df_income_mM[column1].max()-df_income_mM[column1].min())
df_income_mM[column2] = (df_income_mM[column2] - df_income_mM[column2].min())/(df_income_mM[column2].max()-df_income_mM[column2].min())

In [None]:
# 3. Z-Score
df_income_Z = df_train.copy()
df_income_Z[column1] = (df_income_Z[column1] - df_income_Z[column1].mean())/df_income_Z[column1].std()
df_income_Z[column2] = (df_income_Z[column2] - df_income_Z[column2].mean())/df_income_Z[column2].std()

In [None]:
fig, ax5 = plt.subplots(ncols=2, nrows=3, figsize=(20,10))

sns.histplot(df_income_SFS[column1], bins=50, kde=True, ax=ax5[0, 0])
sns.histplot(df_income_SFS[column2], bins=50, kde=True, ax=ax5[0, 1])

sns.histplot(df_income_mM[column1], bins=50, kde=True, ax=ax5[1, 0])
sns.histplot(df_income_mM[column2], bins=50, kde=True, ax=ax5[1, 1])

sns.histplot(df_income_Z[column1], bins=50, kde=True, ax=ax5[2, 0])
sns.histplot(df_income_Z[column2], bins=50, kde=True, ax=ax5[2, 1])

In [None]:
# To get to know the unique value in every features
for col in rus_X_train.select_dtypes(include=['object', 'bool', 'float64', 'int64', 'int32']).columns:
    print(col)
    print(rus_X_train[col].unique())
    print()

Apart from manual formulas, we can also do this with Sklearn

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_income_normalized = scaler.fit_transform(df_train[['Income']])
df_loan_amount_normalized = scaler.fit_transform(df_train[['LoanAmount']])

fig, ax6 = plt.subplots(ncols=2, figsize=(10,5))

sns.histplot(df_income_normalized, bins=50, kde=True, ax=ax6[0])
sns.histplot(df_loan_amount_normalized, bins=50, kde=True, ax=ax6[1])


In [None]:
#Move data that has been normalized and standardized to df_train again
df_train.head()

In [None]:
# column1 = 'Income'
# column2 = 'LoanAmount'
# column3 = 'Dependents'
# column4 = 'Loan_Amount_Term'
df_train["Income"] = df_income_SFS[column1]
df_train["LoanAmount"] = df_income_SFS[column2]
df_train["Dependents"] = df_income_SFS[column3]
df_train["Loan_Amount_Term"] = df_income_SFS[column4]

df_train.head()

#### Correlation Matrix

Then we can do matrix correlation

In [None]:
plt.figure(figsize = (10, 10))
sns.heatmap(df_train.corr(), annot=True)

Loan Status is directly correlated with Loan Status and Income is directly correlated with LoanAmount

# Model Training | Pelatihan Model

If we look at the case that I took, there are several estimators that would be suitable based on the following picture:
img

In [None]:
from IPython import display
display.Image("./ml_map.png")

* LinearSVC
* RandomForest Classifier
* KNeighbours Classification
* Logistic Regression
* Decision Tree

**Extra Experiments**
* XGBoost
* Catboostclassifier

In [None]:
def train_and_evaluate(seed, estimators, data, clf):
    # Set random seed
    np.random.seed(seed)
    
    # Divide the data into features (X) and target (y)
    y = data["Loan_Status"]
    X = data.drop("Loan_Status", axis=1)
    
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    # Initialize the estimators
    clf = estimators()
    
    # Fit the model
    clf.fit(X_train, y_train)
    
    # Evaluate LinearSVC
    score = clf.score(X_test, y_test)
    
    return score

## 1. LinearSVC

Linear Support Vector Classification

The main principle of difference between LinearSVC and SVC is as follows:
* By *default* scaling, `LinearSVC` minimizes squared hinge loss while SVC minimizes regular hinge `loss`. It is possible to manually specify the string 'hinge' for the `loss` parameter in `LinearSVC`

* `LinearSVC` uses One-vs-All (also known as One-vs-Rest) multiclass reduction whereas SVC uses One-vs-One multiclass reduction. This is also noted here. Additionally, for multi-class classification problems, SVC fits the N*(N - 1)/2 model where N is the number of classes. LinearSVC, on the other hand, is only suitable for N models. If the classification problem is binary, then only one model is suitable for both scenarios. multi_class and Decision_function_shape parameters have nothing in common. The second is an aggregator that transforms the results of the decision function in the appropriate form (n_features, n_samples). multi_class is an algorithmic approach to building solutions.

* The underlying estimator of `LinearSVC` is liblinear, which in fact imposes a penalty on the intercept. SVC uses the libsvm estimator which does not. The liblinear estimator is optimized for the (special) linear case and thus converges more quickly on large amounts of data than libsvm. That's why `LinearSVC` takes less time to solve the problem.

🔗: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

In [None]:
from sklearn.svm import LinearSVC

# Set a random seed
np.random.seed(42) # The random function produces the same random or random value every time it is called

# Divide the data into 2 variables, features(X) which will be used as parameters and target(y) which will predict the results
y = df_train["Loan_Status"]
X = df_train.drop("Loan_Status", axis=1)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) # The goal is to divide the data we have into 2 parts, training data and test.

# Inisialisasi LinearSVC
clf_LSVC = LinearSVC()

# Fit the model
clf_LSVC.fit(X_train, y_train)

# Evaluating LinearSVC
score_LSVC = clf_LSVC.score(X_test, y_test)
print(f"Mean Accuracy : {score_LSVC}")

In [None]:
clf = "clf_LSVC"

accuracy = train_and_evaluate(42, LinearSVC, df_train, clf)
print(f"Mean Accuracy: {accuracy}")

Using LinearSVC with the usual method (without setting hypeparameter tuning and also cross validation) produces a final value of 0.65625.

## 2. RandomForestClassifier vs HistoryGradientBoostingClassifier

* Random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of a data set and uses an average to improve prediction accuracy and control overfitting

* Exact gradient boosting methods that don't scale very well on datasets with a large number of samples.

Trees in the forest use a best split strategy, which is equivalent to passing splitter="best" to the underlying DecisionTreeRegressor. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the entire data set will be used to build each tree.

🔗 : https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = "clf_RFC"

accuracy = train_and_evaluate(42, RandomForestClassifier, df_train, clf)
print(f"Mean Accuracy: {accuracy}")
score_RFC = accuracy

Using the RandomForestClassifier with the usual method (without setting the hypeparameter tuning and also cross validation) produces a final value of 0.640625.

In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier

clf="clf_HGBC"

accuracy = train_and_evaluate(42, HistGradientBoostingClassifier, df_train, clf)
print(f"Mean Accuracy: {accuracy}")

HistGradientBoostingClassifier is often compared with RandomForestClassifier, especially on small data samples because HGBC is less than optimal on large amounts of data samples.

## 3. KNeighbours Classification

Classifier implementing the k-nearest neighbors vote.

`Neighbors-based classification` is a type of example-based learning or non-generalization learning: it does not attempt to build a general internal model, but only stores examples of training data. Classification is calculated from a simple majority vote of each point's nearest neighbors: a query point is assigned the data class that has the most representatives in the point's nearest neighbors.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

clf = "clf_KNC"

accuracy = train_and_evaluate(42, KNeighborsClassifier, df_train, clf)
print(f"Mean Accuracy: {accuracy}")
score_KNC = accuracy

Using KNeighbors with the usual method (without setting hypeparameter tuning and also cross validation) produces a final value of 0.5625.

## 4. Logistic Regression

Logistic regression is implemented in LogisticRegression. Despite the name, it is implemented as a linear model for classification rather than regression in terms of scikit-learn/ML ​​nomenclature.

Logistic regression is also known in the literature as logit regression, maximum entropy classification (MaxEnt) or log-linear classifier. In this model, probabilities that describe the possible outcomes of a single trial are modeled using a logistic function.

🔗 : https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

In [None]:
from sklearn.linear_model import LogisticRegression

clf = "clf_LR"

accuracy = train_and_evaluate(42, LogisticRegression, df_train, clf)
print(f"Mean Accuracy: {accuracy}")
score_LR = accuracy

Using LogisticRegression with the usual method (without setting hypeparameter tuning and also cross validation) produces a final value of 0.625.

## 5. Decision Tree

Decision Trees (DT) are non-parametric `supervised-learning` methods used for classification and regression.

The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from data features. A tree can be seen as a piecewise constant approximation.


***Excess***
* Simple to understand and interpret. Trees can be visualized.
* Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created, and empty values ​​need to be removed. Some combinations of trees and algorithms support missing values.
* The cost of using the tree (i.e., prediction data) is logarithmic in the number of data points used to train the tree.
* Able to handle numeric and categorical data. However, the scikit-learn implementation does not support categorical variables at this time. Other techniques are usually devoted to analyzing data sets that have only one type of variable. See algorithm for more information.
* Able to handle multi-output problems.
* Uses a white box model. If certain situations can be observed in a model, the explanation of those conditions is easily explained with Boolean logic. In contrast, in black box models (for example, in artificial neural networks), the results may be more difficult to interpret.
* Possibility to validate the model using statistical tests. This makes it possible to take into account the reliability of the model.
* Performs well even if the assumptions are violated by the actual model on which the data is generated.

***Disadvantages of decision trees include:***
* Decision tree learners can create trees that are too complex to generalize the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at leaf nodes or setting the maximum depth of the tree are necessary to avoid this problem.
* Decision trees can be unstable because small variations in the data might produce a completely different tree. This problem is addressed by using decision trees in ensembles.
* Decision tree predictions are not smooth or continuous, but are constant estimates bit by bit as seen in the figure above. Therefore, they are not good at extrapolation.
* The problem of studying optimal decision trees is known to be NP-complete in some aspects of optimality and even for simple concepts. As a result, practical decision tree learning algorithms are based on heuristic algorithms such as greedy algorithms where locally optimal decisions are made at each node. Such algorithms cannot guarantee to produce globally optimal decision trees. This can be mitigated by training multiple trees in an ensemble learner, where features and samples are randomly sampled with replacement.
* There are concepts that are difficult to learn because decision trees cannot express them easily, such as XOR, parity, or multiplexer problems.
* Decision tree learner creates a biased tree if some classes dominate. Therefore it is recommended to balance the data set before fitting the decision tree.

In [None]:
from sklearn import tree

clf = "clf_tree"

accuracy = train_and_evaluate(42, tree.DecisionTreeClassifier, df_train, clf)
print(f"Mean Accuracy: {accuracy}")
score_tree = accuracy

Using the DecisionTreeClassifier with the usual method (without setting the hypeparameter tuning and also cross validation) produces a final value of 0.640625.

In [None]:
# Untuk visualize our Decision tree model
from sklearn import tree
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None)
graph = graphviz.Source(dot_data)
graph.render("Loan_Eligibility")

In [None]:
# from IPython.display import IFrame
# filepath = "Loan_Eligibility.pdf"
# IFrame(filepath, width=120, height=1920)

import os
path = 'Loan_Eligibility.pdf'
os.system(path)

## 6. XGBoost

`XGBoost` is an optimized distributed gradient boosting library designed to be highly `efficient`, `flexible`, and `portable`. It implements machine learning algorithms under the Gradient Boosting framework.

`XGBoost` provides parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate manner. The same code runs on major distributed environments (Hadoop, SGE, MPI) and can solve problems exceeding billions of examples.

The installation, initialization, and optimization processes are contained in the following documentation:
https://xgboost.readthedocs.io/en/stable/install.html#python

In [None]:
from xgboost import XGBClassifier

clf = "clf_XGB"

accuracy = train_and_evaluate(42, XGBClassifier, df_train, clf)
print(f"Mean Accuracy: {accuracy}")
score_XGB = accuracy

Using the XGBoostClassifier with the usual method (without setting the hypeparameter tuning and also cross validation) produces a final value of 0.65625.

## 7. CatboostClassifier

Training and deploying models for classification problems. Provides compatibility with scikit-learn tools.

CatBoost is a supervised machine learning method used by the Train Using AutoML tool and uses decision trees for classification and regression. As the name suggests, CatBoost has two main features, it works with categorical data (Cat) and uses gradient boosting (Boost).

🔗 : https://catboost.ai/en/docs/features/visualization_jupyter-notebook

In [None]:
from catboost import CatBoostClassifier

clf = "clf_CBC"

accuracy = train_and_evaluate(42, CatBoostClassifier, df_train, clf)
print(f"Mean Accuracy: {accuracy}")
score_CBC = accuracy

Using CatBoostClassifier with the usual method (without setting hypeparameter tuning and also cross validation) produces a final value of 64.0625%. The cool thing about CatBoostClassifier is that it can be customized with the User Interface

# Evaluating a Machine Learning Model

There are 3 ways to evaluate Scikit-learn models:

    1. Estimator's built-in `score()` method
    2. The `scoring()` method
    3. Problem-specific metric functions

🔗 : https://scikit-learn.org/stable/modules/model_evaluation.html

## Evaluasi dengan metode `score`

### • Score

This has been done above when conducting model training where the maximum value is 1 and the minimum value is 0.

In 'regression' the score functions to find out the coefficient and determination, while in 'classification' it is to find out the mean accuracy

## Evaluasi dengan `Scoring Parameter`

### • Cross Validation

Cross Validation is a method where the process of separating training, validation and test data is randomized depending on the parameters we enter in `cv`

🔗 : https://scikit-learn.org/stable/modules/cross_validation.html

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

np.random.seed(42)

y = df_train["Loan_Status"]
X = df_train.drop("Loan_Status", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Linear SVC
clf_LSVC = LinearSVC()
clf_LSVC.fit(X_train, y_train)
cross_val_score_LSVC = cross_val_score(clf_LSVC, X, y, cv=15)
LSVC_cv = np.mean(cross_val_score_LSVC)

# Random Forest Classifier
clf_RFC = RandomForestClassifier()
clf_RFC.fit(X_train, y_train)
cross_val_score_RFC = cross_val_score(clf_RFC, X, y, cv=15)
RFC_cv = np.mean(cross_val_score_RFC)

# KNeighbours Classification
clf_KNC = KNeighborsClassifier()
clf_KNC.fit(X_train, y_train)
cross_val_score_KNC = cross_val_score(clf_KNC, X, y, cv=15)
KNC_cv = np.mean(cross_val_score_KNC)

# Logistic Regression
clf_LR = LogisticRegression()
clf_LR.fit(X_train, y_train)
cross_val_score_LR = cross_val_score(clf_LR, X, y, cv=15)
LR_cv = np.mean(cross_val_score_LR)

# Decision Tree
clf_tree = DecisionTreeClassifier()
clf_tree.fit(X_train, y_train)
cross_val_score_tree = cross_val_score(clf_tree, X, y, cv=15)
tree_cv = np.mean(cross_val_score_tree)

# XGBoost
clf_XGB = XGBClassifier()
clf_XGB.fit(X_train, y_train)
cross_val_score_XGB = cross_val_score(clf_XGB, X, y, cv=15)
XGB_cv = np.mean(cross_val_score_XGB)

# CatBoost
clf_CBC = CatBoostClassifier()
clf_CBC.fit(X_train, y_train)
cross_val_score_CBC = cross_val_score(clf_CBC, X, y, cv=15)
CBC_cv = np.mean(cross_val_score_CBC)

print("LinearSVC")
print(f"Sebelum Cross-Validation : {score_LSVC*100:.2f}%")
print(f"Setelah Cross-Validation : {LSVC_cv*100:.2f}%")
print(" ")
print("RandomForestClassifier")
print(f"Sebelum Cross-Validation : {score_RFC*100:.2f}%")
print(f"Setelah Cross-Validation : {RFC_cv*100:.2f}%")
print(" ")
print("KNeighbours")
print(f"Sebelum Cross-Validation : {score_KNC*100:.2f}%")
print(f"Setelah Cross-Validation : {KNC_cv*100:.2f}%")
print(" ")
print("Logistic Regression")
print(f"Sebelum Cross-Validation : {score_LR*100:.2f}%")
print(f"Setelah Cross-Validation : {LR_cv*100:.2f}%")
print(" ")
print("Decision Tree")
print(f"Sebelum Cross-Validation : {score_tree*100:.2f}%")
print(f"Setelah Cross-Validation : {tree_cv*100:.2f}%")
print(" ")
print("XGBoost")
print(f"Sebelum Cross-Validation : {score_XGB*100:.2f}%")
print(f"Setelah Cross-Validation : {XGB_cv*100:.2f}%")
print(" ")
print("CatBoostClassifier")
print(f"Sebelum Cross-Validation : {score_CBC*100:.2f}%")
print(f"Setelah Cross-Validation : {CBC_cv*100:.2f}%")

In [None]:
# import numpy as np
# from sklearn.model_selection import train_test_split, cross_val_score
# from sklearn.svm import LinearSVC
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.linear_model import LogisticRegression
# from sklearn.tree import DecisionTreeClassifier
# from xgboost import XGBClassifier
# from catboost import CatBoostClassifier

# # Function to train classifier, calculate cross-validation scores, and print results
# def train_and_print(clf, X_train, y_train, X, y, cv):
#     clf.fit(X_train, y_train)
#     cross_val_scores = cross_val_score(clf, X, y, cv=cv)
#     cv_score = np.mean(cross_val_scores)
#     print(f"Sebelum Cross-Validation: {clf.score(X_train, y_train) * 100:.2f}%")
#     print(f"Setelah Cross-Validation: {cv_score * 100:.2f}%")
#     print()

# # Set random seed
# np.random.seed(42)

# # Split data
# y = df_train["Loan_Status"]
# X = df_train.drop("Loan_Status", axis=1)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# # Train and print results for each classifier
# classifiers = {
#     "LinearSVC": LinearSVC(),
#     "RandomForestClassifier": RandomForestClassifier(),
#     "KNeighbours": KNeighborsClassifier(),
#     "Logistic Regression": LogisticRegression(),
#     "Decision Tree": DecisionTreeClassifier(),
#     "XGBoost": XGBClassifier(),
#     "CatBoostClassifier": CatBoostClassifier()
# }

# for name, clf in classifiers.items():
#     print(name)
#     train_and_print(clf, X_train, y_train, X, y, cv=15)

## Evaluasi dengan kasus tertentu

### Evaluasi menggunakan Classification Model Evaluatin Metrics

1. Accuracy
2. Area Under Curve (ROC)/Area Under Curve(AUC)<br>
    Comparison of a model's true positive rate(TPR) versus false positive rate(FPR)
    * True Positive =  model predicts 1 when truth is 1
    * False Positive = model predicts 1 when truth is 0
    * True Negative =  model predicts 0 when truth is 0
    * False Negative = model predicts 0 when truth is 1
3. Confusion Matrix
4. Classification report
    * Precision
    * Recall
    * F1_Score

### 1. Accuracy

In [None]:
print("LinearSVC")
print(f"Loan Eligibility Cross-Validation Accuracy: {LSVC_cv*100:.2f}%")
print(" ")
print("RandomForestClassifier")
print(f"Loan Eligibility Cross-Validation Accuracy: {RFC_cv*100:.2f}%")
print(" ")
print("KNeighbours")
print(f"Loan Eligibility Cross-Validation Accuracy: {KNC_cv*100:.2f}%")
print(" ")
print("Logistic Regression")
print(f"Loan Eligibility Cross-Validation Accuracy: {LR_cv*100:.2f}%")
print(" ")
print("Decision Tree")
print(f"Loan Eligibility Cross-Validation Accuracy: {tree_cv*100:.2f}%")
print(" ")
print("XGBoost")
print(f"Loan Eligibility Cross-Validation Accuracy: {XGB_cv*100:.2f}%")
print(" ")
print("CatBoostClassifier")
print(f"Loan Eligibility Cross-Validation Accuracy: {CBC_cv*100:.2f}%")

### 2. AOC-ROC Curve

AUC - ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes.

🔗 : https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5#:~:text=AUC%20%2D%20ROC%20curve%20is%20a,capable%20of%20distinguishing%20between%20classes.

In [None]:
from sklearn.metrics import roc_curve

# LinearSVC tidak mendukung adanya predict_proba dikarenakan diperuntukan untuk multi kelas

# Make predictions with probabilities
# Random Forest Classifier
y_probs_RFC = clf_RFC.predict_proba(X_test)
y_probs_positive_RFC = y_probs_RFC[:,1]
# Calculate fpr, tpr, and thresholds
fpr_RFC, tpr_RFC, thresholds_RFC = roc_curve(y_test, y_probs_positive_RFC)


# KNeighbours Classifier
y_probs_KNC = clf_KNC.predict_proba(X_test)
y_probs_positive_KNC = y_probs_KNC[:, 1]
# Calculate fpr, tpr, and thresholds
fpr_KNC, tpr_KNC, thresholds_KNC = roc_curve(y_test, y_probs_positive_KNC)


# Logistic Regression
y_probs_LR = clf_LR.predict_proba(X_test)
y_probs_positive_LR = y_probs_LR[:, 1]
# Calculate fpr, tpr, and thresholds
fpr_LR, tpr_LR, thresholds_LR = roc_curve(y_test, y_probs_positive_LR)


# Decision Tree
y_probs_tree = clf_tree.predict_proba(X_test)
y_probs_positive_tree = y_probs_tree[:, 1]
# Calculate fpr, tpr, and thresholds
fpr_tree, tpr_tree, thresholds_tree = roc_curve(y_test, y_probs_positive_tree)


# XGBoost
y_probs_XGB = clf_XGB.predict_proba(X_test)
y_probs_positive_XGB = y_probs_XGB[:, 1]
# Calculate fpr, tpr, and thresholds
fpr_XGB, tpr_XGB, thresholds_XGB = roc_curve(y_test, y_probs_positive_XGB)


# CatBoost Classifier
y_probs_CBC = clf_CBC.predict_proba(X_test)
y_probs_positive_CBC = y_probs_CBC[:, 1]
# Calculate fpr, tpr, and thresholds
fpr_CBC, tpr_CBC, thresholds_CBC = roc_curve(y_test, y_probs_positive_CBC)


# Visualize the data

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve

# Membuat fungsi untuk memisualisasikan data
def plot_roc_curve(fpr_list, tpr_list, labels):
    """
    Plots multiple ROC curves given lists of false positive rates (fpr) and true positive rates (tpr).
    """
    # Buat canvas untuk multiple plots
    fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))
    axes = axes.ravel() # mengubah array multidimensional menjadi array 1 dimensi

    # Ploting every ROC Curve
    for fpr, tpr, label, ax in zip(fpr_list, tpr_list, labels, axes): # digunakan untuk menggabungkan dua atau lebih iterables menjadi satu, dengan menghasilkan tuple yang berisi elemen-elemen yang sesuai dari setiap iterable
        # Plot roc curve
        ax.plot(fpr, tpr, color="orange", label="ROC")
        # Plot line with no predictive power (baseline)
        ax.plot([0, 1], [0, 1], color="darkblue", linestyle="--", label="Guessing")

        # Customize the plot
        ax.set_xlabel("False positive rate (fpr)")
        ax.set_ylabel("True positive rate (tpr)")
        ax.set_title(f"ROC Curve for {label}")
        ax.legend()

    plt.tight_layout()
    plt.show()

# Calculate ROC curves
classifiers = {
    "Random Forest Classifier": clf_RFC,
    "KNeighbours Classifier": clf_KNC,
    "Logistic Regression": clf_LR,
    "Decision Tree": clf_tree,
    "XGBoost": clf_XGB,
    "CatBoost Classifier": clf_CBC
}

fpr_list = []
tpr_list = []
labels = []

for label, clf in classifiers.items():
    # Make predictions with probabilities
    y_probs = clf.predict_proba(X_test)
    y_probs_positive = y_probs[:, 1]
    # Calculate fpr, tpr, and thresholds
    fpr, tpr, thresholds = roc_curve(y_test, y_probs_positive)
    fpr_list.append(fpr)
    tpr_list.append(tpr)
    labels.append(label)

# Plot ROC curves
plot_roc_curve(fpr_list, tpr_list, labels)


### 3. Confusion Matrix

cara tercepat untuk memprediksi label yang model prediksi dengan label sebenarnya.

In [None]:
# Menggunakan cara manual
from sklearn.metrics import confusion_matrix

y_preds_CBC = clf_CBC.predict(X_test)

confusion_matrix(y_test, y_preds_CBC)

In [None]:
# Seaborn juga menyediakan fitur ini
import seaborn as sns
from sklearn.metrics import confusion_matrix

y_preds_CBC = clf_CBC.predict(X_test)

# set the font scale
sns.set(font_scale=1.5)

# Create a confusion matrix
conf_mat = confusion_matrix(y_test, y_preds_CBC)

# Plot it using seaborn
sns.heatmap(conf_mat)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

def display_confusion_matrices(estimators, X_list, y_list):
    """
    Display confusion matrices for multiple classifiers in subplots.

    Parameters:
        - estimators: List of classifier estimators.
        - X_list: List of feature matrices.
        - y_list: List of label vectors.
    """
    num_estimators = len(estimators)
    num_rows = (num_estimators + 1) // 2  # Adjust for odd number of classifiers
    num_cols = 2

    fig, axes = plt.subplots(num_rows, num_cols, figsize=(12, 6*num_rows))

    for idx, (estimator, X, y) in enumerate(zip(estimators, X_list, y_list)):
        row = idx // num_cols
        col = idx % num_cols

        # Make predictions
        y_pred = estimator.predict(X)

        # Calculate confusion matrix
        cm = confusion_matrix(y, y_pred)

        # Display confusion matrix
        ax = axes[row, col]
        disp = ConfusionMatrixDisplay(confusion_matrix=cm)
        disp.plot(ax=ax)
        ax.set_title(f'Confusion Matrix - {type(estimator).__name__}')

    plt.tight_layout()
    plt.show()

# Example usage
estimators = [clf_LSVC, clf_RFC, clf_KNC, clf_LR, clf_tree, clf_XGB, clf_CBC]
X_list = [X] * len(estimators)  # Assuming same features for all classifiers
y_list = [y] * len(estimators)  # Assuming same labels for all classifiers

display_confusion_matrices(estimators, X_list, y_list)


### 4. Classification Report

In [None]:
from sklearn.metrics import classification_report

def generate_classification_reports(estimators, X, y):
    """
    Generate classification reports for multiple classifiers.

    Parameters:
        - estimators: List of classifier estimators.
        - X: Feature matrix.
        - y: Label vector.
    """
    for estimator in estimators:
        # Make predictions
        y_pred = estimator.predict(X)
        # Generate classification report
        report = classification_report(y, y_pred)
        # Print classification report
        print(f"Classification Report - {type(estimator).__name__}:\n{report}\n")

# Run the classification report

X = df_train.drop("Loan_Status", axis=1)
y = df_train["Loan_Status"]

estimators = [clf_LSVC, clf_RFC, clf_KNC, clf_LR, clf_tree, clf_XGB, clf_CBC]

generate_classification_reports(estimators, X, y)


# Improving Model

Kita perlu meningkatkan performa ML model kita dengan mengatur `parameter` dan `Hyper Parameter` *tuning*. Usaha ini dilakukan dengan mengganti parameter pada masing-masing model agar mendapatkan performa yang optimal.

`Hyperparameters` vs `Parameters`

* Parameters = model find these patterns in data
* Hyperparameters = settings on a model you can adjust to (potentially) improve its ability to find patterns

## Manual Tuning

### 1. LinearSVC

In [None]:
from sklearn.svm import LinearSVC

np.random.seed(42)

y = df_train["Loan_Status"]
X = df_train.drop("Loan_Status", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) # Tujuannya untuk membagi data yang kita punya menjadi 3 bagian, data training, validation, dan test.

clf_LSVC = LinearSVC(C=20,
                    max_iter=1000,
                    verbose=25)

clf_LSVC.fit(X_train, y_train)

score = clf_LSVC.score(X_test, y_test)
print(f"Mean Accuracy : {score*100:.2f}%")

Pada LinearSVC mengalami peningkatan ketika C kita ubah ke 20 dan verbose kita ganti ke 25 hingga mengalami peningkatan akurasi hingga 0.703125

In [None]:
# Untuk melihat list hyper parameternya
clf_LSVC.get_params()

### 2. RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Mengatur random seed
np.random.seed(42) # Fungsi acak untuk menghasilkan nilai acak atau random yang sama setiap kali dipanggil

# Membagi data menjadi 2 variabel, features(X) yang akan digunakan sebagai parameter dan target(y) yang akan diprediksi hasilnya
y = df_train["Loan_Status"]
X = df_train.drop("Loan_Status", axis=1)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) # Tujuannya untuk membagi data yang kita punya menjadi 3 bagian, data training, validation, dan test.

# Inisialisasi RandomForestClassifier
clf_RFC = RandomForestClassifier(criterion="gini",
                                max_depth=15,
                                max_leaf_nodes=50,
                                n_estimators=20)

# Fit the model
clf_RFC.fit(X_train, y_train)

# Mengevaluasi RandomForestClassifier
score = clf_RFC.score(X_test, y_test)
print(f"Mean Accuracy : {score * 100:.2f}%")

Pada RandomForestClassifier mengalami peningkatan ketika mengatur max_depth, max_leaf_nodes, dan n_estimators hingga mengalami peningkatan akurasi hingga 73.44%

In [None]:
clf_RFC.get_params()

### 3. KNeighbours Classification


In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Mengatur random seed
np.random.seed(42) # Fungsi acak untuk menghasilkan nilai acak atau random yang sama setiap kali dipanggil

# Membagi data menjadi 2 variabel, features(X) yang akan digunakan sebagai parameter dan target(y) yang akan diprediksi hasilnya
y = df_train["Loan_Status"]
X = df_train.drop("Loan_Status", axis=1)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) # Tujuannya untuk membagi data yang kita punya menjadi 3 bagian, data training, validation, dan test.

# Inisialisasi KNeighborsClassifier
clf_KNC = KNeighborsClassifier(n_neighbors=20,
                                p=1)

# Fit the model
clf_KNC.fit(X_train, y_train)

# Mengevaluasi KNeighborsClassifier
score = clf_KNC.score(X_test, y_test)
print(f"Mean Accuracy : {score*100:.2f}%")

Pada KNeighbours Classification mengalami peningkatan ketika mengatur n_neighbors dan p hingga mengalami peningkatan akurasi hingga 67.10%

In [None]:
clf_KNC.get_params()

### 4. Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

# Mengatur Random Seed
np.random.seed(42)

# Membagi data menjadi 2 variabel, features(X) yang akan digunakan sebagai parameter dan target(y) yang akan diprediksi hasilnya
X = df_train.drop("Loan_Status", axis=1)
y = df_train["Loan_Status"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)  # Tujuannya untuk membagi data yang kita punya menjadi 3 bagian, data training, validation, dan test.

# Inisialisasi LogisticRegression
clf_LR = LogisticRegression(solver="saga",
                            penalty="elasticnet",
                            l1_ratio=0.3,
                            n_jobs=20,
                            max_iter=100,
                            random_state=7)

# Fit the model
clf_LR.fit(X_train, y_train)

# Mengevaluasi model
score = clf_LR.score(X_test, y_test)
print(f"Mean Accuracy : {score*100:.2f}%")

Mengatur hyperparameter tuning tidak memberikan dampak sinifikan pada hasil akhir Logistic Regression

In [None]:
clf_LR.get_params()

### 5. Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Mengatur Random Seed
np.random.seed(42)

# Membagi data menjadi 2 variabel, features(X) yang akan digunakan sebagai parameter dan target(y) yang akan diprediksi hasilnya
X = df_train.drop("Loan_Status", axis=1)
y = df_train["Loan_Status"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)  # Tujuannya untuk membagi data yang kita punya menjadi 3 bagian, data training, validation, dan test.

# Inisialisasi tree
clf_tree = DecisionTreeClassifier(criterion='entropy', 
                                splitter='best',
                                max_depth=2, 
                                max_features=4,
                                random_state=42
                                )

# Fit the model
clf_tree.fit(X_train, y_train)

# Mengevaluasi model
score = clf_tree.score(X_test, y_test)
print(f"Mean Accuracy : {score*100:.2f}%")

Pada DecisionTree mengalami peningkatan ketika mengatur criterion, splitter, max_depth, dan max_features hingga mengalami peningkatan akurasi hingga 67.19%

In [None]:
clf_tree.get_params()

### 6. XGBoost

In [None]:
from xgboost import XGBClassifier

# Mengatur Random Seed
np.random.seed(42)

# Membagi data menjadi 2 variabel, features(X) yang akan digunakan sebagai parameter dan target(y) yang akan diprediksi hasilnya
X = df_train.drop("Loan_Status", axis=1)
y = df_train["Loan_Status"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)  # Tujuannya untuk membagi data yang kita punya menjadi 3 bagian, data training, validation, dan test.

# Inisialisasi XGBClassifier
clf_XGB = XGBClassifier(booster="gbtree",
                        device="gpu",
                        max_depth=100,
                        n_estimators=100,
                        n_jobs=20,
                        num_parallel_tree=10,
                        tree_method="auto")

# Fit the model
clf_XGB.fit(X_train, y_train)

# Mengevaluasi model
score = clf_XGB.score(X_test, y_test)
print(f"Mean Accuracy : {score*100:.2f}%")

Pada XGBoostClassifier mengalami peningkatan ketika mengatur criterion, splitter, max_depth, dan max_features hingga mengalami peningkatan akurasi hingga 64.06% dari 59.38%

In [None]:
clf_XGB.get_params()

### 7. CatBoostClassifier

In [None]:
from catboost import CatBoostClassifier, Pool

# Mengatur Random Seed
np.random.seed(42)

# Membagi data menjadi 2 variabel, features(X) yang akan digunakan sebagai parameter dan target(y) yang akan diprediksi hasilnya
X = df_train.drop("Loan_Status", axis=1)
y = df_train["Loan_Status"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)  # Tujuannya untuk membagi data yang kita punya menjadi 3 bagian, data training, validation, dan test.

# Inisialisasi CatBoostClassifier
clf_CBC = CatBoostClassifier(max_depth=3, verbose=None)

# Fit the model
clf_CBC.fit(X_train, y_train, plot=True)

# Mengevaluasi model
score = clf_CBC.score(X_test, y_test)
print(f"Mean Accuracy : {score*100:.2f}%")

Pada CatBoostClassifier mengalami peningkatan ketika mengatur criterion, splitter, max_depth, dan max_features hingga mengalami peningkatan akurasi hingga 67.19% dari 60.94%

In [None]:
# CatBoostClassifier(
    # iterations=None,
    # learning_rate=None,
    # depth=None,
    # l2_leaf_reg=None,
    # model_size_reg=None,...
    # max_depth=None,
    # n_estimators=None,

## We want to create train, validation, and test sets

* Train set for training purpose
* Validation set for hyperparameters get tuned
* Test set for model evaluation purpose

### Evaluation Function

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# First of all we create function to evaluate our models

def evaluate_preds(y_true, y_preds):
    """
    Performs evaluation comparison on y_true labels vs. y_preds labels on a classification.
    """
    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    metric_dict = {"accuracy": round(accuracy, 2),
                    "precision": round(precision, 2),
                    "recall": round(recall, 2),
                    "f1": round(f1, 2)}
    print(f"Accuracy: {accuracy * 100:.2f}%")
    print(f"Precision: {precision * 100:.2f}%")
    print(f"Recall: {recall * 100:.2f}%")
    print(f"F1 Score: {f1 * 100:.2f}%")

    return metric_dict

### Train, Validation, and Test sets

In [None]:
# We want to create train, validation, and test sets
# Train set for training purpose
# Validation set for hyperparameters get tuned
# Test set for model evaluation purpose

np.random.seed(42)

# Shuffle the data
df_train_shuffled = df_train.sample(frac=1) # 1 means 100% of the data gets shuffled

# Split into X & y
X = df_train_shuffled.drop('Loan_Status', axis = 1)
y = df_train_shuffled["Loan_Status"]

# Split the data into train, validation, & test sets
train_split = round(0.7 * len(df_train_shuffled)) # 70% of the data
valid_split = round(train_split + 0.15 * len(df_train_shuffled)) # 15% of the data
X_train, y_train =X[:train_split], y[:train_split]
X_valid, y_valid = X[train_split:valid_split], y[train_split:valid_split]
X_test, y_test = X[valid_split:], y[valid_split:]

len(X_train), len(X_valid), len(X_test)

In [None]:
clf_LSVC.fit(X_train, y_train)
clf_RFC.fit(X_train, y_train)
clf_KNC.fit(X_train, y_train)
clf_LR.fit(X_train, y_train)
clf_tree.fit(X_train, y_train)
clf_XGB.fit(X_train, y_train)
clf_CBC.fit(X_train, y_train)

# Make baseline predictions
y_preds_LSVC = clf_LSVC.predict(X_valid)
y_preds_RFC = clf_RFC.predict(X_valid)
y_preds_KNC = clf_KNC.predict(X_valid)
y_preds_LR = clf_LR.predict(X_valid)
y_preds_tree = clf_tree.predict(X_valid)
y_preds_XGB = clf_XGB.predict(X_valid)
y_preds_CBC = clf_CBC.predict(X_valid)

# Evaluate the classifier on validation set
print(" ")
print("LinearSVC Classifier")
baseline_metrics_LSVC = evaluate_preds(y_valid, y_preds_LSVC)
print(" ")
print("Random Forest Classifier")
baseline_metrics_RFC = evaluate_preds(y_valid, y_preds_RFC)
print(" ")
print("KNeighbours Classifier")
baseline_metrics_KNC = evaluate_preds(y_valid, y_preds_KNC)
print(" ")
print("Logistic Regression Classifier")
baseline_metrics_LR = evaluate_preds(y_valid, y_preds_LR)
print(" ")
print("Decision Tree Classifier")
baseline_metrics_tree = evaluate_preds(y_valid, y_preds_LSVC)
print(" ")
print("XGBoost Classifier")
baseline_metrics_XGB = evaluate_preds(y_valid, y_preds_XGB)
print(" ")
print("CatBoost Classifier")
baseline_metrics_CBC = evaluate_preds(y_valid, y_preds_CBC)


Dikarenakan selama ini kita menggunakan metode manual, atau mencoba satu-satu, ini akan memakan banyak waktu dan tenaga. Maka Sklearn menciptakan sebuah library untuk mencari hyperparameter terbaik secara otomatis, yaitu `GridSearch` dan `RandomizedSearchCV`

## RandomizedSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV

grid = {
    "n_estimators": [10, 100, 200, 500, 1000, 1200],
    "max_depth": [None, 5, 10, 20, 30],
    "max_features": ["sqrt", None],
    "min_samples_split": [2, 4, 6],
    "min_samples_leaf": [1, 2, 4]
}

np.random.seed(42)

# Split into X & y
X = df_train_shuffled.drop("Loan_Status", axis=1)
y = df_train_shuffled["Loan_Status"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate RandomForestClassifier
clf_RFC = RandomForestClassifier(n_jobs=1)

# Setup RandomizedSearchCV
rs_clf_RFC = RandomizedSearchCV(estimator=clf_RFC,
                                param_distributions=grid,
                                n_iter=50, # Number of models to try
                                cv=5,
                                verbose=2)

# Fit the RandomizedSearchCV version of clf
rs_clf_RFC.fit(X_train, y_train);

In [None]:
rs_clf_RFC.best_params_

### Dari hasil diatas, maka kita bisa mendapatkan best parameter tanpa perlu mencoba satu persatu.

In [None]:
# Make predictions with the best hyperparameters
rs_y_preds = rs_clf_RFC.predict(X_test)

# Evaluate the predictions
rs_metrics = evaluate_preds(y_test, rs_y_preds)

Terlihat bahwa semakin kita atur RandomizedSearchCV, maka semakin meningkat score nya.

## GridSearchCV

In [None]:
grid_2 = {
     'n_estimators': [500, 1000, 1200],
     'max_depth': [None, 5],
     'max_features': ['sqrt'],
     'min_samples_split': [6],
     'min_samples_leaf': [1, 2, 4]
}

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split

np.random.seed(42)

# Split into X & y
X = df_train.drop("Loan_Status", axis=1)
y = df_train["Loan_Status"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate RandomForestClassifier
clf_RFC = RandomForestClassifier(n_jobs=1)

# Setup GridSearchCV
gs_clf_RFC = GridSearchCV(estimator=clf_RFC,
                                param_grid=grid_2,
                                cv=5,
                                verbose=2)

# Fit the GridSearchCV version of clf
gs_clf_RFC.fit(X_train, y_train);

In [None]:
gs_clf_RFC.best_params_

In [None]:
gs_y_preds = gs_clf_RFC.predict(X_test)

# Evaluate the predictions
gs_metrics = evaluate_preds(y_test, gs_y_preds)

Compare different models metrics

In [None]:
compare_metrics = pd.DataFrame({
    "baseline": baseline_metrics_RFC,
    "random search": rs_metrics,
    "grid search": gs_metrics
})

compare_metrics.plot.bar(figsize=(10,8));

In [None]:
import pandas as pd
def compare(baseline_metrics, rs_metrics, gs_metrics, title):
    compare_metrics = pd.DataFrame({
        "baseline": baseline_metrics,
        "random search": rs_metrics,
        "grid search": gs_metrics
    })
    ax = compare_metrics.plot.bar(figsize=(10,8))
    ax.set_title(title)
    return ax

### Kesimpulan :

Ketika melatih model kita gunakan Train set
Ketika ingin memvalidasi model gunakan Validation set
Ketika sudah mantab dengan model yang dimiliki, barulah gunakan Test set untuk menguji performa

Perbedaan RandomizedSearchCV(RSCV) dengan GridSearchCV(GSCV) adalah RSCV mencari secara acak sedangkan GSCV mencoba semua kemungkinan yang ada

Maka kita dapat mencari hyperparameter menggunakan RSCV terlebih dahulu, ketika sudah ditemukan hyperparameter yang lebih kerucut, barulah kita bisa gunakan GSCV.

## Optimized Models

Melakukan pemisahan data set menjadi training, validation, dan test

In [None]:
np.random.seed(42)

# Shuffle the data
df_train_shuffled = df_train.sample(frac=1) # 1 means 100% of the data gets shuffled

# Split into X & y
X = df_train_shuffled.drop('Loan_Status', axis = 1)
y = df_train_shuffled["Loan_Status"]

# # Split the data into train, validation, & test sets
# train_split = round(0.7 * len(df_train_shuffled)) # 70% of the data
# valid_split = round(train_split + 0.15 * len(df_train_shuffled)) # 15% of the data
# X_train, y_train =X[:train_split], y[:train_split] # Fit the model
# X_valid, y_valid = X[train_split:valid_split], y[train_split:valid_split] # Tune the hyperparameter
# X_test, y_test = X[valid_split:], y[valid_split:] # Evaluate using metrics

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

len(X_train), len(X_test)

In [None]:
def baseline_train_evaluate(X_train, y_train, X_test, y_test, classifier, estimators):
    """
    To predict using baseline models 
    """
    np.random.seed(42)
    # Train/Fit the models
    classifier.fit(X_train, y_train)
    # Create the baseline predictions
    y_preds_test = classifier.predict(X_test)
    print(" ")
    print("Baseline", estimators, "classifier performance: ")
    baseline_test = evaluate_preds(y_test, y_preds_test)

    return baseline_test

### 1. Improved LinearSVC

#### Baseline LinearSVC

In [None]:
from sklearn.svm import LinearSVC

np.random.seed(42)

# Shuffle the data
df_train_shuffled = df_train.sample(frac=1) # 1 means 100% of the data gets shuffled

# Split into X & y
X = df_train_shuffled.drop('Loan_Status', axis = 1)
y = df_train_shuffled["Loan_Status"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Inisialisasi LinearSVC
clf_LSVC = LinearSVC(dual=True, random_state=42)

baseline_LSVC = baseline_train_evaluate(X_train, y_train, X_test, y_test, clf_LSVC, estimators = "LinearSVC")

In [None]:
clf_LSVC.get_params()

In [None]:
grid_LSVC ={
    "C": [0.1, 1, 10, 100, 1000],
    "max_iter": [100, 200, 300, 400, 500, 700, 800, 1000],
    "dual":["auto"],
    "loss":["hinge", "squared_hinge"],
}

#### rs_LinearSVC

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Setup RandomizedSearchCV to search the best parameters
rs_clf_LSVC = RandomizedSearchCV(estimator=clf_LSVC,
                                param_distributions=grid_LSVC,
                                n_iter=50, # Number of models to try
                                cv=5,
                                verbose=2)

rs_clf_LSVC.fit(X_train, y_train);

In [None]:
rs_clf_LSVC.best_params_

In [None]:
# Evaluating the model
# Make predictions with the best hyperparameters
rs_y_preds_LSVC = rs_clf_LSVC.predict(X_valid)

# Evaluate the predictions
rs_metrics_LSVC = evaluate_preds(y_valid, rs_y_preds_LSVC)

#### gs_LinearSVC

In [None]:
grid_LSVC_2 ={
    "C": [0.1, 1, 10],
    "max_iter": [100, 200],
    "dual":["auto"],
    "loss":["hinge", "squared_hinge"],
}

In [None]:
# Setup GridSearchCV
gs_clf_LSVC = GridSearchCV(estimator=clf_LSVC,
                            param_grid=grid_LSVC_2,
                            cv=5,
                            verbose=2)

# Fit the GridSearchCV version of clf
gs_clf_LSVC.fit(X_train, y_train);

In [None]:
gs_clf_LSVC.best_params_

In [None]:
# Evaluating the model
# Make predictions with the best hyperparameters
gs_y_preds_LSVC = gs_clf_LSVC.predict(X_valid)

# Evaluate the predictions
gs_metrics_LSVC = evaluate_preds(y_valid, gs_y_preds_LSVC)

#### Compare

In [None]:
compare(baseline_LSVC, rs_metrics_LSVC, gs_metrics_LSVC, title="Comparison of LinearSVC")

### 2. Improved RandomForestClassifier

#### Baseline RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

# Shuffle the data
df_train_shuffled = df_train.sample(frac=1) # 1 means 100% of the data gets shuffled

# Split into X & y
X = df_train_shuffled.drop('Loan_Status', axis = 1)
y = df_train_shuffled["Loan_Status"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf_RFC = RandomForestClassifier(n_jobs=-1)

baseline_RFC = baseline_train_evaluate(X_train, y_train, X_test, y_test, clf_RFC, estimators = "RandomForest Classifier")

In [None]:
clf_RFC.get_params()

In [None]:
grid_RFC ={
    "n_estimators": [10, 100, 200, 500, 1000, 1200],
    "max_depth": [None, 5, 10, 20, 30],
    "max_features": ["sqrt", None],
    "min_samples_split": [2, 4, 6],
    "min_samples_leaf": [1, 2, 4],
    "criterion":["gini", "entropy", "log_loss"]
}

#### rs_RandomForestClassifier

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Setup RandomizedSearchCV to search the best parameters
rs_clf_RFC = RandomizedSearchCV(estimator=clf_RFC,
                                param_distributions=grid_RFC,
                                n_iter=50, # Number of models to try
                                cv=5,
                                verbose=2)

rs_clf_RFC.fit(X_train, y_train);

In [None]:
rs_clf_RFC.best_params_

In [None]:
# Evaluating the model
# Make predictions with the best hyperparameters
rs_y_preds_RFC = rs_clf_RFC.predict(X_valid)

# Evaluate the predictions
rs_metrics_RFC = evaluate_preds(y_valid, rs_y_preds_RFC)

#### gs_RandomForestClassifier

In [None]:
grid_RFC_2 ={
    'n_estimators': [500, 1000, 1200],
     'max_depth': [None, 5],
     'max_features': ['sqrt'],
     'min_samples_split': [6],
     'min_samples_leaf': [1, 2, 4]
}

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split

np.random.seed(42)

# Split into X & y
X = df_train.drop("Loan_Status", axis=1)
y = df_train["Loan_Status"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate RandomForestClassifier
clf_RFC = RandomForestClassifier(n_jobs=1)

# Setup GridSearchCV
gs_clf_RFC = GridSearchCV(estimator=clf_RFC,
                                param_grid=grid_2,
                                cv=5,
                                verbose=2)

# Fit the GridSearchCV version of clf
gs_clf_RFC.fit(X_train, y_train);

In [None]:
gs_clf_RFC.best_params_

In [None]:
# Evaluating the model
# Make predictions with the best hyperparameters
gs_y_preds_RFC = gs_clf_RFC.predict(X_valid)

# Evaluate the predictions
gs_metrics_RFC = evaluate_preds(y_valid, gs_y_preds_RFC)

#### Compare

In [None]:
compare(baseline_RFC, rs_metrics_RFC, gs_metrics_RFC, title="Comparison of RandomForest Classifier")

### 3. Improved KNeighbors Classification

#### Baseline KNeighbours

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Mengatur random seed
np.random.seed(42) # Fungsi acak untuk menghasilkan nilai acak atau random yang sama setiap kali dipanggil

# Membagi data menjadi 2 variabel, features(X) yang akan digunakan sebagai parameter dan target(y) yang akan diprediksi hasilnya
y = df_train["Loan_Status"]
X = df_train.drop("Loan_Status", axis=1)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) # Tujuannya untuk membagi data yang kita punya menjadi 3 bagian, data training, validation, dan test.

# Inisialisasi KNeighborsClassifier
clf_KNC = KNeighborsClassifier(n_jobs=-1)
# Fit the model
clf_KNC.fit(X_train, y_train)

baseline_KNC = baseline_train_evaluate(X_train, y_train, X_test, y_test, clf_KNC, estimators = "KNeighbours Classifier")

In [None]:
clf_KNC.get_params()

In [None]:
grid_KNC = {
    'n_neighbors': [1, 5, 10, 90],
    'leaf_size': [10, 20, 30 ,40, 50, 100, 1200],
    'p': [1,2],
    'weights': ['uniform', 'distance'],
    'metric': ['minkowski', 'chebyshev'],
}

#### rs_KNC

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Setup RandomizedSearchCV to search the best parameters
rs_clf_KNC = RandomizedSearchCV(estimator=clf_KNC,
                                param_distributions=grid_KNC,
                                n_iter=10, # Number of models to try
                                cv=5,
                                verbose=2)

rs_clf_KNC.fit(X_train, y_train);

In [None]:
rs_clf_KNC.best_params_

In [None]:
# Evaluating the model
# Make predictions with the best hyperparameters
rs_y_preds_KNC = rs_clf_KNC.predict(X_valid)

# Evaluate the predictions
rs_metrics_KNC = evaluate_preds(y_valid, rs_y_preds_KNC)

#### gs_KNC

In [None]:
grid_KNC_2 = {
    'n_neighbors': [90],
    'leaf_size': [2, 10, 30],
    'p': [1,2],
    'weights': ['distance'],
    'metric': ['minkowski'],
}

In [None]:
np.random.seed(42)

# Shuffle the data
df_train_shuffled = df_train.sample(frac=1) # 1 means 100% of the data gets shuffled

# Split into X & y
X = df_train_shuffled.drop('Loan_Status', axis = 1)
y = df_train_shuffled["Loan_Status"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate KNeigbors Classifier
clf_KNC = KNeighborsClassifier(n_jobs=1)

# Setup GridSearchCV
gs_clf_KNC = GridSearchCV(estimator=clf_KNC,
                            param_grid=grid_KNC_2,
                            cv=5,
                            verbose=2)

# Fit the GridSearchCV version of clf
gs_clf_KNC.fit(X_train, y_train);

In [None]:
gs_clf_KNC.best_params_

In [None]:
# Evaluating the model
# Make predictions with the best hyperparameters
gs_y_preds_KNC = gs_clf_KNC.predict(X_valid)

# Evaluate the predictions
gs_metrics_KNC = evaluate_preds(y_valid, gs_y_preds_KNC)

#### Compare

In [None]:
compare(baseline_KNC, rs_metrics_KNC, gs_metrics_KNC, title="Comparison of KNeighbours Classifier")

### 4. Improved Logistic Regression

#### Baseline Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

np.random.seed(42)

# Shuffle the data
df_train_shuffled = df_train.sample(frac=1) # 1 means 100% of the data gets shuffled

# Split into X & y
X = df_train_shuffled.drop("Loan_Status", axis = 1)
y = df_train_shuffled["Loan_Status"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) # Tujuannya untuk membagi data yang kita punya menjadi 3 bagian, data training, validation, dan test.

clf_LR = LogisticRegression()

clf_LR.fit(X_train, y_train)
# Create the baseline predictions
y_preds_test = clf_LR.predict(X_test)
print("Baseline Logistic Regression classifier performance: ")
baseline_LR = evaluate_preds(y_test, y_preds_test)

In [None]:
clf_LR.get_params()

#### rs_Logistic Regression

In [None]:
grid_LR = {
    "C": [1.0, 1.5, 2.0, 2.5, 3.0],
    "max_iter": [1, 5, 10, 100, 200, 500, 1000, 1200],
    "solver": ["newton-cholesky", "liblinear"],
    "n_jobs":[-1]
}

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Setup RandomizedSearchCV to search the best parameters
rs_clf_LR = RandomizedSearchCV(estimator=clf_LR,
                                param_distributions=grid_LR,
                                n_iter=80, # Number of models to try
                                cv=5,
                                verbose=2)

rs_clf_LR.fit(X_train, y_train);

In [None]:
rs_clf_LR.best_params_

In [None]:
# Evaluating the model
# Make predictions with the best hyperparameters
rs_y_preds_LR = rs_clf_LR.predict(X_valid)

# Evaluate the predictions
rs_metrics_LR = evaluate_preds(y_valid, rs_y_preds_LR)

#### gs_Logistic Regression

In [None]:
grid_LR_2 = {
    "C": [3.0, 4.0, 5.0],
    "max_iter": [1, 2, 3, 4, 5],
    "solver": ["newton-cholesky"],
    "n_jobs":[-1]
}

In [None]:
np.random.seed(42)

# Shuffle the data
df_train_shuffled = df_train.sample(frac=1) # 1 means 100% of the data gets shuffled

# Split into X & y
X = df_train_shuffled.drop('Loan_Status', axis = 1)
y = df_train_shuffled["Loan_Status"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Setup GridSearchCV
gs_clf_LR = GridSearchCV(estimator=clf_LR,
                            param_grid=grid_LR_2,
                            cv=5,
                            verbose=2)

# Fit the GridSearchCV version of clf
gs_clf_LR.fit(X_train, y_train);

In [None]:
gs_clf_LR.best_params_

In [None]:
# Evaluating the model
# Make predictions with the best hyperparameters
gs_y_preds_LR = gs_clf_LR.predict(X_valid)

# Evaluate the predictions
gs_metrics_LR = evaluate_preds(y_valid, gs_y_preds_LR)

#### Compare

In [None]:
compare(baseline_LR, rs_metrics_LR, gs_metrics_LR, title="Comparison of Logistic Regression Classifier")

### 5. Improved Decision Tree

#### Baseline Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
# Mengatur random seed
np.random.seed(42) # Fungsi acak untuk menghasilkan nilai acak atau random yang sama setiap kali dipanggil

# Membagi data menjadi 2 variabel, features(X) yang akan digunakan sebagai parameter dan target(y) yang akan diprediksi hasilnya
y = df_train["Loan_Status"]
X = df_train.drop("Loan_Status", axis=1)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) # Tujuannya untuk membagi data yang kita punya menjadi 3 bagian, data training, validation, dan test.

# Inisialisasi KNeighborsClassifier
clf_tree = DecisionTreeClassifier()
# Fit the model
clf_tree.fit(X_train, y_train)

baseline_tree = baseline_train_evaluate(X_train, y_train, X_test, y_test, clf_tree, estimators = "Decision Tree Classifier")

In [None]:
clf_tree.get_params()

#### rs_Decision Tree

In [None]:
grid_tree = {
    "splitter" : ["best", "random"],
    "min_samples_split":[2, 10, 100, 200, 500, 1200],
    "max_features":["sqrt", "log2", None],
    "max_leaf_nodes" : [None]
}

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Setup RandomizedSearchCV to search the best parameters
rs_clf_tree = RandomizedSearchCV(estimator=clf_tree,
                                param_distributions=grid_tree,
                                n_iter=10, # Number of models to try
                                cv=5,
                                verbose=2)

rs_clf_tree.fit(X_train, y_train);

In [None]:
rs_clf_tree.best_params_

In [None]:
# Evaluating the model
# Make predictions with the best hyperparameters
rs_y_preds_tree = rs_clf_tree.predict(X_valid)

# Evaluate the predictions
rs_metrics_tree = evaluate_preds(y_valid, rs_y_preds_tree)

#### gs_Decision Tree

In [None]:
grid_tree_2 = {
    "splitter" : ["best"],
    "min_samples_split":[100, 200,500, 600, 1000, 2000],
    "max_features":["sqrt", "log2", None],
    "max_leaf_nodes":[None]
}

In [None]:
np.random.seed(42)

# Shuffle the data
df_train_shuffled = df_train.sample(frac=1) # 1 means 100% of the data gets shuffled

# Split into X & y
X = df_train_shuffled.drop('Loan_Status', axis = 1)
y = df_train_shuffled["Loan_Status"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Setup GridSearchCV
gs_clf_tree = GridSearchCV(estimator=clf_tree,
                            param_grid=grid_tree_2,
                            cv=5,
                            verbose=2)

# Fit the GridSearchCV version of clf
gs_clf_tree.fit(X_train, y_train);

In [None]:
gs_clf_tree.best_params_

In [None]:
# Evaluating the model
# Make predictions with the best hyperparameters
gs_y_preds_tree = gs_clf_tree.predict(X_valid)

# Evaluate the predictions
gs_metrics_tree = evaluate_preds(y_valid, gs_y_preds_tree)

#### Compare

In [None]:
compare(baseline_tree, rs_metrics_tree, gs_metrics_tree, title="Comparison of Decision Tree Classifier")

### 6. Improved XGBoost

#### Baseline XGBoost

In [None]:
from xgboost import XGBClassifier
# Mengatur random seed
np.random.seed(42) # Fungsi acak untuk menghasilkan nilai acak atau random yang sama setiap kali dipanggil

# Membagi data menjadi 2 variabel, features(X) yang akan digunakan sebagai parameter dan target(y) yang akan diprediksi hasilnya
y = df_train["Loan_Status"]
X = df_train.drop("Loan_Status", axis=1)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) # Tujuannya untuk membagi data yang kita punya menjadi 3 bagian, data training, validation, dan test.

# Inisialisasi KNeighborsClassifier
clf_XGB = XGBClassifier()
# Fit the model
clf_XGB.fit(X_train, y_train)

baseline_XGB = baseline_train_evaluate(X_train, y_train, X_test, y_test, clf_XGB, estimators = "XGBoost Classifier")

In [None]:
clf_XGB.get_params()

#### rs_XGBoost

In [None]:
grid_XGB = {
    "learning_rate": [0.1, 0.01, 0.001],
    "n_estimators": [10, 100, 500, 800, 1000],
    "max_depth": [2, 3, 5, 7, 10, 100],
    "min_child_weight": [5, 10, 20, 40, 80, 100],
    "subsample": [0.5, 0.7, 1.0],
    "colsample_bytree": [0.1, 0.01, 0.001],
    "objective": ["binary:logistic"],
    "scale_pos_weight":[1],
    "device":["cpu"]
}

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Setup RandomizedSearchCV to search the best parameters
rs_clf_XGB = RandomizedSearchCV(estimator=clf_XGB,
                                param_distributions=grid_XGB,
                                n_iter=50, # Number of models to try
                                cv=5,
                                verbose=2)

rs_clf_XGB.fit(X_train, y_train);

In [None]:
rs_clf_XGB.best_params_

In [None]:
# Evaluating the model
# Make predictions with the best hyperparameters
rs_y_preds_XGB = rs_clf_XGB.predict(X_valid)

# Evaluate the predictions
rs_metrics_XGB = evaluate_preds(y_valid, rs_y_preds_XGB)

#### gs_XGBoost

In [None]:
grid_XGB_2 = {
    "learning_rate": [0.01, 0.001],
    "n_estimators": [500, 800, 1000, 1200, 1500],
    "max_depth": [5, 7],
    "min_child_weight": [10, 20, 40, 80],
    "subsample": [1.0],
    "colsample_bytree": [0.001],
    "objective": ["binary:logistic"],
    "scale_pos_weight":[1],
    "device":["cpu"]
}

In [None]:
np.random.seed(42)

# Shuffle the data
df_train_shuffled = df_train.sample(frac=1) # 1 means 100% of the data gets shuffled

# Split into X & y
X = df_train_shuffled.drop('Loan_Status', axis = 1)
y = df_train_shuffled["Loan_Status"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Setup GridSearchCV
gs_clf_XGB = GridSearchCV(estimator=clf_XGB,
                            param_grid=grid_XGB_2,
                            cv=5,
                            verbose=2)

# Fit the GridSearchCV version of clf
gs_clf_XGB.fit(X_train, y_train);

In [None]:
gs_clf_XGB.best_params_

In [None]:
# Evaluating the model
# Make predictions with the best hyperparameters
gs_y_preds_XGB = gs_clf_XGB.predict(X_valid)

# Evaluate the predictions
gs_metrics_XGB = evaluate_preds(y_valid, gs_y_preds_XGB)

#### Compare

In [None]:
compare(baseline_XGB, rs_metrics_XGB, gs_metrics_XGB, title="Comparison of XGBoost Classifier")

### 7. Improved CatBoostClassifier

#### Baseline CatBoostClassifier

In [None]:
from catboost import CatBoostClassifier
# Mengatur random seed
np.random.seed(42) # Fungsi acak untuk menghasilkan nilai acak atau random yang sama setiap kali dipanggil

# Membagi data menjadi 2 variabel, features(X) yang akan digunakan sebagai parameter dan target(y) yang akan diprediksi hasilnya
y = df_train["Loan_Status"]
X = df_train.drop("Loan_Status", axis=1)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) # Tujuannya untuk membagi data yang kita punya menjadi 3 bagian, data training, validation, dan test.

# Inisialisasi KNeighborsClassifier
clf_CBC = CatBoostClassifier(verbose=0)
# Fit the model
clf_CBC.fit(X_train, y_train)

baseline_CBC = baseline_train_evaluate(X_train, y_train, X_test, y_test, clf_CBC, estimators = "CatBoost Classifier")

In [None]:
params = CatBoostClassifier.__doc__

print(params)

In [None]:
grid_CBC = {
    "iterations": [500, 700, 900, 1200],
    "learning_rate": [0.03, 0.5, 0.7, 1.0],
    "depth":[1, 3, 5, 7],
    "leaf_estimation_iterations": [10],
    "leaf_estimation_method": ["Newton", "Gradient"],
    "thread_count":[-1],
    "verbose":[0],
    "task_type":["CPU"]
}

#### rs_CatBoosterClassifier

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Setup RandomizedSearchCV to search the best parameters
rs_clf_CBC = RandomizedSearchCV(estimator=clf_CBC,
                                param_distributions=grid_CBC,
                                n_iter=10, # Number of models to try
                                cv=5,
                                verbose=2)

rs_clf_CBC.fit(X_train, y_train);

In [None]:
rs_clf_CBC.best_params_

In [None]:
# Evaluating the model
# Make predictions with the best hyperparameters
rs_y_preds_CBC = rs_clf_CBC.predict(X_valid)

# Evaluate the predictions
rs_metrics_CBC = evaluate_preds(y_valid, rs_y_preds_CBC)

#### gs_CatBoosterClassifier

In [None]:
grid_CBC_2 ={
    "iterations": [500, 600, 700],
    "learning_rate": [0.03, 0.5],
    "depth":[1, 3],
    "leaf_estimation_iterations": [10],
    "leaf_estimation_method": ["Newton", "Gradient"],
    "thread_count":[-1],
    "verbose":[0],
    "task_type":["CPU"]
}

In [None]:
np.random.seed(42)

# Shuffle the data
df_train_shuffled = df_train.sample(frac=1) # 1 means 100% of the data gets shuffled

# Split into X & y
X = df_train_shuffled.drop('Loan_Status', axis = 1)
y = df_train_shuffled["Loan_Status"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Setup GridSearchCV
gs_clf_CBC = GridSearchCV(estimator=clf_CBC,
                            param_grid=grid_CBC_2,
                            cv=5,
                            verbose=2)

# Fit the GridSearchCV version of clf
gs_clf_CBC.fit(X_train, y_train);

In [None]:
gs_clf_CBC.best_params_

In [None]:
# Evaluating the model
# Make predictions with the best hyperparameters
gs_y_preds_CBC = gs_clf_CBC.predict(X_valid)

# Evaluate the predictions
gs_metrics_CBC = evaluate_preds(y_valid, gs_y_preds_CBC)

#### Compare

In [None]:
compare(baseline_CBC, rs_metrics_CBC, gs_metrics_CBC, title="Comparison of CatBooster Classifier")

# Use the model to predict the new data

In [None]:
df_test.info()

# Silahkan gunakan model ini dengan data terbaru (df_test)

## `gs_clf_(estimators).predict(df_test)`

# Model Selection | Pemilihan Model

In [None]:
import pandas as pd

# Assuming you have the model names and evaluation metrics stored in dictionaries
model_metrics = {
    "LinearSVC"               : gs_metrics_LSVC,
    "Random Forest Classifier": gs_metrics_RFC,
    "KNeighbors Classifier"   : gs_metrics_KNC,
    "Logistic Regression"     : gs_metrics_LR,
    "Decision Tree"           : gs_metrics_tree,
    "XGBoost"                 : gs_metrics_XGB,
    "CatBoost Classifier"     : gs_metrics_CBC,
    # Add more models and their evaluation metrics as needed
}

# Convert the dictionary into a DataFrame
hehohi = pd.DataFrame.from_dict(model_metrics, orient='index')

# Round the evaluation metrics to two decimal places
hehohi = hehohi.round(2)

# Print the DataFrame
print(hehohi)


# Conclusion | Kesimpulan

Setelah melakukan eksplorasi data, analisa, manipulasi, training model, tuning, dan mengevaluasi 7 model data yang berbeda, maka apabila dibandingkan menggunakan evaluasi matrix klasifikasi, yaitu akurasi, presisi, recall, dan f1 score, maka peneliti akan memilih KNeighbors Classifier sebagai estimator dalam model. Hal ini dikarenakan nilai evaluation metrics yang dimiliki cukup tinggi dan juga tidak terlalu jauh perbedaan persentase antar metrics.

Kesimpulannya, KNeighbours merupakan model yang cocok untuk memprediksi eligibilitas nasabah yang melakukan peminjaman ke bank karena akurasinya yang tinggi. Hal ini dapat lebih dioptimalkan dan diterapkan untuk aplikasi dunia nyata dalam domain perbankan.