<div style="text-align: center;">
  <h1>Home Loan Approval EDA and Prediction</h1>
</div>


* This EDA and prediction project aims to explore the factors that influence home loan approval and to develop a model that can predict the likelihood of approval for a given borrower. 
* The project uses a dataset of home loan approval to train a machine learning model. 
* Once the model is trained, it can be used to predict the likelihood of approval for new loan applications.
<br> </br>
## Here's a brief explanation of each column:

1. Loan_ID: A unique identifier for each loan application.

1. Gender: The gender of the loan applicant (e.g., Male, Female).

1. Married: Marital status of the applicant (e.g., Yes, No).

1. Dependents: The number of dependents of the applicant (e.g., 0, 1, 2, 3+).

1. Education: The educational background of the applicant (e.g., Graduate, Not Graduate).

1. Self_Employed: Indicates whether the applicant is self-employed (e.g., Yes, No).

1. ApplicantIncome: The income of the primary applicant.

1. CoapplicantIncome: The income of the co-applicant (if any).

1. LoanAmount: The loan amount requested by the applicant.

1. Loan_Amount_Term: The term or duration of the loan in months.

1. Credit_History: A binary variable indicating the credit history of the applicant (e.g., 1 for good credit history, 0 for bad credit history).

1. Property_Area: The area or location of the property for which the loan is requested (e.g., Urban, Rural, Semiurban).

1. Loan_Status: The target variable indicating whether the loan was approved or not (e.g., Y for Yes, N for No).

Github : https://github.com/kinba09/Home_Loan_Approval | Streamlit : will update | API : will update 


# Importing the required libraries

In [None]:
pip install scikit-learn==1.3.0

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
import sklearn

plt.style.use('seaborn-v0_8-pastel')
pd.set_option('display.max_columns', 200)
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler

import warnings 
warnings.filterwarnings('ignore')

In [None]:
print(sklearn.__version__)

<h3>Setting Matplotlib Style:</h3>

* `plt.style.use('seaborn-v0_8-pastel')`: Sets the Matplotlib style to 'seaborn-v0_8-pastel', which is a predefined style for creating pastel-colored plots with seaborn-like aesthetics.

<h3>pandas Configuration:</h3>

* `pd.set_option('display.max_columns', 200)`: Configures pandas to display up to 200 columns when you print DataFrames. This setting ensures that you can see a large number of columns in your DataFrames when needed.

<h3>Importing Specific Functions:</h3>

* `from sklearn.model_selection import train_test_split`: Imports the train_test_split function from scikit-learn, which is used for splitting a dataset into training and testing sets.

* `from imblearn.over_sampling import RandomOverSampler`: Imports the RandomOverSampler class from the imbalanced-learn library, which is used for oversampling to balance imbalanced datasets in machine learning.

<h3>Suppressing Warnings:</h3>

* `warnings.filterwarnings('ignore')`: Temporarily suppresses warnings to avoid cluttering the output with warning messages.

# Importing and Reading the data

In [None]:
df = pd.read_csv('/kaggle/input/home-loan-approval/loan_sanction_train.csv')
#df = pd.read_csv('your_file.csv')

In [None]:
df.shape

In [None]:
df.head(10)

In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
df.describe()

`df.shape`: This line of code returns the shape of the DataFrame.

`df.head(10)`: This code displays the first 10 rows of the DataFrame. It is useful for quickly inspecting the beginning of the dataset to get an overview of its contents.

`df.columns`: This line of code retrieves the column names (or column labels) of the DataFrame df. 

`df.dtypes`: This code returns the data types of each column in the DataFrame df. It provides information about whether each column contains integers, floats, strings, or other data types.

`df.describe()`: This line of code generates summary statistics for the numerical columns in the DataFrame df. It provides information such as the count, mean, standard deviation, minimum, and maximum values for each numerical column.

# Data Preperation

In [None]:
df = df[[#'Loan_ID', 
        'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status']].copy()

- `df[['Gender', 'Married', ... 'Loan_Status']]` selects a subset of columns from the DataFrame `df`. It retains only the columns listed within the double square brackets.

- `.copy()` creates a new DataFrame that is a deep copy of the selected columns. This is often done to ensure that modifications to the new DataFrame do not affect the original DataFrame `df`.


### Checking for null values

In [None]:
df.isna().sum()

In [None]:
(df.isnull().sum()/(len(df)))*100

In [None]:
df = df.dropna(axis=0)

In [None]:
df.isna().sum()

- `df.isna().sum()`: This code calculates the number of missing values (NaN) for each column in the DataFrame `df`. It returns a Series with the count of missing values for each column.

- `(df.isnull().sum() / len(df)) * 100`: This code calculates the percentage of missing values for each column in the DataFrame `df`. It divides the count of missing values by the total number of rows (`len(df)`) and multiplies by 100 to get the percentage.

- `df = df.dropna(axis=0)`: This code removes rows with missing values (NaN) from the DataFrame `df`. The `axis=0` argument specifies that rows containing any missing values should be dropped.

- `df.isna().sum()`: After removing rows with missing values, this code calculates the updated number of missing values for each column in the DataFrame `df`. It returns a Series with the count of missing values for each column.


In [None]:
df.loc[df.duplicated()]

In [None]:
df=df.reset_index()

- `df.loc[df.duplicated()]`: This code identifies and returns rows in the DataFrame `df` that are duplicates. It checks for duplicated rows based on all columns by default. In this dataset, there arn't any duplicates.

- `df = df.reset_index()`: This code resets the index of the DataFrame `df`. When resetting the index, a new default integer index is assigned to the DataFrame, and the old index is moved to a new column. This operation can be useful when you want to reorganize the index of your DataFrame.



In [None]:
df.head(10)  #to check the reseted index

In [None]:
df.info()

In [None]:
df["LoanAmount"]=df["LoanAmount"].astype(int)
df["Credit_History"]=df["Credit_History"].astype(int)
df["Loan_Amount_Term"]=df["Loan_Amount_Term"].astype(int)
df["CoapplicantIncome"]=df["CoapplicantIncome"].astype(int)


- `df.info()`: This code displays concise information about the DataFrame `df`. It includes details such as the number of non-null values, data types of columns, and memory usage. It's a useful way to get an overview of the DataFrame's structure.

- `df["LoanAmount"] = df["LoanAmount"].astype(int)`: This code converts the data type of the "LoanAmount" column in the DataFrame `df` to integers. It is useful when you want to work with the "LoanAmount" values as whole numbers.


In [None]:
df.info() #checking

## Calculating Total income 

In [None]:
df["Total_Income"] = df["ApplicantIncome"] + df["CoapplicantIncome"]

In [None]:
df["Total_Income"].mean() #average total income

In [None]:
df['Loan_Term_years'] = (df['Loan_Amount_Term']/12).astype(int)

> I have computed the total income by aggregating both the applicant's income and co-applicant's income. This approach simplifies the comparison of the combined income with the individual incomes, making it more efficient for our predictive modeling process and reduced the Loan Amount terms months to years

In [None]:
df.head()

In [None]:
df = df[[#'Loan_ID', 
        'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       #'Loan_Amount_Term',
    'Credit_History', 'Property_Area', 'Total_Income', 'Loan_Term_years', 'Loan_Status']].copy()

#removed the unwanted columns and reordered them 

In [None]:
df.head() #checking

## Feature Understanding

In [None]:
Features = ['Gender','Married','Dependents','Education','Self_Employed','Credit_History','Property_Area','Loan_Term_years']
for i in Features:
    print(df[i].value_counts(normalize=True))
    print("---------------------------------")


- `Features = ['Gender','Married','Dependents','Education','Self_Employed','Credit_History','Property_Area','Loan_Term_years']`: This line defines a list named `Features` containing the names of *catagorical columns* (features) from the DataFrame that you want to analyze and print value counts for.

  - `print(df[i].value_counts(normalize=True))`: Within the loop, this line calculates and prints the value counts for the current feature (`i`) in the DataFrame `df`. The `normalize=True` parameter normalizes the counts to represent proportions (percentages) rather than raw counts, making it easier to compare the distribution of each feature.

  - `print("---------------------------------")`: After printing the value counts for the current feature, this line adds a separator (a line of dashes) to visually distinguish the results for different features.


> It's evident that the majority of the categorical features exhibit significant class imbalance. This observation highlights the unequal distribution of categories within these features, which can have implications for modeling and analysis


In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

sns.histplot(data=df,ax=axes[0], x="LoanAmount", kde=True)
axes[0].set_xlabel('Loan Amount')
axes[0].set_title('Histogram and KDE Plot For LoanAmount')

sns.histplot(data=df, ax=axes[1], x="Total_Income", kde=True)
axes[1].set_xlabel('Total_Income')
axes[1].set_title('Histogram and KDE Plot For Total_Income')

- `fig, axes = plt.subplots(1, 2, figsize=(12, 4))`: This line creates a figure (`fig`) with two subplots (`axes`) arranged side by side in a single row. The `figsize` parameter specifies the dimensions of the figure.

- `sns.histplot(data=df, ax=axes[0], x="LoanAmount", kde=True)`: In the first subplot (`axes[0]`), this code uses Seaborn's `histplot` function to create a histogram and kernel density estimate (KDE) plot for the "LoanAmount" column from the DataFrame `df`. It visualizes the distribution of loan amounts with a histogram and overlays a KDE curve. The `x` parameter specifies the data to be plotted, and `kde=True` adds the KDE curve.

- `axes[0].set_xlabel('Loan Amount')` and `axes[0].set_title('Histogram and KDE Plot For LoanAmount')`: These lines set the x-axis label and subplot title for the first plot, providing clarity and context.

- `sns.histplot(data=df, ax=axes[1], x="Total_Income", kde=True)`: In the second subplot (`axes[1]`), this code creates a similar histogram and KDE plot, but this time for the "Total_Income" column from the DataFrame `df`.

- `axes[1].set_xlabel('Total_Income')` and `axes[1].set_title('Histogram and KDE Plot For Total_Income')`: These lines set the x-axis label and subplot title for the second plot, providing clear labeling and context.

> It's evident that both the loan amount and total income distributions exhibit a right-skewed pattern. In these distributions, the mean is greater than the median, and the median is greater than the mode. This skewness suggests that a significant proportion of observations have higher values, causing the rightward tail in the distributions


In [None]:
fig, axes = plt.subplots(1,2, figsize=(12,4))

sns.boxplot(data = df,ax=axes[0], x="LoanAmount")
axes[0].set_title('BoxPlot LoanAmount')

sns.boxplot(data = df,ax=axes[1], x="Total_Income")
axes[1].set_title('Boxplot Total_Income')

- `sns.boxplot(data=df, ax=axes[0], x="LoanAmount")`: In the first subplot (`axes[0]`), this code uses Seaborn's `boxplot` function to create a boxplot for the "LoanAmount" column from the DataFrame `df`. A boxplot is a graphical representation that shows the distribution of data and identifies potential outliers.

- `sns.boxplot(data=df, ax=axes[1], x="Total_Income")`: In the second subplot (`axes[1]`), this code creates a similar boxplot, but this time for the "Total_Income" column from the DataFrame `df`.

- `axes[1].set_title('Boxplot Total_Income')`: This line sets the title for the second subplot, describing the content of the plot.

> From these box plots, it's evident that both 'LoanAmount' and 'Total_Income' exhibit a considerable number of outliers. These outliers are data points that fall significantly beyond the whiskers of the box plots, indicating the presence of extreme values in these variables

In [None]:
df['Total_Income'].max()

In [None]:
df = df[df['Total_Income'] != 81000]


> I noticed that there was a particularly high outlier at $80,000 in the 'Total_Income' variable. As it appeared to be an extreme value, I decided to remove only that specific outlier from the dataset

In [None]:
fig, axes = plt.subplots(nrows = 2,ncols =2, figsize=(12,4)) # constrained_layout=True, used this for overlapping
plt.subplots_adjust( hspace=0.8)

sns.boxplot(data = df,ax=axes[0,0], x="LoanAmount")
axes[0,0].set_title('BoxPlot LoanAmount')

sns.boxplot(data = df,ax=axes[0,1], x="Total_Income")
axes[0,1].set_title('Boxplot Total_Income')

sns.violinplot(data=df, ax = axes[1,0], x ="LoanAmount")
axes[1,0].set_title('ViolinPlot LoanAmount')

sns.violinplot(data=df, ax = axes[1,1], x ="Total_Income")
axes[1,1].set_title('ViolinPlot LoanAmount')


- `fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 4))`: This line creates a 2x2 grid of subplots (`axes`) within a single figure (`fig`) with a specified size. The grid layout allows for four plots to be displayed in a structured manner.

- `plt.subplots_adjust(hspace=0.8)`: This line adjusts the vertical spacing (`hspace`) between the subplots, ensuring sufficient space between them for clarity.

- `axes[0, 1].set_title('Boxplot Total_Income')`: This line sets the title for the top-right subplot.

- `sns.violinplot(data=df, ax=axes[1, 0], x="LoanAmount")`: In the bottom-left subplot (`axes[1, 0]`), this code uses Seaborn's `violinplot` function to create a violin plot for the "LoanAmount" column, providing insights into the distribution and density of data points.

- `axes[1, 0].set_title('ViolinPlot LoanAmount')`: This line sets the title for the bottom-left subplot.

- `sns.violinplot(data=df, ax=axes[1, 1], x="Total_Income")`: In the bottom-right subplot (`axes[1, 1]`), a similar violin plot is created, but this time for the "Total_Income" column from the DataFrame `df`.

- `axes[1, 1].set_title('ViolinPlot LoanAmount')`: This line sets the title for the bottom-right subplot.



In [None]:
sns.scatterplot(data=df, x="LoanAmount", y="Total_Income", hue="Loan_Status")

In [None]:
data_list = ["Gender","Married","Education","Dependents", "Self_Employed","Credit_History","Property_Area","Loan_Term_years"]
total_plots = len(data_list)

nrows, ncols = 4, 2


fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(10, 16)) 
for i in range(total_plots):
    row = i // ncols
    col = i % ncols
    ax = axes[row, col]

    sns.scatterplot(data=df, x="LoanAmount", y="Total_Income",ax=ax, hue=data_list[i])

    ax.set_title(f'Plot {i+1}')
    ax.set_xlabel('Loan Amount')
    ax.set_ylabel('Total Income')

# Adjust spacing between subplots
plt.tight_layout()

# Display the plots
plt.show()



- `data_list`: This list contains the names of the categorical variables we want to analyze with scatterplots.

- `total_plots`: This variable stores the total number of plots to be created, which is determined by the length of `data_list`.

- `nrows, ncols`: These variables specify the number of rows and columns for the subplot grid.

- `fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(10, 16))`: This line creates a grid of subplots (`axes`) with the specified number of rows and columns. The `figsize` parameter determines the overall size of the figure.

- The `for` loop iterates through the categorical variables in `data_list`:
  - `row` and `col` are calculated to determine the current position in the subplot grid.
  - `ax` represents the current subplot where the scatterplot will be drawn.
  - `sns.scatterplot()` is used to create a scatterplot of "LoanAmount" against "Total_Income," with points colored by the categorical variable specified in `data_list[i]`.
  - `ax.set_title()`, `ax.set_xlabel()`, and `ax.set_ylabel()` set the title, x-axis label, and y-axis label for each subplot.
  
- `plt.tight_layout()`: This line adjusts the spacing between subplots to ensure they are well-organized and do not overlap.

- `plt.show()`: Finally, this command displays the scatterplot grid, showing how "LoanAmount" and "Total_Income" relate to each categorical variable in `data_list`.




In [None]:
sns.jointplot(data=df, x="LoanAmount", y="Total_Income", kind="hist")

In [None]:
sns.swarmplot(data=df, x="Loan_Status", y="Total_Income",  hue="Credit_History")

- The joint plot displays the relationship between two variables, "LoanAmount" (on the x-axis) and "Total_Income" (on the y-axis), with the chosen kind of plot being a histogram

- `sns.swarmplot(data=df, x="Loan_Status", y="Total_Income",  hue="Credit_History")`: This line creates a swarm plot using Seaborn.

A swarm plot is a categorical scatter plot that displays individual data points along a categorical axis, in this case, "Loan_Status" on the x-axis and "Total_Income" on the y-axis. Additionally, it uses the "Credit_History" variable to color-code the data points, providing information about a third categorical variable.

A swarm plot is useful for visualizing the distribution of data points within different categories and understanding the relationships between multiple variables. In this case, it can help identify any patterns or trends related to loan status, total income, and credit history.

> In the swarm plot visualization, a clear pattern emerges: individuals who were approved for a loan typically have a documented credit history, while those whose loan applications were not approved often lack a credit history. This visual representation underscores the significant influence of credit history on loan approval outcomes.

In [None]:
df1 = df[[#'Loan_ID', 
        'Gender', 'Married', 'Dependents', 'Education',
        'Self_Employed', #'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
        #'Loan_Amount_Term','Credit_History','Total_Income', 
        'Property_Area',  'Loan_Term_years', 'Loan_Status']].copy()

for col in df1.columns:
    print(f'{col}: {df1[col].unique()}')
    print(" ") #new line

  - `print(f'{col}: {df1[col].unique()}')`: It prints the name of the column followed by the unique values present in that column. This provides insight into the distinct values within each selected column.


In [None]:
df['Gender'] = df['Gender'].replace({'Male': 1, 'Female': 0})
df['Married'] = df['Married'].replace({'Yes': 1, 'No': 0})
df['Dependents'] = df['Dependents'].replace({'1': 1, '0': 0, '2': 2, '3+' : 3})
df['Education'] = df['Education'].replace({'Graduate': 1, 'Not Graduate': 0})
df['Self_Employed'] = df['Self_Employed'].replace({'No': 1, 'Yes': 0})
df['Property_Area'] = df['Property_Area'].replace({'Rural': 1, 'Urban': 0, 'Semiurban': 2})
df['Loan_Status'] = df['Loan_Status'].replace({'N': 1, 'Y': 0})


> The provided code is performing data preprocessing tasks by converting categorical variables into numerical format using the `replace` method in pandas. 


In [None]:
df = df[[#'Loan_ID', 
        'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', #'ApplicantIncome', 'CoapplicantIncome', 
        'LoanAmount',
       #'Loan_Amount_Term',
    'Credit_History', 'Property_Area', 'Total_Income', 'Loan_Term_years', 'Loan_Status']].copy()

In [None]:
df.head(10) #check the change

In [None]:
df.info() #Checking whether everything has changed int

In [None]:
X = df.drop('Loan_Status',axis = 1)
Y = df['Loan_Status']

In [None]:
ros = RandomOverSampler()
X, Y = ros.fit_resample(X, Y)

- `X = df.drop('Loan_Status', axis=1)`: In this line, a new DataFrame `X` is created by dropping the 'Loan_Status' column from the original DataFrame `df`. This step separates the features (independent variables) from the target variable ('Loan_Status').

- `Y = df['Loan_Status']`: Here, a new Series `Y` is created, containing only the 'Loan_Status' column from the original DataFrame `df`. This variable represents the target or dependent variable that we want to predict.

- `ros = RandomOverSampler()`: An instance of the `RandomOverSampler` class is created. This class is part of the imbalanced-learn library and is used to address class imbalance by oversampling the minority class.

- `X, Y = ros.fit_resample(X, Y)`: This is where the resampling takes place. The `fit_resample` method of the `RandomOverSampler` class is used to balance the class distribution. It oversamples the minority class (where 'Loan_Status' is 1) by generating synthetic samples to match the number of samples in the majority class (where 'Loan_Status' is 0). As a result, both classes are balanced in terms of the number of samples.

 > The resampling technique is essential when dealing with imbalanced datasets, as it helps improve the performance of machine learning models by ensuring that both classes have an adequate number of samples for training. This can lead to better model generalization and prediction accuracy, especially when the dataset has a significant class imbalance.


In [None]:
sns.countplot(x=Y) #checking after balancing

# ML Modelling 

In [None]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, shuffle=True)

print("Size of X_train DataFrame: ", X_train.shape)
print("Size of X_val DataFrame: ", X_val.shape)
print("Size of Y_train DataFrame: ", Y_train.shape)
print("Size of Y_val DataFrame: ", Y_val.shape)

#splitting the current X and Y into training and test dataset 

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, confusion_matrix

def show_scores(model):
    train_preds = model.predict(X_train)
    val_preds = model.predict(X_val)
    scores = {"Training F1 Score": f1_score(Y_train, train_preds),
              "Validation F1 Score": f1_score(Y_val, val_preds),
              "Training Precision Score": precision_score(Y_train, train_preds),
              "Validation Precision Score": precision_score(Y_val, val_preds),
              "Training Recall Score": recall_score(Y_train, train_preds),
              "Validation Recall Score": recall_score(Y_val, val_preds),
              "Training Accuracy Score": accuracy_score(Y_train, train_preds),
              "Validation Accuracy Score": accuracy_score(Y_val, val_preds),
              "Training Confusion Matrix": confusion_matrix(Y_train, train_preds),
              "Validation Confusion Matrix": confusion_matrix(Y_val, val_preds)}

    # Iterate through the dictionary and print each score with a newline character
    for score_name, score_value in scores.items():
        print(f"{score_name}: {score_value}\n")

    #return scores


The `show_scores` function is designed to assess the performance of a machine learning model by calculating several classification metrics for both the training and validation datasets. Here's a summary of what the code does:

- `train_preds = model.predict(X_train)`: This line generates predictions for the training dataset (`X_train`) using the provided machine learning model (`model`).

- `val_preds = model.predict(X_val)`: Similarly, this line generates predictions for the validation dataset (`X_val`) using the same model.

- `scores = {...}`: In this dictionary, various classification metrics are computed and stored. These metrics include F1 Score, Precision Score, Recall Score, Accuracy Score, and Confusion Matrix for both the training and validation datasets.

- The metrics are calculated using functions from the `sklearn.metrics` module, such as `f1_score`, `precision_score`, `recall_score`, `accuracy_score`, and `confusion_matrix`. These metrics are common tools for evaluating the performance of classification models.

- The function then iterates through the `scores` dictionary and prints each metric along with its corresponding value. It adds a newline character (`\n`) after printing each metric to separate them.





In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, Y_train)

print("Baseline Random Forest model scores\n")
show_scores(rf_clf)
# print("Baseline Random Forest model scores")

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

xt_clf = ExtraTreesClassifier()
xt_clf.fit(X_train, Y_train)

print("Baseline Extra Trees Classifier scores\n")
show_scores(xt_clf)

In [None]:
from sklearn.ensemble import VotingClassifier

estimators = []
estimators.append(('RF', RandomForestClassifier()))
estimators.append(('XT', ExtraTreesClassifier()))

vt_clf = VotingClassifier(estimators = estimators, voting='soft', verbose = True)
vt_clf.fit(X_train, Y_train)


print("Baseline Voting Classifier scores\n")
show_scores(vt_clf)

# GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

# Define your Random Forest classifier
rf_clf = RandomForestClassifier()

# Define a dictionary of hyperparameter values to search
param_grid = {
    'n_estimators': [100, 200],            # Number of trees in the forest
    'max_depth': [None, 10, 20, 30],            # Maximum depth of each tree
    'min_samples_split': [2, 5],           # Minimum samples required to split a node
    'min_samples_leaf': [1, 2],             # Minimum samples required in a leaf node
    'max_features': ['auto', 'sqrt'],          # Number of features to consider for the best split
    'bootstrap': [True],                # Whether to bootstrap samples when building trees
    'criterion': ['gini', 'entropy']          
}

# Create GridSearchCV instance
rf_grid_search = GridSearchCV(estimator=rf_clf, param_grid=param_grid, 
                           cv=5, scoring='accuracy', verbose=2, n_jobs=-1)

# Fit the GridSearchCV to your training data
rf_grid_search.fit(X_train, Y_train)

# Print the best hyperparameters found
best_params = rf_grid_search.best_params_
print("Best Hyperparameters:")
print(best_params)

# Get the best model from the grid search
best_rf_model = rf_grid_search.best_estimator_

show_scores(best_rf_model)

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(bootstrap = True, criterion= 'entropy', max_depth= 30, max_features= 'sqrt', min_samples_leaf= 1, min_samples_split= 2, n_estimators= 200)
rf_clf.fit(X_train, Y_train)

print("Baseline Random Forest model scores\n")
show_scores(rf_clf)
# print("Baseline Random Forest model scores")

These hyperparameters are tailored for the Random Forest model to achieve specific performance goals. After configuring the classifier, it is trained on the provided training data (`X_train` and `Y_train`).

> It's worth noting that since I haven't fixed a random state, there might be slight variations in results across different runs. However, the model I'm presenting here represents one of the most efficient and effective variations I've identified during my GridSearchCV optimization efforts. 


In [None]:
#uncomment this if you the above model in a pickle file

# import pickle
# filename = 'model.pkl'
# pickle.dump(rf_clf, open(filename, 'wb'))

# Feature Importances

In [None]:
feature_importances = rf_clf.feature_importances_

In [None]:
print(feature_importances)

In [None]:

plt.bar(X.columns, feature_importances)
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Random Forest Feature Importance')
plt.xticks(rotation=40)
plt.tight_layout()
plt.show()

#plotting the model's Feature importance

> The feature importances plot reveals that within this model, certain features hold substantial significance. Specifically, it highlights the pivotal roles played by "Total Income," "Loan Amount," and "Credit History." These features are instrumental in shaping the model's predictions, signifying their critical importance in assessing loan approval outcomes.






In conclusion, the developed machine learning model represents a valuable tool for predicting loan approval outcomes. Through a comprehensive analysis of various input parameters such as gender, marital status, education, income, and credit history, the model has demonstrated its capability to make informed decisions regarding loan eligibility.



## Areas for Improvement:

1. **Detailed Exploratory Data Analysis (EDA):** To enhance the model's performance, conducting a more comprehensive EDA is crucial. This includes exploring data distributions, correlations, and relationships between variables in greater detail.

2. **Incorporate Additional Data Insights:** Seek to extract deeper insights from the data, such as identifying patterns or trends that might be crucial in predicting loan approval outcomes. Utilize data visualization techniques and statistical analysis to uncover hidden information.

3. **Data Transformation for Skewed Data:** Addressing right-skewed data distributions can improve model accuracy. Consider applying techniques like log transformations to make the data more symmetrical and suitable for modeling.

4. **Outlier Handling with Winsorization:** Implement Winsorization to manage outliers in the dataset. This technique caps extreme values to minimize their impact on the model's performance while preserving valuable information.

5. **Imputation Methods:** Evaluate different imputation methods, such as mean or median imputation, for handling missing data effectively. Select the method that aligns best with the dataset's characteristics.

6. **Experiment with Various Models:** Explore a range of machine learning models beyond the Random Forest Classifier used in the initial model. Evaluate models like Gradient Boosting, Support Vector Machines, or Neural Networks to identify which one performs optimally for loan approval prediction.

These improvements aim to create a more robust and accurate predictive model for loan approval. By enhancing data exploration, addressing data quality issues, and experimenting with different modeling approaches, the model's performance can be significantly enhanced.


Github : https://github.com/kinba09/Home_Loan_Approval | Streamlit : will update | API : will update 

<div style="text-align: center;">
  <h1>Do comment and Upvote if you like my work</h1>
</div>
