<a href="https://www.kaggle.com/code/manishkr1754/loan-eligibility-prediction?scriptVersionId=142889708" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
<center><h1>Loan Eligibility Prediction</h1></center>
<center><h3>Part of 30 Days 30 ML Projects Challenge</h3></center>

---

## 1) Understanding Problem Statement
---

The Finance company faces the challenge of accurately predicting loan eligibility for its customers. In order to make informed lending decisions and minimize risk, the company needs to develop a robust predictive model that takes into account various factors and attributes of the applicants. This predictive model will help the Finance company streamline its lending processes, reduce the likelihood of default and ensure fair and transparent loan approval practices, ultimately improving the overall efficiency and profitability of the company's operations.

The project falls under **Classication Machine Learning Problem**. The goal of this project is to leverage machine learning **to determine whether an applicant is eligible for a loan or not** while also considering factors such as credit history, income, employment status and other relevant variables.

## 2) Understanding Data
---
The project uses Loan Prediction Data which contains several variables (independent variables) like **Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others** and one outcome variable (dependent variable) called **Loan Status** for each individual.

- **Loan Status:** The outcome variable with two possible values:
  - Y: Indicates that the individual is eligible for Loan.
  - N: Indicates that the individual is not eligible for Loan.

## 3) Getting System Ready
---
Importing required libraries


In [None]:
import numpy as np
import pandas as pd

# for model buidling
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

## 4) Data Eyeballing
---

### Laoding Data

In [None]:
loan_data = pd.read_csv('Datasets/Day5_Loan_Prediction_Data.csv') 

In [None]:
loan_data

In [None]:
print('The size of Dataframe is: ', loan_data.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
loan_data.info()
print('-'*100)

In [None]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in loan_data.columns if loan_data[feature].dtype != 'O']
categorical_features = [feature for feature in loan_data.columns if loan_data[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=loan_data.isnull().sum().sort_values(ascending=False)
percent=(loan_data.isnull().sum()/loan_data.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [None]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
loan_data.describe()

In [None]:
print('Summary Statistics of categorical features for DataFrame are as follows:')
print('-'*100)
loan_data.describe(include='object')

In [None]:
loan_data['Loan_Status'].value_counts()

## 5) Data Cleaning and Preprocessing
---

### Dropping the missing values

In [None]:
loan_data = loan_data.dropna()

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=loan_data.isnull().sum().sort_values(ascending=False)
percent=(loan_data.isnull().sum()/loan_data.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

### Label Encoding

In [None]:
loan_data.replace({"Loan_Status":{'N':0,'Y':1}},inplace=True)

In [None]:
loan_data.head()

### Processinng `Dependent` column

In [None]:
loan_data['Dependents'].value_counts()

In [None]:
# Replacing the value of 3+ to 4
loan_data = loan_data.replace(to_replace='3+', value=4)

In [None]:
loan_data['Dependents'].value_counts()

### Converting categorical columns to numerical values

In [None]:
# convert categorical columns to numerical values
loan_data.replace({'Married':{'No':0,'Yes':1},'Gender':{'Male':1,'Female':0},'Self_Employed':{'No':0,'Yes':1},
                      'Property_Area':{'Rural':0,'Semiurban':1,'Urban':2},'Education':{'Graduate':1,'Not Graduate':0}},inplace=True)

In [None]:
loan_data.head()

## 5) Model Building
---

### Creating Feature Matrix (Independent Variables) & Target Variable (Dependent Variable)

In [None]:
# separating the data and labels
X = loan_data.drop(columns = ['Loan_ID','Loan_Status'], axis=1) # Feature matrix
y = loan_data['Loan_Status'] # Target variable

In [None]:
X

In [None]:
y

### Data Standardization

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(X)

In [None]:
standardized_data = scaler.transform(X)

In [None]:
standardized_data

In [None]:
X = standardized_data

In [None]:
X

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=45)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

In [None]:
print(y.shape, y_train.shape, y_test.shape)

### Model Comparison : Training & Evaluation

In [None]:
models = [LogisticRegression, SVC, DecisionTreeClassifier, RandomForestClassifier]
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

for model in models:
    classifier = model().fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    
    accuracy_scores.append(accuracy_score(y_test, y_pred))
    precision_scores.append(precision_score(y_test, y_pred))
    recall_scores.append(recall_score(y_test, y_pred))
    f1_scores.append(f1_score(y_test, y_pred))

In [None]:
classification_metrics_df = pd.DataFrame({
    "Model": ["Logistic Regression", "SVM", "Decision Tree", "Random Forest"],
    "Accuracy": accuracy_scores,
    "Precision": precision_scores,
    "Recall": recall_scores,
    "F1 Score": f1_scores
})

classification_metrics_df.set_index('Model', inplace=True)
classification_metrics_df

### Inference
In the context of Loan Eligibility Prediction:

1. **Logistic Regression** demonstrates the highest recall (0.97) indicating its effectiveness in identifying eligible applicants. However, precision (0.75) and F1 score (0.85) show a trade-off between accuracy and false positives.

2. **SVM** maintains a high recall (0.95) with a slightly lower precision (0.75). It's a balanced choice for minimizing false negatives while controlling false positives.

3. **Decision Tree** has the lowest accuracy (0.65) among the models. It provides good precision (0.75) but struggles with recall (0.73) leading to a moderate F1 score (0.74).

4. **Random Forest** strikes a balance between precision (0.77) and recall (0.89) resulting in a reasonable F1 score (0.83) and overall accuracy (0.74).

In summary, Logistic Regression excels in recall but sacrifices precision. SVM offers a balanced approach while Random Forest strikes a compromise between precision and recall.


**`Note:`** Choosing the most suitable model depends on the specific objectives of the Finance company. If minimizing false negatives (approving loans for eligible applicants) is crucial, Logistic Regression or SVM may be preferred. If a balance between precision and recall is desired, Random Forest offers a reasonable compromise. Further model evaluation and fine-tuning may be necessary to optimize performance for the specific business goals.