
#                                 PROBLEM STATEMENT

In this Loan Status Prediction dataset, we have the data of applicants who previously applied for the loan based on the property which is a Property Loan.

The bank will decide whether to give a loan to the applicant based on some factors such as Applicant Income, Loan Amount, previous Credit History, Co-applicant Income, etc…

Our goal is to build a Machine Learning Model to predict the loan to be approved or to be rejected for an applicant.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split,cross_val_score,RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import joblib #to facilitate lightweight pipelining and parallel computing. It can return cached result instead of 
              # recalculating it


#                    Working of Libraries

1. **Pandas** : Used for Data Cleaning, Data Transformation, Data Analysis, Supports reading dataset from and writing to various formats like CSV, Excel, SQL databases, and JSON

2. **NumPy** : 
          - **N-dimensional Arrays**: Efficient storage and manipulation of large datasets.
          - **Mathematical Functions**: Functions for linear algebra, statistical operations, and random number                                             generation.
         - **Performance**: Optimized performance for numerical computations, making it suitable for scientific                                        computing

3. **Scikit-learn (sklearn)** : 
             - **Model Selection**: Tools like `train_test_split`, `GridSearchCV`, and `RandomizedSearchCV` help in                                      selecting the best model and hyperparameters.
             - **Preprocessing**: Functions for scaling features (e.g., `StandardScaler`), encoding categorical                                       variables, and imputing missing values.
             - **Model Evaluation**: Metrics such as accuracy score, precision, recall, and cross-validation                                             techniques to assess model performance

4. **StandardScaler** :This is a specific utility from scikit-learn used to standardize features by removing the mean and scaling to unit variance. This is crucial when features have different scales, as it helps improve the performance of many machine learning algorithms.

5. **Joblib** :  lightweight pipelining in Python. 
                  - **Caching Results**: It can cache the output of functions to avoid recomputation.
                  - **Parallel Computing**: Facilitates running tasks concurrently across multiple CPU cores.
                  - **Efficient Object Persistence**: Saves and loads large NumPy arrays efficiently
 
6. **svm (Support Vector Machine)** : 
This module from scikit-learn implements Support Vector Machines for classification tasks. SVMs are powerful classifiers that work well on both linear and non-linear data.

7. **DecisionTreeClassifier** : 
This classifier uses a decision tree structure to make predictions based on feature values. It’s intuitive and interpretable but can be prone to overfitting.

8. **RandomForestClassifier** :
An ensemble method that constructs multiple decision trees during training and outputs the mode of their predictions. It generally offers better accuracy than individual decision trees.

9. **GradientBoostingClassifier** : 
This is another ensemble method that builds trees sequentially; each new tree corrects errors made by previously trained trees. It's effective but can be sensitive to overfitting if not tuned properly.

                         READING DATASET

In [2]:
df = pd.read_csv('loan_data.csv')

##                             EXPLANATION OF FEATURES

1) **Loan_ID**: A unique loan ID.

2) **Gender**: Either male or female.

3) **Married**: Weather Married(yes) or Not Marttied(No).

4) **Dependents**: Number of persons depending on the client.

5) **Education**: Applicant Education(Graduate or Undergraduate).

6) **Self_Employed**: Self-employed (Yes/No).

7) **ApplicantIncome**: Applicant income.

8) **CoapplicantIncome**: Co-applicant income.

9) **LoanAmount**: Loan amount in thousands.

10) **Loan_Amount_Term**: Terms of the loan in months.

11) **Credit_History**: Credit history meets guidelines.

12) **Property_Area**: Applicants are living either Urban, Semi-Urban or Rural.

13) **Loan_Status**: Loan approved (Y/N).

In [3]:
df.head()#first 5 rows 

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
1,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
2,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
3,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
4,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y


In [4]:
df.shape #to check total no. of rows and columns. in this 381rows and 13 columns

(381, 13)

In [5]:
df.tail()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
376,LP002953,Male,Yes,3+,Graduate,No,5703,0.0,128.0,360.0,1.0,Urban,Y
377,LP002974,Male,Yes,0,Graduate,No,3232,1950.0,108.0,360.0,1.0,Rural,Y
378,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
379,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
380,LP002990,Female,No,0,Graduate,Yes,4583,0.0,133.0,360.0,0.0,Semiurban,N


In [6]:
df.info()#to check missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381 entries, 0 to 380
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            381 non-null    object 
 1   Gender             376 non-null    object 
 2   Married            381 non-null    object 
 3   Dependents         373 non-null    object 
 4   Education          381 non-null    object 
 5   Self_Employed      360 non-null    object 
 6   ApplicantIncome    381 non-null    int64  
 7   CoapplicantIncome  381 non-null    float64
 8   LoanAmount         381 non-null    float64
 9   Loan_Amount_Term   370 non-null    float64
 10  Credit_History     351 non-null    float64
 11  Property_Area      381 non-null    object 
 12  Loan_Status        381 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 38.8+ KB


In [7]:
#Handling missing values
df.isnull().sum()#e.g missing 5 values in gender feature. ||ly can see other features also which has missing values

Loan_ID               0
Gender                5
Married               0
Dependents            8
Education             0
Self_Employed        21
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term     11
Credit_History       30
Property_Area         0
Loan_Status           0
dtype: int64

In [8]:
#Percentage of missing value
df.isnull().mean()*100

Loan_ID              0.000000
Gender               1.312336
Married              0.000000
Dependents           2.099738
Education            0.000000
Self_Employed        5.511811
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
LoanAmount           0.000000
Loan_Amount_Term     2.887139
Credit_History       7.874016
Property_Area        0.000000
Loan_Status          0.000000
dtype: float64

In [9]:
# Drop unnecessary column
df = df.drop('Loan_ID',axis=1) #axis=1 denotes column you want to drop

In [10]:
# More no. of missing values (check o/p of df.isnull().mean()*100) we simply going to replace them with mean/mode
# otherwise we simply going to drop them

df = df.dropna(subset = ['Gender', 'Dependents', 'Loan_Amount_Term'])

In [11]:
df.shape

(358, 12)

In [12]:
#handling missing value
df.isnull().sum()

Gender                0
Married               0
Dependents            0
Education             0
Self_Employed        20
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term      0
Credit_History       30
Property_Area         0
Loan_Status           0
dtype: int64

In [13]:
df['Self_Employed'].unique() #we can't take mean of this field because this is not the continuous values

array(['No', 'Yes', nan], dtype=object)

In [14]:
df['Self_Employed'].mode()[0]#replacing all missing values in this column with No

'No'

In [15]:
df['Credit_History'].unique()

array([ 1., nan,  0.])

In [16]:
df['Credit_History'].mode()[0] #applying mean on this will give o/p 0.8597560975609756 but we don't have such value 
                               # in this column

1.0

In [17]:
df['Self_Employed'].fillna(df['Self_Employed'].mode()[0], inplace = True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Self_Employed'].fillna(df['Self_Employed'].mode()[0], inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace = True)


In [18]:
df.isnull().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 358 entries, 0 to 380
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             358 non-null    object 
 1   Married            358 non-null    object 
 2   Dependents         358 non-null    object 
 3   Education          358 non-null    object 
 4   Self_Employed      358 non-null    object 
 5   ApplicantIncome    358 non-null    int64  
 6   CoapplicantIncome  358 non-null    float64
 7   LoanAmount         358 non-null    float64
 8   Loan_Amount_Term   358 non-null    float64
 9   Credit_History     358 non-null    float64
 10  Property_Area      358 non-null    object 
 11  Loan_Status        358 non-null    object 
dtypes: float64(4), int64(1), object(7)
memory usage: 36.4+ KB


In [20]:
df['Gender'].unique()

array(['Male', 'Female'], dtype=object)

In [21]:
df['Gender'].value_counts()

Gender
Male      278
Female     80
Name: count, dtype: int64

In [22]:
df['Dependents'].unique()

array(['1', '0', '2', '3+'], dtype=object)

In [23]:
df['Dependents'].replace('3+','4',inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Dependents'].replace('3+','4',inplace=True)


In [24]:
df['Married'].unique()

array(['Yes', 'No'], dtype=object)

In [25]:
df['Education'].unique()

array(['Graduate', 'Not Graduate'], dtype=object)

In [26]:
#converting categorical values to numerical
encoding = {
    'Gender' : {'Male':1, 'Female':0},
    'Married': {'Yes':1, 'No':0},
    'Dependents':{'0':0, '1':1, '2':2, '4':4},
    'Education' : {'Graduate':1, 'Not Graduate':0},
    'Self_Employed' : {'Yes':1, 'No':0},
    'Property_Area' : {'Rural':0, 'Semiurban':2, 'Urban':1},
    'Loan_Status': {'Yes':1, 'No':0}
    
}

In [27]:
df.replace(encoding,inplace=True)

  df.replace(encoding,inplace=True)


In [28]:
#categorical to numerical o/p
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 358 entries, 0 to 380
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             358 non-null    int64  
 1   Married            358 non-null    int64  
 2   Dependents         358 non-null    int64  
 3   Education          358 non-null    int64  
 4   Self_Employed      358 non-null    int64  
 5   ApplicantIncome    358 non-null    int64  
 6   CoapplicantIncome  358 non-null    float64
 7   LoanAmount         358 non-null    float64
 8   Loan_Amount_Term   358 non-null    float64
 9   Credit_History     358 non-null    float64
 10  Property_Area      358 non-null    int64  
 11  Loan_Status        358 non-null    object 
dtypes: float64(4), int64(7), object(1)
memory usage: 36.4+ KB


In [29]:
# divided dataset into dependent and independent variables
X = df.drop('Loan_Status', axis=1)
Y = df['Loan_Status']

In [30]:
df['Loan_Status'].value_counts()

Loan_Status
Y    261
N     97
Name: count, dtype: int64

In [31]:
df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0,N
1,1,1,0,1,1,3000,0.0,66.0,360.0,1.0,1,Y
2,1,1,0,0,0,2583,2358.0,120.0,360.0,1.0,1,Y
3,1,0,0,1,0,6000,0.0,141.0,360.0,1.0,1,Y
4,1,1,0,0,0,2333,1516.0,95.0,360.0,1.0,1,Y


In [32]:
# ApplicantIncome 	CoapplicantIncome 	LoanAmount 	Loan_Amount_Term
num_cols = ['ApplicantIncome', 'CoapplicantIncome' , 'LoanAmount', 'Loan_Amount_Term'] #these 4 cols values doesn't 
            #lie b/w 0 and 1 so needs to use StandardScalar for this
scaler = StandardScaler()    
X[num_cols] = scaler.fit_transform(X[num_cols])#this will lead to give values in standard form b/w -1 and 1

In [33]:
X.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,1,1,1,1,0,0.71163,0.092069,0.80598,0.285826,1.0,0
1,1,1,0,1,1,-0.398856,-0.539332,-1.350425,0.285826,1.0,1
2,1,1,0,0,0,-0.691384,0.447965,0.527735,0.285826,1.0,1
3,1,0,0,1,0,1.705666,-0.539332,1.25813,0.285826,1.0,1
4,1,1,0,0,0,-0.866761,0.095418,-0.341784,0.285826,1.0,1


                   Creating our Prediction Model

In [34]:
#Model Validation function
def evaluate_model(model):
    X_train, X_test, Y_train, Y_test  = train_test_split(X, Y, test_size = 0.2, random_state = 42)
    model.fit(X_train, Y_train)
    Y_pred = model.predict(X_test)
    accuracy = accuracy_score(Y_test, Y_pred)
    cross_val = cross_val_score(model, X, Y, cv=5)#cv=5 denotes divided into 5 equal parts
    avg_cross_val = np.mean(cross_val)
    print(f"{model.__class__.__name__} - Accuarcy : {accuracy: .2f} , Cross-Val-Score: {avg_cross_val: .2f}")
    return avg_cross_val

In [35]:
models = {
    LogisticRegression(),
    svm.SVC(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    GradientBoostingClassifier(), 
}

In [36]:
model_score = {model.__class__.__name__:evaluate_model(model) for model in models}

GradientBoostingClassifier - Accuarcy :  0.85 , Cross-Val-Score:  0.82
DecisionTreeClassifier - Accuarcy :  0.85 , Cross-Val-Score:  0.79
LogisticRegression - Accuarcy :  0.85 , Cross-Val-Score:  0.84
SVC - Accuarcy :  0.85 , Cross-Val-Score:  0.83
RandomForestClassifier - Accuarcy :  0.82 , Cross-Val-Score:  0.84


                          Tuning our model

Hyperparameter tuning done using RandomizedSearchCV.
param_grid is a dictionary or a list of dictionary that defines hyperparameters
n_iter =20 :- 20 different combinations of hyperparameters will be sampled and evaluated
verbose = True :- detailed logging info during tuning process
tuner.best_estimator_  :- returns best model during tuning process

In [37]:
def tune_model(model,param_grid):
    tuner = RandomizedSearchCV(model, param_grid, cv = 5, n_iter =20, verbose = True, random_state = 42)
    tuner.fit(X, Y)
    print(f"Best Score for {model.__class__.__name__}: {tuner.best_score_:.2f}")
    print(f"Best Parameter for {model.__class__.__name__}: {tuner.best_params_}")
    return tuner.best_estimator_

- **`C`**: This parameter is the inverse of regularization strength; smaller values specify stronger regularization. The `np.logspace(-4, 4, 20)` function generates 20 values logarithmically spaced between $$10^{-4}$$ and $$10^{4}$$. This range allows the model to explore a wide spectrum of regularization strengths.
  
- **`solver`**: The `"liblinear"` solver is suitable for small datasets and supports L1 regularization, making it a common choice for logistic regression.


In [38]:
# specify our parameters
log_reg_grid = {'C': np.logspace(-4,4,20), "solver":["liblinear"]}
svc_grid = {'C':[0.25,0.50,0.75,1], "kernel":["linear"]}

rf_grid = {
    'n_estimators': np.arange(10, 1000, 10), #describes no. of trees in the forest
    'max_features': ['log2', 'sqrt'], 
    'max_depth': [None, 3, 5, 10, 20, 30],# None:- nodes are expanded until all leaves are pure or <min_samples_split samples.
    'min_samples_split': [2, 5, 20, 50, 100],
    'min_samples_leaf': [1, 2, 5, 10]
}

In [39]:
best_log_reg = tune_model(LogisticRegression(), log_reg_grid)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best Score for LogisticRegression: 0.84
Best Parameter for LogisticRegression: {'solver': 'liblinear', 'C': 1.623776739188721}


In [40]:
best_svc_reg = tune_model(svm.SVC(), svc_grid)

Fitting 5 folds for each of 4 candidates, totalling 20 fits




Best Score for SVC: 0.84
Best Parameter for SVC: {'kernel': 'linear', 'C': 0.25}


In [41]:
best_rf = tune_model(RandomForestClassifier(), rf_grid)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best Score for RandomForestClassifier: 0.84
Best Parameter for RandomForestClassifier: {'n_estimators': 930, 'min_samples_split': 50, 'min_samples_leaf': 10, 'max_features': 'sqrt', 'max_depth': 30}


In [42]:
final_model = best_rf

In [43]:
joblib.dump(final_model, 'loan_status_predictor.pkl')

['loan_status_predictor.pkl']

In [44]:
# Prediction system
sample_data = pd.DataFrame({
    'Gender' : [1],
    'Married' : [1],
    'Dependents' : [1],
    'Education' : [1],
    'Self_Employed' : [0],
    'ApplicantIncome' : [4583],
    'CoapplicantIncome' : [1508.0],
    'LoanAmount' : [128.0],
    'Loan_Amount_Term' : [360.0],
    'Credit_History' : [1.0],
    'Property_Area' : [0]
  
})

sample_data[num_cols] = scaler.transform(sample_data[num_cols])
loaded_model = joblib.load('loan_status_predictor.pkl')
prediction = loaded_model.predict(sample_data)

result = "Loan Approved" if prediction[0] == 1 else "Loan Not Approved"
print(f"\n Prediction Result:{result}")


 Prediction Result:Loan Not Approved


In [45]:
joblib.dump(scaler, 'vector.pkl')

['vector.pkl']

Resource:-  Dataset been taken from following link:-
               https://www.kaggle.com/datasets/bhavikjikadara/loan-status-prediction?resource=download           
            