## Loan Approval Prediction 

The goal of this project is to automate the loan eligibility process based on customer details provided. The details (columns) are Unique Identifier of Loan, Gender, Marital Status, No. of Dependents, Education level, Self-Employment Status, Income of applicant, Income of co-applicant, Amount of Loan, Term of Loan Amount, Credit History, Type of Property Area and Loan Status (i.e. target column). The data is downloaded from Kaggle. For reference, https://www.kaggle.com/datasets/altruistdelhite04/loan-prediction-problem-dataset

In [0]:
from pyspark.sql import SparkSession

In [0]:
spark = SparkSession.builder.appName('IMMLLogReg').getOrCreate()

In [0]:
# File location and type
file_location = "/FileStore/tables/loan.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df.limit(10))

Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
LP001002,Male,No,0,Graduate,No,5849,0.0,,360,1,Urban,Y
LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360,1,Rural,N
LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360,1,Urban,Y
LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360,1,Urban,Y
LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360,1,Urban,Y
LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360,1,Urban,Y
LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360,1,Urban,Y
LP001014,Male,Yes,3+,Graduate,No,3036,2504.0,158.0,360,0,Semiurban,N
LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360,1,Urban,Y
LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360,1,Semiurban,N


**Data Pre-processing**

In [0]:
import pyspark.sql.functions as f

In [0]:
# converting string to boolean values for Loan_status column 

df=df.withColumn('loan_status', f.col('Loan_Status').cast('boolean')).\
        replace(['Y',], 'true', subset='loan_status').\
        replace(['N'], 'false', subset='loan_status')

display(df.limit(10))

Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,loan_status
LP001002,Male,No,0,Graduate,No,5849,0.0,,360,1,Urban,True
LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360,1,Rural,False
LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360,1,Urban,True
LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360,1,Urban,True
LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360,1,Urban,True
LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360,1,Urban,True
LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360,1,Urban,True
LP001014,Male,Yes,3+,Graduate,No,3036,2504.0,158.0,360,0,Semiurban,False
LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360,1,Urban,True
LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360,1,Semiurban,False


In [0]:
# casting loan_status_boolean field to integer type 

df=df.withColumn('loan_status_int',df.loan_status.cast('integer'))
display(df.limit(10))

Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,loan_status,loan_status_int
LP001002,Male,No,0,Graduate,No,5849,0.0,,360,1,Urban,True,1
LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360,1,Rural,False,0
LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360,1,Urban,True,1
LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360,1,Urban,True,1
LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360,1,Urban,True,1
LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360,1,Urban,True,1
LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360,1,Urban,True,1
LP001014,Male,Yes,3+,Graduate,No,3036,2504.0,158.0,360,0,Semiurban,False,0
LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360,1,Urban,True,1
LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360,1,Semiurban,False,0


In [0]:
# Feature extraction 
data = df.select(['Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome','CoapplicantIncome','LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'loan_status_int'])

In [0]:
data.printSchema()

root
 |-- Gender: string (nullable = true)
 |-- Married: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- Education: string (nullable = true)
 |-- Self_Employed: string (nullable = true)
 |-- ApplicantIncome: integer (nullable = true)
 |-- CoapplicantIncome: double (nullable = true)
 |-- LoanAmount: integer (nullable = true)
 |-- Loan_Amount_Term: integer (nullable = true)
 |-- Credit_History: integer (nullable = true)
 |-- Property_Area: string (nullable = true)
 |-- loan_status_int: integer (nullable = true)



In [0]:
# Removing the rows that contain NULL values 
data=data.dropna()

In [0]:
# find ratio between loan approved applicants and loan not-approved applicants 

loanApprovedDf = df.filter("loan_status_int=1")
nonLoanApprovedDf = df.filter("loan_status_int=0")

print(loanApprovedDf.count(), ":", nonLoanApprovedDf.count())

422 : 192


In [0]:
# Create a view or table

temp_table_name = "loan_data"
 
df.createOrReplaceTempView(temp_table_name)

In [0]:
display(spark.sql("select count(loan_status_int) as no_of_approved_loan, loan_status_int from loan_data group by loan_status_int"))

no_of_approved_loan,loan_status_int
422,1
192,0


From the above chart results, we can infer that the dataset is highly imbalanced.

In [0]:
# Data Splitting : 70-30 train test split

train_data,test_data=data.randomSplit([0.7,0.3])

### Model 1: Logistic Regression Model

In [0]:
# Import the required libraries

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler,StringIndexer ,OneHotEncoder
from pyspark.ml import Pipeline

In [0]:
# Using StringIndexer to convert the categorical columns to hold numerical data

gender_indexer = StringIndexer(inputCol='Gender',outputCol='gender_by_index',handleInvalid='keep')
married_indexer = StringIndexer(inputCol='Married',outputCol='married_index',handleInvalid='keep')
education_indexer = StringIndexer(inputCol='Education',outputCol='education_index',handleInvalid='keep')
self_employed_indexer = StringIndexer(inputCol='Self_Employed',outputCol='self_employed_index',handleInvalid='keep')
property_area_indexer = StringIndexer(inputCol='Property_Area',outputCol='property_area_index',handleInvalid='keep')

In [0]:
# Using OneHotEncoderEstimator to convert the indexed data into a vector which will be effectively handled by Logistic Regression model

data_encoder = OneHotEncoder(inputCols=['gender_by_index','married_index','education_index',
                                                 'self_employed_index','property_area_index'], 
                             outputCols= ['gender_by_vec','married_vec','education_vec','self_employed_vec',
                                                  'property_area_vec'],
                                      handleInvalid='keep')

In [0]:
# Using Vector assembler to create a vector of input features

assembler = VectorAssembler(inputCols=['gender_by_vec','married_vec','education_vec','self_employed_vec','property_area_vec','ApplicantIncome','CoapplicantIncome','LoanAmount','Credit_History'],
                            outputCol="features")

In [0]:
# Create an object for the Logistic Regression model

lr_model = LogisticRegression(labelCol='loan_status_int')

In [0]:
# Using Pipeline to pass the data through indexer and assembler simultaneously

pipe = Pipeline(stages=[gender_indexer,married_indexer,education_indexer,self_employed_indexer,
                        property_area_indexer, data_encoder,assembler,lr_model])

In [0]:
fit_model=pipe.fit(train_data)

In [0]:
# Store the results in a dataframe

predicted_results = fit_model.transform(test_data)

In [0]:
predicted_results.select(['loan_status_int','prediction']).show()

+---------------+----------+
|loan_status_int|prediction|
+---------------+----------+
|              1|       1.0|
|              0|       1.0|
|              0|       1.0|
|              1|       1.0|
|              1|       1.0|
|              0|       1.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              0|       1.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              0|       1.0|
|              0|       1.0|
+---------------+----------+
only showing top 20 rows



**Model Evaluation**

**_Area Under The (Receiver Operating Characteristic )ROC curve_**

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [0]:
AUC_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='loan_status_int',metricName='areaUnderROC')

In [0]:
AUC = AUC_evaluator.evaluate(predicted_results)

In [0]:
print("The Area Under The Curve is {}".format(AUC))

The Area Under The Curve is 0.7354639969195225


A roughly 73.5% area under ROC curve denotes the model has performed well in predicting the loan approval process for applicants.

**_Area Under The Precision-Recall(PR) curve_**

Now, we will find AUC-PR curve because when dealing with highly skewed datasets, Precision-Recall (PR) curve gives a more informative picture of an algorithm's performance.

In [0]:
PR_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='loan_status_int',metricName='areaUnderPR')

In [0]:
PR = PR_evaluator.evaluate(predicted_results)

In [0]:
print("The Area Under the Precision-Recall Curve is {}".format(PR))

The Area Under the Precision-Recall Curve is 0.8050474882871661


A roughly 80% area under PR curve denotes the model has performed moderately well in predicting the loan approval process for applicants.

**_Accuracy_**

In [0]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [0]:
ACC_evaluator = MulticlassClassificationEvaluator(
    labelCol="loan_status_int", predictionCol="prediction", metricName="accuracy")

In [0]:
accuracy = ACC_evaluator.evaluate(predicted_results)

In [0]:
print("The Accuracy of the Logistic Regression model is {}".format(accuracy))

The Accuracy of the Logistic Regression model is 0.8258064516129032


The accuracy of Logistic Regression model is 82% which shows the model performs well in predicting the loan approval process for applicants.

### Model 2: Decision Tree

In [0]:
# Import the required libraries

from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler,StringIndexer
from pyspark.ml import Pipeline

In [0]:

# Create an object for the Decision Tree model
# Use the parameter maxBins and assign a value that is equal to or more than the number of categories in any sigle feature

dt_model = DecisionTreeClassifier(labelCol='loan_status_int',maxBins=5000)

In [0]:

# Pipeline is used to pass the data through indexer and assembler simultaneously. Also, it helps to pre-rocess the test data in the same way as that of the train data


pipedt = Pipeline(stages=[gender_indexer,married_indexer,education_indexer,self_employed_indexer,
                        property_area_indexer, data_encoder,assembler,dt_model])

In [0]:
fit_model_dt=pipedt.fit(train_data)

In [0]:
# Store the results in a dataframe

results_dt = fit_model_dt.transform(test_data)

In [0]:
results_dt.select(['loan_status_int','prediction']).show()

+---------------+----------+
|loan_status_int|prediction|
+---------------+----------+
|              1|       1.0|
|              1|       1.0|
|              0|       1.0|
|              0|       1.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              0|       0.0|
|              0|       0.0|
|              1|       1.0|
|              0|       0.0|
|              1|       1.0|
|              1|       0.0|
|              0|       1.0|
|              1|       1.0|
|              0|       1.0|
|              0|       0.0|
+---------------+----------+
only showing top 20 rows



**Model Evaluation**

**_Accuracy_**

In [0]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [0]:
ACC_evaluator = MulticlassClassificationEvaluator(
    labelCol="loan_status_int", predictionCol="prediction", metricName="accuracy")

In [0]:
accuracy_dt = ACC_evaluator.evaluate(results_dt)

In [0]:
print("The accuracy of the decision tree classifier is {}".format(accuracy_dt))

The accuracy of the decision tree classifier is 0.7419354838709677


The accuracy of decision tree model is 74% which is moderate. Thus this model is fit for loan prediction.

###Model 3 : Linear Support Vector Classifer

In [0]:
# Import the required libraries

from pyspark.ml.classification import LinearSVC
from pyspark.ml.feature import VectorAssembler,StringIndexer,StandardScaler
from pyspark.ml import Pipeline

In [0]:
# Create an object for the Linear SVC model

svc_model1 = LinearSVC(labelCol='loan_status_int')

In [0]:
pipesvc = Pipeline(stages=[gender_indexer,married_indexer,education_indexer,self_employed_indexer,
                        property_area_indexer, data_encoder,assembler,svc_model1])

In [0]:
fit_model_svc=pipesvc.fit(train_data)

In [0]:
# Store the results in a dataframe

results_SVC = fit_model_svc.transform(test_data)
display(results_SVC)

Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,loan_status_int,gender_by_index,married_index,education_index,self_employed_index,property_area_index,gender_by_vec,married_vec,education_vec,self_employed_vec,property_area_vec,features,rawPrediction,prediction
Female,No,0,Graduate,No,645,3683.0,113,480,1,Rural,1,1.0,1.0,0.0,0.0,2.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 14, 16, 17, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 645.0, 3683.0, 113.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-1.000000064768571, 1.000000064768571))",1.0
Female,No,0,Graduate,No,1811,1666.0,54,360,1,Urban,1,1.0,1.0,0.0,0.0,1.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 13, 16, 17, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1811.0, 1666.0, 54.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-1.0000001547696793, 1.0000001547696793))",1.0
Female,No,0,Graduate,No,2378,0.0,9,360,1,Urban,0,1.0,1.0,0.0,0.0,1.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 13, 16, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 2378.0, 9.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-1.0000002122897227, 1.0000002122897227))",1.0
Female,No,0,Graduate,No,2400,1863.0,104,360,0,Urban,0,1.0,1.0,0.0,0.0,1.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 13, 16, 17, 18), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 2400.0, 1863.0, 104.0))","Map(vectorType -> dense, length -> 2, values -> List(1.0000018028759048, -1.0000018028759048))",0.0
Female,No,0,Graduate,No,2500,0.0,67,360,1,Urban,1,1.0,1.0,0.0,0.0,1.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 13, 16, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 2500.0, 67.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-1.0000001326850239, 1.0000001326850239))",1.0
Female,No,0,Graduate,No,3086,0.0,120,360,1,Semiurban,1,1.0,1.0,0.0,0.0,0.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 12, 16, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 3086.0, 120.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-1.0000001114766373, 1.0000001114766373))",1.0
Female,No,0,Graduate,No,3762,1666.0,135,360,1,Rural,1,1.0,1.0,0.0,0.0,2.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 14, 16, 17, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 3762.0, 1666.0, 135.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-1.000000036360217, 1.000000036360217))",1.0
Female,No,0,Graduate,No,3846,0.0,111,360,1,Semiurban,1,1.0,1.0,0.0,0.0,0.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 12, 16, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 3846.0, 111.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-1.0000001262048421, 1.0000001262048421))",1.0
Female,No,0,Graduate,No,4160,0.0,71,360,1,Semiurban,1,1.0,1.0,0.0,0.0,0.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 12, 16, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 4160.0, 71.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-1.0000001823189613, 1.0000001823189613))",1.0
Female,No,0,Graduate,No,4166,0.0,44,360,1,Semiurban,1,1.0,1.0,0.0,0.0,0.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 12, 16, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 4166.0, 44.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-1.0000002195678408, 1.0000002195678408))",1.0


In [0]:
results_SVC.select(['loan_status_int','prediction']).show()

+---------------+----------+
|loan_status_int|prediction|
+---------------+----------+
|              1|       1.0|
|              1|       1.0|
|              0|       1.0|
|              0|       0.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              0|       0.0|
|              0|       0.0|
|              1|       1.0|
|              0|       0.0|
|              1|       1.0|
|              1|       1.0|
|              0|       1.0|
|              1|       1.0|
|              0|       1.0|
|              0|       1.0|
+---------------+----------+
only showing top 20 rows



**Model Evaluation**

**_Area Under The (Receiver Operating Characteristic )ROC curve_**

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [0]:
AUC_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='loan_status_int',metricName='areaUnderROC')

In [0]:
AUC = AUC_evaluator.evaluate(results_SVC)

In [0]:
print("The area under the curve is {}".format(AUC))

The area under the curve is 0.7252599152868695


Area under ROC curve is 72.5% that denotes the model has performed reasonably well in predicting loan status of the applicant.

**_Area Under The Precision-Recall(PR) curve_**

In [0]:
PR_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='loan_status_int',metricName='areaUnderPR')

In [0]:
PR = PR_evaluator.evaluate(results_SVC)

In [0]:
print("The area under the PR curve is {}".format(PR))

The area under the PR curve is 0.7989044430919051


A roughly 80% area under PR curve denotes the model has performed well in predicting the loan approval process for applicants.

**_Accuracy_**

In [0]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [0]:
ACC_evaluator = MulticlassClassificationEvaluator(
    labelCol="loan_status_int", predictionCol="prediction", metricName="accuracy")

In [0]:
accuracy = ACC_evaluator.evaluate(results_SVC)

In [0]:
print("The accuracy of the model is {}".format(accuracy))

The accuracy of the model is 0.8193548387096774


The accuracy of linear SVC model is 81.9% which shows that the model performed well in predicting the loan approval process for applicants.

### Model 3: Gradient Boosting Classifier

In [0]:
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import Normalizer

In [0]:
gbt1 = GBTClassifier(labelCol="loan_status_int", featuresCol="features", maxIter=10)

In [0]:
pipeGB = Pipeline(stages=[gender_indexer,married_indexer,education_indexer,self_employed_indexer,
                        property_area_indexer, data_encoder,assembler,gbt1])

In [0]:
modelGB = pipeGB.fit(train_data)


In [0]:
# Store the results in a dataframe

results_GB = modelGB.transform(test_data)
display(results_GB)

Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,loan_status_int,gender_by_index,married_index,education_index,self_employed_index,property_area_index,gender_by_vec,married_vec,education_vec,self_employed_vec,property_area_vec,features,rawPrediction,probability,prediction
Female,No,0,Graduate,No,645,3683.0,113,480,1,Rural,1,1.0,1.0,0.0,0.0,2.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 14, 16, 17, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 645.0, 3683.0, 113.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-0.7160039871217743, 0.7160039871217743))","Map(vectorType -> dense, length -> 2, values -> List(0.19278601137863371, 0.8072139886213663))",1.0
Female,No,0,Graduate,No,1811,1666.0,54,360,1,Urban,1,1.0,1.0,0.0,0.0,1.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 13, 16, 17, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1811.0, 1666.0, 54.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-1.0055883423546499, 1.0055883423546499))","Map(vectorType -> dense, length -> 2, values -> List(0.11803442715666443, 0.8819655728433355))",1.0
Female,No,0,Graduate,No,2378,0.0,9,360,1,Urban,0,1.0,1.0,0.0,0.0,1.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 13, 16, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 2378.0, 9.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(0.3029631865824023, -0.3029631865824023))","Map(vectorType -> dense, length -> 2, values -> List(0.6470109936593877, 0.35298900634061225))",0.0
Female,No,0,Graduate,No,2400,1863.0,104,360,0,Urban,0,1.0,1.0,0.0,0.0,1.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 13, 16, 17, 18), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 2400.0, 1863.0, 104.0))","Map(vectorType -> dense, length -> 2, values -> List(-0.21516666602792744, 0.21516666602792744))","Map(vectorType -> dense, length -> 2, values -> List(0.39404673776086757, 0.6059532622391324))",1.0
Female,No,0,Graduate,No,2500,0.0,67,360,1,Urban,1,1.0,1.0,0.0,0.0,1.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 13, 16, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 2500.0, 67.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-0.7100930962386077, 0.7100930962386077))","Map(vectorType -> dense, length -> 2, values -> List(0.19463239614469113, 0.8053676038553088))",1.0
Female,No,0,Graduate,No,3086,0.0,120,360,1,Semiurban,1,1.0,1.0,0.0,0.0,0.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 12, 16, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 3086.0, 120.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-0.8669361211013981, 0.8669361211013981))","Map(vectorType -> dense, length -> 2, values -> List(0.15009294738646112, 0.8499070526135388))",1.0
Female,No,0,Graduate,No,3762,1666.0,135,360,1,Rural,1,1.0,1.0,0.0,0.0,2.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 14, 16, 17, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 3762.0, 1666.0, 135.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-0.21906368107060561, 0.21906368107060561))","Map(vectorType -> dense, length -> 2, values -> List(0.39218727171771767, 0.6078127282822823))",1.0
Female,No,0,Graduate,No,3846,0.0,111,360,1,Semiurban,1,1.0,1.0,0.0,0.0,0.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 12, 16, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 3846.0, 111.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-0.8032184415629815, 0.8032184415629815))","Map(vectorType -> dense, length -> 2, values -> List(0.16708389334949655, 0.8329161066505034))",1.0
Female,No,0,Graduate,No,4160,0.0,71,360,1,Semiurban,1,1.0,1.0,0.0,0.0,0.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 12, 16, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 4160.0, 71.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-1.0460381510652694, 1.0460381510652694))","Map(vectorType -> dense, length -> 2, values -> List(0.10986935146326202, 0.890130648536738))",1.0
Female,No,0,Graduate,No,4166,0.0,44,360,1,Semiurban,1,1.0,1.0,0.0,0.0,0.0,"Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 20, indices -> List(1, 4, 6, 9, 12, 16, 18, 19), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 4166.0, 44.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-0.9941448904634109, 0.9941448904634109))","Map(vectorType -> dense, length -> 2, values -> List(0.12043791288824329, 0.8795620871117567))",1.0


In [0]:
results_GB.select(['loan_status_int','prediction']).show()

+---------------+----------+
|loan_status_int|prediction|
+---------------+----------+
|              1|       1.0|
|              1|       1.0|
|              0|       0.0|
|              0|       1.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              1|       1.0|
|              0|       0.0|
|              0|       0.0|
|              1|       1.0|
|              0|       1.0|
|              1|       1.0|
|              1|       0.0|
|              0|       1.0|
|              1|       1.0|
|              0|       0.0|
|              0|       0.0|
+---------------+----------+
only showing top 20 rows



**Model Evaluation**

**_Accuracy_**

In [0]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [0]:
ACC_evaluator = MulticlassClassificationEvaluator(
    labelCol="loan_status_int", predictionCol="prediction", metricName="accuracy")

In [0]:
accuracy_GB = ACC_evaluator.evaluate(results_GB)

In [0]:
print("The accuracy of the model is {}".format(accuracy_GB))

The accuracy of the model is 0.7354838709677419


The accuracy of the model is approximately 73.5% which is moderate. Thus this model is fit for loan prediction.

The purpose behind this project is to automate the loan approval process by analyzing details of a customer. We have considered following attributes 'Gender','Married','Education','Self_Employed','ApplicantIncome','CoapplicantIncome','LoanAmount', 'Credit_History', and 'Property_Area' to analyze whether the loan status should be approved(1) or not Approved(0). This is a binary classification problem (loan status as the label) where we are classifying the loan status as approved(1) or not approved(0). The analysis has been done with four different classification models. Following are the values of evaluation metrics for each model:- 

**Logistic Regression**- 
AUC-ROC : 73.5%, AUC-PR: 80.5%, Accuracy: 82.5% 

**Decision Tree**- 
Accuracy: 74% 

**Linear SVC**- 
AUC- ROC: 72.5%, AUC- PR: 79.8%, Accuracy: 81.9% 

**Gradient Descent classifer**- 
Accuracy: 73.5%

Among all the four models that we have used to predict loan status, **Logistic Regression** has performed the best in terms of accuracy. At first, we found the AUC-ROC of the model i.e. 73.5% which signifies that the model can be accepted to classify and predict loan-status of the applicants. However, since the dataset is highly imbalanced, hence we used Precision-Recall curves as PR AUC focuses mainly on the positive class (loan_status = 1). And as observed from the above results, AUC-PR came out to be 80.5%, which denotes that the classifier model performs well in distinguishing the eligibility of applicants for an approved loan (class=1)and non-approved loan (class=0). It means that this model predicts the fraction of positive predictions that actually belong to the positive class (loan_status = 1) quite good. In this scenario, it means among the customers predicted as taking loan, the percentage of those who really got approved for a loan, is predicted well by the model.