## Building a Classification Model
 The dataset contains information pertaining to a few customers who
have applied for new bank loans and whether they will default. We will
build a binary classification model to predict whether a particular
customer should be granted a loan, based on the knowledge gleaned
from the model. The following core steps are used to build a classification
model:
* 1. Load the dataset.
* 2. Perform exploratory data analysis.
* 3. Perform required data transformations.
* 4. Split data into train and test subsets.
* 5. Train and evaluate the baseline model on train data.
* 6. Perform hyperparameter tuning.
* 7. Build a final model with the best parameters.

In [4]:
!pip install pyspark



In [5]:
#import SparkSession
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('binary_class').getOrCreate()

## Step 1: Load the Dataset

In [6]:
#read the dataset
df=spark.read.csv('classification_data.csv',inferSchema=True,header=True)

## Step 2: Explore the Dataframe

In [7]:
#check the shape of the data
print((df.count(),len(df.columns)))

(46751, 12)


In [8]:
#printSchema
df.printSchema()

root
 |-- loan_id: string (nullable = true)
 |-- loan_purpose: string (nullable = true)
 |-- is_first_loan: integer (nullable = true)
 |-- total_credit_card_limit: integer (nullable = true)
 |-- avg_percentage_credit_card_limit_used_last_year: double (nullable = true)
 |-- saving_amount: integer (nullable = true)
 |-- checking_amount: integer (nullable = true)
 |-- is_employed: integer (nullable = true)
 |-- yearly_salary: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- dependent_number: integer (nullable = true)
 |-- label: integer (nullable = true)



In [9]:
#number of columns in dataset
df.columns

['loan_id',
 'loan_purpose',
 'is_first_loan',
 'total_credit_card_limit',
 'avg_percentage_credit_card_limit_used_last_year',
 'saving_amount',
 'checking_amount',
 'is_employed',
 'yearly_salary',
 'age',
 'dependent_number',
 'label']

In [10]:
#view the dataset
df.show(5)

+-------+------------+-------------+-----------------------+-----------------------------------------------+-------------+---------------+-----------+-------------+---+----------------+-----+
|loan_id|loan_purpose|is_first_loan|total_credit_card_limit|avg_percentage_credit_card_limit_used_last_year|saving_amount|checking_amount|is_employed|yearly_salary|age|dependent_number|label|
+-------+------------+-------------+-----------------------+-----------------------------------------------+-------------+---------------+-----------+-------------+---+----------------+-----+
|    A_1|    personal|            1|                   7900|                                            0.8|         1103|           6393|          1|        16400| 42|               4|    0|
|    A_2|    personal|            0|                   3300|                                           0.29|         2588|            832|          1|        75500| 56|               1|    0|
|    A_3|    personal|            0|    

In [11]:
#Exploratory Data Analysis
df.describe().show()

+-------+-------+------------+------------------+-----------------------+-----------------------------------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+
|summary|loan_id|loan_purpose|     is_first_loan|total_credit_card_limit|avg_percentage_credit_card_limit_used_last_year|     saving_amount|   checking_amount|       is_employed|     yearly_salary|               age|  dependent_number|              label|
+-------+-------+------------+------------------+-----------------------+-----------------------------------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+
|  count|  46751|       46751|             46751|                  46751|                                          46751|             46751|             46751|             46751|             46751|             46751|             467

In [12]:
df.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|    1|16201|
|    0|30550|
+-----+-----+



In [13]:
df.groupBy('loan_purpose').count().show()

+------------+-----+
|loan_purpose|count|
+------------+-----+
|      others| 6763|
|   emergency| 7562|
|    property|11388|
|  operations|10580|
|    personal|10458|
+------------+-----+



## Step 3: Data Transformation

In [14]:
#import required libraries
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

In [15]:
loan_purpose_indexer = StringIndexer(inputCol="loan_purpose", outputCol="loan_index").fit(df)
df = loan_purpose_indexer.transform(df)
loan_encoder = OneHotEncoder(inputCol="loan_index", outputCol="loan_purpose_vec")
ohe = loan_encoder.fit(df)
df = ohe.transform(df)

In [16]:
df.select(['loan_purpose','loan_index','loan_purpose_vec']).show(3,False)

+------------+----------+----------------+
|loan_purpose|loan_index|loan_purpose_vec|
+------------+----------+----------------+
|personal    |2.0       |(4,[2],[1.0])   |
|personal    |2.0       |(4,[2],[1.0])   |
|personal    |2.0       |(4,[2],[1.0])   |
+------------+----------+----------------+
only showing top 3 rows



In [17]:
from pyspark.ml.feature import VectorAssembler

In [18]:
df_assembler = VectorAssembler(inputCols=['is_first_loan',
 'total_credit_card_limit',
 'avg_percentage_credit_card_limit_used_last_year',
 'saving_amount',
 'checking_amount',
 'is_employed',
 'yearly_salary',
 'age',
 'dependent_number',
 'loan_purpose_vec'], outputCol="features")
df = df_assembler.transform(df)

In [19]:
df.printSchema()

root
 |-- loan_id: string (nullable = true)
 |-- loan_purpose: string (nullable = true)
 |-- is_first_loan: integer (nullable = true)
 |-- total_credit_card_limit: integer (nullable = true)
 |-- avg_percentage_credit_card_limit_used_last_year: double (nullable = true)
 |-- saving_amount: integer (nullable = true)
 |-- checking_amount: integer (nullable = true)
 |-- is_employed: integer (nullable = true)
 |-- yearly_salary: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- dependent_number: integer (nullable = true)
 |-- label: integer (nullable = true)
 |-- loan_index: double (nullable = false)
 |-- loan_purpose_vec: vector (nullable = true)
 |-- features: vector (nullable = true)



In [20]:
df.select(['features','label']).show(10,False)

+--------------------------------------------------------------------+-----+
|features                                                            |label|
+--------------------------------------------------------------------+-----+
|[1.0,7900.0,0.8,1103.0,6393.0,1.0,16400.0,42.0,4.0,0.0,0.0,1.0,0.0] |0    |
|[0.0,3300.0,0.29,2588.0,832.0,1.0,75500.0,56.0,1.0,0.0,0.0,1.0,0.0] |0    |
|[0.0,7600.0,0.9,1651.0,8868.0,1.0,59000.0,46.0,1.0,0.0,0.0,1.0,0.0] |0    |
|[1.0,3400.0,0.38,1269.0,6863.0,1.0,26000.0,55.0,8.0,0.0,0.0,1.0,0.0]|0    |
|[0.0,2600.0,0.89,1310.0,3423.0,1.0,9700.0,41.0,4.0,0.0,0.0,0.0,1.0] |1    |
|[0.0,7600.0,0.51,1040.0,2406.0,1.0,22900.0,52.0,0.0,0.0,1.0,0.0,0.0]|0    |
|[1.0,6900.0,0.82,2408.0,5556.0,1.0,34800.0,48.0,4.0,0.0,1.0,0.0,0.0]|0    |
|[0.0,5700.0,0.56,1933.0,4139.0,1.0,32500.0,64.0,2.0,0.0,0.0,1.0,0.0]|0    |
|[1.0,3400.0,0.95,3866.0,4131.0,1.0,13300.0,23.0,3.0,0.0,0.0,1.0,0.0]|0    |
|[0.0,2900.0,0.91,88.0,2725.0,1.0,21100.0,52.0,1.0,0.0,0.0,1.0,0.0]  |1    |

In [21]:
#select data for building model
model_df=df.select(['features','label'])

## Step 4: Splitting into Train and Test Data

In [22]:
#split the data
training_df,test_df=model_df.randomSplit([0.75,0.25])

In [23]:
training_df.count()

35220

In [24]:
training_df.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|    1|12210|
|    0|23010|
+-----+-----+



In [25]:
test_df.count()

11531

In [26]:
test_df.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|    1| 3991|
|    0| 7540|
+-----+-----+



## Step 5: Model Training

In [27]:
from pyspark.ml.classification import LogisticRegression

In [28]:
log_reg=LogisticRegression().fit(training_df)

In [29]:
#Training Results
lr_summary=log_reg.summary

In [30]:
lr_summary.accuracy

0.8930437251561613

In [31]:
lr_summary.areaUnderROC

0.9587531112954859

In [32]:
print(lr_summary.precisionByLabel)

[0.9229788543544204, 0.8384510542772389]


In [33]:
print(lr_summary.recallByLabel)

[0.9124293785310734, 0.8565110565110565]


In [34]:
predictions = log_reg.transform(test_df)
predictions.show(10)

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|(13,[0,1,2,3,4,7]...|    1|[-1.8687430454116...|[0.13368722965516...|       1.0|
|(13,[0,1,2,3,4,7]...|    1|[-1.5788895027286...|[0.17095281301686...|       1.0|
|(13,[0,1,2,3,4,7,...|    1|[-4.5068263916770...|[0.01091301257048...|       1.0|
|(13,[0,1,2,3,4,7,...|    1|[-4.5738339150110...|[0.01021294374383...|       1.0|
|(13,[0,1,2,3,4,7,...|    1|[-5.0093080550314...|[0.00663125372936...|       1.0|
|(13,[0,1,2,3,4,7,...|    1|[-6.1284369026440...|[0.00217524382831...|       1.0|
|(13,[0,1,2,3,4,7,...|    1|[-6.6157066449128...|[0.00133737717234...|       1.0|
|(13,[0,1,2,3,4,7,...|    1|[-5.7699977740943...|[0.00311006179819...|       1.0|
|(13,[0,1,2,3,4,7,...|    1|[-5.1895860057044...|[0.00554341334450...|       1.0|
|(13,[0,1,2,3,4,

In [35]:
lr_predictions = log_reg.transform(test_df)


In [36]:
lr_predictions = log_reg.evaluate(test_df)

In [37]:
lr_predictions.accuracy

0.8968866533691787

In [38]:
lr_predictions.weightedPrecision

0.8974765531093142

In [39]:
lr_predictions.recallByLabel

[0.9156498673740053, 0.86143823603107]

In [40]:
print(lr_predictions.precisionByLabel)

[0.9258414912163069, 0.8438880706921944]


In [41]:
lr_predictions.areaUnderROC

0.9594828084675924

## Step 6: Hyperparameter Tuning

In [42]:
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier()
rf_model = rf.fit(training_df)


In [43]:
model_predictions = rf_model.transform(test_df)


In [44]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()

rf = RandomForestClassifier()
paramGrid = (ParamGridBuilder()
             .addGrid(rf.maxDepth, [5,10,20,25,30])
             .addGrid(rf.maxBins, [20,30,40 ])
             .addGrid(rf.numTrees, [5, 20,50])
             .build())
cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)
cv_model = cv.fit(training_df)

## Step 7: Best Model

In [45]:
best_rf_model = cv_model.bestModel

In [46]:
# Generate predictions for entire dataset
brf_predictions = best_rf_model.transform(test_df)

In [47]:
true_pos=brf_predictions.filter(brf_predictions['label']==1).filter(brf_predictions['prediction']==1).count()
actual_pos=brf_predictions.filter(brf_predictions['label']==1).count()
pred_pos=brf_predictions.filter(brf_predictions['prediction']==1).count()

In [48]:
#Recall
float(true_pos)/(actual_pos)

0.9087947882736156

In [49]:
#Precision on test Data
float(true_pos)/(pred_pos)

0.8548197030403016