## Clas - Random Forest

<strong> Classification - Random Forest </strong>
<ul style="list-style-type:square">
  <li>Features : age, job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome</li>
  <li>Target : Outcome [yes/no] </li>
  <li>Model : Random Forest</li>
  <li>Goal : Predict whether a bank client will pay the credit or not.</li>
</ul>

<strong>Attribute Information:</strong>

<strong>Input variables:</strong>

<strong>Bank client data:</strong>

<li>1 - age (numeric) </li>
<li>2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')</li>
<li>3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)</li>
<li>4 - education (categorical:'primary','secondary','tertiary','unknown')</li>
<li>5 - default: has credit in default? (categorical: 'no','yes','unknown')</li>
<li>6 - balance - balance in the client's current account</i> 
<li>7 - housing: has housing loan? (categorical: 'no','yes','unknown')</li>
<li>8 - loan: has personal loan? (categorical: 'no','yes','unknown')</li>

<strong>Related with the last contact of the current campaign:</strong>

<li>9 - contact: contact communication type (categorical: 'cellular','telephone') </li>
<li>10 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')</li>
<li>11 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')</li>
<li>12 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.</li>

<strong>Other attributes:</strong>

<li>13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)</li>
<li>14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)</li>
<li>15 - previous: number of contacts performed before this campaign and for this client (numeric)</li>
<li>16 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')</li>

<strong>Output variable (desired target)</strong>
<li>17 - y - has the client subscribed a term deposit? (binary: 'yes','no')</li>

Source : https://archive.ics.uci.edu/ml/datasets/bank+marketing

In [1]:
# Spark Session
spSession = SparkSession.builder.master('local').appName("RandomForest").config("any.config").getOrCreate()

In [2]:
# Libraries 
import math
from pyspark.sql import Row 
from pyspark.ml.linalg import Vectors # labeled Point
from pyspark.ml.feature import StringIndexer  # For numerical labeling
from pyspark.ml.feature import PCA  # Principal Component analysis - Dimensionality reduction
from pyspark.ml.classification import RandomForestClassifier # Model
from pyspark.ml.evaluation import MulticlassClassificationEvaluator #Evaluate

In [3]:
# Import csv file and store it in cache
bankRDD = sc.textFile("bank.csv")
bankRDD.cache()

bank.csv MapPartitionsRDD[1] at textFile at <unknown>:0

In [4]:
bankRDD.take(5)

['"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"',
 '30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"',
 '33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"yes"',
 '35;"management";"single";"tertiary";"no";1350;"yes";"no";"cellular";16;"apr";185;1;330;1;"failure";"yes"',
 '30;"management";"married";"tertiary";"no";1476;"yes";"yes";"unknown";3;"jun";199;4;-1;0;"unknown";"yes"']

In [5]:
# Remove header
header = bankRDD.first()
bankRDD2 = bankRDD.filter(lambda x : x not in header)

In [6]:
bankRDD2.take(3)

['30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"',
 '33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"yes"',
 '35;"management";"single";"tertiary";"no";1350;"yes";"no";"cellular";16;"apr";185;1;330;1;"failure";"yes"']

## Data Cleaning

In [7]:
# Transform string into numeric values ( labels)

def ConvertToNumeric(inputStr): 
    # Replace  Double quotation markr (\")  and Split using semicolon as separetor
    # scape sequence - https://www.techopedia.com/definition/822/escape-sequence-c
    attList = inputStr.replace("\"","").split(";")
    
    # Set numeric labels
    age = float(attList[0])
    
    single = 1.0 if attList[2] == "single" else 0.0
    married = 1.0 if attList[2] == "married" else 0.0
    divorced = 1.0 if attList[2] == "divorced" else 0.0
    
    primary = 1.0 if attList[3] == "primary" else 0.0
    secondary = 1.0 if attList[3] == "secondary" else 0.0
    tertiary = 1.0 if attList[3] == "tertiary" else 0.0
    
    default = 0.0 if attList[4] == "no" else 1.0
    
    balance = float(attList[5])
    
    loan = 0.0 if attList[7] == "no" else 1.0
    
    outcome = 0.0 if attList[16] == "no" else 1.0
    
    # Create rows 
    
    row = Row(OUTCOME = outcome, AGE = age, SINGLE = single, MARRIED = married , DIVORCED = divorced , 
             PRIMARY = primary, SECONDARY = secondary , TERTIARY = tertiary, DEFAULT = default , BALANCE = balance,
             LOAN = loan)  
    
    return row                

In [8]:
bankRDD3 = bankRDD2.map(ConvertToNumeric)
bankRDD3.collect()[:3]

[Row(AGE=30.0, BALANCE=1787.0, DEFAULT=0.0, DIVORCED=0.0, LOAN=0.0, MARRIED=1.0, OUTCOME=0.0, PRIMARY=1.0, SECONDARY=0.0, SINGLE=0.0, TERTIARY=0.0),
 Row(AGE=33.0, BALANCE=4789.0, DEFAULT=0.0, DIVORCED=0.0, LOAN=1.0, MARRIED=1.0, OUTCOME=1.0, PRIMARY=0.0, SECONDARY=1.0, SINGLE=0.0, TERTIARY=0.0),
 Row(AGE=35.0, BALANCE=1350.0, DEFAULT=0.0, DIVORCED=0.0, LOAN=0.0, MARRIED=0.0, OUTCOME=1.0, PRIMARY=0.0, SECONDARY=0.0, SINGLE=1.0, TERTIARY=1.0)]

## Data Exploration

In [9]:
# Create a dataframe and show some statistics
bankDF = spSession.createDataFrame(bankRDD3)


In [11]:
bankDF.show()

+----+-------+-------+--------+----+-------+-------+-------+---------+------+--------+
| AGE|BALANCE|DEFAULT|DIVORCED|LOAN|MARRIED|OUTCOME|PRIMARY|SECONDARY|SINGLE|TERTIARY|
+----+-------+-------+--------+----+-------+-------+-------+---------+------+--------+
|30.0| 1787.0|    0.0|     0.0| 0.0|    1.0|    0.0|    1.0|      0.0|   0.0|     0.0|
|33.0| 4789.0|    0.0|     0.0| 1.0|    1.0|    1.0|    0.0|      1.0|   0.0|     0.0|
|35.0| 1350.0|    0.0|     0.0| 0.0|    0.0|    1.0|    0.0|      0.0|   1.0|     1.0|
|30.0| 1476.0|    0.0|     0.0| 1.0|    1.0|    1.0|    0.0|      0.0|   0.0|     1.0|
|59.0|    0.0|    0.0|     0.0| 0.0|    1.0|    0.0|    0.0|      1.0|   0.0|     0.0|
|35.0|  747.0|    0.0|     0.0| 0.0|    0.0|    1.0|    0.0|      0.0|   1.0|     1.0|
|36.0|  307.0|    0.0|     0.0| 0.0|    1.0|    1.0|    0.0|      0.0|   0.0|     1.0|
|39.0|  147.0|    0.0|     0.0| 0.0|    1.0|    0.0|    0.0|      1.0|   0.0|     0.0|
|41.0|  221.0|    0.0|     0.0| 0.0|    1.0

In [12]:
#Correlation 
for i in bankDF.columns:
    if not( isinstance(bankDF.select(i).take(1)[0][0], str)):
        print("Correlation between outcome and", i, bankDF.stat.corr('OUTCOME',i))

Correlation between outcome and AGE -0.1823210432736525
Correlation between outcome and BALANCE 0.03657486611997681
Correlation between outcome and DEFAULT -0.04536965206737378
Correlation between outcome and DIVORCED -0.07812659940926987
Correlation between outcome and LOAN -0.030420586112717318
Correlation between outcome and MARRIED -0.3753241299133561
Correlation between outcome and OUTCOME 1.0
Correlation between outcome and PRIMARY -0.12561548832677982
Correlation between outcome and SECONDARY 0.026392774894072973
Correlation between outcome and SINGLE 0.46323284934360515
Correlation between outcome and TERTIARY 0.08494840766635618


## Data Pre- Processing 

In [17]:
# LabeledPoint (target, Vectors[features])
def transformVar(row):
    obj = (row["OUTCOME"], Vectors.dense([row["AGE"], row["BALANCE"] , row["DEFAULT"], row["DIVORCED"], row["LOAN"]
                                         ,row["MARRIED"], row["OUTCOME"], row["PRIMARY"], row["SECONDARY"],
                                          row["TERTIARY"], row["SINGLE"] ]))
    return obj                                                           

In [22]:
# Convert dataframe into rdd to use the map function
bankRDD4 = bankDF.rdd.map(transformVar)
bankRDD4.take(5)

[(0.0,
  DenseVector([30.0, 1787.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0])),
 (1.0,
  DenseVector([33.0, 4789.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0])),
 (1.0,
  DenseVector([35.0, 1350.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0])),
 (1.0,
  DenseVector([30.0, 1476.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0])),
 (0.0, DenseVector([59.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0]))]

In [26]:
# Visualize the data as dataframe
bankDF = spSession.createDataFrame(bankRDD4, ["label", "features"])
bankDF.select("label", "features").show(10)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|[30.0,1787.0,0.0,...|
|  1.0|[33.0,4789.0,0.0,...|
|  1.0|[35.0,1350.0,0.0,...|
|  1.0|[30.0,1476.0,0.0,...|
|  0.0|[59.0,0.0,0.0,0.0...|
|  1.0|[35.0,747.0,0.0,0...|
|  1.0|[36.0,307.0,0.0,0...|
|  0.0|[39.0,147.0,0.0,0...|
|  0.0|[41.0,221.0,0.0,0...|
|  1.0|[43.0,-88.0,0.0,0...|
+-----+--------------------+
only showing top 10 rows



## Machine Learning

Let's use Principal Component analysis to reduce the dimensionality of the dataset due to its considerable number of variables.


In [27]:
# apply PCA
#Number of dimensions = 3, input = features and output = pcaFeatures
bankPCA = PCA(k = 3 , inputCol="features", outputCol="pcaFeatures")
pcaModel = bankPCA.fit(bankDF)
pcaResult = pcaModel.transform(bankDF).select("label", "pcaFeatures")
pcaResult.show(truncate = False)


+-----+------------------------------------------------------------+
|label|pcaFeatures                                                 |
+-----+------------------------------------------------------------+
|0.0  |[-1787.0188971485131,28.86106203527629,-0.07804878148604315]|
|1.0  |[-4789.020184400351,29.9127382479866,0.05513699055786786]   |
|1.0  |[-1350.022220519092,34.09087337817735,1.8836485302341996]   |
|1.0  |[-1476.0189590708485,29.041324405591645,0.6524643962591163] |
|0.0  |[-0.037889185293237065,58.9873520580651,0.23505788947191042]|
|1.0  |[-747.0223451357992,34.47800179608814,1.897358503632201]    |
|1.0  |[-307.0230764865881,35.78949185789179,0.8368850167598223]   |
|0.0  |[-147.02501215769811,38.89953006341547,-0.14905606782256242]|
|0.0  |[-221.02629852878886,40.85201049112231,0.46645650809549283] |
|1.0  |[87.97237948177352,43.051985386677245,0.636049715179434]    |
|0.0  |[-9374.023105294744,32.97575906496965,-0.35884367071370504] |
|0.0  |[-264.0275573080166,42.8231

In [28]:
# Set numeric index
string_Indexer = StringIndexer(inputCol= "label", outputCol= "Indexed")
si_model = string_Indexer.fit(pcaResult)
final_obj = si_model.transform(pcaResult)
final_obj.collect()[:5]


[Row(label=0.0, pcaFeatures=DenseVector([-1787.0189, 28.8611, -0.078]), Indexed=0.0),
 Row(label=1.0, pcaFeatures=DenseVector([-4789.0202, 29.9127, 0.0551]), Indexed=1.0),
 Row(label=1.0, pcaFeatures=DenseVector([-1350.0222, 34.0909, 1.8836]), Indexed=1.0),
 Row(label=1.0, pcaFeatures=DenseVector([-1476.019, 29.0413, 0.6525]), Indexed=1.0),
 Row(label=0.0, pcaFeatures=DenseVector([-0.0379, 58.9874, 0.2351]), Indexed=0.0)]

In [29]:
# Split test and training dataset
(training_set, test_set) = final_obj.randomSplit([0.7, 0.3])

In [30]:
training_set.count()

385

In [31]:
test_set.count()

156

In [32]:
# Model
rfClassifier = RandomForestClassifier(labelCol= "Indexed", featuresCol= "pcaFeatures")
model = rfClassifier.fit(training_set)

In [34]:
# Test the model
predictions = model.transform(test_set)
predictions.select("prediction", "Indexed", "label", "pcaFeatures").collect()[:5]

[Row(prediction=0.0, Indexed=0.0, label=0.0, pcaFeatures=DenseVector([-8104.0336, 49.7859, -0.0254])),
 Row(prediction=0.0, Indexed=0.0, label=0.0, pcaFeatures=DenseVector([-7190.0255, 37.3723, 1.0348])),
 Row(prediction=1.0, Indexed=0.0, label=0.0, pcaFeatures=DenseVector([-3096.0186, 27.9799, 0.7021])),
 Row(prediction=0.0, Indexed=0.0, label=0.0, pcaFeatures=DenseVector([-2843.0225, 34.1644, 0.4403])),
 Row(prediction=0.0, Indexed=0.0, label=0.0, pcaFeatures=DenseVector([-2693.02, 30.2673, -0.3402]))]

In [37]:
# Model Accuracy
Acc = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol= "Indexed", metricName="accuracy")
Acc.evaluate(predictions)

0.7692307692307693

In [38]:
# Confusion Metrix to identify the number of classifications
predictions.groupBy("Indexed", "prediction").count().show()

+-------+----------+-----+
|Indexed|prediction|count|
+-------+----------+-----+
|    1.0|       1.0|   45|
|    0.0|       1.0|   16|
|    1.0|       0.0|   20|
|    0.0|       0.0|   75|
+-------+----------+-----+

