## Nick Curci
### Homework

#### Due Date: Thursday 4/16 at the start of the class.

- Please try to improve areaUnderROC value for this deposit prediction model. Please document what changes you have made, what is the best model/algorithms you found and new areaUnderROC value.

- please export the completed notebook as HTML, zip it and submit it on Blackboard by due date.

##Predict whether a bank customer will subscribe to a term deposit
The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict whether the client will subscribe (Yes/No) to a term deposit. The dataset can be downloaded from [Kaggle](http://www.kaggle.com/rouseguy/bankbalanced).

[Attribute Information](http://archive.ics.uci.edu/ml/datasets/bank+marketing)

Input variables:
##### bank client data:
1. - age (numeric)
2. - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4. - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5. - default: has credit in default? (categorical: 'no','yes','unknown')
6. - housing: has housing loan? (categorical: 'no','yes','unknown')
7. - loan: has personal loan? (categorical: 'no','yes','unknown')
#### related with the last contact of the current campaign:
8. - contact: contact communication type (categorical: 'cellular','telephone')
9. - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10. - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11. - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
#### other attributes:
12. - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. - previous: number of contacts performed before this campaign and for this client (numeric)
15. - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

In [0]:
%fs ls /mnt/classdata/bank

path,name,size
dbfs:/mnt/classdata/bank/bank.csv,bank.csv,918960


In [0]:
#read bank data
bank = spark.read.csv('/mnt/classdata/bank/bank.csv', header = True, inferSchema = True)
bank.printSchema()

In [0]:
display(bank)

age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
59,admin.,married,secondary,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,yes
56,admin.,married,secondary,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown,yes
41,technician,married,secondary,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,yes
55,services,married,secondary,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown,yes
54,admin.,married,tertiary,no,184,no,no,unknown,5,may,673,2,-1,0,unknown,yes
42,management,single,tertiary,no,0,yes,yes,unknown,5,may,562,2,-1,0,unknown,yes
56,management,married,tertiary,no,830,yes,yes,unknown,6,may,1201,1,-1,0,unknown,yes
60,retired,divorced,secondary,no,545,yes,no,unknown,6,may,1030,1,-1,0,unknown,yes
37,technician,married,secondary,no,1,yes,no,unknown,6,may,608,1,-1,0,unknown,yes
28,services,single,secondary,no,5090,yes,no,unknown,6,may,1297,3,-1,0,unknown,yes


In [0]:
bank.describe('age', 'balance','duration', 'pdays').show()

###Feature Engineering with Transformers

[RFormula](https://spark.apache.org/docs/latest/ml-features#rformula)

In [0]:
from pyspark.ml.feature import RFormula

bank_rf = RFormula(formula="deposit ~ .")

####Prepare (Train and transform) our data frame

In [0]:
fittedRF = bank_rf.fit(bank)

preparedDF = fittedRF.transform(bank)

preparedDF.select('deposit', 'label', 'features').show(10, False)


split our data into train and test set

# FIRST CHANGE, CHANGED THE SPLITS TO 90-10

In [0]:
(train, test) = preparedDF.randomSplit([0.9, 0.1], seed=100)


In [0]:
train.count()
test.count()

####Instantiate an instance of LogisticsRegression, set the label columns and the feature columns

In [0]:
from pyspark.ml.classification import LogisticRegression
#lr = LogisticRegression(labelCol="label",featuresCol="features")
lr= LogisticRegression()

### Train the logistics regression model based on training dataset

In [0]:
lrModel = lr.fit(train)

###Using the model making prediction

In [0]:
lrPrediction=lrModel.transform(test)
lrPrediction.select("label", "prediction").show(10, False)

calcualte accuracy

In [0]:
print("prediction accuracy is: ", lrPrediction.where("prediction==label").count()/lrPrediction.count())

tp=lrPrediction.where("label=1 and prediction=1").count()
fp=lrPrediction.where("label=0 and prediction=1").count()
tn=lrPrediction.where("label=0 and prediction=0").count()
fn=lrPrediction.where("label=1 and prediction=0").count()

print("true positive is: ", tp)

print("false positive is: ", fp)

print("true negative is: ", tn)

print("false negative is ", fn)

print("precision is ", tp/(tp+fp)) 

print("recall is ", tp/(tp+fn))


In [0]:
featureIndex=preparedDF.schema["features"].metadata["ml_attr"]["attrs"]
x=0
#print numberic feature
for x in range(len(lrModel.coefficients)-1):
  try:
    print("feature", featureIndex["numeric"][x]['idx'], " ", featureIndex["numeric"][x]['name'], ': ', lrModel.coefficients[x])
  except:
    continue

# print binary feature   
for x in range(len(lrModel.coefficients)-1):
  try:
    print("feature", featureIndex["binary"][x]['idx'], " ", featureIndex["binary"][x]['name'], ': ', lrModel.coefficients[x])
  except:
    continue

Evalaute model

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()\
  .setMetricName("areaUnderROC")\
  .setRawPredictionCol("prediction")\
  .setLabelCol("label")

# NEW PREDICTION VALUE

In [0]:
evaluator.evaluate(lrPrediction)

In [0]:
# ROC for test data
display(lrModel, test, "ROC")

False Positive Rate,True Positive Rate,Threshold
0.0,0.0,0.9993367043299846
0.0,0.0222222222222222,0.9993367043299846
0.0,0.0444444444444444,0.994009686813798
0.0,0.0666666666666666,0.9890170455012351
0.0,0.0888888888888888,0.9886112695254684
0.0,0.1111111111111111,0.988292869056298
0.0,0.1333333333333333,0.9866091853935158
0.0,0.1555555555555555,0.9841169044607524
0.0,0.1777777777777777,0.9838073646782458
0.0,0.2,0.9522294441090032


In [0]:
display(lrModel.summary.roc)

FPR,TPR
0.0,0.0
0.0013305455236647,0.019521410579345
0.001900779319521,0.0398824517212426
0.0032313248431857,0.0594038623005877
0.0049420262307546,0.0785054575986566
0.0064626496863714,0.0978169605373635
0.0079832731419882,0.1171284634760705
0.0096939745295571,0.1362300587741393
0.011404675917126,0.1553316540722082
0.0127352214407907,0.1748530646515533


In [0]:
display(lrModel.summary.pr)

recall,precision
0.0,0.93
0.019521410579345,0.93
0.0398824517212426,0.95
0.0594038623005877,0.9433333333333334
0.0785054575986566,0.935
0.0978169605373635,0.932
0.1171284634760705,0.93
0.1362300587741393,0.9271428571428572
0.1553316540722082,0.925
0.1748530646515533,0.9255555555555556


## Decision Trees

You can read more about [Decision Trees](http://spark.apache.org/docs/latest/mllib-decision-tree.html) in the Spark MLLib Programming Guide.
The Decision Trees algorithm is popular because it handles categorical
data and works out of the box with multiclass classification tasks.

# SECOND CHANGE, USED MAX DEPTH OF 6 FOR THE DECISION TREE

In [0]:
from pyspark.ml.classification import DecisionTreeClassifier

# Create initial Decision Tree Model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=6)

# Train model with Training Data
dtModel = dt.fit(train)

In [0]:
display(dtModel)

treeNode
"{""index"":49,""featureType"":""continuous"",""prediction"":null,""threshold"":205.5,""categories"":null,""feature"":35,""overflow"":false}"
"{""index"":21,""featureType"":""categorical"",""prediction"":null,""threshold"":null,""categories"":[1.0],""feature"":41,""overflow"":false}"
"{""index"":11,""featureType"":""continuous"",""prediction"":null,""threshold"":125.5,""categories"":null,""feature"":35,""overflow"":false}"
"{""index"":5,""featureType"":""continuous"",""prediction"":null,""threshold"":73.5,""categories"":null,""feature"":35,""overflow"":false}"
"{""index"":1,""featureType"":""continuous"",""prediction"":null,""threshold"":182.5,""categories"":null,""feature"":37,""overflow"":false}"
"{""index"":0,""featureType"":null,""prediction"":0.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":3,""featureType"":""continuous"",""prediction"":null,""threshold"":57.5,""categories"":null,""feature"":35,""overflow"":false}"
"{""index"":2,""featureType"":null,""prediction"":0.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":4,""featureType"":null,""prediction"":1.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":9,""featureType"":""continuous"",""prediction"":null,""threshold"":6.5,""categories"":null,""feature"":38,""overflow"":false}"


In [0]:
featureIndex=preparedDF.schema["features"].metadata["ml_attr"]["attrs"]

print(featureIndex)

In [0]:
# Make predictions on test data using the Transformer.transform() method.
dtPrediction = dtModel.transform(test)

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Evaluate model
evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(dtPrediction)

## Random Forest

Random Forests uses an ensemble of trees to improve model accuracy.
You can read more about [Random Forest] from the [classification and regression] section of MLlib Programming Guide.

[classification and regression]: https://spark.apache.org/docs/latest/ml-classification-regression.html
[Random Forest]: https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forests

In [0]:
from pyspark.ml.classification import RandomForestClassifier

# Create an initial RandomForest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

# Train model with Training Data
rfModel = rf.fit(train)

In [0]:
# Make predictions on test data using the Transformer.transform() method.
rfPrediction = rfModel.transform(test)

We will evaluate our Random Forest model with BinaryClassificationEvaluator.

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(rfPrediction)

In [0]:
print("prediction accuracy is: ", rfPrediction.where("prediction==label").count()/rfPrediction.count())

tp=lrPrediction.where("label=1 and prediction=1").count()
fp=lrPrediction.where("label=0 and prediction=1").count()
tn=lrPrediction.where("label=0 and prediction=0").count()
fn=lrPrediction.where("label=1 and prediction=0").count()

print("true positive is: ", tp)

print("false positive is: ", fp)

print("true negative is: ", tn)

print("false negative is ", fn)

print("precision is ", tp/(tp+fp)) 

print("recall is ", tp/(tp+fn))


### Gradient-Boosted Trees

In [0]:
from pyspark.ml.classification import GBTClassifier
gbtClassifier=GBTClassifier()

gbtModel=gbtClassifier.fit(train)

# Make predictions on test data using the Transformer.transform() method.
gbtPrediction = gbtModel.transform(test)

# Evaluate Model
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(gbtPrediction)

In [0]:
print("prediction accuracy is: ", gbtPrediction.where("prediction==label").count()/gbtPrediction.count())

tp=lrPrediction.where("label=1 and prediction=1").count()
fp=lrPrediction.where("label=0 and prediction=1").count()
tn=lrPrediction.where("label=0 and prediction=0").count()
fn=lrPrediction.where("label=1 and prediction=0").count()

print("true positive is: ", tp)

print("false positive is: ", fp)

print("true negative is: ", tn)

print("false negative is ", fn)

print("precision is ", tp/(tp+fp))

print("recall is ", tp/(tp+fn))

###Make Predictions

As Gradient-boosted Trees model gives us the best areaUnderROC value, we will use the bestModel obtained from Gradient-Boosted Tree model for deployment, and use it to generate predictions on new data. In this example, we will simulate this by generating predictions on the entire dataset.

In [0]:
# Generate predictions for entire dataset
finalPredictions = gbtModel.transform(preparedDF)

In [0]:
# Evaluate best model
evaluator.evaluate(finalPredictions)

In [0]:
# create a SQL view
finalPredictions.createOrReplaceTempView("finalPredictions")

In an operational environment, analysts may use a similar machine learning pipeline to obtain predictions on new data, organize it into a table and use it for analysis or lead targeting.

In [0]:
%sql
select case when label=1 and prediction=1 then "true positive"
            when label=1 and prediction=0 then "false negative"
            when label=0 and prediction=0 then "true negative"
            when label=0 and prediction=1 then "false positive"
            else "N/A"
            End as status, count(*) as NumberofRecords
from finalPredictions 
group by status

status,NumberofRecords
true positive,4664
true negative,4938
false negative,625
false positive,935


# END
## The training and testing datasets were modified to a 9:1 ratio, the ROC value increased by over .1, the decision tree was changed to allow a max depth of 6 rather than the original 3 and this increase gave us a better model. I think that the decision tree classifier is the best of the classifiers because it allows for direct manipulation of the depth of search