## Amazon Review Cell Phone and Accessories

## Dataset Info , Check the below link.

selected dataset is --> Cell Phone and Accessories.
https://nijianmo.github.io/amazon/index.html

## Databricks File System(DBFS) Connection.

In [4]:
# File location and type.
file_location = "/FileStore/tables/Cell_Phones_and_Accessories_5.json"
file_type = "json"

# Json options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for json files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

# Displaying dataframe 'df'.
display(df)

asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
120401325X,"List(0, 0)",4.0,They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again,"05 21, 2014",A30TL5EWN6DFXT,christina,Looks Good,1400630400
120401325X,"List(0, 0)",5.0,These stickers work like the review says they do. They stick on great and they stay on the phone. They are super stylish and I can share them with my sister. :),"01 14, 2014",ASY55RVNIL0UD,emily l.,Really great product.,1389657600
120401325X,"List(0, 0)",5.0,These are awesome and make my phone look so stylish! I have only used one so far and have had it on for almost a year! CAN YOU BELIEVE THAT! ONE YEAR!! Great quality!,"06 26, 2014",A2TMXE2AFO7ONB,Erica,LOVE LOVE LOVE,1403740800
120401325X,"List(4, 4)",4.0,"Item arrived in great time and was in perfect condition. However, I ordered these buttons because they were a great deal and included a FREE screen protector. I never received one. Though its not a big deal, it would've been nice to get it since they claim it comes with one.","10 21, 2013",AWJ0WZQYMYFQ4,JM,Cute!,1382313600
120401325X,"List(2, 3)",5.0,"awesome! stays on, and looks great. can be used on multiple apple products. especially having nails, it helps to have an elevated key.","02 3, 2013",ATX7CZYFXI1KW,patrice m rogoza,leopard home button sticker for iphone 4s,1359849600
120401325X,"List(1, 2)",3.0,These make using the home button easy. My daughter and I both like them. I would purchase them again. Well worth the price.,"10 12, 2013",APX47D16JOP7H,RLH,Cute,1381536000
120401325X,"List(0, 0)",5.0,Came just as described.. It doesn't come unstuck and its cute! People ask where I got them from & it's great when driving.,"08 22, 2013",A1JVVYYO7G56DS,Tyler Evans,best thing ever..,1377129600
3998899561,"List(1, 2)",1.0,it worked for the first week then it only charge my phone to 20%. it is a waste of money.,"11 21, 2013",A6FGO4TBZ3QFZ,Abdullah Albyati,not a good Idea,1384992000
3998899561,"List(2, 3)",5.0,"Good case, solid build. Protects phone all around with good access to buttons. Battery charges with full battery lasts me a full day. I usually leave my house around 7am and return at 10pm. I'm glad that it lasts from start to end. 5/5","09 25, 2013",A2JWEDW5FSVB0F,Adam,Solid Case,1380067200
3998899561,"List(1, 1)",5.0,"This is a fantastic case. Very stylish and protects my phone. Easy access to all buttons and features, without any loss of phone reception. But most importantly, it double power, just as promised. Great buy","04 3, 2014",A8AJS1DW7L3JJ,Agata Majchrzak,Perfect Case,1396483200


## Count of DataFrame.

In [6]:
# Counting number of rows in dataframe 'df'.
original_count = df.count()
original_count

# Printing Count
print("Total Rows = %d" % original_count)

## Printing the Schema of a Dataframe(df)

In [8]:
# Displaying datatypes in a dataframe 'df'. 
df.printSchema()

## Removing NA, NULL, NaN Values and Dropping Unwanted columns.

In [10]:
import pyspark.sql.functions as F

# Dropping Unwanted Columns and saving in new dataframe called dfmodel.
step1 = df.drop('asin','helpful','reviewTime','reviewerID','reviewerName','unixReviewTime')

# Selecting "?","NULL", "NA", "NaN" values.
step1 = [F.when(~F.col(x).isin("?","NULL", "NA", "NaN"), F.col(x)).alias(x)  for x in step1.columns] 

# Droping "?","NULL", "NA", "NaN" values.
dfmodel = df.select(*step1).dropna(how='any')

## Count of Deleted Rows.

In [12]:
after_count = dfmodel.count()
tot_del = original_count - after_count

# Printing Count
print("Deleted Rows = %d" % tot_del)

## Removing Unwanted symbols.

In [14]:
import re

# Creating a different types of varibales
REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

# Creating userdefined function.
def preprocess_reviews(reviews):
    reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]
    
    return reviews
  
  # Calling function.
    reviews_train_clean = preprocess_reviews(dfmodel)

## Converting sentence into Tokens.

In [16]:
from pyspark.ml.feature import RegexTokenizer

# Tokenizer
tokenizer = (RegexTokenizer()
            .setInputCol("reviewText")
            .setOutputCol("tokens")
            .setPattern("\\W+"))

tokenizedDF = tokenizer.transform(dfmodel)

# Displaying Dataframe
display(tokenizedDF.limit(5))

overall,reviewText,summary,tokens
4.0,They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again,Looks Good,"List(they, look, good, and, stick, good, i, just, don, t, like, the, rounded, shape, because, i, was, always, bumping, it, and, siri, kept, popping, up, and, it, was, irritating, i, just, won, t, buy, a, product, like, this, again)"
5.0,These stickers work like the review says they do. They stick on great and they stay on the phone. They are super stylish and I can share them with my sister. :),Really great product.,"List(these, stickers, work, like, the, review, says, they, do, they, stick, on, great, and, they, stay, on, the, phone, they, are, super, stylish, and, i, can, share, them, with, my, sister)"
5.0,These are awesome and make my phone look so stylish! I have only used one so far and have had it on for almost a year! CAN YOU BELIEVE THAT! ONE YEAR!! Great quality!,LOVE LOVE LOVE,"List(these, are, awesome, and, make, my, phone, look, so, stylish, i, have, only, used, one, so, far, and, have, had, it, on, for, almost, a, year, can, you, believe, that, one, year, great, quality)"
4.0,"Item arrived in great time and was in perfect condition. However, I ordered these buttons because they were a great deal and included a FREE screen protector. I never received one. Though its not a big deal, it would've been nice to get it since they claim it comes with one.",Cute!,"List(item, arrived, in, great, time, and, was, in, perfect, condition, however, i, ordered, these, buttons, because, they, were, a, great, deal, and, included, a, free, screen, protector, i, never, received, one, though, its, not, a, big, deal, it, would, ve, been, nice, to, get, it, since, they, claim, it, comes, with, one)"
5.0,"awesome! stays on, and looks great. can be used on multiple apple products. especially having nails, it helps to have an elevated key.",leopard home button sticker for iphone 4s,"List(awesome, stays, on, and, looks, great, can, be, used, on, multiple, apple, products, especially, having, nails, it, helps, to, have, an, elevated, key)"


## Removing StopWords from Tokens

In [18]:
from pyspark.ml.feature import StopWordsRemover

#StopwordsRemover
remover = (StopWordsRemover()
          .setInputCol("tokens")
          .setOutputCol("stopWordFree"))

removedStopWordsDF = remover.transform(tokenizedDF)

# Displaying Dataframe
display(removedStopWordsDF.limit(5))

overall,reviewText,summary,tokens,stopWordFree
4.0,They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again,Looks Good,"List(they, look, good, and, stick, good, i, just, don, t, like, the, rounded, shape, because, i, was, always, bumping, it, and, siri, kept, popping, up, and, it, was, irritating, i, just, won, t, buy, a, product, like, this, again)","List(look, good, stick, good, like, rounded, shape, always, bumping, siri, kept, popping, irritating, won, buy, product, like)"
5.0,These stickers work like the review says they do. They stick on great and they stay on the phone. They are super stylish and I can share them with my sister. :),Really great product.,"List(these, stickers, work, like, the, review, says, they, do, they, stick, on, great, and, they, stay, on, the, phone, they, are, super, stylish, and, i, can, share, them, with, my, sister)","List(stickers, work, like, review, says, stick, great, stay, phone, super, stylish, share, sister)"
5.0,These are awesome and make my phone look so stylish! I have only used one so far and have had it on for almost a year! CAN YOU BELIEVE THAT! ONE YEAR!! Great quality!,LOVE LOVE LOVE,"List(these, are, awesome, and, make, my, phone, look, so, stylish, i, have, only, used, one, so, far, and, have, had, it, on, for, almost, a, year, can, you, believe, that, one, year, great, quality)","List(awesome, make, phone, look, stylish, used, one, far, almost, year, believe, one, year, great, quality)"
4.0,"Item arrived in great time and was in perfect condition. However, I ordered these buttons because they were a great deal and included a FREE screen protector. I never received one. Though its not a big deal, it would've been nice to get it since they claim it comes with one.",Cute!,"List(item, arrived, in, great, time, and, was, in, perfect, condition, however, i, ordered, these, buttons, because, they, were, a, great, deal, and, included, a, free, screen, protector, i, never, received, one, though, its, not, a, big, deal, it, would, ve, been, nice, to, get, it, since, they, claim, it, comes, with, one)","List(item, arrived, great, time, perfect, condition, however, ordered, buttons, great, deal, included, free, screen, protector, never, received, one, though, big, deal, ve, nice, get, since, claim, comes, one)"
5.0,"awesome! stays on, and looks great. can be used on multiple apple products. especially having nails, it helps to have an elevated key.",leopard home button sticker for iphone 4s,"List(awesome, stays, on, and, looks, great, can, be, used, on, multiple, apple, products, especially, having, nails, it, helps, to, have, an, elevated, key)","List(awesome, stays, looks, great, used, multiple, apple, products, especially, nails, helps, elevated, key)"


## Converitng Stopwordfree words into Vector for applying machine learning models.

In [20]:
from pyspark.ml.feature import CountVectorizer

counts = (CountVectorizer()
          .setInputCol("stopWordFree")
          .setOutputCol("features")
          .setVocabSize(2000))

cModel = counts.fit(removedStopWordsDF)
countModel = cModel.transform(removedStopWordsDF)

# Displaying Dataframe
display(countModel.limit(5))

overall,reviewText,summary,tokens,stopWordFree,features
4.0,They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again,Looks Good,"List(they, look, good, and, stick, good, i, just, don, t, like, the, rounded, shape, because, i, was, always, bumping, it, and, siri, kept, popping, up, and, it, was, irritating, i, just, won, t, buy, a, product, like, this, again)","List(look, good, stick, good, like, rounded, shape, always, bumping, siri, kept, popping, irritating, won, buy, product, like)","List(0, 2000, List(3, 7, 14, 57, 85, 155, 222, 521, 695, 942, 1981), List(2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
5.0,These stickers work like the review says they do. They stick on great and they stay on the phone. They are super stylish and I can share them with my sister. :),Really great product.,"List(these, stickers, work, like, the, review, says, they, do, they, stick, on, great, and, they, stay, on, the, phone, they, are, super, stylish, and, i, can, share, them, with, my, sister)","List(stickers, work, like, review, says, stick, great, stay, phone, super, stylish, share, sister)","List(0, 2000, List(0, 3, 4, 30, 117, 318, 403, 521, 527, 886, 1593, 1690, 1839), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
5.0,These are awesome and make my phone look so stylish! I have only used one so far and have had it on for almost a year! CAN YOU BELIEVE THAT! ONE YEAR!! Great quality!,LOVE LOVE LOVE,"List(these, are, awesome, and, make, my, phone, look, so, stylish, i, have, only, used, one, so, far, and, have, had, it, on, for, almost, a, year, can, you, believe, that, one, year, great, quality)","List(awesome, make, phone, look, stylish, used, one, far, almost, year, believe, one, year, great, quality)","List(0, 2000, List(0, 2, 4, 27, 39, 73, 85, 106, 206, 287, 323, 630, 886), List(1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0))"
4.0,"Item arrived in great time and was in perfect condition. However, I ordered these buttons because they were a great deal and included a FREE screen protector. I never received one. Though its not a big deal, it would've been nice to get it since they claim it comes with one.",Cute!,"List(item, arrived, in, great, time, and, was, in, perfect, condition, however, i, ordered, these, buttons, because, they, were, a, great, deal, and, included, a, free, screen, protector, i, never, received, one, though, its, not, a, big, deal, it, would, ve, been, nice, to, get, it, since, they, claim, it, comes, with, one)","List(item, arrived, great, time, perfect, condition, however, ordered, buttons, great, deal, included, free, screen, protector, never, received, one, though, big, deal, ve, nice, get, since, claim, comes, one)","List(0, 2000, List(2, 4, 6, 11, 16, 22, 34, 40, 82, 90, 103, 112, 113, 125, 130, 148, 163, 184, 200, 231, 280, 293, 398, 1273, 1729), List(2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0))"
5.0,"awesome! stays on, and looks great. can be used on multiple apple products. especially having nails, it helps to have an elevated key.",leopard home button sticker for iphone 4s,"List(awesome, stays, on, and, looks, great, can, be, used, on, multiple, apple, products, especially, having, nails, it, helps, to, have, an, elevated, key)","List(awesome, stays, looks, great, used, multiple, apple, products, especially, nails, helps, elevated, key)","List(0, 2000, List(4, 39, 56, 145, 283, 287, 309, 551, 683, 745, 1040), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"


## Creating a Userdefind function for reducting values of Independent Variable

In [22]:
from pyspark.sql.functions import udf
from pyspark.sql.types import *

# Clearly identifying the job categories
def overall(rating):
  if(rating == 1.0 or rating == 2.0 or rating == 3.0):
    return 0.0
  if(rating == 4.0 or rating == 5.0):
    return 1.0
  else:
    return(rating)
  
  #CALLING USER DEFINED FUNCTIONS
etype_udf = udf(overall,DoubleType())
datamodel = countModel.withColumn("label", etype_udf("overall"))

# Displaying Dataframe
display(datamodel.limit(5))

overall,reviewText,summary,tokens,stopWordFree,features,label
4.0,They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again,Looks Good,"List(they, look, good, and, stick, good, i, just, don, t, like, the, rounded, shape, because, i, was, always, bumping, it, and, siri, kept, popping, up, and, it, was, irritating, i, just, won, t, buy, a, product, like, this, again)","List(look, good, stick, good, like, rounded, shape, always, bumping, siri, kept, popping, irritating, won, buy, product, like)","List(0, 2000, List(3, 7, 14, 57, 85, 155, 222, 521, 695, 942, 1981), List(2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",1.0
5.0,These stickers work like the review says they do. They stick on great and they stay on the phone. They are super stylish and I can share them with my sister. :),Really great product.,"List(these, stickers, work, like, the, review, says, they, do, they, stick, on, great, and, they, stay, on, the, phone, they, are, super, stylish, and, i, can, share, them, with, my, sister)","List(stickers, work, like, review, says, stick, great, stay, phone, super, stylish, share, sister)","List(0, 2000, List(0, 3, 4, 30, 117, 318, 403, 521, 527, 886, 1593, 1690, 1839), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",1.0
5.0,These are awesome and make my phone look so stylish! I have only used one so far and have had it on for almost a year! CAN YOU BELIEVE THAT! ONE YEAR!! Great quality!,LOVE LOVE LOVE,"List(these, are, awesome, and, make, my, phone, look, so, stylish, i, have, only, used, one, so, far, and, have, had, it, on, for, almost, a, year, can, you, believe, that, one, year, great, quality)","List(awesome, make, phone, look, stylish, used, one, far, almost, year, believe, one, year, great, quality)","List(0, 2000, List(0, 2, 4, 27, 39, 73, 85, 106, 206, 287, 323, 630, 886), List(1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0))",1.0
4.0,"Item arrived in great time and was in perfect condition. However, I ordered these buttons because they were a great deal and included a FREE screen protector. I never received one. Though its not a big deal, it would've been nice to get it since they claim it comes with one.",Cute!,"List(item, arrived, in, great, time, and, was, in, perfect, condition, however, i, ordered, these, buttons, because, they, were, a, great, deal, and, included, a, free, screen, protector, i, never, received, one, though, its, not, a, big, deal, it, would, ve, been, nice, to, get, it, since, they, claim, it, comes, with, one)","List(item, arrived, great, time, perfect, condition, however, ordered, buttons, great, deal, included, free, screen, protector, never, received, one, though, big, deal, ve, nice, get, since, claim, comes, one)","List(0, 2000, List(2, 4, 6, 11, 16, 22, 34, 40, 82, 90, 103, 112, 113, 125, 130, 148, 163, 184, 200, 231, 280, 293, 398, 1273, 1729), List(2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0))",1.0
5.0,"awesome! stays on, and looks great. can be used on multiple apple products. especially having nails, it helps to have an elevated key.",leopard home button sticker for iphone 4s,"List(awesome, stays, on, and, looks, great, can, be, used, on, multiple, apple, products, especially, having, nails, it, helps, to, have, an, elevated, key)","List(awesome, stays, looks, great, used, multiple, apple, products, especially, nails, helps, elevated, key)","List(0, 2000, List(4, 39, 56, 145, 283, 287, 309, 551, 683, 745, 1040), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",1.0


## Selecting Columns for Applying Machine Learning Models
## and Spliting dataset into train and test sets.

In [24]:
# selecting and spliting 
(trainDF, testDF) = datamodel.select("label","features").randomSplit((0.80, 0.20), seed=1234)

#caching the dataframe.
trainDF.cache()
testDF.cache()

### MACHINE LEARNING MODELS APPLYING.

## 1.1) Logistic Regression (Single Run)

In [27]:
from pyspark.ml.classification import LogisticRegression

# Creating  a varible for LogisticRegression()
lr = LogisticRegression()

# Fit the model to trainDf
lrModel = lr.fit(trainDF)

# Print the coefficients and intercept for Logistic Regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))
lrModel.summary.accuracy

#testDf
result = lrModel.transform(testDF)
result.select("prediction", "label", "features").show(5)

## ROC Curve:-

In [29]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

#Binary Classification.
evaluator = BinaryClassificationEvaluator()

#Dispaying ROC CURVE
display(lrModel, trainDF, "ROC")


False Positive Rate,True Positive Rate,Threshold
0.0,0.0,0.9999973484040092
0.0,0.0140845070422535,0.9999973484040092
0.0,0.028169014084507,0.999995998361904
0.0,0.0422535211267605,0.9994250803375244
0.0,0.056338028169014,0.9993739544518422
0.0,0.0704225352112676,0.9990475483923152
0.0,0.0845070422535211,0.9973382268092492
0.0,0.0985915492957746,0.9971877513626782
0.0,0.1126760563380281,0.9957955381238656
0.0,0.1267605633802817,0.995392026767288


## Evaluator Result

In [31]:
#Printing  the values of AUC
print("ACC: %(result)s" % {"result": evaluator.evaluate(result)})

## Displaying LogisticRegression Model and TraningModel.

In [33]:
display(lrModel, trainDF)

fitted values,residuals
0.8772686838901046,-0.7062559057325607
2.255405949999579,-0.9051158222772088
0.4562147198955094,-0.6121158192472247
-0.0808511635118658,-0.4797982126964661
-1.3336313724479285,-0.2085593280208302
-1.2937169530641277,-0.2152243424412974
-1.2309296145483426,-0.2260187628592568
-0.1044941697543211,-0.4739002019504425
1.0667457851252509,-0.7439775617000499
-0.4062876404207889,-0.3998026084921883


## Binary Classification Evaluator

In [35]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Let's use the run-of-the-mill evaluator
evaluator = BinaryClassificationEvaluator(labelCol='label')

# We have only two choices: area under ROC and PR curves :-(
auroc = evaluator.evaluate(result, {evaluator.metricName: "areaUnderROC"})
auprc = evaluator.evaluate(result, {evaluator.metricName: "areaUnderPR"})
print("Area under ROC Curve: {:.4f}".format(auroc))
print("Area under PR Curve: {:.4f}".format(auprc))

## Confusion Matrix

In [37]:
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.sql.types import FloatType

#important: need to cast to float type, and order by prediction, else it won't work
preds_and_labels = result.select(['prediction','label']).withColumn('label', F.col('label').cast(FloatType())).orderBy('prediction')

#select only prediction and label columns
preds_and_labels = preds_and_labels.select(['prediction','label'])
metrics = MulticlassMetrics(preds_and_labels.rdd.map(tuple))

#Printing Confusion Matrix.
print(metrics.confusionMatrix().toArray())

## 1.2) Logistic Regression (Multiple  Run)

In [39]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator

#Estimator
logr = LogisticRegression(featuresCol='features', labelCol='label')

#Hyper-parameter tuning using Grid Search
param_grid = ParamGridBuilder().\
      addGrid(logr.regParam, [0, 0.1, 0.2, 0.6, 1]).\
      addGrid(logr.elasticNetParam, [0, 0.1, 0.2, 0.6, 1]).\
      addGrid(logr.maxIter, [5,10,20,60,100]).\
      build()

#Evaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")

#Cross Validation
cv = CrossValidator(estimator=logr, evaluator=evaluator, estimatorParamMaps=param_grid, numFolds=3)
cv_model = cv.fit(trainDF)  

#selecting Columns
print("------------Showing Columns------------")
show_columns = ['features', 'label', 'prediction', 'rawPrediction', 'probability']
pred_training_cv = cv_model.transform(trainDF)
pred_training_cv.select(show_columns).show(5, truncate=False)

# Prediction on Testing
print("------------PREDICTION ON TESTING------------")
pred_test_cv = cv_model.transform(testDF)
pred_test_cv.select(show_columns).show(5, truncate=False)

# Prediction on Testing
print("------------PRINTING COEFFICIENTS------------")
print('Intercept: ' + str(cv_model.bestModel.intercept) + "\n" 'coefficients: ' + str(cv_model.bestModel.coefficients))

print('Logistic Regression', "\n",'The best RegParam is: ', cv_model.bestModel._java_obj.getRegParam(), "\n",'The best ElasticNetParam is:', cv_model.bestModel._java_obj.getElasticNetParam(), "\n",'The best Iteration is:',cv_model.bestModel._java_obj.getMaxIter() , "\n", 'Area under ROC is:', cv_model.bestModel.summary.areaUnderROC)

print("------------PRINTING AVEERAGE METRICS------------")
cv_model.avgMetrics

## 2.1) Support Vector Machine(Single Run)

In [41]:
from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Load training data
lsvc = LinearSVC(maxIter=10, regParam=0.1)

# Fit the model
lsvcModel = lsvc.fit(trainDF)

# Print the coefficients and intercept for linear SVC
print("Coefficients: " + str(lsvcModel.coefficients))
print("Intercept: " + str(lsvcModel.intercept))

lsvcresult = lsvcModel.transform(testDF)
lsvcresult.select("prediction","label","features").show(10)

#Compute accuracy of test
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
print("evaluation: %(result)s" % {"result": evaluator.evaluate(lsvcresult)})

# Let's use the run-of-the-mill evaluator
svmevaluator = BinaryClassificationEvaluator()

# We have only two choices: area under ROC and PR curves :-(
svmauroc = svmevaluator.evaluate(lsvcresult, {svmevaluator.metricName: "areaUnderROC"})
svmauprc = svmevaluator.evaluate(lsvcresult, {svmevaluator.metricName: "areaUnderPR"})
print("Area under ROC Curve: {:.4f}".format(svmauroc))
print("Area under PR Curve: {:.4f}".format(svmauprc))

## 2.2) Support Vector Machine(Multiple Run)

In [43]:
from pyspark.ml.classification import LinearSVC
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator

#Estimator
lsvm = LinearSVC(featuresCol='features', labelCol='label')

#GRID VECTOR
param_grid_svm = ParamGridBuilder().\
      addGrid(lsvm.regParam, [0, 0.1, 0.2, 0.5, 1]).\
      addGrid(lsvm.maxIter, [5,10,20,50,100]).\
      build()

#Evaluator
svmevaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")

#Cross Validation
cv_svm = CrossValidator(estimator=lsvm, evaluator=svmevaluator, estimatorParamMaps=param_grid_svm, numFolds=3)
cv_svm_model = cv_svm.fit(trainDF)

#selecting Columns
print("------------Showing Columns------------")
show_columns = ['features', 'label', 'prediction', 'rawPrediction']
pred_training_svm = cv_svm_model.transform(trainDF)
pred_training_svm.select(show_columns).show(5, truncate=False)

# Prediction on Testing
print("------------PREDICTION ON TESTING------------")
pred_test_svm = cv_svm_model.transform(testDF)
pred_test_svm.select(show_columns).show(5, truncate=False)

print('Support Vector Machine', "\n",'The best RegParam is: ', cv_svm_model.bestModel._java_obj.getRegParam(),  "\n",'The best Iteration is:',cv_svm_model.bestModel._java_obj.getMaxIter() , "\n", 'Area under ROC is:', svmevaluator.evaluate(pred_test_svm, {svmevaluator.metricName: "areaUnderROC"}))

print("------------PRINTING AVEERAGE METRICS------------")
cv_svm_model.avgMetrics

## 3.1) Naive Bayes(Single Run)

In [45]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# create the trainer and set its parameters
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
nbmodel = nb.fit(trainDF)

# select example rows to display.
nbresult = nbmodel.transform(testDF)
nbresult.select("prediction","label","features").show(10)

# compute accuracy on the test set
nbevaluator = BinaryClassificationEvaluator()
accuracy = nbevaluator.evaluate(nbresult)
print("evaluations: %(nbresult)s" % {"nbresult": nbevaluator.evaluate(nbresult)})

# Let's use the run-of-the-mill evaluator
nbevaluator = BinaryClassificationEvaluator()

# We have only two choices: area under ROC and PR curves :-(
nbauroc = nbevaluator.evaluate(nbresult, {nbevaluator.metricName: "areaUnderROC"})
nbauprc = nbevaluator.evaluate(nbresult, {nbevaluator.metricName: "areaUnderPR"})
print("Area under ROC Curve: {:.4f}".format(nbauroc))
print("Area under PR Curve: {:.4f}".format(nbauprc))

## 3.2) Naive Bayes(Multiple Run)

In [47]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator

#ESTIMATOR
nb = NaiveBayes(featuresCol='features', labelCol='label')

#GRID VECTOR
param_grid_nb = ParamGridBuilder().\
      addGrid(nb.smoothing, [0.0,1.0,2.0,4.0,6.0,8.0]).\
      addGrid(nb.modelType, ["multinomial", "bernoulli"]).\
      build()

#Evaluator
nbevaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")

#CROSS VALIDATION
cv_nb = CrossValidator(estimator = nb, evaluator = nbevaluator, estimatorParamMaps = param_grid_nb, numFolds=3)
cv_nb_model = cv_nb.fit(trainDF)  # fitiing data to my cross validation model

#selecting Columns
print("------------Showing Columns------------")
show_columns = ['features', 'label', 'prediction', 'rawPrediction', 'probability']
pred_training_nb = cv_nb_model.transform(trainDF)
pred_training_nb.select(show_columns).show(5, truncate=False)

# Prediction on Testing
print("------------PREDICTION ON TESTING------------")
pred_test_nb = cv_nb_model.transform(testDF)
pred_test_nb.select(show_columns).show(5, truncate=False)

print('Naive Bayes ',"\n",'The best Smoothening is: ', cv_nb_model.bestModel._java_obj.getSmoothing(), "\n",'The best model type is:', cv_nb_model.bestModel._java_obj.getModelType(), "\n", 'Area under ROC is:', nbevaluator.evaluate(pred_test_nb, {nbevaluator.metricName: "areaUnderROC"}))

print("------------PRINTING AVEERAGE METRICS------------")
cv_nb_model.avgMetrics

## 4.1) Random Forest(Single Run)

In [49]:
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Train a RandomForest model.
rf = RandomForestClassifier( numTrees=10)

# Train model.  This also runs the indexers.
rfmodel = rf.fit(trainDF)

# Make predictions.
rfresult = rfmodel.transform(testDF)

# Select example rows to display.
rfresult.select("prediction","label","features").show(10)

# Select (prediction, true label) and compute test error
rfevaluator = BinaryClassificationEvaluator()
print("evaluations: %(rfresult)s" % {"rfresult": rfevaluator.evaluate(rfresult)})

# We have only two choices: area under ROC and PR curves :-(
rfauroc = rfevaluator.evaluate(rfresult, {rfevaluator.metricName: "areaUnderROC"})
rfauprc = rfevaluator.evaluate(rfresult, {rfevaluator.metricName: "areaUnderPR"})
print("Area under ROC Curve: {:.4f}".format(rfauroc))
print("Area under PR Curve: {:.4f}".format(rfauprc))

## 4.2) Random Forest(Single Run)

In [51]:
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.tuning import CrossValidator

#ESTIMATOR
rf = RandomForestClassifier(featuresCol='features', labelCol='label')

#GRID VECTOR
param_grid_rf = ParamGridBuilder().\
      addGrid(rf.impurity,['gini']).\
      addGrid(rf.maxDepth, [2, 3, 4]).\
      addGrid(rf.minInfoGain, [0.0, 0.1, 0.2, 0.3]).\
      addGrid(rf.numTrees,[20,40,60,80,100]).\
      build()

#Evaluator
rfevaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")

#CROSS VALIDATION
cv_rf = CrossValidator(estimator=rf, evaluator=rfevaluator, estimatorParamMaps=param_grid_rf, numFolds=3)
cv_rf_model = cv_rf.fit(trainDF)  # fitiing data to my cross validation model

#selecting Columns
print("------------Showing Columns------------")
show_columns = ['features', 'label', 'prediction', 'rawPrediction', 'probability']
pred_training_rf = cv_rf_model.transform(trainDF)
pred_training_rf.select(show_columns).show(5, truncate=False)

# Prediction on Testing
print("------------PREDICTION ON TESTING------------")
pred_test_rf = cv_rf_model.transform(testDF)
pred_test_rf.select(show_columns).show(5, truncate=False)

print('Random forest ',"\n",'The best Max Depth is: ', cv_rf_model.bestModel._java_obj.getMaxDepth(), "\n",'The best min Info gain is:', cv_rf_model.bestModel._java_obj.getMinInfoGain(), "\n", 'Area under ROC is:', rfevaluator.evaluate(pred_test_rf, {rfevaluator.metricName: "areaUnderROC"}))

print("------------PRINTING AVEERAGE METRICS------------")
cv_rf_model.avgMetrics

## 5.1) Gradient Boost(Single Run)

In [53]:
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Train a GBT model.
gb = GBTClassifier()

# Train model.  This also runs the indexers.
gbmodel = gb.fit(trainDF)

# Make predictions.
gbresult = gbmodel.transform(testDF)

# Select example rows to display.
gbresult.select("prediction","label","features").show(5)

# Select (prediction, true label) and compute test error
gbevaluator = BinaryClassificationEvaluator()

print("evaluations: %(gbresult)s" % {"gbresult": gbevaluator.evaluate(gbresult)})

# We have only two choices: area under ROC and PR curves :-(
gbauroc = gbevaluator.evaluate(gbresult, {gbevaluator.metricName: "areaUnderROC"})
gbauprc = gbevaluator.evaluate(gbresult, {gbevaluator.metricName: "areaUnderPR"})
print("Area under ROC Curve: {:.4f}".format(gbauroc))
print("Area under PR Curve: {:.4f}".format(gbauprc))


## 5.2) Gradient Boost(Multiple Run)

In [55]:
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.tuning import CrossValidator

# ESTIMATOR
gbt = GBTClassifier(featuresCol='features', labelCol='label')


#GRID VECTOR
param_grid_gbt = ParamGridBuilder().\
    addGrid(gbt.maxDepth, [2, 3, 4]).\
    addGrid(gbt.minInfoGain, [0.0, 0.1, 0.2]).\
    addGrid(gbt.stepSize, [0.02, 0.05, 0.1]).\
    addGrid(gb.maxIter,[20,40,60,80,100]).\
    build()

#Evaluator
gbtevaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")

#CROSS VALIDATION
cv_gbt = CrossValidator(estimator=gbt, evaluator=gbtevaluator, estimatorParamMaps=param_grid_gbt)
cv_gbt_model = cv_gbt.fit(trainDF)  # fitiing data to my cross validation model

show_columns = ['features', 'label', 'prediction', 'rawPrediction', 'probability']
pred_training_gbt = cv_gbt_model.transform(trainDF)
pred_training_gbt.select(show_columns).show(5, truncate=False)

pred_test_gbt = cv_gbt_model.transform(testDF)
pred_test_gbt.select(show_columns).show(5, truncate=False)


print('Gradient Boosting ',"\n",'The best Max Depth is: ', cv_gbt_model.bestModel._java_obj.getMaxDepth(), "\n",'The best min Info gain is:',cv_gbt_model.bestModel._java_obj.getMinInfoGain(), "\n", 'step size: ', cv_gbt_model.bestModel._java_obj.getStepSize(),"\n" ,'Area under ROC is:', gbtevaluator.evaluate(pred_test_gbt, {gbtevaluator.metricName: "areaUnderROC"}))

cv_gbt_model.avgMetrics

## ALL MODELS ACCURACY ON TRAINING DATA SET

In [57]:
print('Models and their Performance',"\n")
print('Logistic Regression',evaluator.evaluate(pred_training_cv, {evaluator.metricName: "areaUnderROC"}))
print('Support Vector Machine',svmevaluator.evaluate(pred_training_svm, {svmevaluator.metricName: "areaUnderROC"}))
print('Naive Bayes', nbevaluator.evaluate(pred_training_nb, {nbevaluator.metricName: "areaUnderROC"}))
print('Random forest', rfevaluator.evaluate(pred_training_rf, {rfevaluator.metricName: "areaUnderROC"}))
print('Gradient Boost', gbtevaluator.evaluate(pred_training_gbt, {gbtevaluator.metricName: "areaUnderROC"}))

## ALL MODEL PREDICTION ACCURACY ON TEST DATA SET

In [59]:
print('Models and their Performance',"\n")
print('Logistic Regression',evaluator.evaluate(pred_test_cv, {evaluator.metricName: "areaUnderROC"}))
print('Support Vector Machine',svmevaluator.evaluate(pred_test_svm, {svmevaluator.metricName: "areaUnderROC"}))
print('Naive Bayes', nbevaluator.evaluate(pred_test_nb, {nbevaluator.metricName: "areaUnderROC"}))
print('Random forest', rfevaluator.evaluate(pred_test_rf, {rfevaluator.metricName: "areaUnderROC"}))
print('Gradient Boost', gbtevaluator.evaluate(pred_test_gbt, {gbtevaluator.metricName: "areaUnderROC"}))

## ALL MODELS ROC V/S PR

In [61]:
print('Models and their Performance',"\n")
print('Logistic Regression: ROC: ',evaluator.evaluate(pred_training_cv, {evaluator.metricName: "areaUnderROC"}), ', PR: ',evaluator.evaluate(pred_training_cv, {evaluator.metricName: "areaUnderPR"}))
print('Support Vector Machine',svmevaluator.evaluate(pred_training_svm, {svmevaluator.metricName: "areaUnderROC"}), ', PR: ',svmevaluator.evaluate(pred_training_svm, {svmevaluator.metricName: "areaUnderPR"}))
print('Naive Bayes', nbevaluator.evaluate(pred_training_nb, {nbevaluator.metricName: "areaUnderROC"}),', PR: ' , nbevaluator.evaluate(pred_training_nb, {nbevaluator.metricName: "areaUnderPR"}))
print('Random forest', rfevaluator.evaluate(pred_training_rf, {rfevaluator.metricName: "areaUnderROC"}), ', PR: ', rfevaluator.evaluate(pred_training_rf, {rfevaluator.metricName: "areaUnderPR"}))
print('Gradient Boost', gbtevaluator.evaluate(pred_training_gbt, {gbtevaluator.metricName: "areaUnderROC"}),', PR: ' , gbtevaluator.evaluate(pred_training_gbt ,{gbtevaluator.metricName: "areaUnderPR"}))