Setup Spark environment

In [1]:
%%init_spark
launcher.master="yarn"
launcher.num_executors=6
launcher.executor_cores=2
launcher.executor_memory='6000m'
launcher.packages=["com.github.master:spark-stemming_2.10:0.2.0"]

### 1. Data Exploration
Read-in Yelp review data, take a look at the data layout, and view a few sample rows

In [2]:
//Read the CSV file and load it into a dataframe. Note that the "inferschema" parameter is set to true
val rev=spark.read.option("header","true").option("inferschema", "true").json("/hadoop-user/data/review.json")
rev.cache()
rev.printSchema()
rev.show(3)
rev.count
rev.take(1).foreach(println)

Intitializing Scala interpreter ...

Spark Web UI available at http://bd-hm:8088/proxy/application_1573932525645_0001
SparkContext available as 'sc' (version = 2.4.4, master = yarn, app id = application_1573932525645_0001)
SparkSession available as 'spark'


root
 |-- business_id: string (nullable = true)
 |-- cool: long (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- review_id: string (nullable = true)
 |-- stars: long (nullable = true)
 |-- text: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)

+--------------------+----+----------+-----+--------------------+-----+--------------------+------+--------------------+
|         business_id|cool|      date|funny|           review_id|stars|                text|useful|             user_id|
+--------------------+----+----------+-----+--------------------+-----+--------------------+------+--------------------+
|0W4lkclzZThpx3V65...|   0|2016-05-28|    0|v0i_UHJMo_hPBq9bx...|    5|Love the staff, l...|     0|bv2nCi5Qv5vroFiqK...|
|AEx2SYEUJmTxVVB18...|   0|2016-05-28|    0|vkVSCC7xljjrAI4UG...|    5|Super simple plac...|     0|bv2nCi5Qv5vroFiqK...|
|VR6GpWIda3SfvPC-l...|   0|2016-05-28|    0|n6QzIUObkY

rev: org.apache.spark.sql.DataFrame = [business_id: string, cool: bigint ... 7 more fields]


Find distribution of "stars" variable

In [3]:
val rev2=rev.toDF()
rev2.createOrReplaceTempView("review")
spark.sql("select stars,count(review_id) from review group by stars").show()

+-----+----------------+
|stars|count(review_id)|
+-----+----------------+
|    5|         2253348|
|    1|          731363|
|    3|          615481|
|    2|          438161|
|    4|         1223316|
+-----+----------------+



rev2: org.apache.spark.sql.DataFrame = [business_id: string, cool: bigint ... 7 more fields]


### 2. Feature Engineering
Turn "star" reviews of 4 and 5 into a '1' and "star" reviews of 1, 2, and 3 into '0'

In [4]:
import org.apache.spark.sql.functions.{when,_}
import spark.sqlContext.implicits._
val rev3=rev2.withColumn("rating", expr("case when stars = 4 then 1 " +
                                       "when stars = 5 then 1 " +
                                       "else 0 end "))
rev3.createOrReplaceTempView("review")
rev3.show(2)

+--------------------+----+----------+-----+--------------------+-----+--------------------+------+--------------------+------+
|         business_id|cool|      date|funny|           review_id|stars|                text|useful|             user_id|rating|
+--------------------+----+----------+-----+--------------------+-----+--------------------+------+--------------------+------+
|0W4lkclzZThpx3V65...|   0|2016-05-28|    0|v0i_UHJMo_hPBq9bx...|    5|Love the staff, l...|     0|bv2nCi5Qv5vroFiqK...|     1|
|AEx2SYEUJmTxVVB18...|   0|2016-05-28|    0|vkVSCC7xljjrAI4UG...|    5|Super simple plac...|     0|bv2nCi5Qv5vroFiqK...|     1|
+--------------------+----+----------+-----+--------------------+-----+--------------------+------+--------------------+------+
only showing top 2 rows



import org.apache.spark.sql.functions.{when, _}
import spark.sqlContext.implicits._
rev3: org.apache.spark.sql.DataFrame = [business_id: string, cool: bigint ... 8 more fields]


Distribution of rating before downsampling

In [5]:
spark.sql("select rating,count(review_id) from review group by rating").show()

+------+----------------+
|rating|count(review_id)|
+------+----------------+
|     1|         3476664|
|     0|         1785005|
+------+----------------+



The rating attribute is not balanced, with a rating of 1 being nearly twice as likely as a rating of 0.

Now we will downsample the data, and only use 10% of the downsampled data to make the dataset size manageable.

In [5]:
import org.apache.spark.sql.DataFrameStatFunctions
val frac= Map(0 -> .1, 1 -> 1785005.0/34766640.0)
val rev4 = rev3.stat.sampleBy("rating", frac, 111)
rev4.cache().groupBy("rating").count().show()

+------+------+
|rating| count|
+------+------+
|     1|178580|
|     0|178418|
+------+------+



import org.apache.spark.sql.DataFrameStatFunctions
frac: scala.collection.immutable.Map[Int,Double] = Map(0 -> 0.1, 1 -> 0.05134246507571626)
rev4: org.apache.spark.sql.DataFrame = [business_id: string, cool: bigint ... 8 more fields]


Retain only rating and text values from the dataset, and remove any rows with missing text.

In [6]:
rev4.createOrReplaceTempView("rev4")
val rating_data2 = spark.sql("select rating, text from rev4").toDF()
val rating_data= rating_data2.filter("trim(text)!='' or trim(text)!=null").select($"text".alias("text_field"), $"rating")

rating_data2: org.apache.spark.sql.DataFrame = [rating: int, text: string]
rating_data: org.apache.spark.sql.DataFrame = [text_field: string, rating: int]


Create a pipeline to extract TFIDF vectors from data. The pipeline will remove punctuation, remove stop words, stem the words, vectorize them and turn them into TFIDF vectors.

In [7]:
import org.apache.spark.ml.feature._
import org.apache.spark.sql.functions.udf
import org.apache.spark.mllib.feature.Stemmer

val tokenizer = new RegexTokenizer().setMinTokenLength(3).setToLowercase(true).setInputCol("text_field").setOutputCol("text_words")

//Defining a udf to remove punctuations from a sequence of words
def removePunc(words:Seq[String]):Seq[String]={
 return words.map(_.replaceAll("\\p{Punct}"," "))
}

//val removePuncUDF=udf(removePunc(_:Seq[String]))
spark.udf.register("removePuncUDF",removePunc(_:Seq[String]) )

//use the removePuncUDF to remove all punctuation 
val puncRemover = new SQLTransformer().setStatement("SELECT removePuncUDF(text_words) as text_field, rating from __THIS__ ")
val stopWordRemover=new StopWordsRemover().setInputCol("text_field").setOutputCol("filtered_text")
val stemmer = new Stemmer().setInputCol("filtered_text").setOutputCol("stemmed_text")
val vectorizer = new CountVectorizer().setMinDF(100).setInputCol("stemmed_text").setOutputCol("text_BOW")
val tfidf = new IDF().setInputCol("text_BOW").setOutputCol("text_TFIDF")

import org.apache.spark.ml.feature._
import org.apache.spark.sql.functions.udf
import org.apache.spark.mllib.feature.Stemmer
tokenizer: org.apache.spark.ml.feature.RegexTokenizer = regexTok_7b9db5ab3be0
removePunc: (words: Seq[String])Seq[String]
puncRemover: org.apache.spark.ml.feature.SQLTransformer = sql_938dbbbde055
stopWordRemover: org.apache.spark.ml.feature.StopWordsRemover = stopWords_86a87e0a016e
stemmer: org.apache.spark.mllib.feature.Stemmer = stemmer_e5c362935948
vectorizer: org.apache.spark.ml.feature.CountVectorizer = cntVec_1c8cef4c6415
tfidf: org.apache.spark.ml.feature.IDF = idf_8c30f17438df


### Part 3 - Machine Learning Pipelines

In [8]:
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.tuning._
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.feature._

import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.tuning._
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.feature._


##### Logistic Regression

Implement Logistic Regression with 3-fold cross validation. 

In [11]:
import org.apache.spark.ml.classification.LogisticRegression

val lr = new LogisticRegression().setLabelCol("rating").setFeaturesCol("text_TFIDF")
val paramGrid =new ParamGridBuilder()
             .addGrid(lr.regParam, Array(0.01, 0.5, 2.0))
             .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
             .addGrid(tfidf.minDocFreq, Array(5,10))
             .build()
val evaluator = new BinaryClassificationEvaluator().setRawPredictionCol("rawPrediction").setLabelCol("rating").setMetricName("areaUnderROC")
val cv = new CrossValidator().setEstimator(lr).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)

val pipeline = new Pipeline().setStages(Array(tokenizer,puncRemover,stopWordRemover, stemmer, vectorizer, tfidf,cv))

val Array(training,testing)=rating_data.randomSplit(Array(0.8,0.2),111)
val pipelineModel = pipeline.fit(training)
val predictions = pipelineModel.transform(testing)

predictions.select("rating", "prediction", "probability", "stemmed_text").show(5)

val AUC = evaluator.evaluate(predictions)
println(s"AUC for LR on test data = $AUC")

+------+----------+--------------------+--------------------+
|rating|prediction|         probability|        stemmed_text|
+------+----------+--------------------+--------------------+
|     0|       1.0|[0.16141907685102...|[ ovr , rate,    ...|
|     0|       0.0|[0.91705661215902...|[ 16 50, buger, r...|
|     0|       1.0|[0.37502600527578...|[beer, asid,  whi...|
|     0|       1.0|[0.41504192095022...|[ 33, resort, fee...|
|     0|       0.0|[0.99983369576327...|[shop, wasn t, vi...|
+------+----------+--------------------+--------------------+
only showing top 5 rows

AUC for LR on test data = 0.9440929929897601


import org.apache.spark.ml.classification.LogisticRegression
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_e26d13b3450f
paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	logreg_e26d13b3450f-elasticNetParam: 0.0,
	idf_a072585a67a4-minDocFreq: 5,
	logreg_e26d13b3450f-regParam: 0.01
}, {
	logreg_e26d13b3450f-elasticNetParam: 0.0,
	idf_a072585a67a4-minDocFreq: 5,
	logreg_e26d13b3450f-regParam: 0.5
}, {
	logreg_e26d13b3450f-elasticNetParam: 0.0,
	idf_a072585a67a4-minDocFreq: 5,
	logreg_e26d13b3450f-regParam: 2.0
}, {
	logreg_e26d13b3450f-elasticNetParam: 0.0,
	idf_a072585a67a4-minDocFreq: 10,
	logreg_e26d13b3450f-regParam: 0.01
}, {
	logreg_e26d13b3450f-elasticNetParam: 0.0,
	idf_a072585a67a4-minDocFreq: 10,
	logreg_e26d13b3450f-regParam: 0.5
}, {
	logreg_e...

##### Random Forest

Implement Random Forest with 3-fold cross validation. 

In [12]:
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}

val rf = new RandomForestClassifier().setLabelCol("rating").setFeaturesCol("text_TFIDF")
val paramGrid =new ParamGridBuilder()
             .addGrid(rf.numTrees, Array(5,10,15))
             .addGrid(tfidf.minDocFreq, Array(5,10))
             .build()
val evaluator = new BinaryClassificationEvaluator().setRawPredictionCol("rawPrediction").setLabelCol("rating").setMetricName("areaUnderROC")
val cv = new CrossValidator().setEstimator(rf).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)

val pipeline = new Pipeline().setStages(Array(tokenizer,puncRemover,stopWordRemover, stemmer, vectorizer, tfidf, cv))

val Array(training,testing)=rating_data.randomSplit(Array(0.8,0.2),111)
val pipelineModel = pipeline.fit(training)
val predictions = pipelineModel.transform(testing)

predictions.select("rating", "prediction", "probability", "stemmed_text").show(5)

val AUC = evaluator.evaluate(predictions)
println(s"AUC for RF on test data = $AUC")

+------+----------+--------------------+--------------------+
|rating|prediction|         probability|        stemmed_text|
+------+----------+--------------------+--------------------+
|     0|       1.0|[0.47019936708999...|[ ovr , rate,    ...|
|     0|       0.0|[0.53130756207492...|[ 16 50, buger, r...|
|     0|       0.0|[0.50456762529794...|[beer, asid,  whi...|
|     0|       1.0|[0.47151702017081...|[ 33, resort, fee...|
|     0|       0.0|[0.54541499699992...|[shop, wasn t, vi...|
+------+----------+--------------------+--------------------+
only showing top 5 rows

AUC for RF on test data = 0.7886449721370802


import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
rf: org.apache.spark.ml.classification.RandomForestClassifier = rfc_5854619fdaba
paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	idf_a072585a67a4-minDocFreq: 5,
	rfc_5854619fdaba-numTrees: 5
}, {
	idf_a072585a67a4-minDocFreq: 10,
	rfc_5854619fdaba-numTrees: 5
}, {
	idf_a072585a67a4-minDocFreq: 5,
	rfc_5854619fdaba-numTrees: 10
}, {
	idf_a072585a67a4-minDocFreq: 10,
	rfc_5854619fdaba-numTrees: 10
}, {
	idf_a072585a67a4-minDocFreq: 5,
	rfc_5854619fdaba-numTrees: 15
}, {
	idf_a072585a67a4-minDocFreq: 10,
	rfc_5854619fdaba-numTrees: 15
})
evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_b97143e4f40f
cv: org.apache.spark.ml.tuning.CrossValidator...

##### GB Classification

Implement GB Classification with 3-fold cross validation.

In [10]:
import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier}

val gbt = new GBTClassifier().setLabelCol("rating").setFeaturesCol("text_TFIDF")
val paramGrid =new ParamGridBuilder()
             .addGrid(gbt.maxDepth, Array(2,5))
             .addGrid(gbt.maxIter, Array(5, 10))
             .addGrid(tfidf.minDocFreq, Array(5,10))
             .build()
val evaluator = new BinaryClassificationEvaluator().setRawPredictionCol("rawPrediction").setLabelCol("rating").setMetricName("areaUnderROC")
val cv = new CrossValidator().setEstimator(gbt).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)

val pipeline = new Pipeline().setStages(Array(tokenizer,puncRemover,stopWordRemover, stemmer, vectorizer, tfidf,cv))

val Array(training,testing)=rating_data.randomSplit(Array(0.8,0.2),111)
val pipelineModel = pipeline.fit(training)
val predictions = pipelineModel.transform(testing)


predictions.select("rating", "prediction", "probability", "stemmed_text").show(5)

val AUC = evaluator.evaluate(predictions)
println(s"AUC for GBT on test data = $AUC")

+------+----------+--------------------+--------------------+
|rating|prediction|         probability|        stemmed_text|
+------+----------+--------------------+--------------------+
|     0|       0.0|[0.59825391532020...|[ ovr , rate,    ...|
|     0|       0.0|[0.62839190655241...|[ 16 50, buger, r...|
|     0|       1.0|[0.22131837077660...|[beer, asid,  whi...|
|     0|       0.0|[0.56818077137962...|[ 33, resort, fee...|
|     0|       0.0|[0.80144694198160...|[shop, wasn t, vi...|
+------+----------+--------------------+--------------------+
only showing top 5 rows

AUC for GBT on test data = 0.8300411449788886


import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier}
gbt: org.apache.spark.ml.classification.GBTClassifier = gbtc_6209bebe4ae4
paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	gbtc_6209bebe4ae4-maxDepth: 2,
	gbtc_6209bebe4ae4-maxIter: 5,
	idf_5bc1d685964d-minDocFreq: 5
}, {
	gbtc_6209bebe4ae4-maxDepth: 5,
	gbtc_6209bebe4ae4-maxIter: 5,
	idf_5bc1d685964d-minDocFreq: 5
}, {
	gbtc_6209bebe4ae4-maxDepth: 2,
	gbtc_6209bebe4ae4-maxIter: 10,
	idf_5bc1d685964d-minDocFreq: 5
}, {
	gbtc_6209bebe4ae4-maxDepth: 5,
	gbtc_6209bebe4ae4-maxIter: 10,
	idf_5bc1d685964d-minDocFreq: 5
}, {
	gbtc_6209bebe4ae4-maxDepth: 2,
	gbtc_6209bebe4ae4-maxIter: 5,
	idf_5bc1d685964d-minDocFreq: 10
}, {
	gbtc_6209bebe4ae4-maxDepth: 5,
	gbtc_6209bebe4ae4-maxIter: 5,
	idf_5bc1d68...

### 4. Adding More Features

Read in user data file, and observe the layout of the table

In [9]:
val user=spark.read.option("header","true").option("inferschema", "true").json("/hadoop-user/data/user.json")
user.cache()
user.printSchema()
user.show(3)

root
 |-- average_stars: double (nullable = true)
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- elite: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- fans: long (nullable = true)
 |-- friends: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- funny: long (nullable = true)
 |-- name: string (nullable = true)
 |-- review_count: long (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- yelping_since: str

user: org.apache.spark.sql.DataFrame = [average_stars: double, compliment_cool: bigint ... 20 more fields]


Select only relevant columns from rating and user datasets

In [10]:
user.createOrReplaceTempView("user")
val user_data = spark.sql("select average_stars, user_id from user").toDF()
val rating_data = spark.sql("select rating, text, user_id from rev4").toDF()
user_data.createOrReplaceTempView("user_data")
rating_data.createOrReplaceTempView("rating_data")

user_data: org.apache.spark.sql.DataFrame = [average_stars: double, user_id: string]
rating_data: org.apache.spark.sql.DataFrame = [rating: int, text: string ... 1 more field]


Join datasets together on user_id

In [11]:
val merged_data = spark.sql("select * from rating_data left join user_data on rating_data.user_id=user_data.user_id")
merged_data.show(3)

+------+--------------------+--------------------+-------------+--------------------+
|rating|                text|             user_id|average_stars|             user_id|
+------+--------------------+--------------------+-------------+--------------------+
|     0|We got our buffet...|-3i9bhfvrM3F1wsC9...|         4.06|-3i9bhfvrM3F1wsC9...|
|     1|Un endroit sympat...|-7JSlmBJKUQwREG_y...|          4.5|-7JSlmBJKUQwREG_y...|
|     1|I was unsure goin...|-7JSlmBJKUQwREG_y...|          4.5|-7JSlmBJKUQwREG_y...|
+------+--------------------+--------------------+-------------+--------------------+
only showing top 3 rows



merged_data: org.apache.spark.sql.DataFrame = [rating: int, text: string ... 3 more fields]


Calculate correlation between rating and user average_stars.

In [14]:
merged_data.toDF().stat.corr("rating", "average_stars")

res9: Double = 0.4569367111513306


It looks like average_stars and rating are moderately correlated, and average_stars could be a useful predictor in our machine learning models.

Select only relevant columns from merged dataset, and trim text column.

In [12]:
val rating_data= merged_data.filter("trim(text)!='' or trim(text)!=null").select($"text".alias("text_field"), $"average_stars", $"rating")
rating_data.take(1).foreach(println)

[We got our buffet lunch comped from our few days of getting drilled at the tables.  The buffet price was reasonable anyway, around $13.

It was rather small as far as Vegas buffets go, but they have all the basics: fresh carved prime rib, a few mexican and chinese options, fresh salads, plus lots of the standard comfort foods.  Nothing I tried was gross, but nothing fabulous either.  Decent, cheap buffet food.  Solid three stars.

Only bad thing, we had to stand at the desk and wait about 10 minutes before getting seated... and there were numerous empty tables in all sections.  The desk people were visibly bored and disinterested with their job, life, everything and everybody.,4.06,0]


rating_data: org.apache.spark.sql.DataFrame = [text_field: string, average_stars: double ... 1 more field]


Create TFIDF pipeline with incorporated average_stars column

In [13]:
val tokenizer = new RegexTokenizer().setMinTokenLength(3).setToLowercase(true).setInputCol("text_field").setOutputCol("text_words")

//Defining a udf to remove punctuations from a sequence of words
def removePunc(words:Seq[String]):Seq[String]={
 return words.map(_.replaceAll("\\p{Punct}"," "))
}

//val removePuncUDF=udf(removePunc(_:Seq[String]))
spark.udf.register("removePuncUDF",removePunc(_:Seq[String]) )

//use the removePuncUDF to remove all punctuation
val puncRemover = new SQLTransformer().setStatement("SELECT removePuncUDF(text_words) as text_field, average_stars, rating from __THIS__ ")
val stopWordRemover=new StopWordsRemover().setInputCol("text_field").setOutputCol("filtered_text")
val stemmer = new Stemmer().setInputCol("filtered_text").setOutputCol("stemmed_text")
val vectorizer = new CountVectorizer().setMinDF(100).setInputCol("stemmed_text").setOutputCol("text_BOW")
val tfidf = new IDF().setInputCol("text_BOW").setOutputCol("text_TFIDF")

tokenizer: org.apache.spark.ml.feature.RegexTokenizer = regexTok_2b61ddd8a5d5
removePunc: (words: Seq[String])Seq[String]
puncRemover: org.apache.spark.ml.feature.SQLTransformer = sql_013b54f232d6
stopWordRemover: org.apache.spark.ml.feature.StopWordsRemover = stopWords_4168e9929dbe
stemmer: org.apache.spark.mllib.feature.Stemmer = stemmer_3f7371ec2a9f
vectorizer: org.apache.spark.ml.feature.CountVectorizer = cntVec_57365845bd86
tfidf: org.apache.spark.ml.feature.IDF = idf_21c89933cb6f


Turn average_stars into a vector, standardize the values for average_stars and merge together TFIDF data with average_stars data

In [14]:
val vectorizer_numeric=new VectorAssembler().setInputCols(Array("average_stars")).setOutputCol("numeric_features")
val standardizer=new StandardScaler().setWithMean(true).setInputCol("numeric_features").setOutputCol("numeric_features_vector")
val vectorizer_assemb=new VectorAssembler().setInputCols(Array("numeric_features_vector","text_TFIDF")).setOutputCol("features")

vectorizer_numeric: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_eff6df66b748
standardizer: org.apache.spark.ml.feature.StandardScaler = stdScal_611869f96a6b
vectorizer_assemb: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_583ab2a97d04


##### Logistic Regression

Implement Logistic Regression with 3-fold cross validation.

In [15]:
import org.apache.spark.ml.classification.LogisticRegression

val lr = new LogisticRegression().setLabelCol("rating").setFeaturesCol("features")
val paramGrid =new ParamGridBuilder()
             .addGrid(lr.regParam, Array(0.01, 0.5, 2.0))
             .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
             .addGrid(tfidf.minDocFreq, Array(5,10))
             .build()
val evaluator = new BinaryClassificationEvaluator().setRawPredictionCol("rawPrediction").setLabelCol("rating").setMetricName("areaUnderROC")
val cv = new CrossValidator().setEstimator(lr).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)

val pipeline = new Pipeline().setStages(Array(tokenizer, puncRemover,stopWordRemover, stemmer, vectorizer, tfidf, vectorizer_numeric,standardizer, vectorizer_assemb,cv))

val Array(training,testing)=rating_data.randomSplit(Array(0.8,0.2),111)
val pipelineModel = pipeline.fit(training)
val predictions = pipelineModel.transform(testing)

predictions.select("rating", "average_stars","prediction", "probability", "stemmed_text").show(5)

val AUC = evaluator.evaluate(predictions)
println(s"AUC for LR on test data = $AUC")

+------+-------------+----------+--------------------+--------------------+
|rating|average_stars|prediction|         probability|        stemmed_text|
+------+-------------+----------+--------------------+--------------------+
|     0|          1.0|       0.0|[0.99873762830949...|[     , must, rea...|
|     0|         4.14|       1.0|[0.11885833053694...|[star, rating , i...|
|     1|         3.46|       0.0|[0.75733271509246...|[2pm 6pm, happi, ...|
|     0|         1.86|       0.0|[0.85852915361655...|[half, hour, hole...|
|     1|         3.25|       1.0|[0.00607100661132...|[coupl, week, ago...|
+------+-------------+----------+--------------------+--------------------+
only showing top 5 rows

AUC for LR on test data = 0.9510706639881208


import org.apache.spark.ml.classification.LogisticRegression
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_d2ecbf0bcf75
paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	logreg_d2ecbf0bcf75-elasticNetParam: 0.0,
	idf_21c89933cb6f-minDocFreq: 5,
	logreg_d2ecbf0bcf75-regParam: 0.01
}, {
	logreg_d2ecbf0bcf75-elasticNetParam: 0.0,
	idf_21c89933cb6f-minDocFreq: 5,
	logreg_d2ecbf0bcf75-regParam: 0.5
}, {
	logreg_d2ecbf0bcf75-elasticNetParam: 0.0,
	idf_21c89933cb6f-minDocFreq: 5,
	logreg_d2ecbf0bcf75-regParam: 2.0
}, {
	logreg_d2ecbf0bcf75-elasticNetParam: 0.5,
	idf_21c89933cb6f-minDocFreq: 5,
	logreg_d2ecbf0bcf75-regParam: 0.01
}, {
	logreg_d2ecbf0bcf75-elasticNetParam: 0.5,
	idf_21c89933cb6f-minDocFreq: 5,
	logreg_d2ecbf0bcf75-regParam: 0.5
}, {
	logreg_d2e...

##### Random Forest

Implement Random Forest with 3-fold cross validation.

In [16]:
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}

val rf = new RandomForestClassifier().setLabelCol("rating").setFeaturesCol("features")
val paramGrid =new ParamGridBuilder()
             .addGrid(rf.numTrees, Array(5,10,15))
             .addGrid(tfidf.minDocFreq, Array(5,10))
             .build()
val evaluator = new BinaryClassificationEvaluator().setRawPredictionCol("rawPrediction").setLabelCol("rating").setMetricName("areaUnderROC")
val cv = new CrossValidator().setEstimator(rf).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)

val pipeline = new Pipeline().setStages(Array(tokenizer, puncRemover,stopWordRemover, stemmer, vectorizer, tfidf, vectorizer_numeric,standardizer, vectorizer_assemb,cv))

val Array(training,testing)=rating_data.randomSplit(Array(0.8,0.2),111)
val pipelineModel = pipeline.fit(training)
val predictions = pipelineModel.transform(testing)

predictions.select("rating", "average_stars","prediction", "probability", "stemmed_text").show(5)

val AUC = evaluator.evaluate(predictions)
println(s"AUC for RF on test data = $AUC")

+------+-------------+----------+--------------------+--------------------+
|rating|average_stars|prediction|         probability|        stemmed_text|
+------+-------------+----------+--------------------+--------------------+
|     0|          1.0|       0.0|[0.52121782935173...|[     , must, rea...|
|     0|         4.14|       1.0|[0.48076607185631...|[star, rating , i...|
|     1|         3.46|       0.0|[0.51212080904525...|[2pm 6pm, happi, ...|
|     0|         1.86|       0.0|[0.50059466255467...|[half, hour, hole...|
|     1|         3.25|       1.0|[0.48686552871443...|[coupl, week, ago...|
+------+-------------+----------+--------------------+--------------------+
only showing top 5 rows

AUC for RF on test data = 0.8463778067077127


import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
rf: org.apache.spark.ml.classification.RandomForestClassifier = rfc_d08e1cefa986
paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	idf_21c89933cb6f-minDocFreq: 5,
	rfc_d08e1cefa986-numTrees: 5
}, {
	idf_21c89933cb6f-minDocFreq: 5,
	rfc_d08e1cefa986-numTrees: 10
}, {
	idf_21c89933cb6f-minDocFreq: 5,
	rfc_d08e1cefa986-numTrees: 15
}, {
	idf_21c89933cb6f-minDocFreq: 10,
	rfc_d08e1cefa986-numTrees: 5
}, {
	idf_21c89933cb6f-minDocFreq: 10,
	rfc_d08e1cefa986-numTrees: 10
}, {
	idf_21c89933cb6f-minDocFreq: 10,
	rfc_d08e1cefa986-numTrees: 15
})
evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_a72356c55fec
cv: org.apache.spark.ml.tuning.CrossValidator...

##### GB Classification

Implement GB Classification with 3-fold cross validation.

In [18]:
import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier}

val gbt = new GBTClassifier().setLabelCol("rating").setFeaturesCol("features")
val paramGrid =new ParamGridBuilder()
             .addGrid(gbt.maxDepth, Array(2,5))
             .addGrid(gbt.maxIter, Array(5, 10))
             .addGrid(tfidf.minDocFreq, Array(5,10))
             .build()
val evaluator = new BinaryClassificationEvaluator().setRawPredictionCol("rawPrediction").setLabelCol("rating").setMetricName("areaUnderROC")
val cv = new CrossValidator().setEstimator(gbt).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)

val pipeline = new Pipeline().setStages(Array(tokenizer, puncRemover,stopWordRemover, stemmer, vectorizer, tfidf, vectorizer_numeric,standardizer, vectorizer_assemb,cv))

val Array(training,testing)=rating_data.randomSplit(Array(0.8,0.2),111)
val pipelineModel = pipeline.fit(training)
val predictions = pipelineModel.transform(testing)

predictions.select("rating", "average_stars","prediction", "probability", "stemmed_text").show(5)

val AUC = evaluator.evaluate(predictions)
println(s"AUC for GBT on test data = $AUC")

+------+-------------+----------+--------------------+--------------------+
|rating|average_stars|prediction|         probability|        stemmed_text|
+------+-------------+----------+--------------------+--------------------+
|     0|          1.0|       0.0|[0.92146300349403...|[     , must, rea...|
|     0|         4.14|       1.0|[0.40821797604501...|[star, rating , i...|
|     1|         3.46|       0.0|[0.66313501555915...|[2pm 6pm, happi, ...|
|     0|         1.86|       0.0|[0.84490323437291...|[half, hour, hole...|
|     1|         3.25|       1.0|[0.20281641208289...|[coupl, week, ago...|
+------+-------------+----------+--------------------+--------------------+
only showing top 5 rows

AUC for GBT on test data = 0.8592435347342001


import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier}
gbt: org.apache.spark.ml.classification.GBTClassifier = gbtc_c56c749ecfc0
paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
	gbtc_c56c749ecfc0-maxDepth: 2,
	gbtc_c56c749ecfc0-maxIter: 5,
	idf_21c89933cb6f-minDocFreq: 5
}, {
	gbtc_c56c749ecfc0-maxDepth: 2,
	gbtc_c56c749ecfc0-maxIter: 10,
	idf_21c89933cb6f-minDocFreq: 5
}, {
	gbtc_c56c749ecfc0-maxDepth: 2,
	gbtc_c56c749ecfc0-maxIter: 5,
	idf_21c89933cb6f-minDocFreq: 10
}, {
	gbtc_c56c749ecfc0-maxDepth: 2,
	gbtc_c56c749ecfc0-maxIter: 10,
	idf_21c89933cb6f-minDocFreq: 10
}, {
	gbtc_c56c749ecfc0-maxDepth: 5,
	gbtc_c56c749ecfc0-maxIter: 5,
	idf_21c89933cb6f-minDocFreq: 5
}, {
	gbtc_c56c749ecfc0-maxDepth: 5,
	gbtc_c56c749ecfc0-maxIter: 10,
	idf_21c89...

For all three models, it looks like adding "average_star" helped to improve the AUC. This happened most notably for the AUC of the random forest model which improved quite a few percentage points. Therefore, it would appear that the tendencies of certain users impacts how they grade individual restaurants. 