# Part 3 Modeling and Evaluation

#### This notebook contains modeling results on:

- #### BOW(bag-of-words) 
    - Logistic regression + GridSearch
    - Naive Bayes
    - TFIDF + Logistic regressioon
    - TFIDF + Naive Bayes
- #### Bigram 
    - TFIDF + Logistic regression
    - TFIDF + Naive Bayes
    - TFIDF + Decision Tree
- #### WordEmbeddings 
    - Logistic regression
    - Decision Tree
    - Random Forest
    
#### AreaUnderROC was selected as evaluation metric.

<br></br>

In [2]:
from pyspark.ml.classification import LogisticRegression, NaiveBayes, DecisionTreeClassifier,  RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

#### Load processed Word Vectors

In [3]:
df = spark.read.parquet('s3://dse230-project-data1/text_df.parquet')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

We keep all the useful columns in one ``Dataframe``, that way for different classifiers, we can call the same Dataframe with required ``featuresCol``

In [3]:
df.printSchema()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- is_helpful: integer (nullable = true)
 |-- unigrams: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- bow: vector (nullable = true)
 |-- bow_tfidf: vector (nullable = true)
 |-- bigrams: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- bigram_tfidf: vector (nullable = true)
 |-- word2vec: vector (nullable = true)

#### Train/Test Split

In [4]:
(train, test) = df.randomSplit([0.7, 0.3], seed = 168)
print(f"Training Dataset Size: {train.count()}")
print(f"Test Dataset Size: {test.count()}")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training Dataset Size: 614742
Test Dataset Size: 263819

<br></br>

### Baseline Model: Logistic Regression on BOW (GridSearch)

In [5]:
# lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0, featuresCol='bow', labelCol='is_helpful')
# # create paramter grid
# paramGrid = ParamGridBuilder()\
#                .addGrid(lr.regParam, [0.1, 0.3, 0.5])\
#                .addGrid(lr.elasticNetParam, [0.0, 0.1, 0.2])\
#                .build()

# # set evaluator
# evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='is_helpful')

# # create cross-validator
# crossval = CrossValidator(estimator=lr, \
#                           estimatorParamMaps=paramGrid, \
#                           evaluator=evaluator, \
#                           numFolds=5) 
 
# # run cross-validator    
# cvModel = crossval.fit(train)
# prediction = cvModel.transform(test)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

#### Note: running this cell causes an error related to the `thread cell_monitor`,then break down the whole notebook, however it still gives the result for the best parameters.

<img src="https://i.ibb.co/qNZK5bG/Screen-Shot-2021-06-04-at-11-06-07.png" width="1000">

<img src="https://i.ibb.co/Gvh68SR/Screen-Shot-2021-06-04-at-11-06-27.png" width="1000">

#### helper function to run different classifiers

In [6]:
class text_classifier:
    
    def __init__(self, clf, how, tfidf=True, w2v=False):
        self.clf = clf
        self.how = how
        self.tfidf = tfidf
        self.w2v = w2v
        
        
    def fit(self, train_data):
        if self.clf == 'lr':
            if self.how == 'bow' and self.w2v == False:
                if self.tfidf == False:
                    col = 'bow'
                else:
                    col = 'bow_tfidf'
            elif self.how == 'bigram' and self.w2v == False:
                col = 'bow_tfidf'
            else:
                col = 'word2vec'
            
            lr = LogisticRegression(maxIter=20, regParam=0.1, elasticNetParam=0, featuresCol=col, labelCol='is_helpful')
            self.model = lr.fit(train_data)
            
        elif self.clf == 'nb':
            if self.how == 'bow' and self.w2v == False:
                if self.tfidf == False:
                    col = 'bow'
                else:
                    col = 'bow_tfidf'
            elif self.how == 'bigram' and self.w2v == False:
                col = 'bow_tfidf'
            else:
                print("Input word2vec data contains negative values, consider normalize to (0, 1) before fitting naive bayes model.")
            
            nb = NaiveBayes(smoothing=1.0, modelType="multinomial", featuresCol=col, labelCol='is_helpful')
            self.model = nb.fit(train_data)
            
        elif self.clf == 'dt':
            if self.how == 'bow' and self.w2v == False:
                if self.tfidf == False:
                    col = 'bow'
                else:
                    col = 'bow_tfidf'
            elif self.how == 'bigram' and self.w2v == False:
                col = 'bow_tfidf'
            else:
                col = 'word2vec'
            
            dt = DecisionTreeClassifier(featuresCol=col, labelCol='is_helpful', maxDepth=5)
            self.model = dt.fit(train_data)
            
        elif self.clf == 'rf':
            if self.how == 'bow' and self.w2v == False:
                if self.tfidf == False:
                    col = 'bow'
                else:
                    col = 'bow_tfidf'
            elif self.how == 'bigram' and self.w2v == False:
                col = 'bow_tfidf'
            else:
                col = 'word2vec'
            
            rf = RandomForestClassifier(featuresCol='bow', labelCol='is_helpful', maxDepth=4, numTrees=100)
            self.model = rf.fit(train_data)
            
        return self
    
    
    def predict(self, test_data):
        pred = self.model.transform(test_data)
        return pred
    
    
    def accuracy(self, test_data):
        evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='is_helpful')
        pred = self.predict(test_data)
        acc = evaluator.evaluate(pred)
        return acc

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## BOW(bag-of-words) models

In [9]:
for clf in ['lr', 'nb']:
    a = text_classifier(clf=clf, how='bow', tfidf=False)
    a.fit(train)
    a_pred = a.predict(test)
    #a_pred.write.parquet(f's3://dse230-project-data1/bigram_{clf}.parquet')
    print(f"BOW + {clf} AreaUnderROC: {a.accuracy(test)}")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<text_classifier object at 0x7fefdd4f63d0>
BOW + lr AreaUnderROC: 0.6163422086896138
<text_classifier object at 0x7fefdd9ab190>
BOW + nb AreaUnderROC: 0.5831755784095843

## BOW(bag-of-words) + TFIDF models

In [11]:
for clf in ['lr', 'nb']:
    a = text_classifier(clf=clf, how='bow', tfidf=True)
    a.fit(train)
    a_pred = a.predict(test)
    #a_pred.write.parquet(f's3://dse230-project-data1/bigram_{clf}.parquet')
    print(f"BOW + {clf} AreaUnderROC: {a.accuracy(test)}")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<text_classifier object at 0x7fefdd9ad1d0>
BOW + lr AreaUnderROC: 0.6163422086896138
<text_classifier object at 0x7fefdd4eb1d0>
BOW + nb AreaUnderROC: 0.5673705062090807

#### Comment
Since the 'BOW' representation is very sparse, we considered not to train this model with tree-based algorithms, because those algothrms normaly performs not that well on large sparse matrix. Our baseline model trained on bag-of-words with logistic regression has a AreaUnderROC 0.616. After adding tfidf term, the result of logistic regression remains about the same.

## Bigram + TFIDF models

In [7]:
for clf in ['lr', 'nb', 'dt']:
    a = text_classifier(clf=clf, how='bigram', tfidf=True)
    a.fit(train)
    a_pred = a.predict(test)
    #a_pred.write.parquet(f's3://dse230-project-data1/bigram_{clf}.parquet')
    print(f"Bigram + TFIDF + {clf} AreaUnderROC: {a.accuracy(test)}")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<text_classifier object at 0x7fefdf43fc90>
Bigram + TFIDF + lr AreaUnderROC: 0.6163422086896138
<text_classifier object at 0x7fefdd514610>
Bigram + TFIDF + nb AreaUnderROC: 0.5673705062090807
<text_classifier object at 0x7fefdd5237d0>
Bigram + TFIDF + dt AreaUnderROC: 0.581415152238933

#### Comment

As we can see, bigram with logistic regression model still output same result as baseline model, in this case, bigram doesn't bring us much improvement.

## Word Embeddings model
- word embedding + logistic regression
- word embedding + random forest

In [8]:
for clf in ['lr', 'rf']:
    a = text_classifier(clf=clf, how='bow', w2v=True)
    a.fit(train)
    a_pred = a.predict(test)
    #a_pred.write.parquet(f's3://dse230-project-data1/bigram_{clf}.parquet')
    print(f"WordEmbeddings + {clf} AreaUnderROC: {a.accuracy(test)}")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<text_classifier object at 0x7fefdf43fe10>
WordEmbeddings + lr AreaUnderROC: 0.5854188459674912
<text_classifier object at 0x7fefdf43f5d0>
WordEmbeddings + rf AreaUnderROC: 0.5256573328294689

#### Comment
The word embedding model we trained doesn't improve the performance, potential reason is when training word embedding vector, we chose a relatively small vector size(100) which limited capability of catching more information between words.

<br></br>

Then, we will use the best model we have so far to predict Airbnb reviews.

In [5]:
lr = LogisticRegression(maxIter=20, regParam=0.1, elasticNetParam=0, featuresCol='bow', labelCol='is_helpful')
lr_model = lr.fit(train)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

#### Load processed airbnb bag-of-words model

In [6]:
df2 = spark.read.parquet('s3://dse230-project-data1/text_df2.parquet/')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
df2.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+---------+--------------------+
|            unigrams|       id|                 bow|
+--------------------+---------+--------------------+
|[respons, help, h...|214904211|(10000,[1,2,10,12...|
|[hous, perfect, c...|218048729|(10000,[0,1,2,10,...|
|[famili, babi, sp...|219935056|(10000,[7,14,20,2...|
|[great, hous, hil...|222829306|(10000,[0,2,4,18,...|
|[airbnb, exactli,...|227447978|(10000,[10,12,23,...|
|[group, wonder, s...|229828619|(10000,[0,2,3,4,5...|
|[realli, enjoy, s...|231585560|(10000,[0,2,6,7,1...|
|[thoroughli, enjo...|233713124|(10000,[0,2,8,9,1...|
|[place, real, get...|236677318|(10000,[0,1,4,16,...|
|[open, concept, h...|239212816|(10000,[2,6,7,11,...|
|[host, cancel, re...|243379080|(10000,[10,32,106...|
|[mahi, great, hos...|249212151|(10000,[0,2,10,21...|
|[hous, beauti, pi...|251934875|(10000,[13,17,21,...|
|[thank, everyth, ...|262128150|(10000,[0,7,8,19,...|
|[home, truli, gor...|264815252|(10000,[0,7,8,20,...|
|[mahi, hous, beau...|267497

In [8]:
airbnb_pred = lr_model.transform(df2)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [9]:
airbnb_pred.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+---------+--------------------+--------------------+--------------------+----------+
|            unigrams|       id|                 bow|       rawPrediction|         probability|prediction|
+--------------------+---------+--------------------+--------------------+--------------------+----------+
|[respons, help, h...|214904211|(10000,[1,2,10,12...|[0.81398590285483...|[0.69295822502460...|       0.0|
|[hous, perfect, c...|218048729|(10000,[0,1,2,10,...|[1.01549148112327...|[0.73409346479673...|       0.0|
|[famili, babi, sp...|219935056|(10000,[7,14,20,2...|[1.00857204595898...|[0.73274060336478...|       0.0|
|[great, hous, hil...|222829306|(10000,[0,2,4,18,...|[0.74740953740144...|[0.67861398854576...|       0.0|
|[airbnb, exactli,...|227447978|(10000,[10,12,23,...|[0.93143632022713...|[0.71736659285155...|       0.0|
|[group, wonder, s...|229828619|(10000,[0,2,3,4,5...|[-0.2542136445969...|[0.43678665074399...|       1.0|
|[realli, enjoy, s...|231585560|(1000

In [11]:
result = airbnb_pred.select(['id', 'prediction'])

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [15]:
result.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

595381

In [14]:
# save predicted results
#result.coalesce(1).write.csv('s3://dse230-project-data1/result.csv/', mode='overwrite')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…