<strong>Requiment:</strong>
Build a reviewer filter. Use the various NLP tools and a new classifier, Naive Bayes, to predict if one reviewText is like (overall>=4)/ don't like (overall<=2)/ neutral (2<overall<4)

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('nlp_musical').getOrCreate()

In [4]:
data = spark.read.json('Musical_Instruments_5.json')

In [5]:
data.show(3)

+----------+--------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
|      asin| helpful|overall|          reviewText| reviewTime|    reviewerID|        reviewerName|             summary|unixReviewTime|
+----------+--------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
|1384719342|  [0, 0]|    5.0|Not much to write...|02 28, 2014|A2IBPI20UZIR0U|cassandra tu "Yea...|                good|    1393545600|
|1384719342|[13, 14]|    5.0|The product does ...|03 16, 2013|A14VAT5EAX3D9S|                Jake|                Jake|    1363392000|
|1384719342|  [1, 1]|    5.0|The primary job o...|08 28, 2013|A195EZSQDW3E21|Rick Bennette "Ri...|It Does The Job Well|    1377648000|
+----------+--------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
only showing top 3 rows



In [6]:
from pyspark.sql.functions import *

In [7]:
data = data.withColumn('class', when(data.overall >=4, 'like')
                       .when(data.overall <=2, 'not_like')
                       .otherwise('neutral'))

In [8]:
data = data.select('reviewText', 'overall', 'class')

## Clean and Prepare the Data

** Create a new length feature: **

In [9]:
from pyspark.sql.functions import length

In [10]:
data = data.withColumn('length', length(data['reviewText']))

In [11]:
data.show(5)

+--------------------+-------+-----+------+
|          reviewText|overall|class|length|
+--------------------+-------+-----+------+
|Not much to write...|    5.0| like|   268|
|The product does ...|    5.0| like|   544|
|The primary job o...|    5.0| like|   436|
|Nice windscreen p...|    5.0| like|   206|
|This pop filter i...|    5.0| like|   159|
+--------------------+-------+-----+------+
only showing top 5 rows



In [12]:
# Pretty Clear Difference
data.groupby('class').mean().show()

+--------+------------------+-----------------+
|   class|      avg(overall)|      avg(length)|
+--------+------------------+-----------------+
|not_like|1.5353319057815846|579.2055674518201|
| neutral|               3.0|579.2111398963731|
|    like|4.7690090888938155|473.1188206606074|
+--------+------------------+-----------------+



## Feature Transformations

In [13]:
from pyspark.ml.feature import Tokenizer, StopWordsRemover
from pyspark.ml.feature import CountVectorizer, IDF, StringIndexer
tokenizer = Tokenizer(inputCol = 'reviewText', outputCol = 'token_text')
stopremove = StopWordsRemover(inputCol = 'token_text', outputCol = 'stop_tokens')
count_vec = CountVectorizer(inputCol = 'stop_tokens', outputCol = 'c_vec')
idf = IDF(inputCol = 'c_vec', outputCol = 'tf_idf')
class_to_num = StringIndexer(inputCol = 'class', outputCol = 'label')

In [14]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vector

In [15]:
clean_up = VectorAssembler(inputCols = ['tf_idf', 'length'], outputCol = 'features')

## The Model
We'll use the Naive Bayes, but feel free to play around with this choice!

In [16]:
from pyspark.ml.classification import NaiveBayes

In [17]:
# Use defaults
nb = NaiveBayes()

## Pipeline

In [18]:
from pyspark.ml import Pipeline

In [19]:
data_prep_pipe = Pipeline(stages = [class_to_num, tokenizer,
                                    stopremove, count_vec,
                                    idf, clean_up])

In [20]:
cleaner = data_prep_pipe.fit(data)

In [21]:
clean_data = cleaner.transform(data)

## Training and Evaluation!

In [22]:
clean_data = clean_data.select(['label', 'features'])

In [23]:
clean_data.show(10) # 0: like, 1: neutral, 2: not_like

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(51949,[3,12,14,3...|
|  0.0|(51949,[2,3,12,16...|
|  0.0|(51949,[11,19,44,...|
|  0.0|(51949,[18,37,57,...|
|  0.0|(51949,[2,122,132...|
|  0.0|(51949,[0,5,15,21...|
|  0.0|(51949,[5,16,29,1...|
|  1.0|(51949,[1,3,4,8,1...|
|  0.0|(51949,[0,3,12,33...|
|  0.0|(51949,[1,6,15,52...|
+-----+--------------------+
only showing top 10 rows



In [24]:
training, testing = clean_data.randomSplit([0.7, 0.3])

In [25]:
training.groupby('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0| 6346|
|  1.0|  538|
|  2.0|  338|
+-----+-----+



In [26]:
testing.groupby('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0| 2676|
|  1.0|  234|
|  2.0|  129|
+-----+-----+



In [27]:
predictor = nb.fit(training)

In [28]:
data.printSchema()

root
 |-- reviewText: string (nullable = true)
 |-- overall: double (nullable = true)
 |-- class: string (nullable = false)
 |-- length: integer (nullable = true)



In [29]:
test_results = predictor.transform(testing)

In [30]:
test_results.show(10)

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(51949,[0,1,2,3,4...|[-11979.857561335...|[1.0,5.6454643749...|       0.0|
|  0.0|(51949,[0,1,2,3,4...|[-4675.7072524771...|[1.0,5.2495428198...|       0.0|
|  0.0|(51949,[0,1,2,3,4...|[-37635.488680969...|[1.59017540787312...|       1.0|
|  0.0|(51949,[0,1,2,3,4...|[-2473.5167646879...|[1.0,1.1335949862...|       0.0|
|  0.0|(51949,[0,1,2,3,4...|[-19030.000568201...|[4.78649391243439...|       1.0|
|  0.0|(51949,[0,1,2,3,4...|[-10805.876284528...|[1.0,3.9312112275...|       0.0|
|  0.0|(51949,[0,1,2,3,4...|[-22287.133862823...|[2.89739333812114...|       1.0|
|  0.0|(51949,[0,1,2,3,4...|[-12508.387129341...|[1.0,1.8476477860...|       0.0|
|  0.0|(51949,[0,1,2,3,4...|[-3930.8732605327...|[0.99999999956642...|       0.0|
|  0.0|(51949,[0

In [31]:
# Create a confusion matrix
test_results.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  2.0|       0.0|   54|
|  1.0|       1.0|   73|
|  0.0|       1.0|  540|
|  1.0|       0.0|  136|
|  2.0|       2.0|   34|
|  2.0|       1.0|   41|
|  1.0|       2.0|   25|
|  0.0|       0.0| 1972|
|  0.0|       2.0|  164|
+-----+----------+-----+



In [32]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [33]:
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_results)
print('Accuracy of model at predicting: {}'.format(acc))

Accuracy of model at predicting: 0.7386979737665652


- Not very good result (~74%)
- Solution: Try switching out the classification models! Or even try to come up with other engineered features!...

## Use LogisticRegression/RandomForest

### Logistic Regression

In [34]:
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression

In [35]:
lg = LogisticRegression(maxIter = 20, regParam = 0.3, elasticNetParam = 0)

In [36]:
predictor_1 = lg.fit(training)

In [37]:
test_results_1 = predictor_1.transform(testing)

In [38]:
# Create a confusion matrix
test_results_1.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  2.0|       0.0|  124|
|  0.0|       1.0|    5|
|  1.0|       0.0|  234|
|  2.0|       2.0|    2|
|  2.0|       1.0|    3|
|  0.0|       0.0| 2668|
|  0.0|       2.0|    3|
+-----+----------+-----+



In [39]:
acc_eval = MulticlassClassificationEvaluator()
acc_1 = acc_eval.evaluate(test_results_1)
print('Accuracy of model at predicting: {}'.format(acc_1))

Accuracy of model at predicting: 0.8252989953949459


- Higher accuracy but not better result!!!

### Random Forest

In [40]:
rfc = RandomForestClassifier(labelCol = 'label', \
                             featuresCol = 'features', \
                             numTrees = 500, \
                             maxDepth = 5, \
                             maxBins = 64)

In [41]:
predictor_2 = rfc.fit(training)

In [42]:
test_results_2 = predictor_2.transform(testing)

In [43]:
# Create a confusion matrix
test_results_2.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  2.0|       0.0|  129|
|  1.0|       0.0|  234|
|  0.0|       0.0| 2676|
+-----+----------+-----+



In [44]:
test_results_2.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|       0.0| 3039|
+----------+-----+



In [45]:
acc_eval = MulticlassClassificationEvaluator()
acc_2 = acc_eval.evaluate(test_results_2)
print('Accuracy of model at predicting: {}'.format(acc_2))

Accuracy of model at predicting: 0.8246226872183917


- Higher accuracy but too bad result!!!

## Need to resample data

In [46]:
like_df = training.filter(col('label') == 0)
neutral_df = training.filter(col('label') == 1)
not_like_df = training.filter(col('label') == 2)
ratio_1 = int(like_df.count()/neutral_df.count())
ratio_2 = int(like_df.count()/not_like_df.count())
print('ratio like/neutral: {}'.format(ratio_1))
print('ratio like/neutral: {}'.format(ratio_2))

ratio like/neutral: 11
ratio like/neutral: 18


In [47]:
# resample neutral
a1 = range(ratio_1)
# duplicate the minority rows
oversampled_neutral_df = neutral_df.withColumn("dummy", explode(array([lit(x) for x in a1]))).drop('dummy')
# combine both oversampled minority rows and previous majority rows
combined_df = like_df.unionAll(oversampled_neutral_df)
combined_df.show(10)                                       

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(51949,[0],[1.025...|
|  0.0|(51949,[0],[1.025...|
|  0.0|(51949,[0,1,2,3,4...|
|  0.0|(51949,[0,1,2,3,4...|
|  0.0|(51949,[0,1,2,3,4...|
|  0.0|(51949,[0,1,2,3,4...|
|  0.0|(51949,[0,1,2,3,4...|
|  0.0|(51949,[0,1,2,3,4...|
|  0.0|(51949,[0,1,2,3,4...|
|  0.0|(51949,[0,1,2,3,4...|
+-----+--------------------+
only showing top 10 rows



In [48]:
combined_df.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0| 6346|
|  1.0| 5918|
+-----+-----+



In [49]:
# resample not_like
a2 = range(ratio_2)
# duplicate the minority rows
oversampled_notlike_df = not_like_df.withColumn("dummy", explode(array([lit(x) for x in a2]))).drop('dummy')
# combine both oversampled minority rows and previous majority rows
combined_df = like_df.unionAll(oversampled_notlike_df)
combined_df.show(10)  

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(51949,[0],[1.025...|
|  0.0|(51949,[0],[1.025...|
|  0.0|(51949,[0,1,2,3,4...|
|  0.0|(51949,[0,1,2,3,4...|
|  0.0|(51949,[0,1,2,3,4...|
|  0.0|(51949,[0,1,2,3,4...|
|  0.0|(51949,[0,1,2,3,4...|
|  0.0|(51949,[0,1,2,3,4...|
|  0.0|(51949,[0,1,2,3,4...|
|  0.0|(51949,[0,1,2,3,4...|
+-----+--------------------+
only showing top 10 rows



In [50]:
combined_df.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0| 6346|
|  2.0| 6084|
+-----+-----+



### Naive Bayes

In [51]:
predictor_3 = nb.fit(combined_df)

In [52]:
test_results_3 = predictor_3.transform(testing)

In [53]:
test_results_3.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0| 2676|
|  1.0|  234|
|  2.0|  129|
+-----+-----+



In [54]:
acc_eval = MulticlassClassificationEvaluator()
acc_3 = acc_eval.evaluate(test_results_3)
print('Accuracy of model at predicting: {}'.format(acc_3))

Accuracy of model at predicting: 0.8239115779815323


### Logistic Regression

In [55]:
predictor_4 = lg.fit(combined_df)

In [56]:
test_results_4 = predictor_4.transform(testing)

In [57]:
test_results_4.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  2.0|       0.0|  106|
|  1.0|       0.0|  230|
|  2.0|       2.0|   23|
|  1.0|       2.0|    4|
|  0.0|       0.0| 2668|
|  0.0|       2.0|    8|
+-----+----------+-----+



In [58]:
acc_eval = MulticlassClassificationEvaluator()
acc_4 = acc_eval.evaluate(test_results_4)
print('Accuracy of model at predicting: {}'.format(acc_4))

Accuracy of model at predicting: 0.8391297536016666


### Random Forest

In [59]:
predictor_5 = rfc.fit(combined_df)

In [60]:
test_results_5 = predictor_5.transform(testing)

In [61]:
test_results_5.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0| 2676|
|  1.0|  234|
|  2.0|  129|
+-----+-----+



In [62]:
# Create a confusion matrix
test_results_5.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  2.0|       0.0|  102|
|  1.0|       0.0|  219|
|  2.0|       2.0|   27|
|  1.0|       2.0|   15|
|  0.0|       0.0| 2655|
|  0.0|       2.0|   21|
+-----+----------+-----+



In [63]:
acc_eval = MulticlassClassificationEvaluator()
acc_5 = acc_eval.evaluate(test_results_5)
print('Accuracy of model at predicting: {}'.format(acc_5))

Accuracy of model at predicting: 0.8392095041530172


- Higher accuracy and better result. But not very good!
- Find another solution to improve performance!