## Business Understanding

- Goal: classifies tweets with 5 groups (Extremely possitive/negative, possitive, negative & neutral)

## Set Environments

In [1]:
import findspark
findspark.init()

In [2]:
# import libraries
from pyspark import SparkContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

import pandas as pd
import numpy as np
import pandas_profiling as pp

import matplotlib
matplotlib.use('Qt5Agg')
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import scipy
from datetime import datetime
from pyspark.sql.functions import *
from pyspark.sql import types 
from pyspark.sql.types import *
from pyspark.ml.feature import *

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler, StringIndexer, Tokenizer, StopWordsRemover, CountVectorizer, IDF, HashingTF
from pyspark.ml.classification import LogisticRegression, LogisticRegressionModel
from pyspark.ml.classification import DecisionTreeClassifier, DecisionTreeClassificationModel
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel
from pyspark.ml.classification import GBTClassifier, GBTClassificationModel
from pyspark.ml.classification import NaiveBayes, NaiveBayesModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator, RegressionEvaluator
from pyspark.ml import Pipeline

import warnings
warnings.filterwarnings("ignore")

In [3]:
sc = SparkContext()

In [4]:
spark = SparkSession.builder.appName('corona_tweet').getOrCreate()

## Loading Dataset

In [5]:
train = spark.read.csv('covid_text_classification\Corona_NLP_train.csv', header = True, multiLine=True, inferSchema=True, escape='"', dateFormat='dd-MM-yyyy')
test = spark.read.csv('covid_text_classification\Corona_NLP_test.csv', header = True, multiLine=True, inferSchema=True, escape='"', dateFormat='dd-MM-yyyy')

## Data Understanding
- The tweet have been pulled from Twitters and manual tagging has been done.
- It's conatin 6 features:
    - Username: UserId
    - ScreenName: User's Screen id
    - Location: Location when Tweet.
    - TweetAt: Time making a Tweet.
    - OriginalTweet: Tweet content.
    - Sentiment: Type of Tweet (have 5 group listed above).

In [6]:
train.show(3)

+--------+----------+---------+----------+--------------------+---------+
|UserName|ScreenName| Location|   TweetAt|       OriginalTweet|Sentiment|
+--------+----------+---------+----------+--------------------+---------+
|    3799|     48751|   London|16-03-2020|@MeNyrbie @Phil_G...|  Neutral|
|    3800|     48752|       UK|16-03-2020|advice Talk to yo...| Positive|
|    3801|     48753|Vagabonds|16-03-2020|Coronavirus Austr...| Positive|
+--------+----------+---------+----------+--------------------+---------+
only showing top 3 rows



In [7]:
train.printSchema()

root
 |-- UserName: integer (nullable = true)
 |-- ScreenName: integer (nullable = true)
 |-- Location: string (nullable = true)
 |-- TweetAt: string (nullable = true)
 |-- OriginalTweet: string (nullable = true)
 |-- Sentiment: string (nullable = true)



In [8]:
test.show(3)

+--------+----------+-----------+----------+--------------------+------------------+
|UserName|ScreenName|   Location|   TweetAt|       OriginalTweet|         Sentiment|
+--------+----------+-----------+----------+--------------------+------------------+
|       1|     44953|        NYC|02-03-2020|TRENDING: New Yor...|Extremely Negative|
|       2|     44954|Seattle, WA|02-03-2020|When I couldn't f...|          Positive|
|       3|     44955|       null|02-03-2020|Find out how you ...|Extremely Positive|
+--------+----------+-----------+----------+--------------------+------------------+
only showing top 3 rows



In [9]:
test.printSchema()

root
 |-- UserName: integer (nullable = true)
 |-- ScreenName: integer (nullable = true)
 |-- Location: string (nullable = true)
 |-- TweetAt: string (nullable = true)
 |-- OriginalTweet: string (nullable = true)
 |-- Sentiment: string (nullable = true)



In [12]:
print('Total row & column of the dataset:', train.count(), 'rows and', len(train.columns), 'columns')

Total row & column of the dataset: 41157 rows and 6 columns


In [13]:
for row in train.take(5):
    print(row)
    print('\n')

Row(UserName=3799, ScreenName=48751, Location='London', TweetAt='16-03-2020', OriginalTweet='@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8', Sentiment='Neutral')


Row(UserName=3800, ScreenName=48752, Location='UK', TweetAt='16-03-2020', OriginalTweet='advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order', Sentiment='Positive')


Row(UserName=3801, ScreenName=48753, Location='Vagabonds', TweetAt='16-03-2020', OriginalTweet='Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P', Sentiment='Positive')


Row(UserName=3802, ScreenName=48754, Location=None, TweetAt='16-03-2020', OriginalTweet="My food stock is not the only one which is empty...\r\n\r\nPLEASE, don't p

In [14]:
print('Total row & column of the dataset:', test.count(), 'rows and', len(test.columns), 'columns')

Total row & column of the dataset: 3798 rows and 6 columns


In [15]:
for row in test.take(5):
    print(row)
    print('\n')

Row(UserName=1, ScreenName=44953, Location='NYC', TweetAt='02-03-2020', OriginalTweet='TRENDING: New Yorkers encounter empty supermarket shelves (pictured, Wegmans in Brooklyn), sold-out online grocers (FoodKick, MaxDelivery) as #coronavirus-fearing shoppers stock up https://t.co/Gr76pcrLWh https://t.co/ivMKMsqdT1', Sentiment='Extremely Negative')


Row(UserName=2, ScreenName=44954, Location='Seattle, WA', TweetAt='02-03-2020', OriginalTweet="When I couldn't find hand sanitizer at Fred Meyer, I turned to #Amazon. But $114.97 for a 2 pack of Purell??!!Check out how  #coronavirus concerns are driving up prices. https://t.co/ygbipBflMY", Sentiment='Positive')


Row(UserName=3, ScreenName=44955, Location=None, TweetAt='02-03-2020', OriginalTweet='Find out how you can protect yourself and loved ones from #coronavirus. ?', Sentiment='Extremely Positive')


Row(UserName=4, ScreenName=44956, Location='Chicagoland', TweetAt='02-03-2020', OriginalTweet='#Panic buying hits #NewYork City as anxiou

The original tweet is a combination of many component: words, special character(@, /, etc) and web link.

## Data Preparation

#### Train Dataset

In [16]:
#nan checking
train.select([count(when(isnan(c), c)).alias(c) for c in train.columns]).toPandas().T

Unnamed: 0,0
UserName,0
ScreenName,0
Location,0
TweetAt,0
OriginalTweet,0
Sentiment,0


there is no nan value in this train dataset

In [17]:
#null checking
train.select([count(when(col(c).isNull(), c)).alias(c) for c in train.columns]).toPandas().T

Unnamed: 0,0
UserName,0
ScreenName,0
Location,8590
TweetAt,0
OriginalTweet,0
Sentiment,0


No Null value in train data set because the read.csv with params listed above already delete the null row.

In [18]:
#Duplicate value
print('Total of duplicate row:', train.count() - train.distinct().count())

Total of duplicate row: 0


There is no dupliacate value too.

Filter the dataset again to make sure if the data have any Null value

In [19]:
train = train.filter(train.OriginalTweet.isNotNull() & train.Sentiment.isNotNull())

In [20]:
train.select([count(when(col(c).isNull(), c)).alias(c) for c in train.columns]).toPandas().T

Unnamed: 0,0
UserName,0
ScreenName,0
Location,8590
TweetAt,0
OriginalTweet,0
Sentiment,0


Same result.

In [21]:
#Duplicate value
print('Total of duplicate row:', train.count() - train.distinct().count())

Total of duplicate row: 0


In [22]:
train.select('Sentiment').toPandas().value_counts()

Sentiment         
Positive              11422
Negative               9917
Neutral                7713
Extremely Positive     6624
Extremely Negative     5481
dtype: int64

There is a huge amount of positive and negative Tweet in train dataset with 11k and 10k. Neutral sentiment only 7.7k Tweet so it will be a imbalanced value in the output.

In [23]:
print('Total row after delete invalid value from Sentiment:', train.count(), 'rows')

Total row after delete invalid value from Sentiment: 41157 rows


- In this case, I will combine Extremely Positive into Positive, and Extremely Negative into Negative Sentiments.
- Because it's hard to define if the Tweets on the same measurement, It will cause a confict when use tf-idf method.
- So that the Sentiment will be 3 now: Positive, Negative and Neutral.

In [26]:
train = train.withColumn('Sentiment', when(train.Sentiment == 'Extremely Positive', 'Positive')\
                       .when(train.Sentiment == 'Positive', 'Positive')
                      .when(train.Sentiment == 'Extremely Negative', 'Negative')\
                      .when(train.Sentiment == 'Negative', 'Negative')\
                      .otherwise('Neutral'))

In [27]:
train.select('Sentiment').toPandas().value_counts()

Sentiment
Positive     18046
Negative     15398
Neutral       7713
dtype: int64

- Now the output look better, but there is a small imbalance between group of output in Sentiment value. 
- The positive is the highest Tweet with 18k and the Negative is 15k.

#### Test dataset

Same method with Train dataset, I will check the test dataset

In [28]:
#nan checking
test.select([count(when(isnan(c), c)).alias(c) for c in test.columns]).toPandas().T

Unnamed: 0,0
UserName,0
ScreenName,0
Location,0
TweetAt,0
OriginalTweet,0
Sentiment,0


There is no nan value

In [29]:
#null checking
test.select([count(when(col(c).isNull(), c)).alias(c) for c in test.columns]).toPandas().T

Unnamed: 0,0
UserName,0
ScreenName,0
Location,834
TweetAt,0
OriginalTweet,0
Sentiment,0


No Null value on OriginalTweet feature

In [30]:
#Duplicate value
print('Total of duplicate row:', test.count() - test.distinct().count())

Total of duplicate row: 0


There is no duplicate row

Filter the dataset again to make sure it not Null

In [31]:
#Filter the dataset - delete Null row.
test = test.filter(test.OriginalTweet.isNotNull() & test.Sentiment.isNotNull())

In [32]:
test.select([count(when(col(c).isNull(), c)).alias(c) for c in test.columns]).toPandas().T

Unnamed: 0,0
UserName,0
ScreenName,0
Location,834
TweetAt,0
OriginalTweet,0
Sentiment,0


No more Null value

In [33]:
#Duplicate value
print('Total of duplicate row:', test.count() - test.distinct().count())

Total of duplicate row: 0


In [34]:
test.select('Sentiment').toPandas().value_counts()

Sentiment         
Negative              1041
Positive               947
Neutral                619
Extremely Positive     599
Extremely Negative     592
dtype: int64

Also there is a imbalance data between sentiments but not much, it's acceptable.

In [35]:
print('Total row after delete invalid value from Sentiment:', test.count(), 'rows')

Total row after delete invalid value from Sentiment: 3798 rows


Transform the Extremely Positive/Negative into Positive/Negative

In [38]:
test = test.withColumn('Sentiment', when(test.Sentiment == 'Extremely Positive', 'Positive')\
                       .when(test.Sentiment == 'Positive', 'Positive')
                      .when(test.Sentiment == 'Extremely Negative', 'Negative')\
                      .when(test.Sentiment == 'Negative', 'Negative')\
                      .otherwise('Neutral'))

In [39]:
test.select('Sentiment').toPandas().value_counts()

Sentiment
Negative     1633
Positive     1546
Neutral       619
dtype: int64

Now the test dataset look better

### Clean the text value

Because the OriginalTweet is a combination of link, text, symbol ,etc. So clean the Tweet is necessary

In [40]:
import re
#delete link
train = train.withColumn('OriginalTweet_re', regexp_replace(lower(col('OriginalTweet')), 'https?://\S+|www\.\S+', ""))
#delete symbol
train = train.withColumn('OriginalTweet_re', regexp_replace(lower(col('OriginalTweet_re')), '[^a-zA-z]', " "))
#delete digits
train = train.withColumn('OriginalTweet_re', regexp_replace(lower(col('OriginalTweet_re')), '\d+', ""))
#delete white space
train = train.withColumn('OriginalTweet_re', regexp_replace(lower(col('OriginalTweet_re')), '\s+', " "))

In [41]:
train.select('OriginalTweet_re').show(3,truncate = False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|OriginalTweet_re                                                                                                                                                                                                                             |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| menyrbie phil_gahan chrisitv and and                                                                                                                                                                                                        |
|advice talk to your neighbours family t

The Tweet look better than the original Tweet now. And now, I will try with the Test dataset

In [42]:
#delete link
test = test.withColumn('OriginalTweet_re', regexp_replace(lower(col('OriginalTweet')), 'https?://\S+|www\.\S+', " "))
#delete symbol
test = test.withColumn('OriginalTweet_re', regexp_replace(lower(col('OriginalTweet_re')), '[^\w]', " "))
#delete digits
test = test.withColumn('OriginalTweet_re', regexp_replace(lower(col('OriginalTweet_re')), '\d+', " "))
#delete white space
test = test.withColumn('OriginalTweet_re', regexp_replace(lower(col('OriginalTweet_re')), '\s+', " "))

In [43]:
test.select('OriginalTweet_re').show(3,truncate = False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|OriginalTweet_re                                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|trending new yorkers encounter empty supermarket shelves pictured wegmans in brooklyn sold out online grocers foodkick maxdelivery as coronavirus fearing shoppers stock up |
|when i couldn t find hand sanitizer at fred meyer i turned to amazon but for a pack of purell check out how coronavirus concerns are driving up prices                      |
|find out how you can protect yourself and loved ones from coronavirus                                                       

It's better now.

## Data Transformation

- The Output have to be indexed to put in predicting model so I will use StringIndexer.
- The key feature to decide if the Tweet belonged to which Sentiment is Tweets's content. So I will take OriginalTweet as an input only.
- Also I have to transform it by using tf-idf method.

In [78]:
#convert Sentiment value (word) to index
indexer_output = StringIndexer(inputCol = 'Sentiment', outputCol = 'Sentiment_idx')
#tokenizer for OriginalTweet
tokenizer = Tokenizer(inputCol = 'OriginalTweet_re', outputCol = 'OriginalTweet_token')
# OriginalTweet token -> stopword
stopword = StopWordsRemover(inputCol = 'OriginalTweet_token', outputCol = 'OriginalTweet_stopword')
#OriginalTweet stopword -> Count vectorizer (tf)
count_vec = CountVectorizer(inputCol = 'OriginalTweet_stopword', outputCol = 'OriginalTweet_countvec', maxDF = 0.7)
#OriginalTweet tf -> idf (find the important word)
idf = IDF(inputCol = 'OriginalTweet_countvec', outputCol = 'OriginalTweet_idf', minDocFreq = 12)

assembler = VectorAssembler(inputCols = ['OriginalTweet_idf'], outputCol = 'features')

I will use maxDF to delete some popular word appear in most of the text that thw StopWordRemover function can not detect such as and, in, etc.

In [79]:
pipeline = Pipeline(stages = [indexer_output, tokenizer, stopword, count_vec, idf, assembler])

In [80]:
data_train = pipeline.fit(train).transform(train)

In [81]:
data_train.select(['Sentiment', 'Sentiment_idx']).distinct().show()

+---------+-------------+
|Sentiment|Sentiment_idx|
+---------+-------------+
| Positive|          0.0|
| Negative|          1.0|
|  Neutral|          2.0|
+---------+-------------+



- The Sentiment will convert to index defined as:
    - Positive - 0.0
    - Negative - 1.0
    - Neutral - 2.0

In [82]:
data_train = data_train.select(['Sentiment_idx', 'features'])

In [83]:
data_train.show(3)

+-------------+--------------------+
|Sentiment_idx|            features|
+-------------+--------------------+
|          2.0|(52581,[2],[1.582...|
|          0.0|(52581,[11,12,104...|
|          0.0|(52581,[0,1,11,60...|
+-------------+--------------------+
only showing top 3 rows



In [84]:
data_test = pipeline.fit(test).transform(test)

In [85]:
data_test = data_test.select(['Sentiment_idx', 'features'])

In [86]:
data_test.show(3)

+-------------+--------------------+
|Sentiment_idx|            features|
+-------------+--------------------+
|          0.0|(11361,[1,7,11,13...|
|          1.0|(11361,[1,14,48,9...|
|          1.0|(11361,[1,173,342...|
+-------------+--------------------+
only showing top 3 rows



## Modeling and Evaluating

### Decision tree model

In [87]:
def decisiontree_model(train_set, test_set, label):
    #build three models
    tree = DecisionTreeClassifier(labelCol = label, featuresCol = 'features')
    
    #Fit 3 model with train dataset
    tree_model = tree.fit(train_set)
    
    #Predict with test dataset
    tree_pred = tree_model.transform(test_set)
    #tree_prednlabel = tree_pred.select(['prediction', label]).withColumn(label, col(label).cast('float')).orderBy('prediction')

    #Select predoction result for evaluate the performance
    #accuracy_evaluation = MulticlassClassificationEvaluator(labelCol = 'PrivateIndex', predictionCol = 'prediction',metricName = 'accuracy')
    bi_evaluator = BinaryClassificationEvaluator(labelCol = label, rawPredictionCol = 'prediction')
    multi_evaluator = MulticlassClassificationEvaluator(labelCol = label, predictionCol = 'prediction')
    #Evaluate
    tree_acc = multi_evaluator.evaluate(tree_pred, {multi_evaluator.metricName: "accuracy"})
    
    tree_auc = bi_evaluator.evaluate(tree_pred, {bi_evaluator.metricName: "areaUnderROC"})

    #save model
    #tree_model.save('treemodel_rating')
    
    #Show the result
    print('Accuracy Score:')
    print('-'*80)
    print('Decision Tree accuracy: {0:2.2f}%'.format(tree_acc*100))

    print('\n')
    print('AUC Score:')
    print('-'*80)
    print('Decision Tree AUC: {0:2.2f}%'.format(tree_auc*100))
    
    print('\n')
    print('Confusion matrix')
    print('-'*80)
    print('Decision tree')
    tree_pred.groupby(label, 'prediction').count().show()

In [88]:
decisiontree_model(data_train, data_test, 'Sentiment_idx')

Accuracy Score:
--------------------------------------------------------------------------------
Decision Tree accuracy: 44.94%


AUC Score:
--------------------------------------------------------------------------------
Decision Tree AUC: 51.30%


Confusion matrix
--------------------------------------------------------------------------------
Decision tree
+-------------+----------+-----+
|Sentiment_idx|prediction|count|
+-------------+----------+-----+
|          2.0|       0.0|  579|
|          1.0|       1.0|  251|
|          0.0|       1.0|  177|
|          1.0|       0.0| 1295|
|          2.0|       1.0|   40|
|          0.0|       0.0| 1456|
+-------------+----------+-----+



- The model showed that the accuracy in this case is not good.
    - Right predict result are 1.7k.
    - While wrong result are 2k.
    - In this model, It seem like negative Tweet easily misunderstood as positive sentiment.
- The accuracy is below 50% which is not good.

### Random Forest Model

In [89]:
def forest_model(train_set, test_set, label):
    #split dataset into train and test set
    #train_set, test_set = dataset.randomSplit([0.8, 0.2])
    #build three models
    forest = RandomForestClassifier(labelCol = label, featuresCol = 'features')
    
    #Fit 3 model with train dataset
    forest_model = forest.fit(train_set)
    
    #Predict with test dataset
    forest_pred = forest_model.transform(test_set)
    #tree_prednlabel = tree_pred.select(['prediction', label]).withColumn(label, col(label).cast('float')).orderBy('prediction')

    #Select predoction result for evaluate the performance
    #accuracy_evaluation = MulticlassClassificationEvaluator(labelCol = 'PrivateIndex', predictionCol = 'prediction',metricName = 'accuracy')
    bi_evaluator = BinaryClassificationEvaluator(labelCol = label, rawPredictionCol = 'prediction')
    multi_evaluator = MulticlassClassificationEvaluator(labelCol = label, predictionCol = 'prediction')
    #Evaluate
    forest_acc = multi_evaluator.evaluate(forest_pred, {multi_evaluator.metricName: "accuracy"})
    
    forest_auc = bi_evaluator.evaluate(forest_pred, {bi_evaluator.metricName: "areaUnderROC"})

    #save model
    #tree_model.save('treemodel_rating')
    
    #Show the result
    print('Accuracy Score:')
    print('-'*80)
    print('Random Forest accuracy: {0:2.2f}%'.format(forest_acc*100))

    print('\n')
    print('AUC Score:')
    print('-'*80)
    print('Random Forest AUC: {0:2.2f}%'.format(forest_auc*100))
    
    print('\n')
    print('Confusion matrix')
    print('-'*80)
    print('Random Forest')
    forest_pred.groupby(label, 'prediction').count().show()

In [90]:
forest_model(data_train, data_test, 'Sentiment_idx')

Accuracy Score:
--------------------------------------------------------------------------------
Random Forest accuracy: 42.63%


AUC Score:
--------------------------------------------------------------------------------
Random Forest AUC: 49.46%


Confusion matrix
--------------------------------------------------------------------------------
Random Forest
+-------------+----------+-----+
|Sentiment_idx|prediction|count|
+-------------+----------+-----+
|          2.0|       0.0|  617|
|          1.0|       1.0|   21|
|          0.0|       1.0|   35|
|          1.0|       0.0| 1525|
|          2.0|       1.0|    2|
|          0.0|       0.0| 1598|
+-------------+----------+-----+



- Also with forest model, the result is not good when there is still alot of wrong prediction negative -> positive.
- The accuracy still below 50%.

Now I will try with the data just hyave 2 sentiment positive and negative

In [91]:
data_train1 = data_train.select(['Sentiment_idx', 'features'])

In [95]:
data_train1 = data_train1.filter(col('Sentiment_idx') != 2.0)

In [96]:
data_train1.count()

33444

In [99]:
data_test1 = data_test.select(['Sentiment_idx', 'features'])

In [100]:
data_test1 = data_test1.filter(col('Sentiment_idx') != 2.0)

In [101]:
decisiontree_model(data_train1, data_test1, 'Sentiment_idx')

Accuracy Score:
--------------------------------------------------------------------------------
Decision Tree accuracy: 53.63%


AUC Score:
--------------------------------------------------------------------------------
Decision Tree AUC: 52.64%


Confusion matrix
--------------------------------------------------------------------------------
Decision tree
+-------------+----------+-----+
|Sentiment_idx|prediction|count|
+-------------+----------+-----+
|          1.0|       1.0|  252|
|          0.0|       1.0|  180|
|          1.0|       0.0| 1294|
|          0.0|       0.0| 1453|
+-------------+----------+-----+



The result still showed that Negative tweet (1.0) will be easily predicted to Positive (0.0). It make the model hard to predict correctly due to a variety of human language,

#### Conclusion:
- In this case, tf-idf approach will not a good method for predicting the Sentiment. Because the language is more variety so that the old method like if-idf will not suitable in some case.
- The evidence is user still talk good (positve) about a problem, then they will give some negative idea. The problem is we must know how much (percentage) will positive and negative sentences accounted for in a single Tweets.
- You can see that most of Neutral sentiment (2.0) is predicted as Positive sentiment(0.0), that's mean people talk good about something, also with Negative sentiment (1.0) is predicted as Positive (0.0) and this case account the most in the result.
- Both tree model give a performance below 50%, but that's not mean the model does not good. It need a difference approach such as use another new transform text data method like sparkNPL.