## Part 1: Training a NaiveBayes Model of Twitter Sentiment

Our approach is making twitter sentiment prediction is to first train a Naive Bayes model of twitter sentiment prediction using a given labled dataset at Kaggle.

We then save the trained model and will load and use it subsequently in a streaming application (in a different notebook)

## Step 1: Download and Explore data

1\. First download the 1.6 Million tweets data set from `http://idsdl.csom.umn.edu/c/share/msba6330/twitter1.6m.zip`

The original kaggle dataset is available at [https://www.kaggle.com/datasets/kazanova/sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140)

It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .

It contains the following 6 fields:

- target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive) - there is no 2 in this dataset.
- id: The id of the tweet (`2087`)
- date: the date of the tweet (`Sat May 16 23:58:44 UTC 2009`)
- flag: The query (lyx). If there is no query, then this value is `NO_QUERY`.
- user: the user that tweeted (`robotickilldozr`)
- text: the text of the tweet (`Lyx is cool`)

In [None]:
%%bash
rm -f twitter1.6m.zip
wget -nv http://idsdl.csom.umn.edu/c/share/msba6330/twitter1.6m.zip
yes | unzip twitter1.6m.zip

Archive:  twitter1.6m.zip
  inflating: training.1600000.processed.noemoticon.csv  
2022-11-05 22:32:57 URL:http://idsdl.csom.umn.edu/c/share/msba6330/twitter1.6m.zip [84855679/84855679] -> "twitter1.6m.zip" [1]


2\. Explore the data format

- display the first 10 rows of the data in the training dataset

In [None]:
%%bash
head training.1600000.processed.noemoticon.csv

"0","1467810369","Mon Apr 06 22:19:45 PDT 2009","NO_QUERY","_TheSpecialOne_","@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"
"0","1467810672","Mon Apr 06 22:19:49 PDT 2009","NO_QUERY","scotthamilton","is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!"
"0","1467810917","Mon Apr 06 22:19:53 PDT 2009","NO_QUERY","mattycus","@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds"
"0","1467811184","Mon Apr 06 22:19:57 PDT 2009","NO_QUERY","ElleCTF","my whole body feels itchy and like its on fire "
"0","1467811193","Mon Apr 06 22:19:57 PDT 2009","NO_QUERY","Karoli","@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there. "
"0","1467811372","Mon Apr 06 22:20:00 PDT 2009","NO_QUERY","joy_wolf","@Kwesidei not the whole crew "
"0","1467811592","Mon Apr 06 22:20:03 PDT 2009","NO

3\. Read the data into a DataFrame `data` using schema string `target integer,id long,date string,flag string,user string,text string`

then verify the results by showing 10 rows and schema

In [None]:
schema = "target integer, id long, date string, flag string, user string, text string"
data = spark.read.csv("file:/databricks/driver/training.1600000.processed.noemoticon.csv", schema = schema)

In [None]:
data.limit(10).display()

target,id,date,flag,user,text
0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."
0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
0,1467811592,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,mybirch,Need a hug
0,1467811594,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,coZZ,"@LOLTrish hey long time no see! Yes.. Rains a bit ,only a bit LOL , I'm fine thanks , how's you ?"
0,1467811795,Mon Apr 06 22:20:05 PDT 2009,NO_QUERY,2Hood4Hollywood,@Tatiana_K nope they didn't have it
0,1467812025,Mon Apr 06 22:20:09 PDT 2009,NO_QUERY,mimismo,@twittera que me muera ?


In [None]:
data.printSchema()

root
 |-- target: integer (nullable = true)
 |-- id: long (nullable = true)
 |-- date: string (nullable = true)
 |-- flag: string (nullable = true)
 |-- user: string (nullable = true)
 |-- text: string (nullable = true)



## Step 2: Do Some Data Cleaning and Target Variable transformation

Please note that our train data has a different fromat from our testing data which comes from twitter stream.  Our testing data will have these fields

- `text`: tweet text
- `time`: timestamp

In the following, we first want to 

- transform the date column into a timestamp column `time`
- transform `target` variable into a binary column (tip: using Binarizer, but cast it to double first as Binarizer only works with double/float columns)
- drop the irrelevant columns.

The resulting dataframe, called `data_clean` will have these
- `label`: a 0-1 label column derived from target.
- `text` 
- `time`

In [None]:
from pyspark.sql.functions import col, to_timestamp,substring
data_clean = data.withColumn("time", to_timestamp(substring(col("date"),5,24), "MMM dd HH:mm:ss zzz yyyy")).drop("id", "flag", "user", "date").cache()

In [None]:
data_clean.limit(10).display()

target,text,time
0,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",2009-04-07T05:19:45.000+0000
0,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!,2009-04-07T05:19:49.000+0000
0,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds,2009-04-07T05:19:53.000+0000
0,my whole body feels itchy and like its on fire,2009-04-07T05:19:57.000+0000
0,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",2009-04-07T05:19:57.000+0000
0,@Kwesidei not the whole crew,2009-04-07T05:20:00.000+0000
0,Need a hug,2009-04-07T05:20:03.000+0000
0,"@LOLTrish hey long time no see! Yes.. Rains a bit ,only a bit LOL , I'm fine thanks , how's you ?",2009-04-07T05:20:03.000+0000
0,@Tatiana_K nope they didn't have it,2009-04-07T05:20:05.000+0000
0,@twittera que me muera ?,2009-04-07T05:20:09.000+0000


In [None]:
from pyspark.ml.feature import *
from pyspark.ml import Pipeline
st_cast = SQLTransformer(statement="select *, cast(target as double) as target_double from __THIS__")
binarizer = Binarizer(threshold=2.0, inputCol="target_double", outputCol="label")
target_pipeline = Pipeline(stages=[st_cast, binarizer])
data_labeled = target_pipeline.fit(data_clean).transform(data_clean)

In [None]:
data_labeled.limit(10).display()

target,text,time,target_double,label
0,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",2009-04-07T05:19:45.000+0000,0.0,0.0
0,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!,2009-04-07T05:19:49.000+0000,0.0,0.0
0,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds,2009-04-07T05:19:53.000+0000,0.0,0.0
0,my whole body feels itchy and like its on fire,2009-04-07T05:19:57.000+0000,0.0,0.0
0,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",2009-04-07T05:19:57.000+0000,0.0,0.0
0,@Kwesidei not the whole crew,2009-04-07T05:20:00.000+0000,0.0,0.0
0,Need a hug,2009-04-07T05:20:03.000+0000,0.0,0.0
0,"@LOLTrish hey long time no see! Yes.. Rains a bit ,only a bit LOL , I'm fine thanks , how's you ?",2009-04-07T05:20:03.000+0000,0.0,0.0
0,@Tatiana_K nope they didn't have it,2009-04-07T05:20:05.000+0000,0.0,0.0
0,@twittera que me muera ?,2009-04-07T05:20:09.000+0000,0.0,0.0


In [None]:
train = data_labeled.drop("target", "target_double")

In [None]:
data_labeled.limit(10).display()

target,text,time,target_double,label
0,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",2009-04-07T05:19:45.000+0000,0.0,0.0
0,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!,2009-04-07T05:19:49.000+0000,0.0,0.0
0,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds,2009-04-07T05:19:53.000+0000,0.0,0.0
0,my whole body feels itchy and like its on fire,2009-04-07T05:19:57.000+0000,0.0,0.0
0,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",2009-04-07T05:19:57.000+0000,0.0,0.0
0,@Kwesidei not the whole crew,2009-04-07T05:20:00.000+0000,0.0,0.0
0,Need a hug,2009-04-07T05:20:03.000+0000,0.0,0.0
0,"@LOLTrish hey long time no see! Yes.. Rains a bit ,only a bit LOL , I'm fine thanks , how's you ?",2009-04-07T05:20:03.000+0000,0.0,0.0
0,@Tatiana_K nope they didn't have it,2009-04-07T05:20:05.000+0000,0.0,0.0
0,@twittera que me muera ?,2009-04-07T05:20:09.000+0000,0.0,0.0


## Step 3: define and fit a ML pipeline containing data preprocessing and model training.

Specifically, we need to remove some unwanted strings components (such as URL), stop words, and vectorize the words. We will try to do all these with PySpark (a combination of SparkSQL and Spark MLlib)

- eliminate URLs, @user, and # (remove the just the symbol but keep the hashtag). We will do this with SQLTransformer
  - Please use regexp_replace from SparkSQL
  - `'http\\\S+'` --> `''`: remove url   (note, we have three escape symbols because the text needs to go through three interpreter; final form should be `http\S+`)
  - `'@\\\w+'` --> `''`: remove @user
  - `'#'` --> `''` --> remove hashtag symbols.
- Tokenize the tweet, using a RegExTokenizer with `\\W+` as the token pattern.
- Remove stopwords, using StopWordsRemover (note that you need to load stop words first)
- Turn words into numerical features using CountVectorizer, limiting document frequency to 20 and above so that rare words are dropped.
- Use NaiveBayes to predict the sentiment with `smoothing` coefficient of `1.0` and `modelType` of `multinomial`

Pleas also define 

- an evaluator `e` for the accuracy metric.
- a pipeline `pipeline` with these stages: SQL transformer (for removing unwanted patterns), tokenizer, stopword remover, count vectorizer, naiveBayes.

In [None]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
st = SQLTransformer(statement="select *, regexp_replace(regexp_replace(regexp_replace(lower(text), 'http\\\S+',''),'@\\\w+',''),'#','') as text_cleaned from __THIS__")
tokenizer = RegexTokenizer(inputCol='text_cleaned', outputCol='words', pattern="\\W+")
eng_stopwords = StopWordsRemover.loadDefaultStopWords(language="english")
swr = StopWordsRemover(inputCol='words', outputCol='words_filtered', stopWords=eng_stopwords)
cv = CountVectorizer(inputCol='words_filtered', outputCol='features', minDF=20)
nb = NaiveBayes(smoothing=1.0, modelType='multinomial')

In [None]:
train_sample = train.limit(1000)
pipeline = Pipeline(stages = [st, tokenizer, swr, cv, nb])
pipelineModel = pipeline.fit(train_sample)
df = pipelineModel.transform(train_sample)
df.limit(20).display()

text,time,label,text_cleaned,words,words_filtered,features,rawPrediction,probability,prediction
"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",2009-04-07T05:19:45.000+0000,0.0,"- awww, that's a bummer. you shoulda got david carr of third day to do it. ;d","List(awww, that, s, a, bummer, you, shoulda, got, david, carr, of, third, day, to, do, it, d)","List(awww, bummer, shoulda, got, david, carr, third, day, d)","Map(vectorType -> sparse, length -> 44, indices -> List(10, 16), values -> List(1.0, 1.0))","Map(vectorType -> dense, length -> 1, values -> List(-7.397564902249693))","Map(vectorType -> dense, length -> 1, values -> List(1.0))",0.0
is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!,2009-04-07T05:19:49.000+0000,0.0,is upset that he can't update his facebook by texting it... and might cry as a result school today also. blah!,"List(is, upset, that, he, can, t, update, his, facebook, by, texting, it, and, might, cry, as, a, result, school, today, also, blah)","List(upset, update, facebook, texting, might, cry, result, school, today, also, blah)","Map(vectorType -> sparse, length -> 44, indices -> List(17), values -> List(1.0))","Map(vectorType -> dense, length -> 1, values -> List(-3.761811923604968))","Map(vectorType -> dense, length -> 1, values -> List(1.0))",0.0
@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds,2009-04-07T05:19:53.000+0000,0.0,i dived many times for the ball. managed to save 50% the rest go out of bounds,"List(i, dived, many, times, for, the, ball, managed, to, save, 50, the, rest, go, out, of, bounds)","List(dived, many, times, ball, managed, save, 50, rest, go, bounds)","Map(vectorType -> sparse, length -> 44, indices -> List(9), values -> List(1.0))","Map(vectorType -> dense, length -> 1, values -> List(-3.638197967637791))","Map(vectorType -> dense, length -> 1, values -> List(1.0))",0.0
my whole body feels itchy and like its on fire,2009-04-07T05:19:57.000+0000,0.0,my whole body feels itchy and like its on fire,"List(my, whole, body, feels, itchy, and, like, its, on, fire)","List(whole, body, feels, itchy, like, fire)","Map(vectorType -> sparse, length -> 44, indices -> List(3), values -> List(1.0))","Map(vectorType -> dense, length -> 1, values -> List(-3.4104140367670794))","Map(vectorType -> dense, length -> 1, values -> List(1.0))",0.0
"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",2009-04-07T05:19:57.000+0000,0.0,"no, it's not behaving at all. i'm mad. why am i here? because i can't see you all over there.","List(no, it, s, not, behaving, at, all, i, m, mad, why, am, i, here, because, i, can, t, see, you, all, over, there)","List(behaving, m, mad, see)","Map(vectorType -> sparse, length -> 44, indices -> List(0, 27), values -> List(1.0, 1.0))","Map(vectorType -> dense, length -> 1, values -> List(-6.673461079948418))","Map(vectorType -> dense, length -> 1, values -> List(1.0))",0.0
@Kwesidei not the whole crew,2009-04-07T05:20:00.000+0000,0.0,not the whole crew,"List(not, the, whole, crew)","List(whole, crew)","Map(vectorType -> sparse, length -> 44, indices -> List(), values -> List())","Map(vectorType -> dense, length -> 1, values -> List(0.0))","Map(vectorType -> dense, length -> 1, values -> List(1.0))",0.0
Need a hug,2009-04-07T05:20:03.000+0000,0.0,need a hug,"List(need, a, hug)","List(need, hug)","Map(vectorType -> sparse, length -> 44, indices -> List(34), values -> List(1.0))","Map(vectorType -> dense, length -> 1, values -> List(-4.103561217327025))","Map(vectorType -> dense, length -> 1, values -> List(1.0))",0.0
"@LOLTrish hey long time no see! Yes.. Rains a bit ,only a bit LOL , I'm fine thanks , how's you ?",2009-04-07T05:20:03.000+0000,0.0,"hey long time no see! yes.. rains a bit ,only a bit lol , i'm fine thanks , how's you ?","List(hey, long, time, no, see, yes, rains, a, bit, only, a, bit, lol, i, m, fine, thanks, how, s, you)","List(hey, long, time, see, yes, rains, bit, bit, lol, m, fine, thanks)","Map(vectorType -> sparse, length -> 44, indices -> List(0, 6, 27), values -> List(1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 1, values -> List(-10.222711561569714))","Map(vectorType -> dense, length -> 1, values -> List(1.0))",0.0
@Tatiana_K nope they didn't have it,2009-04-07T05:20:05.000+0000,0.0,nope they didn't have it,"List(nope, they, didn, t, have, it)","List(nope, didn)","Map(vectorType -> sparse, length -> 44, indices -> List(40), values -> List(1.0))","Map(vectorType -> dense, length -> 1, values -> List(-4.221344252983409))","Map(vectorType -> dense, length -> 1, values -> List(1.0))",0.0
@twittera que me muera ?,2009-04-07T05:20:09.000+0000,0.0,que me muera ?,"List(que, me, muera)","List(que, muera)","Map(vectorType -> sparse, length -> 44, indices -> List(), values -> List())","Map(vectorType -> dense, length -> 1, values -> List(0.0))","Map(vectorType -> dense, length -> 1, values -> List(1.0))",0.0


In [None]:
pipelineModel.stages[3].vocabulary

Out[14]: ['m',
 'get',
 'sad',
 'like',
 'work',
 'quot',
 'time',
 'still',
 'sleep',
 'go',
 'day',
 'going',
 'back',
 'really',
 'one',
 'know',
 'got',
 'today',
 'miss',
 'good',
 'much',
 'want',
 'im',
 'night',
 'oh',
 'sorry',
 'tomorrow',
 'see',
 'home',
 'feel',
 'think',
 've',
 'u',
 'bad',
 'need',
 'hate',
 'new',
 '2',
 'well',
 '3',
 'didn',
 'sick',
 'wish',
 'love']

## Step 4: train the pipeline

- save the resulting model as `pipelineModel`
- use the model to transform the training data `data_clean`
- display sample results.

> Note: the training may take several minutes

In [None]:
pipeline = Pipeline(stages = [st, tokenizer, swr, cv, nb])
pipelineModel = pipeline.fit(train)
df = pipelineModel.transform(train)
df.limit(100).display()

text,time,label,text_cleaned,words,words_filtered,features,rawPrediction,probability,prediction
"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",2009-04-07T05:19:45.000+0000,0.0,"- awww, that's a bummer. you shoulda got david carr of third day to do it. ;d","List(awww, that, s, a, bummer, you, shoulda, got, david, carr, of, third, day, to, do, it, d)","List(awww, bummer, shoulda, got, david, carr, third, day, d)","Map(vectorType -> sparse, length -> 22148, indices -> List(2, 11, 72, 349, 737, 1074, 1788, 3377, 9562), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-73.10315816766634, -75.76010317369364))","Map(vectorType -> dense, length -> 2, values -> List(0.9344377541357022, 0.06556224586429775))",0.0
is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!,2009-04-07T05:19:49.000+0000,0.0,is upset that he can't update his facebook by texting it... and might cry as a result school today also. blah!,"List(is, upset, that, he, can, t, update, his, facebook, by, texting, it, and, might, cry, as, a, result, school, today, also, blah)","List(upset, update, facebook, texting, might, cry, result, school, today, also, blah)","Map(vectorType -> sparse, length -> 22148, indices -> List(7, 70, 174, 197, 425, 429, 440, 682, 1018, 1918, 2240), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-84.59444310092665, -89.98160775323247))","Map(vectorType -> dense, length -> 2, values -> List(0.9954459081643332, 0.004554091835666661))",0.0
@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds,2009-04-07T05:19:53.000+0000,0.0,i dived many times for the ball. managed to save 50% the rest go out of bounds,"List(i, dived, many, times, for, the, ball, managed, to, save, 50, the, rest, go, out, of, bounds)","List(dived, many, times, ball, managed, save, 50, rest, go, bounds)","Map(vectorType -> sparse, length -> 22148, indices -> List(5, 216, 256, 370, 800, 982, 1171, 1577), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-63.04531360734018, -63.7797837673144))","Map(vectorType -> dense, length -> 2, values -> List(0.6757854508682208, 0.32421454913177916))",0.0
my whole body feels itchy and like its on fire,2009-04-07T05:19:57.000+0000,0.0,my whole body feels itchy and like its on fire,"List(my, whole, body, feels, itchy, and, like, its, on, fire)","List(whole, body, feels, itchy, like, fire)","Map(vectorType -> sparse, length -> 22148, indices -> List(4, 331, 381, 705, 1043, 2813), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-47.009948334394494, -50.54773661405896))","Map(vectorType -> dense, length -> 2, values -> List(0.9717440469195154, 0.02825595308048461))",0.0
"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",2009-04-07T05:19:57.000+0000,0.0,"no, it's not behaving at all. i'm mad. why am i here? because i can't see you all over there.","List(no, it, s, not, behaving, at, all, i, m, mad, why, am, i, here, because, i, can, t, see, you, all, over, there)","List(behaving, m, mad, see)","Map(vectorType -> sparse, length -> 22148, indices -> List(0, 21, 493, 10189), values -> List(1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-30.130604678083436, -31.07548249143339))","Map(vectorType -> dense, length -> 2, values -> List(0.7200838991455994, 0.27991610085440055))",0.0
@Kwesidei not the whole crew,2009-04-07T05:20:00.000+0000,0.0,not the whole crew,"List(not, the, whole, crew)","List(whole, crew)","Map(vectorType -> sparse, length -> 22148, indices -> List(331, 2083), values -> List(1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-17.98391091046203, -17.91489039377489))","Map(vectorType -> dense, length -> 2, values -> List(0.4827517176108538, 0.5172482823891462))",1.0
Need a hug,2009-04-07T05:20:03.000+0000,0.0,need a hug,"List(need, a, hug)","List(need, hug)","Map(vectorType -> sparse, length -> 22148, indices -> List(35, 815), values -> List(1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-14.601725469945839, -15.435554208423282))","Map(vectorType -> dense, length -> 2, values -> List(0.6971638872858877, 0.3028361127141123))",0.0
"@LOLTrish hey long time no see! Yes.. Rains a bit ,only a bit LOL , I'm fine thanks , how's you ?",2009-04-07T05:20:03.000+0000,0.0,"hey long time no see! yes.. rains a bit ,only a bit lol , i'm fine thanks , how's you ?","List(hey, long, time, no, see, yes, rains, a, bit, only, a, bit, lol, i, m, fine, thanks, how, s, you)","List(hey, long, time, see, yes, rains, bit, bit, lol, m, fine, thanks)","Map(vectorType -> sparse, length -> 22148, indices -> List(0, 12, 13, 21, 31, 76, 78, 88, 162, 423, 2559), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-80.42879077895951, -76.07687148586122))","Map(vectorType -> dense, length -> 2, values -> List(0.012718227357656941, 0.9872817726423432))",1.0
@Tatiana_K nope they didn't have it,2009-04-07T05:20:05.000+0000,0.0,nope they didn't have it,"List(nope, they, didn, t, have, it)","List(nope, didn)","Map(vectorType -> sparse, length -> 22148, indices -> List(69, 691), values -> List(1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-14.795096699692145, -16.118580610190293))","Map(vectorType -> dense, length -> 2, values -> List(0.7897607544738096, 0.21023924552619036))",0.0
@twittera que me muera ?,2009-04-07T05:20:09.000+0000,0.0,que me muera ?,"List(que, me, muera)","List(que, muera)","Map(vectorType -> sparse, length -> 22148, indices -> List(2372), values -> List(1.0))","Map(vectorType -> dense, length -> 2, values -> List(-10.504200888860087, -10.589274685072796))","Map(vectorType -> dense, length -> 2, values -> List(0.5212556307070653, 0.4787443692929346))",0.0


## Step 5: obtain the accuracy of the model using our predefined evaluator.

In [None]:
e = MulticlassClassificationEvaluator(metricName='accuracy')
e.evaluate(df)

Out[16]: 0.77388125

## Step 6: Save the model on DBFS at `/FileStore/twitter_nbpipeline/`

- then, explore the saved model using fs commands.

In [None]:
pipelineModel.write().overwrite().save("/FileStore/twitter_nbpipeline/")

In [None]:
%fs ls /FileStore/twitter_nbpipeline/stages

path,name,size,modificationTime
dbfs:/FileStore/twitter_nbpipeline/stages/0_SQLTransformer_9725fc062340/,0_SQLTransformer_9725fc062340/,0,0
dbfs:/FileStore/twitter_nbpipeline/stages/1_RegexTokenizer_b69ced09a92f/,1_RegexTokenizer_b69ced09a92f/,0,0
dbfs:/FileStore/twitter_nbpipeline/stages/2_StopWordsRemover_c7e93dd28d81/,2_StopWordsRemover_c7e93dd28d81/,0,0
dbfs:/FileStore/twitter_nbpipeline/stages/3_CountVectorizer_4ea34488b929/,3_CountVectorizer_4ea34488b929/,0,0
dbfs:/FileStore/twitter_nbpipeline/stages/4_NaiveBayes_347517a19d55/,4_NaiveBayes_347517a19d55/,0,0


## Step 7. Test the saved model

- Load the saved pipeline model as `pipelineModel2`
- Use it transform a small sample (e.g. 1000 rows) of the training data.
- View the results

In [None]:
from pyspark.ml import PipelineModel
PipelineModel2 = PipelineModel.load("/FileStore/twitter_nbpipeline/")

In [None]:
test = train.limit(1000)

In [None]:
test_predicted = PipelineModel2.transform(test)
test_predicted.display()

text,time,label,text_cleaned,words,words_filtered,features,rawPrediction,probability,prediction
"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",2009-04-07T05:19:45.000+0000,0.0,"- awww, that's a bummer. you shoulda got david carr of third day to do it. ;d","List(awww, that, s, a, bummer, you, shoulda, got, david, carr, of, third, day, to, do, it, d)","List(awww, bummer, shoulda, got, david, carr, third, day, d)","Map(vectorType -> sparse, length -> 22148, indices -> List(2, 11, 72, 349, 737, 1074, 1788, 3377, 9562), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-73.10315816766634, -75.76010317369364))","Map(vectorType -> dense, length -> 2, values -> List(0.9344377541357022, 0.06556224586429775))",0.0
is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!,2009-04-07T05:19:49.000+0000,0.0,is upset that he can't update his facebook by texting it... and might cry as a result school today also. blah!,"List(is, upset, that, he, can, t, update, his, facebook, by, texting, it, and, might, cry, as, a, result, school, today, also, blah)","List(upset, update, facebook, texting, might, cry, result, school, today, also, blah)","Map(vectorType -> sparse, length -> 22148, indices -> List(7, 70, 174, 197, 425, 429, 440, 682, 1018, 1918, 2240), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-84.59444310092665, -89.98160775323247))","Map(vectorType -> dense, length -> 2, values -> List(0.9954459081643332, 0.004554091835666661))",0.0
@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds,2009-04-07T05:19:53.000+0000,0.0,i dived many times for the ball. managed to save 50% the rest go out of bounds,"List(i, dived, many, times, for, the, ball, managed, to, save, 50, the, rest, go, out, of, bounds)","List(dived, many, times, ball, managed, save, 50, rest, go, bounds)","Map(vectorType -> sparse, length -> 22148, indices -> List(5, 216, 256, 370, 800, 982, 1171, 1577), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-63.04531360734018, -63.7797837673144))","Map(vectorType -> dense, length -> 2, values -> List(0.6757854508682208, 0.32421454913177916))",0.0
my whole body feels itchy and like its on fire,2009-04-07T05:19:57.000+0000,0.0,my whole body feels itchy and like its on fire,"List(my, whole, body, feels, itchy, and, like, its, on, fire)","List(whole, body, feels, itchy, like, fire)","Map(vectorType -> sparse, length -> 22148, indices -> List(4, 331, 381, 705, 1043, 2813), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-47.009948334394494, -50.54773661405896))","Map(vectorType -> dense, length -> 2, values -> List(0.9717440469195154, 0.02825595308048461))",0.0
"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",2009-04-07T05:19:57.000+0000,0.0,"no, it's not behaving at all. i'm mad. why am i here? because i can't see you all over there.","List(no, it, s, not, behaving, at, all, i, m, mad, why, am, i, here, because, i, can, t, see, you, all, over, there)","List(behaving, m, mad, see)","Map(vectorType -> sparse, length -> 22148, indices -> List(0, 21, 493, 10189), values -> List(1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-30.130604678083436, -31.07548249143339))","Map(vectorType -> dense, length -> 2, values -> List(0.7200838991455994, 0.27991610085440055))",0.0
@Kwesidei not the whole crew,2009-04-07T05:20:00.000+0000,0.0,not the whole crew,"List(not, the, whole, crew)","List(whole, crew)","Map(vectorType -> sparse, length -> 22148, indices -> List(331, 2083), values -> List(1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-17.98391091046203, -17.91489039377489))","Map(vectorType -> dense, length -> 2, values -> List(0.4827517176108538, 0.5172482823891462))",1.0
Need a hug,2009-04-07T05:20:03.000+0000,0.0,need a hug,"List(need, a, hug)","List(need, hug)","Map(vectorType -> sparse, length -> 22148, indices -> List(35, 815), values -> List(1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-14.601725469945839, -15.435554208423282))","Map(vectorType -> dense, length -> 2, values -> List(0.6971638872858877, 0.3028361127141123))",0.0
"@LOLTrish hey long time no see! Yes.. Rains a bit ,only a bit LOL , I'm fine thanks , how's you ?",2009-04-07T05:20:03.000+0000,0.0,"hey long time no see! yes.. rains a bit ,only a bit lol , i'm fine thanks , how's you ?","List(hey, long, time, no, see, yes, rains, a, bit, only, a, bit, lol, i, m, fine, thanks, how, s, you)","List(hey, long, time, see, yes, rains, bit, bit, lol, m, fine, thanks)","Map(vectorType -> sparse, length -> 22148, indices -> List(0, 12, 13, 21, 31, 76, 78, 88, 162, 423, 2559), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-80.42879077895951, -76.07687148586122))","Map(vectorType -> dense, length -> 2, values -> List(0.012718227357656941, 0.9872817726423432))",1.0
@Tatiana_K nope they didn't have it,2009-04-07T05:20:05.000+0000,0.0,nope they didn't have it,"List(nope, they, didn, t, have, it)","List(nope, didn)","Map(vectorType -> sparse, length -> 22148, indices -> List(69, 691), values -> List(1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-14.795096699692145, -16.118580610190293))","Map(vectorType -> dense, length -> 2, values -> List(0.7897607544738096, 0.21023924552619036))",0.0
@twittera que me muera ?,2009-04-07T05:20:09.000+0000,0.0,que me muera ?,"List(que, me, muera)","List(que, muera)","Map(vectorType -> sparse, length -> 22148, indices -> List(2372), values -> List(1.0))","Map(vectorType -> dense, length -> 2, values -> List(-10.504200888860087, -10.589274685072796))","Map(vectorType -> dense, length -> 2, values -> List(0.5212556307070653, 0.4787443692929346))",0.0


## Next steps

Now we have trained and persisted the model. We can use it for streaming sentiment prediction in the next part of the lab.