<h1>Machine Learning Task<h1>

<h1>Load data from elasticsearch<h1>

Read data written to elasticsearch at the ETL proccess.

In [0]:
# imports
import string
import numpy as np

import pyspark.sql.functions as F
from pyspark.sql.types import *

from pyspark.ml.linalg import SparseVector, DenseVector
from pyspark.ml.feature import IDF, Tokenizer,CountVectorizer, StopWordsRemover, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import RegressionEvaluator, BinaryClassificationEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

from elasticsearch import Elasticsearch, helpers
import requests

In [0]:
ES_OUR_HOST = 'dds2019s-1010.eastus.cloudapp.azure.com'
index1 = "test3"
index2 = "test5"

In [0]:
es = Elasticsearch([{'host': ES_OUR_HOST}], timeout=60000)

# load data from the February data into df_old

if not es.indices.exists(index1):
    raise Exception("Index doesn't exist!")

df_old =  spark.read\
            .format("org.elasticsearch.spark.sql")\
            .option("es.nodes.wan.only","true")\
            .option("es.port","9200")\
            .option("es.nodes",ES_OUR_HOST)\
            .option("pushdown", "true")\
            .load(index1)

In [0]:
# load data from the June data into df_new

if not es.indices.exists(index2):
    raise Exception("Index doesn't exist!")

df_new =  spark.read\
            .format("org.elasticsearch.spark.sql")\
            .option("es.nodes.wan.only","true")\
            .option("es.port","9200")\
            .option("es.nodes",ES_OUR_HOST)\
            .option("pushdown", "true")\
            .load(index2)

In [0]:
df_old = df_old.dropDuplicates(["tweet_id"]).withColumn("label_text", F.lit("old")).withColumn("label", F.lit(0))
df_new = df_new.dropDuplicates(["tweet_id"]).withColumn("label_text", F.lit("new")).withColumn("label", F.lit(1))

df = df_old.union(df_new)

df = df.where(df.lang == 'en') # since we are classification based on 'text' field

Removing text duplicates since there are re-tweets, and we want to test our classifier on data it hadn't seen yet.
<br>Otherwise our accuracy might be higher but will not tell us the real ability of our classifier's prediction

In [0]:
df = df.dropDuplicates(['text'])

<h1>Text Pre Processing</h1>

By the following steps (for each tweet):
  * Create a list of words from text.
  * English standard stop words removal, and removal of punctuations.
  * TF-IDF representation (with dimension of 5k).
  * Filtering out empty tweets.

In [0]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
wordsData = tokenizer.transform(df)

def remove_punc(word):
  return word.translate(str.maketrans('', '', string.punctuation))

def words_filter(l):
  to_return = []
  i = 0
  while i < len(l):
    word = l[i]
    if word=='rt':
      i = i+2
      continue
    w = remove_punc(word)   
    if len(w)!=0:
      to_return.append(w)
    i = i + 1
  return to_return

words_filter_udf = F.udf(words_filter, ArrayType(StringType()))

wordsData = wordsData.withColumn("punc_free", words_filter_udf(F.col("words")))

remover = StopWordsRemover(inputCol="punc_free", outputCol="filtered")
wordsData = remover.transform(wordsData)

@udf
def length(l):
  return l.numNonzeros()

cv = CountVectorizer(inputCol="filtered", outputCol="tf", vocabSize=5000)

In [0]:
# split into train and test sets
train, test = wordsData.randomSplit([0.75, 0.25])

In [0]:
# create vocabulary from train
cv_model = cv.fit(train)
vocab = cv_model.vocabulary

# tf-idf transformation for train & test
def tfidf_tr(df):
  tf = cv_model.transform(df)
  tf.cache()
  idf = IDF(inputCol="tf", outputCol="features").fit(tf)
  tfidf = idf.transform(tf)

  tfidf = tfidf.withColumn("len", length(F.col("features")))
  tfidf = tfidf.where(tfidf.len!=0).drop('len')
  return tfidf

train_tfidf = tfidf_tr(train)
test_tfidf = tfidf_tr(test)

<h2>Task definition<h2>

As seen in the Data Analysis notebook, there is a difference between tweets from February and tweets from June.
<br>We can notice that this difference can also be explained by the text (Q3).
<br>Therefore we defined our learning task to be classification of period of time (February or June) by the tweet's text.

<h1>Cross-validation on train set<h1>

By the following steps:
  * Split train-set to 4 folds.
  * For each permutation of single fold as inner-test (and the rest as train):
    * Run LogisticRegression model for each pair of parameters (combination of regParam,threshold)
    * Evaluate Area under ROC curve (AUC)
  * Find best pair of parameters over all averages of AUCs

In [0]:
# create model
lr = LogisticRegression()

# create pipe for cv
pipeline = Pipeline().setStages([lr])

params = ParamGridBuilder().addGrid(lr.regParam, [0, 0.2, 0.4, 0.6, 0.8, 1])\
                           .addGrid(lr.threshold, [0.35,0.4,0.45,0.5,0.55,0.6,0.65]).build()

evaluator = BinaryClassificationEvaluator()\  # default metric is AUC
  .setRawPredictionCol("prediction")\
  .setLabelCol("label")

# CV (Model Selection)
cross_val = CrossValidator()\
  .setEstimator(pipeline)\
  .setEvaluator(evaluator)\
  .setEstimatorParamMaps(params)\
  .setNumFolds(4)

# Run cross-validation, and choose the best set of parameters.
cvModel = cross_val.fit(train_tfidf)

<h1>Classification (run model on test set)<h1>

In [0]:
# best reg params (according to cv results)
best_params = list(cvModel.getEstimatorParamMaps()[np.argmax(cvModel.avgMetrics)].values())
best_regParam = best_params[0]
best_threshold = best_params[1]

In [0]:
print(best_regParam, best_threshold)

In [0]:
# create model (using the best param)
lr = LogisticRegression(regParam=best_regParam, threshold=best_threshold)

# Fit the model
lrModel = lr.fit(train_tfidf)

In [0]:
res_tr = lrModel.transform(test_tfidf) # predict
display(res_tr)

date,lang,location,month,source,text,tweet_id,user_followers_count,user_friends_count,user_id,user_listed_count,label_text,label,words,punc_free,filtered,tf,features,rawPrediction,probability,prediction
2020-02-07T05:01:07.000+0000,en,"List(Seattle, United States, Washington)",2,Buffer,"Medical workers take the temperature of a woman at Queen Elizabeth Hospital, following the coronavirus outbreak. Ph… https://t.co/NGNPegURf9",1225645641759514624,166,258,1218404702183800834,11,old,0,"List(medical, workers, take, the, temperature, of, a, woman, at, queen, elizabeth, hospital,, following, the, coronavirus, outbreak., ph…, https://t.co/ngnpegurf9)","List(medical, workers, take, the, temperature, of, a, woman, at, queen, elizabeth, hospital, following, the, coronavirus, outbreak, ph…, httpstcongnpegurf9)","List(medical, workers, take, temperature, woman, queen, elizabeth, hospital, following, coronavirus, outbreak, ph…, httpstcongnpegurf9)","List(0, 5000, List(0, 8, 38, 98, 121, 204, 362, 384, 1855), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 5000, List(0, 8, 38, 98, 121, 204, 362, 384, 1855), List(0.5380151026077924, 2.9137619965925565, 4.128206100785787, 4.583362055564429, 4.7187666925706315, 5.263493868012304, 5.574432802628352, 5.660375232429077, 7.309033858016458))","List(1, 2, List(), List(1.8204880198735984, -1.8204880198735984))","List(1, 2, List(), List(0.8606246752487178, 0.13937532475128228))",0.0
2020-02-07T05:22:01.000+0000,en,"List(null, Philippines, null)",2,Twitter Web App,"I just read your article. Very well said. I'm so sick of the hate-mongering western politicians, and their propagan… https://t.co/9MSe12Ovdd",1225650902201094146,67,148,968495142167396353,1,old,0,"List(i, just, read, your, article., very, well, said., i'm, so, sick, of, the, hate-mongering, western, politicians,, and, their, propagan…, https://t.co/9mse12ovdd)","List(i, just, read, your, article, very, well, said, im, so, sick, of, the, hatemongering, western, politicians, and, their, propagan…, httpstco9mse12ovdd)","List(read, article, well, said, im, sick, hatemongering, western, politicians, propagan…, httpstco9mse12ovdd)","List(0, 5000, List(100, 134, 149, 186, 306, 403, 1826, 2777), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 5000, List(100, 134, 149, 186, 306, 403, 1826, 2777), List(4.503655312953686, 4.824127208228458, 4.738969399888151, 4.941910243884841, 5.363123708961146, 5.551175940464085, 6.881589843189519, 7.619188786320298))","List(1, 2, List(), List(0.560926070763465, -0.560926070763465))","List(1, 2, List(), List(0.6366667880983282, 0.36333321190167167))",1.0
2020-02-07T05:26:32.000+0000,en,"List(null, United Kingdom, null)",2,dlvr.it,China’s Coronavirus Whistleblower Is Now Memorialized on Ethereum https://t.co/2LLyS7sDAr #News #BlockchainTechnology #Coronavirus,1225652038664482819,8823,9711,1059910799458754561,39,old,0,"List(china’s, coronavirus, whistleblower, is, now, memorialized, on, ethereum, https://t.co/2llys7sdar, #news, #blockchaintechnology, #coronavirus)","List(china’s, coronavirus, whistleblower, is, now, memorialized, on, ethereum, httpstco2llys7sdar, news, blockchaintechnology, coronavirus)","List(china’s, coronavirus, whistleblower, memorialized, ethereum, httpstco2llys7sdar, news, blockchaintechnology, coronavirus)","List(0, 5000, List(0, 14, 26, 112, 884, 937), List(2.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 5000, List(0, 14, 26, 112, 884, 937), List(1.0760302052155848, 3.402358458914908, 3.802475960696477, 4.69407407998026, 6.255883943425106, 6.303511992414361))","List(1, 2, List(), List(3.0525246838037656, -3.0525246838037656))","List(1, 2, List(), List(0.9548913992573365, 0.045108600742663645))",0.0
2020-02-07T05:32:11.000+0000,en,"List(null, Uganda, Kampala)",2,Twitter for Android,@owishemwe @KagutaMuseveni The coronavirus outbreak has really showed us Ugandans how ill-prepared our country is f… https://t.co/z91Zn6wcJT,1225653460663582721,909,689,1454534238,2,old,0,"List(@owishemwe, @kagutamuseveni, the, coronavirus, outbreak, has, really, showed, us, ugandans, how, ill-prepared, our, country, is, f…, https://t.co/z91zn6wcjt)","List(owishemwe, kagutamuseveni, the, coronavirus, outbreak, has, really, showed, us, ugandans, how, illprepared, our, country, is, f…, httpstcoz91zn6wcjt)","List(owishemwe, kagutamuseveni, coronavirus, outbreak, really, showed, us, ugandans, illprepared, country, f…, httpstcoz91zn6wcjt)","List(0, 5000, List(0, 8, 16, 111, 157, 449, 1393, 3003, 3031, 4197), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 5000, List(0, 8, 16, 111, 157, 449, 1393, 3003, 3031, 4197), List(0.5380151026077924, 2.9137619965925565, 3.497936771178273, 4.614406677246389, 4.89312007971541, 5.586267260275355, 6.615886677456513, 7.126712301222504, 7.532177409330668, 7.619188786320298))","List(1, 2, List(), List(3.0875535375317584, -3.0875535375317584))","List(1, 2, List(), List(0.9563764111159043, 0.043623588884095676))",0.0
2020-02-07T05:40:30.000+0000,en,,2,Twitter for Android,"RT @rajfortyseven: #China #Wuhan #nCoV #nCoV2019 #LiWenliang reports clearly indicate special #virology tests performed probably at #P4Lab,…",1225655553315102720,45,338,1201778901342539776,0,old,0,"List(rt, @rajfortyseven:, #china, #wuhan, #ncov, #ncov2019, #liwenliang, reports, clearly, indicate, special, #virology, tests, performed, probably, at, #p4lab,…)","List(china, wuhan, ncov, ncov2019, liwenliang, reports, clearly, indicate, special, virology, tests, performed, probably, at, p4lab…)","List(china, wuhan, ncov, ncov2019, liwenliang, reports, clearly, indicate, special, virology, tests, performed, probably, p4lab…)","List(0, 5000, List(1, 2, 118, 229, 448, 488, 548, 871, 919, 1748, 1830), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 5000, List(1, 2, 118, 229, 448, 488, 548, 871, 919, 1748, 1830), List(2.2527880526681505, 2.323602418283493, 4.655791893409243, 5.1728969726600775, 5.7130189659144985, 5.889949674073577, 5.635057424444787, 6.232894425200407, 6.105061053690522, 7.183870715062453, 6.972561621395245))","List(1, 2, List(), List(3.1372537015449815, -3.1372537015449815))","List(1, 2, List(), List(0.9584035339414149, 0.04159646605858509))",0.0
2020-02-07T05:49:08.000+0000,en,,2,Twitter for iPhone,He died “of coronavirus” https://t.co/MjRJvL4pu8,1225657728443371520,8503,5122,234180888,24,old,0,"List(he, died, “of, coronavirus”, https://t.co/mjrjvl4pu8)","List(he, died, “of, coronavirus”, httpstcomjrjvl4pu8)","List(died, “of, coronavirus”, httpstcomjrjvl4pu8)","List(0, 5000, List(45, 2243), List(1.0, 1.0))","List(0, 5000, List(45, 2243), List(4.185201581835152, 7.126712301222504))","List(1, 2, List(), List(1.145149453844211, -1.145149453844211))","List(1, 2, List(), List(0.7586238299852942, 0.24137617001470565))",0.0
2020-02-07T05:53:36.000+0000,en,,2,Twitter for iPhone,Coronavirus got nothing on these masks 🔥 https://t.co/gNvoddQCg7,1225658851799658497,106,183,3058854792,1,old,0,"List(coronavirus, got, nothing, on, these, masks, 🔥, https://t.co/gnvoddqcg7)","List(coronavirus, got, nothing, on, these, masks, 🔥, httpstcognvoddqcg7)","List(coronavirus, got, nothing, masks, 🔥, httpstcognvoddqcg7)","List(0, 5000, List(0, 74, 128, 560), List(1.0, 1.0, 1.0, 1.0))","List(0, 5000, List(0, 74, 128, 560), List(0.5380151026077924, 4.5240226157781205, 4.905096270762125, 5.754404182077353))","List(1, 2, List(), List(1.1099458423186044, -1.1099458423186044))","List(1, 2, List(), List(0.7521190146358144, 0.24788098536418565))",0.0
2020-02-07T05:54:19.000+0000,en,"List(null, Uganda, Kampala)",2,Twitter for iPhone,RT @Sambannz: Regime apologists are caught between a hard place and a rock on the issue of the Ugandans in Wuhan. As if they want to show…,1225659030112063490,2113,912,65859751,46,old,0,"List(rt, @sambannz:, regime, apologists, are, caught, between, a, hard, place, and, a, rock, on, the, issue, of, the, ugandans, in, wuhan., , as, if, they, want, to, show…)","List(regime, apologists, are, caught, between, a, hard, place, and, a, rock, on, the, issue, of, the, ugandans, in, wuhan, as, if, they, want, to, show…)","List(regime, apologists, caught, hard, place, rock, issue, ugandans, wuhan, want, show…)","List(0, 5000, List(2, 136, 512, 516, 542, 603, 1393, 2676), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 5000, List(2, 136, 512, 516, 542, 603, 1393, 2676), List(2.323602418283493, 4.744084500554922, 5.782977554521409, 5.906210194945357, 5.754404182077353, 5.842696789223031, 6.615886677456513, 6.972561621395245))","List(1, 2, List(), List(1.7405363295430092, -1.7405363295430092))","List(1, 2, List(), List(0.8507551764750425, 0.14924482352495738))",0.0
2020-02-07T06:01:02.000+0000,en,,2,Twitter for iPad,RT @AJEnglish: You've all heard about the #coronavirus - but what do we really know about it? #AJStartHere explains https://t.co/eYNuSgLgsX,1225660721989468160,7,31,1394499685,1,old,0,"List(rt, @ajenglish:, you've, all, heard, about, the, #coronavirus, -, but, what, do, we, really, know, about, it?, #ajstarthere, explains, https://t.co/eynusglgsx)","List(youve, all, heard, about, the, coronavirus, but, what, do, we, really, know, about, it, ajstarthere, explains, httpstcoeynusglgsx)","List(youve, heard, coronavirus, really, know, ajstarthere, explains, httpstcoeynusglgsx)","List(0, 5000, List(0, 36, 157, 706, 2083, 2246), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 5000, List(0, 36, 157, 706, 2083, 2246), List(0.5380151026077924, 4.076912806398237, 4.89312007971541, 6.145883048210778, 7.244495336878887, 7.4521347016571315))","List(1, 2, List(), List(0.8492650719982993, -0.8492650719982993))","List(1, 2, List(), List(0.7004129518324279, 0.2995870481675721))",0.0
2020-02-07T06:26:09.000+0000,en,"List(null, Canada, British Columbia)",2,Twitter Web App,@Fuplaayz Is there a higher quality meme of the coronavirus multiplier?,1225667041270517761,1781,1535,15474479,18,old,0,"List(@fuplaayz, is, there, a, higher, quality, meme, of, the, coronavirus, multiplier?)","List(fuplaayz, is, there, a, higher, quality, meme, of, the, coronavirus, multiplier)","List(fuplaayz, higher, quality, meme, coronavirus, multiplier)","List(0, 5000, List(0, 1063, 2803), List(1.0, 1.0, 1.0))","List(0, 5000, List(0, 1063, 2803), List(0.5380151026077924, 6.353522412989022, 7.81985948178245))","List(1, 2, List(), List(0.6701084582658972, -0.6701084582658972))","List(1, 2, List(), List(0.6615274443977348, 0.3384725556022652))",0.0


<h1>Evaluating Results<h1>

In [0]:
res_ev = lrModel.evaluate(test_tfidf) # evaluate

accuracy = res_ev.accuracy
falsePositiveRate = res_ev.weightedFalsePositiveRate
truePositiveRate = res_ev.weightedTruePositiveRate
fMeasure = res_ev.weightedFMeasure()
precision = res_ev.weightedPrecision
recall = res_ev.weightedRecall
print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
      % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))

Visualizing accuracy per label

In [0]:
def acc(a,b):
  if a==b:
    return 'True Prediction'
  return 'Wrong Prediction'

acc_udf = F.udf(acc, StringType())

acc = res_tr.withColumn('final_res', acc_udf(F.col('prediction'), F.col('label')))
display(acc.groupby('final_res', 'label_text').count())

final_res,label_text,count
True Prediction,old,13442
Wrong Prediction,new,692
True Prediction,new,6911
Wrong Prediction,old,1314


<h1>Convengence<h1>

For different portions of the dataset (in (0,1) range), run the model above and evalute(accuracy)

In [0]:
schema = train_tfidf.schema
n_train = train_tfidf.count()
n_test = test_tfidf.count()
resSchema = StructType([ StructField("ratio", DoubleType(), True), StructField("accuracy", DoubleType(), True) ])
k_values = [x for x in np.arange(0.05,1.01,0.05)] + [0.001*(2**x) for x in range(1,10)]

In [0]:
res = []

for k in k_values:
  cur_train = spark.createDataFrame(train_tfidf.rdd.takeSample(False, int(n_train*k), seed= 42), schema=schema)
  cur_test = spark.createDataFrame(test_tfidf.rdd.takeSample(False, int(n_test*k), seed= 42), schema=schema)
  cur_lr = LogisticRegression(regParam=best_regParam, threshold=best_threshold)
  cur_lrModel = cur_lr.fit(cur_train)
  cur_res_tr = cur_lrModel.transform(cur_test)
  cur_res_ev = cur_lrModel.evaluate(cur_test)
  cur_accuracy = cur_res_ev.accuracy
  res.append([float(k), cur_accuracy])

In [0]:
res_df = spark.createDataFrame(res, resSchema)
display(res_df.sort('ratio'))

ratio,accuracy
0.002,0.7727272727272727
0.004,0.7303370786516854
0.008,0.8258426966292135
0.016,0.8263305322128851
0.032,0.8041958041958042
0.05,0.832737030411449
0.064,0.843466107617051
0.1,0.8667262969588551
0.128,0.8731656184486373
0.15,0.881633870005963


As seen above, there is no significant improvement as size of the data grows (above k~=0.4) and therefore there is a convergence of the model.
<br>Thus, we decided that our data-set size is "large enough" for training our model.