# Exercise 2

In this exercise we use PySpark to build a binary classifier to classify a given tweet is about KPOP or other topics using a supervised machine learning technique, SVM.

For parts marked with **[CODE CHANGE REQUIRED]** you need to modify or complete the code before execution.
For parts without **[CODE CHANGE REQUIRED]** , you can just run the given code.

The task here is to build a classifier to differentiate the KPOP tweets or otherwise.

For example, the following tweet message falls into the category of Korean Pop because it seems talking about someone from korea 
```text
crazy cool jae s lee's pic of street singer reflected in raindrops tuesday on 2nd ave  
```
On the other hand, the following tweet is not revelant to KPOP. 
```text
accident closes jae valley rd drivers advised to avoid area seek alternate routes
```
To achieve the goal, we need to develop a classifier, which is a supervised machine learning technique. In this example, we consider using Support Vector Machine (SVM) as the classifier algorithm. On the higher level, we need to "train" the model with some manually labelled data and perform some tests against the trained model. As part of the input requirement the SVM expect the input data to represented as a label (either yes or no, 1 or 0) accompanied by the feature vector. The feature vector is a vector of values which uniquely differentiate one entry from another ideally. In the machine learning context, features have to be fixed by the programmers. 



## Uploading the data

**[CODE CHANGE REQUIRED]** 
Modify the following bash cell according to your environment and upload the data.

In case running the below taking too long thus Zeppelin killed it. e.g. 

```text
Paragraph received a SIGTERM
ExitValue: 143
```

You may copy, paste and run the commands in a terminal (via ssh).

However due to a bug with hadoop version 3.3.x, we still see the following warning, which is fine.

```text
2021-11-03 14:39:57,306 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1252)
	at java.lang.Thread.join(Thread.java:1326)
	at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:986)
	at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:640)
	at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:810)
```

In [2]:
%sh
export PATH=$PATH:/home/ec2-user/hadoop/bin/

namenode=ip-172-31-86-18 # TODO:change me

hdfs dfs -rm -r hdfs://$namenode:9000/lab12/ex2/
hdfs dfs -mkdir -p hdfs://$namenode:9000/lab12/ex2/
hdfs dfs -put /home/ec2-user/git/50043-labs/lab12/data/ex2/label_data hdfs://$namenode:9000/lab12/ex2/


## Importing and Setup

**[CODE CHANGE REQUIRED]**

Let's import all the require libraries and set the hadoop file system name node IP.

We make use of `numpy` a python library for numeric computation,
If Python complains about `numpy not found`, go to terminal and run in all the data nodes that you have in the cluster

```bash
$ sudo pip3 install numpy sets
```

Alternatively, you may can also use flintrock to issue the above command to all the nodes in your cluster

```bash
$ flintrock run-command my_test_cluster 'sudo pip3 install sets numpy'
```

In [4]:
%pyspark


import re
import sets, math
import numpy # make sure numpy is installed on all datanodes using the command pip3 install numpy

from pyspark.sql import SQLContext
from pyspark.mllib import *
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.classification import SVMWithSGD
from pyspark.mllib.evaluation import BinaryClassificationMetrics

sparkSession = SparkSession.builder.appName("SVM notebook").getOrCreate()
sc = sparkSession.sparkContext

hdfs_nn = "ip-172-31-86-18" # TODO: fixme




## Loading the data
We load the data from the HDFS. The `.sample(False,0.1)` is to perform sampling on the input dataset. If you are to run it on a full cluster, feel free to remove the sampling

The first argument is boolean flag is called `withReplacement`. When it is `True`, it allows the same element to appear more than once. 
The second argument is the fraction of elements in the sampled results. `0.1` means we expect 10% of the entire data set in the samples. You might set it to a lower ratio if it takes too long to run in t2.micro.



In [6]:
%pyspark
def remove_punct(tweet):
    return re.sub('[\'".,!#]','',tweet)

posTXT = sc.textFile("hdfs://%s:9000/lab12/ex2/label_data/Kpop/*.txt" % hdfs_nn).sample(False,0.1).map(remove_punct)
negTXT = sc.textFile("hdfs://%s:9000/lab12/ex2/label_data/othertweet/*.txt" % hdfs_nn).sample(False,0.1).map(remove_punct)



 

## Exercise 2.1 Build a language model using TFIDF

In Natural Language Procoessing, we often model a language using bags of words model. The idea is to represent text in terms of vectors.

One of the simple and effective method is to use Term-Frequency Inversed Document Frequency.

$$
TFIDF(w) = TF(w) * log(NDoc/DF(w))
$$

where *NDoc* is the number of documents.


*  TF is actually the word count. For instance, consider the following text data.
```text
apple smart phones made by apple
android smart phones made by others
```
We assume that each line is a document, hence there are two documents here.

* The term frequency is
```text
apple, 2
android, 1
phones, 2
smart, 2
made, 2
by, 2
others, 1
```
The term frequency is basically the word count, i.e. the number of occurances of a word across all document.

* The document frequency is 

```text
apple, 1
android, 1
phones, 2
smart, 2
made, 2
by, 2
others, 1
```

The document frequency is the number of documents a word is mentioned.


* IDF is is the total number of documents/records divided by the total number of the documents/records containing the words. We apply logarithmic to the quotient. The IDF for the above example is
```text
apple, log(2/1)
android, log(2/1)
phones, log(2/2)
smart, log(2/2)
made, log(2/2)
by, log(2/2)
others, log(2/1)
```
that is
```text
apple, 0.693
android, 0.693
phones, 0
smart, 0
made, 0
by, 0
others, 0.693
```

* TF-IDF is obtained by multiplying the TF with the IDF.
```text
apple, 1.386
android, 0.693
phones, 0
smart, 0
made, 0
by, 0
others, 0.693
```



### Define `tf`
**[CODE CHANGE REQUIRED]** 
Complete the following snippet to define `tf`


<style>
    div.hidecode + pre {display: none}
</style>
<script>
doclick=function(e) {
    e.nextSibling.nextSibling.style.display="block";
}
</script>

<div class="hidecode" onclick="doclick(this);">[Show Hint]</div>

```text
tf is the same as word count
```

In [9]:
%pyspark
def tf(terms): 
    '''
    input
    terms :  a RDD of lists of terms (words)
    output
    a RDD of pairs i.e. (word, tf_score)
    '''
    # TODO
    return None


### Sample answer 



<div class="hidecode" onclick="doclick(this);">[Show Hint]</div>

```python
def tf(terms): 
    '''
    input
    terms :  a RDD of lists of terms (words)
    output
    a RDD of pairs i.e. (word, tf_score)
    '''
    # ANSWER
    return terms.flatMap(lambda seq: map(lambda w:(w,1), seq)).reduceByKey(lambda x,y:x + y)

```



### Test Case for `tf`

Run the following cell, you should see

```
[('apple', 2), ('by', 2), ('android', 1), ('smart', 2), ('made', 2), ('phones', 2), ('others', 1)]
```



In [12]:
%pyspark


def one_grams(s):
    return s.split()


test_terms = [one_grams("apple smart phones made by apple"), one_grams("android smart phones made by others")]
test_tf = tf(sc.parallelize(test_terms))
test_tf.collect()

### Define `df`

**[CODE CHANGE REQUIRED]** 

Complete the following snippet to define `df`


<style>
    div.hidecode + pre {display: none}
</style>
<script>
doclick=function(e) {
    e.nextSibling.nextSibling.style.display="block";
}
</script>

<div class="hidecode" onclick="doclick(this);">[Show Hint]</div>

```text
df differs from tf with a little bit. Instead of outputting (word,1) for every word in a tweet directly, we should remove the duplicating words (within the same tweet) first.
```

In [14]:
%pyspark
def df(terms): 
    '''
    input
    terms :  a RDD of lists of terms (words)
    output
    a RDD of pairs i.e. (word, df_score)
    '''
    # TODO
    return None


### Test Case for `df`

Run the following cell, you will see

```
[('apple', 1), ('by', 2), ('android', 1), ('smart', 2), ('made', 2), ('phones', 2), ('others', 1)]
```


In [16]:
%pyspark
test_terms = [one_grams("apple smart phones made by apple"), one_grams("android smart phones made by others")]
test_df = df(sc.parallelize(test_terms))
test_df.collect()

### Sample answer



<div class="hidecode" onclick="doclick(this);">[Show Hint]</div>

```python
def df(terms): 
    '''
    input
    terms :  a RDD of lists of terms (words)
    output
    a RDD of pairs i.e. (word, df_score)
    '''
    # ANSWER
    return terms.flatMap(lambda seq: list(set(map(lambda w:(w,1), seq)))).reduceByKey(lambda x,y:x + y)
```



### Define `tfidf`

**[CODE CHANGE REQUIRED]** 

Complete the following snippet to define `tfidf`


<style>
    div.hidecode + pre {display: none}
</style>
<script>
doclick=function(e) {
    e.nextSibling.nextSibling.style.display="block";
}
</script>

<div class="hidecode" onclick="doclick(this);">[Show Hint]</div>

```text
Let r be an RDD. r.count() returns the size of r.
Let r1, r2 be RDDs of key-value pairs. r1.join(r2) joins two RDDs by keys.
```

In [19]:
%pyspark
def tfidf(terms): 
    '''
    input
    terms:  a RDD of lists of terms (words)
    output
    a RDD of pairs i.e. (words, tfidf_score) sorted by tfidf_score in descending order.
    '''
    # TODO
    return None
    



### Sample answer



<div class="hidecode" onclick="doclick(this);">[Show Hint]</div>

```python
def tfidf(terms): 
    '''
    input
    terms:  a RDD of lists of terms (words)
    output
    a RDD of pairs i.e. (words, tfidf_score) sorted by tfidf_score in descending order.
    '''
    # ANSWER
    dCount = terms.count()
    tfreq = tf(terms)
    dfreq = df(terms)
    return tfreq.join(dfreq).map(lambda p :(p[0], p[1][0] * math.log(dCount/p[1][1]))).sortBy( lambda p : - p[1])
```





### Test case for `tfidf`

Run the following cell you will see

```
[('apple', 1.3862943611198906), ('android', 0.6931471805599453), ('others', 0.6931471805599453), ('by', 0.0), ('smart', 0.0), ('made', 0.0), ('phones', 0.0)]
```


In [22]:
%pyspark
test_terms = [one_grams("apple smart phones made by apple"), one_grams("android smart phones made by others")]
test_tfidf = tfidf(sc.parallelize(test_terms))
test_tfidf.collect()

## Exercise 2.2 Defining the Label points

Recall that each label point is a decimal value (the label) with a vector. 

* For all positive tweets (KPop tweets) the label will be `1` and for all negative tweets we set `0` as the label. 
* For the vector parts, we build them using the tweet messages and the top 150 TFIDF


In [24]:
%pyspark
# You don't need to modify this cell
def buildTopTFIDF(tweets,tokenizer):
    '''
    input
    tweets: an RDD of texts|
    tokenizer: a function turns a string into list of tokens
    
    output
    a list containing top 150 tfidf terms
    '''
    terms = tweets.map(tokenizer)
    return map(lambda p:p[0], tfidf(terms).take(150))
    

### Tokenizer
**[CODE CHANGE REQUIRED]** 
We've been using single word tokens for the test cases. However sometime using a multi-word tokenizer will help improving the performance by taking the neighboring word into account. 
Define a `two_grams` tokenizer



In [26]:
%pyspark
from functools import reduce

def two_grams(str):
   '''
    input
     str : a string
    output
     a list of strings (each string contains two consecutive words seperated by space)
   '''
   return None # TODO: fixme 



### Sample answer



<div class="hidecode" onclick="doclick(this);">[Show Hint]</div>

```python
def to_ngrams(str, n):
    words = str.split()
    tokens = [words] * (n-1) # replicate the list of words for n-1 times
    dropped = map(lambda p: p[0][p[1]:], zip(tokens, range(1,n)))
    return reduce(lambda acc,ts:map(lambda p : p[0] + " " + p[1], zip(acc,ts)), dropped, words)

def two_grams(str):
    return to_ngrams(str, 2)


```



### Test Case for `two_grams`

Run the following you should see 

```text
['The virus', 'virus that', 'that causes', 'causes COVID-19', 'COVID-19 is', 'is mainly', 'mainly transmitted', 'transmitted through', 'through droplets']
```


In [29]:
%pyspark

s = "The virus that causes COVID-19 is mainly  transmitted through droplets"
list(two_grams(s))


The following cells build the top 150 TFIDF from the data that we loaded, you don't need to change anything. It might take a while to run (~ 25 mins on my t2.micro cluster)


In [31]:
%pyspark
topTFIDF =  buildTopTFIDF(posTXT + negTXT,two_grams)

In [32]:
%pyspark
type(topTFIDF)

 

## Defining `computeLP`
**[CODE CHANGE REQUIRED]** 
Complete the following snippet.

Concretely speaking, the `computeLP` function takes a label `1.0` or `0.0`, a sequence of string i.e. the 2-grams or 3-grams, and a array of top-N TF-IDF.

For each tf-idf term, let's say `t` is the i-th top-N TF-IDF term, if `t` is in the sequence of strings, we should put a `1.0` at the i-th position of the output vector, otherwise it should be `0.0`.

<style>
    div.hidecode + pre {display: none}
</style>
<script>
doclick=function(e) {
    e.nextSibling.nextSibling.style.display="block";
}
</script>

<div class="hidecode" onclick="doclick(this);">[Show Hint]</div>

```text
Convert all the words in the input text into a set instead of a list.
The output vector should be of the same dimension as topTerms (AKA top 150 TFIDF).
```

In [34]:
%pyspark


def computeLP(label,text,tokenizer,topTerms):
    '''
    input
    label : label 1 or 0
    text : the text (String type)
    tokenizer : the tokenizer
    topTerms: the top TFIDF terms
    
    output:
    a label point.
    '''
    seqSet = set(tokenizer(text))
    scores = [0.0] * 150 # TODO: fixme
    return LabeledPoint(label, Vectors.dense(scores))

### Sample answer


<div class="hidecode" onclick="doclick(this);">[Show Hint]</div>


```python
def computeLP(label,text,tokenizer,topTerms):
    '''
    input
    label : label 1 or 0
    text : the text (String type)
    tokenizer : the tokenizer
    topTerms: the top TFIDF terms
    
    output:
    a label point.
    '''
    seqSet = set(tokenizer(text))
    # ANSWER
    scores = map(lambda t: 1.0 if t in seqSet else 0.0, list(topTerms))
    return LabeledPoint(label, Vectors.dense(scores))
````

### Test Case for `computeLP`

Run the following cell, you should see

```
LabeledPoint(1.0, [0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])
```



In [37]:
%pyspark
computeLP(1.0, "I love yoo jae suk", two_grams, topTFIDF)

## Training the model

Let's train our model. The codes are written for you, you don't need to change anything


In [39]:
%pyspark

posLP = posTXT.map( lambda twt: computeLP(1.0, twt, two_grams, topTFIDF) )
negLP = negTXT.map( lambda twt: computeLP(0.0, twt, two_grams, topTFIDF) )

data = negLP + posLP


# Split data into training (60%) and test (40%).

splits = data.randomSplit([0.6,0.4],seed = 11L)
training = splits[0].cache()
test = splits[1]

# Run training algorithm to build the model
num_iteration = 100
model = SVMWithSGD.train(training,num_iteration)

# This will takes about 20 mins on a 4-core intel i7 processor 3.8GHZ with hyperthreading


## Exercise 2.3 Evaluating the model

We apply the trained model to our testing data and evaluate the performance of our model. It should be around 84% accurate.




In [41]:

%pyspark 
model.clearThreshold()
# Compute raw scores on the test set
score_and_labels = test.map( lambda point: (float(model.predict(point.features)), point.label) )


# Get the evaluation metrics
metrics = BinaryClassificationMetrics(score_and_labels)
au_roc = metrics.areaUnderROC

print("Area under ROC = %s" % str(au_roc))

In [42]:
%pyspark
sc.stop()

## Cleaning up
**[CODE CHANGE REQUIRED]** 
Modify the following to clean up the HDFS


In [44]:
%sh
export PATH=$PATH:/home/ec2-user/hadoop/bin/

namenode=ip-172-31-86-18 # TODO:change me

hdfs dfs -rm -r hdfs://$namenode:9000/lab12/ex2/


# End of Exercise 2
