# Week 13: Example Sentences Classification
In the last two assignment, we use deep learning method to do classification task. We can get good result by deep learning, but we can hardly explain the classification result(i.e., we don't know why the phrase is classified as "good" phrase). Hence, in this assignment, we want you do classification again, but do it by traditional machine learning method. In this way, you can know *why* more. We want you implement this on example sentences(e.g., "This song has a brilliant piano intro." for word "brilliant").

[Here](https://drive.google.com/drive/folders/1ij20ecLlI1Zh5CdMAa91SXQzmpKfdBdJ?usp=sharing) are two files needed for this task: **train.jsonl** & **test.jsonl**. In these files, each line records one example sentence with its label. There're two types of labels: ***positive*** means it's *good* example sentence;  ***negative*** means it's *bad* example sentence.

## Read Data
We use dataframe to keep data in this assignment.

In [19]:
import json
import pandas as pd

In [20]:
def read_data_to_df(path):
    labels = []
    sentences = []
    with open(path, 'r', encoding='utf-8') as f:
        for line in f.readlines():
            line = json.loads(line)
            sentences.append(line['text'])
            labels.append(line['label'])
    return pd.DataFrame({'sent':sentences,'label':labels})  

In [21]:
train = read_data_to_df('data/train.jsonl')
print(train.head())
test = read_data_to_df('data/test.jsonl')
print(test.head())

                                                sent     label
0         My children threw a birthday party for me.  positive
1  Marketing on social networking sites is a fast...  positive
2                You pillock, look what you've done!  positive
3      He scored three touchdowns early in the game.  positive
4          His abrupt manner makes me uncomfortable.  positive
                                                sent     label
0  I've just spotted Mark - he's over there, near...  positive
1  After repricing, the bonds yield a much higher...  positive
2             I admire her staunchness and fidelity.  positive
3  The party's leader is in favour of the treaty ...  positive
4  About 20 companies are working on treatments f...  positive


In [22]:
print('train.shape:', train.shape)
print('test.shape: ', test.shape)

train.shape: (243948, 2)
test.shape:  (30520, 2)


## Extract Features
Traditional machine learning need data scientists to observe data and find out some useful information. 

Here is an example:

In [23]:
import re
# for positive
spec = 0
count=0
for i in train[train['label'] == "positive"]["sent"]:
    if len(re.sub('[\w\s]','',i)) >3:
        spec+=1
    count+=1

print(spec/count)

spec = 0
count=0
for i in train[train['label'] == "negative"]["sent"]:
    if len(re.sub('[\w\s]','',i)) >3:
        spec+=1
    count+=1

print(spec/count)

0.058014915590512126
0.3834226149596325


After some experiment, we found 38% of bad example sentences have more than 3 punctuations. In contrast, only 5.8% of good example sentences have more than 3 punctuations. Hence, it seems that it is a nice feature to distinguish good and bad example sentences, so we add this feature.

In [24]:
train["more than three punc"] = [1 if len(re.sub('[\w\s]','',i)) >3 else 0 for i in train["sent"]]
test["more than three punc"] = [1 if len(re.sub('[\w\s]','',i))>3 else 0 for i in test["sent"]]

In [25]:
train.head()

Unnamed: 0,sent,label,more than three punc
0,My children threw a birthday party for me.,positive,0
1,Marketing on social networking sites is a fast...,positive,0
2,"You pillock, look what you've done!",positive,0
3,He scored three touchdowns early in the game.,positive,0
4,His abrupt manner makes me uncomfortable.,positive,0


<font color="red">**[ TODO ]**</font> Please observe the data and extract at least three features and add them to the dataframe.

In [26]:
# length
# avg_len_pos = 63.217
# avg_len_neg = 93.2698
spec_pos, count_pos, spec_neg, count_neg = 0, 0, 0, 0
len_pos = []
len_neg = []
for i in train[train['label'] == "positive"]["sent"]:
    len_pos.append(len(i))

for i in train[train['label'] == 'negative']['sent']:
    len_neg.append(len(i))

avg_len_pos = sum(len_pos)/len(len_pos)
avg_len_neg = sum(len_neg)/len(len_neg)

for i in train[train['label'] == "positive"]["sent"]:
    if len(i) > 110:
        spec_pos += 1
    count_pos += 1

for i in train[train['label'] == 'negative']['sent']:
    if len(i) > 110:
        spec_neg += 1
    count_neg += 1

print(spec_pos/count_pos)
print(spec_neg/count_neg)

0.043896476590317264
0.26250324313599804


In [27]:
train["length > 110"] = [1 if len(i) > 110 else 0 for i in train["sent"]]
test["length > 110"] = [1 if len(i) > 110 else 0 for i in test["sent"]]

In [28]:
train.head()

Unnamed: 0,sent,label,more than three punc,length > 110
0,My children threw a birthday party for me.,positive,0,0
1,Marketing on social networking sites is a fast...,positive,0,0
2,"You pillock, look what you've done!",positive,0,0
3,He scored three touchdowns early in the game.,positive,0,0
4,His abrupt manner makes me uncomfortable.,positive,0,0


In [29]:
# exist "("
import re
# for positive
spec = 0
count=0
for i in train[train['label'] == "positive"]["sent"]:
    if "(" in i:
        spec+=1
    count+=1
print(spec/count)

# for negative
spec = 0
count=0
for i in train[train['label'] == "negative"]["sent"]:
    if "(" in i :
        spec+=1
    count+=1
print(spec/count)

0.012524135976333457
0.2662576499855013


In [30]:
train["exist_left_bracket"] = [1 if "(" in i  else 0 for i in train["sent"]]
test["exist_left_bracket"] = [1 if "(" in i else 0 for i in test["sent"]]

In [31]:
train.head()

Unnamed: 0,sent,label,more than three punc,length > 110,exist_left_bracket
0,My children threw a birthday party for me.,positive,0,0,0
1,Marketing on social networking sites is a fast...,positive,0,0,0
2,"You pillock, look what you've done!",positive,0,0,0
3,He scored three touchdowns early in the game.,positive,0,0,0
4,His abrupt manner makes me uncomfortable.,positive,0,0,0


In [32]:
# len(token) > 19
import re
# for positive
spec = 0
count=0
for i in train[train['label'] == "positive"]["sent"]:
    if len(i.split()) > 18:
        spec+=1
    count+=1
print(spec/count)

# for negative
spec = 0
count=0
for i in train[train['label'] == "negative"]["sent"]:
    if len(i.split()) > 18 :
        spec+=1
    count+=1
print(spec/count)

0.05045969070521337
0.2811608137600537


In [53]:
train["len(token)>19"] = [1 if len(i.split()) > 19  else 0 for i in train["sent"]]
test["len(token)>19"] = [1 if len(i.split()) > 19 else 0 for i in test["sent"]]

In [34]:
# len(token) < 4
spec = 0
count=0
for i in train[train['label'] == "positive"]["sent"]:
    if len(i.split()) < 4:
        spec+=1
    count+=1
print(spec/count)

# for negative
spec = 0
count=0
for i in train[train['label'] == "negative"]["sent"]:
    if len(i.split()) < 4 :
        spec+=1
    count+=1
print(spec/count)

0.003790898301181556
0.025792469819757947


In [35]:
train["len(token)<4"] = [1 if len(i.split()) < 4 else 0 for i in train["sent"]]
test["len(token)<4"] = [1 if len(i.split()) < 4 else 0 for i in test["sent"]]

In [36]:
# number of cap
spec = 0
count=0
for i in train[train['label'] == "positive"]["sent"]:
    if len(re.findall('([A-Z])', i)) > 3:
        spec+=1
    count+=1
print(spec/count)

# for negative
spec = 0
count=0
for i in train[train['label'] == "negative"]["sent"]:
    if len(re.findall('([A-Z])', i)) > 3 :
        spec+=1
    count+=1
print(spec/count)

0.03454323218366371
0.3743876196144865


In [38]:
train["len(cap)>3"] = [1 if len(re.findall('([A-Z])', i)) > 3 else 0 for i in train["sent"]]
test["len(cap)>3"] = [1 if len(re.findall('([A-Z])', i)) > 3  else 0 for i in test["sent"]]

In [39]:
# exist will
import re
# for positive
spec = 0
count=0
for i in train[train['label'] == "positive"]["sent"]:
    r = re.findall('will', i)
    if r:
        spec+=1
    count+=1
print(spec/count)

# for negative
spec = 0
count=0
for i in train[train['label'] == "negative"]["sent"]:
    r = re.findall('will', i)
    if r:
        spec+=1
    count+=1       
print(spec/count)

0.03651839648544756
0.020809486745112404


In [40]:
train["exist_will"] = [1 if re.findall('will', i) else 0 for i in train["sent"]]
test["exist_will"] = [1 if re.findall('will', i) else 0 for i in test["sent"]]

In [41]:
# exist not
import re
# for positive
spec = 0
count=0
for i in train[train['label'] == "positive"]["sent"]:
    r = re.findall('not', i)
    if r:
        spec+=1
    count+=1
print(spec/count)

# for negative
spec = 0
count=0
for i in train[train['label'] == "negative"]["sent"]:
    r = re.findall('not', i)
    if r:
        spec+=1
    count+=1       
print(spec/count)

0.04156702272767533
0.07595806052836408


In [50]:
train["exist_not"] = [1 if re.findall('not', i) else 0 for i in train["sent"]]
test["exist_not"] = [1 if re.findall('not', i) else 0 for i in test["sent"]]

In [57]:
# number of 0-9
spec = 0
count=0
for i in train[train['label'] == "positive"]["sent"]:
    if len(re.findall('([0-9])', i)) > 3:
        spec+=1
    count+=1
print(spec/count)

# for negative
spec = 0
count=0
for i in train[train['label'] == "negative"]["sent"]:
    if len(re.findall('([0-9])', i)) > 3 :
        spec+=1
    count+=1
print(spec/count)

0.02391454535792103
0.21844237901195


In [59]:
train["len(digital)>3"] = [1 if len(re.findall('([0-9])', i)) > 3 else 0 for i in train["sent"]]
test["len(digital)>3"] = [1 if len(re.findall('([0-9])', i)) > 3  else 0 for i in test["sent"]]

In [71]:
month = ['January', 'February', "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]

In [84]:
# exist month
spec = 0
count=0
for i in train[train['label'] == "positive"]["sent"]:
    if re.findall(r"(?=("+'|'.join(month)+r"))", i):
        spec += 1
    count+=1
print(spec/count)

# for negative
spec = 0
count=0
for i in train[train['label'] == "negative"]["sent"]:
    if re.findall(r"(?=("+'|'.join(month)+r"))", i):
        spec+=1
    count+=1
print(spec/count)

0.006722644417282245
0.08239091616684217


In [82]:
s = ['lydia', 'is', 'mimo']
x = ' sticky poop'

print(re.findall(r"(?=("+'|'.join(s)+r"))", x))


[]


## Train
Now, it's time to evaluate whether the features just selected is useful to classify. We use [Bernoulli Naive Bayes model](https://scikit-learn.org/stable/modules/naive_bayes.html#bernoulli-naive-bayes) to train training data. 

In [42]:
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()

<font color="red">**[ TODO ]**</font> Please adjust the `selected_features` list and train the model.

***Don't*** use other model in this assignment.

In [102]:
selected_features = ["more than three punc", "length > 110", "exist_left_bracket",  "len(token)>19", 
                    "len(token)<4", "len(cap)>3", "len(digital)>3","exist_month", 
                    "exist_will", "exist_not"]


In [103]:
y_pred = bnb.fit(train[selected_features], train['label']).predict(test[selected_features])

## Test
Test the model and get the accuracy of the prediction on testing data. 

There're four baselines for this task:
```
1. simple baseline: 0.65
2. medium baseline: 0.72
3. strong baseline: 0.8
4. boss baseline: 0.85
```
The more the baseline you pass, the more higher the grade you can get.

*hint: If the result isn't ideal, you can print the wrong prediction data and re-observe but only focus on wrong data to extract other features.

In [104]:
print((test['label'] == y_pred).sum()/len(test))

0.7779816513761468


In [97]:
wrong_pred = test[test['label'] != y_pred]
FN = wrong_pred[wrong_pred['label'] == 'positive']
FP = wrong_pred[wrong_pred['label'] == 'negative']

FN.iloc[0:3]['sent']

0     I've just spotted Mark - he's over there, near...
10    Andrew spends all his spare time playing with ...
15    They have a standing order every January to su...
Name: sent, dtype: object

In [98]:
FP

Unnamed: 0,sent,label,more than three punc,length > 110,exist_left_bracket,len(token)>18,len(token)<4,len(cap)>3,exist_will,exist_notl,exist_not,len(token)>19,len(digtali)>3,len(digital)>3,exist_month
16410,Really that's all I had to say.,negative,0,0,0,0,0,0,0,0,0,0,0,0,0
16413,This will put a signature containing your user...,negative,0,0,0,1,0,0,1,0,0,0,0,0,0
16415,I know I did.,negative,0,0,0,0,0,0,0,0,0,0,0,0,0
16418,Chinese mathematicians were the first to use n...,negative,0,0,0,0,0,0,0,0,0,0,0,0,0
16423,I've deleted several because they weren't simp...,negative,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30511,It is located at a ford on the Castletown River.,negative,0,0,0,0,0,0,0,0,0,0,0,0,0
30512,The brain reacts by making the sympathetic ner...,negative,0,0,0,0,0,0,0,0,0,0,0,0,0
30514,With Sandstorm at his side he foes on a perilo...,negative,0,0,0,0,0,0,0,0,0,0,0,0,0
30518,Any regular editor can take out timetables fro...,negative,0,0,0,0,0,0,0,0,0,0,0,0,0


## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit#gid=1031097651) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to eeclass. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.