# Week 13: Example Sentences Classification
In the last two assignment, we use deep learning method to do classification task. We can get good result by deep learning, but we can hardly explain the classification result(i.e., we don't know why the phrase is classified as "good" phrase). Hence, in this assignment, we want you do classification again, but do it by traditional machine learning method. In this way, you can know *why* more. We want you implement this on example sentences(e.g., "This song has a brilliant piano intro." for word "brilliant").

[Here](https://drive.google.com/drive/folders/1ij20ecLlI1Zh5CdMAa91SXQzmpKfdBdJ?usp=sharing) are two files needed for this task: **train.jsonl** & **test.jsonl**. In these files, each line records one example sentence with its label. There're two types of labels: ***positive*** means it's *good* example sentence;  ***negative*** means it's *bad* example sentence.

## Read Data
We use dataframe to keep data in this assignment.

In [1]:
import json
import pandas as pd

In [2]:
def read_data_to_df(path):
    labels = []
    sentences = []
    with open(path, 'r') as f:
        for line in f.readlines():
            line = json.loads(line)
            sentences.append(line['text'])
            labels.append(line['label'])
    return pd.DataFrame({'sent':sentences,'label':labels})  

In [4]:
train = read_data_to_df('data/train.jsonl')
print(train.head())
test = read_data_to_df('data/test.jsonl')
print(test.head())

                                                sent     label
0         My children threw a birthday party for me.  positive
1  Marketing on social networking sites is a fast...  positive
2                You pillock, look what you've done!  positive
3      He scored three touchdowns early in the game.  positive
4          His abrupt manner makes me uncomfortable.  positive
                                                sent     label
0  I've just spotted Mark - he's over there, near...  positive
1  After repricing, the bonds yield a much higher...  positive
2             I admire her staunchness and fidelity.  positive
3  The party's leader is in favour of the treaty ...  positive
4  About 20 companies are working on treatments f...  positive


## Extract Features
Traditional machine learning need data scientists to observe data and find out some useful information. 

Here is an example:

In [9]:
import re
# for positive
spec = 0
count = 0
for i in train[train['label'] == "positive"]["sent"]:
    #print(i)
    #print(re.sub('[\w\s]','',i))
    if len(re.sub('[\w\s]','',i)) >3:
        spec+=1
    count+=1
print(spec/count)

0.058014915590512126


In [8]:
spec = 0
count=0
for i in train[train['label'] == "negative"]["sent"]:
    if len(re.sub('[\w\s]','',i)) >3:
        spec+=1
    count+=1
print(spec/count)

0.3834226149596325


After some experiment, we found 38% of bad example sentences have more than 3 punctuations. In contrast, only 5.8% of good example sentences have more than 3 punctuations. Hence, it seems that it is a nice feature to distinguish good and bad example sentences, so we add this feature.

In [10]:
train["more than three punc"] = [1 if len(re.sub('[\w\s]','',i)) >3 else 0 for i in train["sent"]]
test["more than three punc"] = [1 if len(re.sub('[\w\s]','',i))>3 else 0 for i in test["sent"]]

In [11]:
train.head()

Unnamed: 0,sent,label,more than three punc
0,My children threw a birthday party for me.,positive,0
1,Marketing on social networking sites is a fast...,positive,0
2,"You pillock, look what you've done!",positive,0
3,He scored three touchdowns early in the game.,positive,0
4,His abrupt manner makes me uncomfortable.,positive,0


<font color="red">**[ TODO ]**</font> Please observe the data and extract at least three features and add them to the dataframe.

In [99]:
def test_data(f, train):
    count = 0
    for s in train[train['label'] == "positive"]["sent"]:
        if(f(s)):
            count += 1
            
    pos_ratio = count / train[train['label'] == "positive"]["sent"].count()
    
    count = 0
    for s in train[train['label'] == "negative"]["sent"]:
        if(f(s)):
            count += 1
    
    neg_ratio = count / train[train['label'] == "negative"]["sent"].count()
    
    return [pos_ratio, neg_ratio]
    

In [168]:
# Find average length
len_sum = 0

neg = train[train['label'] == "negative"]["sent"]

for i in neg:
    len_sum += len(i)
neg_avg_len = len_sum / neg.count()
neg_count = neg.count()
print(neg_avg_len)
print(neg_count)

len_sum = 0

pos = train[train['label'] == "positive"]["sent"]

for i in pos:
    len_sum += len(i)
pos_avg_len = len_sum / pos.count()
pos_count = pos.count()
print(pos_avg_len)
print(pos_count)

# Calculate standard deviation

neg_sd = 0
for s in neg:
    neg_sd += (len(s) - neg_avg_len) ** 2
neg_sd /= neg_count
neg_sd = neg_sd ** (1/2)

print(neg_sd)


pos_sd = 0
for s in pos:
    pos_sd += (len(s) - pos_avg_len) ** 2
pos_sd /= pos_count
pos_sd = pos_sd ** (1/2)

print(pos_sd)

sd_range_ratio = 2.5

len_high = pos_avg_len + sd_range_ratio * pos_sd
len_low = pos_avg_len - sd_range_ratio * pos_sd
print(len_high, len_low)

def if_len_not_in_range(s):
    if(len(s) > len_high or len(s) < len_low):
        return True
    return False

print(test_data(if_len_in_range, train))


93.26986706957862
131046
63.21707321393775
112902
64.79810916236774
24.068649119044714
135.4230205710719 -8.988874143196384
[0.990168464686188, 0.8436426903530058]


In [169]:
train["len_not_in_range"] = [1 if if_len_not_in_range(s) else 0 for s in train["sent"]]
test["len_not_in_range"] = [1 if if_len_not_in_range(s) else 0 for s in test["sent"]]

In [146]:
def if_contains_noneASCII(s):
    none_ASCII = re.sub(r'[\u0000-\u007F]+', '', s)
    limit = 1
    if(len(none_ASCII) > limit):
        return True
    return False
    
print(test_data(if_contains_noneASCII, train))

[0.04452534056084038, 0.07149397921340599]


In [147]:
train["contains_noneASCII"] = [1 if if_contains_noneASCII(s) else 0 for s in train["sent"]]
test["contains_noneASCII"] = [1 if if_contains_noneASCII(s) else 0 for s in test["sent"]]

In [103]:
def if_many_uppercase(s):
    upper = re.sub(r'[^A-Z]', '', s)
    limit = 4
    if(len(upper) > limit):
        return True 

    return False

print(test_data(if_many_uppercase, train))

[0.01467644505854635, 0.2866321749614639]


In [104]:
train["many_uppercase"] = [1 if if_many_uppercase(s) else 0 for s in train["sent"]]
test["many_uppercase"] = [1 if if_many_uppercase(s) else 0 for s in test["sent"]]

In [157]:
def avg_numbers(train):
    
    len_sum = 0
    for s in train[train['label'] == "negative"]["sent"]:
        numbers = re.sub(r'[^0-9]', '', s)
        len_sum += len(numbers)
    return len_sum / train[train['label'] == "positive"]["sent"].count()
        
print(avg_numbers(train))

2.0240739756603072


In [162]:
def if_many_numbers(s):
    upper = re.sub(r'[^0-9]', '', s)
    limit = 1
    if(len(upper) > limit):
        return True 

    return False

print(test_data(if_many_uppercase, train))

[0.01467644505854635, 0.2866321749614639]


In [163]:
train["many_numbers"] = [1 if if_many_numbers(s) else 0 for s in train["sent"]]
test["many_numbers"] = [1 if if_many_numbers(s) else 0 for s in test["sent"]]

In [210]:
def avg_word_len(train):
    len_sum = 0
    word_count = 0
    for s in train[train["label"] == "negative"]['sent']:
        words = re.split(r'[^A-z]', s)
        for w in words:
            if len(w) > 4:
                word_count += 1
                len_sum += len(w)
    return len_sum / word_count

avg_word_len(train)

6.980457773776895

In [176]:
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
print(train[train['label'] == 'negative']['sent'].sample(n=10, random_state=None))

203018                                                                                                                                                                                                           Here’s hoping “Touch” continues to reach it.
154870                                                                                                                                                                  In 1784 he became curate of Selborne for the last time, remaining so until his death.
131383    Like liberal feminists, socialist or Marxist feminists acknowledge that men are necessary as part of the movement for change.<ref name=":0" /> Whatever the theory, socialist and Marxist countries have never had women in major government posts.
145256                                                                                                                                              Also, in psychology, psychiatry and sociology, it may mean someone who has a big influence

In [178]:
print(train[train['label'] == 'positive']['sent'].sample(n=10, random_state=None))

59273                                                                             The plane went into a dive.
5817                                                    To be parted from him even for two days made her sad.
28607     The phone group plans to flog its new handsets for £30 apiece to people signing one-year contracts.
45596            Much of the investment was directed to Greater London, the West Midlands and the North West.
120926                 It would take a worldwide depression to drastically reduce the current demand for oil.
21344                                             She revolutionized fashion reporting with her breezy style.
42841                                                               We were upgraded and flew business class.
16445                                 The minister opened the door and sallied forth to face the angry crowd.
23572                                        The side wall of the old house was braced with a wooden support.
59381     

In [144]:
train.loc[91313]

sent                    Paul beat me by three games to two (= he won three and I won two).
label                                                                             negative
more than three punc                                                                     1
contains_noneASCII                                                                       0
many_uppercase                                                                           0
len_not_in_range                                                                         0
Name: 91313, dtype: object

## Train
Now, it's time to evaluate whether the features just selected is useful to classify. We use [Bernoulli Naive Bayes model](https://scikit-learn.org/stable/modules/naive_bayes.html#bernoulli-naive-bayes) to train training data. 

In [96]:
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()

<font color="red">**[ TODO ]**</font> Please adjust the `selected_features` list and train the model.

***Don't*** use other model in this assignment.

In [164]:
selected_features = ["more than three punc",
                     "many_uppercase",
                     "contains_noneASCII",
                     "len_not_in_range",
                     "many_numbers"]

In [170]:
y_pred = bnb.fit(train[selected_features], train['label']).predict(test[selected_features])

## Test
Test the model and get the accuracy of the prediction on testing data. 

There're four baselines for this task:
```
1. simple baseline: 0.65
2. medium baseline: 0.72
3. strong baseline: 0.8
4. boss baseline: 0.85
```
The more the baseline you pass, the more higher the grade you can get.

*hint: If the result isn't ideal, you can print the wrong prediction data and re-observe but only focus on wrong data to extract other features.

In [171]:
print((test['label'] == y_pred).sum()/len(test))

0.7428243774574049


## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit#gid=1031097651) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to eeclass. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.