# Week 13: Example Sentences Classification
In the last two assignment, we use deep learning method to do classification task. We can get good result by deep learning, but we can hardly explain the classification result(i.e., we don't know why the phrase is classified as "good" phrase). Hence, in this assignment, we want you do classification again, but do it by traditional machine learning method. In this way, you can know *why* more. We want you implement this on example sentences(e.g., "This song has a brilliant piano intro." for word "brilliant").

[Here](https://drive.google.com/drive/folders/1ij20ecLlI1Zh5CdMAa91SXQzmpKfdBdJ?usp=sharing) are two files needed for this task: **train.jsonl** & **test.jsonl**. In these files, each line records one example sentence with its label. There're two types of labels: ***positive*** means it's *good* example sentence;  ***negative*** means it's *bad* example sentence.

## Read Data
We use dataframe to keep data in this assignment.

In [1]:
import json
import pandas as pd



In [2]:
def read_data_to_df(path):
    labels = []
    sentences = []
    with open(path, 'r') as f:
        for line in f.readlines():
            line = json.loads(line)
            sentences.append(line['text'])
            labels.append(line['label'])
    return pd.DataFrame({'sent':sentences,'label':labels})  

In [3]:
train = read_data_to_df('data/train.jsonl')
print(train.head())
test = read_data_to_df('data/test.jsonl')
print(test.head())

                                                sent     label
0         My children threw a birthday party for me.  positive
1  Marketing on social networking sites is a fast...  positive
2                You pillock, look what you've done!  positive
3      He scored three touchdowns early in the game.  positive
4          His abrupt manner makes me uncomfortable.  positive
                                                sent     label
0  I've just spotted Mark - he's over there, near...  positive
1  After repricing, the bonds yield a much higher...  positive
2             I admire her staunchness and fidelity.  positive
3  The party's leader is in favour of the treaty ...  positive
4  About 20 companies are working on treatments f...  positive


## Extract Features
Traditional machine learning need data scientists to observe data and find out some useful information. 

Here is an example:

In [4]:
import re
# for positive
spec = 0
count = 0
for i in train[train['label'] == "positive"]["sent"]:
    #print(i)
    #print(re.sub('[\w\s]','',i))
    if len(re.sub('[\w\s]','',i)) >3:
        spec+=1
    count+=1
print(spec/count)

0.058014915590512126


In [5]:
spec = 0
count=0
for i in train[train['label'] == "negative"]["sent"]:
    if len(re.sub('[\w\s]','',i)) >3:
        spec+=1
    count+=1
print(spec/count)

0.3834226149596325


After some experiment, we found 38% of bad example sentences have more than 3 punctuations. In contrast, only 5.8% of good example sentences have more than 3 punctuations. Hence, it seems that it is a nice feature to distinguish good and bad example sentences, so we add this feature.

In [6]:
train.head()

Unnamed: 0,sent,label
0,My children threw a birthday party for me.,positive
1,Marketing on social networking sites is a fast...,positive
2,"You pillock, look what you've done!",positive
3,He scored three touchdowns early in the game.,positive
4,His abrupt manner makes me uncomfortable.,positive


<font color="red">**[ TODO ]**</font> Please observe the data and extract at least three features and add them to the dataframe.

In [7]:
def test_data(f, train, *args):
    
    
    count = 0
    for s in train[train['label'] == "positive"]["sent"]:
        
        if(args):
            if(f(s, args[0])):
                count += 1
        else:
            if(f(s)):
                count += 1

            
    pos_ratio = count / train[train['label'] == "positive"]["sent"].count()
    
    count = 0
    for s in train[train['label'] == "negative"]["sent"]:
        if(args):
            if(f(s, args[0])):
                count += 1
        else:
            if(f(s)):
                count += 1
    
    neg_ratio = count / train[train['label'] == "negative"]["sent"].count()
    
    return [pos_ratio, neg_ratio]
    

In [8]:
def if_more_than_three_punc(s):
    if len(re.sub('[\w\s]','',s)) > 3:
        return True
    return False

test_data(if_more_than_three_punc, test)

[0.05512257333144396, 0.3833963184200902]

In [9]:
# Find average length
len_sum = 0

neg = train[train['label'] == "negative"]["sent"]

for i in neg:
    len_sum += len(i)
neg_avg_len = len_sum / neg.count()
neg_count = neg.count()
print(neg_avg_len)
print(neg_count)

len_sum = 0

pos = train[train['label'] == "positive"]["sent"]

for i in pos:
    len_sum += len(i)
pos_avg_len = len_sum / pos.count()
pos_count = pos.count()
print(pos_avg_len)
print(pos_count)

# Calculate standard deviation

neg_sd = 0
for s in neg:
    neg_sd += (len(s) - neg_avg_len) ** 2
neg_sd /= neg_count
neg_sd = neg_sd ** (1/2)

print(neg_sd)


pos_sd = 0
for s in pos:
    pos_sd += (len(s) - pos_avg_len) ** 2
pos_sd /= pos_count
pos_sd = pos_sd ** (1/2)

print(pos_sd)

sd_range_ratio = 1.66

def if_len_not_in_range(s, sd_range_ratio):
    
    len_high = pos_avg_len + sd_range_ratio * pos_sd
    len_low = pos_avg_len - sd_range_ratio * pos_sd
    if(len(s) > len_high or len(s) < len_low):
        return True
    return False

print(test_data(if_len_not_in_range, train, sd_range_ratio))

93.26986706957862
131046
63.21707321393775
112902
64.79810916236774
24.068649119044714
[0.0779348461497582, 0.33739297651206446]


In [10]:
def if_contains_noneASCII(s):
    none_ASCII = re.sub(r'[\u0000-\u007F]+', '', s)
    limit = 0
    if(len(none_ASCII) > limit):
        return True
    return False
    
print(test_data(if_contains_noneASCII, test))

[0.04966699730763781, 0.07314397171766426]


In [11]:
if_contains_noneASCII("Béziers was the first place to be attacked.")

True

In [12]:
def if_many_uppercase(s):
    upper = re.sub(r'[^A-Z]', '', s)
    limit = 1
    if(len(upper) > limit):
        return True 

    return False

print(test_data(if_many_uppercase, test))

[0.18775683718293892, 0.6251980982567353]


In [13]:
def avg_numbers(train):
    
    len_sum = 0
    for s in train[train['label'] == "negative"]["sent"]:
        numbers = re.sub(r'[^0-9]', '', s)
        len_sum += len(numbers)
    return len_sum / train[train['label'] == "positive"]["sent"].count()
        
print(avg_numbers(train))

2.0240739756603072


In [14]:
def if_many_numbers(s):
    upper = re.sub(r'[^0-9]', '', s)
    limit = 3
    if(len(upper) > limit):
        return True 

    return False

print(test_data(if_many_numbers, train))

[0.02391454535792103, 0.21844237901195]


In [15]:
def avg_word_len(train):
    len_sum = 0
    word_count = 0
    for s in train[train["label"] == "negative"]['sent']:
        words = re.split(r'[^A-z]', s)
        for w in words:
            if len(w) > 4:
                word_count += 1
                len_sum += len(w)
    return len_sum / word_count

avg_word_len(train)

6.980457773776895

In [16]:
sents = test[test['label'] == 'positive']['sent']

pos_first_words = {}
def embed_first_words(s, pos_first_words):
    first = s.split()[0]
    if(first in pos_first_words):
        pos_first_words[first] += 1
    else:
        pos_first_words[first] = 1

a = sents.apply(embed_first_words, args=(pos_first_words,))

In [17]:
threshold = 0
pos_first_words_df = pd.DataFrame.from_dict(pos_first_words, orient='index')
pos_first_words_df = pos_first_words_df[pos_first_words_df[0] > threshold]
pos_first_words_df.count()

pos = pos_first_words_df.to_dict()

pos = pos[0]

def if_not_have_first_word(s, pos):
    if((s.split())[0] in pos):
        return False
    return True

tup = test_data(if_not_have_first_word, test, pos)
print(tup, tup[1]-tup[0])


[0.0, 0.24509325856394001] 0.24509325856394001


In [18]:
"""
max_dif = 0
max_t = 0
for t in range(0, 100):
    pos_first_words_df = pd.DataFrame.from_dict(pos_first_words, orient='index')
    pos_first_words_df = pos_first_words_df[pos_first_words_df[0] > t]
    pos_first_words_df.count()

    pos = pos_first_words_df.to_dict()

    pos = pos[0]
    tup = test_data(if_not_have_first_word, test, pos)
    print(tup, tup[1]-tup[0])
    if tup[1]-tup[0] > max_dif:
        max_dif = tup[1]-tup[0]
        max_t = t
"""

"\nmax_dif = 0\nmax_t = 0\nfor t in range(0, 100):\n    pos_first_words_df = pd.DataFrame.from_dict(pos_first_words, orient='index')\n    pos_first_words_df = pos_first_words_df[pos_first_words_df[0] > t]\n    pos_first_words_df.count()\n\n    pos = pos_first_words_df.to_dict()\n\n    pos = pos[0]\n    tup = test_data(if_not_have_first_word, test, pos)\n    print(tup, tup[1]-tup[0])\n    if tup[1]-tup[0] > max_dif:\n        max_dif = tup[1]-tup[0]\n        max_t = t\n"

In [19]:
sents = test[test['label'] == 'positive']['sent']

pos_last_words = {}

def embed_last_words(s, pos_last_words):
    pos_last_words[s.split()[-1]] = 1

a = sents.apply(embed_last_words, args=(pos_last_words,))

In [20]:
len(pos_last_words)

6372

In [21]:
def if_not_have_last_word(s, pos_last_words):
    if((s.split())[-1] in pos_last_words):
        return False
    return True

In [22]:
excluded_first_words = ['And']
def if_first_word_excluded(s, excluded_first_words):
    if((s.split())[0] in excluded_first_words):
        return True
    return False

print(test_data(if_first_word_excluded, train, excluded_first_words))

[0.00033657508281518487, 0.0018237870671367305]


In [23]:
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
print(test[test['label'] == 'negative']['sent'].sample(n=10, random_state=None))

18296                                           I’d say 40% are constructive, 40% are trivial, and 20% are unconstructive.
21440                                     She doesn’t want to rely on a rich man, but decides to marry a poor man, Freddy.
2538        I hoped that going to college might broaden my horizons (= increase the range of my knowledge and experience).
5804     The detectives searched the house from top to bottom (= all over it), but they found no sign of the stolen goods.
28913                            In March 2011, 2011 Tōhoku earthquake and tsunami occurred in wide area including Sendai.
26238                                                                                        They do this in various ways.
29761                                    They are guest stars in the episode as the characters they voice act in the show.
18344                                                                              Approximately 20,000 people live there.
28617           

In [24]:
print(train[train['label'] == 'positive']['sent'].sample(n=10, random_state=None))

41271                                               He let me drive his new car last night - it goes like a dream.
85544    You've been putting off making that phone call for days - I think it's about time you grasped the nettle!
1092                                                            Lady Tavistock was the chatelaine of Woburn Abbey.
29094                                                                          I'm not sure where this bowl lives.
91008                                    The regime in December issued new currency and wiped out private savings.
34061                                                                              Our plane came down in a field.
64980                                                                                             She's almost 30.
23450                                                                        The salary was $40,000, plus a bonus.
43406                                                    To win a seat, a candid

In [25]:
train.loc[91313]

sent     Paul beat me by three games to two (= he won three and I won two).
label                                                              negative
Name: 91313, dtype: object

## Train
Now, it's time to evaluate whether the features just selected is useful to classify. We use [Bernoulli Naive Bayes model](https://scikit-learn.org/stable/modules/naive_bayes.html#bernoulli-naive-bayes) to train training data. 

<font color="red">**[ TODO ]**</font> Please adjust the `selected_features` list and train the model.

***Don't*** use other model in this assignment.

In [26]:
train["len_not_in_range"] = [1 if if_len_not_in_range(s, sd_range_ratio) else 0 for s in train["sent"]]
test["len_not_in_range"] = [1 if if_len_not_in_range(s, sd_range_ratio) else 0 for s in test["sent"]]

train["contains_noneASCII"] = [1 if if_contains_noneASCII(s) else 0 for s in train["sent"]]
test["contains_noneASCII"] = [1 if if_contains_noneASCII(s) else 0 for s in test["sent"]]

train["many_uppercase"] = [1 if if_many_uppercase(s) else 0 for s in train["sent"]]
test["many_uppercase"] = [1 if if_many_uppercase(s) else 0 for s in test["sent"]]

train["many_numbers"] = [1 if if_many_numbers(s) else 0 for s in train["sent"]]
test["many_numbers"] = [1 if if_many_numbers(s) else 0 for s in test["sent"]]

train["more than three punc"] = [1 if if_more_than_three_punc(s) else 0 for s in train["sent"]]
test["more than three punc"] = [1 if if_more_than_three_punc(s) else 0 for s in test["sent"]]

train["not_have_first_word"] = [1 if if_not_have_first_word(s, pos) else 0 for s in train["sent"]]
test["not_have_first_word"] = [1 if if_not_have_first_word(s, pos) else 0 for s in test["sent"]]

train["not_have_last_word"] = [1 if if_not_have_last_word(s, pos_last_words) else 0 for s in train["sent"]]
test["not_have_last_word"] = [1 if if_not_have_last_word(s, pos_last_words) else 0 for s in test["sent"]]

In [27]:
selected_features = ["more than three punc",
                     "many_uppercase",
                     "contains_noneASCII",
                     "len_not_in_range",
                     "many_numbers",
                     "not_have_first_word",
                     "not_have_last_word",
                    ]

In [28]:
"""
# Find best sd_range_ratio

steps = arange(1.6,1.8, step = 0.01)

max_srr = steps[0]
max_ratio = 0

for srr in steps:
    
    train["len_not_in_range"] = [1 if if_len_not_in_range(s, srr) else 0 for s in train["sent"]]
    test["len_not_in_range"] = [1 if if_len_not_in_range(s, srr) else 0 for s in test["sent"]]

    y_pred = bnb.fit(train[selected_features], train['label']).predict(test[selected_features])
    
    ratio = (test['label'] == y_pred).sum()/len(test)
    if(ratio > max_ratio):
        max_ratio = ratio
        max_srr = srr

print(max_ratio, max_srr)
"""

'\n# Find best sd_range_ratio\n\nsteps = arange(1.6,1.8, step = 0.01)\n\nmax_srr = steps[0]\nmax_ratio = 0\n\nfor srr in steps:\n    \n    train["len_not_in_range"] = [1 if if_len_not_in_range(s, srr) else 0 for s in train["sent"]]\n    test["len_not_in_range"] = [1 if if_len_not_in_range(s, srr) else 0 for s in test["sent"]]\n\n    y_pred = bnb.fit(train[selected_features], train[\'label\']).predict(test[selected_features])\n    \n    ratio = (test[\'label\'] == y_pred).sum()/len(test)\n    if(ratio > max_ratio):\n        max_ratio = ratio\n        max_srr = srr\n\nprint(max_ratio, max_srr)\n'

In [29]:
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()

y_pred = bnb.fit(train[selected_features], train['label']).predict(test[selected_features])

## Test
Test the model and get the accuracy of the prediction on testing data. 

There're four baselines for this task:
```
1. simple baseline: 0.65
2. medium baseline: 0.72
3. strong baseline: 0.8
4. boss baseline: 0.85
```
The more the baseline you pass, the more higher the grade you can get.

*hint: If the result isn't ideal, you can print the wrong prediction data and re-observe but only focus on wrong data to extract other features.

In [30]:
print((test['label'] == y_pred).sum()/len(test))

0.8215596330275229


In [31]:
d = test[test['label'] != y_pred]

In [32]:
d[d['label'] == 'positive'].sample(n=10)

Unnamed: 0,sent,label,len_not_in_range,contains_noneASCII,many_uppercase,many_numbers,more than three punc,not_have_first_word,not_have_last_word
2576,"No more cake for me, thanks, I'm full.",positive,0,0,1,0,1,0,0
13348,Is he qualified to be the UN’s standard-bearer for democracy and freedom?,positive,0,1,1,0,0,0,0
14623,"We had several hours to wait for our train, so we left our bags in a (luggage) locker, and went to look around the town.",positive,1,0,0,0,1,0,0
13151,He'd have driven straight into me if I hadn't seen him first - the dozy idiot!,positive,0,0,1,0,1,0,0
0,"I've just spotted Mark - he's over there, near the entrance.",positive,0,0,1,0,1,0,0
6238,"Yes, Captain.",positive,1,0,1,0,0,0,0
3602,"Bruce's pet name for Noelle is ""Otter,"" which can be embarrassing when other people hear it.",positive,0,0,1,0,1,0,0
1064,You don't fool me - I wasn't born yesterday.,positive,0,0,1,0,1,0,0
5233,"The bank offers a 20-year, fixed-rate second mortgage option at an amount up to 4% of the appraised value of the home.",positive,1,0,0,0,1,0,0
2042,"If it's okay by/with you, I'll come over tomorrow instead.",positive,0,0,1,0,1,0,0


In [33]:
d[d['label'] == 'negative'].sample(n=10)

Unnamed: 0,sent,label,len_not_in_range,contains_noneASCII,many_uppercase,many_numbers,more than three punc,not_have_first_word,not_have_last_word
16574,The word as translated into English.,negative,0,0,1,0,0,0,0
24727,Several individuals have been seen with a white lining around the mouth and chin.,negative,0,0,0,0,0,0,0
17447,People take part in a traditional mame-maki ceremony.,negative,0,0,0,0,0,0,0
27114,Many also operate round the Manchester area.,negative,0,0,1,0,0,0,0
26907,"There is a public nude beach close to New York City called Gunnison in Sandy Hook, New Jersey.",negative,0,0,1,0,0,0,0
19899,Tests were done.,negative,1,0,0,0,0,0,0
22874,"While this happens, Lois Griffin hits Stewie Griffin and Stewie wants Lois to hit him more.",negative,0,0,1,0,0,0,0
17427,Jason fell in love with the daughter of the Golden Fleece's owner.,negative,0,0,1,0,0,0,0
17484,The speed that it must travel to get away is called escape velocity.,negative,0,0,0,0,0,0,1
16630,Developers have named each after famous biologists.,negative,0,0,0,0,0,0,1


## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit#gid=1031097651) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to eeclass. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.