# Week 08: Phrase Classification
The assignment this week needs you to distinguish between good and bad phrases of the word "**earn**" (e.g., earn money). The method, word2vector, learned today will be used in the process. 

There're some data for this assignment: 
* train.tsv: Some phrases with labels to train and validate the classification model. There are only two types of label: 1 means *good*; 0 means *bad*.
* test.tsv: Same format as train.tsv. It's used to test your model.
* GoogleNews-vectors-negative300.bin.gz: a pre-trained word2vector model trained by Google ([source](https://code.google.com/archive/p/word2vec/))

## Requirement
* pandas
* tensorflow
* sklearn

## Read Data
We use dataframe to store data here.

In [1]:
import numpy as np
import tensorflow as tf

tf.random.set_seed(1234)

In [2]:
import pandas as pd
def loadData(path):
    ngram = []
    _class = []
    with open(path) as f:
        for line in f.readlines():
            line = line.strip("\n").split("\t")
            ngram.append(line[0])
            _class.append(int(line[1]))
    return pd.DataFrame({"phrase":ngram,"class":_class})
train = loadData("train.tsv")
test = loadData("test.tsv")

## load word2vec model
<font color="red">**[ TODO ]**</font> Please load [GoogleNews-vectors-negative300.bin.gz](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g) model and check the embedding of the word `language`.

* package `gensim` is a good choice

In [3]:
from gensim.models.keyedvectors import KeyedVectors

kv = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print(type(kv))
#print(kv['language'])

<class 'gensim.models.keyedvectors.KeyedVectors'>


In [4]:
embeddings_index = {}
for word, vector in zip(list(kv.index_to_key), kv.vectors):
    coefs = np.asarray(vector, dtype='float32')
    embeddings_index[word] = coefs

In [5]:
display(embeddings_index['language'])

array([ 2.30712891e-02,  1.68457031e-02,  1.54296875e-01,  1.27929688e-01,
       -2.67578125e-01,  3.51562500e-02,  1.19140625e-01,  2.48046875e-01,
        1.93359375e-01, -7.95898438e-02,  1.46484375e-01, -1.43554688e-01,
       -3.04687500e-01,  3.46679688e-02, -1.85546875e-02,  1.06933594e-01,
       -1.52343750e-01,  2.89062500e-01,  2.35595703e-02, -3.80859375e-01,
        1.09863281e-01,  4.41406250e-01,  3.75976562e-02, -1.22680664e-02,
        1.62353516e-02, -2.24609375e-01,  7.61718750e-02, -3.12500000e-02,
       -2.16064453e-02,  1.49414062e-01, -4.02832031e-02, -4.46777344e-02,
       -1.72851562e-01,  3.32031250e-02,  1.50390625e-01, -5.05371094e-02,
        2.72216797e-02,  3.00781250e-01, -1.33789062e-01, -7.56835938e-02,
        1.93359375e-01, -1.98242188e-01, -1.27563477e-02,  4.19921875e-01,
       -2.19726562e-01,  1.44531250e-01, -3.93066406e-02,  1.94335938e-01,
       -3.12500000e-01,  1.84570312e-01,  1.48773193e-04, -1.67968750e-01,
       -7.37304688e-02, -

## Preprocessing
Preprocess two tsv files here.

#### adjust the ratio of the two classes of training data
In training data, the ratio of good phrases to bad phrases is about one to thirty. That will make training classification unsatisfactory, so we need to adjust the ratio. Reducing bad phrases and adding good phrases are both common way.

<font color="red">**[ TODO ]**</font> Please adjust the ratio of good phrases to bad phrases in any way which you think is the best and output the number of two class for demo.

You need to explain why you choose this ratio and how you do it.

In [6]:
#### print the number of training data of two classes
from sklearn.utils import shuffle

train_0 = train.loc[train['class'] == 0]
train_1 = train.loc[train['class'] == 1]

train_0_len = len(train_0)
train_1_len = len(train_1)

desire_ratio = 1/4
desire_n = int(train_1_len / desire_ratio)

train_0_sample = train_0.sample(n=desire_n, random_state=42)

train_shuffle = shuffle(pd.concat([train_1, train_0_sample]), random_state=42)

In [7]:
train = train_shuffle

#### number words
Let each word have its unique number.

In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer
tok = Tokenizer(filters='')
tok.fit_on_texts(pd.concat([train, test], ignore_index=True)['phrase'])
vocab_size = len(tok.word_index) + 1

#### convert phrases into numbers
Because model can't read words, so we have to do this transform. 

The number should be same as the last step.

In [9]:
train_encoded_phrase = tok.texts_to_sequences(train['phrase'])
test_encoded_phrase = tok.texts_to_sequences(test['phrase'])

#### padding
Make all phrases become same length. The longest phrases in two tsv have five tokens. Hence, we should make the phrases whose lengths less than five become five by adding 0. 

In [10]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_ngram = 5
x_train = pad_sequences(train_encoded_phrase, maxlen=max_ngram, padding='pre')
x_test = pad_sequences(test_encoded_phrase, maxlen=max_ngram, padding='pre')
print(x_train[:10])

[[   0    6    1    3  481]
 [  11   27  125    2    1]
 [  28   27    1   22   15]
 [   0    1  260   95   23]
 [   1    7  610   19  143]
 [ 535 1125    4    1   92]
 [   0  482   88    1   29]
 [   0    1   10  120   12]
 [   0    0    2    1   33]
 [   0  381   34    1    3]]


#### one hot encodding label

In [11]:
from tensorflow.keras.utils import to_categorical
y_train=to_categorical(train['class'])
y_test=to_categorical(test['class'])
print(y_train[:5])
print(y_train.shape)

[[1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]]
(30525, 2)


#### split training data into train and validation

In [12]:
from sklearn.model_selection import train_test_split
x_train,x_val,y_train,y_val=train_test_split(x_train,y_train,test_size=0.20,random_state=42)

#### creating the embedding matrix
The embedding matrix is used by classification model. It should be a list of list. Each sub-list is an embedding vector of a word and the order of all embedding vectors should be same as *tokenizer*. It is stored in a dictionary. You can check it by `tok.word_index.items()`.

<font color="red">**[ TODO ]**</font> Make embedding matrix. If you don't need it for your classification model, you can skip it. We won't check it when demo. 

In [13]:
from keras.layers import Embedding

vector_dimension = 300
embedding_matrix = np.zeros((vocab_size, vector_dimension))

for word, index in tok.word_index.items():
    if index > vocab_size:
        break
    else:
        try:
            embedding_matrix[index] = embeddings_index[word]
        except:
            continue

In [14]:
embedding_matrix.shape

(5596, 300)

## Classification

#### build model
<font color="red">**[ TODO ]**</font> Please build your classification model by ***keras*** here. 

You **must** use the pre-trained word2vec model to represent the words of phrases.

In [15]:
from keras.models import Sequential
from keras.layers import Dense
from keras import layers
import tensorflow as tf

model = tf.keras.models.Sequential([
        layers.Embedding(input_dim=embedding_matrix.shape[0],
                         output_dim=300,
                         input_length=5,
                         weights=[embedding_matrix]),
        tf.keras.layers.LSTM(4),
        layers.Dense(units=2, activation='sigmoid'),
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.binary_crossentropy,
              metrics=['accuracy'])

model.fit(x_train, y_train, batch_size=32, epochs=3)

model.summary()

Epoch 1/3
Epoch 2/3
Epoch 3/3
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 5, 300)            1678800   
                                                                 
 lstm (LSTM)                 (None, 4)                 4880      
                                                                 
 dense (Dense)               (None, 2)                 10        
                                                                 
Total params: 1,683,690
Trainable params: 1,683,690
Non-trainable params: 0
_________________________________________________________________


#### train
Train classification model here.

<font color="red">**[ TODO ]**</font> Adjust the hyperparameter to optimize the validation accuracy and validation loss.

* The higher the accuracy, the better; the lower the validation, the better.
* **number of epoch** and **batch size** are the most important

#### test

<font color="red">**[ TODO ]**</font> Test your model by test.tsv and output the accuracy. Your accuracy need to beat baseline: **0.97**.

In [16]:
accuracy = model.evaluate(x_test,y_test)
print(accuracy[1])

0.9894999861717224


## Show wrong prediction results
Observing wrong prediction result may help you improve your prediction.

<font color="red">**[ TODO ]**</font> show the wrong prediction results like this: 

<img src="https://imgur.com/BOTMyZH.jpg" width=30%><br>

In [17]:
def label_race(row):
    if row['class'] == 1:
        return 0
    else:
        return 1

In [18]:
prediction = np.round(model.predict(x_test))

wrong_index = np.equal(prediction, y_test)
wrong_index = [not h[0] or not h[1] for h in wrong_index]

wrong_predictions = test[wrong_index]

se = (wrong_predictions.apply(lambda row: label_race(row), axis=1)).rename('labeled')

wrong_predictions = pd.concat([wrong_predictions, se.to_frame()], axis=1)

#display(wrong_predictions)
display(wrong_predictions[wrong_predictions['class'] == 1])
display(wrong_predictions[wrong_predictions['class'] == 0])

Unnamed: 0,phrase,class,labeled
49,earn a reprimand from her,1,0
88,earn 10 points per dollar,1,0
270,earn instead of sharing it,1,0
298,earn money fast earn money,1,0
475,earn your confidence and respect,1,0
541,earn money from earn money,1,0
602,earn when everybody wants them,1,0
765,earn up to two more,1,0
801,earn a Division I scholarship,1,0
960,earn from home job money,1,0


Unnamed: 0,phrase,class,labeled
197,earn your Masters Degree,0,1
227,% of them earn more,0,1
313,to earn $ 134 million,0,1
590,them earn a better living,0,1
615,make them earn their money,0,1
1458,money earn money,0,1
1512,earn Free Gear with Bonus,0,1
1521,sites that earn a revenue,0,1
1572,earn $ 1800,0,1
1577,"earn $ 275,000",0,1


## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit#gid=807282025) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to eeclass. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.

## Learning Resource
[Deep Learning with Python](https://tanthiamhuat.files.wordpress.com/2018/03/deeplearningwithpython.pdf)

[Classification on IMDB](https://keras.io/examples/nlp/bidirectional_lstm_imdb/)