# Week 09: Word Sense Disambiguation

This week, we introduced a hot topic in Natural Language Proccessing: *Word Sense Disambiguation (WSD)* .  
Many words in natural languages have ambiguous meanings. For example, the word *[party](https://dictionary.cambridge.org/dictionary/english/party)* can refer to 1) a social gathering (派對), 2) a political organization (政黨), or 3) an entity in law (當事人；⋯⋯方).  
As a human, we can distinguish different meanings easily, but can a machine do the same? This is what WSD aims for.  

## Introduction

### tl; dr
You have to 
1. preprocess the data
2. (stage 1) generate a small training dataset from the given collocation seed,
3. (stage 1) train a weak model on that small dataset,
4. (stage 2) use the weak model to generate more labeled data, and
5. (stage 2) train your final model
6. Evaluate your model on testing data (requirement: accuracy > 0.7)

### Concept

In [Lesk's assumption](https://en.wikipedia.org/wiki/Lesk_algorithm), each word has only one sense when it appears in the same collocation.  
For example, if *party* shows up with the word *court* (法庭), most likely the sense of this *party* is the 3rd one: an entity in law (當事人；⋯⋯方).  
However, we are not implementing Lesk's algorithm this week. Instead, we will combine his assumption with [Yarowsky's](https://en.wikipedia.org/wiki/Yarowsky_algorithm) *bootstrap technique* .  

You are given some pre-defined collocations, or called *seeds*, of the word *party*, along with which sense each collocation belongs to.  
With the given seeds, you can generate a small set of labeled data by rule. Then with this small set, we can train a small model with limited accuracy.  
The current classifier might not perform well on the whole dataset, sure, but it's already enough to generate more reliable labeled data. With the newly labeled training data, we can now train another sense-classification model with more robustness, which aims for the real WSD task.  
This process, about training on smaller dataset, generating more data, and then improving the model itself, is called *[bootstrapping](https://www.mastersindatascience.org/learning/introduction-to-machine-learning-algorithms/bootstrapping/)* .  


<a name="I.-Data-preparation"></a>
## I. Data preparation

First thing first. To make natural language understandable for machines, we have to transform sentences into embeddings.  
So here are four things to do:

1. load data
2. preprocess the sentences
3. transform sentences into embeddings
4. pad the sentences to the same length

To make the task simple and easy to understand, we will only work on a single word *party* .  
Three senses of *party* is defined as below with their corresponding `sense id`s. 

In [1]:
SENSE = {
    1: 'a social event at which a group of people meet to talk, eat, drink, dance, etc.', # 派對
    2: 'an organization of people with particular political beliefs', # 政黨
    3: 'a single entity which can be identified as one for the purposes of the law' # （法庭）當事人；⋯⋯方
}

### 1. Load data

The data is a set of sentences containing the word *party*, all extracted from wikipedia. The uniqueness of each sentence is guaranteed. 

In [2]:
import os
import re

In [3]:
with open(os.path.join('data', 'party.train.txt'), 'r', encoding='utf8') as f:
    data = f.read().strip().split('\n')

# this dict maps sentence_id to the sentence itself
pure_data = { sent_id: text for sent_id, text in [line.split('\t', 1) 
                                                 for line in data] }

Let's see what the data looks like.

In [4]:
for sent_id, sentence in pure_data.items():
    if int(sent_id) > 1003: break
        
    print(f'{sent_id}: {sentence}')

1001: A naked party, also known as nude party, is a party where the participants are required to be nude.
1002: The town center bears the hallmarks of a typical migration-accepting Turkish rural town, with traditional structures coexisting with a collection of concrete apartment blocks providing public housing, as well as amenities such as basic shopping and fast-food restaurants, and essential infrastructure but little in the way of culture except for cinemas and large rooms hired out for wedding parties.
1003: Elections Alberta oversees the creation of political parties and riding associations, compiles election statistics on ridings, and collects financial statements from party candidates and riding associations.


In [5]:
# a look up table from sentence to id
id_mapper = {v: k for k, v in pure_data.items()}
# a table for id to embedding; we will deal with this later
processed_data = {}

We define 2 samples here to validate the preprocess during our coding.

In [6]:
samples = [
    'Adnan Al-Hakim (died May 26, 1990) was the leader of the Najjadeh Party, an Arab nationalist party in Lebanon, for more than 30 years.',
    'A block party or street party is a party in which many members of a single community congregate, either to observe an event of some importance or simply for mutual enjoyment.'
]

### 2. Preprocess the sentences 

<font color="red">[TODO]</font> Define your preprocessing function to transform a sentence into tokens here.  

\-

<small>
*hint: If you can't get a high accuracy in the final result, you may want to come back and modify your preprocessing here.<br/>
*hint: Think about what words are useful and what are useless when distinguishing a sense.
</small>

In [7]:
def preprocess(text):
    # [ TODO ]
    tokens = re.findall(r"[\w']+",text)
    return tokens

In [8]:
sent_tokens = [preprocess(sent) for sent in samples]
sent_tokens[0][:5]

['Adnan', 'Al', 'Hakim', 'died', 'May']

### 3. Transform sentences into embeddings

For the simplicity, we are still using word2vec here, so you can copy-paste your code from previous week.  
This is not required; you don't have to use word2vec if you want to train a embedding model along with the classifier.  

<small>\*Download w2v: [Google Code Archive](https://code.google.com/archive/p/word2vec/#Pretrained-word-and-phrase-vectors)</small>

In [9]:
import numpy as np
from gensim.models import KeyedVectors

In [10]:
w2v = KeyedVectors.load_word2vec_format(
        os.path.join('data', 'GoogleNews-vectors-negative300.bin'), 
        binary = True
      )

In [11]:
def to_embedding(tokens):
    # [ TODO ]
    result = []
    for word in tokens:
        try:
            embedding_vector = w2v[word]
            if embedding_vector is not None:
                result.append(embedding_vector)
        except:
            continue
            # print("word don't have embedding vector: "+ word)
    return np.array(result)


In [12]:
embeddings = [to_embedding(tokens) for tokens in sent_tokens]
embeddings[0]

array([[-0.140625  ,  0.20703125, -0.12988281, ...,  0.03076172,
         0.07080078,  0.484375  ],
       [-0.0390625 ,  0.24804688,  0.00540161, ...,  0.2265625 ,
         0.02404785, -0.01477051],
       [ 0.05541992,  0.31640625,  0.27929688, ..., -0.06689453,
         0.34179688,  0.27929688],
       ...,
       [ 0.06396484, -0.25585938, -0.08447266, ...,  0.02746582,
         0.06494141,  0.06201172],
       [-0.07666016, -0.10400391, -0.00175476, ..., -0.01965332,
        -0.03442383,  0.0007515 ],
       [-0.12695312,  0.20898438, -0.10644531, ...,  0.13476562,
         0.01879883, -0.1484375 ]], dtype=float32)

### 4. Pad the sentences to the same length

The input size of model is fixed. However, the sentence lengths are various.  
An intuitive solution is to stuff some dummy values into arrays util they share the same size, and this is called *padding*.  

<small>*<a href="https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences">tf.keras.preprocessing.sequence.pad_sequences</a></small>

In [13]:
# if you prefer numpy
import numpy as np
# or if you prefer tensorflow
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [14]:
def add_padding(embeddings, padding_width = None):
    # [ TODO ]
    # Pad all embeddings to padding_width, or detect it automatically when it's not given
    # ps. tensorflow's `pad_sequences` can detect that for you
    insert = np.zeros(300)
    max_len = 0
    for i in range(len(embeddings)):
        if len(embeddings[i]) > max_len:
            max_len = len(embeddings[i])

    if padding_width is not None:
        max_len = padding_width
    # print(max_len)
    new_embedding = []
    for i in range(len(embeddings)):
        token_list = list(embeddings[i])
        length = len(token_list)
        if length < max_len:
            for _ in range(max_len-length):
                token_list.insert(0, insert)
        
        new_embedding.append(np.array(token_list))

    return np.array(new_embedding)

In [None]:
print(embeddings[0].shape)
print(embeddings[1].shape)

In [15]:
emb_padded = add_padding(embeddings)
emb_padded[0]

# print(emb_padded[0].shape)
# print(emb_padded[1].shape)
print(emb_padded.shape)

(2, 26, 300)


You should see the embedding of shorter sentence is padded by empty arrays, and they are at the same length now.

In [16]:
# record the width for the future use.
PADDING_WIDTH = emb_padded[0].shape[0]
print(PADDING_WIDTH)

26


### 5. all-in-one

Define a function to setup the pipeline, and transform all sentences into embeddings!  

<small>\*Your embedding shape might not be the same with ours due to our different preprocessing procedure. </small>

In [17]:
def process_text(sentences, padding = None):
    result = [ preprocess(sentence) for sentence in sentences ]
    result = [ to_embedding(sentence) for sentence in result ]
    result = add_padding(result, padding)
    return result

In [18]:
X = process_text(pure_data.values())

In [None]:
X[0] # should be an embedding with padding

In [19]:
# X.shape # should be (637, *, 300), * depends on your preprocessing
print(X.shape)

(637, 84, 300)


Let's use a dictionary to store all embeddings with their sentence_id.

In [20]:
processed_data = { 
    sent_id: embedding for sent_id, embedding in zip(pure_data, X) 
}

In [21]:
print(pure_data['1001'])
processed_data['1001']

A naked party, also known as nude party, is a party where the participants are required to be nude.


array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-0.12890625, -0.18261719,  0.10351562, ..., -0.07714844,
        -0.11572266, -0.02832031],
       [-0.22851562, -0.08837891,  0.12792969, ..., -0.21289062,
         0.18847656, -0.14550781],
       [ 0.21582031, -0.12207031,  0.09765625, ..., -0.06201172,
        -0.17089844,  0.02563477]])

## II. First stage

After preprocessing the training data, now we are going to train our first-stage model!  

According to the method described at the beginning, we can train a simple model on a smaller dataset, and this dataset can be generated by rule from seeds.  

### Steps

1. Prepare the training data
2. Encode labels
3. Split training and testing dataset
4. Build classifier
5. Train

### 1. Prepare the training data

Given the seed collocationss, you can add a sentence into the training data with label if that sentence contains that collocation.  
For example, we can say <i>"A party is a **social** gathering."</i> should be the first sense, because it contains the keyword *social*. Hence, your training data will have this sentence with its label `1`.  

Don't worry about the false-positive cases for now.  
If the seed is generally good enough, the model will learn to ignore those wrong data by itself. (though yeah, you can get better results if you deal with it beforehand)

In [22]:
SEEDS = {
    1: ['social', 'events'],
    2: ['system', 'coalition'],
    3: ['court', 'law']
}

In [290]:
# 將整個資料去 label, 看比例
value_list = list(SEEDS.values())
key_list = list(SEEDS.keys())
total_indice, total_X, total_Y = [], [], []
for sent_id, sentence in pure_data.items():
    for i in range(len(value_list)):
        for word in value_list[i]:
            if word in sentence:
                label = key_list[i]
                total_indice.append(sent_id)
                total_X.append(processed_data[sent_id])
                total_Y.append(label)
                if label != 2:
                    total_indice.append(sent_id)
                    total_X.append(processed_data[sent_id])
                    total_Y.append(label)

total_X = np.array(total_X)
total_Y = np.array(total_Y)

In [291]:
print(total_X.shape)
print(total_Y.shape)

occur3 = np.count_nonzero(total_Y ==3)
occur2 = np.count_nonzero(total_Y ==2)
occur1 = np.count_nonzero(total_Y ==1)
print(occur1)
print(occur2)
print(occur3)

(348, 84, 300)
(348,)
86
136
126


<font color="red">[TODO]</font> Get the initial training data from the given seeds.  

In [292]:
# [TODO]
from random import sample, seed
# seed(10)
indice, first_X, first_Y = [], [], [] # sentence id of selected samples, selected sentences, detected labels
value_list = list(SEEDS.values())
key_list = list(SEEDS.keys())
for sent_id, sentence in sample(pure_data.items(), 402):

    for i in range(len(value_list)):
        for word in value_list[i]:
            if word in sentence:
                label = key_list[i]
                indice.append(sent_id)
                first_X.append(processed_data[sent_id])
                first_Y.append(label)
                if label != 2:
                    indice.append(sent_id)
                    first_X.append(processed_data[sent_id])
                    first_Y.append(label)
    
                

Examine training data.  
The labels might not be 100% correct, but it should look reasonable.  

In [293]:
for i in range(5):
    print(pure_data[indice[i]])
    print(f' -> {first_Y[i]}: {SENSE[first_Y[i]]}')
    print()

Kosovo has a multi-party system, with numerous parties and the system of proportional representation and guaranteed minority representation means that no one party is likely to have a parliamentary majority.
 -> 2: an organization of people with particular political beliefs

The great losers of the election were Labour Party, People's Party for Freedom and Democracy and Democrats 66, the coalition parties of the 'purple' cabinets.
 -> 2: an organization of people with particular political beliefs

A dot com party (often known as an internet party or more generally, a launch party) is a social and business networking party hosted by an Internet-related business, typically for promotional purposes or to celebrate a corporate event such as a product launch, venture funding round, or corporate acquisition.
 -> 1: a social event at which a group of people meet to talk, eat, drink, dance, etc.

A dot com party (often known as an internet party or more generally, a launch party) is a social a

Transform X and Y into numpy array for future use.

In [294]:
first_X = np.array(first_X)
first_Y = np.array(first_Y)
print(first_X.shape)
print(first_Y.shape)

(231, 84, 300)
(231,)


In [295]:
occur3 = np.count_nonzero(first_Y ==3)
occur2 = np.count_nonzero(first_Y ==2)
occur1 = np.count_nonzero(first_Y ==1)
print(occur1)
print(occur2)
print(occur3)

60
91
80


### 2. Encode labels

The labels now are all categorical, which are `1`, `2`, and `3` . However, it's hard to teach a machine this kind of answers.  
Most of the time, machine learning generates a *numeric probability*, like `0.329`, rather than a categorical result.  
That's why we want to encode the label into a floating point between 0 ~ 1, so that the machine can generate the probability of each answer.  

Here we suggest you use the one-hot encoding, which is suitable for categorical classification.  
So the label `2` will look like
```
 Sense 1, Sense 2, Sense 3
[      0,       1,       0]
```

*<small><a href="https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/">Why One-Hot Encode Data in Machine Learning?</a></small>

In [None]:
# if you prefer tensorflow
from tensorflow import one_hot
# or if you don't like tensorflow
from sklearn.preprocessing import OneHotEncoder

<font color="red">[TODO]</font> one-hot encode `first_Y`

<small>
*<a href="https://www.tensorflow.org/api_docs/python/tf/one_hot">tf.one_hot</a><br/>
*<a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html">sklearn.preprocessing.OneHotEncoder</a>
</small>

In [296]:
# [TODO]
# one_hot 從 0 開始編號，因此深度為 4
# one_hot_Y = one_hot(first_Y, 4)
onehot_dict = {
    1: np.array([1, 0, 0]),
    2: np.array([0, 1, 0]),
    3: np.array([0, 0, 1])
}

onehot_Y = np.zeros((len(first_Y), 3))
for i in range(len(first_Y)):
    onehot_Y[i] = onehot_dict[first_Y[i]]

first_Y = onehot_Y

In [297]:
first_Y.shape


(231, 3)

In [244]:
first_Y[:5]

array([[1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])

### 3. Prepare training and validation set

Split the dataset into training set and validation set.  
The reason for splitting is because, you may not want the model to see what you'll use to test it when it is still learning.

Machine is very smart; sometimes it just *memorizes* the answers, rather than *learns* them. Even that the model has yielded a perfect accuracy in the test, it still might fail miserably when facing the cruel, real world. *(heh)*  
That's why we need a validation set. We reserve a partition of data that will never be learnt by the model, and use it to validate whether the model really learns someting.

<small>*<a href="https://tarangshah.com/blog/2017-12-03/train-validation-and-test-sets/">Train, Validation and Test Sets</a></small>

In [28]:
# if you prefer sklearn
from sklearn.model_selection import train_test_split
# or if you don't like sklearn. **Remember to shuffle your data before splitting.**
import numpy as np

In [None]:
print(len(first_X))
print(len(first_Y))

In [298]:
X_train, X_val, Y_train, Y_val = train_test_split(
    first_X, first_Y,
    test_size = 0.2,   # [TODO] How much data you want to used as validation set
    shuffle = True
)

In [271]:
print(X_train.shape, X_val.shape, Y_train.shape, Y_val.shape)

(125, 84, 300) (32, 84, 300) (125, 3) (32, 3)


### 4. Build your multi-labeling classifier 

Now the data is all prepared.  
Let's build a model to learn from it!  

Note that, different from last week, your output dimension should be the size of all categories, rather than `2` .  

\-

<small>
*Although tensorflow is used below, you can always change it to any other framework you are familiar with. <br/>
*<a href="https://www.tensorflow.org/api_docs/python/tf/keras/layers">tf.keras.layers</a>
</small>

In [33]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Flatten, BatchNormalization, Embedding#, and all the other layers you may use

In [272]:
_, PADDING_WIDTH, EMBEDDING_DIM = X_train.shape
OUTPUT_CATEGORY = len(SENSE)

print(PADDING_WIDTH, EMBEDDING_DIM, OUTPUT_CATEGORY)

84 300 3


In [130]:
print(X_train.shape)
print(Y_train.shape)

(160, 84, 300)
(160, 3)


<font color="red">[TODO]</font> Build a classifier

In [305]:
model_1 = Sequential()

# [TODO]
# model_1.add(Embedding(vocab_size, 300, input_length = PADDING_WIDTH, embeddings_initializer=keras.initializers.constant(embedding_matrix)))
# model_1.add(LSTM(128, return_sequences=True))
# model_1.add(Dropout(0.3))
model_1.add(LSTM(32, return_sequences=True))
# model_1.add(Dropout(0.41))
model_1.add(Flatten())

# model_1.add(Dropout(0.3))
model_1.add(BatchNormalization())
model_1.add(Dense(256, activation='relu'))
model_1.add(Dropout(0.1))
model_1.add(Dense(3, activation='softmax'))

model_1.build(input_shape=X_train.shape)
print(model_1.summary())

Model: "sequential_16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_18 (LSTM)               (184, 84, 32)             42624     
_________________________________________________________________
flatten_16 (Flatten)         (184, 2688)               0         
_________________________________________________________________
batch_normalization_16 (Batc (184, 2688)               10752     
_________________________________________________________________
dense_30 (Dense)             (184, 256)                688384    
_________________________________________________________________
dropout_30 (Dropout)         (184, 256)                0         
_________________________________________________________________
dense_31 (Dense)             (184, 3)                  771       
Total params: 742,531
Trainable params: 737,155
Non-trainable params: 5,376
___________________________________________

Time to choose the optimizer and loss function.  

Loss function is an equation evaluating how wrong your model has answered (the lower the better), while optimizer tells the model how to improve itself.  
But seriously, we are not asking you to fine-tune these parameters. That is for Machine Learning class, not for NLP class, so if you are not able to pass the baseline, go check your processing procedure first. Something might go wrong there.  

\-

<small>
*<a href="https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile">tf.keras.model#compile</a> <br/>
*<a href="https://www.tensorflow.org/api_docs/python/tf/keras/optimizers">tf.keras.optimizers</a> <br/>
*<a href="https://www.tensorflow.org/api_docs/python/tf/keras/losses">tf.keras.losses</a>
</small>

<font color="red">[TODO]</font> Compile your model

In [306]:
# [TODO]
model_1.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


### 5. Train 

Time to train your model!  

You should always prevent the model from overfitting, so take validation accuracy into consideration and choose your epoch number wisely.  

<small>*<a href="https://www.ibm.com/cloud/learn/overfitting">What is Overfitting?</a></small>

<font color="red">[TODO]</font> Train and tune your model

In [307]:
history = model_1.fit(
    X_train, Y_train, 
    validation_data=(X_val, Y_val),
    epochs = 7,          # [TODO] how many iterations you want to run
    # initial_epoch = ?    # set this if you're continuing previous training
)

Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7


In [311]:
# example of continued training

history = model_1.fit(
    X_train, Y_train, 
    validation_data=(X_val, Y_val),
    epochs = 20,          # how many iterations you want to run
    initial_epoch = 7     # set this if you're continuing previous training
)

Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [316]:
history = model_1.fit(
    X_train, Y_train, 
    validation_data=(X_val, Y_val),
    epochs = 25,          # how many iterations you want to run
    initial_epoch = 20     # set this if you're continuing previous training
)

Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [320]:
history = model_1.fit(
    X_train, Y_train, 
    validation_data=(X_val, Y_val),
    epochs = 30,          # how many iterations you want to run
    initial_epoch = 25     # set this if you're continuing previous training
)

Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


### 6. Examine your model

Let's see how good your model does.  

In [281]:
testcases = [
    # 1
    'A block party or street party is a party in which many members of a single community congregate, either to observe an event of some importance or simply for mutual enjoyment.',
    'A party is a social gathering.',
    # 2
    'Ukraine has a multi-party system, with numerous parties in which often not a single party has a chance of gaining power alone, and parties must work with each other to form coalition governments.',
    'Serbia has a multi-party system, with numerous parties in which no one party often has a chance of gaining power alone, and parties must work with each other to form coalition governments.',
    # 3
    'In a civil lawsuit, a nominal party is one named as a party on the record of an action, but having no interest in the action.',
]

In [321]:
# you must specify the padding width here, since the input size of model should always be the same
test_X = process_text(testcases, padding = PADDING_WIDTH)

In [318]:
predictions = model_1.predict(test_X)

In [286]:
predictions

array([[0.33769137, 0.40841702, 0.25389162],
       [0.3401947 , 0.4371507 , 0.22265463],
       [0.05900172, 0.8824855 , 0.05851285],
       [0.05026186, 0.89369535, 0.05604285],
       [0.22716464, 0.35813114, 0.41470426]], dtype=float32)

#### What does the result mean?

As you can see, a list of floats are generated, and since we used one-hot encoding when preparing the training data, each number presents the result of corresponding categories.  
```
 Sense 1, Sense 2, Sense 3
[   0.89,    0.12,    0.21]
```
You can consider these values as the probability of each column, or said category. Hence, the true predicted label should be the one with the highest probability, which is Sense 1 for this sample.  

Now let's get all the predicted labels from these probabilities.  

In [319]:
for idx, result in enumerate(predictions):
    predict_id = result.argmax() # select the index of the maximum value
    sense_id = predict_id + 1    # sense_id starts from 1
    print(testcases[idx])
    print(f'-> Sense {sense_id} (prob={result[predict_id]:.2f}): {SENSE[sense_id]}')
    print()

A block party or street party is a party in which many members of a single community congregate, either to observe an event of some importance or simply for mutual enjoyment.
-> Sense 3 (prob=0.37): a single entity which can be identified as one for the purposes of the law

A party is a social gathering.
-> Sense 1 (prob=0.55): a social event at which a group of people meet to talk, eat, drink, dance, etc.

Ukraine has a multi-party system, with numerous parties in which often not a single party has a chance of gaining power alone, and parties must work with each other to form coalition governments.
-> Sense 2 (prob=0.87): an organization of people with particular political beliefs

Serbia has a multi-party system, with numerous parties in which no one party often has a chance of gaining power alone, and parties must work with each other to form coalition governments.
-> Sense 2 (prob=0.87): an organization of people with particular political beliefs

In a civil lawsuit, a nominal part

Again, the label might not be 100% correct, but it should look reasonable somehow.  

## III. Second stage

The previous model might not be enough for real-world use; another model with better ability is needed.  

<small>*Most contents of this section are the same as previous one, so you can make use of your code above.</small>

### 1. Prepare the training data 

The model from the previous section is weak, yet it still has learned some valuable knowledge.  
Let's ask that model to label more training data for us!

In [407]:
# Get the probability on the whold dataset
predictions = model_1.predict(np.array(list(processed_data.values())))

In [358]:
predictions[0]

array([0.48507053, 0.26981223, 0.24511726], dtype=float32)


<font color="red">[TODO]</font> Get the labels of all data, and reserve only those labels with high probabilities.

In [408]:
THRESHOLD = 0.  # you may want to change this :)
from random import sample, seed
indice, second_X, second_Y = [], [], [] # sentence id of selected samples, selected sentences, detected labels

number = 0
list_zip = list(zip(processed_data, predictions))
list_zip_sample = sample(list_zip, 402)
# list_zip_sample = list_zip

for i in range(len(list_zip_sample)):
        sent_id = list_zip_sample[i][0]
        result = list_zip_sample[i][1]

        label = np.argmax(result)+1
        indice.append(sent_id)
        second_X.append(processed_data[sent_id])
        second_Y.append(label)
        if label == 3:
            indice.append(sent_id)
            second_X.append(processed_data[sent_id])
            second_Y.append(label)
    

Observe the selected data size and the quality of labels.  
You might want to go back and modify your preprocessing, first model, or the threshold until you get a better training data.

In [360]:
for i in range(5):
    print(pure_data[indice[i]])
    print(f' -> {second_Y[i]}: {SENSE[second_Y[i]]}')
    print()

A naked party, also known as nude party, is a party where the participants are required to be nude.
 -> 1: a social event at which a group of people meet to talk, eat, drink, dance, etc.

The town center bears the hallmarks of a typical migration-accepting Turkish rural town, with traditional structures coexisting with a collection of concrete apartment blocks providing public housing, as well as amenities such as basic shopping and fast-food restaurants, and essential infrastructure but little in the way of culture except for cinemas and large rooms hired out for wedding parties.
 -> 1: a social event at which a group of people meet to talk, eat, drink, dance, etc.

Elections Alberta oversees the creation of political parties and riding associations, compiles election statistics on ridings, and collects financial statements from party candidates and riding associations.
 -> 2: an organization of people with particular political beliefs

A group of characters can join together to form 

In [409]:
second_X = np.array(second_X)
second_Y = np.array(second_Y)
second_X.shape

(479, 84, 300)

In [410]:
print(second_X.shape)
print(second_Y.shape)

(479, 84, 300)
(479,)


In [411]:
occur3 = np.count_nonzero(second_Y ==3)
occur2 = np.count_nonzero(second_Y ==2)
occur1 = np.count_nonzero(second_Y ==1)
print(occur1)
print(occur2)
print(occur3)

124
201
154


### 2. Encode labels 

<font color="red">[TODO]</font> one-hot encode secone_Y

In [412]:
# [TODO]
# one_hot 從 0 開始編號，因此深度為 4
# one_hot_Y = one_hot(first_Y, 4)

onehot2_Y = np.zeros((len(second_Y), 3))
for i in range(len(second_Y)):
    onehot2_Y[i] = onehot_dict[second_Y[i]]

second_Y = onehot2_Y

In [None]:
second_Y[:3]

In [369]:
print(second_X.shape)
print(second_Y.shape)

(762, 84, 300)
(762, 3)


### 3. Prepare training and validating dataset

In [413]:
X_train, X_val, Y_train, Y_val = train_test_split(
    second_X, second_Y,
    test_size = 0.2,    # [TODO] How much data you want to used as validation set
    shuffle = True
)

In [414]:
print(X_train.shape)
print(Y_train.shape)

(383, 84, 300)
(383, 3)


### 4. Build model

In [61]:
# the number comes from previous setting
print(PADDING_WIDTH, EMBEDDING_DIM, OUTPUT_CATEGORY)

84 300 3


<font color="red">[TODO]</font> Build your second model

<small>*This model can be different from the previous one.</small>

In [415]:
model_2 = Sequential()

# [TODO]

# model_2.add(LSTM(128, return_sequences=True))
# model_2.add(Dropout(0.52))
model_2.add(LSTM(64, return_sequences=True))
model_2.add(Dropout(0.41))
model_2.add(Flatten())

model_2.add(Dropout(0.3))
model_2.add(BatchNormalization())
model_2.add(Dense(256, activation='relu'))
model_2.add(Dropout(0.1))
model_2.add(Dense(3, activation='softmax'))

model_2.build(input_shape=X_train.shape)

print(model_2.summary())

Model: "sequential_21"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_26 (LSTM)               (383, 84, 64)             93440     
_________________________________________________________________
dropout_46 (Dropout)         (383, 84, 64)             0         
_________________________________________________________________
flatten_21 (Flatten)         (383, 5376)               0         
_________________________________________________________________
dropout_47 (Dropout)         (383, 5376)               0         
_________________________________________________________________
batch_normalization_21 (Batc (383, 5376)               21504     
_________________________________________________________________
dense_40 (Dense)             (383, 256)                1376512   
_________________________________________________________________
dropout_48 (Dropout)         (383, 256)              

<font color="red">[TODO]</font> Compile your model

In [416]:
# [TODO]
model_2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

### 5. Train model

<font color="red">[TODO]</font> Train it!

In [417]:
history = model_2.fit(
    X_train, Y_train, 
    validation_data=(X_val, Y_val),
    epochs = 10        # [TODO] how many iterations you want to run
    # initial_epoch = ?  # set this if you're continuing previous training
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [418]:
history = model_2.fit(
    X_train, Y_train, 
    validation_data=(X_val, Y_val),
    epochs = 15,        # [TODO] how many iterations you want to run
    initial_epoch = 10  # set this if you're continuing previous training
)

Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [419]:
history = model_2.fit(
    X_train, Y_train, 
    validation_data=(X_val, Y_val),
    epochs = 18,        # [TODO] how many iterations you want to run
    initial_epoch = 15  # set this if you're continuing previous training
)

Epoch 16/18
Epoch 17/18
Epoch 18/18


### 6. Examine the result

In [420]:
testcases = [
    # 1
    'Green Beer Day (GBD) is a day-long party, where celebrants drink beer dyed green with artificial coloring or natural processes.',
    'When the siblings grew up, they held parties and introduced the tradition to friends while in college, and the tradition began to spread.',
    # 2
    'Politicians from the two main parties tend to win elections when not confronted by strong challengers from their own party (in which cases their traditional opponents tend to win).',
    'After the general election on 22 March 1992, five parties (Rassadorn, Justice Unity, Social Action, Thai Citizen, Chart Thai) designated Suchinda as the prime minister.',
    # 3
    'Typically, a party has the right to object in court to a line of questioning or at the introduction of a particular piece of evidence.',
    'In the practice of law, judicial estoppel (also known as estoppel by inconsistent positions) is an estoppel that precludes a party from taking a position in a case that is contrary to a position it has taken in earlier legal proceedings.'
]

In [434]:
# you must specify the padding width! 
test_X = process_text(testcases, padding = PADDING_WIDTH)

In [435]:
predictions = model_2.predict(test_X)

In [436]:
print(predictions)

[[0.92654544 0.06100381 0.01245083]
 [0.12303197 0.8424626  0.03450544]
 [0.03212543 0.89569116 0.07218342]
 [0.14029951 0.8258109  0.03388954]
 [0.09980756 0.60266584 0.29752657]
 [0.07414062 0.0920529  0.83380646]]


In [423]:
for idx, result in enumerate(predictions):
    predict_id = result.argmax()
    sense_id = predict_id + 1    # sense_id starts from 1
    print(testcases[idx])
    print(f'-> Sense {sense_id} (prob={result[predict_id]:.2f}): {SENSE[sense_id]}')
    print()

Green Beer Day (GBD) is a day-long party, where celebrants drink beer dyed green with artificial coloring or natural processes.
-> Sense 1 (prob=0.93): a social event at which a group of people meet to talk, eat, drink, dance, etc.

When the siblings grew up, they held parties and introduced the tradition to friends while in college, and the tradition began to spread.
-> Sense 2 (prob=0.84): an organization of people with particular political beliefs

Politicians from the two main parties tend to win elections when not confronted by strong challengers from their own party (in which cases their traditional opponents tend to win).
-> Sense 2 (prob=0.90): an organization of people with particular political beliefs

After the general election on 22 March 1992, five parties (Rassadorn, Justice Unity, Social Action, Thai Citizen, Chart Thai) designated Suchinda as the prime minister.
-> Sense 2 (prob=0.83): an organization of people with particular political beliefs

Typically, a party has t

Yet again, the label might not be 100% correct, but it still should look reasonable.

## IV. Evaluation

We have our model built! It's time to see how good it is on the testing dataset.  
Get the predictions from the final model and examine the results.  

In [437]:
with open(os.path.join('data', 'party.test.txt'), 'r', encoding='utf8') as f:
    data = f.read().strip().split('\n')

# this dict maps sentence_id to the sentence itself
test_data = { sent_id: text for sent_id, text in [line.split('\t', 1) 
                                                 for line in data] }

In [433]:
for idx, (sent_id, sentence) in enumerate(test_data.items()):
    if idx > 3: break
        
    print(f'{sent_id}: {sentence}')

1638: Patent ambiguity is that ambiguity which is apparent on the face of an instrument to any one perusing it, even if unacquainted with the circumstances of the parties.
1639: Smith played at parties, juke joints, and fish fries.
1640: Turkey has a multi-party system, with two or three strong parties and often a fourth party that is electorally successful.
1641: The Christian Liberation Movement ( or simply MCL) is a Cuban dissident party advocating political change in Cuba.


<font color="red">[TODO]</font> Get the labels of testing data.  

Try to reserve the sentence id, because you will need it while requesting your accuracy.  
Recommended format of `final_predictions` : 
```
{ sent_id: sense_id }
```

In [438]:
final_predictions = {}

# [TODO]
test_final_X = process_text(test_data.values(), padding = PADDING_WIDTH)
predictions = model_2.predict(test_final_X)

for sent_id, result in zip(test_data, predictions):
    label = np.argmax(result)+1
    final_predictions[sent_id] = int(label)


In [440]:
for idx, (sent_id, pred) in enumerate(final_predictions.items()):
    if idx > 5: break
        
    print(test_data[sent_id])
    print(f'-> Sense {pred}: {SENSE[pred]}')
    print()

Patent ambiguity is that ambiguity which is apparent on the face of an instrument to any one perusing it, even if unacquainted with the circumstances of the parties.
-> Sense 3: a single entity which can be identified as one for the purposes of the law

Smith played at parties, juke joints, and fish fries.
-> Sense 2: an organization of people with particular political beliefs

Turkey has a multi-party system, with two or three strong parties and often a fourth party that is electorally successful.
-> Sense 2: an organization of people with particular political beliefs

The Christian Liberation Movement ( or simply MCL) is a Cuban dissident party advocating political change in Cuba.
-> Sense 2: an organization of people with particular political beliefs

Greens Party () was a green liberal party in Turkey.
-> Sense 2: an organization of people with particular political beliefs

Under the Constitution of North Korea, all citizens 17 and older, regardless of party affiliation, political 

### Get your accuracy

Send your predictions in json format to our server, and we will calculate the accuracy for you.  
The format should be 
```
{ sentence_id: sense_id }
```
Example,
```
{
    1001: 1,
    1002: 1,
    ...
}
```

In [183]:
import json
import requests

In [441]:
data = json.dumps(final_predictions)
ret = requests.post('http://jedi.nlplab.cc:4500/check', 
                    json = { 'data': data }
                   )

In [442]:
if not ret.ok:
    print('Something wrong :o')
print(ret.json())

{'accuracy': 0.7285714285714285, 'comment': ['Well done!']}


**REQUIREMENT**  
**Your accuracy should be <u>higher than 0.70</u> to get the full points.**

But do note that your assignment is mostly scored on your implementation, not just on the accuracy.  
So even if you brute-forcely attack our server and get 100% accuracy, you still can't get your points if your code doesn't make sense to TA.

## TA's note

Congratuation! You've finished the assignment this week.  
Don't forget to <b>[make an appoiment with TA](https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit#gid=1902646609) to demo/explain your implementation <u>before <font color="red">11/18 15:30</font></u></b> .  
Also make sure you submit your {student_id}.ipynb to [eeclass](https://eeclass.nthu.edu.tw/course/homework/4615).

Please note that <font color="red">we will announce our final project on 11/18</font>. Again, **we strongly suggest you join and listen** .  
We will have 2 Ph.D. students introduce the selected topics in class and give you some guidelines about how to approach your project.  
Also, we will have a team-matching session at the end of the class, in which you may want to participate to find teammates.