# Session 9: Assignment

## Emotion classification problem for Shopee comments

**Install `fasttext` for Pretrained Word Embedding**

In [None]:
!pip install "git+https://github.com/facebookresearch/fastText.git"

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ML-intensive/data/sentiment_data.csv")
df.head()

Print a few compliments and a few critical comments

In [None]:
print("Critical- LABEL = 1")
for text in df[df["label"] == 1]["text"].values[:10]:
  print(text)
print()
print("="*30)
print()
print("Compliment - LABEL = 0")
for text in df[df["label"] == 0]["text"].values[:10]:
  print(text)

We see that the above dataset has been applied Word Segment technique to separate words

However, the above dataset has not been cleaned (delete emoji, special characters, ...)

We will delete the special characters `: , = ...` but it should be noted not to delete the character `_` (which will spoil the result of the Word Segment)

In [None]:
import re

def simple_preprocessing(text):
    # Remove emojis
    emoji_pattern = re.compile("["
                               "\U0001F600-\U0001F64F"  # emoticons
                               "\U0001F300-\U0001F5FF"  # symbols & pictographs
                               "\U0001F680-\U0001F6FF"  # transport & map symbols
                               "\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "\U00002702-\U000027B0"  # Dingbats
                               "\U000024C2-\U0001F251"  # Enclosed characters
                               "]+", flags=re.UNICODE)

    text = emoji_pattern.sub(r'', text)
    # Remove special characters excluding underscore (_) with regex python
    text = re.sub(r'[^\w\s_]', '', text)

    '''
    Deleting special characters can lead to the excess spaces
    for example "huhu : ( (  buồn quá" sẽ thành "huhu     buồn quá"
    We will split the text in a space and then join it again to correct this case
    '''

    text = " ".join(text.split())
    text = text.strip().lower()
    return text

df["text"] = df["text"].apply(simple_preprocessing)

In [None]:
# Review results after preprocessing
print("Critical- LABEL = 1")
for text in df[df["label"] == 1]["text"].values[:10]:
  print(text)
print()
print("="*30)
print()
print("Compliment - LABEL = 0")
for text in df[df["label"] == 0]["text"].values[:10]:
  print(text)

### Train Test Split

We will split the dataset into 3 sets
- Train
- Validation
- Test

You should note: when performing Tokenizer and Embedding, you can only manipulate on the Train episode. This leads to when testing the model on the Validation and Test set will be able to occur the following cases:
- In 2 episodes of Val-Test appear words that never appeared in the episode Train
- In 2 episodes of Val-Test appear documents that are too long or too short for Train

The above differences will lower the performance of model → this is the reason that you need to have 1 quality dataset so that the model can run well in real application


In [None]:
from sklearn.model_selection import train_test_split

sentences, labels = df["text"].values, df["label"].values
x_train, x_val, y_train, y_val = train_test_split(
    sentences,
    labels,
    test_size=0.4,
    shuffle=True,
    random_state=42,
    stratify=labels
)

x_val, x_test, y_val, y_test = train_test_split(
    x_val,
    y_val,
    test_size=0.5,
    shuffle=True,
    random_state=42,
    stratify=y_val
)

print("Train Set")
print(x_train.shape, y_train.shape)
print("Validation Set")
print(x_val.shape, y_val.shape)
print("Test Set")
print(x_test.shape, y_test.shape)

### Tokenizer

In this Assignment, we will use the `TextVectorization` layer of `tensorflow.keras` to turn the Tokenizer operation into 1 part of the model (in the Lab article, we Tokenizer first and then put it into the model)

Before initializing the Tokenizer, we need to calculate the length and number of unique words of the documents in the Train set.

**Count unique words**

In [None]:
word_set = set()
for text in x_train:
  words = text.split()
  for word in words:
    if word not in word_set:
      word_set.add(word)

VOCAB_SIZE = len(word_set)
print(VOCAB_SIZE)

**Count the average length of eachg text**

In [None]:
count_word = []
for text in x_train:
  words = text.split()
  count_word.append(len(words))

min(count_word), max(count_word), sum(count_word)/len(count_word)

We see a problem when the value `min` is zero.

In [None]:
for text in x_train:
  if text == "":
    print(True)

Delete lines with null values in `x_train` and delete corresponding lines in `y_train`

In [None]:
new_train_text = []
new_train_label = []

for text, label in zip(x_train, y_train):
  if text != "":
    new_train_text.append(text)
    new_train_label.append(label)

x_train = np.array(new_train_text)
y_train = np.array(new_train_label)

In [None]:
count_word = []
for text in x_train:
  words = text.split()
  count_word.append(len(words))

min(count_word), max(count_word), sum(count_word)/ len(count_word)

Do the same with test and validation sets

In [None]:
new_test_text = []
new_test_label = []

for text, label in zip(x_test, y_test):
  if text != "":
    new_test_text.append(text)
    new_test_label.append(label)

x_test = np.array(new_test_text)
y_test = np.array(new_test_label)

In [None]:
new_val_text = []
new_val_label = []

for text, label in zip(x_val, y_val):
  if text != "":
    new_val_text.append(text)
    new_val_label.append(label)

x_val = np.array(new_val_text)
y_val = np.array(new_val_label)

In fact, the `TextVectorization` layer will automatically calculate the number of unique words (plus 2 for `out_of_vocab` and `special` tokens)

In [None]:
from tensorflow.keras.layers import TextVectorization

# We calculate 9930, here we leave 10k to see the result of the layer
VOCAB_SIZE = 10000
MAX_LENGTH = 50 # average length is 17

tokenizer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    standardize=None, # preprocessing already done above
    output_mode="int", # output list containing tokreprocess (representing words in the field)
    output_sequence_length=MAX_LENGTH # padding or truncate until MAX_LENGTH
)

# Fit on x_train
tokenizer.adapt(x_train)
print(tokenizer.vocabulary_size())

Update the variable `VOCAB_SIZE`

In [None]:
VOCAB_SIZE = tokenizer.vocabulary_size() # 9932

### Pretrained Word Embedding

In [None]:
import fasttext

ft = fasttext.load_model("/content/drive/MyDrive/Colab Notebooks/ML-intensive/data/cc.vi.50.bin")

This pretrained Word Embedding will represent the word with 1 vector `50 dimensions`

In [None]:
text = "đẹp_trai"
embedding = ft[text]
print(embedding)
print(embedding.shape)

#### TODO 1

We will write the algorithm described in the Pre-Class article to derive vector embedding for each word in the dictionary
- Initalize empty  list `embeddings`
- Loop through each word in the dictionary
  - Retrieve a list of words to repeat with `tokenizer.get_vocabulary(include_special_tokens=True)`
  - Use `tqdm` to display progress bar `tqdm(tokenizer.get_vocabulary(include_special_tokens=True))`
- if the word exists in Pretrained Embedding (use the `in ft` command to check)
  - add it to `embeddings`
- If not exist
  - Intialize randomly a vector of 50 features `np.random.uniform`, ranging from `-0.05` to `0.05` and then add into `embeddings`
- Convert `embeddings` into `numpy array` and print the shape to test.

In [None]:
# YOUR SOLUTION

### Simple Recurrent Neural Network

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input, SimpleRNN, LSTM, GRU
from tensorflow import string


'''
Pipeline:
- input layer receive documents (shape=1, dtype=str)
- tokenizer
- embedding
- RNN & MLP
'''

model = Sequential()

model.add(Input(shape=(1,), dtype=string))
xmodel.add(tokenizer)
model.add(embedding_layer)
model.add(SimpleRNN(32))

model.add(Dense(16, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
model.summary()

In [None]:
model.compile(optimizer='adam',
              loss="binary_crossentropy",
              metrics=['accuracy'
              ])

model.fit(
    x_train,
    y_train,
    validation_data=(x_val, y_val),
    epochs=5
)

In [None]:
model.evaluate(x_test, y_test)
y_pred_test = model.predict(x_test) >= 0.5

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(confusion_matrix(y_test, y_pred_test), annot=True, fmt="d")
plt.show()
print(classification_report(y_test, y_pred_test))

#### TODO 2

Now it`s your turn, use more complex models:
1. Stacked Biderectional RNN: oftenly, only 2 to 3 layers should be stacked
2. Replace the RNN in model 1 with LSTM or GRU and train again

**Note: you need to reinitialize the 'Embedding' layer from the 'embeddings' variable when creating a new model (as the current 'Embedding' layer has already been trained)**

There are 1 techniques to train the model better when using the Pretrained Model::
- Freezing the pretrained layer (in this tutorial the `Embedding` layer), i.e. not updating the weight of this layer during training
- Train the model
- Open the pretrained layers, and continue training with a small `learning_rate`.

We will learn about the above technique later, but you can try it

```python
# Initialize embedding
embedding_layer = Embedding(
    VOCAB_SIZE,
    50,
    embeddings_initializer=Constant(embeddings),
    name="embedding"
)

# Freeze weights
embedding_layer.trainable = False

# Create the model
# Start training process
...
# Unfreeze weights
embedding_layer.trainable = True

# re-compile model (compile only, not recreate) and use smaller learning rate
model.compile(optimizer=Adam(learning_rate=smaller_lr))
model.fit(...)
```