# ECE 689, Spring 2025
## Homework 3

## Full name:

## Question 1: Transformer for translation

Here, we implement transformers for neural machine translation (NMT), such as turning "Hello world" to "Salut le monde". You are going to follow the following steps:
1. Load and prepare the data. We provide "en-ft.txt". Each line of this file contains an English phrase, the equivalent French phrase, and an attribution identifying where the translation came from. The en-fr.txt used in problem 3 can also be found at: https://github.com/jeffprosise/Applied-Machine-Learning/tree/main/Chapter%2013/Data
2. Build and train a model. Implement a transformer from scratch in Pytorch. We will provide you with an existing implementation in Keras. You might also find https://github.com/gordicaleksa/pytorch-original-transformer useful.

For deliverables, plot your training and validation accuracy. The x-axis should be epoch, the y-axis should be your translation accuracy.

For reference, the provided code given at https://github.com/jeffprosise/Applied-Machine-Learning/blob/main/Chapter%2013/Neural%20Machine%20Translation%20(Transformer).ipynb achieves 85% accuracy after 14 epochs. You do not have to achieve the same performance to get full marks, just show understanding and functional codes.

In [None]:
"""Clean the text by removing punctuation symbols and numbers, converting
characters to lowercase, and replacing Unicode characters with their ASCII
equivalents. For the French samples, insert [start] and [end] tokens at the
 beginning and end of each phrase"""
import pandas as pd
import re
from unicodedata import normalize

df = pd.read_csv('Data/en-fr.txt', names=['en', 'fr', 'attr'], usecols=['en', 'fr'], sep='\t')
df = df.sample(frac=1, random_state=42)
df = df.reset_index(drop=True)
df.head()

def clean_text(text):
    """ Normalize the text to its canonical form, "NFD" means "Normalization Form D", 
    which decomposes characters into their base characters and combining diacritical marks
    """
    text = normalize('NFD', text.lower())
    # Remove all non-alphabetic characters
    text = re.sub('[^A-Za-z ]+', '', text)
    return text

def clean_and_prepare_text(text):
    text = '[start] ' + clean_text(text) + ' [end]'
    return text

df['en'] = df['en'].apply(lambda row: clean_text(row))
df['fr'] = df['fr'].apply(lambda row: clean_and_prepare_text(row))
df.head()

Unnamed: 0,en,fr
0,youre very clever,[start] vous etes fort ingenieuse [end]
1,are there kids,[start] y atil des enfants [end]
2,come in,[start] entrez [end]
3,wheres boston,[start] ou est boston [end]
4,you see what i mean,[start] vous voyez ce que je veux dire [end]


In [5]:
"""The next step is to scan the phrases and determine the maximum length of the
English phrases and then of the French phrases. These lengths will determine
the lengths of the sequences input to and output from the model"""
en = df['en']
fr = df['fr']

en_max_len = max(len(line.split()) for line in en)
fr_max_len = max(len(line.split()) for line in fr)
sequence_len = max(en_max_len, fr_max_len)

print(f'Max phrase length (English): {en_max_len}')
print(f'Max phrase length (French): {fr_max_len}')
print(f'Sequence length: {sequence_len}')

Max phrase length (English): 7
Max phrase length (French): 16
Sequence length: 16


In [7]:
"""Now fit one Tokenizer to the English phrases and another Tokenizer to their
French equivalents, and generate padded sequences for all the phrases"""
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

en_tokenizer = Tokenizer()
en_tokenizer.fit_on_texts(en)
en_sequences = en_tokenizer.texts_to_sequences(en)
en_x = pad_sequences(en_sequences, maxlen=sequence_len, padding='post')

fr_tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@\\^_`{|}~\t\n')
fr_tokenizer.fit_on_texts(fr)
fr_sequences = fr_tokenizer.texts_to_sequences(fr)
fr_y = pad_sequences(fr_sequences, maxlen=sequence_len + 1, padding='post')

In [8]:
"""Compute the vocabulary sizes from the Tokenizer instances"""
en_vocab_size = len(en_tokenizer.word_index) + 1
fr_vocab_size = len(fr_tokenizer.word_index) + 1

print(f'Vocabulary size (English): {en_vocab_size}')
print(f'Vocabulary size (French): {fr_vocab_size}')

Vocabulary size (English): 6033
Vocabulary size (French): 12197


In [9]:
"""Finally, create the features and the labels the model will be trained with.
The features are the padded English sequences and the padded French sequences
minus the [end] tokens. The labels are the padded French sequences minus the
[start] tokens. Package the features in a dictionary so they can be input to a
model that accepts multiple inputs."""
inputs = { 'encoder_input': en_x, 'decoder_input': fr_y[:, :-1] }
outputs = fr_y[:, 1:]

Now, define and train the transformer in Pytorch. We provide here some example code in Keras, **but note that you have to write it in Pytorch**.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import Input, Dense, Dropout
from keras_nlp.layers import TokenAndPositionEmbedding, TransformerEncoder
from keras_nlp.layers import TransformerDecoder
from tensorflow.keras.callbacks import EarlyStopping
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

np.random.seed(42)
tf.random.set_seed(42)

num_heads = 8
embed_dim = 256

encoder_input = Input(shape=(None,), dtype='int64', name='encoder_input')
x = TokenAndPositionEmbedding(en_vocab_size, sequence_len, embed_dim)(encoder_input)
encoder_output = TransformerEncoder(embed_dim, num_heads)(x)
encoded_seq_input = Input(shape=(None, embed_dim))

decoder_input = Input(shape=(None,), dtype='int64', name='decoder_input')
x = TokenAndPositionEmbedding(fr_vocab_size, sequence_len, embed_dim, mask_zero=True)(decoder_input)
x = TransformerDecoder(embed_dim, num_heads)(x, encoded_seq_input)
x = Dropout(0.4)(x)

decoder_output = Dense(fr_vocab_size, activation='softmax')(x)
decoder = Model([decoder_input, encoded_seq_input], decoder_output)
decoder_output = decoder([decoder_input, encoder_output])

model = Model([encoder_input, decoder_input], decoder_output)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary(line_length=120)

callback = EarlyStopping(monitor='val_accuracy', patience=3, restore_best_weights=True)
hist = model.fit(inputs, outputs, epochs=50, validation_split=0.2, callbacks=[callback])

acc = hist.history['accuracy']
val = hist.history['val_accuracy']
epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, '-', label='Training accuracy')
plt.plot(epochs, val, ':', label='Validation accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.plot()

## Question 2: BERT for sentiment analysis

For the last problem, we are going to learn how to use the huggingface library to train a simple BERT classifier for sentiment analysis.

We will use the IMDB dataset. You can find the dataset from huggingface using the following command:

```
from datasets import load_dataset
imdb = load_dataset("imdb")
```
To access BERT, use
```
from transformers import BertForSequenceClassification
#load pre-trained BERT
model = BertForSequenceClassification.from_pretrained('bert-base-uncased',
                                                      num_labels = len(label_dict),
                                                      output_attentions = False,
                                                      output_hidden_states = False)
```
To reduce training complexity, you can choose to freeze the weight of the pretrained BERT model and only train the classifier. The classifier should have a minimum of 3 layers.
You might find https://huggingface.co/blog/sentiment-analysis-python and https://github.com/baotramduong/Twitter-Sentiment-Analysis-with-Deep-Learning-using-BERT/blob/main/Notebook.ipynb helpful.

