<a href="https://colab.research.google.com/github/ktxdev/Assignment-4/blob/main/scripts/assignment_4_part_III.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part III
Using the previous two tutorials, please answer the following using an encorder-decoder approach and an LSTM compared approach.

Please create a transformer-based classifier for English name classification into male or female.

There are several datasets for name for male or female classification. In subseuqent iterations, this could be expanded to included more classifications.

Below is the source from NLTK, which only has male and female available but could be used for the purposes of this assignment.

```
names = nltk.corpus.names
names.fileids()
['female.txt', 'male.txt']
male_names = names.words('male.txt')
female_names = names.words('female.txt')
[w for w in male_names if w in female_names]
['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis',
'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel',
'Angie', 'Ariel', 'Ashley', 'Aubrey', 'Augustine', 'Austin', 'Averil', ...]
```

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00

In [2]:
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
import nltk
import torch
import random
import pandas as pd

from datasets import Dataset
from transformers import BartTokenizer, BartForSequenceClassification, Trainer, TrainingArguments

nltk.download('names')

# Data Preparation
names = nltk.corpus.names

names_list = [(name, fileid.split('.')[0]) for fileid in names.fileids() for name in names.words(fileid)]
# Shuffle the names list
random.shuffle(names_list)
# Create names dataset
names_df = pd.DataFrame(names_list, columns=['name', 'label']).reset_index(drop=True)
# Cast labels convert labels to numeric values with 0=male and 1=female
names_df['label'] = names_df['label'].map({'male': 0, 'female': 1})

## 1. Using an encorder-decoder approach with BART Transformer
# Creating a hugging face dataset
names_dataset = Dataset.from_pandas(names_df)

class GenderTransformerModel():
  def __init__(self):
    # Create tokenizer and model instance
    self.model_name = "facebook/bart-base"
    self.tokenizer = BartTokenizer.from_pretrained(self.model_name)
    self.model = BartForSequenceClassification.from_pretrained(self.model_name, num_labels = 2) # Set labels to 2 since we have male and female only

    self.training_args = TrainingArguments(
      output_dir="./results",
      num_train_epochs=3,
      per_device_train_batch_size=2,
      warmup_steps=500,
      weight_decay=0.01,
      logging_dir='./logs'
    )

  def tokenize(self, data):
    """
    Takes a dataset and tokenizes the name in the dataset
    """
    return self.tokenizer(data['name'], truncation=True, padding="max_length", max_length=64)

  def train(self, dataset):
    # Tokenize the dataset
    tokenized_dataset = dataset.map(self.tokenize, batched=True)

    # Split the dataset
    train_test_split = tokenized_dataset.shuffle(seed=42).train_test_split(test_size=0.2)
    train_dataset = train_test_split['train']
    test_dataset = train_test_split['test']

    # Instantiate a Trainer
    trainer = Trainer(
        model=self.model,
        args=self.training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
    )

    # Train the model
    trainer.train()

  def predict(self, name):
    """
    Makes prediction given the name
    """
    # Tokenize name
    inputs = self.tokenizer(name, return_tensors='pt')
    # Set device to GPU if available
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    # Move the model to the device selected
    self.model.to(device)
    # Move input tensors to the same device
    inputs = {key: value.to(device) for key, value in inputs.items()}
    # Make prediction
    outputs = self.model(**inputs)
    predicted_gender = torch.argmax(outputs.logits, dim=1).item()
    return "Male" if predicted_gender == 0 else "Female"

# Initialize the model
tf_model = GenderTransformerModel()
# Train model
tf_model.train(names_dataset)

## 2. LSTM Approach
import pandas as pd
import tensorflow as tf

from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


class GenderLSTMModel():
  def __init__(self, tokenizer, max_len):
    self.max_len = max_len
    self.tokenizer = tokenizer
    self.model = Sequential([
        Embedding(input_dim=len(self.tokenizer.word_index) + 1, output_dim=32, input_length=max_len),
        LSTM(50),
        Dense(1, activation='sigmoid')
    ])

    self.model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

  def train(self, X_train_seq, X_test_seq, y_train, y_test):
    # Padding sequences
    X_train_padded = pad_sequences(X_train_seq, maxlen=self.max_len, padding='post')
    X_test_padded = pad_sequences(X_test_seq, maxlen=self.max_len, padding='post')

    # Train the model
    history = self.model.fit(X_train_padded, y_train, epochs=10, batch_size=2, validation_split=0.2)

    return self.model.evaluate(X_test_padded, y_test)

  def predict(self, name):
    name_seq = self.tokenizer.texts_to_sequences([name])
    name_padded_seq = pad_sequences(name_seq, maxlen=self.max_len, padding='post')
    prediction = self.model.predict(name_padded_seq)
    return "Male" if prediction[0][0] < 0.5 else "Female"


# Tokenization
tokenizer = Tokenizer(char_level=True)  # Character-level tokenization
tokenizer.fit_on_texts(names_df['name'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(names_df['name'], names_df['label'], test_size=0.2, random_state=42)

# Convert text to sequences
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Compute max length
max_len = max(len(seq) for seq in X_train_seq)

# Initialize the LSTM Model
lstm_model = GenderLSTMModel(tokenizer, max_len)
# Train the model
lstm_model.train(X_train_seq, X_test_seq, y_train, y_test)

# Make predictions
test_names = ['Sean', 'Melisa', 'Andile', 'Prince']

print("Transformer Model Predictions:\n")
for name in test_names:
  print(f"{name} is a {tf_model.predict(name)}")

print("\n\n")

print("LSTM Model Predictions:\n")
for name in test_names:
  print(f"{name} is a {lstm_model.predict(name)}")
### END CODE HERE ###

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForSequenceClassification were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['classification_head.dense.bias', 'classification_head.dense.weight', 'classification_head.out_proj.bias', 'classification_head.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/7944 [00:00<?, ? examples/s]

Step,Training Loss
500,0.7901
1000,0.7908
1500,0.735
2000,0.7591
2500,0.7238
3000,0.7304
3500,0.6283
4000,0.5444
4500,0.6042
5000,0.6193


Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Transformer Model Predictions:

Sean is a Male
Melisa is a Female
Andile is a Female
Prince is a Male



LSTM Model Predictions:

Sean is a Male
Melisa is a Female
Andile is a Female
Prince is a Male


# References
1. https://arxiv.org/pdf/2102.03692.pdf
2. https://alvinntnu.github.io/NTNU_ENC2045_LECTURES/exercise/13-attention.html
3. https://towardsdatascience.com/deep-learning-gender-from-name-lstm-recurrent-neural-networks-448d64553044
4. https://www.nltk.org/book/ch02.html#sec-lexical-resources