<a href="https://colab.research.google.com/github/Natural-Language-Processing-YU/M3_Assignment/blob/main/scripts/m3_assignment_part_III.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part III
Using the previous two tutorials, please answer the following using an encorder-decoder approach and an LSTM compared approach. 

Please create a transformer-based classifier for English name classification into male or female.

There are several datasets for name for male or female classification. In subseuqent iterations, this could be expanded to included more classifications. 

Below is the source from NLTK, which only has male and female available but could be used for the purposes of this assignment. 

```
names = nltk.corpus.names
names.fileids()
['female.txt', 'male.txt']
male_names = names.words('male.txt')
female_names = names.words('female.txt')
[w for w in male_names if w in female_names]
['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis',
'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel',
'Angie', 'Ariel', 'Ashley', 'Aubrey', 'Augustine', 'Austin', 'Averil', ...]
```

1. **Data Preparation:** Download and preprocess a dataset of names, labeling them by gender.
2. **Model Training:** Fine-tune a pre-trained DistilBERT model on the labeled names dataset for the task of gender classification.
3. **Evaluation:** Configure training arguments, execute the training process, and compute metrics such as accuracy, precision, recall, and F1-score to evaluate the model's performance.
4. **Results Interpretation:** Analyze the training outcomes using the computed metrics and log outputs.


In [38]:
import random
import torch
import nltk
from nltk.corpus import names
from sklearn.model_selection import train_test_split
from transformers import (DistilBertTokenizerFast, DistilBertForSequenceClassification, 
                          Trainer, TrainingArguments)
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

# Download required NLTK resources
nltk.download('names')

def load_and_prepare_data():
    """Load names data from NLTK and prepare labels."""
    male_names = [(name, 0) for name in names.words('male.txt')]
    female_names = [(name, 1) for name in names.words('female.txt')]
    all_names = male_names + female_names
    random.shuffle(all_names)
    return zip(*all_names)  # Unzips into two lists

def create_datasets(tokenizer, names_train, names_test, labels_train, labels_test):
    """Tokenize names and create custom dataset objects."""
    train_encodings = tokenizer(list(names_train), truncation=True, padding=True)
    test_encodings = tokenizer(list(names_test), truncation=True, padding=True)
    train_dataset = NameDataset(train_encodings, labels_train)
    test_dataset = NameDataset(test_encodings, labels_test)
    return train_dataset, test_dataset

class NameDataset(torch.utils.data.Dataset):
    """Custom PyTorch dataset for name classification."""
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

def setup_trainer(tokenizer):
    """Initialize tokenizer, model, and training arguments."""
    model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
    training_args = TrainingArguments(
        output_dir='./results',         
        num_train_epochs=3,             
        per_device_train_batch_size=32, 
        per_device_eval_batch_size=64,  
        warmup_steps=100,               
        weight_decay=0.01,              
        logging_dir='./logs',           
        logging_steps=10,
        evaluation_strategy="epoch",    
        save_strategy="epoch"           
    )
    return model, training_args

def compute_metrics(pred):
    """Calculate metrics for evaluating the model."""
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc, 'f1': f1, 'precision': precision, 'recall': recall}

# Main execution
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
names_list, labels = load_and_prepare_data()
names_train, names_test, labels_train, labels_test = train_test_split(names_list, labels, test_size=0.2)
train_dataset, test_dataset = create_datasets(tokenizer, names_train, names_test, labels_train, labels_test)
model, training_args = setup_trainer(tokenizer)
trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset,
                  eval_dataset=test_dataset, compute_metrics=compute_metrics)
trainer.train()


[nltk_data] Downloading package names to /home/azureuser/nltk_data...
[nltk_data]   Package names is already up-to-date!
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3872,0.302647,0.876023,0.90262,0.894221,0.911178
2,0.1448,0.361893,0.874764,0.901534,0.894014,0.909182
3,0.1616,0.365745,0.870359,0.896794,0.900402,0.893214


Attempted to log scalar metric loss:
0.6973
Attempted to log scalar metric grad_norm:
0.726940929889679
Attempted to log scalar metric learning_rate:
5e-06
Attempted to log scalar metric epoch:
0.05
Attempted to log scalar metric loss:
0.6764
Attempted to log scalar metric grad_norm:
0.8579685688018799
Attempted to log scalar metric learning_rate:
1e-05
Attempted to log scalar metric epoch:
0.1
Attempted to log scalar metric loss:
0.6563
Attempted to log scalar metric grad_norm:
1.3368971347808838
Attempted to log scalar metric learning_rate:
1.5e-05
Attempted to log scalar metric epoch:
0.15
Attempted to log scalar metric loss:
0.5623
Attempted to log scalar metric grad_norm:
1.7434629201889038
Attempted to log scalar metric learning_rate:
2e-05
Attempted to log scalar metric epoch:
0.2
Attempted to log scalar metric loss:
0.4707
Attempted to log scalar metric grad_norm:
2.725975275039673
Attempted to log scalar metric learning_rate:
2.5e-05
Attempted to log scalar metric epoch:
0.25


TrainOutput(global_step=597, training_loss=0.2764041194164973, metrics={'train_runtime': 27.201, 'train_samples_per_second': 700.892, 'train_steps_per_second': 21.948, 'total_flos': 34528196655540.0, 'train_loss': 0.2764041194164973, 'epoch': 3.0})

In [39]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def print_evaluation_results(trainer):
    # Evaluate the model and store the results
    evaluation_results = trainer.evaluate()

    # Print evaluation results
    for metric_name, metric_value in evaluation_results.items():
        print(f"{metric_name}: {metric_value}")

# Execute the function to print evaluation results
print_evaluation_results(trainer)


Attempted to log scalar metric eval_loss:
0.36574485898017883
Attempted to log scalar metric eval_accuracy:
0.8703587161736942
Attempted to log scalar metric eval_f1:
0.8967935871743486
Attempted to log scalar metric eval_precision:
0.9004024144869215
Attempted to log scalar metric eval_recall:
0.8932135728542914
Attempted to log scalar metric eval_runtime:
0.1566
Attempted to log scalar metric eval_samples_per_second:
10148.725
Attempted to log scalar metric eval_steps_per_second:
159.672
Attempted to log scalar metric epoch:
3.0
eval_loss: 0.36574485898017883
eval_accuracy: 0.8703587161736942
eval_f1: 0.8967935871743486
eval_precision: 0.9004024144869215
eval_recall: 0.8932135728542914
eval_runtime: 0.1566
eval_samples_per_second: 10148.725
eval_steps_per_second: 159.672
epoch: 3.0


### Lets create predictions

In [40]:
def predict_name(name):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Tokenize the input name
    encodings = tokenizer(name, truncation=True, padding=True, return_tensors="pt")

    
    encodings = {key: val.to(device) for key, val in encodings.items()}

    # Predict using the trained model
    with torch.no_grad():
        outputs = model(**encodings)

    # Get the prediction probabilities and apply softmax
    logits = outputs.logits
    probs = torch.nn.functional.softmax(logits, dim=-1)

    # Extract probabilities for each class
    male_prob, female_prob = probs[0, 0].item(), probs[0, 1].item()

    # Determine predicted class based on the higher probability
    predicted_class = 'Male' if male_prob > female_prob else 'Female'

    # Print results
    print(f"Name: {name}")
    print(f"Predicted Gender: {predicted_class}")
    print(f"Probability (Male): {male_prob:.4f}")
    print(f"Probability (Female): {female_prob:.4f}")


predict_name("Alex")
predict_name("Emma")


Name: Alex
Predicted Gender: Male
Probability (Male): 0.6109
Probability (Female): 0.3891
Name: Emma
Predicted Gender: Female
Probability (Male): 0.0096
Probability (Female): 0.9904


### LSTM based approach

In [35]:
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import Adam
import torch.nn as nn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
import nltk
from nltk.corpus import names


nltk.download('names')

def preprocess_names_data(male_names, female_names):
    names = male_names + female_names
    names = [str(name) for name in names if isinstance(name, (str, tuple))]
    return names

def tokenize_and_pad(names):
    tokenizer = Tokenizer(char_level=True)
    tokenizer.fit_on_texts(names)
    sequences = tokenizer.texts_to_sequences(names)
    data = pad_sequences(sequences, maxlen=10)
    vocab_size = len(tokenizer.word_index) + 1
    return data, vocab_size

def prepare_data(names_data, labels):
    train_data, val_data, train_labels, val_labels = train_test_split(names_data, labels, test_size=0.2, random_state=42)
    return train_data, val_data, train_labels, val_labels

def create_loaders(train_data, val_data, train_labels, val_labels, batch_size=32):
    train_dataset = NamesDataset(train_data, train_labels)
    val_dataset = NamesDataset(val_data, val_labels)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    return train_loader, val_loader

def train_model(model, train_loader, optimizer):
    model.train()
    total_loss = 0
    correct_preds = 0
    for x, y in train_loader:
        optimizer.zero_grad()
        output = model(x)
        loss = nn.BCEWithLogitsLoss()(output.squeeze(), y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        predictions = torch.round(torch.sigmoid(output)).squeeze()
        correct_preds += (predictions == y).sum().item()
    accuracy = correct_preds / len(train_loader.dataset)
    return total_loss / len(train_loader), accuracy

def validate_model(model, val_loader):
    model.eval()
    total_loss = 0
    correct_preds = 0
    with torch.no_grad():
        for x, y in val_loader:
            output = model(x)
            loss = nn.BCEWithLogitsLoss()(output.squeeze(), y)
            total_loss += loss.item()
            predictions = torch.round(torch.sigmoid(output)).squeeze()
            correct_preds += (predictions == y).sum().item()
    accuracy = correct_preds / len(val_loader.dataset)
    return total_loss / len(val_loader), accuracy

def train_and_validate(model, train_loader, val_loader, optimizer, num_epochs=10):
    for epoch in range(num_epochs):
        train_loss, train_acc = train_model(model, train_loader, optimizer)
        val_loss, val_acc = validate_model(model, val_loader)
        print(f"Epoch: {epoch+1}, Training Loss: {train_loss}, Training Accuracy: {train_acc}, Validation Loss: {val_loss}, Validation Accuracy: {val_acc}")

# Custom Dataset class
class NamesDataset(Dataset):
    def __init__(self, data, labels):
        self.data = torch.tensor(data, dtype=torch.long)
        self.labels = torch.tensor(labels, dtype=torch.float)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        return self.data[index], self.labels[index]

# Model definition
class NameClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(NameClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        x = self.embedding(x)
        x, _ = self.lstm(x)
        x = self.fc(x[:, -1, :])
        return x


def main(male_names, female_names):
    # Preprocess data
    names = preprocess_names_data(male_names, female_names)
    labels = np.array([0]*len(male_names) + [1]*len(female_names))

    # Tokenize and pad data
    data, vocab_size = tokenize_and_pad(names)

    # Prepare data
    train_data, val_data, train_labels, val_labels = prepare_data(data, labels)

    # Create loaders
    train_loader, val_loader = create_loaders(train_data, val_data, train_labels, val_labels)

    # Initialize model
    model = NameClassifier(vocab_size, embedding_dim=50, hidden_dim=100)
    optimizer = Adam(model.parameters())

    # Train and validate model
    train_and_validate(model, train_loader, val_loader, optimizer)

male_names = [(name, 0) for name in names.words('male.txt')]
female_names = [(name, 1) for name in names.words('female.txt')]
main(male_names, female_names)


[nltk_data] Downloading package names to /home/azureuser/nltk_data...
[nltk_data]   Package names is already up-to-date!


Epoch: 1, Training Loss: 0.05401957650252398, Training Accuracy: 0.9776553894571204, Validation Loss: 0.0006162651634076611, Validation Accuracy: 1.0
Epoch: 2, Training Loss: 0.0003651852564314081, Training Accuracy: 1.0, Validation Loss: 0.00022718803375028075, Validation Accuracy: 1.0
Epoch: 3, Training Loss: 0.00016512479491229988, Training Accuracy: 1.0, Validation Loss: 0.00012463565944926814, Validation Accuracy: 1.0
Epoch: 4, Training Loss: 9.860007105372968e-05, Training Accuracy: 1.0, Validation Loss: 8.06317522074096e-05, Validation Accuracy: 1.0
Epoch: 5, Training Loss: 6.669550723803987e-05, Training Accuracy: 1.0, Validation Loss: 5.694219456927385e-05, Validation Accuracy: 1.0
Epoch: 6, Training Loss: 4.842155949381091e-05, Training Accuracy: 1.0, Validation Loss: 4.2483297002036124e-05, Validation Accuracy: 1.0
Epoch: 7, Training Loss: 3.679467561450289e-05, Training Accuracy: 1.0, Validation Loss: 3.288896754384041e-05, Validation Accuracy: 1.0
Epoch: 8, Training Loss: 

### Lets generate inferences

In [44]:

def preprocess_input_names(input_names, tokenizer):
    sequences = tokenizer.texts_to_sequences(input_names)
    padded_sequences = pad_sequences(sequences, maxlen=10)
    return padded_sequences


def generate_inferences(model, input_names, tokenizer):
    # Preprocess input names
    input_data = preprocess_input_names(input_names, tokenizer)
    
    # Convert input data to PyTorch tensor
    input_tensor = torch.tensor(input_data, dtype=torch.long)
    
    # Set the model to evaluation mode
    model.eval()
    
    # Forward pass
    with torch.no_grad():
        outputs = model(input_tensor)
        predictions = torch.round(torch.sigmoid(outputs)).squeeze().numpy()
    
    # Convert predictions to gender labels
    gender_labels = ["Male" if pred == 0 else "Female" for pred in predictions]
    
    return list(zip(input_names, gender_labels))


input_names = ["John", "Emma", "Michael"]
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(input_names)


gender_names = generate_inferences(model, input_names, tokenizer)
for name, gender in gender_names:
    print(f"Name: {name}, Gender: {gender}")


Name: John, Gender: Male
Name: Emma, Gender: Female
Name: Michael, Gender: Male


# References
1. https://arxiv.org/pdf/2102.03692.pdf
2. https://alvinntnu.github.io/NTNU_ENC2045_LECTURES/exercise/13-attention.html
3. https://towardsdatascience.com/deep-learning-gender-from-name-lstm-recurrent-neural-networks-448d64553044
4. https://www.nltk.org/book/ch02.html#sec-lexical-resources