<a href="https://colab.research.google.com/github/onkarvkunte/NLP_Assignment/blob/main/scripts/assignment_4_part_III.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part III
Using the previous two tutorials, please answer the following using an encorder-decoder approach and an LSTM compared approach.

Please create a transformer-based classifier for English name classification into male or female.

There are several datasets for name for male or female classification. In subseuqent iterations, this could be expanded to included more classifications.

Below is the source from NLTK, which only has male and female available but could be used for the purposes of this assignment.

```
names = nltk.corpus.names
names.fileids()
['female.txt', 'male.txt']
male_names = names.words('male.txt')
female_names = names.words('female.txt')
[w for w in male_names if w in female_names]
['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis',
'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel',
'Angie', 'Ariel', 'Ashley', 'Aubrey', 'Augustine', 'Austin', 'Averil', ...]
```

In [None]:
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split

# Data
names = nltk.corpus.names
male_names = names.words('male.txt')
female_names = names.words('female.txt')
data = [(name, 0) for name in male_names] + [(name, 1) for name in female_names]

# Data Preprocessing
names, labels = zip(*data)
vocab = set(' '.join(names))
char_to_index = {char: idx for idx, char in enumerate(vocab)}
max_seq_length = max(len(name) for name in names)
data_encoded = np.array([[char_to_index[char] for char in name] for name in names])
data_padded = tf.keras.preprocessing.sequence.pad_sequences(data_encoded, maxlen=max_seq_length)

# Splitting Data
train_data, val_data, train_labels, val_labels = train_test_split(data_padded, labels, test_size=0.2, random_state=42)

# Transformer-Based Classifier
transformer_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=len(vocab), output_dim=128, input_length=max_seq_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
transformer_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
transformer_model.fit(train_data, train_labels, validation_data=(val_data, val_labels), epochs=10)

# LSTM-Based Classifier
lstm_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=len(vocab), output_dim=128, input_length=max_seq_length),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
lstm_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
lstm_model.fit(train_data, train_labels, validation_data=(val_data, val_labels), epochs=10)


# References
1. https://arxiv.org/pdf/2102.03692.pdf
2. https://alvinntnu.github.io/NTNU_ENC2045_LECTURES/exercise/13-attention.html
3. https://towardsdatascience.com/deep-learning-gender-from-name-lstm-recurrent-neural-networks-448d64553044
4. https://www.nltk.org/book/ch02.html#sec-lexical-resources