# Assignment 2: Classification of the Gender from the First Name

In this assignment, we have the task to design and implement a recurrent neural network, particularly, an Elman network using pytorch.
With this network, we classify first names into being male or female.
Here, we only train and evaluate a network on the training data, no separate validation or test set is required.

## Input Data

The data that we make use of is given online in the UCI data repository: https://archive.ics.uci.edu/dataset/591/gender+by+name

The dataset contains names and their gender.
Names are represented by a list of characters -- we here will use only lower-case letters and the `-` character.

Please run the following code cell to download and extract the data.

In [1]:
import os
import zipfile
import urllib.request
# get data:
dataurl = "https://archive.ics.uci.edu/static/public/591/gender+by+name.zip"
name = "gender+by+name.zip"
datafile = "name_gender_dataset.csv"

# Skip downloading if the file already exists
if not os.path.exists(datafile):
    # Download the file
    urllib.request.urlretrieve(dataurl, name)
    print(f"Downloaded {name} successfully.")

    # Extract the zip file
    with zipfile.ZipFile(name, 'r') as zip_ref:
        zip_ref.extractall()
    print(f"Extracted {name} successfully.")

assert os.path.exists(datafile)

Downloaded gender+by+name.zip successfully.
Extracted gender+by+name.zip successfully.


## Data Processing

We extract all names and all target values.
The names are stored in a python list `inputs` with `N=146751` elements, where we filter out any names with non-letter symbols, so that the final set of characters contains 27 elements.
Similarly, the gender values are provided in a python list `targets` with `N` elements, where male names are represented by target $t^{[n]}=0$ and female names with $t^{[n]}=1$.

Please run the below code to extract input and target data and print some statistics.

In [3]:
import csv
file = csv.reader(open(datafile), delimiter=',')
# skip header
next(file)

unique_chars = {chr(ord("a") + i) : i for i in range(26)}
unique_chars["-"] = 26

# read data and targets, remove outliers
inputs, targets = [], []
for splits in file:
  t = splits[0].lower()
  if any(c not in unique_chars for c in t): continue
  inputs.append(splits[0].lower())
  targets.append(0 if splits[1] == "M" else 1)

# print out some statistics
N = len(inputs)
females = int(sum(targets))
print (f"Collected N={N} names, of which {females} are female and {N-females} are male")

# print out 10 examples and their target values
print (f"The first ten inputs are: {inputs[:10]}")
print (f"The first ten targets are: {targets[:10]}")

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 7304: illegal multibyte sequence

## Task (c): Character Encoding

To make use of the data, each unique character needs to be transformed into a one-hot vector encoding $\vec x = \{0.,1.\}^D$ and $\sum\limits_{d=1}^D x_d = 1$ with $D$ being the input dimensionality.

Implement a function (or a reasonable data structure) that turns a character into a one-hot vector representation.

In [None]:
import torch

# Obtain the input dimension
D = N

# ALTERNATIVE 1:
# function to encode characters
def encoding(c):
    torch.zeros(D)[unique_chars[c]] = 1

# ALTERNATIVE 2:
# data structure that defines the encoding
encoding = 

## Task (d): Dataset Implementation

To make use of the data during training, we need to implement a `torch.utils.data.Dataset` that provides the encoded data, and the target values.
As usual, you need to implement three functions, which are the constructor `__init__`, the index function `__getitem__` and the number of samples in this dataset via `__len__`.

Since we want to make use of a batch, each sample needs to be of a fixed sequence length $S$, which you need to define reasonably.
For samples that are shorter than this sequence length, an appropriate padding needs to be implemented.

Implement the dataset below.
Make sure that the `__getitem__` function returns the encoded name, which is in size $\mathbb R^{S\times D}$, possibly including an appropriate padding.
Depending on the loss function that you select below, you need to adapt the target $t$ accordingly.

In [None]:
# Define the sequence length
S = 10

# create dataset
class Dataset(torch.utils.data.Dataset):
    def __init__(self, X, T, S):
        # call base class constructor
        super(Dataset, self).__init__()
        # Anything else?
        self.dataset = X
        self.targets = T
        self.S = S
        

    def __getitem__(self, index):
        # encode sample at the given index with padding
        encoded = encoding(self.dataset[index : index + self.S])
        # possibly adapt the target value to fit to your loss function
        target = encoding(self.targets[index : index + self.S])
        # return both the encoded name and the target value
        return encoded, target

    def __len__(self):
        # return the number of samples in this dataset
        return len(self.datase)

## Task (e): Elman Network Implementation

The Elman network is a sequence processing network that we select to process our encoded names.
Particularly, the network is defined by two layers, which are computed as follows:

$$\vec h^{\{s\}} = g\left(\mathbf W^{(1)} \vec x^{\{s\}} + \mathbf W^{(r)} \vec h^{\{s-1\}}\right)$$
$$\vec z^{\{s\}} = \mathbf W^{(2)} \vec h^{\{s\}}$$

where $\mathbf W^{(1)}$, $\mathbf W^{(r)}$, and $\mathbf W^{(2)}$ are learnable matrices, $g(\cdot)$ is an appropriate activation function as selected in Task (a), and $\vec h^{\{s\}}$ and $\vec z^{\{s\}}$ are, respectively, the hidden representation and the logit output of the network for sequence element $\vec x^{\{s\}}$.

Finally, you need to implement the Elman network defined by the above equations as a `torch.nn.Module`.
Implement the network by making use of the weight matrices (or reasonable representations thereof) in the dimensionalities defined in Task (a).
Please be aware that inputs will be given in batches: $\mathcal X\in\mathbb R^{B\times S\times D}$.

In [None]:
# implement the Elman network
class ElmanNet(torch.nn.Module):
    def __init__(self, D, H, C):
        # call super class constructor
        super(ElmanNet, self).__init__()
        # instantiate all required elements in appropriate dimensionalities
        self.W1 = torch.nn.Linear(D, H)
        self.Wr = torch.nn.Linear(H, H)
        self.W2 = torch.nn.Linear(H, C)
        self.activation = torch.nn.ReLU()
        

    def forward(self, x):
        # obtain the shape of the sample
        B, S, D = x.shape
        # initialize the hidden unit in the required dimensionality
        H = torch.zeros(B, H)
        # iterate through the sequence
        for s in range(S):
            # update the hidden representation with the current sequence element
            H = self.activation(self.W1(x[:, s, :]) + self.Wr(H))
            # anything else to be done here?
            Z = self.W2(H)
            
        # return whatever is required to be returned here
        return Z

## Task (f): Network Training

Finally, we want to train our Elman network on the provided data.
We need to instantiate the Elman network, possibly passing the required parameters.
Also, an appropriate loss function needs to be selected -- here we can ignore the imbalanced nature of the dataset.
Instantiate the data loader with reasonable parameters, and an optimizer used to train the network.
Train the network for 10 epochs, and compute the training set accuracy during the epoch.

In [None]:
# instantiate the network
network = ElmanNet(D, H, C)

# instantiate the loss
loss = torch.nn.CrossEntropyLoss()

# instantiate an optimizer
optimizer = torch.optim.Adam(network.parameters(), lr=0.001)

# instantiate data loader
dataloader = torch.utils.data.DataLoader(Dataset(), batch_size=32, shuffle=True)

# iterate through training set only
for epoch in range(10):
    # iterate through the training dataset
    correct = 0
    accuracy = 0
    for x, t in dataloader:
        # train the network on the current batch
        optimizer.zero_grad()
        Z = network(x)
        J = loss(Z, t)
        J.backward()
        optimizer.step()
        # compute the training set accuracy
        correct += torch.sum(torch.argmax(Z, dim=1) == t).item()
        accuracy = correct / len(dataloader.dataset)

    # print the training set accuracy
    print("Epoch", epoch, "accuracy", accuracy)