# Swahili Character-Level Language Model Using PyTorch LSTMs

This notebook is nearly identical to "Anything Goes (Kwere)". The pretrain-train relationship is just inverted: pre-train on Kwere, fine-tune on Swahili.

### Results
I've had a hard time training on the Swahili data. It seems the data is fairly "messy", with a lot of number and other odd characters cluttering the data. Considering the dataset is so large, I thought a lower learning rate would've been ideal, but the lower learning rates performed significantly worse.

### Parameters
Dictionary containing all parameters for ease of tuning. These will be logged to the neptune logger below.

**To add test data, enter the test file name in the `test_data` parameter.**

Select Parameter Descriptions:
 - `experiment_name`: identifier to be used in logging
 - `tags`: also for logging and filtering trials
 - `seq_len`: length of character lists fed to the model
 - `num_layers`: LSTM layers
 - `carry_hidden_state`: whether or not to perpetuate the hidden state between sequences
 - `pretrain_lr`: the learning rate to use while pretraining
 - `kwere_percentage`: the percentage of the Kwere data to pretrain with

In [260]:
PARAMS = {
    'experiment_name': "Swahili",
    'tags': ["swahili", "anything goes"],
    'epochs': 25,
    'hidden_size': 512,
    'seq_len': 100,
    'num_layers': 4,
    'dropout': 0.2,
    'lr': 0.01,
    'carry_hidden_state': False,
    'val_split': 0.3,
    'swahili_train': "./sw-train.txt",
    'pretrain_epochs': 5,
    'pretrain_lr': 0.001,
    'kwere_percentage': 0, 
    'kwere': "./cwe-train.txt",
    'test_data': "./sw-test.txt"
}

### Logging
For this project I used a logging library / UI called [Neptune.ai](https://neptune.ai/) to track all runs and their respective hyperparameters. Since the API key for this is only in my local `bash_profile`, **this cell will throw an error**, but I'll conditionalize all the logging in the notebook so errors won't be thrown beyond this cell.

To view the logs from my runs, visit the project url [here](https://ui.neptune.ai/gregrolwes/Bantu-Language-Modeling/experiments?viewId=standard-view&sortBy=%5B%22timeOfCreation%22%5D&sortDirection=%5B%22descending%22%5D&sortFieldType=%5B%22native%22%5D&sortFieldAggregationMode=%5B%22auto%22%5D&trashed=false&suggestionsEnabled=false&lbViewUnpacked=true&tags=%5B%22swahili%22%2C%22anything%20goes%22%5D). Note projects are tagged with which language they're targeting, and whether or not they using the "Anything Goes" or "From Scratch" implementation.

In [261]:
is_logging = False

import neptune

neptune.init('gregrolwes/Bantu-Language-Modeling')

neptune.create_experiment(
            name=PARAMS['experiment_name'],
            tags=PARAMS['tags'],
            params=PARAMS
        )

# reach this if the above logger initialization passes
is_logging = True

https://ui.neptune.ai/gregrolwes/Bantu-Language-Modeling/e/BAN-39


### Imports

In [262]:
import torch
import torch.nn as nn
import math

### Random Seed
Make the experiment reproducible.

In [263]:
SEED = 42
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

### GPU Support

In [264]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Dataset Class
The `Dataset` class generates a list of all unique characters found in the supplied data, number of total characters, number of unique characters, mappings from characters to their respective ID, mappings from chracter IDs to characters for making outputs readable, and a data tensor of every character converted to its ID.

The `Dataset` will also generate a `~` character to be used in place of any characters unknown to the model (i.e. anything not in the training set). See the `clean_data` function below.

Inputs:
 - `raw_data`: `string` of all characters from the provided data in order
 - `device`: `torch.device` of either `cuda` or `cpu`

In [265]:
class Dataset():
    def __init__(self, raw_data: str, device: torch.device):
        self.chars = set(list(set(raw_data)))
        self.chars.add('~')
        self.data_size, self.vocab_size = len(raw_data), len(self.chars)
        print("{} characters, {} unique".format(self.data_size, self.vocab_size))
        
        self.char_to_idx = { char: idx for idx, char in enumerate(self.chars) }
        self.idx_to_char = { idx: char for idx, char in enumerate(self.chars) }
        
        self.data = torch.tensor([self.char_to_idx[char] for char in list(raw_data)]).unsqueeze(1).to(device)
    
    def __len__(self):
        return self.data_size
    
    def __getitem__(self, index):
        return self.data[index]

### Data Cleaning
The `clean_data` function removes any unknown chracters in the provided data and replaces them with the deisgnated unknown chracter of `~`. I'm essentially forfeiting these characters if they ever appear in the testing data, since I likely couldn't get them correct anyway considering the model did not see them during training (unless they appear in the Kwere data, but see my explanation below for that decision).

Inputs:
 - `raw_data`: `string` of raw data read directly from file
 - `known_chars`: `list` of `string` to be included in the data. Everything not in this list will be replaced.

In [266]:
def clean_data(raw_data: str, known_chars: str) -> str:
    cleaned = ""
    for char in raw_data:
        if char not in known_chars:
            cleaned += "~"
        else:
            cleaned += char
    return cleaned

### Data Loading
Load the Swahili training data and split based on the provided ratio. Then load the percentage of the Kwere data requested (see `PARAMS`). Finally, if a test file is provided in `PARAMS`, load the test data.

The validation, Kwere, and test data are all cleaned of unknown chracters. I chose to exclude any chracters found in the Swahili data but not found in the Swahili training data for the sake of staying as true to the Swahili language as possible (in the event Kwere uses a character that Kwere does not).

I am also only training on a subset of the Swahili data for the sake of time.

In [267]:
print("Loading Swahili training data:", end="\n\t")
raw_swahili = open(PARAMS['swahili_train'], 'r').read()[:3000000]
swahili_train_size, swahili_val_size = int(len(raw_swahili)*(1-PARAMS['val_split'])), int(len(raw_swahili)*PARAMS['val_split'])

swahili_train = Dataset(raw_swahili[:swahili_train_size], device)

print("Loading Swahili validation data:", end="\n\t")
cleaned_swahili_val_data = clean_data(raw_swahili[swahili_train_size:], swahili_train.chars)
swahili_val = Dataset(cleaned_swahili_val_data, device)

if PARAMS['kwere_percentage'] > 0:
    print("Loading Kwere data:", end="\n\t")
    raw_kwere = open(PARAMS['kwere'], 'r').read()
    kwere_size = int(len(raw_kwere) * PARAMS['kwere_percentage'])

    cleaned_kwere_data = clean_data(raw_kwere[:kwere_size], swahili_train.chars)
    kwere = Dataset(cleaned_kwere_data, device)


if len(PARAMS['test_data']) > 0:
    print("Loading testing data:", end="\n\t")
    raw_test = open(PARAMS['test_data'], 'r').read()

    cleaned_test_data = clean_data(raw_test, swahili_train.chars)
    test_data = Dataset(cleaned_test_data, device)

Loading Swahili training data:
	2100000 characters, 49 unique
Loading Swahili validation data:
	900000 characters, 49 unique
Loading testing data:
	3451383 characters, 49 unique


### Model Declaration
The model is very similar to those I've used in past challenges: a multilayer LSTM with dropout. I've also added the ability to input a hidden state so the state can be carried between sequences.

In [268]:
class RNN(nn.Module):
    def __init__(self, input_size, output_size, hidden_size, num_layers, dropout):
        super(RNN, self).__init__()
        self.embedding = nn.Embedding(input_size, input_size)
        self.lstm = nn.LSTM(
            input_size=input_size, 
            hidden_size=hidden_size, 
            num_layers=num_layers,
            dropout = dropout if num_layers > 1 else 0
        )
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_size, output_size)
        
    def forward(self, input_seq, hidden_state):
        embedding = self.embedding(input_seq)
        output, hidden_state = self.lstm(embedding, hidden_state)
        output = self.fc(self.dropout(output))
        return output, (hidden_state[0].detach(), hidden_state[1].detach())

### Loss Function
As defined in the challenge requirements, I'm using a cross entropy loss customized to use log base 2 rather than the typical natural log used in PyTorch.

I've also added an assertion making sure no probability distribution sums to more than 1/10,000 plus or minus 1.0

In [269]:
def cross_entropy_loss(outputs, targets):
    batch_size = outputs.shape[0]
    outputs = nn.functional.softmax(outputs, dim=-1)
    
    for prob_dist_sum in torch.sum(outputs, dim=1):
        assert(abs(prob_dist_sum - 1) < 0.0001), "The sum of all probabilities for a character should be 1.0, but got {}".format(prob_dist_sum)
    
    outputs = torch.log2(outputs)
    outputs = outputs[range(batch_size), targets]
    
    return -torch.mean(outputs)

### Model Declaration
Based on `PARAMS` and the determined `vocab_size` of the train data.

In [270]:
rnn = RNN(
    swahili_train.vocab_size, 
    swahili_train.vocab_size, 
    PARAMS['hidden_size'], 
    PARAMS['num_layers'],
    PARAMS['dropout'],
)

### Optimizer
Using an Adam optimizer, learning rate set in `PARAMS`.

In [271]:
loss_fn = cross_entropy_loss
optimizer = torch.optim.Adam(rnn.parameters(), lr=PARAMS['lr'])

### Learning Rate Modifier
The `set_lr` function is meant to modify the learning rate between pretraining and fine-tuning to avoid overfitting on the Swahili data.

In [272]:
def set_lr(optimizer: torch.optim.Optimizer, lr: int):
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

### Model to GPU

In [273]:
rnn.to(device)

RNN(
  (embedding): Embedding(49, 49)
  (lstm): LSTM(49, 512, num_layers=4, dropout=0.2)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=512, out_features=49, bias=True)
)

Function to check for NaNs, used in debugging.

In [274]:
def has_nan(t: torch.Tensor) -> bool:
    if torch.sum(torch.isnan(t)) > 0:
        return True
    return False

### Train Function
Standard train function taking `seq_len` characters at a time. For each character in the series, the LSTM will predict the next character based on the previous character and the character history (represented by the hidden state). 

The hidden state is optionally carried between sequences, so events like a sequence ending mid-word should have no negative effect. I've made this optional because while it could help with cutoff words, I think restarting the hidden state every sequence could also be beneficial as a sort of dropout, in the event a particularly difficult sequence causes the hidden state to be thrown off.

In [275]:
def train(model, criterion, optimizer, data, seq_len):
    ptr = 0
    n = 0
    running_loss = 0
    hidden_state = None
    
    model.train()

    while ptr + seq_len + 1 < len(data):
        input_seq = data[ptr:ptr+seq_len].to(device)
        target_seq = data[ptr+1:ptr+seq_len+1].to(device)

        if hidden_state is not None:
            if has_nan(hidden_state[0]) or has_nan(hidden_state[1]):
                hidden_state = None
        output, hidden_state = model(input_seq, hidden_state if PARAMS['carry_hidden_state'] else None)

        try:
            loss = criterion(torch.squeeze(output), torch.squeeze(target_seq))
            assert(not torch.isnan(loss)), "The loss shouldn't be nan"
            running_loss += loss.item()

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            n += 1
        except AssertionError as err:
            print("An assertion failed, skipping for now but this shouldn't happen often:\n\t{}".format(err))

        ptr += seq_len
        
    return running_loss/n

### Test Function
Standard test function with optional carried hidden state.

In [276]:
def test(model, criterion, data, seq_len):
    ptr = 0
    n = 0
    running_loss = 0
    hidden_state = None
    
    model.eval()

    while ptr + seq_len + 1 < len(data):
        input_seq = data[ptr:ptr+seq_len]
        target_seq = data[ptr+1:ptr+seq_len+1]

        if hidden_state is not None:
            if has_nan(hidden_state[0]) or has_nan(hidden_state[1]):
                hidden_state = None
        output, hidden_state = model(input_seq, hidden_state if PARAMS['carry_hidden_state'] else None)

        try:
            loss = criterion(torch.squeeze(output), torch.squeeze(target_seq))
            assert(not torch.isnan(loss)), "The loss shouldn't be nan"
            running_loss += loss.item()
            
            n += 1
        except AssertionError as err:
            print("An assertion failed, skipping for now but this shouldn't happen often:\n\t{}".format(err))

        ptr += seq_len
        
    return running_loss/n

### Loss Function Verification
Based on the equation for cross entropy, a randomized model's loss should on average be $log_2(vocab\_size)$.

This number should also be the target to verify that the model is learning. Any loss lower than this value has learned a non-zero amount.

In [257]:
print("Vocab size is {}, so cross entropy with no training should be approximately {}".format(swahili_train.vocab_size, math.log(swahili_train.vocab_size, 2)))
print("Untrained loss:", end=" ")
print(test(rnn, loss_fn, swahili_val, PARAMS['seq_len']))

Vocab size is 49, so cross entropy with no training should be approximately 5.614709844115208
Untrained loss: 5.613650491827722


### Pretrain

In [277]:
if PARAMS['kwere_percentage'] > 0:
    set_lr(optimizer, PARAMS['pretrain_lr'])
    
    for epoch in range(0, PARAMS['pretrain_epochs']):
        print("-"*3 + " Pretrain Epoch {} ".format(epoch+1) + "-"*17)

        print("\tPretrain Loss:", end=" ")
        pretrain_loss = train(rnn, loss_fn, optimizer, kwere, PARAMS['seq_len'])
        print(pretrain_loss)
        
        if is_logging:
            neptune.log_metric("Pretrain Loss", pretrain_loss)

### Training

In [None]:
set_lr(optimizer, PARAMS['lr'])

for epoch in range(0, PARAMS['epochs']):
    print("-"*3 + " Epoch {} ".format(epoch+1) + "-"*25)
    
    print("\tTraining Loss:", end=" ")
    train_loss = train(rnn, loss_fn, optimizer, swahili_train, PARAMS['seq_len'])
    print(train_loss)
    if is_logging:
        neptune.log_metric("Train Loss", train_loss)
    
    print("\tValidation Loss:", end=" ")
    val_loss = test(rnn, loss_fn, swahili_val, PARAMS['seq_len'])
    print(val_loss)
    if is_logging:
        neptune.log_metric("Validation Loss", val_loss)

--- Epoch 1 -------------------------
	Training Loss: 4.267908933599924
	Validation Loss: 4.332783218065439
--- Epoch 2 -------------------------
	Training Loss: 4.228882867873604
	Validation Loss: 4.2758675821755565
--- Epoch 3 -------------------------
	Training Loss: 4.206150354252037
	Validation Loss: 

### Testing

In [None]:
if test_data in globals():
    print("Testing Loss:", end=" ")
    test_loss = train(rnn, loss_fn, optimizer, test_data, PARAMS['seq_len'])
    print(test_loss)