# Kwere Character-Level Language Vanilla N-Gram Model

**I started with this implementation but it's performance is suboptimal. I left it just for documentation. Please see the other "From Scratch" implementations for better performance.**

### Parameters
Dictionary containing all parameters for ease of tuning. These will be logged to the neptune logger below.

**To add test data, enter the test file name in the `test_data` parameter.**

In [1]:
PARAMS = {
    'experiment_name': "Kwere",
    'tags': ["kwere", "from scratch"],
    'n': 5,
    'train_iterations': 5,
    'carry_hidden_state': False,
    'val_split': 0.3,
    'kwere_train': "./cwe-train.txt",
    'pretrain_iterations': 5,
    'pretrain_percentage': 0.05, 
    'swahili': "./sw-train.txt",
    'test_data': "./cwe-test.txt"
}

Only import. Used for the log function to compute cross entropy.

In [2]:
import math

### Dataset Class
The `Dataset` class generates a list of all unique characters found in the supplied data, number of total characters, number of unique characters, mappings from characters to their respective ID, mappings from chracter IDs to characters for making outputs readable, and a data tensor of every character converted to its ID.

The `Dataset` will also generate a `~` character to be used in place of any characters unknown to the model (i.e. anything not in the training set). See the `clean_data` function below.

Inputs:
 - `raw_data`: `string` of all characters from the provided data in order

In [3]:
class Dataset():
    def __init__(self, raw_data: str):
        self.chars = set(list(set(raw_data)))
        self.chars.add('~')
        self.data_size, self.vocab_size = len(raw_data), len(self.chars)
        print("{} characters, {} unique".format(self.data_size, self.vocab_size))
        
        self.char_to_idx = { char: idx for idx, char in enumerate(self.chars) }
        self.idx_to_char = { idx: char for idx, char in enumerate(self.chars) }
        
        self.data = [self.char_to_idx[char] for char in list(raw_data)]
    
    def __len__(self):
        return self.data_size
    
    def __getitem__(self, index):
        return self.data[index]

### Data Cleaning
The `clean_data` function removes any unknown chracters in the provided data and replaces them with the deisgnated unknown chracter of `~`. I'm essentially forfeiting these characters if they ever appear in the testing data, since I likely couldn't get them correct anyway considering the model did not see them during training (unless they appear in the Kwere data, but see my explanation below for that decision).

Inputs:
 - `raw_data`: `string` of raw data read directly from file
 - `known_chars`: `list` of `string` to be included in the data. Everything not in this list will be replaced.

In [4]:
def clean_data(raw_data: str, known_chars: str) -> str:
    cleaned = ""
    for char in raw_data:
        if char not in known_chars:
            cleaned += "~"
        else:
            cleaned += char
    return cleaned

### Data Loading
Load the Swahili training data and split based on the provided ratio. Then load the percentage of the Kwere data requested (see `PARAMS`). Finally, if a test file is provided in `PARAMS`, load the test data.

The validation, Kwere, and test data are all cleaned of unknown chracters. I chose to exclude any chracters found in the Swahili data but not found in the Swahili training data for the sake of staying as true to the Swahili language as possible (in the event Kwere uses a character that Kwere does not).

In [5]:
print("Loading Kwere training data:", end="\n\t")
raw_kwere = open(PARAMS['kwere_train'], 'r').read()
kwere_train_size, kwere_val_size = int(len(raw_kwere)*(1-PARAMS['val_split'])), int(len(raw_kwere)*PARAMS['val_split'])

train_data = Dataset(raw_kwere[:kwere_train_size])

print("Loading Kwere validation data:", end="\n\t")
cleaned_kwere_val_data = clean_data(raw_kwere[kwere_train_size:], train_data.chars)
val_data = Dataset(cleaned_kwere_val_data)


if PARAMS['pretrain_percentage'] > 0:
    print("Loading Swahili data:", end="\n\t")
    raw_swahili = open(PARAMS['swahili'], 'r').read()
    swahili_size = int(len(raw_swahili) * PARAMS['pretrain_percentage'])

    cleaned_swahili_data = clean_data(raw_swahili[:swahili_size], train_data.chars)
    pretrain_data = Dataset(cleaned_swahili_data)


if len(PARAMS['test_data']) > 0:
    print("Loading Testing data:", end="\n\t")
    raw_test = open(PARAMS['test_data'], 'r').read()

    cleaned_test_data = clean_data(raw_test, train_data.chars)
    test_data = Dataset(cleaned_test_data)

Loading Kwere training data:
	422402 characters, 32 unique
Loading Kwere validation data:
	181030 characters, 32 unique
Loading Swahili data:
	1963053 characters, 32 unique
Loading Testing data:
	61717 characters, 32 unique


In [6]:
def init_matrix(vocab: list, n: int):
    if n > 0:
        return {i:init_matrix(vocab, n-1) for i in vocab}
    else:
        return {i:0 for i in vocab}

In [7]:
def increment_count(char: str, sequence: list, count_matrix: dict) -> list:
    if len(sequence) == 0:
        count_matrix[char] += 1
    else:
        count_matrix[sequence[0]] = increment_count(char, sequence[1:], count_matrix[sequence[0]]) 
    return count_matrix

In [8]:
def iterate_counts(data: Dataset, n: int, count_matrix: dict):
    for idx, char in enumerate(data[n:]):
        idx = n + idx
        sequence = data[idx-n:idx]
        
        count_matrix = increment_count(data[idx], sequence, count_matrix)
    return count_matrix

In [9]:
count_matrix = init_matrix(train_data.idx_to_char.keys(), PARAMS['n'])

In [10]:
print("Fitting on pretrain data...")
count_matrix = iterate_counts(pretrain_data, PARAMS['n'], count_matrix)
print("Fitting on train data...")
count_matrix = iterate_counts(train_data, PARAMS['n'], count_matrix)

Fitting on pretrain data...
Fitting on train data...


In [11]:
def probabilities_from_counts(counts: dict):
    # add one smoothing
    counts = {key:counts[key]+1 for key in counts.keys()}
    
    probabilities = {key: counts[key] / sum(counts.values()) for key in counts.keys()}
    prob_sum = sum(probabilities.values())
    assert(abs(prob_sum - 1) < 0.0001), "Probabilities should sum to 1.0 but got {}".format(prob_sum)
    
    return probabilities

In [12]:
def get_probabilities_for_sequence(sequence: list, count_matrix: dict):
    if len(sequence) == 0:
        return probabilities_from_counts(count_matrix)
    else:
        return get_probabilities_for_sequence(sequence[1:], count_matrix[sequence[0]])

In [13]:
def calc_loss(target_prob):
    return -math.log(target_prob, 2)

In [14]:
def eval(data: Dataset, n: int, count_matrix: dict):
    print("Evaluating...")
    
    counter = 0
    running_loss = 0
    
    for idx, char in enumerate(data[n:]):
        idx = n + idx
        sequence = data[idx-n:idx]

        probabilities: dict = get_probabilities_for_sequence(sequence, count_matrix)
        pred: str = max(probabilities, key=probabilities.get)
        target: str = data[idx]
        target_prob: float = probabilities[target]
        
        running_loss += calc_loss(target_prob)
        counter += 1
        
    return running_loss / counter

In [15]:
train_loss, train_acc = eval(train_data, PARAMS['n'], count_matrix)
print("Train Loss: {}\t\t|\tTrain Accuracy: {}%".format(train_loss, train_acc*100))

val_loss = eval(val_data, PARAMS['n'], count_matrix)
print("Validation Loss: {}\t\t|\tValidation Accuracy: {}%".format(val_loss, val_acc*100))

Evaluating...
Train Loss:  2.2392918598291423
Evaluating...
Validation Loss:  3.418253588269885


In [None]:
if test_data in globals():
    test_loss = eval(val_data, PARAMS['n'], count_matrix)
    print("Validation Loss: ", val_loss)