In [6]:
import numpy as np

# Word Generation With N-Grams

N-Grams are a probability distribution on strings for a given language. For example, in French, the word "xyloqsjdfkljsdfqmskfdl" has a smaller probablity of being a valid word than "xylophone". <big>How much smaller?</big>

### 1. The trainable model

Our model will predict the likeliness of a token given the $n$ preceding tokens. Tokens could be letters or words **but an end token is needed.** In python this does not exist, so we'll have to add it at the end of strings ourselves. We will also add $n$ end-tokens at the beginning of a string, and start analyzing the string when at the first non ending character. Let $\mathcal{T}$ be the set of tokens. Here we will make an *updateable* model, that estimates the $|\mathcal{T}|^{n+1}$ parameters of the model. The parameters are the probability that an $n$-gram (sequence of $n$ tokens) is followed by any given token, for all tokens and for all $n$-grams. Let $f:\mathcal{T}\times \mathcal{T}^{n} \to [0,1]$ be  defined as $$\forall t \in \mathcal{T}, w \in \mathcal{T}^n, f(t,w) := P(t | w).$$ Note that $$f(t,w)\approx \frac{\char"0023 w.t}{\char"0023 w},$$ where $\char"0023$ stands for "the count of", and $.$ stands for string concatenation, aka "followed by". This model stores the parameters as :

- The raw counts of each $n+1$-gram as a $(|\mathcal{T}|,|\mathcal{T}|,|\mathcal{T}|,...,|\mathcal{T}|)-$ shaped ($n+1$ repetitions) array. 
- The raw counts of each token, stored as a $(|\mathcal{T}|)-$ shaped array.

These two elements make it possible to reconstitute a $|\mathcal{T}|\times |\mathcal{T}|^n$ matrix, containing the associated probabilities. This model will be updated with new strings.


In [4]:
class Ngram:


    def __init__(self, n,tokenizer,end_token,alphabet:list):
        self.tokenizer = tokenizer
        self.n = n

        self.alphabet:np.array = np.array(alphabet+[end_token])# initialize alphabet as an array of all possible tokens, **including the end token**. 
        self.raw_count_of_each_token = np.zeros(shape=self.alphabet.shape)

        self.ngrams:np.array = ... # initialize all possible ngrams by using the alphabet. 

        self.raw_count_of_each_n_plus_one_gram = np.zeros([len(alphabet)+1] * (n+1))

    def update_from_file(self,path:str):
        ... # Extract the list, and use update_from_list

    def update_from_list(self,l:list):
        ... # for each word in the list, 
            # for each token in the word, starting at token index n+1,
            # Update the respective raw counts.

    def probability(self, s:str):
        prod = 1 # the initial product value is 1
        # tokenize s.
        # for each token t in s, starting at index n+1, including the ending token,
        # given the preceding n-gram (:=w)
        # prod *= f(t,w)
        
        return prod



The following code should now work:

In [11]:
N = 3
END_TOKEN = '\x00'
PATH_TO_STRINGS_CSV = ...
ALPHABET = [c for c in "abcdefghijklmnopqrstuvwxyz"]

def t(s:str): # s, if it is a finished string, must be ended with END TOKENS. It must also be padded with n end tokens to be able to work with words of length less than n.
    # example tokenizer
    l = [c for c in s] # tokenize by char
    return l
    
model_fr = Ngram(n=N,tokenizer=t,end_token=END_TOKEN,alphabet=ALPHABET) # generate new model from constructor, choosing n and a tokenizer (it must also be possible to `load` a model from csv)

model_fr.update_from_list(["chaussettes","chien"])# train the model using a list of strings
model_fr.update_from_file(PATH_TO_STRINGS_CSV)# train the model using a csv file containing correct strings


In [27]:
model_fr.probability("chaussette")

1

We can now implement storing and loading. Note that the following code is not as important as what comes before. If you are implementing this, what you did works ok and you don't feel like implementing stores and loads, just go to part 2.

In [13]:
PATH_TO_MODEL_CSV = ...

def store(self,path:str):
    ...#the file must contain:
    # - the raw_count_of_each_n_plus_one_gram
    # - the raw_count_of_each_token
    # - n
    # loaded models will GUESS the type of tokenization to be by-letter


def load_from(path:str)->Ngram:
    ...

Ngram.store = store
Ngram.load_from = load_from

The following should now be possible:

In [14]:
model_fr.store(PATH_TO_MODEL_CSV) # store the model in a csv file
... #do stuff with the model
model_fr = Ngram.load_from(PATH_TO_MODEL_CSV) # restart the model to where it was before, from memory.



The end product of this first part is to determine the probability of any given string.

### 2. Word generation

Time to generate new words! To do so, we will need strings of tokens that have their beginning padded with $n$ end-tokens. We will look at the most likely token to follow in the table obtained before.

First, let us define a function that finds the most likely following token, given an $n$-gram.

In [None]:
def