# Language Generation
Generating sentences with conditional probabilities

Author: Pierre Nugues

## Reading a Corpus
Utility function to read all the files in a folder

In [None]:
import os
def get_files(dir, suffix):
    """
    Returns all the files in a folder ending with suffix
    :param dir:
    :param suffix:
    :return: the list of file names
    """
    files = []
    for file in os.listdir(dir):
        if file.endswith(suffix):
            files.append(file)
    return files

## Tokenizer
An elemetary tokenizer

In [None]:
import regex as re
def tokenize(text):
    """
    Uses the letters to break the text into words.
    Returns a list of match objects
    """
    words = re.findall('\p{L}+', text)
    return words

## Reading the Files
We read a corpus of novels from Dickens

In [None]:
folder = '/Users/pierre/Documents/Cours/EDAN20/corpus/Selma/'
#folder = '/Users/pierre/Documents/Cours/EDAN20/corpus/Dickens/'
files = get_files(folder, 'txt')
files

We tokenize the texts

In [None]:
words = []
for file in files:
    text = open(folder + file).read().lower().strip()
    words += tokenize(text)
words[:10]

## N-gram functions

In [None]:
def count_unigrams(words):
    frequency = {}
    for word in words:
        if word in frequency:
            frequency[word] += 1
        else:
            frequency[word] = 1
    return frequency

In [None]:
def count_bigrams(words):
    bigrams = [tuple(words[idx:idx + 2])
               for idx in range(len(words) - 1)]
    frequencies = {}
    for bigram in bigrams:
        if bigram in frequencies:
            frequencies[bigram] += 1
        else:
            frequencies[bigram] = 1
    return frequencies

### We count the unigrams and bigrams

In [None]:
unigrams = count_unigrams(words)

In [None]:
unigrams['nils']

In [None]:
bigrams = count_bigrams(words)

## Conditional Probabilities

Given a bigram, $w_n, w_{n+1}$, we compute $P(w_{n+1}|w_n)$. This is defined as $\frac{count(w_n, w_{n+1})}{count(w_n)}$.

In [None]:
probs = {k: v/unigrams[k[0]] for k, v in bigrams.items()}

### Extracting the conditional probabilities of a word

In [None]:
def cond_prob(word):
    cprob = sorted([(k, v) for k, v in probs.items() if k[0] == word],
                    key=lambda tup: tup[1], reverse=True)
    return cprob
cond_prob('nils')

### Drawing samples from a multinomial distribution. 

Understanding the `np.random.multinomial` function

In [None]:
import numpy as np
np.random.seed(0)
for i in range(10):
    print(np.random.multinomial(1, [0.3, 0.5, 0.2]))

On a large number of draws

In [None]:
draws = []
for i in range(100000):
    draws.append(np.random.multinomial(1, [0.3, 0.5, 0.2]))
np.sum(draws, axis=0)

### And finally, generating a sequence

In [None]:
word = 'nils'
print(word, end=' ')
for i in range(100):
    cprob = cond_prob(word)
    distribution = [i[1] for i in cprob]
    bigram = cprob[np.argmax(np.random.multinomial(1, distribution))]
    print(bigram[0][1], end=' ')
    word = bigram[0][1]