# Language Generation
Generating sentences with conditional probabilities

Author: Pierre Nugues

## Reading a Corpus
Utility function to read all the files in a folder

In [14]:
import os
def get_files(dir, suffix):
    """
    Returns all the files in a folder ending with suffix
    :param dir:
    :param suffix:
    :return: the list of file names
    """
    files = []
    for file in os.listdir(dir):
        if file.endswith(suffix):
            files.append(file)
    return files

## Tokenizer
An elemetary tokenizer

In [15]:
import regex as re
def tokenize(text):
    """
    Uses the letters to break the text into words.
    Returns a list of match objects
    """
    words = re.findall('\p{L}+', text)
    return words

## Reading the Files
We read a corpus of novels from Dickens

In [16]:
#folder = '/Users/pierre/Documents/Cours/EDAN20/corpus/Selma/'
folder = '/Users/pierre/Documents/Cours/EDAN20/corpus/Dickens/'
files = get_files(folder, 'txt')
files

['Hard Times.txt',
 'Oliver Twist.txt',
 'Great Expectations.txt',
 'The Old Curiosity Shop.txt',
 'A Tale of Two Cities.txt',
 'Dombey and Son.txt',
 'The Pickwick Papers.txt',
 'Bleak House.txt',
 'Our Mutual Friend.txt',
 'The Mystery of Edwin Drood.txt',
 'Nicholas Nickleby.txt',
 'David Copperfield.txt',
 'Little Dorrit.txt',
 'A Christmas Carol in Prose.txt']

We tokenize the texts

In [17]:
words = []
for file in files:
    text = open(folder + file).read().lower().strip()
    words += tokenize(text)
words[:10]

['hard',
 'times',
 'and',
 'reprinted',
 'pieces',
 'by',
 'charles',
 'dickens',
 'with',
 'illustrations']

## N-gram functions

In [18]:
def count_unigrams(words):
    frequency = {}
    for word in words:
        if word in frequency:
            frequency[word] += 1
        else:
            frequency[word] = 1
    return frequency

In [19]:
def count_bigrams(words):
    bigrams = [tuple(words[idx:idx + 2])
               for idx in range(len(words) - 1)]
    frequencies = {}
    for bigram in bigrams:
        if bigram in frequencies:
            frequencies[bigram] += 1
        else:
            frequencies[bigram] = 1
    return frequencies

### We count the unigrams and bigrams

In [20]:
unigrams = count_unigrams(words)

In [21]:
unigrams['master']

1158

In [22]:
bigrams = count_bigrams(words)

## Conditional Probabilities

Given a bigram, $w_n, w_{n+1}$, we compute $P(w_{n+1}|w_n)$. This is defined as $\frac{count(w_n, w_{n+1})}{count(w_n)}$.

In [23]:
probs = {k: v/unigrams[k[0]] for k, v in bigrams.items()}

### Extracting the conditional probabilities of a word

In [24]:
def cond_prob(word):
    cprob = sorted([(k, v) for k, v in probs.items() if k[0] == word],
                    key=lambda tup: tup[1], reverse=True)
    return cprob
cond_prob('master')

[(('master', 'copperfield'), 0.10449050086355786),
 (('master', 's'), 0.09930915371329879),
 (('master', 'of'), 0.06303972366148532),
 (('master', 'and'), 0.05958549222797927),
 (('master', 'bates'), 0.04317789291882556),
 (('master', 'said'), 0.023316062176165803),
 (('master', 'i'), 0.023316062176165803),
 (('master', 'bitherstone'), 0.02072538860103627),
 (('master', 'bardell'), 0.018134715025906734),
 (('master', 'was'), 0.016407599309153715),
 (('master', 'in'), 0.015544041450777202),
 (('master', 'at'), 0.012089810017271158),
 (('master', 'paul'), 0.012089810017271158),
 (('master', 'to'), 0.010362694300518135),
 (('master', 'davy'), 0.010362694300518135),
 (('master', 'the'), 0.009499136442141624),
 (('master', 'you'), 0.008635578583765112),
 (('master', 'crummleses'), 0.008635578583765112),
 (('master', 'he'), 0.007772020725388601),
 (('master', 'with'), 0.007772020725388601),
 (('master', 'who'), 0.007772020725388601),
 (('master', 'that'), 0.007772020725388601),
 (('master', 

### Drawing samples from a multinomial distribution. 

Understanding the `np.random.multinomial` function

In [25]:
import numpy as np
np.random.seed(0)
for i in range(10):
    print(np.random.multinomial(1, [0.3, 0.5, 0.2]))

[0 0 1]
[0 1 0]
[0 1 0]
[0 0 1]
[1 0 0]
[0 0 1]
[0 1 0]
[1 0 0]
[0 1 0]
[0 0 1]


On a large number of draws

In [26]:
draws = []
for i in range(100000):
    draws.append(np.random.multinomial(1, [0.3, 0.5, 0.2]))
np.sum(draws, axis=0)

array([30071, 49842, 20087])

### And finally, generating a sequence

In [29]:
word = 'master'
print(word, end=' ')
for i in range(100):
    cprob = cond_prob(word)
    distribution = [i[1] for i in cprob]
    bigram = cprob[np.argmax(np.random.multinomial(1, distribution))]
    print(bigram[0][1], end=' ')
    word = bigram[0][1]

master that made me a hearty of the next demands the misanthrope in the contents of him to know what s thumbs and noiselessly turned upon a highly celebrated day who had done to the chimney indicated in that the child anything in a boy the person of fetching her and taking the contrary i should hazard to leave it is there was interrupted mr dolloby rolled back as i suppose that covered bed in his breast why then sat at the parents boldly remarked that what does my clothes baskets washing his side of these losses maybe you my heart 