In [None]:
load_ext run_and_test

# Background

We will work with a file that lists the members of a set $\mathit{words}$ of words, all made up of nothing but the 26 capital letters. In $\mathit{words}$, 8 words contain the 3-gram (sequence of 3 symbols) ZEL:

* 4 of those words contain ZELL (namely, DALZELL, GAZELLE, ROZELLE and ZELLERBACH);
* 1 of those words contains ZELT (namely, TZELTAL);
* 3 of those words end in ZEL (namely, DITZEL, HAZEL and RAPUNZEL).

Hence

* there is a probability of $\frac{1}{2}$ that ZEL in a word in $\mathit{words}$ be followed by L;
* there is a probability of $\frac{1}{8}$ that ZEL in a word in $\mathit{words}$ be followed by T;
* there is a probability of $\frac{3}{8}$ that ZEL in a word in $\mathit{words}$ actually ends the word.

The same computations can be done for all possible $3$-grams, as well as for all possible $m$-grams with $m<3$ but requiring that they start a word as opposed to letting them occur anywhere in a word. Imagine that an "end of word" marker is appended to every word, so saying that a word $w$ ends in a sequence of letters $\sigma$ is equivalent to saying that $\sigma$ is followed by the end of word marker in $w$. One can then randomly generate a word $w$ as follows: 

* Randomly generate a letter, say $c_1$, following the probability that a word in $\mathit{words}$ starts with $c_1$.
* Randomly generate a letter or the end of word marker, say $c_2$, following the probability that a word in $\mathit{words}$ that starts with $c_1$ actually starts with $c_1c_2$. In case $c_2$ is the end of word marker then $w = c_1$.
* Otherwise, randomly generate a letter or the end of word marker, say $c_3$, following the probability that a word in $\mathit{words}$ that starts with $c_1c_2$ actually starts with $c_1c_2c_3$. In case $c_3$ is the end of word marker then $w = c_1c_2$.
* Otherwise, randomly generate a letter or the end of word marker, say $c_4$, following the probability that a word in $\mathit{words}$ that contains $c_1c_2c_3$ actually contains $c_1c_2c_3c_4$. In case $c_4$ is the end of word marker then $w = c_1c_2c_3$.
* Otherwise, randomly generate a letter or the end of word marker, say $c_5$, following the probability that a word in $\mathit{words}$ that contains $c_2c_3c_4$ actually contains $c_2c_3c_4c_5$. In case $c_5$ is the end of word marker then $w = c_1c_2c_3c_4$.
* Otherwise, randomly generate a letter or the end of word marker, say $c_6$, following the probability that a word in $\mathit{words}$ that contains $c_3c_4c_5$ actually contains $c_3c_4c_5c_6$. In case $c_6$ is the end of word marker then $w = c_1c_2c_3c_4c_5$.
* ...

The randomly generated word $w$ could be infinite, with null probability. It has been generated following a kind of probability distribution known as a Markov chain.

# Task

Write a program `markov_chain.py` that meets the following requirements.

* The program prompts the user for a strictly positive integer $n$, a positive integer $\mathit{nb\_of\_words}$ and an integer $\mathit{for\_seed}$, meant to represent the $n$ in $n$-gram, a number of words to be randomly generated as previously described, and a value to feed the `seed()` function of the `random` module to have control over the program outputs, respectively.
* The program reads the contents of a file named `dictionary.txt`, supposed to be stored in the working directory, that contains uppercase words, one per line. Let $\mathit{words}$ keep denoting the set of those words.
* The program defines a function `computed_markov_chain(words, n)` that given as arguments $\mathit{words}$ and $n$, returns a dictionary $\mathit{markov\_chain}$ whose keys are all $n$-grams that occur anywhere in at least one word in $\mathit{words}$ and all $m$-grams for $m<n$ that occur at the beginning of at least one word in $\mathit{words}$, and such that for a given key $\sigma$, the value of $\sigma$ is the list $L$ defined as follows. Let $c_1$, ..., $c_k$ be the symbols, either letters or end of word marker, such that:
    * either $\sigma$ is of length $n$ and at least one word in $\mathit{words}$ contains $\sigma$ followed by $c$,
    * or $\sigma$ is of length less than $n$ and at least one word in $\mathit{words}$ starts with $\sigma$ followed by $c$;
    * $c_1$, ..., $c_k$ are lexicographically ordered, requesting that the end of word marker be greater than all letters.

  Then $L=[(c_1,p_1),\dots,(c_k,p_k)]$ where:
  
    * if $\sigma$ is of length $n$ then $p_1$ is the probability that a word in $\mathit{words}$ that contains $\sigma$ be followed by $c_1$, whereas if $\sigma$ is of length smaller than $n$ then $p_1$ is the probability that a word in $\mathit{words}$ that starts with $\sigma$ be followed by $c_1$.
    * if $\sigma$ is of length $n$ then $p_2$ is the probability that a word in $\mathit{words}$ that contains $\sigma$ be followed by $c_1$ or $c_2$, whereas if $\sigma$ is of length smaller than $n$ then $p_2$ is the probability that a word in $\mathit{words}$ that starts with $\sigma$ be followed by $c_1$ or $c_2$.
    * ...
    * if $\sigma$ is of length $n$ then $p_k$ is the probability that a word in $\mathit{words}$ that contains $\sigma$ be followed by one of $c_1$, $c_2$, ..., $c_k$, whereas if $\sigma$ is of length smaller than $n$ then $p_k$ is the probability that a word in $\mathit{words}$ that starts with $\sigma$ be followed by one of $c_1$, $c_2$, ..., $c_k$; that probability is 1.
    
  For instance, `'ZEL'` should be one of $\mathit{markov\_chain}$'s keys, and the associated value should be `[('L',0.5), ('T',0.625), ('#',1)]` if `'#'` was chosen to denote the end of word marker.
* The program defines a function `generate_word(words, markov_chain, n)`, that will be called after `computed_markov_chain()` will have been summoned with $\mathit{words}$ and $n$ provided as arguments, returning the dictionary $\mathit{markov\_chain}$ previously described, and after the `seed()` function will have been summoned with the third user input provided as argument. When `generate_word()` is called with $\mathit{nb\_of\_words}$, $\mathit{markov\_chain}$ and $n$ provided as argument, it randomly generates $\mathit{nb\_of\_words}$ many words as previously described, moreover proceeding in a way that we explain based on one example, assuming that $n$ is equal to 3. Suppose that `generate_word()` has generated at least 3 letters of one of the $\mathit{nb\_of\_words}$ many words to generate, say $w$. Suppose that the last 3 generated letters are Z, E and L. Then `generate_word()` calls `random()` to generate a number $r$ between 0 and 1.
    * If $r$ is smaller than 0.5, namely, the probability that ZEL in a word in $\mathit{words}$ be followed by L, then L is the next letter generated for $w$, which is not fully determined yet.
    * Otherwise, if $r$ is smaller than 0.625, namely, the probability that ZEL in a word in $\mathit{words}$ be followed by either L or T, then T is the next letter generated for $w$, which is not fully determined yet.
    * Otherwise, $r$ is smaller than 1, namely, the probability that ZEL in a word in $\mathit{words}$ be followed by either L or T or nothing, and $w$ is fully determined, consisting of all letters previouly generated for $w$.
    
  Eventually, `generate_word()` prints out the $\mathit{nb\_of\_words}$ many words it has generated, one word per line, indicating for each word whether it has been invented because it is not a member of $\mathit{words}$, or whether it has been rediscovered because it belongs to $\mathit{words}$.

# Tests

## Generating 10 words using 1-grams

In [None]:
%%run_and_test -i'1\n10\n0\n' python3 markov_chains.py

'What n to use for the n-grams? ', '1\n',
'How many words to generate? ', '10\n',
'What integer for the seed? ', '0\n',
'''
Invented S\n
Invented GHINTHIN\n
Invented JES\n
Invented D\n
Invented WOULS\n
Invented PLANOUTITETIASS\n
Invented PANVEMSINE\n
Invented STALIVAPRMPOWINGULLEDERONCRY\n
Invented S\n
Invented TINSMPS\n
'''

## Generating 10 words using 2-grams

In [None]:
%%run_and_test -i'2\n10\n20\n' python3 markov_chains.py

'What n to use for the n-grams? ', '2\n',
'How many words to generate? ', '10\n',
'What integer for the seed? ', '20\n',
'''
Invented TROWEN\n
Invented SPEN\n
Invented BEWILIZES\n
Invented SNARACESS\n
Invented CABLUEMARECTORDULKE\n
Invented ST\n
Invented MED\n
Invented LE\n
Invented SCRASKTRISCOLIMATEIGNECALLABLED\n
Invented BELED\n
'''

## Generating 20 words using 3-grams

In [None]:
%%run_and_test -i'3\n20\n40\n' python3 markov_chains.py

'What n to use for the n-grams? ', '3\n',
'How many words to generate? ', '20\n',
'What integer for the seed? ', '40\n',
'''
Rediscovered INCH\n
Invented PARTIATE\n
Invented NITERE\n
Rediscovered BLUSH\n
Invented SHONED\n
Invented SMYTHED\n
Invented NOE\n
Invented BRIC\n
Invented UNTERS\n
Invented ABBINGS\n
Invented CAPTURIFIERS\n
Invented RADOPPING\n
Invented STUBA\n
Invented EXEL\n
Invented HIC\n
Invented NEIT\n
Invented AUTHFUL\n
Rediscovered MICA\n
Invented CENATIONTRACE\n
Rediscovered WASHING\n
'''

## Generating 20 words using 4-grams

In [None]:
%%run_and_test -i'4\n20\n75\n' python3 markov_chains.py

'What n to use for the n-grams? ', '4\n',
'How many words to generate? ', '20\n',
'What integer for the seed? ', '75\n',
'''
Invented HEATLY\n
Rediscovered INTER\n
Rediscovered EDGE\n
Invented ETHANKLINING\n
Invented INSECTED\n
Rediscovered DIE\n
Rediscovered DILATION\n
Rediscovered HALVING\n
Invented REDISTICS\n
Invented ITALITY\n
Invented SPECTIVATOR\n
Rediscovered FULLY\n
Invented HORS\n
Invented HERED\n
Invented INDUCTIVELINOTYPICALLY\n
Rediscovered SKILL\n
Rediscovered RELATIVES\n
Invented ARCHESTNUT\n
Rediscovered WASH\n
Rediscovered SMOKABLE\n
'''

## Generating 20 words using 5-grams

In [None]:
%%run_and_test -i'5\n20\n100\n' python3 markov_chains.py

'What n to use for the n-grams? ', '5\n',
'How many words to generate? ', '20\n',
'What integer for the seed? ', '100\n',
'''
Invented BIRTHPLACEABLE\n
Rediscovered RECEIVING\n
Rediscovered TRANSACTION\n
Invented CONVERTISERS\n
Rediscovered MILQUETOAST\n
Rediscovered TRIGGER\n
Rediscovered SERVED\n
Invented MALIGNMENTS\n
Rediscovered LOCATE\n
Invented NOONTIMELESSLY\n
Invented BATISTICALLY\n
Rediscovered CLUNG\n
Rediscovered GEOFFREY\n
Rediscovered DUTIES\n
Rediscovered BENEFICIALLY\n
Rediscovered APPRENTICED\n
Rediscovered FLAIR\n
Invented MARRIERS\n
Rediscovered DRIFTING\n
Invented BANACHROME\n
'''