# STAGE 1/6
### Objectives:
In order to prepare the corpus for use in this project, we need to take the following important steps:

* Open and read the corpus from the provided file corpus.txt. The filename should be specified as user input. Note that the file is written in UTF-8 encoding, and the file should be in the same folder as your Python script.
* Break the corpus into individual words. To create a Markov model, we use the simplest form of tokenization: tokens are separated by whitespace characters such as spaces, tabulation, and newline characters. Punctuation marks should be left untouched since later on, they will play an important role in signaling where a sentence should end.
* Acquire and print the following information about the corpus under the section of the output called Corpus statistics:

— the number of all tokens;

— the number of all unique tokens, that is, the number of tokens without repetition.

Each of the above should be in a new line.
* Take an integer as user input and print the token with the corresponding index. Repeat this process until the string exit is input. Also, make sure that the input index is actually an integer that falls in the range of the corpus. If that is not the case, print an error message and request a new input. Error messages should contain the types of errors (Type Error, Index Error, etc.).


Each token should be printed in a new line.

In [5]:
from nltk.tokenize import WhitespaceTokenizer
# from nltk.tokenize import regexp_tokenize

with open(input(), 'r', encoding='utf-8') as f:  # input() = 'corpus.txt'
    text = f.read()
    f.close()
# corpus = regexp_tokenize(text, "[^\s]+")

tk = WhitespaceTokenizer()
corpus = tk.tokenize(text)
print(f'Corpus statistics\nAll tokens: {len(corpus)}\nUnique tokens: {len(set(corpus))}')

i = input()  # Type number
while i != 'exit':
    try:
        print(corpus[int(i)])
    except IndexError:
        print('IndexError. Please input an integer that is in the range of the corpus.')
    except ValueError:
        print('TypeError. Please input an integer.')
    i = input()

corpus.txt
Corpus statistics
All tokens: 287968
Unique tokens: 21262
56789
I
exit


# STAGE 2/6

### Objectives:
* Transform the corpus into a collection of bigrams. The results should contain all the possible bigrams from the corpus, which means that:
— Every token from the corpus should be a head of a bigram with the exception of the last token which cannot become a head since nothing follows it;
— Every token from the corpus should be a tail of a bigram with the exception of the first token which cannot possibly be the tail of a bigram because nothing precedes it.
* Output the number of all bigrams in the corpus.
* Take an integer as user input and print the bigrams with the corresponding index. Repeat this process until the string exit is input. Also, make sure that the input is actually an integer that falls in the range of the collection of bigrams. If that is not the case, print an error message and request a new input. Error messages should contain the types of errors (Type Error, Index Error, etc.). Each bigram should have the format Head: [head] Tail: [tail] and should be printed in a new line.


You should only print the output of the current stage and not the previous one, but like in the previous stage, the name of the file that contains the corpus should be given as user input.

In [4]:
from nltk.tokenize import WhitespaceTokenizer
# from nltk.tokenize import regexp_tokenize

with open(input(), 'r', encoding='utf-8') as f:  # заменять input() на 'corpus.txt'
    text = f.read()
    f.close()
# corpus = regexp_tokenize(text, "[^\s]+")

tk = WhitespaceTokenizer()
corpus = tk.tokenize(text)
# print(f'Corpus statistics\nAll tokens: {len(corpus)}\nUnique tokens: {len(set(corpus))}')

bigrams = [[corpus[i], corpus[i + 1]] for i in range(len(corpus) - 1)]
print(f'Number of bigrams:: {len(bigrams)}')


i = input()  # Type number
while i != 'exit':
    try:
        print(f'Head: {bigrams[int(i)][0]} Tail: {bigrams[int(i)][1]}')
    except IndexError:
        print('IndexError. Please input an integer that is in the range of the corpus.')
    except ValueError:
        print('TypeError. Please input an integer.')
    i = input()

corpus.txt
Number of bigrams:: 287967
45
Head: life. Tail: How
46
Head: How Tail: close
exit


# STAGE 3/6

### Description:
This is the final step where we will work on creating a Markov chain model. We will use the data prepared in the first two stages and transform it into a model. This model will contain probabilistic information that will tell us what the next word in a chain might be.

We already have a list of all bigrams from the corpus. As we discussed earlier, this can already be used to make some naive predictions. There is a problem, though: right now our data contains a lot of repetition. As we have seen at the first stage, the total number of tokens is almost 10 times greater than the number of unique tokens. This ratio must be about the same in the list of bigrams. Some bigrams might be very common, others — relatively rare. At the moment, we have no way of telling which are which.

To resolve this problem, we will make a simplified version of a Markov chain model.

In [3]:
from collections import Counter
from nltk.tokenize import WhitespaceTokenizer

with open(input(), 'r', encoding='utf-8') as f:  # заменять input() на 'corpus.txt'
    text = f.read()
    f.close()

tk = WhitespaceTokenizer()
corpus = tk.tokenize(text)

bigrams = [[corpus[i], corpus[i + 1]] for i in range(len(corpus) - 1)]

heads_dict = {}

for head, tail in bigrams:
    heads_dict.setdefault(head, []).append(tail)

word = input()  # Type your word
while word != 'exit':
    try:
        freq_counter = Counter(heads_dict[word])
        print(f'Head: {word}', *[f'Tail: {key} Count: {value}' for key, value in freq_counter.items()], sep='\n')
    except KeyError:
        print(f'Head: {word}', 'The requested word is not in the model. Please input another word.', sep='\n')
    finally:
        word = input()

corpus.txt
queen
Head: queen
Tail: and Count: 14
Tail: has Count: 5
Tail: someday. Count: 1
Tail: today. Count: 1
Tail: of Count: 8
Tail: one Count: 1
Tail: over Count: 1
Tail: you Count: 2
Tail: doesn't Count: 2
Tail: waiting. Count: 1
Tail: herself Count: 1
Tail: regent. Count: 1
Tail: I'm Count: 2
Tail: when Count: 2
Tail: or Count: 2
Tail: at Count: 1
Tail: said Count: 1
Tail: how Count: 1
Tail: detained Count: 1
Tail: get Count: 1
Tail: a Count: 2
Tail: to Count: 9
Tail: would Count: 2
Tail: is Count: 5
Tail: whose Count: 1
Tail: with Count: 1
Tail: mother Count: 2
Tail: who Count: 3
Tail: ordered Count: 1
Tail: trusts Count: 1
Tail: I Count: 1
Tail: Can't Count: 1
Tail: You Count: 1
Tail: because Count: 3
Tail: tried Count: 1
Tail: recognizes Count: 1
Tail: chose Count: 2
Tail: She's Count: 2
Tail: Margaery's Count: 1
Tail: will Count: 3
Tail: loves Count: 1
Tail: returns. Count: 1
Tail: does Count: 2
Tail: insists Count: 1
Tail: before? Count: 1
Tail: mother's Count: 1
Tail: mot

# STAGE 4/6

### Objectives:
* Choose a random word from the corpus that will serve as the first word of the chain.
* The second word should be predicted by looking up the first word of the chain in the model and choosing the most probable next word from the set of possible follow-ups. Right now, an entry contains all the possible tails that might follow the selected head along with their corresponding repetition counts. Using the repetition counts, you will be able to choose the most probable option.
* The second step should be repeated until the length of the chain is 10 words, but this time, the current last word of the chain should be used to look up another probable word to continue the sentence.
Using the algorithm described above, generate chains consisting of 10 tokens, join the resulting tokens together, and print them as a pseudo-sentence. Keep in mind that a pseudo-sentence can consist of multiple actual sentences, so having punctuation marks inside pseudo-sentences is completely valid.

In [83]:
import random
from collections import Counter
from nltk.tokenize import WhitespaceTokenizer

with open(input(), 'r', encoding='utf-8') as f:  # заменять input() на 'corpus.txt'
    text = f.read()
    f.close()

tk = WhitespaceTokenizer()
corpus = tk.tokenize(text)

bigrams = [[corpus[i], corpus[i + 1]] for i in range(len(corpus) - 1)]


heads_dict = {}

for head, tail in bigrams:
    heads_dict.setdefault(head, []).append(tail)

    
text = []
for i in range(10):
    
    sentence = []
    beginning = random.choice(corpus)
    sentence.append(beginning)
    
    for j in range(9):
        freq_counter = Counter(heads_dict[beginning])
        
        population = []
        weights = []

        for key, value in freq_counter.items():
            population.append(key)
            weights.append(value)
            
        beginning = random.choices(population, weights)[0]
        sentence.append(beginning)
        
#     print(sentence)
    
    text.append(sentence)

for el in text:
    print(*el)


corpus.txt
certain about that prerogative. I will pardon these fanatics of
Come here. Are you into the brink of shoes. But
on you, but Children do without. She came to celebrate.
the edge of Light demands this spread if you going
my friend... Voices carry weapons and band of his sword
only here a cook. But the free rides off me!
Welcome. Stop. I told help to be here at the
we? Just don't want me to your sister and I
hope I wonder, Ser Meryn. Go to him anyway? You
and warm. The Iron Bank wants to King's Landing, I'll


# STAGE 5/6

### Objectives:
* Make the algorithm more realistic by generating pseudo-sentences instead of just random text.
The sentences that are being generated should:
— always start with capitalized words ("This is beautiful.", "You are a great programmer!", etc.);
— not start with a word that ends with a sentence-ending punctuation mark ("Okay.", "Nice.", "Good.", "Look!", "Jon!", etc.);
— always end with a sentence-ending punctuation mark like ., !, or ?;
— should not be shorter than 5 tokens.
* Generate and print exactly 10 pseudo-sentences that meet these criteria. A pseudo-sentence should end when the first sentence-ending punctuation mark is encountered after the minimal sentence length (5 tokens) is reached.

In [75]:
import random
import re
from collections import Counter
from nltk.tokenize import WhitespaceTokenizer
from string import ascii_uppercase

def read_file(file_):
    with open(file_, 'r', encoding='utf-8') as f:  # заменять input() на 'corpus.txt'
        raw_text = f.read()
        f.close()
    return raw_text

name_file = 'corpus.txt'

tk = WhitespaceTokenizer()
corpus = tk.tokenize(read_file(name_file))

bigrams = [[corpus[i], corpus[i + 1]] for i in range(len(corpus) - 1)]

heads_dict = {}

for head, tail in bigrams:
    heads_dict.setdefault(head, []).append(tail)

def pop_wei(heads_dict, beginning):
    freq_counter = Counter(heads_dict[beginning])
    population = []
    weights = []

    for key, value in freq_counter.items():
        population.append(key)
        weights.append(value)

    return [population, weights]

def print_sentences(text):
    for el in text:
        print(*el)

def upper_cased(corpus):
    return list(filter(lambda x: x and x[0] in ascii_uppercase and not re.findall(r'[.!?]', x[-1]), corpus))


text = []

for i in range(10):
    sentence = []
    beginning = random.choice(upper_cased(corpus))
    sentence.append(beginning)

    while not re.findall(r'[.!?]', beginning) or len(sentence) < 5:
        beginning = random.choices(pop_wei(heads_dict, beginning)[0], pop_wei(heads_dict, beginning)[1])[0]
        sentence.append(beginning)

    text.append(sentence)

print_sentences(text)


I don't want to Oldtown, then we'll have sneered at you.
I'm Jojen told me stand behind small matters to kill.
I knew about being rude.
I come to her? I admired her.
Kingdoms, I said-- Turn around my friend.
There's nothing you were said he took them.
So how to admit fear.
The dark-haired one. Next I want Balon Greyjoy because she's doing.
What's happening? The maesters whose skeleton sits on you.
I love once. No, my lady.


# STAGE 6/6

Right now, the model is based on bigrams, that is, we only consider one word when trying to predict the next word in the chain.
The algorithm should be extended so that it can use not only bigrams but also trigrams. 
### This change implies the following tasks:
* The list of bigrams should be transformed into a list of trigrams. It should still consist of heads and tails, but now, heads should consist of two space-separated tokens concatenated into a single string. The tails should still consist of one token. For example: head — winter is, tail — coming.

* The model should be trained based on the list of trigrams. The model creation requires no modifications since trigrams still consist of a head and a tail.

* The beginning of the chain should be a randomly chosen head from the model, not just any word from the corpus.

* When predicting the next word, the model should be fed the concatenation of the last two tokens of the chain separated by a space.

In [81]:
import random
import re
from collections import Counter
from nltk.tokenize import WhitespaceTokenizer
from string import ascii_uppercase

def read_file(name_file):
    with open(name_file, 'r', encoding='utf-8') as f:  # заменять input() на 'corpus.txt'
        raw_text = f.read()
        f.close()
    return raw_text

# name_file = input()
name_file = 'corpus.txt'

tk = WhitespaceTokenizer()
corpus = tk.tokenize(read_file(name_file))

trigrams = [[corpus[i] + ' ' + corpus[i + 1], corpus[i + 2]] for i in range(len(corpus) - 2)]

heads_dict = {}

for head, tail in trigrams:
    heads_dict.setdefault(head, []).append(tail)

def pop_wei(heads_dict, beginning):
    freq_counter = Counter(heads_dict[beginning])
    population = []
    weights = []

    for key, value in freq_counter.items():
        population.append(key)
        weights.append(value)

    return [population, weights]

def print_sentences(text):
    for el in text:
        print(*el)

def upper_cased(corpus):
    return list(filter(lambda x: x and x[0] in ascii_uppercase and not re.findall(r'[.!?]', x), corpus))


text = []

for i in range(10):
    sentence = []
    beginning = random.choice(upper_cased(list(heads_dict.keys())))
    sentence.append(beginning)

    while len(sentence) < 5 or not re.findall(r'[.!?] ', beginning):
        beginning = random.choices(pop_wei(heads_dict, beginning)[0], pop_wei(heads_dict, beginning)[1])[0]
        sentence.append(beginning)
        beginning = sentence[-2].split(' ')[-1] + ' ' + sentence[-1]

    text.append(sentence[:-1])

print_sentences(text)


A legendary fighter. A brilliant stylist who invented half the country will rally to their own.
They would have been searching for you.
Fine leather, ornamentation, detailing, and time Time most of them did this to her.
You, me-- we. Once I kill you with your wildling lovers.
Soon they won't just rule over the city who doesn't like the silly little boy when I watched the witch burn.
Kept me in the Night's Watch.
The Halfhand believed our only chance It's a compliment, my lady.
Save your lies for court.
He'd kill us all. It's not my people.
Please, someone stop him! My lord, this man murdered him.
