# Introduction
Use `NLTK` or `spaCy`, as the main NLP libraries, to process some texts. In particular, you can design a very simple language model as follows: 
- Choose a piece of text as your training data.
- Tokenize sentences and words of the training data.
- Count the frequency of each word and store it in a dictionary.
- Count the frequency of each bigram and store it in a dictionary. A bigram is the combination of two consecutive words.
- Now, you can predict the next word of each sentence using the below probabilistic formula: $$ argmax_w Pr(w|w_{last}) = \frac{\text{Frequency of Bigram } (w_{last}, w)}{\text{Frequency of Word } w_{last}}, $$ where $ w_{last} $ is the last word in the given sentence.

In [None]:
text = "Use NLTK or spaCy, as the main NLP libraries, to process some texts. In particular, you can design a very simple language model as follows:"

d["gisma"] = 4
dd[("This", "is")] = 3

This is Gisma
This is
is Gisma


This is ...

is -> 5
(is , gimsa) -> 2
Pr (gisam|is) = 2 / 5

In [6]:
import nltk

text = "Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:"
vocabulary = {}
sentences = nltk.tokenize.sent_tokenize(text)
for sentence in sentences:
    words = nltk.tokenize.word_tokenize(sentence)
    for w in words:
        if w not in vocabulary:
            vocabulary[w] = 0
        vocabulary[w] += 1
        
vocabulary

{'Tokenizers': 1,
 'divide': 1,
 'strings': 1,
 'into': 1,
 'lists': 1,
 'of': 1,
 'substrings': 1,
 '.': 1,
 'For': 1,
 'example': 1,
 ',': 1,
 'tokenizers': 1,
 'can': 1,
 'be': 1,
 'used': 1,
 'to': 1,
 'find': 1,
 'the': 1,
 'words': 1,
 'and': 1,
 'punctuation': 1,
 'in': 1,
 'a': 1,
 'string': 1,
 ':': 1}

In [4]:
import nltk


class SimpleLanguageModeling():

    def __init__(self):
        self.vocabulary = {}
        self.bigrams = {}

    def fit(self, text):
        sentences = nltk.tokenize.sent_tokenize(text)
        for sentence in sentences:
            words = nltk.tokenize.word_tokenize(sentence)
            for w in words:
                if w not in self.vocabulary:
                    self.vocabulary[w] = 0
                self.vocabulary[w] += 1
            for b in nltk.bigrams(words):
                if b not in self.bigrams:
                    self.bigrams[b] = 0
                self.bigrams[b] += 1
        
    def predict(self, text):
        words = nltk.tokenize.word_tokenize(text)
        last_word = words[-1]
        if last_word not in self.vocabulary:
            return ""
        max_probability = 0
        best_word = ""
        for w in self.vocabulary:
            if (last_word, w) in self.bigrams:
                p = self.bigrams[(last_word, w)] / self.vocabulary[last_word]
                if p > max_probability:
                    max_probability = p
                    best_word = w
        return best_word


    
training_text = """Training the network is essentially finding a minimum of this multidimensional "loss" or "cost" function. It's done iteratively over many training runs, incrementally changing the network's state. In practice, that entails making many small adjustments to the network's weights based on the outputs that are computed for a random set of input examples, each time starting with the weights that control the output layer and moving backward through the network. (Only the connections to a single neuron in each layer are shown here, for simplicity.) This backpropagation process is repeated over many random sets of training examples until the loss function is minimized, and the network then provides the best results it can for any new input."""
model = SimpleLanguageModeling()
model.fit(training_text)
model.predict("the network is")

'essentially'