# NLP-Various Implementations | N-gram Language Models

**Overview:** In this part of the project, I implemented a natural language processing algorithm using N-grams. The algorithm generates sentences by predicting the next word, based on the history of previous words, using a probability distribution learned from a corpus of training text. To accomplish this, I trained eight distinct N-gram models (bigram and trigram models for k={1,0.01}) and evaluated them, by measuring their perplexity. Overall, this N-gram approach allows the algorithm to capture patterns and dependencies between words at different scales of context.

## 1. Import all the necessary modules

**Briefly:** ```nltk``` library provides tools for natural language processing, including ```ngrams``` for generating language models, ```math``` library provides support for mathematical operations, while ```random``` library provides tools for generating random numbers. Additionally, ```treebank corpus``` from nltk is used for training the language model, ```collections``` library provides useful data structures like ```defaultdict``` and ```Counter```. Finally, ```PrettyTable``` library provides a way to display data in a table format.

In [1]:
import nltk
import math
import random
from nltk.corpus import treebank
from nltk.util import ngrams
from collections import defaultdict, Counter
from prettytable import PrettyTable

## 2. Download Treebank Corpus from NLTK

**Download the Treebank corpus from the Natural Language Toolkit (NLTK) library:** the code uses the download_treebank() function to download the ```treebank corpus``` from ```nltk``` library. This is necessary because the Treebank is not included in the default nltk package and needs to be downloaded separately. If the package is already up-to-date, you will see the message "Package treebank is already up-to-date!" printed to the console.

In [2]:
def download_treebank():
    nltk.download("treebank")
    
download_treebank()

[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\natalia\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!


## 3. Split the Corpus

**Divide the corpus into two subsets, train_corpus and test_corpus:** the corpus is first obtained by calling the fileids() method on treebank corpus in nltk. The first 170 files are assigned to train_corpus and the remaining 29 files to test_corpus (private variables).

The split_corpus() function returns the two subsets, which are stored in the public variables train_corpus and test_corpus, respectively.

In [3]:
def split_corpus():
    corpus = treebank.fileids()
    train_corpus = corpus[:170] # 170 news files in train_corpus
    test_corpus = corpus[170:] # 29 news files in test_corpus
    return train_corpus, test_corpus

train_corpus, test_corpus = split_corpus()

## 4. Set the Capitalization

**Create two versions of the corpus, one with all words in lowercase and the other with the original capitalization intact:** The original train_corpus and test_corpus are lists of strings where each string is a file identifier. To obtain the actual sentences in these files, we use the sents() method on the treebank corpus in nltk. Then, we convert each word in the sentences to lowercase by iterating over each sentence and word using list comprehension. This is done to ensure that the language model trained on this new version of corpus, treats upper and lowercase words as the same.

The edit_corpus() function returns these two versions of the corpus for both the train and test subsets: corpus_0, which contains the original sentences in the files, and corpus_1, which contains the sentences with lowercase words.

In [4]:
def edit_corpus(corpus):
    corpus_0 = treebank.sents(corpus)
    corpus_1 = [[word.lower() for word in sent] for sent in corpus_0]
    return corpus_0, corpus_1

train_corpus_0, train_corpus_1 = edit_corpus(train_corpus)
test_corpus_0, test_corpus_1 = edit_corpus(test_corpus)

## 5. Create the Vocabulary

**Define two versions of the vocabulary, one with words from the original training corpus and one with words from its lowercase version:** The create_vocab() function creates a set of unique words that occur in the train corpus with a minimum frequency of min_freq. To obtain the words in the corpus, we use list comprehension to iterate over each sentence in the corpus and then over each word in the sentence. The Counter() function then counts the frequency of each word and the items() method returns the word-count pairs. The resulting set of words is returned by the function.

Variables a and b represent the smoothing parameters k=1 and k=0.01, used in the Laplace smoothing technique. Variables bi and tri represent the order of the n-gram models used in the training process: bigram model with n=2 and trigram model with n=3. Variable min_freq is the minimum frequency threshold used to filter out words with low occurrences in the training corpus, as determined by the create_vocab() function. And finally, ns represents the number of sentences to be generated using the trained n-gram model.

The create_vocab() function is called twice with the two versions of the train corpus (corpus_0 and corpus_1) and a minimum frequency of 3. The resulting sets of words are stored in the variables vocab_0 and vocab_1, respectively.

In [5]:
def create_vocab(train_corpus, min_freq):
    return {word for word, count in Counter(word for sent in train_corpus for word in sent).items() if count >= min_freq}

a, b, bi, tri, min_freq, ns = 1, 0.01, 2, 3, 3, 3
vocab_0 = create_vocab(train_corpus_0, min_freq)
vocab_1 = create_vocab(train_corpus_1, min_freq)

## 6. Preprocess the corpus to extract n-grams

**6.1: Pad the corpus' sentences with start/end symbols and extract n-grams out of it:** The preprocess() function takes in the corpus, the vocabulary, the start symbol, the end symbol, the label for out-of-vocabulary words, and an integer value n that specifies the order of the n-grams to be extracted. The function first adds start and end symbols to each sentence in the corpus, then creates n-grams of order n by padding the sentence with start and end symbols using the ngrams() function. The resulting n-grams are filtered to remove any n-grams that contain more than one start or end symbol. Finally, the function replaces any words in the n-grams that are not in the vocabulary with the specified label using the replace_tokens() function.

**6.2: Modify n-grams by replacing out-of-vocabulary words with the unknown label:** The replace_tokens() function takes in the list of n-grams, the vocabulary, the start symbol, the end symbol, and the label for out-of-vocabulary words. It replaces any words in the n-grams that are not in the vocabulary or the start/end symbols with this specified label. The function returns a new list of n-grams with the replaced tokens.

In [6]:
def preprocess(corpus, vocab, start, end, label, n):
    ngrams_list = []
    sents = [[start] + sent + [end] for sent in corpus]
    for sent in sents:
        padded_sent = ngrams(tuple(sent), n, pad_left=True, pad_right=True, left_pad_symbol=start, right_pad_symbol=end) # pads the sentence with start and end symbols, and creates ngrams of order n
        sent_ngrams = [ngram for ngram in padded_sent if ngram.count(start) < 2 and ngram.count(end) < 2]
        ngrams_list.extend(sent_ngrams)
    ngrams_list = replace_tokens(ngrams_list, vocab, start, end, label)
    return ngrams_list

def replace_tokens(ngrams, vocab, start, end, label):
    return [tuple(label if word not in vocab and word not in {start, end} else word for word in ngram) for ngram in ngrams]

## 7. Build and Evaluate the N-Gram Language Model

**7.1: Train the n-gram model:** The train() function takes in the k-smoothing parameter, the vocabulary, and the list of n-grams, and returns a k-smoothed n-gram model along with the probabilities of each prefix using the k-smoothing technique. The function uses the n-grams to calculate the counts for each prefix and suffix, and then calculates the probabilities of each suffix given the prefix using the k-smoothing technique. The resulting model is returned as a defaultdict of Counters.

**7.2: Evaluate the n-gram model by measuring its perplexity:** The evaluate() function takes in the n-gram model, the prefix probabilities, the list of n-grams to be evaluated (extracted from the testing corpus), and the vocabulary. The function calculates the perplexity of the n-grams based on the probabilities given by the n-gram model and the prefix probabilities. The lower the perplexity, the better the model performs on the given n-grams.

**7.3: Build and apply the n-gram model:** The build_and_apply() function takes in the order of n-grams n, the k-smoothing parameter, the train and test lists of n-grams, the vocabulary, and a boolean value lowercase indicating whether the n-grams are extracted from the original or the lowercase version of the training/testing corpus. The function trains the n-gram model using the train() function, evaluates the model using the evaluate() function, and prints out the perplexity results in a pretty table. Finally, the trained n-gram model is returned.

In [7]:
def train(k, vocab, ngrams):
    ngram_model, ngram_counts = defaultdict(Counter), Counter(ngrams)
    prefix_counts, prefix_probs = defaultdict(int), defaultdict(float)
    for ngram, count in ngram_counts.items():
        prefix_counts[ngram[:-1]] += count
    for prefix, count in prefix_counts.items():
        prefix_probs[prefix] = k / (count + k * len(vocab))
    for ngram, count in ngram_counts.items():
        ngram_model[ngram[:-1]][ngram[-1]] = (count + k) / (prefix_counts[ngram[:-1]] + k * len(vocab))
    return ngram_model, prefix_probs

def evaluate(ngram_model, prefix_probs, ngrams, vocab):
    total_log_prob = 0.0
    total_log_prob = sum([math.log(ngram_model[ngram[:-1]].get(ngram[-1], prefix_probs[ngram[:-1]])) if ngram[:-1] in ngram_model else math.log(1/len(vocab)) for ngram in ngrams])
    perplexity = math.exp(-(total_log_prob/len(ngrams)))
    return perplexity

def build_and_apply(n, k, train_ngrams, vocab, test_ngrams, lowercase):
    ngram_model, prefix_probs = train(k, vocab, train_ngrams)
    perplexity = evaluate(ngram_model, prefix_probs, test_ngrams, vocab)
    pt = PrettyTable(field_names=[f"\033[1m{field}\033[0m" for field in ["Model", "k", "Lowercase", "Perplexity"]])
    pt.add_row(["Bigram", k, lowercase, perplexity]) if n == 2 else pt.add_row(["Trigram", k, lowercase, perplexity])
    print(pt)
    return ngram_model

## 8. Generate sentences using the N-Gram Language Model

**Generate sentences using the n-gram model:** The generate_sentences() function takes in the trained n-gram model, the order n of the n-grams used in the model, the start/end symbols, the vocabulary and the number of sentences to generate. To begin generating a sentence, the function first searches for all n-grams in the model that start with the start symbol. However, since sentence generation is based upon n-grams, the function cannot initially form an n-gram without any prefix (p-gram where p = n-1). Therefore, it randomly selects a p-gram from the n-grams list, that starts with the start symbol. The function then, generates subsequent words by choosing a word based on the probabilities assigned by the model to the n-gram formed by the last p words of the sentence and each word in the vocabulary or the end symbol. If the model lacks any n-grams that start with these p words, or if all candidate words are out-of-vocabulary and the end symbol is not among them, the function stops generating the sentence and tries again from the beginning. Finally, it prints each generated sentence to the console.

In [8]:
def generate_sentences(model, n, start, end, vocab, num_sents):
    for ns in range(num_sents):
        start_ngrams = [ngram for ngram in model.keys() if ngram[0] == start]
        random_ngram = random.choice(start_ngrams)
        sentence = list(random_ngram[:n-1])
        while sentence[-1] != end:
            prefix = tuple(sentence[-n+1:])  # get the last n-1 words in sentence as a tuple
            if prefix not in model:
                sentence.append(end) if sentence[-1] != end else None
                break
            candidates = [(c, p) for c, p in model[prefix].items() if c in vocab or c == end]
            if not candidates:
                sentence.append(end) if sentence[-1] != end else None
                break
            choices, probabilities = zip(*candidates)
            sentence.append(random.choices(choices, weights=probabilities)[0])
        print(f"\033[1mSentence {ns+1}:\033[0m " + " ".join(sentence))

## 9. Evaluate the perplexity values and assess the quality of the generated sentences

**Perplexity:** it measures how well a model predicts the test corpus (a collection of sentences that it has not seen before), based on the probability distribution it has learned from the training corpus. Lower perplexity values indicate better performance.

**9.1: Bigram Model with k-smoothing (k = 1), trained on the original training corpus (lowercase = 0):** 
This bigram model achieved a perplexity value of 383.50. This value is considered good, indicating that the model is reasonably accurate in predicting the test corpus.

In [9]:
bigram_model_a0 = build_and_apply(bi, a, preprocess(train_corpus_0, vocab_0, "<BOS>", "<EOS>", "<UNK>", bi), vocab_0, preprocess(test_corpus_0, vocab_0, "<BOS>", "<EOS>", "<UNK>", bi), False)

+--------+---+-----------+--------------------+
| [1mModel[0m  | [1mk[0m | [1mLowercase[0m |     [1mPerplexity[0m     |
+--------+---+-----------+--------------------+
| Bigram | 1 |   False   | 383.50361532871557 |
+--------+---+-----------+--------------------+


The generated sentences from this model exhibit some level of coherence, but there are also noticeable issues with grammar and syntax. The model appears to struggle with producing sentences that follow a logical structure, and many of the sentences lack context or meaningful content.

The first sentence generated by the model is grammatically incorrect and lacks coherence. It appears to be a jumble of words that do not form a coherent idea or message. The second sentence is more coherent than the first, but it still lacks proper grammar and syntax. It appears to suggest that fast-food restaurants are somehow preventing homelessness, which is a nonsensical idea. The third sentence generated by the model is also lacking in coherence and meaningful content. It appears to be a string of words that do not form a coherent sentence or convey a clear message.

In [27]:
generate_sentences(bigram_model_a0, bi, "<BOS>", "<EOS>", vocab_0, ns)

[1mSentence 1:[0m <BOS> At the school , in Moscow found that Southeast Asian nations runs the purchase plans or change in cataract <EOS>
[1mSentence 2:[0m <BOS> The fast-food restaurants , were in order preventing homelessness . <EOS>
[1mSentence 3:[0m <BOS> The March 16 to a savings now on the authors of dry fibers and Commerce Department economists do *T*-2 : <EOS>


**9.2: Bigram Model with k-smoothing (k = 1), trained on the lowercase training corpus (lowercase = 1):** This bigram model achieved a perplexity value of 383.95. The perplexity value is considered good, indicating that the model is reasonably accurate in predicting the test corpus.

The perplexity value of this model is slightly higher than the perplexity value of the bigram model with k-smoothing (k = 1) trained on the original training corpus (lowercase = 0). This suggests that the lowercase version of the training corpus may not have a significant impact on the performance of the bigram model with k-smoothing.

In [11]:
bigram_model_a1 = build_and_apply(bi, a, preprocess(train_corpus_1, vocab_1, "<BOS>", "<EOS>", "<UNK>", bi), vocab_1, preprocess(test_corpus_1, vocab_1, "<BOS>", "<EOS>", "<UNK>", bi), True)

+--------+---+-----------+-------------------+
| [1mModel[0m  | [1mk[0m | [1mLowercase[0m |     [1mPerplexity[0m    |
+--------+---+-----------+-------------------+
| Bigram | 1 |    True   | 383.9460197558427 |
+--------+---+-----------+-------------------+


The generated sentences from this model exhibit noticeable issues with coherence, grammar, and syntax. The model appears to struggle with producing sentences that follow a logical structure and convey a clear message.

The first sentence generated by the model is grammatically incorrect and lacks coherence. It appears to be a jumble of words that do not form a coherent idea or message. The second sentence is more coherent than the first, but it still lacks proper grammar and syntax. It seems to suggest that some firms do not want federal funds, but it is not clear why this is the case or what the sentence is trying to convey. The third sentence generated by the model is also lacking in coherence and meaningful content. It appears to be a string of words that do not form a coherent sentence or convey a clear message.

In [39]:
generate_sentences(bigram_model_a1, bi, "<BOS>", "<EOS>", vocab_1, ns)

[1mSentence 1:[0m <BOS> * to get across much faster and learning materials are far there is without him to yield below are chicago corp . <EOS>
[1mSentence 2:[0m <BOS> the firms are n't want *-1 over federal funds currently owns and executives , or roederer cristal in hopes *-1 when the unconstitutional . <EOS>
[1mSentence 3:[0m <BOS> they found *-2 given the nose on the deals , 52 years , these preparation tests and it would have been reported * do *?* . <EOS>


**9.3: Bigram Model with k-smoothing (k = 0.01), trained on the original training corpus (lowercase = 0):** This bigram model achieved a perplexity value of 137.81. The perplexity value is considered very good, indicating that the model is highly accurate in predicting the test corpus.

The perplexity value of this model is significantly lower than the perplexity values of both the bigram models with k-smoothing (k = 1), trained on both the original and lowercase versions of the training corpus. This indicates that the use of k-smoothing with a smaller k value can improve the performance of the bigram model.

In [13]:
# Bigram Model with k = 0.01 Smoothing, where 0: lowercase = False
bigram_model_b0 = build_and_apply(bi, b, preprocess(train_corpus_0, vocab_0, "<BOS>", "<EOS>", "<UNK>", bi), vocab_0, preprocess(test_corpus_0, vocab_0, "<BOS>", "<EOS>", "<UNK>", bi), False)

+--------+------+-----------+--------------------+
| [1mModel[0m  |  [1mk[0m   | [1mLowercase[0m |     [1mPerplexity[0m     |
+--------+------+-----------+--------------------+
| Bigram | 0.01 |   False   | 137.81108464477174 |
+--------+------+-----------+--------------------+


The generated sentences from this model exhibit some level of coherence, but there are also noticeable issues with grammar and syntax. The model appears to struggle with producing sentences that follow a logical structure, and many of the sentences lack context or meaningful content.

The first sentence generated by the model is lacking in coherence and meaningful content. It appears to be a string of words that do not form a coherent sentence or convey a clear message. The second sentence is also lacking in coherence and meaningful content. It appears to suggest that there may be some sort of deception or falsehood in major magazines, but it does not provide any evidence or context for this claim. The third sentence generated by the model is more coherent than the first two, but it still lacks proper grammar and syntax. It appears to suggest that government leaders were involved in some sort of trading activity, but it is unclear what this activity was or what its significance might be.

In [52]:
generate_sentences(bigram_model_b0, bi, "<BOS>", "<EOS>", vocab_0, ns)

[1mSentence 1:[0m <BOS> The company , in Standard & Poor 's responsibilities , Heritage would be the company were barred the Mitsubishi Estate Co. , an appeal is the risks is n't Buick spokeswoman says *T*-1 , factory goods more slowly and would take his team , advanced 2.50 *U* , when trading hours . <EOS>
[1mSentence 2:[0m <BOS> Market Index went over 14 major magazine is lying ? <EOS>
[1mSentence 3:[0m <BOS> The government leaders in moderate trading at the Supreme Court last year from a 12 points on the Senate stands *T*-1 . <EOS>


**9.4: Bigram Model with k-smoothing (k = 0.01), trained on the lowercase training corpus (lowercase = 1):** This bigram model achieved a perplexity value of 143.79. The perplexity value is considered very good, indicating that the model is highly accurate in predicting the test corpus.

The perplexity value of this model is slightly higher than the perplexity value of the bigram model with k-smoothing (k = 0.01) trained on the original training corpus (lowercase = 0). This suggests that the use of lowercase version of the training corpus may not have a significant impact on the performance of the bigram model with k-smoothing.

In [15]:
bigram_model_b1 = build_and_apply(bi, b, preprocess(train_corpus_1, vocab_1, "<BOS>", "<EOS>", "<UNK>", bi), vocab_1, preprocess(test_corpus_1, vocab_1, "<BOS>", "<EOS>", "<UNK>", bi), True)

+--------+------+-----------+--------------------+
| [1mModel[0m  |  [1mk[0m   | [1mLowercase[0m |     [1mPerplexity[0m     |
+--------+------+-----------+--------------------+
| Bigram | 0.01 |    True   | 143.78868465313255 |
+--------+------+-----------+--------------------+


The generated sentences from this model exhibit significant issues with coherence, grammar, and syntax. The model appears to struggle with producing sentences that follow a logical structure, and many of the sentences lack context or meaningful content.

The first sentence generated by the model is difficult to understand and lacks coherence. It appears to suggest that someone named Mr. McGovern is quoted on a lower figure for program trading, but the sentence structure is confusing and nonsensical. The second sentence is short and straightforward, but it lacks any meaningful context or information. The third sentence is grammatically incorrect and lacks coherence. It appears to be a string of words that do not form a coherent sentence or convey a clear message.

In [54]:
generate_sentences(bigram_model_b1, bi, "<BOS>", "<EOS>", vocab_1, ns)

[1mSentence 1:[0m <BOS> under mr. mcgovern was quoted him land under a lower figures for program trading `` why do not *-2 john phelan said *t*-1 . <EOS>
[1mSentence 2:[0m <BOS> several local radio stations that so-called weak . <EOS>
[1mSentence 3:[0m <BOS> the board membership on the executive power in dividends on trade publishing executive committee , the reorganization process . -rrb- by a competitive world series bonds are prepared *-1 . <EOS>


**9.5: Trigram Model with k-smoothing (k = 1), trained on the original training corpus (lowercase = 0):** This trigram model achieved a perplexity value of 1504.61. The perplexity value is considered poor, indicating that the model is not very accurate in predicting the test corpus.

The perplexity value of this model is significantly higher than the perplexity values of all the previous bigram models. This suggests that the use of trigrams may not be as effective as bigrams in modeling the language of the training corpus.

In [17]:
trigram_model_a0 = build_and_apply(tri, a, preprocess(train_corpus_0, vocab_0, "<BOS>", "<EOS>", "<UNK>", tri), vocab_0, preprocess(test_corpus_0, vocab_0, "<BOS>", "<EOS>", "<UNK>", tri), False)

+---------+---+-----------+--------------------+
|  [1mModel[0m  | [1mk[0m | [1mLowercase[0m |     [1mPerplexity[0m     |
+---------+---+-----------+--------------------+
| Trigram | 1 |   False   | 1504.6100128907715 |
+---------+---+-----------+--------------------+


The generated sentences from this model exhibit a low level of coherence and meaningful content. There are noticeable issues with grammar and syntax, and many of the sentences lack a clear structure or logical flow.

The first sentence generated by the model is confusing and lacks coherence. It appears to be a jumble of words and phrases that do not form a coherent idea or message. The second sentence is grammatically correct, but it lacks context and meaningful content. It seems to be a fragment of a larger idea, but it does not convey any clear message. The third sentence is short and to the point, but it is also lacking in meaningful content and context. It does not provide any information or convey any clear message.

In [56]:
generate_sentences(trigram_model_a0, tri, "<BOS>", "<EOS>", vocab_0, ns)

[1mSentence 1:[0m <BOS> About 20,000 sets of Learning Materials should n't be reached *-1 for $ 23 million *U* of which *T*-3 has been a slowing U.S. economy , and print it in my newspaper ? <EOS>
[1mSentence 2:[0m <BOS> Williams , will join the committee , said 0 *T*-1 . '' <EOS>
[1mSentence 3:[0m <BOS> Terms were n't disclosed *-1 . <EOS>


**9.6: Trigram Model with k-smoothing (k = 1), trained on the lowercase training corpus (lowercase = 1):** This trigram model achieved a perplexity value of 1470.53. The perplexity value is considered poor, indicating that the model is not very accurate in predicting the test corpus.

The perplexity value of this model is slightly lower than the perplexity value of the trigram model with k-smoothing (k = 1) trained on the original training corpus (lowercase = 0). This suggests that the use of lowercase version of the training corpus may have a slight impact on the performance of the trigram model with k-smoothing.

In [19]:
trigram_model_a1 = build_and_apply(tri, a, preprocess(train_corpus_1, vocab_1, "<BOS>", "<EOS>", "<UNK>", tri), vocab_1, preprocess(test_corpus_1, vocab_1, "<BOS>", "<EOS>", "<UNK>", tri), True)

+---------+---+-----------+--------------------+
|  [1mModel[0m  | [1mk[0m | [1mLowercase[0m |     [1mPerplexity[0m     |
+---------+---+-----------+--------------------+
| Trigram | 1 |    True   | 1470.5319718904307 |
+---------+---+-----------+--------------------+


The generated sentences from this model exhibit some level of coherence, but there are also noticeable issues with grammar and syntax. The model appears to struggle with producing sentences that follow a logical structure, and many of the sentences lack context or meaningful content.

The first sentence generated by the model is relatively coherent and appears to suggest that an investor named Harold Simmons offered securities worth $2 billion to Japanese institutions. However, it lacks context and meaningful content as it doesn't provide any additional information about the situation. The second sentence generated by the model is somewhat coherent, but it also has issues with grammar and syntax. It appears to suggest that there were some fall ballot issues that set a precedent for a bad year, and it could benefit agriculture. However, it lacks proper punctuation and context to convey a clear message. The third sentence generated by the model is grammatically correct but lacks meaningful content and coherence. It appears to suggest that typically these companies will have won, but it doesn't provide any context or information about the companies or the situation.

In [61]:
generate_sentences(trigram_model_a1, tri, "<BOS>", "<EOS>", vocab_1, ns)

[1mSentence 1:[0m <BOS> investor harold simmons , offered $ 2 billion *u* of securities by japanese institutions . <EOS>
[1mSentence 2:[0m <BOS> fall ballot issues set a precedent for a bad year was the opportunity * to benefit agriculture , '' he says *t*-1 joseph <EOS>
[1mSentence 3:[0m <BOS> typically , these companies will have won . <EOS>


**9.7: Trigram Model with k-smoothing (k = 0.01), trained on the original training corpus (lowercase = 0):** This trigram model achieved a perplexity value of 461.77, which is considered to be good, indicating that the model is reasonably accurate in predicting the test corpus.

The perplexity value of this model is significantly lower than the perplexity value of the previous trigram models with k-smoothing (k = 1), trained on the original training corpus. This indicates that the use of k-smoothing with a smaller k value can improve the performance of the trigram model.

In [21]:
trigram_model_b0 = build_and_apply(tri, b, preprocess(train_corpus_0, vocab_0, "<BOS>", "<EOS>", "<UNK>", tri), vocab_0, preprocess(test_corpus_0, vocab_0, "<BOS>", "<EOS>", "<UNK>", tri), False)

+---------+------+-----------+--------------------+
|  [1mModel[0m  |  [1mk[0m   | [1mLowercase[0m |     [1mPerplexity[0m     |
+---------+------+-----------+--------------------+
| Trigram | 0.01 |   False   | 463.80467915524156 |
+---------+------+-----------+--------------------+


The generated sentences from this model exhibit a low level of coherence and meaningful content. There are noticeable issues with grammar and syntax, and many of the sentences lack a clear structure or logical flow.

The first sentence generated by the model is confusing and lacks coherence. It appears to suggest that Williams shares have some sort of note related to winning the program-trading issue, but the sentence is jumbled and doesn't convey any clear message. The second sentence is grammatically correct, but it lacks context and meaningful content. It appears to suggest that losses from the computer sector have affected a mutual fund, but it doesn't provide any further information about the situation or its significance. The third sentence is also confusing and lacks meaningful content and coherence. It appears to suggest that discussions with Moody's may have influenced Dunkin's actions, but it doesn't provide any clear context or information about what those actions might be or why they matter.

In [68]:
generate_sentences(trigram_model_b0, tri, "<BOS>", "<EOS>", vocab_0, ns)

[1mSentence 1:[0m <BOS> Williams shares , notes 0 *T*-2 to be winning the program-trading issue is heating up on <EOS>
[1mSentence 2:[0m <BOS> Travelers estimated that losses from the computer sector , our primary market , this exclusive club has taken in a mutual fund . <EOS>
[1mSentence 3:[0m <BOS> Dunkin' is based *-1 upon discussions with a Moody 's said 0 it does anything for the Soviets might still face legal barriers to electronic fund raising . <EOS>


**9.8: Trigram Model with k-smoothing (k = 0.01), trained on the lowercase training corpus (lowercase = 1):** This trigram model achieved a perplexity value of 463.80, which is considered to be good, indicating that the model is reasonably accurate in predicting the test corpus.

The perplexity value of this model is slightly lower than the perplexity value of the trigram model with k-smoothing (k = 0.01), trained on the original training corpus. This suggests that the use of lowercase version of the training corpus may have a slight impact on the performance of the trigram model with k-smoothing.

In [23]:
trigram_model_b1 = build_and_apply(tri, b, preprocess(train_corpus_1, vocab_1, "<BOS>", "<EOS>", "<UNK>", tri), vocab_1, preprocess(test_corpus_1, vocab_1, "<BOS>", "<EOS>", "<UNK>", tri), True)

+---------+------+-----------+------------------+
|  [1mModel[0m  |  [1mk[0m   | [1mLowercase[0m |    [1mPerplexity[0m    |
+---------+------+-----------+------------------+
| Trigram | 0.01 |    True   | 461.769817591675 |
+---------+------+-----------+------------------+


The generated sentences from this model exhibit some level of coherence, but there are also noticeable issues with grammar and syntax. The model appears to struggle with producing sentences that follow a logical structure, and many of the sentences lack context or meaningful content.

The first sentence generated by the model is short and lacks meaningful content or context. It appears to suggest that net income has totaled a certain amount, but it does not provide any information about the company or the situation. The second sentence generated by the model lacks coherence and meaningful content. While it appears to suggest that Wall Street Journal reporters have found tests to be higher in some way, the lack of context and proper punctuation makes it difficult to understand. The third sentence is more coherent, but it still lacks meaningful content and context. It appears to suggest that there is an increase in demand for tramp classic due to the safety of chemicals, but it does not provide any additional information or context to make it clear.

In [80]:
generate_sentences(trigram_model_b1, tri, "<BOS>", "<EOS>", vocab_1, ns)

[1mSentence 1:[0m <BOS> net income totaled $ <EOS>
[1mSentence 2:[0m <BOS> wall street journal reporters across the country where tests mean as much as 1\/8 point higher . <EOS>
[1mSentence 3:[0m <BOS> serial bonds are priced *-1 * to remain fully invested yet have jumped *-1 to impose on light trucks and vans the safety of chemicals ; the 30-day simple yield of the tramp as the most in demand : classic <EOS>
