# N-grams as Tokens for Phrases

In this first project work we will use N-grams as tokens to create phrases in English and evaluate how much sense these phrases make.

### Data preparation

One of the most important parts of this project is preparing N-grams so that they can be used as tokens later on. There are two approaches:
- using an existing dictionary of N-grams, such as Google Ngram;
- creating a new one from a large corpus of text.
We will be using the second approach since it would be easier to train our model on the topics we are interested in.

Another choice that has to be made is how many items do we want to consider for the N-grams? We will create a dictionary of **3 words** using NLTK library corpora. The main disadvantage of NLTK is that its corpora is quite limited in size and in application. But they are already cleaned and tokenized. This saves time on preprocessing.

In [1]:
pip install numpy==1.19.5 --user

Note: you may need to restart the kernel to use updated packages.


In [2]:
import random
from collections import defaultdict, Counter
from nltk.util import ngrams
import nltk
from nltk.corpus import brown, gutenberg, reuters, inaugural, movie_reviews

In [3]:
# punkt is a Sentence Tokenizer
nltk.download('punkt')
# Gutenberg Corpus includes public domain literary texts from authors like Shakespeare and Jane Austen
nltk.download('gutenberg')

# more corpora that can be used to create the dictionary
nltk.download('brown')
nltk.download('reuters')
nltk.download('inaugural')
nltk.download('movie_reviews')

# loading the corpus
all_words = brown.words()
    
# generating bigrams (n=2) or trigrams (n=3)
n = 3
trigrams = list(ngrams(all_words, n))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mo4al\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\mo4al\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\mo4al\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\mo4al\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package inaugural to
[nltk_data]     C:\Users\mo4al\AppData\Roaming\nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\mo4al\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


If we want to work on a large-scale NLP project that requires substantial data, we might want to switch to a larger corpus like Common Crawl (hundreds of TB, but need to preprocess and clean), OpenWebText (about 40GB), or Wikipedia Dump (about 25GB).
We would need to:
- download the corpus;
- preprocess removing tags, non article sections (if using Wikipedia);
- tokenize sentences into words

A different code should be executed if we want to use an available N-gram dataset, for example Google Ngram Viewer.
https://storage.googleapis.com/books/ngrams/books/datasetsv3.html

### Building the model
We want to create an N-gram language model using the data prepared in the first step. It is also possible to build a frequency-based model, where the likelihood of an n-gram appearing is proportional to its frequency in the training data.


In [4]:
def create_trigram_model(trigrams):
    # getting the frequency of each n-gram
    ngram_freq = Counter(trigrams)

    # creating a dictionary where each (n-1)-gram maps to possible next words
    model = defaultdict(list)

    # for trigrams, the prefix will be the first two words, and the next word will be the third
    for (w1, w2, w3) in trigrams:
        model[(w1, w2)].append(w3)

    return ngram_freq, model

### Generating phrases
We consider an initial word from which our phrase would start (in the dictionary it would be preceeded by special tokens to indicate their probable position at the beginning of the phrase), then we predict the next word by sampling from the probability distribution of possible next words given the previous n-1 words. We should continue this process until a phrase is of a certain length or a stop condition is met.
The choice of the next word can be done using one of these approaches:
- **greedy**: always pick the most probable word;
- **random**: sample words based on probability in the corpus.

In [5]:
def generate_random_text(model, start_words, length=20):
    text = list(start_words)
    current_words = tuple(start_words)
    
    for _ in range(length):
        if current_words in model:
            
            # Choose the next word based on its frequency distribution
            next_word = random.choice(model[current_words])
            text.append(next_word)
            current_words = tuple(text[-(n-1):])
        else:
            break  # Stop if we no next word is available
    
    return ' '.join(text)

def generate_greedy_text(model, start_words, length=20):
    text = list(start_words)
    current_words = tuple(start_words)
    
    for _ in range(length):
        if current_words in model:
            
            # Select the most frequent next word
            next_word = Counter(model[current_words]).most_common(1)[0][0]  # Get the most common next word
            text.append(next_word)
            current_words = tuple(text[-(n-1):])
        else:
            break  # Stop if no next word is available
    
    return ' '.join(text)

In [28]:
trigram_freq, model_ngram = create_trigram_model(trigrams)

start_words = ('the', 'name')  # Start with any bigram (n-1) of your choice
generated_random_text = generate_random_text(model_ngram, start_words, length=10)
print(generated_random_text)

generated_greedy_text = generate_greedy_text(model_ngram, start_words, length=10)
print(generated_greedy_text)

the name of his materials themselves cannot sustain interest in playmates of
the name of the United States , and the other hand ,


### Evaluating phrases
After generating a phrase we pass it to a LLM to evaluate its coherence and fluency, the result can be taken as is or can be used to further increase the model's performance. Some of the metrics to evaluate the phrases can be:
- **perplexity** - calculate the perplexity, the lower the perplexity the better;
- **semantic coherence** - use the LLM to assign a coherence score based on how logical or meaningful the phrase is;
- **classifying phrases as coherent or not** - fine-tune a classification head on the LLM to check whether phrases make sense.


We can use **GPT-2** to be the evaluator of the generated phrase

In [7]:
pip install transformers torch sentencepiece torcheval

Note: you may need to restart the kernel to use updated packages.


In [14]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import sentencepiece as spm
from torcheval.metrics.text import Perplexity

torch_device = "cuda" if torch.cuda.is_available() else "cpu"

Using **T5** and **HappyTextToText** to check and correct the phrase's grammar produced by the N-gram model.

In [19]:
pip install happytransformer




In [29]:
from happytransformer import HappyTextToText, TTSettings
happy_tt = HappyTextToText("T5", "vennify/t5-base-grammar-correction")

10/23/2024 14:31:28 - INFO - happytransformer.happy_transformer -   Using device: cpu


In [30]:
def correct_grammar(input_phrase):
    args = TTSettings(num_beams=5, min_length=1)

    formatted_input = f"grammar: {input_phrase}"
    result = happy_tt.generate_text(formatted_input, args=args)
    
    return result.text

input_phrase = generated_random_text
print(input_phrase)
corrected_result = correct_grammar(input_phrase)
print(corrected_result)

10/23/2024 14:31:34 - INFO - happytransformer.happy_transformer -   Moving model to cpu
10/23/2024 14:31:34 - INFO - happytransformer.happy_transformer -   Initializing a pipeline


the name of his materials themselves cannot sustain interest in playmates of
The name of his materials itself cannot sustain interest in playmates.


In [31]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained('t5-base')
model = T5ForConditionalGeneration.from_pretrained('t5-base')
model.to(torch_device)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

In [25]:
def compute_phrase_likelihood(phrase):
    input_ids = tokenizer(phrase, return_tensors="pt").input_ids
    
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss
    
    return -loss.item()

phrase = generated_random_text

Initial Phrase: the name of Luke Murrin . His sudden unannounced appearance at the
Likelihood Score at the start: -5.517093181610107


In [26]:
likelihood_score_initial = compute_phrase_likelihood(generated_)
print("Initial Phrase:", phrase)
print("Likelihood Score at the start:", likelihood_score_initial)

likelihood_score_corrected = compute_phrase_likelihood(corrected_result)
print("Corrected Phrase:", corrected_result)
print("Likelihood Score:", likelihood_score_corrected)

Corrected Phrase: The name of Luke Murrin . His sudden unannounced appearance at the event.
Likelihood Score: -5.50302791595459
