<a href="https://colab.research.google.com/github/misticorion/language-modelling/blob/main/LanguageModelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# N-gram Model


## Trigram Model using NLTK

In [25]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [57]:
import nltk
nltk.download('brown')
from nltk.corpus import brown
from nltk import trigrams
from collections import Counter, defaultdict

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [58]:
model = defaultdict(lambda: defaultdict(lambda: 0))

In [59]:
# get the count of the word for 
for sentence in brown.sents():
    for word_1, word_2, word_3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(word_1, word_2)][word_3] += 1

# model = {[word_1,word_2]:{word_3: count, word_3:count}}
print(model["I", "like"]["to"]) # "economists" follows "what the" 2 times
print(model[None, None]["I"]) # 8839 sentences start with "The"

14
1375


In [60]:
# Convert count to probabilities
for two_adj_words in model:
    total_count = float(sum(model[two_adj_words].values()))

    for third_word in model[two_adj_words]:
        model[two_adj_words][third_word] /= total_count
# model = {[word_1,word_2]:{word_3: probability, word_3:probability}}
print(model["I", "like"]["to"]) # "economists" follows "what the" 2 times
print(model[None, None]["I"]) # 8839 sentences start with "The"

0.5384615384615384
0.023979769794209977


To generate a sentence from two words:

In [56]:
import random
 
input_text = ['I', 'like']
completed = False
 
while not completed:
    r = random.random()
    accumulator = .0
    print(input_text[-2:])
    for word in model[tuple(input_text[-2:])].keys():
        accumulator += model[tuple(input_text[-2:])][word]
 
        if accumulator >= r:
            input_text.append(word)
            break
 
    if input_text[-2:] == [None, None]:
        completed = True
 
print(' '.join([text for text in input_text if text]))

['I', 'like']
['like', 'his']
['his', 'associates']
['associates', 'here']
['here', 'for']
['for', 'ten']
['ten', 'minutes']
['minutes', 'before']
['before', 'nine']
['nine', "o'clock"]
["o'clock", 'in']
['in', 'the']
['the', 'crib']
['crib', "''"]
["''", '?']
['?', '?']
['?', None]
I like his associates here for ten minutes before nine o'clock in the crib '' ? ?


#PyTorch Models

REQUIREMENTS

In [None]:
!pip install pytorch-transformers
!pip install --upgrade urllib3==1.25.4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## OpenAI GPT-2

### Next word generation using OpenAI GPT-2.

Import Libraries

In [11]:
import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

Load pre-trained model tokenizer.

In [12]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

Encode a text input

In [13]:
input_text = "The weather is "
indexed_tokens = tokenizer.encode(input_text)
print(indexed_tokens)

[383, 6193, 318]


Convert indexed tokens to a PyTorch tensor

In [14]:
tokens_tensor = torch.tensor([indexed_tokens])
print(tokens_tensor)

tensor([[ 383, 6193,  318]])


Load the pre-trained model

In [15]:
gpt2model = GPT2LMHeadModel.from_pretrained('gpt2')

Set the model in evaluation mode to deactivate the DropOut modules

In [16]:
gpt2model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): Laye

Predict all the tokens

In [17]:
with torch.no_grad():
  outputs_tensor = gpt2model(tokens_tensor)
  predictions = outputs_tensor[0]

Get the predicted next sub-word

In [18]:
predicted_index = torch.argmax(predictions[0, -1, :]).item()
print(predicted_index)


922


In [19]:
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
print(predicted_text)

 The weather is good
