<a href="https://colab.research.google.com/github/hardikkamboj/Implementations-in-Python/blob/main/NLP/language%20models/Language_model_basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from nltk.corpus import reuters
from nltk import bigrams, trigrams,word_tokenize
from collections import Counter, defaultdict
from tqdm import tqdm
import pickle

In [None]:
import nltk
nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
sents = reuters.sents()
' '.join(sents[0])

"ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said ."

In [None]:
# to understand what are trigrams.
temp = ['Hello everyone my name is Hardik and am from India','Hello there very nice to meet you']

for sent in temp:
  n_grams = trigrams(word_tokenize(sent),pad_right=True, pad_left=True) # using padding we can add the first couple of words, and the last words too.
  for grams in n_grams:
    print(grams)
  print('')


(None, None, 'Hello')
(None, 'Hello', 'everyone')
('Hello', 'everyone', 'my')
('everyone', 'my', 'name')
('my', 'name', 'is')
('name', 'is', 'Hardik')
('is', 'Hardik', 'and')
('Hardik', 'and', 'am')
('and', 'am', 'from')
('am', 'from', 'India')
('from', 'India', None)
('India', None, None)

(None, None, 'Hello')
(None, 'Hello', 'there')
('Hello', 'there', 'very')
('there', 'very', 'nice')
('very', 'nice', 'to')
('nice', 'to', 'meet')
('to', 'meet', 'you')
('meet', 'you', None)
('you', None, None)



**Why are we using defaultdict instead of simple dict?** <br> 
Usually, a Python dictionary throws a KeyError if you try to get an item with a key that is not currently in the dictionary. The defaultdict in contrast will simply create any items that you try to access (provided of course they do not exist yet). To create such a "default" item, it calls the function object that you pass to the constructor (more precisely, it's an arbitrary "callable" object, which includes function and type objects). For the first example, default items are created using int(), which will return the integer object 0. For the second example, default items are created using list(), which returns a new empty list object.
(taken from stackoverflow)

In [None]:
#now since we know what trigrans are, lets create a language model for this. 

model = defaultdict(lambda: defaultdict(lambda: 0))

for sentence in tqdm(temp):
    for w1, w2, w3 in trigrams(word_tokenize(sentence), pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1

100%|██████████| 2/2 [00:00<00:00, 1969.16it/s]


In [None]:
for key in model:
  print(key)
  print(model[key])
  print(' ')

(None, None)
defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fed791ec9d8>, {'Hello': 2})
 
(None, 'Hello')
defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fed791ec840>, {'everyone': 1, 'there': 1})
 
('Hello', 'everyone')
defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fed791ec950>, {'my': 1})
 
('everyone', 'my')
defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fed791ec8c8>, {'name': 1})
 
('my', 'name')
defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fed791ec7b8>, {'is': 1})
 
('name', 'is')
defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fed791ec730>, {'Hardik': 1})
 
('is', 'Hardik')
defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fed791ec6a8>, {'and': 1})
 
('Hardik', 'and')
defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fed791ec620>, {'am': 1})
 
('and', 'am')
defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fed791ec598>, {'from': 1})
 
('am', 'from')
defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fed791ec510>,

In [None]:
# Let's transform the counts to probabilities
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

In [None]:
model

defaultdict(<function __main__.<lambda>>,
            {('Hardik',
              'and'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>>, {'am': 1.0}),
             ('Hello',
              'everyone'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>>, {'my': 1.0}),
             ('Hello',
              'there'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>>, {'very': 1.0}),
             ('India',
              None): defaultdict(<function __main__.<lambda>.<locals>.<lambda>>, {None: 1.0}),
             ('am',
              'from'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>>, {'India': 1.0}),
             ('and',
              'am'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>>, {'from': 1.0}),
             ('everyone',
              'my'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>>, {'name': 1.0}),
             ('from',
              'India'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>>, {None:

Now our language model is ready for our temperary sentence, lets see how we can use this to make predictions - 


In [None]:
dict(model[('Hello','everyone')])

{'my': 1.0}

In [None]:
dict(model[('my','name')])

{'is': 1.0}

We can see how the model has learnt from the sentence that we provided and is learning from them. 
The model is able to predict the next word for sentences like 'Hello everyone'-> my and 'my name' -> 'is' 

But it won't be able to predict the word combinations that doesn't occur in the word corpus -  

In [None]:
dict(model[('the','weather')]) # it predicts nothing

{}

## Now lets try building this model on a large corpus, so that we can get relatively more meaningful predictions

In [None]:
model = defaultdict(lambda: defaultdict(lambda: 0))

# Count frequency of co-occurance  
for sentence in tqdm(reuters.sents()):
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1
 
# Let's transform the counts to probabilities
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

100%|██████████| 54716/54716 [00:12<00:00, 4260.13it/s]


In [None]:
ans = dict(model['I','am'])
sorted(ans.items(), key=lambda x: x[1], reverse=True)

[('sure', 0.15384615384615385),
 ('not', 0.1076923076923077),
 ('confident', 0.07692307692307693),
 ('convinced', 0.06153846153846154),
 ('concerned', 0.046153846153846156),
 ('afraid', 0.046153846153846156),
 ('deeply', 0.046153846153846156),
 ('committed', 0.03076923076923077),
 ('of', 0.03076923076923077),
 ('speculating', 0.03076923076923077),
 ('optimistic', 0.03076923076923077),
 ('encouraged', 0.015384615384615385),
 ('more', 0.015384615384615385),
 ('talking', 0.015384615384615385),
 ('pleased', 0.015384615384615385),
 (',', 0.015384615384615385),
 ('happy', 0.015384615384615385),
 ('for', 0.015384615384615385),
 ('against', 0.015384615384615385),
 ('very', 0.015384615384615385),
 ('cautiously', 0.015384615384615385),
 ('sceptical', 0.015384615384615385),
 ('hopeful', 0.015384615384615385),
 ('now', 0.015384615384615385),
 ('unable', 0.015384615384615385),
 ('expecting', 0.015384615384615385),
 ('astonished', 0.015384615384615385),
 ('joining', 0.015384615384615385),
 ('on', 0.

In [None]:
ans = dict(model[None,'the'])
sorted(ans.items(), key=lambda x: x[1], reverse=True)

[('loan', 0.18181818181818182),
 ('second', 0.09090909090909091),
 ('price', 0.09090909090909091),
 ('increase', 0.09090909090909091),
 ('reorganization', 0.09090909090909091),
 ('restatement', 0.09090909090909091),
 ('quake', 0.09090909090909091),
 ('proposed', 0.09090909090909091),
 ('acqustion', 0.09090909090909091),
 ('company', 0.09090909090909091)]

In [None]:
# generating text 

import random

# starting words
text = ["today", "the"]
sentence_finished = False
 
while not sentence_finished:
  # select a random probability threshold  
  r = random.random()
  accumulator = 0.01 # the minimum probability for a word to be added in text

  for word in model[tuple(text[-2:])].keys():
      accumulator += model[tuple(text[-2:])][word]
      # select words that are above the probability threshold
      if accumulator >= r:
          text.append(word)
          break

  if text[-2:] == [None, None]:
      sentence_finished = True
 
print (' '.join([t for t in text if t]))

today the overseas operations and might decide to respect pledges on monetary policies behind them , delegates said .


## Saving the dict

In [None]:
import pickle

pickle.dump(dict(model), open("save.p", "wb"))  # save it into a file named save.p

AttributeError: ignored

In [None]:
d = dict(model)

In [None]:
dict(d[('I','am')])

{',': 0.015384615384615385,
 'afraid': 0.046153846153846156,
 'against': 0.015384615384615385,
 'astonished': 0.015384615384615385,
 'cautiously': 0.015384615384615385,
 'committed': 0.03076923076923077,
 'concerned': 0.046153846153846156,
 'confident': 0.07692307692307693,
 'convinced': 0.06153846153846154,
 'deeply': 0.046153846153846156,
 'encouraged': 0.015384615384615385,
 'expecting': 0.015384615384615385,
 'for': 0.015384615384615385,
 'happy': 0.015384615384615385,
 'hopeful': 0.015384615384615385,
 'inclined': 0.015384615384615385,
 'joining': 0.015384615384615385,
 'looking': 0.015384615384615385,
 'more': 0.015384615384615385,
 'not': 0.1076923076923077,
 'now': 0.015384615384615385,
 'of': 0.03076923076923077,
 'on': 0.015384615384615385,
 'optimistic': 0.03076923076923077,
 'pleased': 0.015384615384615385,
 'really': 0.015384615384615385,
 'referring': 0.015384615384615385,
 'sceptical': 0.015384615384615385,
 'speculating': 0.03076923076923077,
 'sure': 0.1538461538461538

In [None]:
!pip install dill



In [None]:
import dill # doesn't come with default anaconda. Install with "conda install dill"

dill_file = open("model.pickle", "wb")
dill_file.write(dill.dumps(model))
dill_file.close()

In [None]:
dill_file = open("Q.pickle", "wb")
dill_file.write(dill.dumps(model))
dill_file.close()

In [None]:
file = open('/content/model.pickle', 'rb')
loaded_model = dill.load(file)
file.close()