## Reuters data set

Here is a brief about the dataset:

- It contains documents belonging to 90 different categories

- There are a total of 7769 training documents

- And these documents contain 54716 sentences.

### Steps to build the next word recommender system

1. Loading and exploring the dataset
2. Text cleaning
3. Create vocabulary
4. Creating N-grams of the dialogue
5. Building an N-gram language model
6. Predicting the next word using N-gram Language Model

#### Loading and exploring the dataset

In [43]:
import pandas as pd
import numpy as np
import re
pd.set_option('display.max_colwidth', None)


df= pd.read_csv('sample_reuters_dataset.csv')

In [44]:
df.head()

Unnamed: 0,sentence_number,sentence_text
0,0,"ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said ."
1,1,They told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And lead to curbs on American imports of their products .
2,2,"But some exporters said that while the conflict would hurt them in the long - run , in the short - term Tokyo ' s loss might be their gain ."
3,3,"The U . S . Has said it will impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17 , in retaliation for Japan ' s alleged failure to stick to a pact not to sell semiconductors on world markets at below cost ."
4,4,Unofficial Japanese estimates put the impact of the tariffs at 10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of products hit by the new taxes .


In [45]:
#number of sentences
len(df)

10000

#### Text Cleaning

In [46]:
def clean(text):
    #remove every character except alphabets, the apostophe and white spaces
    text= re.sub("[^a-zA-Z' ]", "", text)
    
    #convert to lowercase
    text= text.lower()
    
    return text
    
df['clean_sentence']= df['sentence_text'].apply(clean)

In [47]:
df.head()

Unnamed: 0,sentence_number,sentence_text,clean_sentence
0,0,"ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said .",asian exporters fear damage from u s japan rift mounting trade friction between the u s and japan has raised fears among many of asia ' s exporting nations that the row could inflict far reaching economic damage businessmen and officials said
1,1,They told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And lead to curbs on American imports of their products .,they told reuter correspondents in asian capitals a u s move against japan might boost protectionist sentiment in the u s and lead to curbs on american imports of their products
2,2,"But some exporters said that while the conflict would hurt them in the long - run , in the short - term Tokyo ' s loss might be their gain .",but some exporters said that while the conflict would hurt them in the long run in the short term tokyo ' s loss might be their gain
3,3,"The U . S . Has said it will impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17 , in retaliation for Japan ' s alleged failure to stick to a pact not to sell semiconductors on world markets at below cost .",the u s has said it will impose mln dlrs of tariffs on imports of japanese electronics goods on april in retaliation for japan ' s alleged failure to stick to a pact not to sell semiconductors on world markets at below cost
4,4,Unofficial Japanese estimates put the impact of the tariffs at 10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of products hit by the new taxes .,unofficial japanese estimates put the impact of the tariffs at billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of products hit by the new taxes


#### Create vocabulary

In [53]:
word_dict= {}
for i in range(len(df)):
    text= df.loc[i,'clean_sentence'].split()
    
    for token in text:
        if token in word_dict:
            word_dict[token]+=1
        else:
            word_dict[token]=1
    
    
word_dict

{'asian': 13,
 'exporters': 52,
 'fear': 8,
 'damage': 29,
 'from': 1369,
 'u': 1117,
 's': 2864,
 'japan': 441,
 'rift': 1,
 'mounting': 5,
 'trade': 549,
 'friction': 8,
 'between': 191,
 'the': 12496,
 'and': 4599,
 'has': 974,
 'raised': 70,
 'fears': 13,
 'among': 44,
 'many': 54,
 'of': 6671,
 'asia': 14,
 "'": 2094,
 'exporting': 12,
 'nations': 71,
 'that': 1376,
 'row': 3,
 'could': 291,
 'inflict': 1,
 'far': 55,
 'reaching': 7,
 'economic': 244,
 'businessmen': 15,
 'officials': 190,
 'said': 4649,
 'they': 518,
 'told': 237,
 'reuter': 27,
 'correspondents': 3,
 'in': 5070,
 'capitals': 3,
 'a': 4412,
 'move': 101,
 'against': 270,
 'might': 59,
 'boost': 45,
 'protectionist': 22,
 'sentiment': 10,
 'lead': 96,
 'to': 6337,
 'curbs': 12,
 'on': 1643,
 'american': 126,
 'imports': 242,
 'their': 230,
 'products': 200,
 'but': 650,
 'some': 278,
 'while': 164,
 'conflict': 3,
 'would': 926,
 'hurt': 11,
 'them': 58,
 'long': 119,
 'run': 21,
 'short': 87,
 'term': 120,
 'toky

In [54]:
words_df= pd.DataFrame({'word': list(word_dict.keys()), 'count': list(word_dict.values())})
words_df= words_df.sort_values(by= 'count', ascending=False)
words_df.reset_index(inplace=True, drop=True)

In [55]:
words_df.head() #most frequent words

Unnamed: 0,word,count
0,the,12496
1,of,6671
2,to,6337
3,in,5070
4,said,4649


In [56]:
words_df.tail() #least frequent words

Unnamed: 0,word,count
12575,shareholdrs,1
12576,disc,1
12577,asylums,1
12578,benel,1
12579,sb,1


In [57]:
len(words_df)

12580

We have 12580 words in the vocabulary based on which we will build the N-gram model.

#### Creating N-gram model

In [58]:
def create_unigram(sentence):
    #creating tokens from the sentence
    tokens= sentence.split()
    #empty list to store the unigrams
    unigram_list= []
    #number of unigrams= no. of tokens in the sentence
    for i in range(len(tokens)):
        #appending each unigram in the list
        unigram_list.append(tokens[i:i+1])
        
        #returning the unigram list for a sentece
    return unigram_list

In [62]:
# creating unigrams for all the sentences in the dataset 
final_unigram = []
# for each sentence
for i in range(len(df)):
    # using the defined unigram function to create unigrams
    final_unigram.append(create_unigram(df['clean_sentence'][i]))

# adding the unigram in a seperate column in the dataset
df['unigram'] = final_unigram

In [65]:
def create_bigram(sentence):
    #creating tokens from the sentence
    tokens= sentence.split()
    #empty list to store the unigrams
    bigram_list= []
    #number of bigrams= no. of tokens in the sentence-1
    for i in range(len(tokens)-1):
        #appending each bigram in the list
        bigram_list.append(tokens[i:i+2])
        
        #returning the bigram list for a sentece
    return bigram_list

# creating bigrams for all the sentences in the dataset 
final_bigram = []
# for each sentence
for i in range(len(df)):
    # using the defined bigram function to create unigrams
    final_bigram.append(create_bigram(df['clean_sentence'][i]))

# adding the bigram in a seperate column in the dataset
df['bigram'] = final_bigram

In [66]:
def create_trigram(sentence):
    #creating tokens from the sentence
    tokens= sentence.split()
    #empty list to store the unigrams
    trigram_list= []
    #number of trigrams= no. of tokens in the sentence-2
    for i in range(len(tokens)-2):
        #appending each trigram in the list
        trigram_list.append(tokens[i:i+3])
        
        #returning the unigram list for a sentece
    return trigram_list

# creating trigrams for all the sentences in the dataset 
final_trigram = []
# for each sentence
for i in range(len(df)):
    # using the defined unigram function to create trigrams
    final_trigram.append(create_trigram(df['clean_sentence'][i]))

# adding the trigram in a seperate column in the dataset
df['trigram'] = final_trigram

In [67]:
df.head()

Unnamed: 0,sentence_number,sentence_text,clean_sentence,unigram,bigram,trigram
0,0,"ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said .",asian exporters fear damage from u s japan rift mounting trade friction between the u s and japan has raised fears among many of asia ' s exporting nations that the row could inflict far reaching economic damage businessmen and officials said,"[[asian], [exporters], [fear], [damage], [from], [u], [s], [japan], [rift], [mounting], [trade], [friction], [between], [the], [u], [s], [and], [japan], [has], [raised], [fears], [among], [many], [of], [asia], ['], [s], [exporting], [nations], [that], [the], [row], [could], [inflict], [far], [reaching], [economic], [damage], [businessmen], [and], [officials], [said]]","[[asian, exporters], [exporters, fear], [fear, damage], [damage, from], [from, u], [u, s], [s, japan], [japan, rift], [rift, mounting], [mounting, trade], [trade, friction], [friction, between], [between, the], [the, u], [u, s], [s, and], [and, japan], [japan, has], [has, raised], [raised, fears], [fears, among], [among, many], [many, of], [of, asia], [asia, '], [', s], [s, exporting], [exporting, nations], [nations, that], [that, the], [the, row], [row, could], [could, inflict], [inflict, far], [far, reaching], [reaching, economic], [economic, damage], [damage, businessmen], [businessmen, and], [and, officials], [officials, said]]","[[asian, exporters, fear], [exporters, fear, damage], [fear, damage, from], [damage, from, u], [from, u, s], [u, s, japan], [s, japan, rift], [japan, rift, mounting], [rift, mounting, trade], [mounting, trade, friction], [trade, friction, between], [friction, between, the], [between, the, u], [the, u, s], [u, s, and], [s, and, japan], [and, japan, has], [japan, has, raised], [has, raised, fears], [raised, fears, among], [fears, among, many], [among, many, of], [many, of, asia], [of, asia, '], [asia, ', s], [', s, exporting], [s, exporting, nations], [exporting, nations, that], [nations, that, the], [that, the, row], [the, row, could], [row, could, inflict], [could, inflict, far], [inflict, far, reaching], [far, reaching, economic], [reaching, economic, damage], [economic, damage, businessmen], [damage, businessmen, and], [businessmen, and, officials], [and, officials, said]]"
1,1,They told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And lead to curbs on American imports of their products .,they told reuter correspondents in asian capitals a u s move against japan might boost protectionist sentiment in the u s and lead to curbs on american imports of their products,"[[they], [told], [reuter], [correspondents], [in], [asian], [capitals], [a], [u], [s], [move], [against], [japan], [might], [boost], [protectionist], [sentiment], [in], [the], [u], [s], [and], [lead], [to], [curbs], [on], [american], [imports], [of], [their], [products]]","[[they, told], [told, reuter], [reuter, correspondents], [correspondents, in], [in, asian], [asian, capitals], [capitals, a], [a, u], [u, s], [s, move], [move, against], [against, japan], [japan, might], [might, boost], [boost, protectionist], [protectionist, sentiment], [sentiment, in], [in, the], [the, u], [u, s], [s, and], [and, lead], [lead, to], [to, curbs], [curbs, on], [on, american], [american, imports], [imports, of], [of, their], [their, products]]","[[they, told, reuter], [told, reuter, correspondents], [reuter, correspondents, in], [correspondents, in, asian], [in, asian, capitals], [asian, capitals, a], [capitals, a, u], [a, u, s], [u, s, move], [s, move, against], [move, against, japan], [against, japan, might], [japan, might, boost], [might, boost, protectionist], [boost, protectionist, sentiment], [protectionist, sentiment, in], [sentiment, in, the], [in, the, u], [the, u, s], [u, s, and], [s, and, lead], [and, lead, to], [lead, to, curbs], [to, curbs, on], [curbs, on, american], [on, american, imports], [american, imports, of], [imports, of, their], [of, their, products]]"
2,2,"But some exporters said that while the conflict would hurt them in the long - run , in the short - term Tokyo ' s loss might be their gain .",but some exporters said that while the conflict would hurt them in the long run in the short term tokyo ' s loss might be their gain,"[[but], [some], [exporters], [said], [that], [while], [the], [conflict], [would], [hurt], [them], [in], [the], [long], [run], [in], [the], [short], [term], [tokyo], ['], [s], [loss], [might], [be], [their], [gain]]","[[but, some], [some, exporters], [exporters, said], [said, that], [that, while], [while, the], [the, conflict], [conflict, would], [would, hurt], [hurt, them], [them, in], [in, the], [the, long], [long, run], [run, in], [in, the], [the, short], [short, term], [term, tokyo], [tokyo, '], [', s], [s, loss], [loss, might], [might, be], [be, their], [their, gain]]","[[but, some, exporters], [some, exporters, said], [exporters, said, that], [said, that, while], [that, while, the], [while, the, conflict], [the, conflict, would], [conflict, would, hurt], [would, hurt, them], [hurt, them, in], [them, in, the], [in, the, long], [the, long, run], [long, run, in], [run, in, the], [in, the, short], [the, short, term], [short, term, tokyo], [term, tokyo, '], [tokyo, ', s], [', s, loss], [s, loss, might], [loss, might, be], [might, be, their], [be, their, gain]]"
3,3,"The U . S . Has said it will impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17 , in retaliation for Japan ' s alleged failure to stick to a pact not to sell semiconductors on world markets at below cost .",the u s has said it will impose mln dlrs of tariffs on imports of japanese electronics goods on april in retaliation for japan ' s alleged failure to stick to a pact not to sell semiconductors on world markets at below cost,"[[the], [u], [s], [has], [said], [it], [will], [impose], [mln], [dlrs], [of], [tariffs], [on], [imports], [of], [japanese], [electronics], [goods], [on], [april], [in], [retaliation], [for], [japan], ['], [s], [alleged], [failure], [to], [stick], [to], [a], [pact], [not], [to], [sell], [semiconductors], [on], [world], [markets], [at], [below], [cost]]","[[the, u], [u, s], [s, has], [has, said], [said, it], [it, will], [will, impose], [impose, mln], [mln, dlrs], [dlrs, of], [of, tariffs], [tariffs, on], [on, imports], [imports, of], [of, japanese], [japanese, electronics], [electronics, goods], [goods, on], [on, april], [april, in], [in, retaliation], [retaliation, for], [for, japan], [japan, '], [', s], [s, alleged], [alleged, failure], [failure, to], [to, stick], [stick, to], [to, a], [a, pact], [pact, not], [not, to], [to, sell], [sell, semiconductors], [semiconductors, on], [on, world], [world, markets], [markets, at], [at, below], [below, cost]]","[[the, u, s], [u, s, has], [s, has, said], [has, said, it], [said, it, will], [it, will, impose], [will, impose, mln], [impose, mln, dlrs], [mln, dlrs, of], [dlrs, of, tariffs], [of, tariffs, on], [tariffs, on, imports], [on, imports, of], [imports, of, japanese], [of, japanese, electronics], [japanese, electronics, goods], [electronics, goods, on], [goods, on, april], [on, april, in], [april, in, retaliation], [in, retaliation, for], [retaliation, for, japan], [for, japan, '], [japan, ', s], [', s, alleged], [s, alleged, failure], [alleged, failure, to], [failure, to, stick], [to, stick, to], [stick, to, a], [to, a, pact], [a, pact, not], [pact, not, to], [not, to, sell], [to, sell, semiconductors], [sell, semiconductors, on], [semiconductors, on, world], [on, world, markets], [world, markets, at], [markets, at, below], [at, below, cost]]"
4,4,Unofficial Japanese estimates put the impact of the tariffs at 10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of products hit by the new taxes .,unofficial japanese estimates put the impact of the tariffs at billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of products hit by the new taxes,"[[unofficial], [japanese], [estimates], [put], [the], [impact], [of], [the], [tariffs], [at], [billion], [dlrs], [and], [spokesmen], [for], [major], [electronics], [firms], [said], [they], [would], [virtually], [halt], [exports], [of], [products], [hit], [by], [the], [new], [taxes]]","[[unofficial, japanese], [japanese, estimates], [estimates, put], [put, the], [the, impact], [impact, of], [of, the], [the, tariffs], [tariffs, at], [at, billion], [billion, dlrs], [dlrs, and], [and, spokesmen], [spokesmen, for], [for, major], [major, electronics], [electronics, firms], [firms, said], [said, they], [they, would], [would, virtually], [virtually, halt], [halt, exports], [exports, of], [of, products], [products, hit], [hit, by], [by, the], [the, new], [new, taxes]]","[[unofficial, japanese, estimates], [japanese, estimates, put], [estimates, put, the], [put, the, impact], [the, impact, of], [impact, of, the], [of, the, tariffs], [the, tariffs, at], [tariffs, at, billion], [at, billion, dlrs], [billion, dlrs, and], [dlrs, and, spokesmen], [and, spokesmen, for], [spokesmen, for, major], [for, major, electronics], [major, electronics, firms], [electronics, firms, said], [firms, said, they], [said, they, would], [they, would, virtually], [would, virtually, halt], [virtually, halt, exports], [halt, exports, of], [exports, of, products], [of, products, hit], [products, hit, by], [hit, by, the], [by, the, new], [the, new, taxes]]"


#### Tri-gram language model

In [70]:
from collections import Counter, defaultdict

#to create a placeholder for the model
model= defaultdict(lambda: defaultdict(lambda: 0))

#count frequency of co-occurence:
for i in range(len(df)):
    #for each trigram_pair:
    for w1,w2,w3 in create_trigram(df['clean_sentence'][i]):
        
        #count of occurence of w3 given w1 and w2
        model[(w1,w2)][w3]+=1

#### Converting to probabilistic model

In [73]:
for w1_w2 in model:
    total_count= float(sum(model[w1_w2].values())) #total possible occurences of the next word
    for w3 in model[w1_w2]: #for a given combination of w1_w2
        model[w1_w2][w3]/= total_count

In [76]:
dict(model['imports','of'])

{'their': 0.03333333333333333,
 'japanese': 0.06666666666666667,
 'u': 0.06666666666666667,
 'raw': 0.03333333333333333,
 'farm': 0.06666666666666667,
 'essential': 0.03333333333333333,
 'rice': 0.03333333333333333,
 'soft': 0.03333333333333333,
 'maize': 0.03333333333333333,
 'microwave': 0.03333333333333333,
 'tea': 0.03333333333333333,
 'brazilian': 0.1,
 'gifts': 0.03333333333333333,
 'some': 0.03333333333333333,
 'canadian': 0.03333333333333333,
 'ec': 0.03333333333333333,
 'billion': 0.1,
 'textiles': 0.03333333333333333,
 'apparel': 0.03333333333333333,
 'all': 0.03333333333333333,
 'soybeans': 0.03333333333333333,
 'agricultural': 0.03333333333333333,
 'machinery': 0.03333333333333333}

This is a tri-gram model that gives the probability of the next word (w3) based on the given two words.

In [79]:
#word with max probability
import operator
max(dict(model['imports','of']).items(), key=operator.itemgetter(1))[0]

'brazilian'