## Word Prediction using N-Grams

#### Overview: 

- The goal of this project is to create an algorithm that can shift through __ngrams__ to predict which word will most likely be typed next. The __Ngram__ Word Prediction Algorithm will predict a user's next word in any number of given circumstances. Using almost all the books in the world as a collection of data, this algorithm will be able to predict the most common word that would appear next in real time. 

- Assume the training data shows the frequency of "data" is 198, "data entry" is 12 and "data streams" is 10. We calculate the maximum likelihood estimate (MLE) as:

The probability of "data entry":

$$ P_{mle}(entry|data) = \frac{12}{198} = 0.06 = 6\%$$

The probability of "data streams" is:

$$ P_{mle}(streams|data) = \frac{10}{198} = 0.05 = 5\%$$

If the user types, "data", the model predicts that "entry" is the most likely next word.

#### Generate 2-grams

- using the corpus 
    - corpus = `the cat is red the cat is green the cat is blue the dog is brown`

#### Prediction

take a word and predict the next word using bi-gram

#### Additional exercise
- install PyPDF4 library into the Anaconda env (needed for reading a PDF file)
- donwload the "MachineLearning Book.pdf" and read the content of the PDF into the numpy or pandas data structures and learn bi-gram model and do the prediction on tokens : 'data', 'machine', 'artificial', 'learners'
- analyze the outout for reasonble correctness, think if adding bigger size corpus would be a better option


In [1]:
from collections import defaultdict

In [2]:
corpus = "the cat is red the cat is green the cat is blue the dog is brown"

In [4]:
tokens = corpus.split()

In [5]:
tokens

['the',
 'cat',
 'is',
 'red',
 'the',
 'cat',
 'is',
 'green',
 'the',
 'cat',
 'is',
 'blue',
 'the',
 'dog',
 'is',
 'brown']

#### build tokens dictionary 
with next word list for each token

In [6]:
previous_word = ""
token_dict    = defaultdict(list)

In [7]:
for current_word in tokens:
    if previous_word != "":
        token_dict[previous_word].append(current_word)
        
    previous_word = current_word

In [9]:
token_dict

defaultdict(list,
            {'the': ['cat', 'cat', 'cat', 'dog'],
             'cat': ['is', 'is', 'is'],
             'is': ['red', 'green', 'blue', 'brown'],
             'red': ['the'],
             'green': ['the'],
             'blue': ['the'],
             'dog': ['is']})

In [10]:
corpus

'the cat is red the cat is green the cat is blue the dog is brown'

#### compute probability
of each observed next word for each word in the dictionnary. 

In [11]:
for key in token_dict.keys():
    
    next_words   = token_dict[key]
    
    unique_words = set(next_words) # removes duplicates
    
    nb_words     = len(next_words)
    
    probabilities_token = {}
    
    for unique_word in unique_words:
        probabilities_token[unique_word] = float(next_words.count(unique_word)) / nb_words
        
    token_dict[key] = probabilities_token

In [12]:
token_dict

defaultdict(list,
            {'the': {'dog': 0.25, 'cat': 0.75},
             'cat': {'is': 1.0},
             'is': {'brown': 0.25, 'red': 0.25, 'blue': 0.25, 'green': 0.25},
             'red': {'the': 1.0},
             'green': {'the': 1.0},
             'blue': {'the': 1.0},
             'dog': {'is': 1.0}})

#### predicting next word

In [27]:
token_ask = 'the'

In [28]:
if token_ask in token_dict:
    next_words_prob = token_dict[token_ask]
    print(next_words_prob)
    print({k: v for k, v in sorted(next_words_prob.items(), key=lambda item: item[1], reverse=True)})

{'dog': 0.25, 'cat': 0.75}
{'cat': 0.75, 'dog': 0.25}


## read a PDF file

In [29]:
!pip install PyPDF4

Collecting PyPDF4
  Downloading PyPDF4-1.27.0.tar.gz (63 kB)
Building wheels for collected packages: PyPDF4
  Building wheel for PyPDF4 (setup.py): started
  Building wheel for PyPDF4 (setup.py): finished with status 'done'
  Created wheel for PyPDF4: filename=PyPDF4-1.27.0-py3-none-any.whl size=61253 sha256=649f2466780dbc18763cf483ab9623808a99681bb7b2443782d906e727a13e44
  Stored in directory: c:\users\bhupen\appdata\local\pip\cache\wheels\f0\79\75\d130281ec9996a2551dbdd1836aa4beb376d53f8cdca49b4b0
Successfully built PyPDF4
Installing collected packages: PyPDF4
Successfully installed PyPDF4-1.27.0


In [12]:
import PyPDF4
import re

In [25]:
#pdf_file_location = r'D:\MYLEARN\DATASETS\PDFs\meetingminutes.pdf'
pdf_file_location = r'D:\MYLEARN\2-ANALYTICS-DataScience\01-TECH DOCS\20 - Text - NLP\Natural Language Processing with TensorFlow by Thushan Ganegedara (z-lib.org).pdf'

In [26]:
pdfFileObj = open(pdf_file_location, 'rb')

In [27]:
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)

In [28]:
pdfReader.numPages, pdfReader.isEncrypted, pdfReader.getDocumentInfo()

ValueError: invalid literal for int() with base 10: b"'A`jCyZI\xa0w\x90M\x05Y\x9e\xbcEM\xf2\xe0\xcaU\x7f&\x8b\x16\x80t\xfbj\x96a"

In [21]:
pageObj = pdfReader.getPage(45)

In [22]:
## loop thru all the pages
cur_page = 0

allText = ""

while cur_page < pdfReader.numPages:
    pageObj  = pdfReader.getPage(cur_page)
    pageText = pageObj.extractText()
    
    cur_page += 1
    
    pattern  = r'[^a-zA-z\s]'
    pageText = re.sub(pattern, '', pageText)
    
    # remove extra newlines
    pageText = re.sub(r'[\r|\n|\r\n]+', ' ', pageText)
    
    pageText = pageText.lower()
    
    allText = allText + pageText

ValueError: invalid literal for int() with base 10: b'\xa2\x87\xf2;\x19\xdb\xb6I6\xfdh\xbc.\x19\x0egW,\xf4V\xc5\x0e\x8e.\x92\xf5\xa0'

In [37]:
len(allText)

584280

In [38]:
allText[:100]

'machine learning   in python machine learning   in python essential techniques for   predictive anal'

In [39]:
tokens = allText.split()

In [40]:
len(tokens)

88157

In [41]:
previous_word = ""
token_dict    = defaultdict(list)

In [42]:
for current_word in tokens:
    if previous_word != "":
        token_dict[previous_word].append(current_word)
        
    previous_word = current_word

#### compute probability
of each observed next word for each word in the dictionnary. 

In [43]:
for key in token_dict.keys():
    
    next_words   = token_dict[key]
    
    unique_words = set(next_words) # removes duplicates
    
    nb_words     = len(next_words)
    
    probabilities_token = {}
    
    for unique_word in unique_words:
        probabilities_token[unique_word] = float(next_words.count(unique_word)) / nb_words
        
    token_dict[key] = probabilities_token

In [44]:
type(token_dict)

collections.defaultdict

In [47]:
ctr = 0
for i, v in token_dict.items():
    print('\n', i, v)
    ctr +=1
    
    if ctr> 15:
        break


 machine {'competitions': 0.008403361344537815, 'is': 0.008403361344537815, 'learning': 0.8739495798319328, 'httpsstatwebstanfordedujhfftptrebstpdf': 0.025210084033613446, 'intelligence': 0.008403361344537815, 'learners': 0.008403361344537815, 'learn': 0.025210084033613446, 'linear': 0.008403361344537815, 'that': 0.008403361344537815, 'annals': 0.008403361344537815, 'and': 0.008403361344537815, 'approach': 0.008403361344537815}

 learning {'experience': 0.008130081300813009, 'algorithms': 0.17886178861788618, 'models': 0.016260162601626018, 'model': 0.016260162601626018, 'svms': 0.008130081300813009, 'problem': 0.08130081300813008, 'binary': 0.008130081300813009, 'practice': 0.008130081300813009, 'its': 0.008130081300813009, 'at': 0.008130081300813009, 'skills': 0.008130081300813009, 'comparison': 0.008130081300813009, 'the': 0.008130081300813009, 'to': 0.016260162601626018, 'is': 0.024390243902439025, 'for': 0.008130081300813009, 'algorithm': 0.06504065040650407, 'this': 0.0081300813

#### predicting next word

In [50]:
token_ask = 'machine'

In [51]:
if token_ask in token_dict:
    
    next_words_prob = token_dict[token_ask]
    
    #print(next_words_prob)
    
    print({k: v for k, v in sorted(next_words_prob.items(), key=lambda item: item[1], reverse=True)})

{'learning': 0.8739495798319328, 'httpsstatwebstanfordedujhfftptrebstpdf': 0.025210084033613446, 'learn': 0.025210084033613446, 'competitions': 0.008403361344537815, 'is': 0.008403361344537815, 'intelligence': 0.008403361344537815, 'learners': 0.008403361344537815, 'linear': 0.008403361344537815, 'that': 0.008403361344537815, 'annals': 0.008403361344537815, 'and': 0.008403361344537815, 'approach': 0.008403361344537815}
