# <font color = 'red'>**Next Word Prediction using NLTK**</font>

## <font color = 'saffron'>**Importing Libraries**</font>

In [1]:
import nltk
from nltk.corpus import reuters
from nltk import bigrams, ConditionalFreqDist
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

True

## <font color = 'saffron'>**Loading the Dataset**</font>

In [2]:
nltk.download("reuters")
corpus = reuters.sents()

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


In [3]:
print(corpus[1])

['They', 'told', 'Reuter', 'correspondents', 'in', 'Asian', 'capitals', 'a', 'U', '.', 'S', '.', 'Move', 'against', 'Japan', 'might', 'boost', 'protectionist', 'sentiment', 'in', 'the', 'U', '.', 'S', '.', 'And', 'lead', 'to', 'curbs', 'on', 'American', 'imports', 'of', 'their', 'products', '.']


In [4]:
len(corpus)

54716

## <font color = 'saffron'>**Creating Bigrams**</font>

In [5]:
words = [word.lower() for s in corpus for word in s]
bigrams_list = list(bigrams(words))

In [6]:
print(bigrams_list[:10])

[('asian', 'exporters'), ('exporters', 'fear'), ('fear', 'damage'), ('damage', 'from'), ('from', 'u'), ('u', '.'), ('.', 's'), ('s', '.-'), ('.-', 'japan'), ('japan', 'rift')]


## <font color = 'saffron'>**Creating Conditional Frequency Distribution**</font>

In [7]:
cfd = ConditionalFreqDist(bigrams_list)

In [8]:
cfd['the']

FreqDist({'company': 3126, 'u': 2264, 'dollar': 984, 'bank': 960, 'first': 839, 'government': 787, 'year': 720, 'united': 682, 'new': 678, 'market': 590, ...})

## <font color = 'saffron'>**Predicting Next Word**</font>

In [9]:
def predict_next_word(input_word):
    input_word = input_word.lower()
    if input_word in cfd:
        return cfd[input_word].max()
    else:
        return "Word not found in corpus"

In [10]:
input_word = "the"
next_word = predict_next_word(input_word)
print(f"The next word after '{input_word}' could be: {next_word}")

The next word after 'the' could be: company
