<p style="font-family:Roboto; font-size: 28px; color: magenta"> Python for NLP: Developing an Automatic Text Filler using N-Grams</p>

In [None]:
'''The N-Grams model is one of the most widely used sentence-to-vector models since it captures the context 
between N-words in a sentence'''

In [1]:
'''
 Automatic text filler is a very useful application and is widely used by Google and different smartphones 
 where a user enters some text and the remaining text is automatically populated or suggested by the application
'''

'\n Automatic text filler is a very useful application and is widely used by Google and different smartphones \n where a user enters some text and the remaining text is automatically populated or suggested by the application\n'

<p style="font-family:Roboto; font-size: 22px; color: cyan; text-decoration-line: overline; "> Problems with TF-IDF and Bag of Words Approach</p>

In [None]:
'''In the bag of words and TF-IDF approach, words are treated individually and every single word is converted 
into its numeric counterpart. The context information of the word is not retained''' 

<p style="font-family:Roboto; font-size: 28px; color: magenta"> N-Grams from Scratch in Python</p>

In [None]:
'''
We will create two types of N-Grams models in this section: a character N-Grams model and a word N-Gram model.
'''

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _1: Characters N-Grams Model</p>

In [2]:
import nltk
import numpy as np
import random
import string

import bs4 as bs
import urllib.request
import re

In [3]:
'''We will be using the BeautifulSoup4 library to parse the data from Wikipedia'''
raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Tennis')
raw_html = raw_html.read()

article_html = bs.BeautifulSoup(raw_html, 'lxml')
article_paragraphs = article_html.find_all('p')
article_text = ''

for para in article_paragraphs:
    article_text += para.text

article_text = article_text.lower()

In [4]:
'''Next, we remove everything from our dataset except letters, periods, and spaces:'''
article_text = re.sub(r'[^A-Za-z. ]', '', article_text)

In [7]:
'''We will be creating a character trigram model'''
ngrams = {}
chars = 3

for i in range(len(article_text)-chars):
    seq = article_text[i:i+chars]
    print(seq)
    '''We then check if the trigram exists in the dictionary. If it doesn't exist in the ngrams dictionary
    we add the trigram to the dictionary. 
    After that, we assign an empty list as the value to the trigram'''
    if seq not in ngrams.keys():
        ngrams[seq] = []
    ngrams[seq].append(article_text[i+chars])


ten
enn
nni
nis
is 
s i
 is
is 
s a
 a 
a r
 ra
rac
ack
cke
ket
et 
t s
 sp
spo
por
ort
rt 
t t
 th
tha
hat
at 
t i
 is
is 
s p
 pl
pla
lay
aye
yed
ed 
d e
 ei
eit
ith
the
her
er 
r i
 in
ind
ndi
div
ivi
vid
idu
dua
ual
all
lly
ly 
y a
 ag
aga
gai
ain
ins
nst
st 
t a
 a 
a s
 si
sin
ing
ngl
gle
le 
e o
 op
opp
ppo
pon
one
nen
ent
nt 
t s
 si
sin
ing
ngl
gle
les
es 
s o
 or
or 
r b
 be
bet
etw
twe
wee
een
en 
n t
 tw
two
wo 
o t
 te
tea
eam
ams
ms 
s o
 of
of 
f t
 tw
two
wo 
o p
 pl
pla
lay
aye
yer
ers
rs 
s e
 ea
eac
ach
ch 
h d
 do
dou
oub
ubl
ble
les
es.
s. 
. e
 ea
eac
ach
ch 
h p
 pl
pla
lay
aye
yer
er 
r u
 us
use
ses
es 
s a
 a 
a t
 te
ten
enn
nni
nis
is 
s r
 ra
rac
ack
cke
ket
et 
t s
 st
str
tru
run
ung
ng 
g w
 wi
wit
ith
th 
h a
 a 
a c
 co
cor
ord
rd 
d t
 to
to 
o s
 st
str
tri
rik
ike
ke 
e a
 a 
a h
 ho
hol
oll
llo
low
ow 
w r
 ru
rub
ubb
bbe
ber
er 
r b
 ba
bal
all
ll 
l c
 co
cov
ove
ver
ere
red
ed 
d w
 wi
wit
ith
th 
h f
 fe
fel
elt
lt 
t o
 ov
ove
ver
er 
r o
 or


In [9]:
'''Let's now try to generate text using the first three characters of our corpus as input. 
The first three characters of our corpus are "ten"'''
curr_sequence = article_text[0:chars]
output = curr_sequence
for i in range(200):
    if curr_sequence not in ngrams.keys():
        break
    possible_chars = ngrams[curr_sequence]
    next_char = possible_chars[random.randrange(len(possible_chars))]
    output += next_char
    curr_sequence = output[len(output)-chars:len(output)]

print(output)

tennis the same will and shand ear the most commoned many to the ally the a terminuoused double at tennis norts in sere signal of the width to reatestofter where of the graf in . player matched ratione s


In [10]:
'''The output doesn't make much sense here in this case. If you increase the value of the chars variable to 4'''
ngrams = {}
chars = 4

for i in range(len(article_text)-chars):
    seq = article_text[i:i+chars]
    print(seq)
    if seq not in ngrams.keys():
        ngrams[seq] = []
    ngrams[seq].append(article_text[i+chars])

tenn
enni
nnis
nis 
is i
s is
 is 
is a
s a 
 a r
a ra
 rac
rack
acke
cket
ket 
et s
t sp
 spo
spor
port
ort 
rt t
t th
 tha
that
hat 
at i
t is
 is 
is p
s pl
 pla
play
laye
ayed
yed 
ed e
d ei
 eit
eith
ithe
ther
her 
er i
r in
 ind
indi
ndiv
divi
ivid
vidu
idua
dual
uall
ally
lly 
ly a
y ag
 aga
agai
gain
ains
inst
nst 
st a
t a 
 a s
a si
 sin
sing
ingl
ngle
gle 
le o
e op
 opp
oppo
ppon
pone
onen
nent
ent 
nt s
t si
 sin
sing
ingl
ngle
gles
les 
es o
s or
 or 
or b
r be
 bet
betw
etwe
twee
ween
een 
en t
n tw
 two
two 
wo t
o te
 tea
team
eams
ams 
ms o
s of
 of 
of t
f tw
 two
two 
wo p
o pl
 pla
play
laye
ayer
yers
ers 
rs e
s ea
 eac
each
ach 
ch d
h do
 dou
doub
oubl
uble
bles
les.
es. 
s. e
. ea
 eac
each
ach 
ch p
h pl
 pla
play
laye
ayer
yer 
er u
r us
 use
uses
ses 
es a
s a 
 a t
a te
 ten
tenn
enni
nnis
nis 
is r
s ra
 rac
rack
acke
cket
ket 
et s
t st
 str
stru
trun
rung
ung 
ng w
g wi
 wit
with
ith 
th a
h a 
 a c
a co
 cor
cord
ord 
rd t
d to
 to 
to s
o st
 str
stri


In [None]:
# we first store the first trigram i.e. ten into the curr_sequence variable
curr_sequence = article_text[0:chars]
output = curr_sequence
# We will generate a text of two hundred characters, therefore we initialize a loop that iterates for 200 times
for i in range(200):
    if curr_sequence not in ngrams.keys():
        break
    possible_chars = ngrams[curr_sequence]
    next_char = possible_chars[random.randrange(len(possible_chars))]
    output += next_char
    curr_sequence = output[len(output)-chars:len(output)]

print(output)

tennis plummer doubles and has on historically at leaming the dubai took overhead outofbounced for most courts it out or times at the only found the th centially twentyfirst sportswriter events to winning


<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _2: Words N-Grams Model</p>

In [None]:
'''Let's first create a dictionary that contains word trigrams as keys and the list of words that occur 
after the trigrams as values'''
ngrams = {}
words = 3

words_tokens = nltk.word_tokenize(article_text)
for i in range(len(words_tokens)-words):
    seq = ' '.join(words_tokens[i:i+words])
    print(seq)
    # After that, we check if the word trigram exists in the ngrams dictionary. 
    # If the trigram doesn't already exist, we simply insert it into the ngrams dictionary as a key
    if  seq not in ngrams.keys():
        ngrams[seq] = []
    ngrams[seq].append(words_tokens[i+words])

tennis is a
is a racket
a racket sport
racket sport that
sport that is
that is played
is played either
played either individually
either individually against
individually against a
against a single
a single opponent
single opponent singles
opponent singles or
singles or between
or between two
between two teams
two teams of
teams of two
of two players
two players each
players each doubles
each doubles .
doubles . each
. each player
each player uses
player uses a
uses a tennis
a tennis racket
tennis racket strung
racket strung with
strung with a
with a cord
a cord to
cord to strike
to strike a
strike a hollow
a hollow rubber
hollow rubber ball
rubber ball covered
ball covered with
covered with felt
with felt over
felt over or
over or around
or around a
around a net
a net and
net and into
and into the
into the opponents
the opponents court
opponents court .
court . the
. the object
the object is
object is to
is to manoeuvre
to manoeuvre the
manoeuvre the ball
the ball in
ball in such
in s

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _create an automatic text filler, using the word trigrams that we just created</p>

In [None]:
 
'''we initialize the curr_sequence variable with the first trigram in the corpus'''
curr_sequence = ' '.join(words_tokens[0:words])
output = curr_sequence
# The first trigram is "tennis is a". We will generate 50 words using the first trigram as the input
for i in range(50):
    if curr_sequence not in ngrams.keys():
        break
    possible_words = ngrams[curr_sequence]
    #  From the list of possible words, one word is chosen randomly and is appended at the end of the out
    next_word = possible_words[random.randrange(len(possible_words))]
    output += ' ' + next_word
    seq_words = nltk.word_tokenize(output)
    # Finally, the curr_sequence variable is updated with the value of the next trigram in the dictionary.
    curr_sequence = ' '.join(seq_words[len(seq_words)-words:len(seq_words)])

print(output)

tennis is a racket sport that is played either individually against a single opponent singles or between two teams of two players each doubles . each player uses a tennis racket is strung with two different strings for the mains the vertical strings and the crosses the horizontal strings . this is where
