<a href="https://colab.research.google.com/github/kleczekr/tolkenizer/blob/master/cooking_with_clusters_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import nltk.data
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
from urllib import request

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
url_alice = 'https://www.gutenberg.org/files/11/11-0.txt'
url_moby = 'https://www.gutenberg.org/files/2701/2701-0.txt'
# opening the txt files
response_alice = request.urlopen(url_alice)
response_moby = request.urlopen(url_moby)
# reading the files into raw variables as strings
raw_alice = response_alice.read().decode('utf8')
raw_moby = response_moby.read().decode('utf8')
# Split the raw files into lists of sentences
tokenized_alice = sent_tokenize(raw_alice)
tokenized_moby = sent_tokenize(raw_moby)
# remove the contents
tokenized_alice = tokenized_alice[14:]
tokenized_moby = tokenized_moby[275:]
# join the lists
tokenized_joint = tokenized_alice + tokenized_moby
# split joint list into lists of words
word_split_joint = [sentence.split() for sentence in tokenized_joint]

In [3]:
from gensim.models import Word2Vec

In [4]:
model = Word2Vec(word_split_joint, min_count=1)

In [5]:
# summarize the loaded model
print(model)

Word2Vec(vocab=36062, size=100, alpha=0.025)


In [6]:
# summarize vocabulary
words = list(model.wv.vocab)
# print random sample
from random import sample

for word in sample(words, 50):
  print(word)

peace
PACIFIC,
traveller.
Arthur
flitting
demonstrations
divine
Its
wasn’t;
ME,”
event—in
reiterated
moments.
_Requin_.
sufferable
Mab.
probable,
obtains
strange,
“_Something_”
Perish?
groping
‘Let
intermixed
rehearsing—singing,
sciences,
shares
coin
flung
Spanishly
heads—namely,
prize.
Wonder
describes
to,
astir
Spermacetti
relief,
sail-needles
in
miasmas,
Or
escape—blow
voices,
worships
computer
moss-bearded
going.
such,
ungrateful;


This is very unsatisfactory. We see not only stopwords, we see a lot of uppercase letters, punctuation, apparent encoding errors which ended up causing problems in our list of words. We need to clean this mess.



In [7]:
# import string
# table = str.maketrans('', '', string.punctuation)
# words = [w.translate(table) for w in words]
# for word in sample(words, 50):
#   print(word)

In [8]:
# words = [word.lower() for word in words]
# for word in sample(words, 50):
#   print(word)

In [9]:
# another approach, with NLTK:
# remove all tokens that are not alphabetic
# words = [word for word in words if word.isalpha()]
# for word in sample(words, 50):
#   print(word)

In [10]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
# full cleanup pipeline:
# convert to lower case
words = [word.lower() for word in words]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
words = [word.translate(table) for word in words]
# remove remaining tokens that are not alphabetic
words = [word for word in words if word.isalpha()]
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if not word in stop_words]
for word in sample(words, 50):
  print(word)

gliding
cupbearers
danger
cathedrals
lock
palest
undulations
requiring
appear
alive
thoughtfulness
crumpled
ebook
waning
quivering
shoreless
savagery
upward
seaport
sandhills
harvest
punctured
indirect
quietly
stars
glue
gladly
askance
corroded
intervals
twinkling
exhaust
amputation
interregnum
unwonted
matters
ribs
perilous
emboldened
stunning
workman
enviable
war
wisest
shored
entirely
ere
balloon
obviously
stockings


In [12]:
# stemming of words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
words = [porter.stem(word) for word in words]
for word in sample(words, 50):
  print(word)

even
notion
spark
moment
abruptli
execut
insular
grandiloqu
vigil
valparaiso
engraft
heaveninsult
therebi
profess
tripoint
chief
shambl
vessel
vers
squall
assur
squall
cleveland
contain
job
iron
appropri
rejoin
cring
therebi
halfhiss
slumber
gape
expos
dim
unreason
legisl
care
pardon
ugli
element
hearken
high
process
medicin
cat
blanch
paw
gibraltar
globe
