#### Activity 4.01, Text Preprocessing of the 'Alice in Wonderland' Text

In this activity, you will apply all the preprocessing steps you've learned about so far to a much larger, real text. We'll work with the text for Alice in Wonderland that we stored in the alice_raw variable

In [1]:
import nltk

In [2]:
alice_raw = nltk.corpus.gutenberg.raw('carroll-alice.txt')

In [3]:
# first few characters of alice_raw
alice_raw[:800]

"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversation?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her.\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit"

In [4]:
# Change the raw text to lowercase
alice_raw = alice_raw.lower()

In [5]:
from nltk import tokenize

In [6]:
# tokenize sentences
alice_sents = tokenize.sent_tokenize(alice_raw)

In [7]:
# tokenize words
alice_words = [tokenize.word_tokenize(sent) for sent in alice_sents]

In [8]:
# Import punctuation from the string module and the stop words from NLTK.
from string import punctuation
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/LNonyane/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
# Create a variable holding the contextual stop words
stop_nltk = stopwords.words('english')

In [10]:
# Punctation list
stop_punct = list(punctuation)

In [11]:
# Create a master list for stop words to remove that contain terms from punctuation, NLTK stop words and contextual stop words
stop_final = stop_punct + stop_nltk

In [12]:
# Define a function to drop these tokens from any input sentence (tokenized).
def drop_stop(input_token):
    return [token for token in input_token if token not in stop_final]

In [13]:
# Remove redudant tokens by applying the drop_stop function to the tokenized sentences
alice_no_stop = [drop_stop(sent) for sent in alice_words]

In [14]:
# print first cleaned up sentence
print(alice_no_stop[0])

['alice', "'s", 'adventures', 'wonderland', 'lewis', 'carroll', '1865', 'chapter', 'i.', 'rabbit-hole', 'alice', 'beginning', 'get', 'tired', 'sitting', 'sister', 'bank', 'nothing', 'twice', 'peeped', 'book', 'sister', 'reading', 'pictures', 'conversations', "'and", 'use', 'book', 'thought', 'alice', "'without", 'pictures', 'conversation']


In [15]:
# Use the PorterStemmer algorithm from NLTK to perform stemming on the result.
from nltk.stem import PorterStemmer
stemmer_p = PorterStemmer()

In [16]:
# Apply the stemmer to the first sentence in alice_no_stop
print([stemmer_p.stem(token) for token in alice_no_stop[0]])

['alic', "'s", 'adventur', 'wonderland', 'lewi', 'carrol', '1865', 'chapter', 'i.', 'rabbit-hol', 'alic', 'begin', 'get', 'tire', 'sit', 'sister', 'bank', 'noth', 'twice', 'peep', 'book', 'sister', 'read', 'pictur', 'convers', "'and", 'use', 'book', 'thought', 'alic', "'without", 'pictur', 'convers']


In [17]:
# Apply the stemmer to all sentences in the data using nested list comprehension
alice_words_stem = [[stemmer_p.stem(token) for token in sent] for sent in alice_no_stop]

In [18]:
# print the result
print(alice_words_stem[:5])

[['alic', "'s", 'adventur', 'wonderland', 'lewi', 'carrol', '1865', 'chapter', 'i.', 'rabbit-hol', 'alic', 'begin', 'get', 'tire', 'sit', 'sister', 'bank', 'noth', 'twice', 'peep', 'book', 'sister', 'read', 'pictur', 'convers', "'and", 'use', 'book', 'thought', 'alic', "'without", 'pictur', 'convers'], ['consid', 'mind', 'well', 'could', 'hot', 'day', 'made', 'feel', 'sleepi', 'stupid', 'whether', 'pleasur', 'make', 'daisy-chain', 'would', 'worth', 'troubl', 'get', 'pick', 'daisi', 'suddenli', 'white', 'rabbit', 'pink', 'eye', 'ran', 'close'], ['noth', 'remark', 'alic', 'think', 'much', 'way', 'hear', 'rabbit', 'say', "'oh", 'dear'], ['oh', 'dear'], ['shall', 'late']]


In this exercise, we used the Porter stemming algorithm to stem the terms of our tokenized data. Stemming works on individual terms, so it needs to be applied after tokenizing into terms. Stemming reduced some terms to their base form, which weren't necessarily valid English words.


#### Activity 4.02: Text Representation for Alice in Wonderland

Import word2vec from Gensim and train your word embeddings with default parameters.

In [19]:
print(alice_no_stop[:3])

[['alice', "'s", 'adventures', 'wonderland', 'lewis', 'carroll', '1865', 'chapter', 'i.', 'rabbit-hole', 'alice', 'beginning', 'get', 'tired', 'sitting', 'sister', 'bank', 'nothing', 'twice', 'peeped', 'book', 'sister', 'reading', 'pictures', 'conversations', "'and", 'use', 'book', 'thought', 'alice', "'without", 'pictures', 'conversation'], ['considering', 'mind', 'well', 'could', 'hot', 'day', 'made', 'feel', 'sleepy', 'stupid', 'whether', 'pleasure', 'making', 'daisy-chain', 'would', 'worth', 'trouble', 'getting', 'picking', 'daisies', 'suddenly', 'white', 'rabbit', 'pink', 'eyes', 'ran', 'close'], ['nothing', 'remarkable', 'alice', 'think', 'much', 'way', 'hear', 'rabbit', 'say', "'oh", 'dear']]


In [20]:
from gensim.models import word2vec

In [21]:
model_default = word2vec.Word2Vec(
    alice_no_stop
)

In [22]:
# Find the terms most similar to rabbit.
model_default.wv.most_similar(
    'rabbit',
    topn=5
)

[('alice', 0.9988503456115723),
 ("'s", 0.9987548589706421),
 ('--', 0.9987461566925049),
 ('one', 0.9987258315086365),
 ('little', 0.9986881613731384)]

In [23]:
# Using a window size 2, retrain the word vectors.
model_default = word2vec.Word2Vec(
    alice_no_stop,
    vector_size=2 # window size
)

In [24]:
# Find the terms most similar to rabbit.
model_default.wv.most_similar(
    'rabbit',
    topn=5
)

[('slates', 0.999992311000824),
 ('little', 0.9999845027923584),
 ('``', 0.9999681711196899),
 ('hot', 0.9999629855155945),
 ('liked', 0.9999596476554871)]

In [25]:
# Retrain the word vectors using the Skip-gram method with a window size of 5.
model_default = word2vec.Word2Vec(
    alice_no_stop,
    sg=1, # skip gram
    vector_size=5 # window size
)

In [26]:
# Find the terms most similar to rabbit.
model_default.wv.most_similar(
    'rabbit',
    topn=5
)

[("'we", 0.9996872544288635),
 ('sitting', 0.9994874596595764),
 ('leave', 0.9975512027740479),
 ('piece', 0.9971781373023987),
 ('sleepy', 0.9971364736557007)]

In [27]:
# Find the representation for the phrase white rabbit by averaging the vectors for white and rabbit.
rabbit = model_default.wv['rabbit'] # extract vector term rabbit
white = model_default.wv['white'] # extract vector term white

# Create a vector as the element-wise average of the two vectors, (white + rabbit)/2. This is our vector for the entire phrase "white rabbit"
white_rabbit = (white + rabbit)/2

In [28]:
# Find the representation for mad hatter by averaging the vectors for mad and hatter.
mad = model_default.wv['mad'] # extract vector term rabbit
hatter = model_default.wv['hatter'] # extract vector term white

# Create a vector as the element-wise average of the two vectors, (mad + hatter)/2. This is our vector for the entire phrase "mad hatter"
mad_hatter = (mad + hatter)/2

In [30]:
# Using the cosine_similarities() method in the model, find the cosine similarity between the two phrases 'white rabbit' & 'mad hatter'
model_default.wv.cosine_similarities(white_rabbit, [mad_hatter])

array([0.9700395], dtype=float32)

The result is a cosine similarity of about 0.97, which is positive and much higher than 0. This means that the model thinks the phrases "white rabbit" and "mad hatter" are similar in meaning.

In [31]:
# Load pre-trained GloVe embeddings of size 100D.
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B/glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.w2vformat.txt'
glove2word2vec(
    glove_input_file,
    word2vec_output_file
)

  glove2word2vec(


(400000, 100)

In [32]:
# glove model
from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format(
    'glove.6B.100d.w2vformat.txt',
    binary=False
)

In [33]:
# Find representations for white rabbit and mad hatter using glove model
# white rabbit
rabbit = glove_model['rabbit'] # extract vector term rabbit
white = glove_model['white'] # extract vector term white

# Create a vector as the element-wise average of the two vectors, (white + rabbit)/2. This is our vector for the entire phrase "white rabbit"
white_rabbit = (white + rabbit)/2

# madd hatter
mad = glove_model['mad'] # extract vector term mad
hatter = glove_model['hatter'] # extract vector term hatter

# Create a vector as the element-wise average of the two vectors, (mad + hatter)/2. This is our vector for the entire phrase "mad hatter"
mad_hatter = (mad + hatter)/2

In [35]:
# Find the cosine similarity between the two phrases. Has the cosine similarity changed?
glove_model.cosine_similarities(white_rabbit, [mad_hatter])

array([0.45145565], dtype=float32)

The result is a cosine similarity of about 0.45, which is positive and closer 0. This means that the model thinks the phrases "white rabbit" and "mad hatter" are not as similar in meaning. This is because the Glove model has not seen the terms together as in the book.

As a result of this activity, we will have our own word vectors that have been trained on "Alice's Adventures in Wonderland" and have representation for the terms available in the text.