In this activity, you will apply all the preprocessing steps you've learned about so far to a much larger, real text. We'll work with the text for Alice in Wonderland that we stored in the alice_raw variable

In [1]:
import nltk

In [2]:
alice_raw = nltk.corpus.gutenberg.raw('carroll-alice.txt')

In [3]:
# first few characters of alice_raw
alice_raw[:800]

"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversation?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her.\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit"

In [4]:
# Change the raw text to lowercase
alice_raw = alice_raw.lower()

In [5]:
from nltk import tokenize

In [6]:
# tokenize sentences
alice_sents = tokenize.sent_tokenize(alice_raw)

In [8]:
# tokenize words
alice_words = [tokenize.word_tokenize(sent) for sent in alice_sents]

In [9]:
# Import punctuation from the string module and the stop words from NLTK.
from string import punctuation
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/LNonyane/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
# Create a variable holding the contextual stop words
stop_nltk = stopwords.words('english')

In [11]:
# Punctation list
stop_punct = list(punctuation)

In [12]:
# Create a master list for stop words to remove that contain terms from punctuation, NLTK stop words and contextual stop words
stop_final = stop_punct + stop_nltk

In [13]:
# Define a function to drop these tokens from any input sentence (tokenized).
def drop_stop(input_token):
    return [token for token in input_token if token not in stop_final]

In [14]:
# Remove redudant tokens by applying the drop_stop function to the tokenized sentences
alice_no_stop = [drop_stop(sent) for sent in alice_words]

In [15]:
# print first cleaned up sentence
print(alice_no_stop[0])

['alice', "'s", 'adventures', 'wonderland', 'lewis', 'carroll', '1865', 'chapter', 'i.', 'rabbit-hole', 'alice', 'beginning', 'get', 'tired', 'sitting', 'sister', 'bank', 'nothing', 'twice', 'peeped', 'book', 'sister', 'reading', 'pictures', 'conversations', "'and", 'use', 'book', 'thought', 'alice', "'without", 'pictures', 'conversation']


In [16]:
# Use the PorterStemmer algorithm from NLTK to perform stemming on the result.
from nltk.stem import PorterStemmer
stemmer_p = PorterStemmer()

In [18]:
# Apply the stemmer to the first sentence in alice_no_stop
print([stemmer_p.stem(token) for token in alice_no_stop[0]])

['alic', "'s", 'adventur', 'wonderland', 'lewi', 'carrol', '1865', 'chapter', 'i.', 'rabbit-hol', 'alic', 'begin', 'get', 'tire', 'sit', 'sister', 'bank', 'noth', 'twice', 'peep', 'book', 'sister', 'read', 'pictur', 'convers', "'and", 'use', 'book', 'thought', 'alic', "'without", 'pictur', 'convers']


In [19]:
# Apply the stemmer to all sentences in the data using nested list comprehension
alice_words_stem = [[stemmer_p.stem(token) for token in sent] for sent in alice_no_stop]

In [23]:
# print the result
print(alice_words_stem[:5])

[['alic', "'s", 'adventur', 'wonderland', 'lewi', 'carrol', '1865', 'chapter', 'i.', 'rabbit-hol', 'alic', 'begin', 'get', 'tire', 'sit', 'sister', 'bank', 'noth', 'twice', 'peep', 'book', 'sister', 'read', 'pictur', 'convers', "'and", 'use', 'book', 'thought', 'alic', "'without", 'pictur', 'convers'], ['consid', 'mind', 'well', 'could', 'hot', 'day', 'made', 'feel', 'sleepi', 'stupid', 'whether', 'pleasur', 'make', 'daisy-chain', 'would', 'worth', 'troubl', 'get', 'pick', 'daisi', 'suddenli', 'white', 'rabbit', 'pink', 'eye', 'ran', 'close'], ['noth', 'remark', 'alic', 'think', 'much', 'way', 'hear', 'rabbit', 'say', "'oh", 'dear'], ['oh', 'dear'], ['shall', 'late']]


In [21]:
# In this exercise, we used the Porter stemming algorithm to stem the terms of our tokenized data. Stemming works on individual terms, so it needs to be applied after tokenizing into terms. Stemming reduced some terms to their base form, which weren't necessarily valid English words.
