# Text pre-processing using NLTK

In this notebook I work through the Implementing Text Pre-processing Using NLTK exercise from the [intro to NLP course by Shivam Bansal](https://courses.analyticsvidhya.com/courses/Intro-to-NLP).

In [5]:
import nltk

In [62]:
# First need to download punkt to tokenize sentences and wordnet to lemmatize
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/pdrew/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/pdrew/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [22]:
text = 'Last night Oliver was sick. He walked around the bedroom a lot and tried to get our attention\
        but I was asleep and Lina did not understand he was feeling sick because he was not displaying\
        his usual tells.'

In [23]:
# separate out the sentences
sents = nltk.tokenize.sent_tokenize(text)

print('n sentences:', len(sents), '\n')

for i in range(len(sents)):
    print('sentence', str(i), ":", sents[i])

n sentences: 2 

sentence 0 : Last night Oliver was sick.
sentence 1 : He walked around the bedroom a lot and tried to get our attention        but I was asleep and Lina did not understand he was feeling sick because he was not displaying        his usual tells.


In [68]:
# now let's tokenize words instead of sentences
words = nltk.tokenize.word_tokenize(text)

print('the first 10 words are:', words[:10])

the first 10 words are: ['Last', 'night', 'Oliver', 'was', 'sick', '.', 'He', 'walked', 'around', 'the']


# Stemming
Now let's try removing the affixes from words, leaving just the stems

In [69]:
# make a stemmer object
stemmer = nltk.stem.PorterStemmer()

print(words)

['Last', 'night', 'Oliver', 'was', 'sick', '.', 'He', 'walked', 'around', 'the', 'bedroom', 'a', 'lot', 'and', 'tried', 'to', 'get', 'our', 'attention', 'but', 'I', 'was', 'asleep', 'and', 'Lina', 'did', 'not', 'understand', 'he', 'was', 'feeling', 'sick', 'because', 'he', 'was', 'not', 'displaying', 'his', 'usual', 'tells', '.']


In [70]:
# let's try stemming each of the words from the story above

# singles = [stemmer.stem(words) for word in words]

singles = words.copy()
for i in range(len(words)):
    singles[i] = stemmer.stem(words[i])
    
print(singles)

['last', 'night', 'oliv', 'wa', 'sick', '.', 'He', 'walk', 'around', 'the', 'bedroom', 'a', 'lot', 'and', 'tri', 'to', 'get', 'our', 'attent', 'but', 'I', 'wa', 'asleep', 'and', 'lina', 'did', 'not', 'understand', 'he', 'wa', 'feel', 'sick', 'becaus', 'he', 'wa', 'not', 'display', 'hi', 'usual', 'tell', '.']


We see that some of these stemmed words are no longer words in the dictionary, for example oliver became oliv and was became wa. For this reason stemming is not a good tool for the normalization of text. Lemmatizing is a better choice than stemming for this.

In [73]:
lem = nltk.stem.WordNetLemmatizer()

singles = words.copy()
for i in range(len(words)):
    singles[i] = lem.lemmatize(words[i])
    
print(singles)

['Last', 'night', 'Oliver', 'wa', 'sick', '.', 'He', 'walked', 'around', 'the', 'bedroom', 'a', 'lot', 'and', 'tried', 'to', 'get', 'our', 'attention', 'but', 'I', 'wa', 'asleep', 'and', 'Lina', 'did', 'not', 'understand', 'he', 'wa', 'feeling', 'sick', 'because', 'he', 'wa', 'not', 'displaying', 'his', 'usual', 'tell', '.']


Lemmatization did better, but still fails to work on the word 'was'.