# Text pre-processing using NLTK

In this notebook I work through the Implementing Text Pre-processing Using NLTK exercise from the [intro to NLP course by Shivam Bansal](https://courses.analyticsvidhya.com/courses/Intro-to-NLP).

In [96]:
import nltk
import pandas as pd
import re

In [76]:
# Dowload some packages we will need later
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/pdrew/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/pdrew/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/pdrew/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [22]:
text = 'Last night Oliver was sick. He walked around the bedroom a lot and tried to get our attention\
        but I was asleep and Lina did not understand he was feeling sick because he was not displaying\
        his usual tells.'

In [23]:
# separate out the sentences
sents = nltk.tokenize.sent_tokenize(text)

print('n sentences:', len(sents), '\n')

for i in range(len(sents)):
    print('sentence', str(i), ":", sents[i])

n sentences: 2 

sentence 0 : Last night Oliver was sick.
sentence 1 : He walked around the bedroom a lot and tried to get our attention        but I was asleep and Lina did not understand he was feeling sick because he was not displaying        his usual tells.


In [68]:
# now let's tokenize words instead of sentences
words = nltk.tokenize.word_tokenize(text)

print('the first 10 words are:', words[:10])

the first 10 words are: ['Last', 'night', 'Oliver', 'was', 'sick', '.', 'He', 'walked', 'around', 'the']


# Stemming
Now let's try removing the affixes from words, leaving just the stems

In [69]:
# make a stemmer object
stemmer = nltk.stem.PorterStemmer()

print(words)

['Last', 'night', 'Oliver', 'was', 'sick', '.', 'He', 'walked', 'around', 'the', 'bedroom', 'a', 'lot', 'and', 'tried', 'to', 'get', 'our', 'attention', 'but', 'I', 'was', 'asleep', 'and', 'Lina', 'did', 'not', 'understand', 'he', 'was', 'feeling', 'sick', 'because', 'he', 'was', 'not', 'displaying', 'his', 'usual', 'tells', '.']


In [70]:
# let's try stemming each of the words from the story above

# singles = [stemmer.stem(words) for word in words]

singles = words.copy()
for i in range(len(words)):
    singles[i] = stemmer.stem(words[i])
    
print(singles)

['last', 'night', 'oliv', 'wa', 'sick', '.', 'He', 'walk', 'around', 'the', 'bedroom', 'a', 'lot', 'and', 'tri', 'to', 'get', 'our', 'attent', 'but', 'I', 'wa', 'asleep', 'and', 'lina', 'did', 'not', 'understand', 'he', 'wa', 'feel', 'sick', 'becaus', 'he', 'wa', 'not', 'display', 'hi', 'usual', 'tell', '.']


We see that some of these stemmed words are no longer words in the dictionary, for example oliver became oliv and was became wa. For this reason stemming is not a good tool for the normalization of text. Lemmatizing is a better choice than stemming for this.

In [73]:
lem = nltk.stem.WordNetLemmatizer()

singles = words.copy()
for i in range(len(words)):
    singles[i] = lem.lemmatize(words[i])
    
print(singles)

['Last', 'night', 'Oliver', 'wa', 'sick', '.', 'He', 'walked', 'around', 'the', 'bedroom', 'a', 'lot', 'and', 'tried', 'to', 'get', 'our', 'attention', 'but', 'I', 'wa', 'asleep', 'and', 'Lina', 'did', 'not', 'understand', 'he', 'wa', 'feeling', 'sick', 'because', 'he', 'wa', 'not', 'displaying', 'his', 'usual', 'tell', '.']


Lemmatization did better, but still fails to work on the word 'was'.

In [84]:
# nltk can also be used to categorize the parts of speech for each word in our story.
pos = nltk.pos_tag(words)
print('first 10 parts of speech:', pos[:10])

first 10 parts of speech: [('Last', 'JJ'), ('night', 'NN'), ('Oliver', 'NNP'), ('was', 'VBD'), ('sick', 'JJ'), ('.', '.'), ('He', 'PRP'), ('walked', 'VBD'), ('around', 'IN'), ('the', 'DT')]


# Machine Learning Model for Text Classification
Now let's build a basic ML model for text classification and detection of hate speech from twitter.

In [104]:
dF = pd.read_csv('data/final_dataset_basicmlmodel.csv')
dF.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


This dataframe is already organized in such a way as to be used as our output, which is to say, the column titled label is where the classification of hate speech will be stored. The label of 0 implies a classification of no hate speech and 1 implies hate speech. The first step we'll take is to clean the data to reduce the noise.

In [107]:
def clean_tweets(text):
    # filter to allow only alphameric characters
    text = re.sub(r'[^a-zA-Z\']', ' ', text)
    
    # Remove Unicode characters
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    
    # enforce lower case
    text = text.lower()
    
    return text

In [108]:
dF['clean_tweet'] = dF.tweet.apply(lambda x: clean_tweets(x))

dF.head()

Unnamed: 0,id,label,tweet,clean_tweet
0,1,0,@user when a father is dysfunctional and is s...,user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...,user user thanks for lyft credit i can't us...
2,3,0,bihday your majesty,bihday your majesty
3,4,0,#model i love u take with u all the time in ...,model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation,factsguide society now motivation
