In this exercise, we will remove stop words from the data, and also apply everything we have learned so far. We'll start by performing tokenization (sentences and words); then, we'll perform case normalization, followed by punctuation and stop word removal.

In [1]:
# punkt and stopwords
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from string import punctuation
from nltk import tokenize

[nltk_data] Downloading package punkt to /Users/LNonyane/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/LNonyane/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We'll keep the code concise this time. We'll be defining and manipulating the raw_txt variable. Let's get started:

Define the raw_txt variable so that it contains the text "Welcome to the world of deep learning for NLP! We're in this together, and we'll learn together. NLP is amazing, and deep learning makes it even more fun. Let's learn!":

In [2]:
raw_txt = """
Welcome to the world of deep learning for NLP! We're in this together, and we'll learn together. NLP is amazing, and deep learning makes it even more fun. Let's learn!
"""

Use the sent_tokenize() method to separate the raw text into individual sentences and store the result in a variable. Use the lower() method to convert the string into lowercase before tokenizing:

In [3]:
txt_sents = tokenize.sent_tokenize(raw_txt.lower()) # calling sentence tokenizer on normalized raw_txt

Using list comprehension, apply the word_tokenize() method to separate each sentence into its constituent words:

In [4]:
txt_words = [tokenize.word_tokenize(sent) for sent in txt_sents] # separate words to constituent parts for each sentence in txt_sents.

Import punctuation from the string module and convert it into a list

In [5]:
stop_punct = list(punctuation) # ! . ? [and many more]

Import the built-in stop words for English from NLTK and save them in a variable.

In [6]:
stop_nltk = stopwords.words('english') # they, my, mine, etc

Create a combined list that contains the punctuations as well as the NLTK stop words. Note that we can remove them together in one go

In [7]:
stop_final = stop_punct + stop_nltk

Define a function that will remove stop words and punctuation from the input sentence, provided as a collection of tokens

In [8]:
def drop_stop(input_tokens):
    return [token for token in input_tokens if token not in stop_final]

Remove redundant tokens by applying the function to the tokenized sentences and store the result in a variable

In [9]:
txt_words_nostop = [drop_stop(sent) for sent in txt_words] # run drop_stop function on each sentence in txt_words

Print the first cleaned-up sentence from the data

In [10]:
print(txt_words_nostop[0])

['welcome', 'world', 'deep', 'learning', 'nlp']


In this exercise, we performed all the cleanup steps we've learned about so far. This time around, we combined certain steps and made the code more concise. These are some very common steps that we should apply when dealing with text data. You could try to further optimize and modularize by defining a function that returns the result after all the processing steps. We encourage you to try it out.

In [11]:
def drop_with_results_per_step(input_tokens):
    print([token for token in input_tokens if token not in stop_final])
    return [token for token in input_tokens if token not in stop_final]

In [12]:
[drop_with_results_per_step(sent) for sent in txt_words]

['welcome', 'world', 'deep', 'learning', 'nlp']
["'re", 'together', "'ll", 'learn', 'together']
['nlp', 'amazing', 'deep', 'learning', 'makes', 'even', 'fun']
['let', "'s", 'learn']


[['welcome', 'world', 'deep', 'learning', 'nlp'],
 ["'re", 'together', "'ll", 'learn', 'together'],
 ['nlp', 'amazing', 'deep', 'learning', 'makes', 'even', 'fun'],
 ['let', "'s", 'learn']]

So far, the steps in the cleanup process were steps that got rid of tokens that weren't very useful in our assessment. But there are a few more things we could do to make our data even better – we can try using our understanding of the language to combine tokens, identify tokens that have practically the same meaning, and remove further redundancy. A couple of popular approaches are stemming and lemmatization.

#### exercise 4.02

In this exercise, we will continue with data preprocessing. We removed the stop words and punctuation in the previous exercise. Now, we will use the Porter stemming algorithm to stem the tokens. Since we'll be using the txt_words_nostop variable we created previously, let's continue with the same Jupyter Notebook we created in Exercise 4.01, Tokenizing, Case Normalization, Punctuation, and Stop Word Removal.

In [13]:
 # Import PorterStemmer from NLTK
from nltk.stem import PorterStemmer

In [14]:
# Instantiate the stemmer
stemmer_p = PorterStemmer()

In [15]:
# Apply the stemmer to the first sentence in txt_words_nostop
print([stemmer_p.stem(token) for token in txt_words_nostop[0]])

['welcom', 'world', 'deep', 'learn', 'nlp']


In [16]:
# Apply the stemmer to all the sentences in the data using nested list comprehension
txt_words_stem = [[stemmer_p.stem(token) for token in sent] for sent in txt_words_nostop]
# run porter stem for each word in the sentences in txt_words_nostop

In [17]:
# print the output
txt_words_stem

[['welcom', 'world', 'deep', 'learn', 'nlp'],
 ["'re", 'togeth', "'ll", 'learn', 'togeth'],
 ['nlp', 'amaz', 'deep', 'learn', 'make', 'even', 'fun'],
 ['let', "'s", 'learn']]

It looks like plenty of modifications have been made by the stemmer. Many of the words aren't valid anymore but are still recognizable, and that's okay.

In this exercise, we used the Porter stemming algorithm to stem the terms of our tokenized data. Stemming works on individual terms, so it needs to be applied after tokenizing into terms. Stemming reduced some terms to their base form, which weren't necessarily valid English words.

#### End of exercises

#### Term Frequencies

In [18]:
txt_sents

['\nwelcome to the world of deep learning for nlp!',
 "we're in this together, and we'll learn together.",
 'nlp is amazing, and deep learning makes it even more fun.',
 "let's learn!"]

In [19]:
# import CountVectorizer function 
from sklearn.feature_extraction.text import CountVectorizer # utility can work with raw text as well as tokens

In [20]:
# instansitate vectorizer and provide the vocabulary size. This picks the top n terms from the data for creating the matrix (Document Term Matrix)
vectorizer = CountVectorizer(max_features=5)

In [21]:
# Teach the vectorizer the vocabulary - top n terms - and print them out.
vectorizer.fit(txt_sents)
vectorizer.vocabulary_

{'deep': 1, 'we': 4, 'together': 3, 'and': 0, 'learn': 2}

In [22]:
"""
Now, let's apply the vectorizer to the data to create the DTM. 
A minor detail: the result from a vectorizer is a sparse matrix. 
To view it, we'll convert it into an array
"""
txt_dtm = vectorizer.fit_transform(txt_sents)
txt_dtm.toarray()

array([[0, 1, 0, 0, 0],
       [1, 0, 1, 2, 2],
       [1, 1, 0, 0, 0],
       [0, 0, 1, 0, 0]])

In [23]:
"""
Notice that the vectorizer tokenizes the sentence as well. 
If you don't want that and want to use preprocessed tokens instead (txt_words_stem), 
you simply need to pass a dummy tokenizer and preprocessor to CountVectorizer.
First, we create a function that does nothing and simply returns the tokenized sentence/document
"""
def do_nothing(doc):
    return doc

In [24]:
# Now, we'll instantiate the vectorizer to use this function as the preprocessor and tokenizer
vectorizer = CountVectorizer(
    max_features=5,
    preprocessor=do_nothing,
    tokenizer=do_nothing
)

In [25]:
"""
Here, we're fitting and transforming the data in one step using the fit_transform() method from the tokenizer, 
and then we view the result. The method identifies the unique terms as the vocabulary when fitting on the data, 
then counts and returns the occurrence of each term for each document when transforming. 
Let's see it in action
"""

txt_dtm = vectorizer.fit_transform(txt_words_stem)
txt_dtm.toarray()

array([[0, 1, 1, 1, 0],
       [1, 0, 1, 0, 2],
       [0, 1, 1, 1, 0],
       [0, 0, 1, 0, 0]])

In [26]:
vectorizer.vocabulary_

{'deep': 1, 'learn': 2, 'nlp': 3, 'togeth': 4, "'ll": 0}

In [27]:
txt_words_stem

[['welcom', 'world', 'deep', 'learn', 'nlp'],
 ["'re", 'togeth', "'ll", 'learn', 'togeth'],
 ['nlp', 'amaz', 'deep', 'learn', 'make', 'even', 'fun'],
 ['let', "'s", 'learn']]

#### Exercise 4.04: Document-Term Matrix with TF-IDF

In [28]:
txt_sents

['\nwelcome to the world of deep learning for nlp!',
 "we're in this together, and we'll learn together.",
 'nlp is amazing, and deep learning makes it even more fun.',
 "let's learn!"]

In [29]:
# Import the TfidfVectorizer utility from scikit learn
from sklearn.feature_extraction.text import TfidfVectorizer

In [30]:
# Instantiate the vectorizer with a vocabulary size of 5
vectorizer_tfidf = TfidfVectorizer(max_features=5)

In [31]:
# Fit the vectorizer on the raw data of txt_sents:
vectorizer_tfidf.fit(txt_sents)

TfidfVectorizer(max_features=5)

In [32]:
# Print out the vocabulary learned by the vectorizer
vectorizer_tfidf.vocabulary_

{'deep': 1, 'we': 4, 'together': 3, 'and': 0, 'learn': 2}

In [33]:
# Transform the data using the trained vectorizer
txt_tfidf = vectorizer_tfidf.transform(txt_sents)

In [34]:
# Print the resulting DTM
txt_tfidf.toarray()

array([[0.        , 1.        , 0.        , 0.        , 0.        ],
       [0.25932364, 0.        , 0.25932364, 0.65783832, 0.65783832],
       [0.70710678, 0.70710678, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        , 0.        , 0.        ]])

In [35]:
"""
We also need to see the IDF for each of the terms in the vocabulary to check if the factor is indeed working as we expect it to. 
Print out the IDF values for the terms using the idf_ attribute
"""

vectorizer_tfidf.idf_

array([1.51082562, 1.51082562, 1.51082562, 1.91629073, 1.91629073])

In this exercise, we saw how we can represent text using the TF-IDF approach. We also saw how the approach downweighs more frequent terms by noticing that the IDF values were lower for higher-frequency terms. We ended up with a DTM containing the TF-IDF values for the terms.