In this exercise, we will replicate the preceding example. The target terms are nlp, deep, and learn. We will create a one-hot encoded feature for these terms using our own function and store the result in a numpy array.

Again, we'll be using the txt_words_nostop variable we created in Exercise 4.01, Tokenizing, Case Normalization, Punctuation, and Stop Word Removal. So, you will need to continue this exercise in the same Jupyter Notebook

#### exercise 4.01

In this exercise, we will remove stop words from the data, and also apply everything we have learned so far. We'll start by performing tokenization (sentences and words); then, we'll perform case normalization, followed by punctuation and stop word removal.

In [1]:
# punkt and stopwords
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from string import punctuation
from nltk import tokenize

[nltk_data] Downloading package punkt to /Users/LNonyane/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/LNonyane/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We'll keep the code concise this time. We'll be defining and manipulating the raw_txt variable. Let's get started:

Define the raw_txt variable so that it contains the text "Welcome to the world of deep learning for NLP! We're in this together, and we'll learn together. NLP is amazing, and deep learning makes it even more fun. Let's learn!":

In [2]:
raw_txt = """
Welcome to the world of deep learning for NLP! We're in this together, and we'll learn together. NLP is amazing, and deep learning makes it even more fun. Let's learn!
"""

Use the sent_tokenize() method to separate the raw text into individual sentences and store the result in a variable. Use the lower() method to convert the string into lowercase before tokenizing:

In [3]:
txt_sents = tokenize.sent_tokenize(raw_txt.lower()) # calling sentence tokenizer on normalized raw_txt

Using list comprehension, apply the word_tokenize() method to separate each sentence into its constituent words:

In [4]:
txt_words = [tokenize.word_tokenize(sent) for sent in txt_sents] # separate words to constituent parts for each sentence in txt_sents.

Import punctuation from the string module and convert it into a list

In [5]:
stop_punct = list(punctuation) # ! . ? [and many more]

Import the built-in stop words for English from NLTK and save them in a variable.

In [6]:
stop_nltk = stopwords.words('english') # they, my, mine, etc

Create a combined list that contains the punctuations as well as the NLTK stop words. Note that we can remove them together in one go

In [7]:
stop_final = stop_punct + stop_nltk

Define a function that will remove stop words and punctuation from the input sentence, provided as a collection of tokens

In [8]:
def drop_stop(input_tokens):
    return [token for token in input_tokens if token not in stop_final]

Remove redundant tokens by applying the function to the tokenized sentences and store the result in a variable

In [9]:
txt_words_nostop = [drop_stop(sent) for sent in txt_words] # run drop_stop function on each sentence in txt_words

Print the first cleaned-up sentence from the data

In [10]:
print(txt_words_nostop[0])

['welcome', 'world', 'deep', 'learning', 'nlp']


In this exercise, we performed all the cleanup steps we've learned about so far. This time around, we combined certain steps and made the code more concise. These are some very common steps that we should apply when dealing with text data. You could try to further optimize and modularize by defining a function that returns the result after all the processing steps. We encourage you to try it out.

In [11]:
def drop_with_results_per_step(input_tokens):
    print([token for token in input_tokens if token not in stop_final])
    return [token for token in input_tokens if token not in stop_final]

In [12]:
[drop_with_results_per_step(sent) for sent in txt_words]

['welcome', 'world', 'deep', 'learning', 'nlp']
["'re", 'together', "'ll", 'learn', 'together']
['nlp', 'amazing', 'deep', 'learning', 'makes', 'even', 'fun']
['let', "'s", 'learn']


[['welcome', 'world', 'deep', 'learning', 'nlp'],
 ["'re", 'together', "'ll", 'learn', 'together'],
 ['nlp', 'amazing', 'deep', 'learning', 'makes', 'even', 'fun'],
 ['let', "'s", 'learn']]

So far, the steps in the cleanup process were steps that got rid of tokens that weren't very useful in our assessment. But there are a few more things we could do to make our data even better – we can try using our understanding of the language to combine tokens, identify tokens that have practically the same meaning, and remove further redundancy. A couple of popular approaches are stemming and lemmatization.

#### exercise 4.02

In this exercise, we will continue with data preprocessing. We removed the stop words and punctuation in the previous exercise. Now, we will use the Porter stemming algorithm to stem the tokens. Since we'll be using the txt_words_nostop variable we created previously, let's continue with the same Jupyter Notebook we created in Exercise 4.01, Tokenizing, Case Normalization, Punctuation, and Stop Word Removal.

In [13]:
 # Import PorterStemmer from NLTK
from nltk.stem import PorterStemmer

In [14]:
# Instantiate the stemmer
stemmer_p = PorterStemmer()

In [15]:
# Apply the stemmer to the first sentence in txt_words_nostop
print([stemmer_p.stem(token) for token in txt_words_nostop[0]])

['welcom', 'world', 'deep', 'learn', 'nlp']


In [16]:
# Apply the stemmer to all the sentences in the data using nested list comprehension
txt_words_stem = [[stemmer_p.stem(token) for token in sent] for sent in txt_words_nostop]
# run porter stem for each word in the sentences in txt_words_nostop

In [17]:
# print the output
txt_words_stem

[['welcom', 'world', 'deep', 'learn', 'nlp'],
 ["'re", 'togeth', "'ll", 'learn', 'togeth'],
 ['nlp', 'amaz', 'deep', 'learn', 'make', 'even', 'fun'],
 ['let', "'s", 'learn']]

It looks like plenty of modifications have been made by the stemmer. Many of the words aren't valid anymore but are still recognizable, and that's okay.

In this exercise, we used the Porter stemming algorithm to stem the terms of our tokenized data. Stemming works on individual terms, so it needs to be applied after tokenizing into terms. Stemming reduced some terms to their base form, which weren't necessarily valid English words.

#### exercise 4.03

In [18]:
# Print out the txt_words_nostop variable to see what we're working with
print(txt_words_nostop)

[['welcome', 'world', 'deep', 'learning', 'nlp'], ["'re", 'together', "'ll", 'learn', 'together'], ['nlp', 'amazing', 'deep', 'learning', 'makes', 'even', 'fun'], ['let', "'s", 'learn']]


In [19]:
# Define a list with the target terms, that is, "nlp", "deep", "learn"
target_terms = ['nlp', 'deep', 'learn']

In [20]:
def get_onehot(sent):
    return [1 if term in sent else 0 for term in target_terms]

In [21]:
# Apply the function to each sentence in our text and store the result in a variable
one_hot_mat = [get_onehot(sent) for sent in txt_words_nostop]

In [22]:
# Import numpy, create a numpy array from the result, and print it
import numpy as np
np.array(one_hot_mat)

array([[1, 1, 0],
       [0, 0, 1],
       [1, 1, 0],
       [0, 0, 1]])