# Text Pre-Processing and Tokenization

In this module, we'll learn how to create "tokens" based on the text of a document and how to pre-process the text to exclude things like punctuation and stop words (e.g., "the", "an", "and", etc.).

In [None]:
import re
import requests
import pandas as pd

To illustrate, we'll use a small snippet from Steve Jobs' 2005 Stanford Commencement Address:

In [None]:
text = "I was lucky — I found what I loved to do early in life. Woz and I started Apple in my parents' garage when I was 20. We worked hard, and in 10 years Apple had grown from just the two of us in a garage into a $2 billion company with over 4,000 employees. We had just released our finest creation — the Macintosh — a year earlier, and I had just turned 30. And then I got fired. How can you get fired from a company you started? Well, as Apple grew we hired someone who I thought was very talented to run the company with me, and for the first year or so things went well. But then our visions of the future began to diverge and eventually we had a falling out. When we did, our Board of Directors sided with him. So at 30 I was out. And very publicly out. What had been the focus of my entire adult life was gone, and it was devastating. I really didn't know what to do for a few months. I felt that I had let the previous generation of entrepreneurs down — that I had dropped the baton as it was being passed to me. I met with David Packard and Bob Noyce and tried to apologize for screwing up so badly. I was a very public failure, and I even thought about running away from the valley. But something slowly began to dawn on me — I still loved what I did. The turn of events at Apple had not changed that one bit. I had been rejected, but I was still in love. And so I decided to start over. I didn't see it then, but it turned out that getting fired from Apple was the best thing that could have ever happened to me. The heaviness of being successful was replaced by the lightness of being a beginner again, less sure about everything. It freed me to enter one of the most creative periods of my life. During the next five years, I started a company named NeXT, another company named Pixar, and fell in love with an amazing woman who would become my wife. Pixar went on to create the world's first computer animated feature film, Toy Story, and is now the most successful animation studio in the world. In a remarkable turn of events, Apple bought NeXT, I returned to Apple, and the technology we developed at NeXT is at the heart of Apple's current renaissance. And Laurene and I have a wonderful family together. I'm pretty sure none of this would have happened if I hadn't been fired from Apple. It was awful tasting medicine, but I guess the patient needed it. Sometimes life hits you in the head with a brick. Don't lose faith. I'm convinced that the only thing that kept me going was that I loved what I did. You've got to find what you love. And that is as true for your work as it is for your lovers. Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven't found it yet, keep looking. Don't settle. As with all matters of the heart, you'll know when you find it. And, like any great relationship, it just gets better and better as the years roll on. So keep looking until you find it. Don't settle."

#### Tokenization

Tokenization is the process of turning text string into tokens (i.e., smaller chunks). 

Examples of tokenization include:

1. Obtaining all words in a sentence
2. Obtaining all sentences in a document
3. Obtaining regex patterns in a document

We tokenize text for several reasons. The most important is that we need a way to summarize the information in our text. By splitting the text into smaller chunks, we can create summary measures that capture the content of the text (e.g., tone).

We will use the **nltk** (natural language toolkit) module to perform tokenization. 

The **nltk** module has several built-in tokenizers:

1. **word_tokenize**: tokenize a document into words
2. **sent_tokenize**: tokenize a document into sentences
3. **regexp_tokenize**: tokenize a string or document based on a regular expression

We can import these tokenizers as follows:

In [None]:
#import nltk
#nltk.download('punkt')
    
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.tokenize import regexp_tokenize

Now, let's play with these tokenizers to see what they do.

#### Word Tokens

In [None]:
words = word_tokenize(text)
words

#### Sentence Tokens

In [None]:
sentences = sent_tokenize(text)
sentences

#### Regex Tokens

This code obtains all words that start with capital letters.

In [None]:
cap_words = regexp_tokenize(text, '[A-Z][a-zA-Z]+')
cap_words

***Note:*** This function is similar to the re.findall function

In [None]:
cap_words = re.findall('[A-Z][a-zA-Z]+', text)
cap_words

#### Counting Tokens

We can use the **Counter** function to count the number of tokens in our text. We import the **Counter** function from the **collections** module as follows:

In [None]:
from collections import Counter

We can then obtain counts for each token using the following code:

In [None]:
tokens = word_tokenize(text)
counts = Counter(tokens)
print(counts)

We can use the **most_common** function to print the tokens and their counts in descending order.

In [None]:
print(counts.most_common(10))

#### Text Preprocessing

We can see in the above example that the most common tokens will inevitably be words like "the" and "of", even though these words do not truly represent the theme of the underlying text.  We can perform several text preprocessing steps to make our tokens more meaningful. Here are a few common text preprocessing steps:

1. Convert all text to lower case to avoid separately counting "Event" and "event", for example.
2. Shorten words to their root stems to avoid separately counting "computer" and "computers", for example. 
3. Remove stop words (e.g., "and", "the", "an", etc.)
4. Remove punctuation (e.g., ".", "?", "-", etc.)

The **nltk** module has a built-in list of stop words. We can access this list using the following code:

In [None]:
#import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords_list = stopwords.words('english')
stopwords_list

The **nltk** module has a built-in function called **PorterStemmer** that we can use to stem words to their roots.

In [None]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

words = ['caring', 'helpful', 'kindness', 'jumps']
for word in words:
    stemmed_word = porter.stem(word)
    print(stemmed_word)

Now, let's apply these filters to our text and then check the most common tokens.

In [None]:
# Obtain word tokens
tokens = word_tokenize(text)

# Convert tokens to lower case
tokens = [t.lower() for t in tokens]

# Keep tokens that are alphabetic (i.e., remove numbers and punctuation)
tokens = [t for t in tokens if t.isalpha()]

# Remove stop words
tokens = [t for t in tokens if t not in stopwords.words('english')]

# Stem all tokens
tokens = [porter.stem(t) for t in tokens]

# Obtain token counts
counts = Counter(tokens)

In [None]:
print(counts.most_common(5))

#### Convert Counted Tokens Dictionary into a Pandas DataFrame

To make our **counts** more easily useable, we can convert them into a pandas DataFrame using the **pd.DataFrame.from_dict** function. We use the **from_dict** function because the **Counter** function stores the **counts** in a dictionary format where the tokens are the keys and the counts are the values. We'll use the **orient='index'** option to create the DataFrame using the dictionary keys as rows.

In [None]:
df = pd.DataFrame.from_dict(counts, orient='index').reset_index()
df = df.rename(columns={"index": "token", 0: "count"})
df = df.sort_values(by=["count"],ascending=[False])
df.head(20)

#### Exercise

1. Use the **requests.get** function to obtain the html source code of the April 30, 2020 earnings announcement 8-K for Apple available at the following link: https://www.sec.gov/Archives/edgar/data/320193/000032019320000050/a8-kexhibit991q2202032.htm
2. Import the **html_to_text** function from the **MyFunctions** module. The **html_to_text** function takes a variable containing html source code in string format as an input, strips out all html code from this variable, and returns a string variable containing only the text of the html source code. Use the **html_to_text** function to convert the earnings announcement 8-K html data into a text format.
3. Obtain a list of all tokens in the 8-K using the **word_tokenize** function.
4. Perform the following text-preprocessing steps:
    1. Convert all tokens to lower case using the **lower** function.
    2. Remove tokens in the stop words list using the **stopwords** function.
    3. Remove tokens that are punctuation.
    4. Shorten all tokens to their root stems using the **PorterStemmer** function.
5. Obtain all remaining token counts and place them in a new **pandas** DataFrame. Sort the DataFrame in descending order by count and print out the first 10 rows. What are the most common words used in Apple's 8-K?

#### Solution for # 1

#### Solution for # 2

#### Solution for # 3

#### Solution for # 4

#### Solution for # 5