# Tutorial: Text Preprocessing and Dictionary-Based Analysis

Preprocessing is the starting point of most text analyses. In many text analysis workflows, there is a built-in preprocessing step that is used to clean and prepare the text data for analysis. However, it is still a good idea to understand what is going on under the hood. In this tutorial, we will cover the several text preprocessing steps to prepare text data for analysis. We will then use a dictionary-based approach to analyze the text data.

## Step 1. Text Preprocessing Step-By-Step

### Sentence Segmentation

The first step in text preprocessing is to segment the text into sentences. This is important because many text analysis techniques operate at the sentence level. In Python, we can use the `nltk` library to segment text into sentences. NLTK is a powerful library for natural language processing that provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet. However, it is very low-level and requires a lot of code to perform simple tasks compared to some other packages we will use.

In [None]:
import nltk

text = """In this retrospective article, we outline the rationale for starting Strategic Entrepreneurship Journal. We provide evidence on the percentage of published papers in SEJ in each of 10 key themes in strategic entrepreneurship identified when the journal was founded. Evidence on progress toward goal achievement in terms of trends in submissions, desk reject and acceptances rates, and downloads, plus examples of highly cited papers and entry into key indicators such as the Financial Times list of 50 journals. We outline developments in strategic entrepreneurship and their implications for future research, notably the need to consider multiple levels of analysis and the role of context variety. Finally, we discuss some of the lessons we learned from SEJ in terms of general challenges that arise in starting a new journal. Copyright © 2017 Strategic Management Society."""
sentences = nltk.sent_tokenize(text)
for idx, sent in enumerate(sentences, 1):
    print(f"Sentence {idx}: {sent}")

### Tokenization

The next step in text preprocessing is to tokenize the text. Tokenization is the process of splitting the text into individual tokens. In Python, we can use the `nltk` library to tokenize text as well.

In [None]:
from pprint import pprint

tokens = []
for idx, sent in enumerate(sentences, 1):
    tokens.append(nltk.word_tokenize(sent))
    print(f"Sentence {idx}: ", end="")
    pprint(tokens[idx-1], compact=True, indent=4)

### Lowercasing

A common preprocessing step is to convert all the text to lowercase. NLTK isn't strictly necessary for this step, we generally use the `lower()` method of the string object from base Python.

In [None]:
lower_tokens = []
for idx, sent in enumerate(tokens, 1):
    lower_tokens.append([word.lower() for word in sent])

    print(f"Sentence {idx}: ", end="")
    pprint(lower_tokens[idx-1], compact=True, indent=4)

### Removing Punctuation

Another common preprocessing step is to remove punctuation from the text. Like with lowercasing, NLTK isn't strictly necessary for this step, we can use the list of common punctuation characters built into Python.

In [None]:
import string

no_punctuation = []
for idx, sent in enumerate(lower_tokens, 1):
    no_punctuation.append([token for token in sent if token not in string.punctuation])

    print(f"Sentence {idx}: ", end="")
    pprint(no_punctuation[idx-1], compact=True, indent=4)

### Removing Stopwords

Stopwords are common words that are often removed from text data because they do not carry much information. NLTK provides a list of stopwords for many languages that we can use to remove stopwords from text data.

Let's first take a look at these words:

In [None]:
list_of_stopwords = nltk.corpus.stopwords.words('english')
print(list_of_stopwords)

Now let's remove them

In [None]:
no_stopwords = []

for idx, sent in enumerate(no_punctuation, 1):
    no_stopwords.append([word for word in sent if word not in nltk.corpus.stopwords.words('english')])

    print(f"Sentence {idx}: ", end="")
    pprint(no_stopwords[idx-1], compact=True, indent=4)

### Removing Numbers

Numbers are often removed from text data or replaced with their textual representation. NLTK isn't strictly necessary for this step, we can use the `isnumerical()` method of the string object from base Python.

In [None]:
no_numbers = []

for idx, sent in enumerate(no_stopwords, 1):
    no_numbers.append([word for word in sent if not word.isnumeric()])

    print(f"Sentence {idx}: ", end="")
    pprint(no_numbers[idx-1], compact=True, indent=4)

### Removing special characters/symbols

This step can be a bit of a challenge, because it is not always clear which special characters or symbols should be removed or whether characters or symbols that are inside of words (e.g., co-opt) should be treated differently than stand-alone characters. Here, we will remove all non-alphabet symbols that are not part of a longer word (e.g., "©")

In [None]:
no_symbols = []

for idx, sent in enumerate(no_numbers, 1):
    no_symbols.append([token for token in sent if len(token)>1 or token.isalpha()])

    print(f"Sentence {idx}: ", end="")
    pprint(no_symbols[idx-1], compact=True, indent=4)

### Stemming/Lemmatization

We won't always use stemming/lemmatization in text analysis. However, when you do, NLTK provides several tools to help you accomplish this. We'll start with stemming, which tries to shorten the word to its stem, regardless of whether the stem is itself a word. The Porter Stemmer is a commonly-used stemmer and is what we'll apply here.

In [None]:
from nltk.stem import PorterStemmer

stems = []
stemmer = PorterStemmer()

for idx, sent in enumerate(no_symbols, 1):
    stems.append([stemmer.stem(token) for token in sent])

    print(f"Sentence {idx}: ", end="")
    pprint(stems[idx-1], compact=True, indent=4)

Now, let's try lemmatization with the WordNet Lemmatizer from NLTK

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmas = []

for idx, sent in enumerate(no_symbols, 1):
    lemmas.append([lemmatizer.lemmatize(token) for token in sent])

    print(f"Sentence {idx}: ", end="")
    pprint(lemmas[idx-1], compact=True, indent=4)

There are, of course, other preprocessing steps that can be done such as emoji replacement, however, the ones presented here are the most commonly used ones.

## Step 2. Preprocessing using a pretrained model

Several Python packages have sophisticated Natural Language Processing models that will take care of most of these steps for you and often do a better job than the above tools. `SpaCy` and `Stanza` are a couple of them that I commonly use. Let's try using SpaCy to accomplish most of the above in fewer steps.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

spacy_results = []
doc = nlp(text)

for sent in doc.sents:
    filtered_tokens = [
        token.lemma_
        for token in sent
        if not (token.is_punct or token.is_stop or token.is_digit or token.is_space)
    ]
    spacy_results.append(filtered_tokens)

pprint(spacy_results, compact=True, indent=4)

A few things to notice here:

- Proper nouns are not lowercased - we can force this if we want by adding `.lower()` to `token.lemma_` in the list comprehension
- The © symbol is still there - this is because Spacy identified this as a stand-in for the word "copyright" and so classified it as a noun. We could remove this by adding a check in the list comprehension to remove any tokens that are not alphabetical characters like we did before.

## Step 3. Dictionary-Based Analysis

Dictionary-based computer-aided text analysis is fundamentally a word-counting technique. It involves counting the number of times words from a predefined dictionary appear in a text. This technique is often used to analyze the sentiment of a text, but it can also be used to analyze other aspects of text data.

Let's start by getting a sense for the frequency of words in our text data. We can use the `Counter` class from the `collections` module to count the frequency of words in our text data.

In [None]:
from collections import Counter

word_freq = Counter()
for sent in no_symbols:
    word_freq.update(sent)
pprint(word_freq, compact=True)

The most frequently used words are very frequently used with a steep drop-off in frequency. This is a common pattern in text data and is known as Zipf's Law. Zipf's Law states that the frequency of a word is inversely proportional to its rank in the frequency table. In other words, the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on. This doesn't exactly hold in our data because of the small sample size, but it is a common pattern in text data.

Let's view it in a more visually appealing format using a line chart.

In [None]:
import matplotlib.pyplot as plt

sorted_word_freq = word_freq.most_common()
words, frequencies = zip(*sorted_word_freq)
plt.figure(figsize=(10, 5))
plt.plot(words, frequencies, marker='o')
plt.xticks(rotation=90)
plt.xlabel('Words')
plt.ylabel('Frequencies')
plt.title('Word Frequencies')
plt.tight_layout()
plt.grid(True)
plt.show()

Now let's use these frequencies to find the frequencies of a 'dictionary' of words that we are interested in. For this example, you will choose what words you want to include in your dictionary.

In [None]:
# Your dictionary of words here:

my_dict = ['strategic', 'entrepreneurship', 'journal', 'evidence', 'progress']
running_total = 0
for word in my_dict:
    print(f"{word}: {word_freq[word]}")
    running_total += word_freq[word]

print(f"\nTotal: {running_total}")

Those are the counts of the words in the dictionary, but this is a raw count and is likely to be skewed by texts with a lot of words. So we often normalize these counts by the total number of words in the text. 

In [None]:
total_words = 0
for sent in no_symbols:
    total_words += len(sent)

print(f"\nTotal words: {total_words}")
print(f"Normalized frequency: {running_total/total_words}")

Here we have the normalized counts of the words in the dictionary using the total number of words in the text post stopword removal. This is one viable approach, but it is common to normalize by the total number of words in the text with stopwords included in the denominator. They will give you slightly different results, but the interpretation is the similar.

Done...