If you recall from the model preperation module of this bootcamp, a typical data science workflow/pipeline is as shown in the diagram below:

![Data Science Workflow](assets/eda.png)

When working with text, we need to modify the above process by taking into consideration the nature and hence the peculiarities of the text data. Look at the modified diagram below:

![Data Preprocessing](assets/data_preprocessing.png)

We see that the three steps in the exploratory data analysis process stay in their places. However, we modified the feature engineering step slightly. Let's go step by step to get into the data science workflow/pipeline when working with the text data:

1. We need to clean our data to remove the problematic parts. In this checkpoint, we walk you through the basic steps of data cleaning for textual data.

2. After we clean up our text, we need to do some data exploration to get to know our data better. In this module, we'll not dedicate a separate checkpoint for this step. This is mainly because the data exploration checkpoints of the model preparation module cover many techniques that we can also apply to the text data.

3. The last step in data preprocessing is the feature engineering where we prepare the features that we'll use in the modeling phase. In this respect, text data is special and it requires different methods for creating features that will be used in the modeling. This step contains converting text data into numerical form which is also known as **language modeling**. We dedicate the next three checkpoints for this step.

4. Note that the steps above are iterative and we may need to go back when we feel necessary.

5. Once we prepared our numerical features, then we can jump into the modeling phase and apply machine learning techniques for different supervised and unsupervised tasks.

In this checkpoint, we'll talk about some data cleaning methods for text data. Our focus will be on methods that are specific to text data. Hence, we skip the common methods that are also relevant to numerical data as those techniques are covered in the model preparation module. With this checkpoint, we also start introducing you the tools we'll be using throughout the module. Specifically, we'll make use of two NLP packages here: **NLTK** and **SpaCy**: 

* As we touched upon previously, NLTK is a seasoned package with great richness in its functionality. It is highly customizable and contains many models which are useful for learning NLP but are not state of the art anymore and not optimal for production code. 

* SpaCy is almost the direct opposite. It doesn't offer many models but rather it uses algorithms and methods that are considered state of the art at the moment. Note that, using only a limited number of algorithms and models generates a trade-off between flexibility and ease of use. SpaCy makes its choice by standing on the later side. This makes its usage very leaner. Furthermore, it's written in Cython (that is the Python code we write is translated into C) and hence it's considerably faster. Indeed, it's among the fastest NLP libraries available.

In the rest of this module, we'll use the functionalities of NLTK and SpaCy interchangeably. Note that some functionalities are available in both of the packages but most of the time we'll use a functionality from a single package.

# The need for text preprocessing

Most of the time, we'll be given a raw dataset consisting of many documents. But in order to work on them, we need to **organize the material in a convenient way** and **clean the noisy and problematic parts** that can harm our analysis, or at least increase the computation time without any informational gain.

## Preparing the dataset

By organizing the material we mean preparing a dataset that we used to work with. Specifically, we want a tabular dataset where rows represent the observations and columns represent the features. For example, if our dataset consists of articles that are written by several authors and our goal is to distinguish authors from their writing styles, then we want the articles and the associated authors to be represented in rows where the article and the author are stored in two columns. However, this was just an example of many sorts of other use cases. Let's give some other examples:

* If our dataset comprises of books that consist of chapters and each chapter may be written by different authors, then we may need to represent chapters as rows.

* If our aim is to summarize the articles, then we may need to derive the paragraphs from the articles and make each paragraph a row.

* If we want to translate a language into another one, then we may need to prepare a dataset where two sentences from the two languages form a row.

So, building up a dataset from a raw material depends on the goal we want to achieve with that dataset. 

In the following three checkpoints, we'll prepare our dataset in the manner highlighted above. For now, we'll clean our data.

## Cleaning the dataset

Before having a dataset in the proper form, the first step is to clean our dataset. When it comes to text data, **cleaning** means a lot of things and you shouldn't take it too narrowly. Some of the things when we talk about cleaning text data are the followings:

* Correcting the typos and the misspelled words.

* Dealing with the abbreviations.

* Making all characters lowercase (or uppercase).

* Removing the emojis if exist.

* Removing the stopwords.

* Normalizing the words (aka lemmatization or stemming).

One can add several other items to the list. In this checkpoint, our focus will be on removing the stopwords and normalizing the text. However, here's a [good Kaggle kernel](https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing) if you want to have an accommodating reading. 

# Text preprocessing in action

Now that we've revised why we need to preprocess the text data, we can begin to work on a dataset using NLTK and SpaCy. First, if you haven't installed NLTK and SpaCy yet, you'll want to install the packages with:

```bash 
pip install nltk
```

and 

```bash 
pip install spacy
```

And you also need to download the English model of SpaCy from your terminal (or command prompt) using:

```bash
python -m spacy download en
```
or more conveniently, you can do the same inside Jupyter notebook by running the following in a cell:

```bash
!python -m spacy download en
```

As usual, we begin with importing the libraries we'll use in this checkpoint:

In [6]:


from collections import Counter
import nltk
import spacy
import re

ModuleNotFoundError: No module named 'spacy'

In the following, we download the data we'll be using from NLTK's repository. Specifically, we'll use the **Gutenberg corpus** of the NLTK which is a sample from the [Gutenberg Project](https://www.gutenberg.org/). The project itself is a large collection of books written in English.

Notice that installing NLTK isn't sufficient for these corpuses to be available for our use. We need to download them using the `nlt.download()` by providing the corpus name as parameter to it. In the next cell, we download this corpus.

**Note**: There's also an interactive way of downloading data from NLTK repository. In this case, you need to run the following: 
```python
nltk.download()
```
and it will launch an [interactive installer](http://www.nltk.org/data.html#interactive-installer). Using the installer, you can choose the "corpora" tab and download the Gutenberg corpus.

In [None]:
# Launch the installer to download Gutenberg corpus
nltk.download("gutenberg")

# Download the English models of SpaCy
!python -m spacy download en

Now that we've acquired some data, let's dig in to look at it and get ready to clean things up. We're going to work specifically with two novels from the Gutenberg corpus: **Alice's Adventures in Wonderland** by Lewis Carroll, and **Persuasion** by Jane Austin. 

In [None]:
# import the data we just downloaded
from nltk.corpus import gutenberg

# grab and process the raw data
print(gutenberg.fileids())

persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# print the first 100 characters of Alice
print('\nRaw:\n', alice[0:100])

## Basic text cleaning

We first do some basic data cleaning in our data. Note that this kind of cleaning depends very much on the corpus we're working on. This is because all texts have their own peculiarities to be dealt with. As you'll see shortly, we'll remove some strings that are specific to the texts we're using.

When modifying text data, using **regular expressions** is a common practice. We're also going to use *regular expressions* (specifically [re.sub()](https://docs.python.org/3/library/re.html#re.sub), short for "substitute") to identify and remove substrings we don't want. Specifically, we'll match those substrings with a regular expression and substitute in an empty string for them.

We won't go into detail here about how regular expressions work, but you should be able to get a good sense for what's happening by reading the code. If you want more information the [Python Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html) is an accessible starting point and reference, and [RegExr](http://regexr.com/) is a useful tool for visualizing and tinkering with regular expressions.

We'll start our cleaning by removing the title. We'll match all text between square brackets and replace it with an empty string:

In [None]:
# this pattern matches all text between square brackets
pattern = "[\[].*?[\]]"
persuasion = re.sub(pattern, "", persuasion)
alice = re.sub(pattern, "", alice)

# print the first 100 characters of Alice again
print("Title removed:", alice[0:100])

Next, we'll remove the chapter headings like `CHAPTER I`. Note that two novels have different styles of chapter headings. So, we deal with each novel one by one. This is quite usual in cleaning text data. As we said before, **all texts have their own peculiarities and cleaning them requires you to know those peculiarities**.

In [None]:
# now we'll match and remove chapter headings
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)

# ok, what's it look like now?
print('Chapter headings removed:', alice[0:100])

If you were to read the two novels, you'd notice that there are a lot of "new line" characters and other types of extra whitespaces. So, we need to clean them up:

In [None]:
# remove newlines and other extra whitespace by splitting and rejoining
persuasion = ' '.join(persuasion.split())
alice = ' '.join(alice.split())

# all done with cleanup? let's see how it looks.
print('Extra whitespace removed:', alice[0:100])

Much of the things you saw as data cleaning so far were just a demonstration of what kind of problems you may encounter in a corpus. You can imagine a lot more than what we showed here. For example, if we were to work on a social media corpus, then we most likely would encounter with many emojis and abbreviations. So, dealing with them would also be a major problem in the data cleaning phase. Hence, **you should always be careful about what kind of corpus you have and what types of problems may occur in the text**.

Since our text started to look okay, the next step is to tokenize our texts:

## Tokenization

As you recall from the previous checkpoint, each individual meaningful piece from a text is called a **token**, and the process of breaking up the text into these pieces is called **tokenization**. **Tokenization is an important step in text preprocessing, because most of the time we generate the numerical representations of our texts from these tokens**. Hence, breaking up the text into tokens correctly is a crucial step for the success of the next steps of any data science workflow.

Tokens are generally words and punctuation. In some NLP applications, you may see that people remove the punctuations from the text as if they are stopwords. There's no a single correct way of handling the punctuations and it's usually a matter of experimentation to determine the best way. In the following, we'll keep punctuations in our documents as we'll make use of them when separating our text into sentences. However, when we analyze our data, we check for them and don't include them in our analysis as you'll see shortly.

Let's go ahead and use spaCy to parse our novels into tokens. When we call spaCy on the novel it will immediately and automatically parse it, tokenizing the string by breaking it into words and punctuation (and many other things we will explore):

In [None]:
nlp = spacy.load('en')

# all the processing work is done below, so it may take a while
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

All our parsed documents are now stored in two variables we defined. SpaCy did a lot of good things when parsing the documents. Let's see what we have after the parsing happened:

In [None]:
# let's explore the objects we've built.
print("The alice_doc object is a {} object.".format(type(alice_doc)))
print("It is {} tokens long".format(len(alice_doc)))
print("The first three tokens are '{}'".format(alice_doc[:3]))
print("The type of each token is {}".format(type(alice_doc[0])))

We see from introspecting the spaCy objects above that we're playing around with [doc](https://spacy.io/docs/api/doc) and [token](https://spacy.io/docs/api/token) objects. These are the types that are defined by SpaCy.

Now that we parsed our documents and get the tokens out of it, we can remove the stopwords.

## Removing stopwords

One of the important steps of text preprocessing is to remove the stopwords from the dataset. This is because they occur a lot in the text and most of the time they convey little meaning. So removing them benefits twice:

1. We get rid off the noisy data.
2. The size of the text diminishes and hence the computation time shortens.

Removing stopwords with SpaCy is quite easy:

In [None]:
alice_without_stopwords = [token for token in alice_doc if not token.is_stop]
persuasion_without_stopwords = [token for token in persuasion_doc if not token.is_stop]

As you can see, we just iterated over the tokens that are already made available by SpaCy during the parsing of the documents and exclude the token from the list if it's a stopword. 

Now, we store our tokens in two lists that are free of stopwords. Let's stop text processing a little bit and look at how frequent each token is in our corpus:

In [None]:
# utility function to calculate how frequently words appear in the text
def word_frequencies(text):
    
    # build a list of words
    # strip out punctuation
    words = []
    for token in text:
        if not token.is_punct:
            words.append(token.text)
            
    # build and return a Counter object containing word counts
    return Counter(words)

# instantiate our list of most common words.
alice_word_freq = word_frequencies(alice_without_stopwords).most_common(10)
persuasion_word_freq = word_frequencies(persuasion_without_stopwords).most_common(10)
print('\nAlice:', alice_word_freq)
print('Persuasion:', persuasion_word_freq)

Just take a moment and think about the 10 most common words in each novel. Do you see some differences that make sense to you?

After the tokenization, the natural next step in text processing is lemmatization or stemming. We prefer to go with lemmatization here. But, you can also play with stemming if you like. Again, most of the time, determining which one of the lemmatization or stemming is the best choice is a matter of experimentation.

## Lemmatization

So far, we've tokenized our texts looked at whether certain words are present and how frequently they appear. We can process these words further to remove a little more noise from our data. Consider the words "think", "thought", and "thinking". They're related. They all share the same root word: the verb "think". Most of the times, we want to focus on the fact that the act of thinking comes up a lot in data, and not have that information split across all the different forms of "think".

To focus in like this, we can reduce each word to its root that is to lemma and do our counts again. This time, we're building a count of *concepts* rather than just *words*:

In [None]:
# utility function to calculate how frequently lemas appear in the text
def lemma_frequencies(text):
    
    # build a list of lemas
    # strip out punctuation
    lemmas = []
    for token in text:
        if not token.is_punct:
            lemmas.append(token.lemma_)
            
    # build and return a Counter object containing lemma counts
    return Counter(lemmas)

# instantiate our list of most common lemmas
alice_lemma_freq = lemma_frequencies(alice_without_stopwords).most_common(10)
persuasion_lemma_freq = lemma_frequencies(persuasion_without_stopwords).most_common(10)
print('\nAlice:', alice_lemma_freq)
print('Persuasion:', persuasion_lemma_freq)

As you can realize, the top ten list changed. You can try to print more number of top lemmas and catch meaningful differences between the two novels.

Alternatively, we can identify the lemmas common to one text but not the other. This may help us in understanding the differences between the two novels.

In [None]:
alice_lemma_common = [pair[0] for pair in alice_lemma_freq]
persuasion_lemma_common = [pair[0] for pair in persuasion_lemma_freq]
print('Unique to Alice:', set(alice_lemma_common) - set(persuasion_lemma_common))
print('Unique to Persuasion:', set(persuasion_lemma_common) - set(alice_lemma_common))

These are examples of how you can do data exploration on text data. When it comes to text data, the limit is sky! So, use your imagination and find out more creative ways of analyzing the two novels based on the lemmas they have.

We'll not go into the details but some syntactical properties can also help in this analysis. If you notice, the most frequent lemmas include person names. For the purpose of our analysis, we may need to eliminate them from the lists. In order to do this, we can derive the **named entities** in the texts and SpaCy has already derived named entities in the texts during parsing. If you like, you can go ahead and inspect the named entities.

**Note**: We lemmatized our tokens to treat words with similar meanings as if they are the same. Apart from looking at lemmas, we could also perform a similar analysis by pulling out prefixes (`token.prefix_`) or suffixes (`token.suffix_`) from the tokens.

## Sentences

Before closing this checkpoint, we want to mention about how to determine the sentences in a corpus. Beyond individual words, text can also be considered at the level of sentences. Using punctuation cues, we can split up text into sentences. Each sentence can then be summarized by, for example, using sentiment analysis to categorize sentences as having positive or negative sentiment. We may also be interested in how long sentences tend to be, and how many unique words make up a sentence. The sentence also provides *context* for the individual words, allowing us to draw even more information from each word.

We get a lot of automatic sentence-level information from spaCy. The `doc.sents` property will give us each sentence as a [span](https://spacy.io/docs/api/span) object. Let's look at some of that:

In [None]:
# initial exploration of sentences
sentences = list(alice_doc.sents)
print("Alice in Wonderland has {} sentences.".format(len(sentences)))

example_sentence = sentences[2]
print("Here is an example: \n{}\n".format(example_sentence))

In [None]:
# look at some metrics around this sentence
example_words = [token for token in example_sentence if not token.is_punct]
unique_words = set([token.text for token in example_words])

print(("There are {} words in this sentence, and {} of them are"
       " unique.").format(len(example_words), len(unique_words)))

As we can see, sentence-level analysis can also be helpful in the data exploration phase.

This is all about data cleaning and text preprocessing for now. It's your turn to complete the assignments.