# Preprocessing

Like other data types, text data never comes clean. Moreover, most of our downstream methods only accept data structured in a particular way. Because of this, before we do any computational text analysis techniques, we will always need to perform some level of preprocessing on our text data. 

Different applications require different approaches to preprocessing. There is no "one size fits all" approach, and even if you use a program or module with built-in preprocessing steps, you want to know what it's doing (and not doing) so you can justify and/or change it. In some cases, it's better to do minimal preprocessing, to retain the valuable meanings stored in sentence structure and the like. In other cases, we want only a "bag of words", so more preprocessing is better. There are as many choices in preprocessing as there are in analysis (lots and lots!), so it's essential to understand the building blocks. 

For instance, to do corpus linguistics (e.g., syntactic parsing) or neural network models (e.g., word embeddings), we usually want to keep sentence structure and different word forms, so we would just tokenize and lower-case words, segment sentences, and normalize text (remove URLs, etc.). In contrast, for simpler techniques like finding the most distinctive or frequent words--as well as for topic models--we would typically do all that plus remove stopwords and punctuation and stem or lemmatize words. 

To help you understand these choice points so you can assemble your own domain-appropriate preprocessing recipes, in this notebook we will cover the following basic ingredients:

- Reading in .txt and .csv files
- Tokenization
- Sentence segmentation
- Removing punctuation
- Stripping whitespace
- Text normalization
- Stop words
- Stemming/Lemmatizing
- POS tagging

This is a great starting point, but there is a lot more to learn about preprocessing. To acquaint yourself with a few more tools, check out `solutions/preprocessing_extra.ipynb` and the other resources listed in [the workshop repo](https://github.com/jhaber-zz/nlp-python-2020). To gain more practice and tools, come back for tomorrow's session!

We will do some review, but this notebook assumes you have basic familiarity with Python. If you need a beginner's introduction to coding in Python, please walk through the notebook at `solutions/intro-to-python.ipynb` *before* the workshop. 

## Reading in files

The first step is to read in the files containing the text data. The most common file types for text data are: `.txt`, `.csv`, `.json`, `.html` and `.xml`.

### Reading in `.txt` files

Python has built-in support for reading in `.txt` files.

In [5]:
import os
DATA_DIR = 'data'
fname = 'pride-and-prejudice.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname, encoding='utf-8') as f:
    raw = f.read()

#### Review of Python string methods

- What type of object is `raw`?
- How many characters are in `raw`?
- Get the first 1000 characters of `raw`.
- Join together the first 200 and the last 200 characters of `raw`.

In [47]:
# your code here

### Reading in `.csv`

Python has a built-in module called `csv` for reading in csv files.

In [8]:
import csv
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = []

# standard approach to opening files (no special encoding situations): # with open(fname) as f:
import codecs
with codecs.open(fname, "r", encoding='utf-8', errors='ignore') as f: # for special encoding issues  
    reader = csv.reader(f)
    tweets = list(reader)

#### Review of Python list methods

- What data type is `tweets`?
- How many entries are in `tweets`?
- Which entry is the header row?
- Get the first 10 entries.
- Join together the 5th and 10th elements of `tweets`.

In [9]:
# your code here

### Reading in `.csv` with `pandas`

`pandas` is a third-party library that makes working with tabular data much easier. This is the recommended way to read in a `.csv` file.

In [18]:
import os
import pandas as pd
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = pd.read_csv(fname) 

#### Review of `pandas`

- What data type is `tweets`?
- How many tweets are there?
- What happened to the header row?
- Get the first row of `tweets`.
- Get the first 5 entries in the `Tweet_Text` column.

In [None]:
# your code here

### Reading in multiple files

Often, our text data is split across multiple files in a folder. We want to read them all into a single variable. <br>`glob` is a handy package for this: it lists all files matching a pattern. We can use this to get all files in a folder. 

In [22]:
import glob
fnames = os.path.join(DATA_DIR, 'austen', '*.txt')
fnames = glob.glob(fnames)
austen = ''
for fname in fnames:
    with codecs.open(fname, "r", encoding='utf-8-sig', errors='ignore') as f:
        text = f.read()
        austen += text

#### Review of working with files

- What does `os.path.join()` do in this case?
- What type is `fnames` after it is first assigned a value?
- What type is `fnames` after it is assigned a second value?
- How many files are in `fnames`?
- What type is `austen`?

In [None]:
# your code here

### Challenge 

Read in all the `.csv` files in the folder `amazon`. Extract out only the `text` column from THE FIRST TWO files and store them all in a list called `reviews`. 

**Hint 1:** Not all of these files heave a header row to indicate column names. But for your reference, the columns are in this order: <br>
```Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary, Text```

**Hint 2:** You can deal with `.csv` files without header rows by calling the argument `header=None` when loading into a pandas DataFrame. This lets pandas know not to mistake the first row of data for column names. 

In [48]:
# your solution here

## Tokenization

Once we've read in the data, our next step is often to split it into words. This step is referred to as "tokenization", because each word occurrence is referred to as a "token". Each distinct word used is called a word "type". So the word type "the" may correspond to multiple tokens of "the" in a text. For example, the sentence "the dog ate the cat" has five tokens ("the", "dog", "ate", "the", "cat") but only *four* types ("the", "dog", "ate", "cat").

There are many, many approaches to tokenization. The simplest approach to splitting words is by whitespace--the spacing between words. Let's see how to do that and compare it to a few other basic and accessible methods.

### Tokenizing by whitespace

In [89]:
import os
fname = 'example1.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

In [90]:
text

"In this little example, we're going to see some of the problems that regularly appear in tokenization. Tokenization may seem simple, but it's harder than it first appears. Why is it so hard? Punctuations, contractions (like don't, won't and would've) get in the way. \n"

In [91]:
text.split()[:20] # tokenize by whitespace

['In',
 'this',
 'little',
 'example,',
 "we're",
 'going',
 'to',
 'see',
 'some',
 'of',
 'the',
 'problems',
 'that',
 'regularly',
 'appear',
 'in',
 'tokenization.',
 'Tokenization',
 'may',
 'seem']

- What problems do you notice with tokenizing by whitespace? If we're just looking for a "bag of words", what other steps should we take?
- What type is `text`?

### Tokenizing with regular expressions

In [92]:
import re
word_pattern = r'\w+' # \w means "one or more words": https://www.debuggex.com/cheatsheet/regex/python
tokens = re.findall(word_pattern, text)
tokens[:20]

['In',
 'this',
 'little',
 'example',
 'we',
 're',
 'going',
 'to',
 'see',
 'some',
 'of',
 'the',
 'problems',
 'that',
 'regularly',
 'appear',
 'in',
 'tokenization',
 'Tokenization',
 'may']

- What type is `tokens`?
- What type is each element of `tokens`?
- What does tokenizing with regular expressions do that tokenizing with whitespace does not?

### Tokenizing with `nltk`

[Just a bunch of regular expressions under the hood](https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py)

In [93]:
from nltk.tokenize import word_tokenize
import nltk; nltk.download('punkt')
tokens = word_tokenize(text)
tokens[:20]

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['In',
 'this',
 'little',
 'example',
 ',',
 'we',
 "'re",
 'going',
 'to',
 'see',
 'some',
 'of',
 'the',
 'problems',
 'that',
 'regularly',
 'appear',
 'in',
 'tokenization',
 '.']

- How does tokenization with NLTK compare to the previous approaches?

### Challenge

A while ago you read in a bunch of Jane Austen books into a variable called `austen`. Tokenize that using a method of your choice. Find all the unique words types (you might want the `set` function). Sort the resulting set object to create a vocabulary (you might want to use the `sorted` function). 

**Extra credit:** Use what you know about types and tokens to measure lexical uniqueness--that is, how often Jane Austen reuses words. <br>*Hint:* Start by counting the number of types and the number of tokens in `austen`.

In [94]:
# your solution here

## Sentence segmentation

Sentence segmentation involves identifying the boundaries of sentences.

### Sentence segmentation by splitting on periods

In [95]:
text.split('.')

["In this little example, we're going to see some of the problems that regularly appear in tokenization",
 " Tokenization may seem simple, but it's harder than it first appears",
 " Why is it so hard? Punctuations, contractions (like don't, won't and would've) get in the way",
 ' \n']

- What does this simple approach do well? What does it miss?

### Sentence segmentation with regular expressions

Regular expressions allow us to split strings based on a number of characters.

In [96]:
sent_boundary_pattern = r'[.?!]' # This pattern identifies at least one punctuation mark concluding a sentence
re.split(sent_boundary_pattern, text)

["In this little example, we're going to see some of the problems that regularly appear in tokenization",
 " Tokenization may seem simple, but it's harder than it first appears",
 ' Why is it so hard',
 " Punctuations, contractions (like don't, won't and would've) get in the way",
 ' \n']

### Sentence segmentation with `nltk`

In [97]:
from nltk.tokenize import sent_tokenize
sent_tokenize(text)

["In this little example, we're going to see some of the problems that regularly appear in tokenization.",
 "Tokenization may seem simple, but it's harder than it first appears.",
 'Why is it so hard?',
 "Punctuations, contractions (like don't, won't and would've) get in the way."]

### Challenge - SOLUTION

The file `example2.txt` has more punctuation problems. Read it in (using same pattern as above) and see what the problems are. Try the different approaches above and see what works best. If you're already familiar with regular expressions, modify that code to work for as many cases as you can. What does this suggest about tokenizing text that includes punctuation?
<br> (For help with regular expressions, see this [cheat sheet](https://www.debuggex.com/cheatsheet/regex/python))

In [None]:
# your solution here

## Removing punctuation

In some situations, it's best to keep only the alphanumeric characters (i.e. the letters and numbers) and ditch the punctuation. Here's how we can do that.

In [98]:
from string import punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

- What type is `punctuation`?

In [99]:
no_punct = ''.join([char for char in text if char not in punctuation])
no_punct

'In this little example were going to see some of the problems that regularly appear in tokenization Tokenization may seem simple but its harder than it first appears Why is it so hard Punctuations contractions like dont wont and wouldve get in the way \n'

## Strip whitespace

This is an extremely common step. It's simple to perform and nicely pre-packaged in Python. It's particularly common for user-generated text (think survey forms).

In [100]:
string = ' Hello! '
string.strip()

'Hello!'

In [103]:
fname = 'example3.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
print(text) # This shows the whitespaces



This is a text file that has some extra whitespace at the start and end. Whitespace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.


The Python method called "strip" only catches whitespace at the start and end of a string. But it won't catch it in       the middle,		for example,

in this sentence.		Once again, regular expressions will

help		us    with this.






In [106]:
stripped_text = text.strip() # This just strips whitespace from front and end--what about in between words?
print(stripped_text)

This is a text file that has some extra whitespace at the start and end. Whitespace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.


The Python method called "strip" only catches whitespace at the start and end of a string. But it won't catch it in       the middle,		for example,

in this sentence.		Once again, regular expressions will

help		us    with this.


In [108]:
# To strip whitespaces both at extremes and inside a text, regular expressions are a great tool:
whitespace_pattern = r'\s+'
clean_text = re.sub(whitespace_pattern, ' ', text)
clean_text.strip() # Strip whitespaces at start & end (again)

'This is a text file that has some extra whitespace at the start and end. Whitespace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines. The Python method called "strip" only catches whitespace at the start and end of a string. But it won\'t catch it in the middle, for example, in this sentence. Once again, regular expressions will help us with this.'

## Text normalization

Text normalization means making our text fit some standard patterns. Lots of steps come under this wide umbrella, but the most common are:

- case folding
- removing URLs, digits, hashtags
- removing infequent words (not done here)

#### Case folding

Case folding means dealing with upper and lower cases characters. This is usually done by making all characters lower-cased.

In [109]:
fname = 'example4.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
text

'Upper and lower case characters can be annoying. Characters are the individual letters and numbers that we see on the page. Case folding is the generic term we use for dealing with upper and lower case characters. Lower case is often what people want. Title Case refers to a multi-word expression with the first character of every word in upper case. '

In [110]:
text.lower()

'upper and lower case characters can be annoying. characters are the individual letters and numbers that we see on the page. case folding is the generic term we use for dealing with upper and lower case characters. lower case is often what people want. title case refers to a multi-word expression with the first character of every word in upper case. '

### Challenge
The `lower` method we used above is a string method, that is, it works on strings. But say you've already tokenized a test, and you want to lowercase every word in the resulting list? Take the list of tokens below and make each one lower case.

In [113]:
tokens = word_tokenize(text)
tokens[:20]

['Upper',
 'and',
 'lower',
 'case',
 'characters',
 'can',
 'be',
 'annoying',
 '.',
 'Characters',
 'are',
 'the',
 'individual',
 'letters',
 'and',
 'numbers',
 'that',
 'we',
 'see',
 'on']

In [None]:
# your solution here

### Removing URLs, digits and hashtags

We rarely care about the exact URL used in a tweet, or the exact number of tweets. We could remove them completely, but it's often informative to know that there is a URL or a digit in the text using a placeholder (to avoid clouding our measures with the actual content). So we want to replace individual URLs and digits with a symbol that preserves the fact that a URL was there. It's standard to just use the strings "URL" and "DIGIT".

How do we do this? Once again, regular expressions save the day.

In [122]:
# First, get just the text for the tweets we read in earlier:
tweet_text = tweets["Tweet_Text"] # 

In [123]:
url_pattern = r'https?:\/\/.*[\r\n]*'
single_tweet = tweet_text.iloc[0]
single_tweet

'Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z'

In [124]:
URL_SIGN = ' URL '
re.sub(url_pattern, URL_SIGN, single_tweet)

'Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet  URL '

Above we replaced the URL in a single tweet. Now we will replace all the URLs for each tweet in `tweet_text`.

In [125]:
url_pattern = r'https?:\/\/.*[\r\n]*'
URL_SIGN = ' URL '
list_of_url_less_tweets = []

# Use a for loop
for tweet in tweet_text:
    url_less_tweet = re.sub(url_pattern, URL_SIGN, tweet)
    list_of_url_less_tweets.append(url_less_tweet)
list_of_url_less_tweets

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet  URL ',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!!  URL ',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at 9:45pm.\n#ElectionNight #MAGA__  URL ',
 'RT @IvankaTrump: Such a surreal moment

In [126]:
## Alternative using list comprehension
list_of_url_less_tweets = [re.sub(url_pattern, URL_SIGN, tweet) for tweet in tweet_text]
list_of_url_less_tweets

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet  URL ',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!!  URL ',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at 9:45pm.\n#ElectionNight #MAGA__  URL ',
 'RT @IvankaTrump: Such a surreal moment

Now let's remove hashtags and digits.

In [127]:
hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
HASHTAG_SIGN = ' HASHTAG '
digit_pattern = '\d+'
DIGIT_SIGN = ' DIGIT '

In [128]:
no_hashtags = [re.sub(hashtag_pattern, HASHTAG_SIGN, tweet) for tweet in tweet_text]
no_hashtags

['Today we express our deepest gratitude to all those who have served in our armed forces. HASHTAG  https://t.co/wPk7QWpK8Z',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!! https://t.co/Lz2dhrXzo4',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at 9:45pm. HASHTAG  HASHTAG  https://t.co/HfuJeRZ

In [129]:
no_digit = [re.sub(digit_pattern, DIGIT_SIGN, tweet) for tweet in tweet_text]
no_digit

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk DIGIT QWpK DIGIT Z',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy  DIGIT st birthday to the U.S. Marine Corps! Thank you for your service!! https://t.co/Lz DIGIT dhrXzo DIGIT ',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at  DIGIT : DIGIT p

## Counting word frequencies (after text normalization)

We can count the frequency of each word type with the built-in `Counter` in Python. This basically just takes the set of word types (we calculated this above as `vocabulary`) and makes a special Python dictionary with each value being the number of times it appears in the list. We can ask that dictionary for the most common words, or for the frequency of individual word types. 

First, clean and normalize the text:

In [None]:
all_tweets = ' '.join(tweets)
clean = re.sub(url_pattern, URL_SIGN, all_tweets)
clean = re.sub(hashtag_pattern, HASHTAG_SIGN, clean)
clean = re.sub(digit_pattern, DIGIT_SIGN, clean)
tokens = word_tokenize(clean)
tokens = [token for token in tokens if token not in punctuation]
tokens[:20]

In [None]:
from collections import Counter
freq = Counter(tokens)
freq.most_common(10)

### Challenge 

I've read in some Amazon reviews from earlier into a list called `reviews`. Each element of the list is a string, representing the text of a single review. Try to:
- Tokenize each review
- Strip all whitespace
- Make all characters lower case
- Replace any URLs and digits

Then find the most common 50 words.

In [None]:
fnames = os.path.join(DATA_DIR, 'amazon', '*.csv')
fnames = glob.glob(fnames)
reviews = []
column_names = ['id', 'product_id', 'user_id', 'profile_name', 'helpfulness_num', 'helpfulness_denom',
               'score', 'time', 'summary', 'text']
for fname in fnames[:2]:
    df = pd.read_csv(fname, names=column_names)
    text = list(df['text'])
    reviews.extend(text)

In [None]:
# your solution here

## Removing stop words

You might have noticed that the most common words above aren't terribly exciting. They're words like "am", "i", "the" and "a": stop words. These are rarely useful to us in computational text analysis, so it's very common to remove them completely.

- What other stop words do you think there are?

In [None]:
import nltk; nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop

### Challenge 

Use the list `stop` of English stopwords to remove stopwords from our tokenized review above.

In [None]:
# your solution here

## Stemming/lemmatization

Stemming and lemmatization both refer to remove morphological affixes on words. For example, if we stem the word "grows", we get "grow". If we stem the word "running", we get "run". We do this because often we care more about the core content of the word (i.e. that it has something to do with growth or running, rather than the fact that it's a third person present tense verb, or progressive participle).

NLTK provides many algorithms for stemming. For English, a great baseline is the [Porter](https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py) algorithm, which is in spirit isn't that far from a bunch of regular expressions.

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [None]:
stemmer.stem('grows')

In [None]:
stemmer.stem('running')

In [None]:
stemmer.stem('leaves')

In [None]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer
import nltk; nltk.download('wordnet') # Download resource for working with WordNet via NLTK
snowballer_stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

In [None]:
print(snowballer_stemmer.stem('running'))
print(snowballer_stemmer.stem('leaves'))

In [None]:
print(lemmatizer.lemmatize('leaves'))

### Challenge 

Use the Porter stemmer to stem each word in the tweet dataset after having removed stop words.

In [None]:
# your solution here

## POS tagging

POS tagging means assigning each token a part-of-speech (e.g. noun, verb, adjective, etc.). Again, there are many different [alternatives](https://github.com/nltk/nltk/tree/develop/nltk/tag), but NLTK keeps its recommended POS tagger available through the function `pos_tag`. The tagger expects a list of tokens as input.When doing POS tagging, it is advisable **not** to remove stop words beforehand (although you are free to do it afterwards).

In [None]:
from nltk import pos_tag
single_review = reviews[3]
single_review

In [None]:
tokens = word_tokenize(single_review)
import nltk; nltk.download('averaged_perceptron_tagger')
tagged_review = pos_tag(tokens)
tagged_review

### Challenge 

Below I've read in the text of Austen's _Pride and Prejudice_ into a variable called `pride`. Preprocess using the following steps:

- Strip whitespace
- Replace all numbers with '0'
- Tokenize
- Tag each token with a POS tag

Make sure you know:
- What type is the result?
- What type is each element of the result?
- What type are the elements of the elements of the result?

In [None]:
fname = 'pride-and-prejudice.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname, encoding='utf-8') as f:
    raw = f.read()
pride = raw[679:684814]
pride

In [None]:
# your solution here

## Things we didn't cover
(see `solutions/preprocessing_extra.ipynb` and [this repo](https://github.com/geoffbacon/nlp-with-nltk-spacy/blob/master/03-NLTK.ipynb) for more on these)

- Reading in JSON, HTML, and XML files 
- Removing infrequent words
- Named entity recognition
- Syntactic parsing
- Information extraction
- Removing markup from HTML
- Extracting numerical features
- DTM/TF-IDF
- SpaCy