# Preprocessing

Like other data types, text data never comes clean. Moreover, most of our downstream methods only accept data structured in a particular way. Because of this, before we do any computational text analysis techniques, we will always need to perform some level of preprocessing. Text data has its own unique kind of preprocessing. In this notebook, we will cover the core preprocessing methods in preparation for our next two weeks:

- Reading in .txt and .csv files
- Tokenization
- Sentence segmentation
- Removing punctuation
- Stripping whitespace
- Text normalization
- Stop words
- Stemming/Lemmatizing
- POS tagging


## Reading in files

The first step is to read in the files containing the data. As we discussed last week, the most common file types for text data are: `.txt`, `.csv`, `.json`, `.html` and `.xml`.

#### Reading in `.txt` files

Python has built-in support for reading in `.txt` files.

- What type of object is `raw`?
- How many characters are in `raw`?
- Get the first 1000 characters of `raw`?

In [1]:
import os
DATA_DIR = '../data'
fname = 'pride-and-prejudice.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname, encoding='utf-8') as f:
    raw = f.read()

In [77]:
type(raw)

str

In [78]:
len(raw)

704192

In [2]:
raw[:1000]

'\ufeffThe Project Gutenberg EBook of Pride and Prejudice, by Jane Austen\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.org\n\n\nTitle: Pride and Prejudice\n\nAuthor: Jane Austen\n\nPosting Date: August 26, 2008 [EBook #1342]\nRelease Date: June, 1998\nLast Updated: October 17, 2016\n\nLanguage: English\n\nCharacter set encoding: UTF-8\n\n*** START OF THIS PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***\n\n\n\n\nProduced by Anonymous Volunteers\n\n\n\n\n\nPRIDE AND PREJUDICE\n\nBy Jane Austen\n\n\n\nChapter 1\n\n\nIt is a truth universally acknowledged, that a single man in possession\nof a good fortune, must be in want of a wife.\n\nHowever little known the feelings or views of such a man may be on his\nfirst entering a neighbourhood, this truth is so well fixed in the minds\nof the 

#### Reading in `.csv`

Python has a built-in module called `csv` for reading in csv files.

- What type is `tweets`?
- How many entries are in `raw`?
- Which entry is the header row?

In [80]:
import csv
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = []
#with open(fname) as f:
import codecs
with codecs.open(fname, "r", encoding='utf-8', errors='ignore') as f: ##for special encoding issues  
    reader = csv.reader(f)
    tweets = list(reader)

In [81]:
type(tweets)

list

In [82]:
len(tweets)

7376

In [83]:
tweets[0] # header row

['Date',
 'Time',
 'Tweet_Text',
 'Type',
 'Media_Type',
 'Hashtags',
 'Tweet_Id',
 'Tweet_Url',
 'twt_favourites_IS_THIS_LIKE_QUESTION_MARK',
 'Retweets',
 '',
 '']

In [4]:
tweets[:10]

[['Date',
  'Time',
  'Tweet_Text',
  'Type',
  'Media_Type',
  'Hashtags',
  'Tweet_Id',
  'Tweet_Url',
  'twt_favourites_IS_THIS_LIKE_QUESTION_MARK',
  'Retweets',
  '',
  ''],
 ['16-11-11',
  '15:26:37',
  'Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z',
  'text',
  'photo',
  'ThankAVet',
  '7.97E+17',
  'https://twitter.com/realDonaldTrump/status/797098212599496704',
  '127213',
  '41112',
  '',
  ''],
 ['16-11-11',
  '13:33:35',
  'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
  'text',
  '',
  '',
  '7.97E+17',
  'https://twitter.com/realDonaldTrump/status/797069763801387008',
  '141527',
  '28654',
  '',
  ''],
 ['16-11-11',
  '11:14:20',
  'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
  'text',
  '',
  '',
  '7.97E+17',
  '

#### Reading in `.csv` with `pandas`

`pandas` is a third-party library that makes working with tabular data much easier. This is the recommended way to read in a `.csv` file.

- How many tweets are there?
- What happened to the header row?

In [84]:
import os
import pandas as pd
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = pd.read_csv(fname) 

In [85]:
len(tweets)

7375

In [6]:
tweets.head(3) # Header row became column names in DF

Unnamed: 0,Date,Time,Tweet_Text,Type,Media_Type,Hashtags,Tweet_Id,Tweet_Url,twt_favourites_IS_THIS_LIKE_QUESTION_MARK,Retweets,Unnamed: 10,Unnamed: 11
0,16-11-11,15:26:37,Today we express our deepest gratitude to all ...,text,photo,ThankAVet,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,127213,41112,,
1,16-11-11,13:33:35,Busy day planned in New York. Will soon be mak...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,141527,28654,,
2,16-11-11,11:14:20,Love the fact that the small groups of protest...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,183729,50039,,


In [7]:
tweet_text = list(tweets['Tweet_Text'])
tweet_text[:4]

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!']

#### Reading in multiple files

Often, our text data is split across multiple files in a folder. We want to be able to read them all into a single variable.

- What type is `austen`?
- What type is `fnames` after it is first assigned a value?
- What type is `fnames` after it is assigned a second value?
- How 

In [96]:
import glob
fnames = os.path.join(DATA_DIR, 'austen', '*.txt')
type(fnames)

str

In [97]:
fnames = glob.glob(fnames)
austen = ''
for fname in fnames:
    with codecs.open(fname, "r", encoding='utf-8-sig', errors='ignore') as f:
        text = f.read()
        austen += text

In [98]:
type(fnames)

list

In [99]:
fnames

['../data/austen/emma.txt',
 '../data/austen/sense-and-sensibility.txt',
 '../data/austen/lady-susan.txt',
 '../data/austen/persuasion.txt',
 '../data/austen/mansfield-park.txt',
 '../data/austen/northanger-abbey.txt',
 '../data/austen/love-and-freindship.txt']

In [100]:
type(austen)

str

In [101]:
austen[:10000]

'The Project Gutenberg EBook of Emma, by Jane Austen\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.org\n\n\nTitle: Emma\n\nAuthor: Jane Austen\n\nRelease Date: August, 1994  [Etext #158]\nPosting Date: January 21, 2010\nLast Updated: October 17, 2016\n\nLanguage: English\n\nCharacter set encoding: UTF-8\n\n*** START OF THIS PROJECT GUTENBERG EBOOK EMMA ***\n\n\n\n\nProduced by An Anonymous Volunteer\n\n\n\n\n\nEMMA\n\nBy Jane Austen\n\n\n\n\nVOLUME I\n\n\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\nand happy disposition, seemed to unite some of the best blessings of\nexistence; and had lived nearly twenty-one years in the world with very\nlittle to distress or vex her.\n\nShe was the youngest of the two daughters of a most affectionate,\nindul

### Challenge - SOLUTION

Read in all the `.csv` files in the folder `amazon`. Extract out only the text column from each THE FIRST TWO file and store them all in a list called `reviews`.

In [17]:
fnames = os.path.join(DATA_DIR, 'amazon', '*.csv')
fnames = glob.glob(fnames)
reviews = []
column_names = ['id', 'product_id', 'user_id', 'profile_name', 'helpfulness_num', 'helpfulness_denom',
               'score', 'time', 'summary', 'text']

In [18]:
fnames[:2]

['../data/amazon/xbv.csv', '../data/amazon/xac.csv']

In [19]:
for fname in fnames[:2]:
    df = pd.read_csv(fname, names=column_names)
    text = list(df['text'])
    reviews.extend(text)

reviews

["This smoothie is absolutely delicious! I credit this drink to getting me through my mother's death and the death of my dog one day later. I drank one of these every day during that time and I got through the stress of handling my mother's estate without any stressor illnesses. Great stuff in my book.",
 "OKAY HOW SHOULD I SAY THIS, IT DOES NOT TASTE LIKE CREAM CHEESE...<br />IT HAS A WEIRD ALMOST CHEMICAL TASTE TO IT, I EVEN STORED IT IN THE FRIG TO KEEP IT FRESH, IT DID NOT HELP...<br />THING IS I HAD USED IT BEFORE YEARS AGO AND I HAD THOUGHT IT WAS GREAT THEN...<br />MAYBE I GOT A BAD BATCH OR MAYBE THEY CHANGED HOW IT BEING MADE NOW..<br />I KNOW I WON'T BE BUYING IT AGAIN..............",
 'I buy pistachios a LOT. When I say a lot, I mean about once a month or so. And as far as I\'m concerned Keenan is the best. I get very few "bad ones" in the entire container, whether it\'s this one or the plastic bags I also buy. The saltiness is perfect, the size is perfect, and Keenan really

## Tokenization

Once we've read in the data, our next step is often to split it into words. This step is referred to as "tokenization". That's because each occurrence of a word is called a "token". Each distinct word used is called a word "type". So the word type "the" may correspond to multiple tokens of "the" in a text.

#### Tokenizing by whitespace

- What problems do you notice with tokenizing by whitespace?
- What type is `text`?
- What type is `tokens`?
- What type is each element of `tokens`?

In [20]:
import os
fname = 'example1.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

In [21]:
text

"In this little example, we're going to see some of the problems that regularly appear in tokenization. Tokenization may seem simple, but it's harder than it first appears. Why is it so hard? Punctuations, contractions (like don't, won't and would've) get in the way. \n"

In [102]:
type(text)

str

In [103]:
type(tokens)

list

In [104]:
type(tokens[0])

str

In [22]:
text.split()[:10] # doesn't deal with contractions

['In',
 'this',
 'little',
 'example,',
 "we're",
 'going',
 'to',
 'see',
 'some',
 'of']

#### Tokenizing with regular expressions

In [23]:
import re
word_pattern = r'\w+'
tokens = re.findall(word_pattern, text)
tokens[:10]

['In', 'this', 'little', 'example', 'we', 're', 'going', 'to', 'see', 'some']

#### Tokenizing with `nltk`

[Just a bunch of regular expressions under the hood](https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py)

In [24]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
tokens[:10]

['In', 'this', 'little', 'example', ',', 'we', "'re", 'going', 'to', 'see']

#### Challenge - SOLUTION

A while ago you read in a bunch of Jane Austen books into a variable called `austen`. Tokenize that using a method of your choice. Find all the unique words types (you might want the `set` function). Sort the resulting set object to create a vocabulary (you might want to use the `sorted` function).

In [25]:
tokens = word_tokenize(austen)
tokens[0]

'The'

In [26]:
tokens[:10]

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Emma',
 ',',
 'by',
 'Jane',
 'Austen']

In [27]:
vocab = sorted(set(tokens))
vocab[1000]

'Criticism'

## Sentence segmentation

Sentence segmentation involves identifying the boundaries of sentences.

#### Sentence segmentation by splitting on punctuation

In [28]:
text.split('.')

["In this little example, we're going to see some of the problems that regularly appear in tokenization",
 " Tokenization may seem simple, but it's harder than it first appears",
 " Why is it so hard? Punctuations, contractions (like don't, won't and would've) get in the way",
 ' \n']

We could improve on this by using regular expressions. They'll allow us to split strings based on a number of characters.

In [29]:
sent_boundary_pattern = r'[.?!]'
re.split(sent_boundary_pattern, text)

["In this little example, we're going to see some of the problems that regularly appear in tokenization",
 " Tokenization may seem simple, but it's harder than it first appears",
 ' Why is it so hard',
 " Punctuations, contractions (like don't, won't and would've) get in the way",
 ' \n']

### Challenge - SOLUTION

The file `example2.txt` has more punctuation problems. Read it in and see what the problems are. Try your best to modify the code from above to work for as many cases as you can.

In [30]:
fname = 'example2.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
re.split(sent_boundary_pattern, text) 
# Simply looking for certain characters gives us problems. There's no notion of context in the
# regular expression below.

["In this little example, we're going to see some of the problems that regularly appear in tokenization",
 " Tokenization may seem simple, but it's harder than it first appears",
 ' Why is it so hard',
 " Punctuations, contractions (like don't, won't and would've) get in the way",
 " \n\nWe can split text into sentences using punctuation, but unfortunately that's not always going to work",
 ' For example, if I wanted to tell you about Dr',
 ' Frankenstein, or Mrs',
 " Doubtfire, we'd be in trouble",
 ' What if I wanted to write about U',
 'C',
 ' Berkeley',
 ' When you think about it, URLs like www',
 'google',
 'com are troublesome too',
 ' How would we settle on a price of $10',
 '50',
 ' The main point is that these punctuation characters serve a variety of purposes in writing',
 ' Moreover, the functions they serve change depending on the domain (medical vs forum text) and language',
 '']

#### Sentence segmentation by `nltk`

In [31]:
from nltk.tokenize import sent_tokenize
fname = 'example2.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
sent_tokenize(text)

["In this little example, we're going to see some of the problems that regularly appear in tokenization.",
 "Tokenization may seem simple, but it's harder than it first appears.",
 'Why is it so hard?',
 "Punctuations, contractions (like don't, won't and would've) get in the way.",
 "We can split text into sentences using punctuation, but unfortunately that's not always going to work.",
 "For example, if I wanted to tell you about Dr. Frankenstein, or Mrs. Doubtfire, we'd be in trouble.",
 'What if I wanted to write about U.C.',
 'Berkeley?',
 'When you think about it, URLs like www.google.com are troublesome too.',
 'How would we settle on a price of $10.50?',
 'The main point is that these punctuation characters serve a variety of purposes in writing.',
 'Moreover, the functions they serve change depending on the domain (medical vs forum text) and language.']

## Removing punctuation

Sometimes (although admittedly less frequently than tokenizing and sentence segmentation), you might want to keep only the alphanumeric characters (i.e. the letters and numbers) and ditch the punctuation. Here's how we can do that.

- What type is `punctuation`?

In [32]:
from string import punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [105]:
type(punctuation)

str

In [33]:
no_punct = ''.join([ch for ch in text if ch not in punctuation])
no_punct

'In this little example were going to see some of the problems that regularly appear in tokenization Tokenization may seem simple but its harder than it first appears Why is it so hard Punctuations contractions like dont wont and wouldve get in the way \n\nWe can split text into sentences using punctuation but unfortunately thats not always going to work For example if I wanted to tell you about Dr Frankenstein or Mrs Doubtfire wed be in trouble What if I wanted to write about UC Berkeley When you think about it URLs like wwwgooglecom are troublesome too How would we settle on a price of 1050 The main point is that these punctuation characters serve a variety of purposes in writing Moreover the functions they serve change depending on the domain medical vs forum text and language'

## Strip whitespace

This is an extremely common step. It's simple to perform and nicely pre-packaged in Python. It's particularly common for user-generated text (think survey forms).

In [34]:
string = ' Hello! '
string.strip()

'Hello!'

In [35]:
fname = 'example3.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
print(text)



This is a text file that has some extra whitespace at the start and end. Whitespace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.


The Python method called "strip" only catches whitespace at the start and end of a string. But it won't catch it in       the middle,		for example,

in this sentence.		Once again, regular expressions will

help		us    with this.






In [36]:
stripped_text = text.strip()
print(stripped_text)

This is a text file that has some extra whitespace at the start and end. Whitespace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.


The Python method called "strip" only catches whitespace at the start and end of a string. But it won't catch it in       the middle,		for example,

in this sentence.		Once again, regular expressions will

help		us    with this.


In [37]:
whitespace_pattern = r'\s+'
clean_text = re.sub(whitespace_pattern, ' ', text)
clean_text.strip()

'This is a text file that has some extra whitespace at the start and end. Whitespace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines. The Python method called "strip" only catches whitespace at the start and end of a string. But it won\'t catch it in the middle, for example, in this sentence. Once again, regular expressions will help us with this.'

## Text normalization

Text normalization means making our text fit some standard patterns. Lots of steps come under this wide umbrella, but the most common are:

- case folding
- removing URLs, digits, hashtags
- OOV (removing infequent words) (not done here)

#### Case folding

Case folding means dealing with upper and lower cases characters. This is usually done by making all characters lower cased.

In [38]:
fname = 'example4.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
text

'Upper and lower case characters can be annoying. Characters are the individual letters and numbers that we see on the page. Case folding is the generic term we use for dealing with upper and lower case characters. Lower case is often what people want. Title Case refers to a multi-word expression with the first character of every word in upper case. '

In [39]:
text.lower()

'upper and lower case characters can be annoying. characters are the individual letters and numbers that we see on the page. case folding is the generic term we use for dealing with upper and lower case characters. lower case is often what people want. title case refers to a multi-word expression with the first character of every word in upper case. '

### Challenge - SOLUTION

The `lower` method we used above is a string method, that is, it works on strings. But what if you want to lowercase every word in a list (say you've already tokenized the text). Take the list of tokens below and make each one lower case.

In [40]:
tokens = word_tokenize(text)
lowercase_tokens = []
for token in tokens:
    lowercased_version = token.lower()
    lowercase_tokens.append(lowercased_version)
lowercase_tokens

['upper',
 'and',
 'lower',
 'case',
 'characters',
 'can',
 'be',
 'annoying',
 '.',
 'characters',
 'are',
 'the',
 'individual',
 'letters',
 'and',
 'numbers',
 'that',
 'we',
 'see',
 'on',
 'the',
 'page',
 '.',
 'case',
 'folding',
 'is',
 'the',
 'generic',
 'term',
 'we',
 'use',
 'for',
 'dealing',
 'with',
 'upper',
 'and',
 'lower',
 'case',
 'characters',
 '.',
 'lower',
 'case',
 'is',
 'often',
 'what',
 'people',
 'want',
 '.',
 'title',
 'case',
 'refers',
 'to',
 'a',
 'multi-word',
 'expression',
 'with',
 'the',
 'first',
 'character',
 'of',
 'every',
 'word',
 'in',
 'upper',
 'case',
 '.']

### Removing URLs, digits and hashtags

We rarely care about the exact URL used in a tweet, or the exact number. We could remove them completely (think about how we'd do that), but it's often informative to know that there is a URL or a digit in the text. So we want to replace individual URLs asnd digits with a symbol that preserves the fact that a URL was there. It's standard to just use the strings "URL" and "DIGIT".

How do we do this? Once again, regular expressions save the day.

In [41]:
url_pattern = r'https?:\/\/.*[\r\n]*'
single_tweet = tweet_text[0]
single_tweet

'Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z'

In [42]:
URL_SIGN = ' URL '
re.sub(url_pattern, URL_SIGN, single_tweet)

'Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet  URL '

Above we replaced the URL in a single tweet. Now we will replace all the URLs in all tweets in `tweet_text`.

In [43]:
url_pattern = r'https?:\/\/.*[\r\n]*'
URL_SIGN = ' URL '
list_of_url_less_tweets = []
## Using a for loop
for tweet in tweet_text:
    url_less_tweet = re.sub(url_pattern, URL_SIGN, tweet)
    list_of_url_less_tweets.append(url_less_tweet)
list_of_url_less_tweets

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet  URL ',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!!  URL ',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at 9:45pm.\n#ElectionNight #MAGA__  URL ',
 'RT @IvankaTrump: Such a surreal moment

In [44]:
## Alternative using list comprehension
list_of_url_less_tweets = [re.sub(url_pattern, URL_SIGN, tweet) for tweet in tweet_text]
list_of_url_less_tweets

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet  URL ',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!!  URL ',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at 9:45pm.\n#ElectionNight #MAGA__  URL ',
 'RT @IvankaTrump: Such a surreal moment

Now let's remove hashtags and digits.

In [45]:
hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
HASHTAG_SIGN = ' HASHTAG '
digit_pattern = '\d+'
DIGIT_SIGN = ' DIGIT '

In [46]:
no_hashtags = [re.sub(hashtag_pattern, HASHTAG_SIGN, tweet) for tweet in tweet_text]
no_hashtags

['Today we express our deepest gratitude to all those who have served in our armed forces. HASHTAG  https://t.co/wPk7QWpK8Z',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!! https://t.co/Lz2dhrXzo4',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at 9:45pm. HASHTAG  HASHTAG  https://t.co/HfuJeRZ

In [47]:
no_digit = [re.sub(digit_pattern, DIGIT_SIGN, tweet) for tweet in tweet_text]
no_digit

['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk DIGIT QWpK DIGIT Z',
 'Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!',
 'Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!',
 'Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!',
 'A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!',
 'Happy  DIGIT st birthday to the U.S. Marine Corps! Thank you for your service!! https://t.co/Lz DIGIT dhrXzo DIGIT ',
 'Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before',
 'Watching the returns at  DIGIT : DIGIT p

### Challenge - SOLUTION

I've read in some Amazon reviews from earlier into a list called `reviews`. Each element of the list is a string, representing the text of a single review. Try to:
- Tokenize each review
- Strip all whitespace
- Make all characters lower case
- Replace any URLs and digits

Then find the most common 50 words.

In [53]:
fnames = os.path.join(DATA_DIR, 'amazon', '*.csv')
fnames = glob.glob(fnames)
reviews = []
column_names = ['id', 'product_id', 'user_id', 'profile_name', 'helpfulness_num', 'helpfulness_denom',
               'score', 'time', 'summary', 'text']
for fname in fnames[:2]:
    df = pd.read_csv(fname, names=column_names)
    text = list(df['text'])
    reviews.extend(text)

In [54]:
clean = [re.sub(url_pattern, URL_SIGN, review) for review in reviews]
clean = [re.sub(hashtag_pattern, HASHTAG_SIGN, review) for review in clean]
clean = [re.sub(digit_pattern, DIGIT_SIGN, review) for review in clean]
clean = [''.join([ch for ch in review if ch not in punctuation]) for review in clean]
clean = [review.lower() for review in clean]
clean = [review.strip() for review in clean]

tokens = [word_tokenize(review) for review in clean] 
tokens[1]


['okay',
 'how',
 'should',
 'i',
 'say',
 'this',
 'it',
 'does',
 'not',
 'taste',
 'like',
 'cream',
 'cheesebr',
 'it',
 'has',
 'a',
 'weird',
 'almost',
 'chemical',
 'taste',
 'to',
 'it',
 'i',
 'even',
 'stored',
 'it',
 'in',
 'the',
 'frig',
 'to',
 'keep',
 'it',
 'fresh',
 'it',
 'did',
 'not',
 'helpbr',
 'thing',
 'is',
 'i',
 'had',
 'used',
 'it',
 'before',
 'years',
 'ago',
 'and',
 'i',
 'had',
 'thought',
 'it',
 'was',
 'great',
 'thenbr',
 'maybe',
 'i',
 'got',
 'a',
 'bad',
 'batch',
 'or',
 'maybe',
 'they',
 'changed',
 'how',
 'it',
 'being',
 'made',
 'nowbr',
 'i',
 'know',
 'i',
 'wont',
 'be',
 'buying',
 'it',
 'again']

In [55]:
freq = Counter(tokens[1])
freq.most_common(50)

[('it', 10),
 ('i', 7),
 ('how', 2),
 ('not', 2),
 ('taste', 2),
 ('a', 2),
 ('to', 2),
 ('had', 2),
 ('maybe', 2),
 ('okay', 1),
 ('should', 1),
 ('say', 1),
 ('this', 1),
 ('does', 1),
 ('like', 1),
 ('cream', 1),
 ('cheesebr', 1),
 ('has', 1),
 ('weird', 1),
 ('almost', 1),
 ('chemical', 1),
 ('even', 1),
 ('stored', 1),
 ('in', 1),
 ('the', 1),
 ('frig', 1),
 ('keep', 1),
 ('fresh', 1),
 ('did', 1),
 ('helpbr', 1),
 ('thing', 1),
 ('is', 1),
 ('used', 1),
 ('before', 1),
 ('years', 1),
 ('ago', 1),
 ('and', 1),
 ('thought', 1),
 ('was', 1),
 ('great', 1),
 ('thenbr', 1),
 ('got', 1),
 ('bad', 1),
 ('batch', 1),
 ('or', 1),
 ('they', 1),
 ('changed', 1),
 ('being', 1),
 ('made', 1),
 ('nowbr', 1)]

## Removing stop words

You might have noticed that the most common words above aren't terribly exciting. They're words like "am", "i", "the" and "a": stop words. These are rarely useful to us in computational text analysis, so it's very common to remove them completely.

- What other stop words do you think there are?

In [56]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

### Challenge - SOLUTION

Use the list `stop` of English stopwords to remove stopwords from our tokenized review above.

In [57]:
new_tokens = [token for token in tokens[1] if token not in stop]
new_tokens

['okay',
 'say',
 'taste',
 'like',
 'cream',
 'cheesebr',
 'weird',
 'almost',
 'chemical',
 'taste',
 'even',
 'stored',
 'frig',
 'keep',
 'fresh',
 'helpbr',
 'thing',
 'used',
 'years',
 'ago',
 'thought',
 'great',
 'thenbr',
 'maybe',
 'got',
 'bad',
 'batch',
 'maybe',
 'changed',
 'made',
 'nowbr',
 'know',
 'wont',
 'buying']

## Stemming/lemmatization

Stemming and lemmatization both refer to remove morphological affixes on words. For example, if we stem the word "grows", we get "grow". If we stem the word "running", we get "run". We do this because often we care more about the core content of the word (i.e. that it has something to do with growth or running, rather than the fact that it's a third person present tense verb, or progressive participle).

NLTK provides many algorithms for stemming. For English, a great baseline is the [Porter](https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py) algorithm, which is in spirit isn't that far from a bunch of regular expressions.

In [58]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [59]:
stemmer.stem('grows')

'grow'

In [60]:
stemmer.stem('running')

'run'

In [61]:
stemmer.stem('leaves')

'leav'

In [67]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer
import nltk; nltk.download('wordnet') # Download resource for working with WordNet via NLTK
snowballer_stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [68]:
print(snowballer_stemmer.stem('running'))
print(snowballer_stemmer.stem('leaves'))

run
leav


In [69]:
print(lemmatizer.lemmatize('leaves'))

leaf


### Challenge - SOLUTION

Use the Porter stemmer to stem each word in the tweet dataset after having removed stop words.

In [70]:
tokenized_tweets = [word_tokenize(tweet) for tweet in tweet_text]
all_stemmed = []
for tweet in tokenized_tweets:
    stemmed = [stemmer.stem(t) for t in tweet]
    all_stemmed.append(stemmed)
all_stemmed

[['today',
  'we',
  'express',
  'our',
  'deepest',
  'gratitud',
  'to',
  'all',
  'those',
  'who',
  'have',
  'serv',
  'in',
  'our',
  'arm',
  'forc',
  '.',
  '#',
  'thankavet',
  'http',
  ':',
  '//t.co/wpk7qwpk8z'],
 ['busi',
  'day',
  'plan',
  'in',
  'new',
  'york',
  '.',
  'will',
  'soon',
  'be',
  'make',
  'some',
  'veri',
  'import',
  'decis',
  'on',
  'the',
  'peopl',
  'who',
  'will',
  'be',
  'run',
  'our',
  'govern',
  '!'],
 ['love',
  'the',
  'fact',
  'that',
  'the',
  'small',
  'group',
  'of',
  'protest',
  'last',
  'night',
  'have',
  'passion',
  'for',
  'our',
  'great',
  'countri',
  '.',
  'We',
  'will',
  'all',
  'come',
  'togeth',
  'and',
  'be',
  'proud',
  '!'],
 ['just',
  'had',
  'a',
  'veri',
  'open',
  'and',
  'success',
  'presidenti',
  'elect',
  '.',
  'now',
  'profession',
  'protest',
  ',',
  'incit',
  'by',
  'the',
  'media',
  ',',
  'are',
  'protest',
  '.',
  'veri',
  'unfair',
  '!'],
 ['A',
  'f

## POS tagging

POS tagging means assigning each token a part-of-speech (e.g. noun, verb, adjective, etc.). Again, there are many different [alternatives](https://github.com/nltk/nltk/tree/develop/nltk/tag), but NLTK keeps its recommended POS tagger available through the function `pos_tag`. The tagger expects a list of tokens as input.When doing POS tagging, it is advisable **not** to remove stop words beforehand (although you are free to do it afterwards).

In [71]:
from nltk import pos_tag
single_review = reviews[3]
single_review

"I love pistachio nuts. So I tried these since the price was good and I am always looking for good products at low prices.<br />I wasn't disappointed. Out of the two jars of nuts, there were less than 5 duds (dried out or bad tasting nuts.<br /><br />There were less than 25 nuts that were not split open so no need for a nut cracker.<br /><br />The rest were very good to excellent in taste, ease of opening shells and amount of salt on them.<br /><br />The only problem was, I couldn't stop eating them so they only lasted less than a week. 3 1/2 pounds of pistachios in less than a week. I sure will be buying them again, but will wait a couple of months and eat pistachio ice cream in the meanwhile.<br /><br />Jewel Foods now has their own pistachio ice cream that is better than Ben and Jerry's and only costs $2.99 a half gallon instead of $4.50 a pint for B&J's. Just as many WHOLE pistachios in the ice cream (unlike the ground up ones Hagen Das uses) and the ice cream itself is creamier an

In [73]:
tokens = word_tokenize(single_review)
import nltk; nltk.download('averaged_perceptron_tagger')
tagged_review = pos_tag(tokens)
tagged_review

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('I', 'PRP'),
 ('love', 'VBP'),
 ('pistachio', 'RB'),
 ('nuts', 'NNS'),
 ('.', '.'),
 ('So', 'RB'),
 ('I', 'PRP'),
 ('tried', 'VBD'),
 ('these', 'DT'),
 ('since', 'IN'),
 ('the', 'DT'),
 ('price', 'NN'),
 ('was', 'VBD'),
 ('good', 'JJ'),
 ('and', 'CC'),
 ('I', 'PRP'),
 ('am', 'VBP'),
 ('always', 'RB'),
 ('looking', 'VBG'),
 ('for', 'IN'),
 ('good', 'JJ'),
 ('products', 'NNS'),
 ('at', 'IN'),
 ('low', 'JJ'),
 ('prices.', 'NN'),
 ('<', 'NN'),
 ('br', 'NN'),
 ('/', 'NNP'),
 ('>', 'NNP'),
 ('I', 'PRP'),
 ('was', 'VBD'),
 ("n't", 'RB'),
 ('disappointed', 'JJ'),
 ('.', '.'),
 ('Out', 'IN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('two', 'CD'),
 ('jars', 'NNS'),
 ('of', 'IN'),
 ('nuts', 'NNS'),
 (',', ','),
 ('there', 'EX'),
 ('were', 'VBD'),
 ('less', 'JJR'),
 ('than', 'IN'),
 ('5', 'CD'),
 ('duds', 'NNS'),
 ('(', '('),
 ('dried', 'VBN'),
 ('out', 'RP'),
 ('or', 'CC'),
 ('bad', 'JJ'),
 ('tasting', 'VBG'),
 ('nuts.', 'JJ'),
 ('<', 'NNP'),
 ('br', 'NN'),
 ('/', 'NNP'),
 ('>', 'NNP'),
 ('<', 'NNP'),


### Challenge - SOLUTION

Below I've read in the text of Austen's _Pride and Prejudice_ into a variable called `pride`. Preprocess using the following steps:

- Strip whitespace
- Replace all numbers with '0'
- Tokenize
- Tag each token with a POS tag

Make sure you know:
- What type is the result?
- What type is each element of the result?
- What type are the elements of the elements of the result?

In [74]:
fname = 'pride-and-prejudice.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname, encoding='utf-8') as f:
    raw = f.read()
pride = raw[679:684814]
pride

'It is a truth universally acknowledged, that a single man in possession\nof a good fortune, must be in want of a wife.\n\nHowever little known the feelings or views of such a man may be on his\nfirst entering a neighbourhood, this truth is so well fixed in the minds\nof the surrounding families, that he is considered the rightful property\nof some one or other of their daughters.\n\n“My dear Mr. Bennet,” said his lady to him one day, “have you heard that\nNetherfield Park is let at last?”\n\nMr. Bennet replied that he had not.\n\n“But it is,” returned she; “for Mrs. Long has just been here, and she\ntold me all about it.”\n\nMr. Bennet made no answer.\n\n“Do you not want to know who has taken it?” cried his wife impatiently.\n\n“_You_ want to tell me, and I have no objection to hearing it.”\n\nThis was invitation enough.\n\n“Why, my dear, you must know, Mrs. Long says that Netherfield is taken\nby a young man of large fortune from the north of England; that he came\ndown on Monday in 

In [75]:
pride = pride.strip()
pride = re.sub(digit_pattern, '0', pride)
tokenized = word_tokenize(pride[:1000]) # Just tokenize the first 1000 characters to speed things up
tokenized = [token for token in tokenized if token not in punctuation]

In [76]:
tagged = pos_tag(tokenized)
tagged

[('It', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('truth', 'NN'),
 ('universally', 'RB'),
 ('acknowledged', 'VBD'),
 ('that', 'IN'),
 ('a', 'DT'),
 ('single', 'JJ'),
 ('man', 'NN'),
 ('in', 'IN'),
 ('possession', 'NN'),
 ('of', 'IN'),
 ('a', 'DT'),
 ('good', 'JJ'),
 ('fortune', 'NN'),
 ('must', 'MD'),
 ('be', 'VB'),
 ('in', 'IN'),
 ('want', 'NN'),
 ('of', 'IN'),
 ('a', 'DT'),
 ('wife', 'NN'),
 ('However', 'RB'),
 ('little', 'JJ'),
 ('known', 'VBN'),
 ('the', 'DT'),
 ('feelings', 'NNS'),
 ('or', 'CC'),
 ('views', 'NNS'),
 ('of', 'IN'),
 ('such', 'JJ'),
 ('a', 'DT'),
 ('man', 'NN'),
 ('may', 'MD'),
 ('be', 'VB'),
 ('on', 'IN'),
 ('his', 'PRP$'),
 ('first', 'JJ'),
 ('entering', 'VBG'),
 ('a', 'DT'),
 ('neighbourhood', 'NN'),
 ('this', 'DT'),
 ('truth', 'NN'),
 ('is', 'VBZ'),
 ('so', 'RB'),
 ('well', 'RB'),
 ('fixed', 'VBN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('minds', 'NNS'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('surrounding', 'VBG'),
 ('families', 'NNS'),
 ('that', 'IN'),
 ('he', 'PRP'),
 ('is',

## Things we didn't cover
(see `solutions/preprocessing_extra.ipynb` and [this repo](https://github.com/geoffbacon/nlp-with-nltk-spacy/blob/master/03-NLTK.ipynb) for more on these)

- Reading in JSON, HTML, and XML files
- Removing infrequent words
- Named entity recognition
- Syntactic parsing
- Information extraction
- Removing markup from HTML
- Extracting numerical features
- SpaCy
- DTM/TF-IDF