# Cleaning text

Here, we're going to go over some basic text cleaning steps in Python.

In [1]:
raw_docs = ["Here are some very simple basic sentences.",
"They won't be very interesting, I'm afraid.",
"The point of these examples is to _learn how basic text cleaning works_ on *very simple* data."]

### Tokenizing text into bags of words

NLTK makes it easy to convert documents-as-strings into word-vectors, a process called tokenizing.

In [3]:
from nltk.tokenize import word_tokenize

tokenized_docs = [word_tokenize(doc) for doc in raw_docs]
print tokenized_docs

### Removing punctuation

Punctuation can help with tokenizers, but once you've done that, there's no reason to keep it around. There are tons of ways to remove punctuation. Since we have already learned regex, how would we do this?

In [None]:
import re
import string
regex = re.compile('[%s]' % re.escape(string.punctuation)) #see documentation here: http://docs.python.org/2/library/string.html

tokenized_docs_no_punctuation = []

for review in tokenized_docs:
    
    new_review = []
    for token in review: 
        new_token = regex.sub(u'', token)
        if not new_token == u'':
            new_review.append(new_token)
    
    tokenized_docs_no_punctuation.append(new_review)
    
print tokenized_docs_no_punctuation

### Cleaning text of stopwords

There are some really basic words that just don't matter. NLTK comes with a list of them for many languages.

In [None]:
from nltk.corpus import stopwords

tokenized_docs_no_stopwords = []
for doc in tokenized_docs_no_punctuation:
    new_term_vector = []
    for word in doc:
        if not word in stopwords.words('english'):
            new_term_vector.append(word)
    tokenized_docs_no_stopwords.append(new_term_vector)
            
print tokenized_docs_no_stopwords


### Stemming and Lemmatizing

If you have taken linguistics, you may be familiar with morphology. This is the belief that words have a root form. If you want to get to the basic term meaning of the word, you can try applying a stemmer or lemmatizer. Here are three very popular methods ready to go right out of the NLTK box. It's up to you to see which one you want to use.

In [None]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

porter = PorterStemmer()
snowball = SnowballStemmer('english')
wordnet = WordNetLemmatizer()

preprocessed_docs = []

for doc in tokenized_docs_no_stopwords:
    final_doc = []
    for word in doc:
        final_doc.append(porter.stem(word))
        #final_doc.append(snowball.stem(word))
        #final_doc.append(wordnet.lemmatize(word)) #note that lemmatize() can also takes part of speech as an argument!
    preprocessed_docs.append(final_doc)

print preprocessed_docs

### Remember how we made a list of review texts in Text 1?

Create a new list of review_texts called clean_reviews that are: tokenized, free of punctuation, free of stopwords, stemmed or lemmatized

In [None]:
import os
import csv

#os.chdir('/Users/rweiss/Dropbox/presentations/IRiSS2013/text2/extra/')

with open('amazon/sociology_2010.csv', 'rb') as csvfile:
    amazon_reader = csv.DictReader(csvfile, delimiter=',')
    amazon_reviews = [row['review_text'] for row in amazon_reader]
    
    #your code here!!!

### Removing HTML entities and tags

Recall that HTML entities are an artifact from the pre-Unicode era. Browsers know to render HTML entities a certain way on the page, but we don't need them anymore.

Here's some code that will do this for you (function courtesy of the author of lxml).

In [None]:
import re, htmlentitydefs

##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.
# AUTHOR: Fredrik Lundh

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

test_string ="<p>While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the &quot;Chicken Soup for the Soul&quot; series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)"

print test_string
print unescape(test_string)

In [None]:
import nltk

nltk.clean_html(unescape(test_string.decode('utf8'))) #notice that it returns unicode!