The first step to dealing with text data is to tokenize it into its constituent words. This is done by a tokenizer that attempts to break up the prose into word tokens. These tokens are then used as the feature space for a machine learning algorithm that attempts to classify the text, or to cluster it. A widely used machine learning package for Python is **`sklearn`**. **`sklearn`** has a built in tokenizer that is obfuscated from the end user in how it tokenizes the words within the document. This is not ideal, as the user should dictate how the tokenization occurs. Thankfully the built in tokenizer can be overrided with a more sensible tokenizer. The question remains, which tokenizer should we use that is capable of dealing with text data created by users on social media.

**`nltk`** is another package in Python that is used for natural language processing and is primarily used as a teaching tool since none of the code is optimized. **`nltk`** has several types of tokenizers and can connect to the Stanford Natural Language Group's tools to use thier tokenizer, which is quite popular. Below we will investigate the different types of tokenizers available in **`sklearn`** and **`nltk`**, and we will choose which one we will use for building the identity algorithm.

# Note
----
You are going to need to download Java from Oracle in order for this to work properly. Download both the jdk and the jre and place both folders after untarring them in `/usr/local/java`. Then in your `~/.bashrc` file, update your path to be:

* export JAVA_HOME=/usr/local/java/jdk1.8.0_60/bin/java
* export PATH=$PATH:/usr/local/l

In [1]:
import nltk
from nltk.tokenize import (RegexpTokenizer,
                           SpaceTokenizer,
                           TreebankWordTokenizer,
                           WhitespaceTokenizer,
                           WordPunctTokenizer,
                           stanford,
                           word_tokenize)
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In order to use the Stanford Natural Language Processing tools, we must first download the jar files from their website: http://nlp.stanford.edu/. Any of their tools can be downloaded to use as the tokenizer. We will download the **parser** package and store it in a folder called `stanford_models/parser` for the this experiment. When instantiating the StanfordTokenizer class within **`nltk`** you must supply it with the path to the jar file.

In [2]:
path_to_jar = 'stanford_models/parser/stanford-parser.jar'
regex_tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
space_tokenizer = SpaceTokenizer()
treebank_tokenizer = TreebankWordTokenizer()
whitespace_tokenizer = WhitespaceTokenizer()
wordpunct_tokenizer = WordPunctTokenizer()
stanford_tokenizer = stanford.StanfordTokenizer(path_to_jar=path_to_jar)

We will use the tokenizers on a labeled dataset that will ultimately be used as the training set for our identity algorithm.

In [3]:
me_df = pd.DataFrame()
with open('labeled_data/ME.txt', 'rb') as f:
    for line in f:
        raw = line
        regex = regex_tokenizer.tokenize(raw)
        space = space_tokenizer.tokenize(raw)
        treebank = treebank_tokenizer.tokenize(raw)
        whitespace = whitespace_tokenizer.tokenize(raw)
        wordpunct = wordpunct_tokenizer.tokenize(raw)
        stanford = stanford_tokenizer.tokenize(raw)
        word = word_tokenize(raw)
        cv = CountVectorizer()
        cv.fit_transform([line])
        sklearn = cv.get_feature_names()
        me_df = me_df.append(\
            {
                'raw': raw,
                'regex': regex,
                'space': space,
                'treebank': treebank,
                'whitespace': whitespace,
                'wordpunct': wordpunct,
                'stanford': stanford,
                'word': word,
                'sklearn': sklearn,
                'label': 'me'
            },
            ignore_index=True)

Now we compare the tokenizers.

In [4]:
def compare_tokenizers(df, row):
    r"""Prints the given row from the given dataframe.
    
    Parameters
    ----------
    df : pandas.DataFrame
        The input dataframe.
    row : integer
        The row integer to look at. Index begins at 1.
    
    Returns
    -------
    A print out of the row in the dataframe.

    """
    columns = [c for c in list(df.keys()) if c != 'label']
    for r in range(row, row+1):
        for column in columns:
            print column, '\n', df[column][r], '\n', '-'*80

In [5]:
compare_tokenizers(me_df, 1)

raw 
Oh my goodness! I havent heard or thought of this song in ages. Thanks!

--------------------------------------------------------------------------------
regex 
['Oh', 'my', 'goodness', '!', 'I', 'havent', 'heard', 'or', 'thought', 'of', 'this', 'song', 'in', 'ages', '.', 'Thanks', '!'] 
--------------------------------------------------------------------------------
sklearn 
[u'ages', u'goodness', u'havent', u'heard', u'in', u'my', u'of', u'oh', u'or', u'song', u'thanks', u'this', u'thought'] 
--------------------------------------------------------------------------------
space 
[u'Oh', u'my', u'goodness!', u'I', u'havent', u'heard', u'or', u'thought', u'of', u'this', u'song', u'in', u'ages.', u'Thanks!\n'] 
--------------------------------------------------------------------------------
stanford 
[u'Oh', u'my', u'goodness', u'!', u'I', u'havent', u'heard', u'or', u'thought', u'of', u'this', u'song', u'in', u'ages', u'.', u'Thanks', u'!'] 
---------------------------------------

Comparing the different tokenizers shows that there is a **major** problem with the data, in that there are no conjunctions at all. This is going to cause issues for the classifier, as it has no idea how to handle language that has conjunctions in it. The only way to remedy this is to classify more posts that have conjunctions in them, or put them back into the labeled data. This is an issue that we will explore further when we go to actually build the classifier.

Notice that all the tokenizers retain punctuation, except for the one from **`sklearn`**. Some tokenizers such as **`space`** and **`whitespace`** attach the punctuation to the word it is next to. This is due to the way the tokenizer is built in that it looks for whitespace between tokens and assumes that a token is sandwhiched between whitespace. This is great for languages that use whitespace between words, however, it will not work so well on languages such as Chinese.

All the tokenizers, except for the one built into **`sklearn`** also retain word case, single letter words, and ordering. One of the reasons why order is not retained in the **`sklearn`** tokenizer is how the data is stored when it is returned from the CountVectorizer. It is transformed into a pandas Series object which does not retain order.

The Stanford tokenizer is able to pick up the periods in the data as opposed to the **`treebank`** tokenizer. The Stanford tokenizer is also used a decent amount and runs quickly. We will investigate a few other rows to see if it is the winner of this experiment.

In [6]:
compare_tokenizers(me_df, 2)

raw 
When I preach the gospel at a Harvest Crusade, I feel I have a solemn responsibility to give the gospel accuratelyto not distort it, to not take away from it, and to not add to it.

--------------------------------------------------------------------------------
regex 
['When', 'I', 'preach', 'the', 'gospel', 'at', 'a', 'Harvest', 'Crusade', ',', 'I', 'feel', 'I', 'have', 'a', 'solemn', 'responsibility', 'to', 'give', 'the', 'gospel', 'accuratelyto', 'not', 'distort', 'it', ',', 'to', 'not', 'take', 'away', 'from', 'it', ',', 'and', 'to', 'not', 'add', 'to', 'it', '.'] 
--------------------------------------------------------------------------------
sklearn 
[u'accuratelyto', u'add', u'and', u'at', u'away', u'crusade', u'distort', u'feel', u'from', u'give', u'gospel', u'harvest', u'have', u'it', u'not', u'preach', u'responsibility', u'solemn', u'take', u'the', u'to', u'when'] 
--------------------------------------------------------------------------------
space 
[u'When', u'I', u

In [7]:
compare_tokenizers(me_df, 100)

raw 
Well, you know that usually I do 1 or 2 posts per day

--------------------------------------------------------------------------------
regex 
['Well', ',', 'you', 'know', 'that', 'usually', 'I', 'do', '1', 'or', '2', 'posts', 'per', 'day'] 
--------------------------------------------------------------------------------
sklearn 
[u'day', u'do', u'know', u'or', u'per', u'posts', u'that', u'usually', u'well', u'you'] 
--------------------------------------------------------------------------------
space 
[u'Well,', u'you', u'know', u'that', u'usually', u'I', u'do', u'1', u'or', u'2', u'posts', u'per', u'day\n'] 
--------------------------------------------------------------------------------
stanford 
[u'Well', u',', u'you', u'know', u'that', u'usually', u'I', u'do', u'1', u'or', u'2', u'posts', u'per', u'day'] 
--------------------------------------------------------------------------------
treebank 
['Well', ',', 'you', 'know', 'that', 'usually', 'I', 'do', '1', 'or', '2', 'posts

From the sample above, it is clear that we should be using the Stanford Tokenizer to tokenize the text data. Below are a few examples of some outlier text that is not in the training data set, but highlight the fact that the Stanford tokenizer is the clear winner.

In [14]:
def test_tokenizers(s):
    print 'regex\t\t', regex_tokenizer.tokenize(s)
    print 'space\t\t', space_tokenizer.tokenize(s)
    print 'treebank\t', treebank_tokenizer.tokenize(s)
    print 'whitespace\t', whitespace_tokenizer.tokenize(s)
    print 'wordpunct\t', wordpunct_tokenizer.tokenize(s)
    print 'stanford\t', stanford_tokenizer.tokenize(s)
    print 'word\t\t', word_tokenize(s)
    cv = CountVectorizer()
    cv.fit_transform([s])
    print 'sklearn\t\t', cv.get_feature_names()

In [15]:
test_tokenizers("I've been working a lot lately.")

regex		['I', "'ve", 'been', 'working', 'a', 'lot', 'lately', '.']
space		[u"I've", u'been', u'working', u'a', u'lot', u'lately.']
treebank	['I', "'ve", 'been', 'working', 'a', 'lot', 'lately', '.']
whitespace	["I've", 'been', 'working', 'a', 'lot', 'lately.']
wordpunct	['I', "'", 've', 'been', 'working', 'a', 'lot', 'lately', '.']
stanford	[u'I', u"'ve", u'been', u'working', u'a', u'lot', u'lately', u'.']
word		['I', "'ve", 'been', 'working', 'a', 'lot', 'lately', '.']
sklearn		[u'been', u'lately', u'lot', u've', u'working']


Note that **`sklearn`** is able to detect contractions, but it unfortunately removes single letters from the tokens. This can be adjusted by changing how the CountVectorizer attempts to find tokens.

In [16]:
test_tokenizers("What about super-long-hyphenated words?")

regex		['What', 'about', 'super', '-long-hyphenated', 'words', '?']
space		[u'What', u'about', u'super-long-hyphenated', u'words?']
treebank	['What', 'about', 'super-long-hyphenated', 'words', '?']
whitespace	['What', 'about', 'super-long-hyphenated', 'words?']
wordpunct	['What', 'about', 'super', '-', 'long', '-', 'hyphenated', 'words', '?']
stanford	[u'What', u'about', u'super-long-hyphenated', u'words', u'?']
word		['What', 'about', 'super-long-hyphenated', 'words', '?']
sklearn		[u'about', u'hyphenated', u'long', u'super', u'what', u'words']


Looks like the Stanford Tokenizer is still winning.

In [18]:
test_tokenizers(u"Em dashes—rock and roll—are parenthetical.")

regex		[u'Em', u'dashes', u'\u2014rock', u'and', u'roll', u'\u2014are', u'parenthetical', u'.']
space		[u'Em', u'dashes\u2014rock', u'and', u'roll\u2014are', u'parenthetical.']
treebank	[u'Em', u'dashes\u2014rock', u'and', u'roll\u2014are', u'parenthetical', u'.']
whitespace	[u'Em', u'dashes\u2014rock', u'and', u'roll\u2014are', u'parenthetical.']
wordpunct	[u'Em', u'dashes', u'\u2014', u'rock', u'and', u'roll', u'\u2014', u'are', u'parenthetical', u'.']
stanford	[u'Em', u'dashes', u'--', u'rock', u'and', u'roll', u'--', u'are', u'parenthetical', u'.']
word		[u'Em', u'dashes\u2014rock', u'and', u'roll\u2014are', u'parenthetical', u'.']
sklearn		[u'and', u'are', u'dashes', u'em', u'parenthetical', u'rock', u'roll']


And again, the Stanford Tokenizer was able to detect an em-dash situation.

In [19]:
test_tokenizers(u"How about actual (parenthetical statement) parenthesis?")

regex		[u'How', u'about', u'actual', u'(parenthetical', u'statement', u')', u'parenthesis', u'?']
space		[u'How', u'about', u'actual', u'(parenthetical', u'statement)', u'parenthesis?']
treebank	[u'How', u'about', u'actual', u'(', u'parenthetical', u'statement', u')', u'parenthesis', u'?']
whitespace	[u'How', u'about', u'actual', u'(parenthetical', u'statement)', u'parenthesis?']
wordpunct	[u'How', u'about', u'actual', u'(', u'parenthetical', u'statement', u')', u'parenthesis', u'?']
stanford	[u'How', u'about', u'actual', u'-LRB-', u'parenthetical', u'statement', u'-RRB-', u'parenthesis', u'?']
word		[u'How', u'about', u'actual', u'(', u'parenthetical', u'statement', u')', u'parenthesis', u'?']
sklearn		[u'about', u'actual', u'how', u'parenthesis', u'parenthetical', u'statement']


In [21]:
test_tokenizers(u"{foo} :) [bar] —_—")

regex		[u'{foo}', u':)', u'[bar]', u'\u2014_\u2014']
space		[u'{foo}', u':)', u'[bar]', u'\u2014_\u2014']
treebank	[u'{', u'foo', u'}', u':', u')', u'[', u'bar', u']', u'\u2014_\u2014']
whitespace	[u'{foo}', u':)', u'[bar]', u'\u2014_\u2014']
wordpunct	[u'{', u'foo', u'}', u':)', u'[', u'bar', u']', u'\u2014', u'_', u'\u2014']
stanford	[u'-LCB-', u'foo', u'-RCB-', u':-RRB-', u'-LSB-', u'bar', u'-RSB-', u'--', u'_', u'--']
word		[u'{', u'foo', u'}', u':', u')', u'[', u'bar', u']', u'\u2014_\u2014']
sklearn		[u'bar', u'foo']


How to deal with emoticons has not been established in the tokenizers.

In [22]:
test_tokenizers(u"En-dash 2–3!")

regex		[u'En', u'-dash', u'2', u'\u20133!']
space		[u'En-dash', u'2\u20133!']
treebank	[u'En-dash', u'2\u20133', u'!']
whitespace	[u'En-dash', u'2\u20133!']
wordpunct	[u'En', u'-', u'dash', u'2', u'\u2013', u'3', u'!']
stanford	[u'En-dash', u'2', u'--', u'3', u'!']
word		[u'En-dash', u'2\u20133', u'!']
sklearn		[u'dash', u'en']


In [23]:
test_tokenizers(u"That's gr8! LOL!!!!!")

regex		[u'That', u"'s", u'gr8', u'!', u'LOL', u'!!!!!']
space		[u"That's", u'gr8!', u'LOL!!!!!']
treebank	[u'That', u"'s", u'gr8', u'!', u'LOL', u'!', u'!', u'!', u'!', u'!']
whitespace	[u"That's", u'gr8!', u'LOL!!!!!']
wordpunct	[u'That', u"'", u's', u'gr8', u'!', u'LOL', u'!!!!!']
stanford	[u'That', u"'s", u'gr8', u'!', u'LOL', u'!!!!!']
word		[u'That', u"'s", u'gr8', u'!', u'LOL', u'!', u'!', u'!', u'!', u'!']
sklearn		[u'gr8', u'lol', u'that']


In [25]:
test_tokenizers(u"reeeeeeeeaaaaaaaaaalllllllllyyyyyyyyyyyy???????")

regex		[u'reeeeeeeeaaaaaaaaaalllllllllyyyyyyyyyyyy', u'???????']
space		[u'reeeeeeeeaaaaaaaaaalllllllllyyyyyyyyyyyy???????']
treebank	[u'reeeeeeeeaaaaaaaaaalllllllllyyyyyyyyyyyy', u'?', u'?', u'?', u'?', u'?', u'?', u'?']
whitespace	[u'reeeeeeeeaaaaaaaaaalllllllllyyyyyyyyyyyy???????']
wordpunct	[u'reeeeeeeeaaaaaaaaaalllllllllyyyyyyyyyyyy', u'???????']
stanford	[u'reeeeeeeeaaaaaaaaaalllllllllyyyyyyyyyyyy', u'???????']
word		[u'reeeeeeeeaaaaaaaaaalllllllllyyyyyyyyyyyy', u'?', u'?', u'?', u'?', u'?', u'?', u'?']
sklearn		[u'reeeeeeeeaaaaaaaaaalllllllllyyyyyyyyyyyy']
