# Chapter 10 Textual Data

You may not be used to thinking about _text_, like an e-mail or a newspaper article, as data. But just as we might want to predict the price of a home or group wines into similar types, we might want to predict the sender of an e-mail or group articles into similar types. To leverage the machine learning techniques we have already learned, we will need a way to convert raw text into tabular form. This chapter introduces some principles for doing this.

# 10.1 Bag of Words and N-Grams

In data science, a text is typically called a **document**, even though a document can be anything from a text message to a full-length novel.  A collection of documents is called a **corpus**. In this chapter, we will work with a corpus of text messages, which contains both spam and non-spam ("ham") messages.

In [1]:
import pandas as pd
pd.options.display.max_rows = 10

texts = pd.read_csv(
    "https://raw.githubusercontent.com/dlsun/data-science-book/master/data/SMSSpamCollection.txt", 
    sep="\t",
    names=["label", "text"]
)
texts

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


We might, for example, want to train a classifier to predict whether or not a text message is spam. To use machine learning techniques like $k$-nearest neighbors, we have to transform each of these "documents" into a more regular representation.

A **bag of words** representation reduces a document to just the multiset of its words, ignoring grammar and word order. (A _multiset_ is like a set, except elements are allowed to appear more than once.)

So, for example, the **bag of words** representation of the string "I am Sam. Sam I am." would be `{I, I, am, am, Sam, Sam}`. In Python, it is easiest to represent multisets using dictionaries, where the keys are the (unique) words and the values are the counts. So we would represent the above bag of words as `{"I": 2, "am": 2, "Sam": 2}`.

Let's convert the text messages to a bag of words representation. To do this, we will use the `Counter` object in the `collections` module of the Python standard library. First, let's see how the `Counter` works.

In [2]:
from collections import Counter
Counter(["I", "am", "Sam", "Sam", "I", "am"])

Counter({'I': 2, 'am': 2, 'Sam': 2})

It takes in a list and returns a dictionary of counts---in other words, the bag of words representation that we want. But to be able to use `Counter`, we have to first convert our text into a list of words. We can do this using the string methods in Pandas, such as `.str.split()`, which splits a string into a list based on some character (which, by default, is whitespace).

In [6]:
texts["text"].str.split()

0       [Go, until, jurong, point,, crazy.., Available...
1                    [Ok, lar..., Joking, wif, u, oni...]
2       [Free, entry, in, 2, a, wkly, comp, to, win, F...
3       [U, dun, say, so, early, hor..., U, c, already...
4       [Nah, I, don't, think, he, goes, to, usf,, he,...
                              ...                        
5567    [This, is, the, 2nd, time, we, have, tried, 2,...
5568        [Will, ü, b, going, to, esplanade, fr, home?]
5569    [Pity,, *, was, in, mood, for, that., So...any...
5570    [The, guy, did, some, bitching, but, I, acted,...
5571                    [Rofl., Its, true, to, its, name]
Name: text, Length: 5572, dtype: object

There are several problems with this approach:

- **It is case-sensitive.**  The word "the" in message 5567 and the word "The" in message 5570 are technically different strings and will be treated as different words by the `Counter`.
- **There is punctuation.**  For example, in message 0, one of the words is "point,". This will be treated differently from the word "point".

We can **normalize** the text for case by 

- converting all of the characters to lowercase, using the `.str.lower()` method
- stripping punctuation using a regular expression. The regular expression `[^\w\s]` tells Python to look for any pattern that is not (`^`) either an alphanumeric character (`\w`) or whitespace (`\s`). That is, it will detect any occurrence of punctuation. We will then use the `.str.replace()` method to replace all detected occurrences with the empty string, effectively removing all punctuation from the string.

By chaining these commands together, we obtain a list, to which we can apply the `Counter` to obtain the bag of words representation.

In [7]:
words = (
    texts["text"].
    str.lower().
    str.replace("[^\w\s]", "").
    str.split()
)

words

0       [go, until, jurong, point, crazy, available, o...
1                          [ok, lar, joking, wif, u, oni]
2       [free, entry, in, 2, a, wkly, comp, to, win, f...
3       [u, dun, say, so, early, hor, u, c, already, t...
4       [nah, i, dont, think, he, goes, to, usf, he, l...
                              ...                        
5567    [this, is, the, 2nd, time, we, have, tried, 2,...
5568         [will, ü, b, going, to, esplanade, fr, home]
5569    [pity, was, in, mood, for, that, soany, other,...
5570    [the, guy, did, some, bitching, but, i, acted,...
5571                     [rofl, its, true, to, its, name]
Name: text, Length: 5572, dtype: object

In [8]:
words.apply(Counter)

0       {'go': 1, 'until': 1, 'jurong': 1, 'point': 1,...
1       {'ok': 1, 'lar': 1, 'joking': 1, 'wif': 1, 'u'...
2       {'free': 1, 'entry': 2, 'in': 1, '2': 1, 'a': ...
3       {'u': 2, 'dun': 1, 'say': 2, 'so': 1, 'early':...
4       {'nah': 1, 'i': 1, 'dont': 1, 'think': 1, 'he'...
                              ...                        
5567    {'this': 1, 'is': 2, 'the': 2, '2nd': 1, 'time...
5568    {'will': 1, 'ü': 1, 'b': 1, 'going': 1, 'to': ...
5569    {'pity': 1, 'was': 1, 'in': 1, 'mood': 1, 'for...
5570    {'the': 1, 'guy': 1, 'did': 1, 'some': 1, 'bit...
5571    {'rofl': 1, 'its': 2, 'true': 1, 'to': 1, 'nam...
Name: text, Length: 5572, dtype: object

## N-Grams

The problem with the bag of words representation is that the ordering of the words is lost. For example, the following sentences have the exact same bag of words representation, but convey different meanings:

1. The dog bit her owner.
2. Her dog bit the owner.

The first sentence has only two actors (the dog and its owner), but the second sentence has three (a woman, her dog, and the owner of something). To better capture the _semantic_ meaning of these two documents, we can use **bigrams** instead of individual words. A **bigram** is simply a pair of consecutive words. The "bag of bigrams" of the two sentences above are quite different:

1. {"The dog", "dog bit", "bit her", "her owner"}
2. {"Her dog", "dog bit", "bit the", "the owner"}

They only share 1 bigram (out of 4) in common, even though they share the same 5 words.

Let's get the bag of bigrams representation for the words above. To generate the bigrams from the list of words, we will use the `zip` function in Python, which takes in two lists and returns a single list of pairs (consisting of one element from each list):

In [9]:
list(zip([1, 2, 3], [4, 5, 6]))

[(1, 4), (2, 5), (3, 6)]

In [10]:
def get_bigrams(words):
    # We need to line up the words as follows:
    #   words[0], words[1]
    #   words[1], words[2]
    #       ... ,  ...
    # words[n-1], words[n]
    return zip(words[:-1], words[1:])

words.apply(get_bigrams).apply(Counter)

0       {('go', 'until'): 1, ('until', 'jurong'): 1, (...
1       {('ok', 'lar'): 1, ('lar', 'joking'): 1, ('jok...
2       {('free', 'entry'): 1, ('entry', 'in'): 1, ('i...
3       {('u', 'dun'): 1, ('dun', 'say'): 1, ('say', '...
4       {('nah', 'i'): 1, ('i', 'dont'): 1, ('dont', '...
                              ...                        
5567    {('this', 'is'): 1, ('is', 'the'): 1, ('the', ...
5568    {('will', 'ü'): 1, ('ü', 'b'): 1, ('b', 'going...
5569    {('pity', 'was'): 1, ('was', 'in'): 1, ('in', ...
5570    {('the', 'guy'): 1, ('guy', 'did'): 1, ('did',...
5571    {('rofl', 'its'): 1, ('its', 'true'): 1, ('tru...
Name: text, Length: 5572, dtype: object

Instead of taking 2 words at a time, we could take 3, 4, or, in general, $n$ words. 
A tuple of $n$ consecutive words is called an $n$-gram, and we can convert any document to a "bag of $n$-grams" representation. 

The larger $n$ is, the better the representation will capture the meaning of a document. But if $n$ is so large that hardly any $n$-gram occurs more than once, then we will not learn much from this representation.

# Exercises

**Exercise 1.** Read in the OKCupid data set (`/data301/data/okcupid/profiles.csv`). Convert the users' responses to `essay0` ("self summary") into a bag of words representation.

(_Hint:_ Test your code on the first 100 users before testing it on the entire data set.)

In [78]:
profiles_df = pd.read_csv("/data301/data/okcupid/profiles.csv")

In [79]:
words = (
    profiles_df.essay0.
    str.lower().
    fillna("").
    str.replace('<.*?>', '').
    str.replace("[^\w\s]", "").
    str.split()
)

words

0        [about, me, i, would, love, to, think, that, i...
1        [i, am, a, chef, this, is, what, that, means, ...
2        [im, not, ashamed, of, much, but, writing, pub...
3           [i, work, in, a, library, and, go, to, school]
4        [hey, hows, it, going, currently, vague, on, t...
                               ...                        
59941    [vibrant, expressive, caring, optimist, i, lov...
59942    [im, nick, i, never, know, what, to, write, ab...
59943    [hello, i, enjoy, traveling, watching, movies,...
59944    [all, i, have, in, this, world, are, my, balls...
59945    [is, it, odd, that, having, a, little, enemy, ...
Name: essay0, Length: 59946, dtype: object

In [80]:
words.apply(Counter)

0        {'about': 4, 'me': 5, 'i': 8, 'would': 2, 'lov...
1        {'i': 13, 'am': 7, 'a': 5, 'chef': 1, 'this': ...
2        {'im': 3, 'not': 1, 'ashamed': 1, 'of': 10, 'm...
3        {'i': 1, 'work': 1, 'in': 1, 'a': 1, 'library'...
4        {'hey': 1, 'hows': 1, 'it': 1, 'going': 1, 'cu...
                               ...                        
59941    {'vibrant': 1, 'expressive': 1, 'caring': 1, '...
59942    {'im': 4, 'nick': 1, 'i': 4, 'never': 2, 'know...
59943    {'hello': 1, 'i': 9, 'enjoy': 2, 'traveling': ...
59944    {'all': 1, 'i': 2, 'have': 1, 'in': 1, 'this':...
59945    {'is': 2, 'it': 1, 'odd': 1, 'that': 3, 'havin...
Name: essay0, Length: 59946, dtype: object

**Exercise 2.** The text of _Green Eggs and Ham_ by Dr. Seuss can be found in (`https://raw.githubusercontent.com/dlsun/data-science-book/master/data/drseuss/greeneggsandham.txt`). Read in this file and convert this "document" into a bag of trigrams (3-grams) representation. Which trigram appears most often? Some code has been provided to get you started.

In [88]:
seuss_df = pd.read_fwf("https://raw.githubusercontent.com/dlsun/data-science-book/master/data/drseuss/greeneggsandham.txt", header=None)
seuss_df

Unnamed: 0,0
0,I am Sam
1,I am Sam
2,Sam I am
3,That Sam-I-am
4,That Sam-I-am!
...,...
153,I do so like
154,green eggs and ham!
155,Thank you!
156,"Thank you,"


In [91]:
get_trigrams = lambda words : zip(words[:-2], words[1:-1], words[2:])
words = (
    seuss_df.iloc[:, 0].
    str.lower().
    fillna("").
    str.replace('<.*?>', '').
    str.replace("[^\w\s]", "").
    str.split()
)

words

0                 [i, am, sam]
1                 [i, am, sam]
2                 [sam, i, am]
3               [that, samiam]
4               [that, samiam]
                ...           
153          [i, do, so, like]
154    [green, eggs, and, ham]
155               [thank, you]
156               [thank, you]
157                   [samiam]
Name: 0, Length: 158, dtype: object

In [102]:
words.apply(get_trigrams).apply(Counter)

0                                {('i', 'am', 'sam'): 1}
1                                {('i', 'am', 'sam'): 1}
2                                {('sam', 'i', 'am'): 1}
3                                                     {}
4                                                     {}
                             ...                        
153      {('i', 'do', 'so'): 1, ('do', 'so', 'like'): 1}
154    {('green', 'eggs', 'and'): 1, ('eggs', 'and', ...
155                                                   {}
156                                                   {}
157                                                   {}
Name: 0, Length: 158, dtype: object