# Tokenization

This notebook introduces some of the basic tools and idea for working with natural language (text), namely tokenization. Tokenization is the process of turning text into a sequence of words, with punctuation and common (stop) words removed.

## Imports

In [1]:
import types

## Tokenization

In [2]:
PUNCTUATION = '`~!@#$%^&*()_-+={[}]|\:;"<,>.?/}\t\n'

Write a generator function, `remove_punctuation`, that removes punctuation from an iterator of words and yields the cleaned words:

* Strip the punctuation characters at the beginning and end of each word.
* Replace the character `-` by a space if found in the middle of the word and split on that white space to yield multiple words.
* If a word is all punctuation, don't yield it at all.

In [3]:
def remove_punctuation(words, punctuation=PUNCTUATION):
    """Remove punctuation from an iterator of words, yielding the results."""
    # YOUR CODE HERE
    for index in range(len(words)):
        string = words[index]
        
        for char in punctuation:
            if string:
                if (char != '-'):
                    string = string.replace(char, '')
                else:
                    string = string.replace(char, ' ')

        if (string.find(' ') != -1):
            string = string.split()
            for word in string:
                yield word
        elif (len(string) > 0):
            yield string
    res = remove_punctuation(words, punctuation)
        
    return res

In [4]:
assert list(remove_punctuation(['!data;']))==['data']
assert list(remove_punctuation(['!data-science:']))==['data', 'science']
assert list(remove_punctuation(['!!']))==[]
assert list(remove_punctuation(['  ']))==[]
assert list(remove_punctuation(['\n']))==[]
assert isinstance(remove_punctuation(['!!']), types.GeneratorType)


Write a generator function, `lower_words`, that makes each word in an iterator lowercase, yielding each lowercase word:

In [5]:
def lower_words(words):
    """Make each word in an iterator lowercase."""
    # YOUR CODE HERE
    return (word.lower() for word in words)

In [6]:
assert isinstance(lower_words('AAA'), types.GeneratorType)
assert list(lower_words('This IS NOT LoWerCaSe'.split(' ')))==['this', 'is', 'not', 'lowercase']

[Stop words](https://en.wikipedia.org/wiki/Stop_words) are common words in text that are typically filtered out when performing natural language processing. Typical stop words are *and*, *of*, *a*, *the*, etc.

Write a generator function, `remove_stop_words`, that removes stop words from an iterator, yielding the results:

In [7]:
def remove_stop_words(words, stop_words=None):
    """Remove the stop words from an iterator of words.
    
    stop_words can be provided as a list of words or a whitespace separated string of words.
    """
    # YOUR CODE HERE
    if stop_words is None:
        for word in words:
            yield word
    else:
        if isinstance(stop_words, str):
            stop_words = stop_words.split()
        
        for word in words:
            if word not in stop_words:
                yield word
    res = remove_stop_words(words, stop_words)
    return res

In [8]:
assert list(remove_stop_words('the begin to the end a of the day'.split(' '), stop_words='a the')) == \
    ['begin', 'to', 'end', 'of', 'day']
assert list(remove_stop_words('the begin to the end a of the day'.split(' '), stop_words=['a', 'the'])) == \
    ['begin', 'to', 'end', 'of', 'day']
assert list(remove_stop_words('the begin to the end a of the day'.split(' '))) == \
    ['the', 'begin', 'to', 'the', 'end', 'a', 'of', 'the', 'day']

[Tokenization](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is the process of taking a string or line of text and returning a sequence of words, or *tokens*, with the following transforms applied

* Punctuation removed
* All words lowercased
* Stop words removed

Write a generator function, `tokenize_line`, that yields tokenized words from a an input line of text. 

In [9]:
def tokenize_line(line, stop_words=None, punctuation=PUNCTUATION):
    """Split a string into a list of words, removing punctuation and stop words."""
    # YOUR CODE HERE
    words = line.split()
    words = remove_punctuation(words, punctuation)
    words = lower_words(words)
    return remove_stop_words(words, stop_words)

In [10]:
assert isinstance(tokenize_line("This, is the way; that things will end"), types.GeneratorType)
assert list(tokenize_line("This, is the way; that things will end", stop_words=['the', 'is'])) == \
    ['this', 'way', 'that', 'things', 'will', 'end']

Write a generator function, `tokenize_lines`, that can yield the tokens in an iterator of lines of text.

In [11]:
def tokenize_lines(lines, stop_words=None, punctuation=PUNCTUATION):
    """Tokenize an iterator of lines, yielding the tokens."""
    # YOUR CODE HERE
    for line in lines:
        words = tokenize_line(line, stop_words, punctuation)
        for word in words:
            yield word
    res = tokenize_lines(lines, stop_words, punctuation)
    return res

In [12]:
wasteland = """
APRIL is the cruellest month, breeding
Lilacs out of the dead land, mixing
Memory and desire, stirring
Dull roots with spring rain.
"""

assert isinstance(tokenize_lines(wasteland.splitlines()), types.GeneratorType)

assert list(tokenize_lines(wasteland.splitlines(), stop_words='is the of and')) == \
    ['april','cruellest','month','breeding','lilacs','out','dead','land',
     'mixing','memory','desire','stirring','dull','roots','with','spring',
     'rain']

## Tokenize song lyrics

Now use all of the above functions to perform tokenization on the set of song lyrics from this Kaggle hosted dataset:

https://www.kaggle.com/mousehead/songlyrics

* You should be able to perform this in a memory efficient manner.
* Read your stop words from the included `data/stopwords.txt` file.

Here is the dataset loaded as a Pandas `DataFrame`:

In [13]:
import pandas as pd
df = pd.read_csv("/data/songdata/songdata.csv")
df.head()
lyrics = df['text']

If we extract the `text` column, we get an iterator over the lyrics. **Remember each lyric can and will have multiple lines.** Here is the total number of lyrics:

In [14]:
len(lyrics)

57650

Read the file `data/stopwords.txt` and tokenize the file into a list of stop words:

In [15]:
# YOUR CODE HERE
file = open('data/stopwords.txt', 'r')
paragraph = file.read()
stop_words = list(tokenize_line(paragraph))

In [16]:
assert len(stop_words)==174
assert type(stop_words)==list

In [17]:
print(stop_words)

['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', "can't", 'cannot', 'could', "couldn't", 'did', "didn't", 'do', 'does', "doesn't", 'doing', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', "hadn't", 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'her', 'here', "here's", 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's", 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', "isn't", 'it', "it's", 'its', 'itself', "let's", 'me', 'more', 'most', "mustn't", 'my', 'myself', 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'ought', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'same', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'so', 'some', 'such', 'than', 'that', "that's", 'the', 'their', 'theirs', 'them', 'themselves', 't

Now iterate through the lyrics and for each lyric:

1. Split the lyric into lines.
2. Call `tokenize_lines` on the lyric eliminating the above stop words (and punctuation).
3. Count the total number of words across all lyrics (excluding stop words) and save the result in a variable named `nwords`.

If you do all of this in a memory efficient manner, the total memory consumption of this notebook shouldn't go over around 250MB. Most of that is just using Pandas to read the dataset into memory. If you construct a full list of all the words and *then* count them all, your memory consumption will be 3-4x that. This should only take a few minutes to run.

In [18]:
# YOUR CODE HERE
nwords = 0
for lyric in lyrics:
    res = tokenize_lines(lyric.splitlines(), stop_words)
    for n in res:
        nwords += 1

In [19]:
print("Total number of words: {}".format(nwords))

Total number of words: 6402086
