# Chapter 1: The Tidy Text Format

## Contrasting Tidy Text with Other Data Structures

## The unnest_tokens function

In [91]:
text = [
    "Because I could not stop for Death -",
    "He kindly stopped for me -",
    "The Carriage held but just Ourselves -",
    "and Immortality"
]

text

['Because I could not stop for Death -',
 'He kindly stopped for me -',
 'The Carriage held but just Ourselves -',
 'and Immortality']

In [92]:
import pandas as pd

In [93]:
text_df = pd.DataFrame({'text': text, 'line': range(1,5)})

text_df

Unnamed: 0,line,text
0,1,Because I could not stop for Death -
1,2,He kindly stopped for me -
2,3,The Carriage held but just Ourselves -
3,4,and Immortality


Now converting the dataframe above into a "tidy" data structure - one token per row. We'll need to pull in some extra tools. I'm going to do it step-by-step below: skip to the `unnest_tokens` function if you just want to see it all together.

In [94]:
import spacy
nlp = spacy.load('en')

In [95]:
# example of using SpaCy to tokenize a simple string

def tokenize(sent):
    doc = nlp.tokenizer(sent)
    return [token.text for token in doc]
        
tokenize(text[0])

['Because', 'I', 'could', 'not', 'stop', 'for', 'Death', '-']

In [96]:
# The example in the book also lowercases everything
# and strips punctuation, so let's also do that:

def tokenize(sent):
    doc = nlp.tokenizer(sent)
    return [token.lower_ for token in doc if not token.is_punct]

print(tokenize(text[0]))

['because', 'i', 'could', 'not', 'stop', 'for', 'death']


In [130]:
# Now we can use our `tokenize` function in combination
# with Pandas operations to expand the dataframe above into a tidy df

# First, how do we expand into tokens?
text_df['text'].apply(tokenize)

0     [because, i, could, not, stop, for, death]
1                 [he, kindly, stopped, for, me]
2    [the, carriage, held, but, just, ourselves]
3                             [and, immortality]
Name: text, dtype: object

In [131]:
# Now we want each of those in its own row - two steps

new_df = (text_df['text'].apply(tokenize)
                         .apply(pd.Series))
new_df

Unnamed: 0,0,1,2,3,4,5,6
0,because,i,could,not,stop,for,death
1,he,kindly,stopped,for,me,,
2,the,carriage,held,but,just,ourselves,
3,and,immortality,,,,,


In [132]:
# now use `stack` to reshape into a single column

new_df = new_df.stack()

new_df

0  0        because
   1              i
   2          could
   3            not
   4           stop
   5            for
   6          death
1  0             he
   1         kindly
   2        stopped
   3            for
   4             me
2  0            the
   1       carriage
   2           held
   3            but
   4           just
   5      ourselves
3  0            and
   1    immortality
dtype: object

In [133]:
new_df = (new_df.reset_index(level=0)
                .set_index('level_0')
                .rename(columns={0: 'word'}))

new_df

Unnamed: 0_level_0,word
level_0,Unnamed: 1_level_1
0,because
0,i
0,could
0,not
0,stop
0,for
0,death
1,he
1,kindly
1,stopped


In [134]:
# Now we use a `join` to get the information from the other associated columns

new_df = new_df.join(text_df.drop('text', 1), how='left')

new_df

Unnamed: 0,word,line
0,because,1
0,i,1
0,could,1
0,not,1
0,stop,1
0,for,1
0,death,1
1,he,2
1,kindly,2
1,stopped,2


In [136]:
new_df = new_df.reset_index(drop=True)

new_df

Unnamed: 0,word,line
0,because,1
1,i,1
2,could,1
3,not,1
4,stop,1
5,for,1
6,death,1
7,he,2
8,kindly,2
9,stopped,2


### All together now!

In [137]:
def unnest_tokens(df, # line-based dataframe
                  column_to_tokenize, # name of the column with the text
                  new_token_column_name='word', # what you want the column of words to be called
                  tokenizer_function=tokenize): # what tokenizer to use
    
    return (df[column_to_tokenize]
              .apply(tokenizer_function)
              .apply(pd.Series)
              .stack()
              .reset_index(level=0)
              .set_index('level_0')
              .rename(columns={0: new_token_column_name})
              .join(text_df.drop(column_to_tokenize, 1), how='left')
              .reset_index(drop=True))

In [139]:
text_df = unnest_tokens(text_df, 'text')
text_df

Unnamed: 0,word,line
0,because,1
1,i,1
2,could,1
3,not,1
4,stop,1
5,for,1
6,death,1
7,he,2
8,kindly,2
9,stopped,2


## Tidying the Works of Jane Austen