# M02 Text into Data

DS 5001 Text as Data

## Purpose

Ww import a text using the  **Clip, Chunk, and Split pattern**.

Demonstrate how to tokenize a raw text and map an OHCO onto the resulting dataframe of tokens.

This goes beyond what we did last week in the First Foray notebook. We capture the chapter, paragraph, and sentence structure of the text.

## Set Up

## Import libraries

In [None]:
import pandas as pd

### Import Config

In [None]:
data_home = "../input"
output_dir = "../working"

In [None]:
data_home, output_dir

In [None]:
text_file = f"{data_home}/gutenberg/pg105.txt"
csv_file  = f"{output_dir}/austen-persuasion.csv" # The file we will create

In [None]:
OHCO = ['chap_num', 'para_num', 'sent_num', 'token_num']

## Import file into a dataframe

In [None]:
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()

In [None]:
LINES.sample(20)

## Extract Title 

In [None]:
title = LINES.loc[0].line_str.replace('The Project Gutenberg EBook of ', '')

In [None]:
print(title)

## Clip the Cruft

In [None]:
clip_pats = [
    r"\*\*\*\s*START OF (?:THE|THIS) PROJECT",
    r"\*\*\*\s*END OF (?:THE|THIS) PROJECT"
]

In [None]:
pat_a = LINES.line_str.str.match(clip_pats[0])
pat_b = LINES.line_str.str.match(clip_pats[1])

In [None]:
line_a = LINES.loc[pat_a].index[0] + 1
line_b = LINES.loc[pat_b].index[0] - 1

In [None]:
line_a, line_b

In [None]:
LINES = LINES.loc[line_a : line_b]

In [None]:
LINES.head(10)

In [None]:
LINES.tail(10)

## Chunk by Chapter

### Find all chapter headers

The regex will depend on the source text. You need to investigate the source text to figure this out.

In [None]:
chap_pat = r"^\s*(?:chapter|letter)\s+\d+"

In [None]:
chap_lines = LINES.line_str.str.match(chap_pat, case=False) # Returns a truth vector

In [None]:
LINES.loc[chap_lines] # Use as filter for dataframe

### Assign numbers to chapters

In [None]:
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]

In [None]:
LINES.loc[chap_lines]

Notice that all lines that are not chapter headers have no chapter number assigned to them.

In [None]:
LINES.sample(10)

### Forward-fill chapter numbers to following text lines

`ffill()` will replace null values with the previous non-null value.

In [None]:
LINES.chap_num = LINES.chap_num.ffill()

In [None]:
LINES.sample(10)

Notice that the lines taht precede our first chapter have no chapters, which is what we want. We need to decide whether to keep these lines as textual front matter or to dispose of them.

In [None]:
LINES.head(20)

### Clean up

In [None]:
LINES = LINES.dropna(subset=['chap_num']) # Remove everything before Chapter 1
# LINES = LINES.loc[~LINES.chap_num.isna()] # Remove everything before Chapter 1 (alternate method)
LINES = LINES.loc[~chap_lines] # Remove chapter heading lines; their work is done
LINES.chap_num = LINES.chap_num.astype('int') # Convert chap_num from float to int

In [None]:
LINES.sample(10)

### Group lines into chapters

In [None]:
OHCO[:1]

In [None]:
# Make big string for each chapter
CHAPS = LINES.groupby(OHCO[:1])\
    .line_str.apply(lambda x: '\n'.join(x))\
    .to_frame('chap_str')

In [None]:
CHAPS.head(10)

In [None]:
CHAPS['chap_str'] = CHAPS.chap_str.str.strip()

In [None]:
CHAPS

So, now we have our text grouped by chapters.

In [None]:
CHAPS.to_csv(f"{output_dir}/pg105-CHAPS.csv", index=True)

## Split chapters into paragraphs 

We use Pandas' convenient `.split()` method with `expand=True`, followed by `.stack()`.
Note that this creates zero-based indexes.

In [None]:
para_pat = r'\n\n+'

In [None]:
# CHAPS['chap_str'].str.split(para_pat, expand=True).head()

In [None]:
PARAS = CHAPS['chap_str'].str.split(para_pat, expand=True).stack()\
    .to_frame('para_str').sort_index()
PARAS.index.names = OHCO[:2]

In [None]:
PARAS.head()

In [None]:
PARAS['para_str'] = PARAS['para_str'].str.replace(r'\n', ' ', regex=True)
PARAS['para_str'] = PARAS['para_str'].str.strip()
PARAS = PARAS[~PARAS['para_str'].str.match(r'^\s*$')] # Remove empty paragraphs

In [None]:
PARAS.head()

In [None]:
PARAS.to_csv(f"{output_dir}/pg105-PARAS.csv", index=True)

## Split paragraphs into sentences

In [None]:
# sent_pat = r'[.?!;:"]+'
sent_pat = r'[.?!;:]+'
SENTS = PARAS['para_str'].str.split(sent_pat, expand=True).stack()\
    .to_frame('sent_str')
SENTS.index.names = OHCO[:3]

In [None]:
SENTS = SENTS[~SENTS['sent_str'].str.match(r'^\s*$')] # Remove empty paragraphs
SENTS.sent_str = SENTS.sent_str.str.strip() # CRUCIAL TO REMOVE BLANK TOKENS

In [None]:
SENTS.head()

In [None]:
SENTS.sample(10)

## Split sentences into tokens

In [None]:
token_pat = r"[\s',-]+"
TOKENS = SENTS['sent_str'].str.split(token_pat, expand=True).stack()\
    .to_frame('token_str')

In [None]:
TOKENS.index.names = OHCO[:4]

In [None]:
TOKENS

## Extract Vocabulary

In [None]:
TOKENS['term_str'] = TOKENS.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
VOCAB = TOKENS.term_str.value_counts().to_frame('n').reset_index().rename(columns={'index':'term_str'})
VOCAB.index.name = 'term_id'

In [None]:
VOCAB

## Gathering by Content Object

In [None]:
def gather(ohco_level):
    global TOKENS
    level_name = OHCO[ohco_level-1].split('_')[0]
    df = TOKENS.groupby(OHCO[:ohco_level])\
        .token_str.apply(lambda x: x.str.cat(sep=' '))\
        .to_frame(f"{level_name}_str")
    return df

In [None]:
gather(1)

In [None]:
gather(2)

In [None]:
gather(3)

## Save work to CSV

This is important -- will be used for homework.

In [None]:
TOKENS.to_csv(csv_file)