# Challenge: Text Into Data

```yaml
Course:   DS 5001 
Module:   02 Text Models
Topic:    Text into Data Challenge
Author:   R.C. Alvarado
Date:     14 October 2022 (revised)
```

## Purpose

Ww import a text using the  Clip, Chunk, and Split pattern.

Demonstrate how to tokenize a raw text and map an OHCO onto the resulting dataframe of tokens.

In this notebook, we use the pattern from `M02_01` on a new text.

## Recipe

### Create TOKEN table

1. Inspect source text, taking note of where it begins and ends and the header patterns.
2. Import the source text into a dataframe of line strings.
3. Extract the title.
4. Clip the cruft by using regexs for the beginning and end of the actual text.
5. Chunk by using a regex for chapter headings, assign lines, and group.
6. Split into paragraphs using new lines.
7. Split into sentences using regex.
8. Split into tokens using regex.

## Create VOBAB table

1. Get token value counts and save as data frame.

## Set Up

In [2]:
import pandas as pd

### Import Config

In [3]:
import configparser
config = configparser.ConfigParser()
config.read("../../../env.ini")
data_home = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']

In [4]:
text_file = f"{data_home}/gutenberg/pg161.txt"
csv_file = f"{output_dir}/austen-sense-and-sensibility.csv" # The file we will create

In [5]:
OHCO = ['chap_num', 'para_num', 'sent_num', 'token_num']

## Import file into a dataframe

In [6]:
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()

In [7]:
LINES.sample(20)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
715,"letter was from this gentleman himself, and wr..."
8389,
4253,
8023,"Lucy, with a demure and settled air, seemed de..."
9601,"ceased to speak;--at last, and as if it were r..."
10422,"friends, and to her doting mother, was an idea..."
8220,every other baby of the same age; nor could he...
12547,"Lucy became as necessary to Mrs. Ferrars, as e..."
4629,"convince Lucy, by her readiness to enter on th..."
3223,blasted trees. I admire them much more if the...


## Extract Title 

In [8]:
title = LINES.loc[0].line_str.replace('The Project Gutenberg EBook of ', '')

In [9]:
print(title)

Sense and Sensibility, by Jane Austen


## Clip Cruft

In [10]:
clip_pats = [
    r"\*\*\*\s*START OF (?:THE|THIS) PROJECT",
    r"\*\*\*\s*END OF (?:THE|THIS) PROJECT"
]

In [11]:
pat_a = LINES.line_str.str.match(clip_pats[0])
pat_b = LINES.line_str.str.match(clip_pats[1])

In [12]:
# pat_a, pat_b

(line_num
 0        False
 1        False
 2        False
 3        False
 4        False
          ...  
 13021    False
 13022    False
 13023    False
 13024    False
 13025    False
 Name: line_str, Length: 13026, dtype: bool,
 line_num
 0        False
 1        False
 2        False
 3        False
 4        False
          ...  
 13021    False
 13022    False
 13023    False
 13024    False
 13025    False
 Name: line_str, Length: 13026, dtype: bool)

In [14]:
line_a = LINES.loc[pat_a].index[0] + 1
line_b = LINES.loc[pat_b].index[0] - 1

In [15]:
line_a, line_b

(20, 12666)

In [21]:
LINES = LINES.loc[line_a : line_b]
LINES.head(10)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,


In [22]:
LINES.tail(10)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
12657,
12658,
12659,
12660,
12661,
12662,
12663,
12664,
12665,End of the Project Gutenberg EBook of Sense an...
12666,


## Chunk by chapter

### Find all chapter headers

The regex will depend on the source text. You need to investigate the source text to figure this out.

In [24]:
chap_pat = r"^\s*(?:chapter|letter)\s+\d+"

In [25]:
chap_lines = LINES.line_str.str.match(chap_pat, case=False) # Returns a truth vector

In [26]:
LINES.loc[chap_lines] # Use as filter for dataframe

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
42,CHAPTER 1
196,CHAPTER 2
399,CHAPTER 3
561,CHAPTER 4
756,CHAPTER 5
858,CHAPTER 6
986,CHAPTER 7
1112,CHAPTER 8
1244,CHAPTER 9
1448,CHAPTER 10


### Assign numbers to chapters

In [28]:
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]

In [30]:
LINES.loc[chap_lines]

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
42,CHAPTER 1,1.0
196,CHAPTER 2,2.0
399,CHAPTER 3,3.0
561,CHAPTER 4,4.0
756,CHAPTER 5,5.0
858,CHAPTER 6,6.0
986,CHAPTER 7,7.0
1112,CHAPTER 8,8.0
1244,CHAPTER 9,9.0
1448,CHAPTER 10,10.0


In [31]:
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
6567,"first.""",
4361,"""Certainly,"" answered Elinor, without knowing ...",
5455,,
8244,"them, it was true, must always be hers. But t...",
2666,,
189,"Margaret, the other sister, was a good-humored...",
2171,"""Oh, yes; and as like him as she can stare. I...",
10021,Their journey was safely performed. The secon...,
3889,can't think how much I longed to see you! It ...,
358,be nothing at all. They will have no carriage...,


### Forward-fill chapter numbers to following text lines

`ffill()` will replace null values with the previous non-null value.

In [32]:
LINES.chap_num = LINES.chap_num.ffill()
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
1179,not think Colonel Brandon's being thirty-five ...,8.0
3117,,17.0
3461,variety which the different state of her spiri...,19.0
9512,"much or too little, and sat deliberating over ...",40.0
8630,lips could not utter. After a pause of wonder...,37.0
11427,"""I am thankful to find that I can look with so...",46.0
11413,"arm, was authorised to walk as long as she cou...",46.0
3666,complaining of the weather.,20.0
7017,"in her speaking to him, even voluntarily speak...",32.0
3305,"before; and when their visitors left them, he ...",18.0


In [33]:
LINES.head(20)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
20,,
21,,
22,,
23,,
24,,
25,,
26,,
27,,
28,,
29,,


### Clean up

In [34]:
LINES = LINES.dropna(subset=['chap_num']) # Remove everything before Chapter 1
# LINES = LINES.loc[~LINES.chap_num.isna()] # Remove everything before Chapter 1 (alternate method)
LINES = LINES.loc[~chap_lines] # Remove chapter heading lines; their work is done
LINES.chap_num = LINES.chap_num.astype('int') # Convert chap_num from float to int

In [35]:
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
2261,and after a ten minutes' interval of earnest t...,13
4334,,22
120,his ordinary duties. Had he married a more am...,1
10769,"""Did you tell her that you should soon return?""",44
10388,many weeks of previous indisposition which Mar...,43
6448,will be when they hear it! If I had my senses...,30
7034,"Elinor's. Long letters from her, quickly succ...",32
10145,"day or two trifled with or denied, would force...",42
3885,"but any testimony in his favour, however small...",20
1920,was in perfect unison with what she had heard ...,12


### Group lines into chapters

In [36]:
OHCO[:1]

['chap_num']

## Split chapters into paragraphs 

We use Pandas' convenient `.split()` method with `expand=True`, followed by `.stack()`.
Note that this creates zero-based indexes.

## Split paragraphs into sentences

## Split sentences into tokens

## Extract Vocabulary

## Gathering by Content Object

## Save work to CSV

This is important -- will be used for homework.