# Challenge: Text Into Data

```yaml
Course:   DS 5001 
Module:   02 Text Models
Topic:    Text into Data Challenge
Author:   Michael Vaden
Date:     24 January 2024
```

## Purpose

Ww import a text using the  Clip, Chunk, and Split pattern.

Demonstrate how to tokenize a raw text and map an OHCO onto the resulting dataframe of tokens.

In this notebook, we use the pattern from `M02_01` on a new text.

## Recipe

### Create TOKEN table

1. Inspect source text, taking note of where it begins and ends and the header patterns.
2. Import the source text into a dataframe of line strings.
3. Extract the title.
4. Clip the cruft by using regexs for the beginning and end of the actual text.
5. Chunk by using a regex for chapter headings, assign lines, and group.
6. Split into paragraphs using new lines.
7. Split into sentences using regex.
8. Split into tokens using regex.

## Create VOBAB table

1. Get token value counts and save as data frame.

## Set Up

In [1]:
import pandas as pd

### Import Config

In [2]:
import configparser
config = configparser.ConfigParser()
config.read("../env.ini")
data_home = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']

In [3]:
text_file = f"{data_home}/pg161.txt"
csv_file = f"{output_dir}/AUSTEN_JANE_SENSE_AND_SENSIBILITY-pg161.txt" # The file we will create

In [4]:
OHCO = ['chap_num', 'para_num', 'sent_num', 'token_num']

## Import file into a dataframe

In [11]:
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()

In [6]:
LINES.sample(20)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
11053,where I have most injured I can least forgive....
5199,
8691,deeply interested;--and it has not been only o...
9917,"not by any reproof of hers, but by his own sen..."
7337,attentive.
4052,totally different circumstances. But this is ...
1257,spite of Sir John's urgent entreaties that the...
6735,nothing to do with his own time has no conscie...
259,
3906,"""Oh, no; but if mama had not objected to it, I..."


## Extract Title 

In [7]:
title = LINES.loc[0].line_str.replace('The Project Gutenberg EBook of ', '')
print(title)

Sense and Sensibility, by Jane Austen


## Clip Cruft

In [12]:
clip_pats = [
    r"\*\*\*\s*START OF (?:THE|THIS) PROJECT",
    r"\*\*\*\s*END OF (?:THE|THIS) PROJECT"
]

pat_a = LINES.line_str.str.match(clip_pats[0])
pat_b = LINES.line_str.str.match(clip_pats[1])

line_a = LINES.loc[pat_a].index[0] + 1
line_b = LINES.loc[pat_b].index[0] - 1
print(line_a, line_b)

LINES = LINES.loc[line_a : line_b]
LINES.head(10)

20 12666


Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,


## Chunk by chapter

### Find all chapter headers

The regex will depend on the source text. You need to investigate the source text to figure this out.

In [14]:
chap_pat = r"^\s*(?:chapter|letter)\s+\d+"

chap_lines = LINES.line_str.str.match(chap_pat, case=False) # Returns a truth vector

LINES.loc[chap_lines] # Use as filter for dataframe

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
42,CHAPTER 1
196,CHAPTER 2
399,CHAPTER 3
561,CHAPTER 4
756,CHAPTER 5
858,CHAPTER 6
986,CHAPTER 7
1112,CHAPTER 8
1244,CHAPTER 9
1448,CHAPTER 10


### Assign numbers to chapters

In [15]:
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]
LINES.loc[chap_lines]

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
42,CHAPTER 1,1.0
196,CHAPTER 2,2.0
399,CHAPTER 3,3.0
561,CHAPTER 4,4.0
756,CHAPTER 5,5.0
858,CHAPTER 6,6.0
986,CHAPTER 7,7.0
1112,CHAPTER 8,8.0
1244,CHAPTER 9,9.0
1448,CHAPTER 10,10.0


### Forward-fill chapter numbers to following text lines

`ffill()` will replace null values with the previous non-null value.

In [16]:
LINES.chap_num = LINES.chap_num.ffill()
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
5407,time in rapture and indecision.,26.0
11834,fortitude.,47.0
6557,,30.0
2511,"smile, ""It is folly to linger in this manner. ...",15.0
12474,"believed, one of the happiest couples in the w...",50.0
2422,"""You are a good woman,"" he warmly replied. ""Y...",14.0
11794,,47.0
4787,manner that made me quite uncomfortable. I fe...,24.0
12422,"resuscitation of Edward, she had one again.",50.0
4489,spirits. I heard from him just before I left ...,22.0


### Clean up

In [17]:
LINES = LINES.dropna(subset=['chap_num']) # Remove everything before Chapter 1
# LINES = LINES.loc[~LINES.chap_num.isna()] # Remove everything before Chapter 1 (alternate method)
LINES = LINES.loc[~chap_lines] # Remove chapter heading lines; their work is done
LINES.chap_num = LINES.chap_num.astype('int') # Convert chap_num from float to int
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
7682,On Elinor its effect was very different. She ...,34
5638,"""So my daughter Middleton told me, for it seem...",27
11297,"by the hollow eye, the sickly skin, the postur...",46
1089,"that event by giving up music, although by her...",7
11218,effusion to a soothing friend--not an applicat...,45
2137,,13
733,recommendation. To quit the neighbourhood of ...,4
4303,"painful as it was strong, had not an immediate...",22
427,,3
11748,,47


### Group lines into chapters

In [18]:
OHCO[:1]
['chap_num']
# Make big string for each chapter
CHAPS = LINES.groupby(OHCO[:1])\
    .line_str.apply(lambda x: '\n'.join(x))\
    .to_frame('chap_str')
CHAPS.head(10)

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,\n\nThe family of Dashwood had long been settl...
2,\n\nMrs. John Dashwood now installed herself m...
3,\n\nMrs. Dashwood remained at Norland several ...
4,"\n\n""What a pity it is, Elinor,"" said Marianne..."
5,"\n\nNo sooner was her answer dispatched, than ..."
6,\n\nThe first part of their journey was perfor...
7,\n\nBarton Park was about half a mile from the...
8,\n\nMrs. Jennings was a widow with an ample jo...
9,\n\nThe Dashwoods were now settled at Barton w...
10,"\n\nMarianne's preserver, as Margaret, with mo..."


In [19]:
CHAPS['chap_str'] = CHAPS.chap_str.str.strip()
CHAPS

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,The family of Dashwood had long been settled i...
2,Mrs. John Dashwood now installed herself mistr...
3,Mrs. Dashwood remained at Norland several mont...
4,"""What a pity it is, Elinor,"" said Marianne, ""t..."
5,"No sooner was her answer dispatched, than Mrs...."
6,The first part of their journey was performed ...
7,Barton Park was about half a mile from the cot...
8,Mrs. Jennings was a widow with an ample jointu...
9,The Dashwoods were now settled at Barton with ...
10,"Marianne's preserver, as Margaret, with more e..."


## Split chapters into paragraphs 

We use Pandas' convenient `.split()` method with `expand=True`, followed by `.stack()`.
Note that this creates zero-based indexes.

In [20]:
para_pat = r'\n\n+'
# CHAPS['chap_str'].str.split(para_pat, expand=True).head()
PARAS = CHAPS['chap_str'].str.split(para_pat, expand=True).stack()\
    .to_frame('para_str').sort_index()
PARAS.index.names = OHCO[:2]
PARAS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,The family of Dashwood had long been settled i...
1,1,"By a former marriage, Mr. Henry Dashwood had o..."
1,2,"The old gentleman died: his will was read, and..."
1,3,"Mr. Dashwood's disappointment was, at first, s..."
1,4,His son was sent for as soon as his danger was...


In [21]:
PARAS['para_str'] = PARAS['para_str'].str.replace(r'\n', ' ', regex=True)
PARAS['para_str'] = PARAS['para_str'].str.strip()
PARAS = PARAS[~PARAS['para_str'].str.match(r'^\s*$')] # Remove empty paragraphs
PARAS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,The family of Dashwood had long been settled i...
1,1,"By a former marriage, Mr. Henry Dashwood had o..."
1,2,"The old gentleman died: his will was read, and..."
1,3,"Mr. Dashwood's disappointment was, at first, s..."
1,4,His son was sent for as soon as his danger was...


## Split paragraphs into sentences

In [22]:
# sent_pat = r'[.?!;:"]+'
sent_pat = r'[.?!;:]+'
SENTS = PARAS['para_str'].str.split(sent_pat, expand=True).stack()\
    .to_frame('sent_str')
SENTS.index.names = OHCO[:3]
SENTS = SENTS[~SENTS['sent_str'].str.match(r'^\s*$')] # Remove empty paragraphs
SENTS.sent_str = SENTS.sent_str.str.strip() # CRUCIAL TO REMOVE BLANK TOKENS
SENTS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_str
chap_num,para_num,sent_num,Unnamed: 3_level_1
1,0,0,The family of Dashwood had long been settled i...
1,0,1,"Their estate was large, and their residence wa..."
1,0,2,The late owner of this estate was a single man...
1,0,3,"But her death, which happened ten years before..."
1,0,4,"for to supply her loss, he invited and receive..."


## Split sentences into tokens

In [23]:
token_pat = r"[\s',-]+"
TOKENS = SENTS['sent_str'].str.split(token_pat, expand=True).stack()\
    .to_frame('token_str')
TOKENS.index.names = OHCO[:4]
TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,token_str
chap_num,para_num,sent_num,token_num,Unnamed: 4_level_1
1,0,0,0,The
1,0,0,1,family
1,0,0,2,of
1,0,0,3,Dashwood
1,0,0,4,had
...,...,...,...,...
50,22,0,8,and
50,22,0,9,Sensibility
50,22,0,10,by
50,22,0,11,Jane


## Extract Vocabulary

In [24]:
TOKENS['term_str'] = TOKENS.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
VOCAB = TOKENS.term_str.value_counts().to_frame('n').reset_index().rename(columns={'index':'term_str'})
VOCAB.index.name = 'term_id'
VOCAB

Unnamed: 0_level_0,term_str,n
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,to,4115
1,the,4105
2,of,3574
3,and,3490
4,her,2543
...,...,...
6275,prefer,1
6276,dissolving,1
6277,beset,1
6278,effectually,1


## Gathering by Content Object

In [25]:
def gather(ohco_level):
    global TOKENS
    level_name = OHCO[ohco_level-1].split('_')[0]
    df = TOKENS.groupby(OHCO[:ohco_level])\
        .token_str.apply(lambda x: x.str.cat(sep=' '))\
        .to_frame(f"{level_name}_str")
    return df
gather(1)

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,The family of Dashwood had long been settled i...
2,Mrs John Dashwood now installed herself mistre...
3,Mrs Dashwood remained at Norland several month...
4,"""What a pity it is Elinor "" said Marianne ""tha..."
5,No sooner was her answer dispatched than Mrs D...
6,The first part of their journey was performed ...
7,Barton Park was about half a mile from the cot...
8,Mrs Jennings was a widow with an ample jointur...
9,The Dashwoods were now settled at Barton with ...
10,Marianne s preserver as Margaret with more ele...


## Save work to CSV

This is important -- will be used for homework.

In [26]:
TOKENS.to_csv(csv_file)