# Metadata

```{yaml}
Course:   DS 5001 
Module:   02 Text Models
Topic:    Text into Data: Importing a Text, or, Clip, Chunk, and Split -- CHALLENGE
Author:   R.C. Alvarado
Date:     14 October 2022 (revised)
```
**Purpose**:  Demonstrate how to tokenize a raw text and map an OHCO onto the resulting dataframe of tokens.

# Set Up

In [3]:
import pandas as pd

In [4]:
data_home = "../data"

In [5]:
text_file = f"{data_home}/gutenberg/pg161.txt"
csv_file = f"{data_home}/output/austen-sense-and-sensibility.csv" # The file we will create

In [6]:
OHCO = ['chap_num', 'para_num', 'sent_num', 'token_num']

# Import file into a dataframe

In [7]:
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()

In [8]:
LINES.sample(20)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
6585,attention; and what followed was a positive as...
12139,ill offices. Your brother has gained my affec...
7766,"Had both the children been there, the affair m..."
1194,a woman of seven and twenty could feel for a m...
3998,Fortunately for those who pay their court thro...
10330,"the arrival of the apothecary, and to watch by..."
8849,deplorable. The interest of two thousand poun...
4889,"""I will honestly tell you of one scheme which ..."
1291,"""Is there a felicity in the world,"" said Maria..."
8169,of everybody. Their hours were therefore made ...


# Extract Title 

In [9]:
title = LINES.loc[0].line_str.replace('The Project Gutenberg EBook of ', '')

In [10]:
print(title)

Sense and Sensibility, by Jane Austen


# Clip Cruft

In [11]:
clip_pats = [
    r"\*\*\*\s*START OF (?:THE|THIS) PROJECT",
    r"\*\*\*\s*END OF (?:THE|THIS) PROJECT"
]

In [12]:
pat_a = LINES.line_str.str.match(clip_pats[0])
pat_b = LINES.line_str.str.match(clip_pats[1])

In [13]:
line_a = LINES.loc[pat_a].index[0] + 1
line_b = LINES.loc[pat_b].index[0] - 1

In [14]:
line_a, line_b

(20, 12666)

In [15]:
LINES = LINES.loc[line_a : line_b]

In [16]:
LINES.head(10)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,


In [18]:
LINES.tail(10)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
12657,
12658,
12659,
12660,
12661,
12662,
12663,
12664,
12665,End of the Project Gutenberg EBook of Sense an...
12666,


# Chunk by chapter

## Find all chapter headers

The regex will depend on the source text. You need to investigate the source text to figure this out.

In [19]:
chap_pat = r"^\s*(?:chapter|letter)\s+\d+"

In [20]:
chap_lines = LINES.line_str.str.match(chap_pat, case=False) # Returns a truth vector

In [21]:
LINES.loc[chap_lines] # Use as filter for dataframe

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
42,CHAPTER 1
196,CHAPTER 2
399,CHAPTER 3
561,CHAPTER 4
756,CHAPTER 5
858,CHAPTER 6
986,CHAPTER 7
1112,CHAPTER 8
1244,CHAPTER 9
1448,CHAPTER 10


## Assign numbers to chapters

In [22]:
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]

In [23]:
LINES.loc[chap_lines]

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
42,CHAPTER 1,1.0
196,CHAPTER 2,2.0
399,CHAPTER 3,3.0
561,CHAPTER 4,4.0
756,CHAPTER 5,5.0
858,CHAPTER 6,6.0
986,CHAPTER 7,7.0
1112,CHAPTER 8,8.0
1244,CHAPTER 9,9.0
1448,CHAPTER 10,10.0


Notice that all lines that are not chapter headers have no chapter number assigned to them.

In [24]:
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
57,"children, the old Gentleman's days were comfor...",
3794,by asking her whether she did not like Mr. Pal...,
2347,favourite pointer at her feet.,
2177,happy; and after some consultation it was agre...,
936,object of real solicitude to him. He said muc...,
11193,young man!--and without selfishness--without e...,
1227,"that there was no immediate hurry for it, as i...",
9915,"look, however, very well bestowed, for it reli...",
8050,should be checked by Lucy's unwelcome presence...,
2798,"""We have never finished Hamlet, Marianne; our ...",


## Forward-fill chapter numbers to following text lines

`ffill()` will replace null values with the previous non-null value.

In [25]:
LINES.chap_num = LINES.chap_num.ffill()

In [27]:
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
4319,because I know he has the highest opinion in t...,22.0
2349,"One evening in particular, about a week after ...",14.0
4257,,22.0
12433,"enforced the assertion, by observing that Miss...",50.0
10650,"""You did then,"" said Elinor, a little softened...",44.0
1131,"excellent match, for HE was rich, and SHE was ...",8.0
12487,"confess, it would give me great pleasure to ca...",50.0
1690,"illaudable, appeared to her not merely an unne...",11.0
1406,"""I do not believe,"" said Mrs. Dashwood, with a...",9.0
10519,"prudence required dispatch, and that her acqui...",44.0


Notice that the lines taht precede our first chapter have no chapters, which is what we want. We need to decide whether to keep these lines as textual front matter or to dispose of them.

In [28]:
LINES.head(20)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
20,,
21,,
22,,
23,,
24,,
25,,
26,,
27,,
28,,
29,,


## Clean up

In [29]:
LINES = LINES.dropna(subset=['chap_num']) # Remove everything before Chapter 1
# LINES = LINES.loc[~LINES.chap_num.isna()] # Remove everything before Chapter 1 (alternate method)
LINES = LINES.loc[~chap_lines] # Remove chapter heading lines; their work is done
LINES.chap_num = LINES.chap_num.astype('int') # Convert chap_num from float to int

In [30]:
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
784,"John Dashwood, by this pointed invitation to h...",5
7194,"""I should have been quite disappointed if I ha...",32
10154,,42
5170,"the Miss Steeles, especially Lucy, they had ne...",25
9272,"future ennui, to provoke him to make that offe...",39
4342,,22
5800,with calmness.,28
11061,,45
5333,"come before--beg your pardon, but I have been ...",26
10026,"Cleveland was a spacious, modern-built house, ...",42


## Group lines into chapters

In [31]:
OHCO[:1]

['chap_num']

In [32]:
# Make big string for each chapter
CHAPS = LINES.groupby(OHCO[:1])\
    .line_str.apply(lambda x: '\n'.join(x))\
    .to_frame('chap_str') 

In [33]:
CHAPS.head(10)

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,\n\nThe family of Dashwood had long been settl...
2,\n\nMrs. John Dashwood now installed herself m...
3,\n\nMrs. Dashwood remained at Norland several ...
4,"\n\n""What a pity it is, Elinor,"" said Marianne..."
5,"\n\nNo sooner was her answer dispatched, than ..."
6,\n\nThe first part of their journey was perfor...
7,\n\nBarton Park was about half a mile from the...
8,\n\nMrs. Jennings was a widow with an ample jo...
9,\n\nThe Dashwoods were now settled at Barton w...
10,"\n\nMarianne's preserver, as Margaret, with mo..."


In [34]:
CHAPS['chap_str'] = CHAPS.chap_str.str.strip()

In [35]:
CHAPS

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,The family of Dashwood had long been settled i...
2,Mrs. John Dashwood now installed herself mistr...
3,Mrs. Dashwood remained at Norland several mont...
4,"""What a pity it is, Elinor,"" said Marianne, ""t..."
5,"No sooner was her answer dispatched, than Mrs...."
6,The first part of their journey was performed ...
7,Barton Park was about half a mile from the cot...
8,Mrs. Jennings was a widow with an ample jointu...
9,The Dashwoods were now settled at Barton with ...
10,"Marianne's preserver, as Margaret, with more e..."


So, now we have our text grouped by chapters.

# Split chapters into paragraphs 

We use Pandas' convenient `.split()` method with `expand=True`, followed by `.stack()`.
Note that this creates zero-based indexes.

In [36]:
para_pat = r'\n\n+'

In [37]:
# CHAPS['chap_str'].str.split(para_pat, expand=True).head()

In [38]:
PARAS = CHAPS['chap_str'].str.split(para_pat, expand=True).stack()\
    .to_frame('para_str').sort_index()
PARAS.index.names = OHCO[:2]

In [39]:
PARAS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,The family of Dashwood had long been settled i...
1,1,"By a former marriage, Mr. Henry Dashwood had o..."
1,2,"The old gentleman died: his will was read, and..."
1,3,"Mr. Dashwood's disappointment was, at first, s..."
1,4,His son was sent for as soon as his danger was...


In [40]:
PARAS['para_str'] = PARAS['para_str'].str.replace(r'\n', ' ', regex=True)
PARAS['para_str'] = PARAS['para_str'].str.strip()
PARAS = PARAS[~PARAS['para_str'].str.match(r'^\s*$')] # Remove empty paragraphs

In [41]:
PARAS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,The family of Dashwood had long been settled i...
1,1,"By a former marriage, Mr. Henry Dashwood had o..."
1,2,"The old gentleman died: his will was read, and..."
1,3,"Mr. Dashwood's disappointment was, at first, s..."
1,4,His son was sent for as soon as his danger was...


# Split paragraphs into sentences

In [45]:
# sent_pat = r'[.?!;:"]+'
sent_pat = r'[.?!;:]+'
SENTS = PARAS['para_str'].str.split(sent_pat, expand=True).stack()\
    .to_frame('sent_str')
SENTS.index.names = OHCO[:3]

In [46]:
SENTS = SENTS[~SENTS['sent_str'].str.match(r'^\s*$')] # Remove empty paragraphs
SENTS.sent_str = SENTS.sent_str.str.strip() # CRUCIAL TO REMOVE BLANK TOKENS

In [47]:
SENTS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_str
chap_num,para_num,sent_num,Unnamed: 3_level_1
1,0,0,The family of Dashwood had long been settled i...
1,0,1,"Their estate was large, and their residence wa..."
1,0,2,The late owner of this estate was a single man...
1,0,3,"But her death, which happened ten years before..."
1,0,4,"for to supply her loss, he invited and receive..."


# Split sentences into tokens

In [48]:
token_pat = r"[\s',-]+"
TOKENS = SENTS['sent_str'].str.split(token_pat, expand=True).stack()\
    .to_frame('token_str')

In [49]:
TOKENS.index.names = OHCO[:4]

In [50]:
TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,token_str
chap_num,para_num,sent_num,token_num,Unnamed: 4_level_1
1,0,0,0,The
1,0,0,1,family
1,0,0,2,of
1,0,0,3,Dashwood
1,0,0,4,had
...,...,...,...,...
50,22,0,8,and
50,22,0,9,Sensibility
50,22,0,10,by
50,22,0,11,Jane


# Extract Vocabulary

In [51]:
TOKENS['term_str'] = TOKENS.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
VOCAB = TOKENS.term_str.value_counts().to_frame('n').reset_index().rename(columns={'index':'term_str'})
VOCAB.index.name = 'term_id'

In [53]:
TOKENS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,token_str,term_str
chap_num,para_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0,0,0,The,the
1,0,0,1,family,family
1,0,0,2,of,of
1,0,0,3,Dashwood,dashwood
1,0,0,4,had,had


In [54]:
VOCAB

Unnamed: 0_level_0,term_str,n
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,to,4115
1,the,4105
2,of,3574
3,and,3490
4,her,2543
...,...,...
6275,prefer,1
6276,dissolving,1
6277,beset,1
6278,effectually,1


# Gathering by Content Object

In [55]:
def gather(ohco_level):
    global TOKENS
    level_name = OHCO[ohco_level-1].split('_')[0]
    df = TOKENS.groupby(OHCO[:ohco_level])\
        .token_str.apply(lambda x: x.str.cat(sep=' '))\
        .to_frame(f"{level_name}_str")
    return df

In [56]:
gather(1)

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,The family of Dashwood had long been settled i...
2,Mrs John Dashwood now installed herself mistre...
3,Mrs Dashwood remained at Norland several month...
4,"""What a pity it is Elinor "" said Marianne ""tha..."
5,No sooner was her answer dispatched than Mrs D...
6,The first part of their journey was performed ...
7,Barton Park was about half a mile from the cot...
8,Mrs Jennings was a widow with an ample jointur...
9,The Dashwoods were now settled at Barton with ...
10,Marianne s preserver as Margaret with more ele...


In [57]:
gather(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,The family of Dashwood had long been settled i...
1,1,By a former marriage Mr Henry Dashwood had one...
1,2,The old gentleman died his will was read and l...
1,3,Mr Dashwood s disappointment was at first seve...
1,4,His son was sent for as soon as his danger was...
...,...,...
50,18,For Marianne however in spite of his incivilit...
50,19,Mrs Dashwood was prudent enough to remain at t...
50,20,Between Barton and Delaford there was that con...
50,21,THE END


In [58]:
gather(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_str
chap_num,para_num,sent_num,Unnamed: 3_level_1
1,0,0,The family of Dashwood had long been settled i...
1,0,1,Their estate was large and their residence was...
1,0,2,The late owner of this estate was a single man...
1,0,3,But her death which happened ten years before ...
1,0,4,for to supply her loss he invited and received...
...,...,...,...
50,19,3,Jennings when Marianne was taken from them Mar...
50,20,0,Between Barton and Delaford there was that con...
50,20,1,and among the merits and the happiness of Eli...
50,21,0,THE END


# Save work to CSV

This is important -- will be used for homework.

In [59]:
TOKENS.to_csv(csv_file)

# Use Library

In [62]:
import sys
sys.path.append("../lib")
from textimporter import TextImporter

In [63]:
my_text = TextImporter(src_file=text_file, ohco_pats=[('chap', chap_pat, 'm')], clip_pats=clip_pats)
my_text.import_source()
my_text.parse_tokens()
my_text.extract_vocab()

Importing  ../data/gutenberg/pg161.txt
Clipping text
Parsing OHCO level 0 chap_id by milestone ^\s*(?:chapter|letter)\s+\d+
Parsing OHCO level 1 para_num by delimitter \n\n
Parsing OHCO level 2 sent_num by delimitter [.?!;:]+
Parsing OHCO level 3 token_num by delimitter [\s',-]+


<textimporter.TextImporter at 0x7fdb42391b80>

In [64]:
my_text.TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,token_str,term_str
chap_id,para_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0,0,0,The,the
1,0,0,1,family,family
1,0,0,2,of,of
1,0,0,3,Dashwood,dashwood
1,0,0,4,had,had
...,...,...,...,...,...
50,27,0,8,and,and
50,27,0,9,Sensibility,sensibility
50,27,0,10,by,by
50,27,0,11,Jane,jane


In [65]:
my_text.VOCAB

Unnamed: 0_level_0,n,n_chars,p,s,i,h
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
to,4115,2,0.033669,29.700851,4.892432,0.164724
the,4105,3,0.033587,29.773203,4.895943,0.164441
of,3574,2,0.029243,34.196698,5.095785,0.149014
and,3490,3,0.028555,35.019771,5.130098,0.146491
her,2543,3,0.020807,48.060952,5.586793,0.116244
...,...,...,...,...,...,...
prefer,1,6,0.000008,122219.000000,16.899109,0.000138
dissolving,1,10,0.000008,122219.000000,16.899109,0.000138
beset,1,5,0.000008,122219.000000,16.899109,0.000138
effectually,1,11,0.000008,122219.000000,16.899109,0.000138
