# Metadata

```{yaml}
Course:   DS 5001 
Module:   02 Text Models
Topic:    Text into Data: Importing a Text, or, Clip, Chunk, and Split
Author:   R.C. Alvarado
```
**Purpose**:  Demonstrate how to tokenize a raw text and map an OHCO onto the resulting dataframe of tokens.

# Set Up

In [1]:
import pandas as pd

In [2]:
data_home = "../data"

In [3]:
text_file = f"{data_home}/gutenberg/pg105.txt"
csv_file = f"{data_home}/output/austen-persuasion.csv" # The file we will create

In [4]:
OHCO = ['chap_num', 'para_num', 'sent_num', 'token_num']

# Import file into a dataframe

In [5]:
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()

In [6]:
LINES.sample(20)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
2691,"manner on purpose to ask us, how can one say no?"""
3792,"to her. She gave a moment's recollection, as ..."
2546,"""I thought the Miss Musgroves had been here: M..."
3047,"from Uppercross, where she felt she had been s..."
8526,1.E.8. You may charge a reasonable fee for co...
3452,"master was a very rich gentleman, and would be..."
3577,"smiled and said, ""I am determined I will:"" he ..."
3553,"but as they drew near the Cobb, there was such..."
1217,
7898,"and Mrs Musgrove, who thought only of one sort..."


# Extract Title 

In [7]:
title = LINES.loc[0].line_str.replace('The Project Gutenberg EBook of ', '')

In [8]:
print(title)

Persuasion, by Jane Austen


# Clip Cruft

In [9]:
clip_pats = [
    r"\*\*\*\s*START OF (?:THE|THIS) PROJECT", 
    r"\*\*\*\s*END OF (?:THE|THIS) PROJECT"
]

In [10]:
pat_a = LINES.line_str.str.match(clip_pats[0])
pat_b = LINES.line_str.str.match(clip_pats[1])

In [11]:
line_a = LINES.loc[pat_a].index[0] + 1
line_b = LINES.loc[pat_b].index[0] - 1

In [12]:
line_a, line_b

(19, 8372)

In [13]:
LINES = LINES.loc[line_a : line_b]

In [14]:
LINES.head(10)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
19,
20,
21,
22,
23,Produced by Sharon Partridge and Martin Ward. ...
24,by Al Haines.
25,
26,
27,
28,


In [15]:
LINES.tail(10)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
8363,
8364,
8365,
8366,
8367,
8368,
8369,
8370,
8371,End of the Project Gutenberg EBook of Persuasi...
8372,


# Chunk by chapter

## Find all chapter headers

The regex will depend on the source text. You need to investigate the source text to figure this out.

In [16]:
chap_pat = r"^\s*(?:chapter|letter)\s+\d+"

In [17]:
chap_lines = LINES.line_str.str.match(chap_pat, case=False) # Returns a truth vector

In [18]:
LINES.loc[chap_lines] # Use as filter for dataframe

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
47,Chapter 1
306,Chapter 2
500,Chapter 3
786,Chapter 4
959,Chapter 5
1297,Chapter 6
1657,Chapter 7
1992,Chapter 8
2346,Chapter 9
2632,Chapter 10


## Assign numbers to chapters

In [19]:
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]

In [20]:
LINES.loc[chap_lines]

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
47,Chapter 1,1.0
306,Chapter 2,2.0
500,Chapter 3,3.0
786,Chapter 4,4.0
959,Chapter 5,5.0
1297,Chapter 6,6.0
1657,Chapter 7,7.0
1992,Chapter 8,8.0
2346,Chapter 9,9.0
2632,Chapter 10,10.0


Notice that all lines that are not chapter headers have no chapter number assigned to them.

In [21]:
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
4129,"The breakfast-room chimney smokes a little, I ...",
3775,would be the farther advantage of sending an a...,
290,There was only a small part of his estate that...,
5507,,
43,,
6778,with Miss Elliot and Sir Walter as long ago as...,
3375,a wish that such another woman were at Uppercr...,
1253,Anne had always thought such a style of interc...,
1153,"""Yes, I made the best of it; I always do: but...",
4448,regret in the duties and dignity of the reside...,


## Forward-fill chapter numbers to following text lines

`ffill()` will replace null values with the previous non-null value.

In [22]:
LINES.chap_num = LINES.chap_num.ffill()

In [23]:
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
2763,"Mary exclaimed, ""Bless me! here is Winthrop. ...",10.0
7633,,23.0
6296,recollections of the concert were quite happy ...,21.0
786,Chapter 4,4.0
5687,thickest.,19.0
1871,"actuated, perhaps, by the same view of escapin...",7.0
2754,"the sweets of poetical despondence, and meanin...",10.0
1388,"humours and indulges them to such a degree, an...",6.0
1590,"such gloomy things.""",6.0
2162,"kind wishes, as to her son, he had probably be...",8.0


Notice that the lines taht precede our first chapter have no chapters, which is what we want. We need to decide whether to keep these lines as textual front matter or to dispose of them.

In [24]:
LINES.head(20)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
19,,
20,,
21,,
22,,
23,Produced by Sharon Partridge and Martin Ward. ...,
24,by Al Haines.,
25,,
26,,
27,,
28,,


## Clean up

In [25]:
LINES = LINES.dropna(subset=['chap_num']) # Remove everything before Chapter 1
# LINES = LINES.loc[~LINES.chap_num.isna()] # Remove everything before Chapter 1 (alternate method)
LINES = LINES.loc[~chap_lines] # Remove chapter heading lines; their work is done
LINES.chap_num = LINES.chap_num.astype('int') # Convert chap_num from float to int

In [26]:
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
4687,"""now Miss Anne was come, she could not suppose...",16
103,pardoned the youthful infatuation which made h...,1
6700,"""The language, I know, is highly disrespectful...",21
3716,day or night. And all this was said with a tr...,12
2381,and when he came back he had the pain of findi...,9
6504,"""And--were you much acquainted?""",21
2716,"It was mere lively chat, such as any young per...",10
4257,"Captain Benwick.""",14
1669,could feel secure even for a week.,7
2630,,9


## Group lines into chapters

In [27]:
OHCO[:1]

['chap_num']

In [28]:
CHAPS = LINES.groupby(OHCO[:1]).line_str.apply(lambda x: '\n'.join(x)).to_frame('chap_str') # Make big string for each chapter

In [29]:
CHAPS.head(10)

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,"\n\nSir Walter Elliot, of Kellynch Hall, in So..."
2,"\n\nMr Shepherd, a civil, cautious lawyer, who..."
3,"\n\n""I must take leave to observe, Sir Walter,..."
4,"\n\nHe was not Mr Wentworth, the former curate..."
5,\n\nOn the morning appointed for Admiral and M...
6,\n\nAnne had not wanted this visit to Uppercro...
7,"\n\nA very few days more, and Captain Wentwort..."
8,\n\nFrom this time Captain Wentworth and Anne ...
9,\n\nCaptain Wentworth was come to Kellynch as ...
10,\n\nOther opportunities of making her observat...


In [30]:
CHAPS['chap_str'] = CHAPS.chap_str.str.strip()

In [31]:
CHAPS.head(10)

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,"Sir Walter Elliot, of Kellynch Hall, in Somers..."
2,"Mr Shepherd, a civil, cautious lawyer, who, wh..."
3,"""I must take leave to observe, Sir Walter,"" sa..."
4,"He was not Mr Wentworth, the former curate of ..."
5,On the morning appointed for Admiral and Mrs C...
6,"Anne had not wanted this visit to Uppercross, ..."
7,"A very few days more, and Captain Wentworth wa..."
8,From this time Captain Wentworth and Anne Elli...
9,Captain Wentworth was come to Kellynch as to a...
10,Other opportunities of making her observations...


So, now we have our text grouped by chapters.

# Split chapters into paragraphs 

We use Pandas' convenient `.split()` method with `expand=True`, followed by `.stack()`.
Note that this creates zero-based indexes.

In [32]:
para_pat = r'\n\n+'

In [33]:
# CHAPS['chap_str'].str.split(para_pat, expand=True).head()

In [34]:
PARAS = CHAPS['chap_str'].str.split(para_pat, expand=True).stack()\
    .to_frame('para_str').sort_index()
PARAS.index.names = OHCO[:2]

In [35]:
PARAS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,"Sir Walter Elliot, of Kellynch Hall, in Somers..."
1,1,"""ELLIOT OF KELLYNCH HALL."
1,2,"""Walter Elliot, born March 1, 1760, married, J..."
1,3,Precisely such had the paragraph originally st...
1,4,Then followed the history and rise of the anci...


In [36]:
PARAS['para_str'] = PARAS['para_str'].str.replace(r'\n', ' ', regex=True)
PARAS['para_str'] = PARAS['para_str'].str.strip()
PARAS = PARAS[~PARAS['para_str'].str.match(r'^\s*$')] # Remove empty paragraphs

In [37]:
PARAS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,"Sir Walter Elliot, of Kellynch Hall, in Somers..."
1,1,"""ELLIOT OF KELLYNCH HALL."
1,2,"""Walter Elliot, born March 1, 1760, married, J..."
1,3,Precisely such had the paragraph originally st...
1,4,Then followed the history and rise of the anci...


# Split paragraphs into sentences

In [38]:
# sent_pat = r'[.?!;:"]+'
sent_pat = r'[.?!;:]+'
SENTS = PARAS['para_str'].str.split(sent_pat, expand=True).stack()\
    .to_frame('sent_str')
SENTS.index.names = OHCO[:3]

In [39]:
SENTS = SENTS[~SENTS['sent_str'].str.match(r'^\s*$')] # Remove empty paragraphs
SENTS.sent_str = SENTS.sent_str.str.strip() # CRUCIAL TO REMOVE BLANK TOKENS

In [40]:
SENTS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_str
chap_num,para_num,sent_num,Unnamed: 3_level_1
1,0,0,"Sir Walter Elliot, of Kellynch Hall, in Somers..."
1,0,1,"there he found occupation for an idle hour, an..."
1,0,2,there his faculties were roused into admiratio...
1,0,3,"there any unwelcome sensations, arising from d..."
1,0,4,"and there, if every other leaf were powerless,..."


# Split sentences into tokens

In [41]:
token_pat = r"[\s',-]+"
TOKENS = SENTS['sent_str'].str.split(token_pat, expand=True).stack()\
    .to_frame('token_str')

In [42]:
TOKENS.index.names = OHCO[:4]

In [43]:
TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,token_str
chap_num,para_num,sent_num,token_num,Unnamed: 4_level_1
1,0,0,0,Sir
1,0,0,1,Walter
1,0,0,2,Elliot
1,0,0,3,of
1,0,0,4,Kellynch
...,...,...,...,...
24,13,0,6,of
24,13,0,7,Persuasion
24,13,0,8,by
24,13,0,9,Jane


# Extract Vocabulary

In [44]:
TOKENS['term_str'] = TOKENS.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
VOCAB = TOKENS.term_str.value_counts().to_frame('n').reset_index().rename(columns={'index':'term_str'})
VOCAB.index.name = 'term_id'

In [45]:
VOCAB

Unnamed: 0_level_0,term_str,n
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,the,3330
1,to,2808
2,and,2800
3,of,2572
4,a,1595
...,...,...
5755,reins,1
5756,judiciously,1
5757,rut,1
5758,dung,1


# Gathering by Content Object

In [46]:
def gather(ohco_level):
    global TOKENS
    level_name = OHCO[ohco_level-1].split('_')[0]
    df = TOKENS.groupby(OHCO[:ohco_level])\
        .token_str.apply(lambda x: x.str.cat(sep=' '))\
        .to_frame(f"{level_name}_str")
    return df

In [47]:
gather(1)

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,Sir Walter Elliot of Kellynch Hall in Somerset...
2,Mr Shepherd a civil cautious lawyer who whatev...
3,"""I must take leave to observe Sir Walter "" sai..."
4,He was not Mr Wentworth the former curate of M...
5,On the morning appointed for Admiral and Mrs C...
6,Anne had not wanted this visit to Uppercross t...
7,A very few days more and Captain Wentworth was...
8,From this time Captain Wentworth and Anne Elli...
9,Captain Wentworth was come to Kellynch as to a...
10,Other opportunities of making her observations...


In [48]:
gather(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,Sir Walter Elliot of Kellynch Hall in Somerset...
1,1,"""ELLIOT OF KELLYNCH HALL"
1,2,"""Walter Elliot born March 1 1760 married July ..."
1,3,Precisely such had the paragraph originally st...
1,4,Then followed the history and rise of the anci...
...,...,...
24,9,Anne satisfied at a very early period of Lady ...
24,10,Her recent good offices by Anne had been enoug...
24,11,Mrs Smith s enjoyments were not spoiled by thi...
24,12,Finis


In [49]:
gather(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_str
chap_num,para_num,sent_num,Unnamed: 3_level_1
1,0,0,Sir Walter Elliot of Kellynch Hall in Somerset...
1,0,1,there he found occupation for an idle hour and...
1,0,2,there his faculties were roused into admiratio...
1,0,3,there any unwelcome sensations arising from do...
1,0,4,and there if every other leaf were powerless h...
...,...,...,...
24,11,4,Anne was tenderness itself and she had the ful...
24,11,5,His profession was all that could ever make he...
24,11,6,She gloried in being a sailor s wife but she m...
24,12,0,Finis


# Save work to CSV

This is important -- will be used for homework.

In [50]:
TOKENS.to_csv(csv_file)

# Use Library

In [51]:
import sys
sys.path.append("../lib")
from textimporter import TextImporter

In [52]:
pg105 = TextImporter(src_file=text_file, ohco_pats=[('chap', chap_pat, 'm')], clip_pats=clip_pats)
pg105.import_source()
pg105.parse_tokens()
pg105.extract_vocab()

Importing  ../data/gutenberg/pg105.txt
Clipping text
Parsing OHCO level 0 chap_id by milestone ^\s*(?:chapter|letter)\s+\d+
Parsing OHCO level 1 para_num by delimitter \n\n
Parsing OHCO level 2 sent_num by delimitter [.?!;:]+
Parsing OHCO level 3 token_num by delimitter [\s',-]+


<textimporter.TextImporter at 0x7f7e083c7520>

In [53]:
pg105.TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,token_str,term_str
chap_id,para_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0,0,0,Sir,sir
1,0,0,1,Walter,walter
1,0,0,2,Elliot,elliot
1,0,0,3,of,of
1,0,0,4,Kellynch,kellynch
...,...,...,...,...,...
24,17,0,7,of,of
24,17,0,8,Persuasion,persuasion
24,17,0,9,by,by
24,17,0,10,Jane,jane


In [54]:
pg105.VOCAB

Unnamed: 0_level_0,n,n_chars,p,s,i,h
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
the,3330,3,0.039221,25.496697,4.672238,0.183249
to,2808,2,0.033073,30.236467,4.918218,0.162658
and,2800,3,0.032978,30.322857,4.922334,0.162331
of,2572,2,0.030293,33.010886,5.044870,0.152824
a,1595,1,0.018786,53.231348,5.734204,0.107722
...,...,...,...,...,...,...
reins,1,5,0.000012,84904.000000,16.373545,0.000193
judiciously,1,11,0.000012,84904.000000,16.373545,0.000193
rut,1,3,0.000012,84904.000000,16.373545,0.000193
dung,1,4,0.000012,84904.000000,16.373545,0.000193
