# Text into Data

DS 5001 Text as Data

## Purpose

Ww import a text using the  **Clip, Chunk, and Split pattern**.

Demonstrate how to tokenize a raw text and map an OHCO onto the resulting dataframe of tokens.

This goes beyond what we did last week in the First Foray notebook. We capture the chapter, paragraph, and sentence structure of the text.

## Set Up

## Import libraries

In [1]:
import pandas as pd

### Import Config

In [2]:
import configparser
config = configparser.ConfigParser()
config.read("../../../env.ini")
data_home = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']

In [3]:
data_home, output_dir

('/Users/jamessiegener/MSDS/DS5001/data',
 '/Users/jamessiegener/MSDS/DS5001/output')

In [4]:
text_file = f"{data_home}/gutenberg/pg105.txt"
csv_file  = f"{output_dir}/austen-persuasion.csv" # The file we will create

In [5]:
OHCO = ['chap_num', 'para_num', 'sent_num', 'token_num']

## Import file into a dataframe

In [6]:
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()

In [7]:
LINES.sample(20)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
5364,
6763,"'somewhere down in the west,' to use her own w..."
6435,
2835,"time I was in company with him, I need not af..."
4875,"""Yes,"" sighed Anne, ""we shall, indeed, be know..."
3280,"repeated, with such tremulous feeling, the var..."
3414,"the window. It was a gentleman's carriage, a ..."
8566,1.F.
1526,"that related to Kellynch, and it pleased her: ..."
5178,be able to regard you as the future mistress o...


## Extract Title 

In [8]:
title = LINES.loc[0].line_str.replace('The Project Gutenberg EBook of ', '')

In [9]:
print(title)

Persuasion, by Jane Austen


## Clip the Cruft

In [10]:
clip_pats = [
    r"\*\*\*\s*START OF (?:THE|THIS) PROJECT",
    r"\*\*\*\s*END OF (?:THE|THIS) PROJECT"
]

In [11]:
pat_a = LINES.line_str.str.match(clip_pats[0])
pat_b = LINES.line_str.str.match(clip_pats[1])

In [12]:
line_a = LINES.loc[pat_a].index[0] + 1
line_b = LINES.loc[pat_b].index[0] - 1

In [13]:
line_a, line_b

(19, 8372)

In [14]:
LINES = LINES.loc[line_a : line_b]

In [15]:
LINES.head(10)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
19,
20,
21,
22,
23,Produced by Sharon Partridge and Martin Ward. ...
24,by Al Haines.
25,
26,
27,
28,


In [16]:
LINES.tail(10)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
8363,
8364,
8365,
8366,
8367,
8368,
8369,
8370,
8371,End of the Project Gutenberg EBook of Persuasi...
8372,


## Chunk by Chapter

### Find all chapter headers

The regex will depend on the source text. You need to investigate the source text to figure this out.

In [17]:
chap_pat = r"^\s*(?:chapter|letter)\s+\d+"

In [18]:
chap_lines = LINES.line_str.str.match(chap_pat, case=False) # Returns a truth vector

In [19]:
LINES.loc[chap_lines] # Use as filter for dataframe

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
47,Chapter 1
306,Chapter 2
500,Chapter 3
786,Chapter 4
959,Chapter 5
1297,Chapter 6
1657,Chapter 7
1992,Chapter 8
2346,Chapter 9
2632,Chapter 10


### Assign numbers to chapters

In [20]:
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]

In [21]:
LINES.loc[chap_lines]

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
47,Chapter 1,1.0
306,Chapter 2,2.0
500,Chapter 3,3.0
786,Chapter 4,4.0
959,Chapter 5,5.0
1297,Chapter 6,6.0
1657,Chapter 7,7.0
1992,Chapter 8,8.0
2346,Chapter 9,9.0
2632,Chapter 10,10.0


Notice that all lines that are not chapter headers have no chapter number assigned to them.

In [22]:
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
7916,"""Do you think so? But I am afraid; and I shou...",
5182,You are your mother's self in countenance and ...,
1219,with me till this morning. It would have been...,
7734,,
1852,she could not but believe that in his place sh...,
1189,,
894,from what she had been made to think at ninete...,
4360,"for the present, to see his brother in Shropsh...",
3468,"""Putting all these very extraordinary circumst...",
4543,"appearance, his air of elegance and fashion, h...",


### Forward-fill chapter numbers to following text lines

`ffill()` will replace null values with the previous non-null value.

In [23]:
LINES.chap_num = LINES.chap_num.ffill()

In [24]:
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
7144,she wanted to see; it was thought a good oppor...,22.0
604,"line. One day last spring, in town, I was in ...",3.0
5898,"more visits from you.""",19.0
7601,"persuaded to think might do very well,"" and a ...",23.0
7552,,22.0
6679,,21.0
1277,"advantage, their faces were rather pretty, the...",5.0
6083,"were all properly arranged, she looked round t...",20.0
7162,"in a few months, quite as soon as Louisa's. ""...",22.0
89,Vanity was the beginning and the end of Sir Wa...,1.0


Notice that the lines taht precede our first chapter have no chapters, which is what we want. We need to decide whether to keep these lines as textual front matter or to dispose of them.

In [25]:
LINES.head(20)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
19,,
20,,
21,,
22,,
23,Produced by Sharon Partridge and Martin Ward. ...,
24,by Al Haines.,
25,,
26,,
27,,
28,,


### Clean up

In [26]:
LINES = LINES.dropna(subset=['chap_num']) # Remove everything before Chapter 1
# LINES = LINES.loc[~LINES.chap_num.isna()] # Remove everything before Chapter 1 (alternate method)
LINES = LINES.loc[~chap_lines] # Remove chapter heading lines; their work is done
LINES.chap_num = LINES.chap_num.astype('int') # Convert chap_num from float to int

In [27]:
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
8267,securing the happiness of her other child.,24
6479,"Wallis herself, which did not seem bad authori...",21
5111,"approves it, and has generally taken me when I...",17
4285,,14
7444,"""I am not yet so much changed,"" cried Anne, an...",22
5306,going on; always the last of my family to be n...,18
5645,"""Poor Frederick!"" said he at last. ""Now he mu...",18
105,"concealed his failings, and promoted his real ...",1
5424,"Mary, from the present course of events, they ...",18
7857,seen this? Can you fail to have understood my...,23


### Group lines into chapters

In [28]:
OHCO[:1]

['chap_num']

In [29]:
# Make big string for each chapter
CHAPS = LINES.groupby(OHCO[:1])\
    .line_str.apply(lambda x: '\n'.join(x))\
    .to_frame('chap_str')

In [30]:
CHAPS.head(10)

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,"\n\nSir Walter Elliot, of Kellynch Hall, in So..."
2,"\n\nMr Shepherd, a civil, cautious lawyer, who..."
3,"\n\n""I must take leave to observe, Sir Walter,..."
4,"\n\nHe was not Mr Wentworth, the former curate..."
5,\n\nOn the morning appointed for Admiral and M...
6,\n\nAnne had not wanted this visit to Uppercro...
7,"\n\nA very few days more, and Captain Wentwort..."
8,\n\nFrom this time Captain Wentworth and Anne ...
9,\n\nCaptain Wentworth was come to Kellynch as ...
10,\n\nOther opportunities of making her observat...


In [31]:
CHAPS['chap_str'] = CHAPS.chap_str.str.strip()

In [32]:
CHAPS

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,"Sir Walter Elliot, of Kellynch Hall, in Somers..."
2,"Mr Shepherd, a civil, cautious lawyer, who, wh..."
3,"""I must take leave to observe, Sir Walter,"" sa..."
4,"He was not Mr Wentworth, the former curate of ..."
5,On the morning appointed for Admiral and Mrs C...
6,"Anne had not wanted this visit to Uppercross, ..."
7,"A very few days more, and Captain Wentworth wa..."
8,From this time Captain Wentworth and Anne Elli...
9,Captain Wentworth was come to Kellynch as to a...
10,Other opportunities of making her observations...


So, now we have our text grouped by chapters.

## Split chapters into paragraphs 

We use Pandas' convenient `.split()` method with `expand=True`, followed by `.stack()`.
Note that this creates zero-based indexes.

In [33]:
para_pat = r'\n\n+'

In [34]:
# CHAPS['chap_str'].str.split(para_pat, expand=True).head()

In [35]:
PARAS = CHAPS['chap_str'].str.split(para_pat, expand=True).stack()\
    .to_frame('para_str').sort_index()
PARAS.index.names = OHCO[:2]

In [36]:
PARAS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,"Sir Walter Elliot, of Kellynch Hall, in Somers..."
1,1,"""ELLIOT OF KELLYNCH HALL."
1,2,"""Walter Elliot, born March 1, 1760, married, J..."
1,3,Precisely such had the paragraph originally st...
1,4,Then followed the history and rise of the anci...


In [37]:
PARAS['para_str'] = PARAS['para_str'].str.replace(r'\n', ' ', regex=True)
PARAS['para_str'] = PARAS['para_str'].str.strip()
PARAS = PARAS[~PARAS['para_str'].str.match(r'^\s*$')] # Remove empty paragraphs

In [38]:
PARAS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,"Sir Walter Elliot, of Kellynch Hall, in Somers..."
1,1,"""ELLIOT OF KELLYNCH HALL."
1,2,"""Walter Elliot, born March 1, 1760, married, J..."
1,3,Precisely such had the paragraph originally st...
1,4,Then followed the history and rise of the anci...


## Split paragraphs into sentences

In [39]:
# sent_pat = r'[.?!;:"]+'
sent_pat = r'[.?!;:]+'
SENTS = PARAS['para_str'].str.split(sent_pat, expand=True).stack()\
    .to_frame('sent_str')
SENTS.index.names = OHCO[:3]

In [40]:
SENTS = SENTS[~SENTS['sent_str'].str.match(r'^\s*$')] # Remove empty paragraphs
SENTS.sent_str = SENTS.sent_str.str.strip() # CRUCIAL TO REMOVE BLANK TOKENS

In [53]:
SENTS.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_str
chap_num,para_num,sent_num,Unnamed: 3_level_1
1,0,0,"Sir Walter Elliot, of Kellynch Hall, in Somers..."
1,0,1,"there he found occupation for an idle hour, an..."
1,0,2,there his faculties were roused into admiratio...
1,0,3,"there any unwelcome sensations, arising from d..."
1,0,4,"and there, if every other leaf were powerless,..."
1,0,5,This was the page at which the favourite volum...
1,1,0,"""ELLIOT OF KELLYNCH HALL"
1,2,0,"""Walter Elliot, born March 1, 1760, married, J..."
1,2,1,"of South Park, in the county of Gloucester, by..."
1,2,2,"Anne, born August 9, 1787"


In [42]:
SENTS.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_str
chap_num,para_num,sent_num,Unnamed: 3_level_1
12,9,1,""" cried Captain Wentworth, instantly, and with..."
16,6,1,"and it did not surprise her, therefore, that L..."
15,11,9,"but, at the same time, ""must lament his being ..."
21,15,2,""""
24,4,6,and if they could but keep Captain Wentworth f...
8,1,8,There must be the same immediate association o...
18,30,1,"Yes, yes we will have a snug walk together, an..."
20,44,3,"The others returned, the room filled again, be..."
6,8,8,"but I shall tell you, Miss Anne, because you m..."
12,64,0,Captain Wentworth now hurried off to get every...


## Split sentences into tokens

In [43]:
token_pat = r"[\s',-]+"
TOKENS = SENTS['sent_str'].str.split(token_pat, expand=True).stack()\
    .to_frame('token_str')

In [44]:
TOKENS.index.names = OHCO[:4]

In [45]:
TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,token_str
chap_num,para_num,sent_num,token_num,Unnamed: 4_level_1
1,0,0,0,Sir
1,0,0,1,Walter
1,0,0,2,Elliot
1,0,0,3,of
1,0,0,4,Kellynch
...,...,...,...,...
24,13,0,6,of
24,13,0,7,Persuasion
24,13,0,8,by
24,13,0,9,Jane


## Extract Vocabulary

In [46]:
TOKENS['term_str'] = TOKENS.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
VOCAB = TOKENS.term_str.value_counts().to_frame('n').reset_index().rename(columns={'index':'term_str'})
VOCAB.index.name = 'term_id'

In [47]:
VOCAB

Unnamed: 0_level_0,term_str,n
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,the,3330
1,to,2808
2,and,2800
3,of,2572
4,a,1595
...,...,...
5755,reins,1
5756,judiciously,1
5757,rut,1
5758,dung,1


## Gathering by Content Object

In [48]:
def gather(ohco_level):
    global TOKENS
    level_name = OHCO[ohco_level-1].split('_')[0]
    df = TOKENS.groupby(OHCO[:ohco_level])\
        .token_str.apply(lambda x: x.str.cat(sep=' '))\
        .to_frame(f"{level_name}_str")
    return df

In [49]:
gather(1)

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,Sir Walter Elliot of Kellynch Hall in Somerset...
2,Mr Shepherd a civil cautious lawyer who whatev...
3,"""I must take leave to observe Sir Walter "" sai..."
4,He was not Mr Wentworth the former curate of M...
5,On the morning appointed for Admiral and Mrs C...
6,Anne had not wanted this visit to Uppercross t...
7,A very few days more and Captain Wentworth was...
8,From this time Captain Wentworth and Anne Elli...
9,Captain Wentworth was come to Kellynch as to a...
10,Other opportunities of making her observations...


In [50]:
gather(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,Sir Walter Elliot of Kellynch Hall in Somerset...
1,1,"""ELLIOT OF KELLYNCH HALL"
1,2,"""Walter Elliot born March 1 1760 married July ..."
1,3,Precisely such had the paragraph originally st...
1,4,Then followed the history and rise of the anci...
...,...,...
24,9,Anne satisfied at a very early period of Lady ...
24,10,Her recent good offices by Anne had been enoug...
24,11,Mrs Smith s enjoyments were not spoiled by thi...
24,12,Finis


In [51]:
gather(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_str
chap_num,para_num,sent_num,Unnamed: 3_level_1
1,0,0,Sir Walter Elliot of Kellynch Hall in Somerset...
1,0,1,there he found occupation for an idle hour and...
1,0,2,there his faculties were roused into admiratio...
1,0,3,there any unwelcome sensations arising from do...
1,0,4,and there if every other leaf were powerless h...
...,...,...,...
24,11,4,Anne was tenderness itself and she had the ful...
24,11,5,His profession was all that could ever make he...
24,11,6,She gloried in being a sailor s wife but she m...
24,12,0,Finis


## Save work to CSV

This is important -- will be used for homework.

In [54]:
TOKENS.to_csv(csv_file)