# Text into Data

```yaml
Name:   Ji Hyun Kim 
Computing ID:   mqa4qu
Code Guide: M02_01_Importing-Persuasion
Purpose:    Convert a different text from raw text into a data frame of tokens and preserving its OHCO. Then extract some statistical features from the resulting corpus.
```

## Import

In [120]:
import pandas as pd
import configparser

In [121]:
config = configparser.ConfigParser()
config.read("../../../env.ini")

data_home = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']

data_home, output_dir

('/Users/jihyunkim/desktop/uva_docs/DS5001/data',
 '/Users/jihyunkim/desktop/uva_docs/DS5001/output')

In [122]:
text_file = f"{data_home}/gutenberg/pg161.txt"
csv_file  = f"{output_dir}/austen-sense-and-sensibility.csv"

OHCO = ['chap_num', 'para_num', 'sent_num', 'token_num']

In [123]:
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()
# LINES.head(10)

## Tasks

In [124]:
# Extract Title
title = LINES.loc[0].line_str.replace('The Project Gutenberg EBook of ', '')
print(title)

Sense and Sensibility, by Jane Austen


### Task 1: Clip the cruft
Remove Gutenberg's front and back matter using the lines that indicate the start and end of the project.

In [125]:
clip_pats = [
    r"\*\*\*\s*START OF (?:THE|THIS) PROJECT",
    r"\*\*\*\s*END OF (?:THE|THIS) PROJECT"
]

In [126]:
pat_a = LINES.line_str.str.match(clip_pats[0])
pat_b = LINES.line_str.str.match(clip_pats[1])

line_a = LINES.loc[pat_a].index[0] + 1
line_b = LINES.loc[pat_b].index[0] - 1

line_a, line_b

(20, 12666)

In [127]:
LINES = LINES.loc[line_a : line_b]
LINES.head(20)
LINES.tail(10)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
12657,
12658,
12659,
12660,
12661,
12662,
12663,
12664,
12665,End of the Project Gutenberg EBook of Sense an...
12666,


### Task 2
Chunk by chapter, using the pattern of locating the headers in the data frame, assigning them 
numbers, forward-filling those numbers, and then grouping by number (and cleaning up).

#### Task 2 - Chunk by Chapters

In [128]:
chap_pat = r"^\s*(?:chapter|letter)\s+\d+"
chap_lines = LINES.line_str.str.match(chap_pat, case=False)
LINES.loc[chap_lines]

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
42,CHAPTER 1
196,CHAPTER 2
399,CHAPTER 3
561,CHAPTER 4
756,CHAPTER 5
858,CHAPTER 6
986,CHAPTER 7
1112,CHAPTER 8
1244,CHAPTER 9
1448,CHAPTER 10


#### Task 2 - Assign numbers

In [129]:
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]
LINES.loc[chap_lines]

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
42,CHAPTER 1,1.0
196,CHAPTER 2,2.0
399,CHAPTER 3,3.0
561,CHAPTER 4,4.0
756,CHAPTER 5,5.0
858,CHAPTER 6,6.0
986,CHAPTER 7,7.0
1112,CHAPTER 8,8.0
1244,CHAPTER 9,9.0
1448,CHAPTER 10,10.0


#### Task 2 - Forward Filling chapter numbers

In [130]:
LINES.chap_num = LINES.chap_num.ffill()
# LINES.sample(10)
LINES.head(50)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
20,,
21,,
22,,
23,,
24,,
25,,
26,,
27,,
28,,
29,,


#### Task 2 - Clean up

In [131]:
LINES = LINES.dropna(subset=['chap_num'])
LINES = LINES.loc[~chap_lines]
LINES.chap_num = LINES.chap_num.astype('int')
LINES.head(20)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
43,,1
44,,1
45,The family of Dashwood had long been settled i...,1
46,"was large, and their residence was at Norland ...",1
47,"their property, where, for many generations, t...",1
48,respectable a manner as to engage the general ...,1
49,surrounding acquaintance. The late owner of t...,1
50,"man, who lived to a very advanced age, and who...",1
51,"life, had a constant companion and housekeeper...",1
52,"death, which happened ten years before his own...",1


#### Task 2 - Group by Number

In [132]:
OHCO[:1]

CHAPS = LINES.groupby(OHCO[:1])\
    .line_str.apply(lambda x: '\n'.join(x))\
    .to_frame('chap_str')

CHAPS.head(10)

CHAPS['chap_str'] = CHAPS.chap_str.str.strip()
CHAPS.head(10)

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,The family of Dashwood had long been settled i...
2,Mrs. John Dashwood now installed herself mistr...
3,Mrs. Dashwood remained at Norland several mont...
4,"""What a pity it is, Elinor,"" said Marianne, ""t..."
5,"No sooner was her answer dispatched, than Mrs...."
6,The first part of their journey was performed ...
7,Barton Park was about half a mile from the cot...
8,Mrs. Jennings was a widow with an ample jointu...
9,The Dashwoods were now settled at Barton with ...
10,"Marianne's preserver, as Margaret, with more e..."


### Task 3: Split Chapters into Paragraphs
Split resulting data frame into paragraphs using the regex provided

In [133]:
OHCO[:2]

['chap_num', 'para_num']

In [134]:
para_pat = r'\n\n+'

PARAS = CHAPS['chap_str'].str.split(para_pat, expand=True).stack()\
    .to_frame('para_str').sort_index()
PARAS.index.names = OHCO[:2]

PARAS.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,The family of Dashwood had long been settled i...
1,1,"By a former marriage, Mr. Henry Dashwood had o..."
1,2,"The old gentleman died: his will was read, and..."
1,3,"Mr. Dashwood's disappointment was, at first, s..."
1,4,His son was sent for as soon as his danger was...
1,5,Mr. John Dashwood had not the strong feelings ...
1,6,"He was not an ill-disposed young man, unless t..."
1,7,"When he gave his promise to his father, he med..."
1,8,"No sooner was his father's funeral over, than ..."
1,9,So acutely did Mrs. Dashwood feel this ungraci...


In [135]:
PARAS['para_str'] = PARAS['para_str'].str.replace(r'\n', ' ', regex=True)
PARAS['para_str'] = PARAS['para_str'].str.strip()
PARAS = PARAS[~PARAS['para_str'].str.match(r'^\s*$')]

PARAS.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,The family of Dashwood had long been settled i...
1,1,"By a former marriage, Mr. Henry Dashwood had o..."
1,2,"The old gentleman died: his will was read, and..."
1,3,"Mr. Dashwood's disappointment was, at first, s..."
1,4,His son was sent for as soon as his danger was...
1,5,Mr. John Dashwood had not the strong feelings ...
1,6,"He was not an ill-disposed young man, unless t..."
1,7,"When he gave his promise to his father, he med..."
1,8,"No sooner was his father's funeral over, than ..."
1,9,So acutely did Mrs. Dashwood feel this ungraci...


### Task 4: Split paragraphs into sentences
Split resulting data frame into sentences using the regex provided.

In [136]:
sent_pat = r'[.?!;:]+'
SENTS = PARAS['para_str'].str.split(sent_pat, expand=True).stack()\
    .to_frame('sent_str')
SENTS.index.names = OHCO[:3]

SENTS = SENTS[~SENTS['sent_str'].str.match(r'^\s*$')]
SENTS.sent_str = SENTS.sent_str.str.strip()

SENTS.head(10)
# SENTS.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_str
chap_num,para_num,sent_num,Unnamed: 3_level_1
1,0,0,The family of Dashwood had long been settled i...
1,0,1,"Their estate was large, and their residence wa..."
1,0,2,The late owner of this estate was a single man...
1,0,3,"But her death, which happened ten years before..."
1,0,4,"for to supply her loss, he invited and receive..."
1,0,5,"Henry Dashwood, the legal inheritor of the Nor..."
1,0,6,"In the society of his nephew and niece, and th..."
1,0,7,His attachment to them all increased
1,0,8,The constant attention of Mr
1,0,9,and Mrs


### Task 4: Split sentences into tokens
Split resulting data frame into tokens using the regex provided.

In [137]:
token_pat = r"[\s',-]+"
TOKENS = SENTS['sent_str'].str.split(token_pat, expand=True).stack()\
    .to_frame('token_str')

TOKENS.index.names = OHCO[:4]
# TOKENS.head(10)
TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,token_str
chap_num,para_num,sent_num,token_num,Unnamed: 4_level_1
1,0,0,0,The
1,0,0,1,family
1,0,0,2,of
1,0,0,3,Dashwood
1,0,0,4,had
...,...,...,...,...
50,22,0,8,and
50,22,0,9,Sensibility
50,22,0,10,by
50,22,0,11,Jane
