# Text into Data

```yaml
Name:   Ji Hyun Kim 
Computing ID:   mqa4qu
Code Guide: M02_01_Importing-Persuasion
Purpose:    Convert a different text from raw text into a data frame of tokens and preserving its OHCO. Then extract some statistical features from the resulting corpus.
```

## Import

In [1]:
import pandas as pd
import configparser

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
config = configparser.ConfigParser()
config.read("../../../env.ini")

data_home = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']

data_home, output_dir

('/Users/jihyunkim/desktop/uva_docs/DS5001/data',
 '/Users/jihyunkim/desktop/uva_docs/DS5001/output')

In [3]:
text_file = f"{data_home}/gutenberg/pg161.txt"
csv_file  = f"{output_dir}/austen-sense-and-sensibility.csv"

OHCO = ['book_num', 'chap_num', 'para_num', 'sent_num', 'token_num']

In [4]:
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()

LINES['book_num'] = 0

# LINES.head(10)

## Tasks

In [5]:
# Extract Title
title = LINES.loc[0].line_str.replace('The Project Gutenberg EBook of ', '')
print(title)

Sense and Sensibility, by Jane Austen


### Task 1: Clip the cruft
Remove Gutenberg's front and back matter using the lines that indicate the start and end of the project.

In [6]:
clip_pats = [
    r"\*\*\*\s*START OF (?:THE|THIS) PROJECT",
    r"\*\*\*\s*END OF (?:THE|THIS) PROJECT"
]

In [7]:
pat_a = LINES.line_str.str.match(clip_pats[0])
pat_b = LINES.line_str.str.match(clip_pats[1])

line_a = LINES.loc[pat_a].index[0] + 1
line_b = LINES.loc[pat_b].index[0] - 1

line_a, line_b

(20, 12666)

In [8]:
LINES = LINES.loc[line_a : line_b]
LINES.head(20)
LINES.tail(10)

Unnamed: 0_level_0,line_str,book_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
12657,,0
12658,,0
12659,,0
12660,,0
12661,,0
12662,,0
12663,,0
12664,,0
12665,End of the Project Gutenberg EBook of Sense an...,0
12666,,0


### Task 2
Chunk by chapter, using the pattern of locating the headers in the data frame, assigning them 
numbers, forward-filling those numbers, and then grouping by number (and cleaning up).

#### Task 2 - Chunk by Chapters

In [9]:
chap_pat = r"^\s*(?:chapter|letter)\s+\d+"
chap_lines = LINES.line_str.str.match(chap_pat, case=False)
LINES.loc[chap_lines]

Unnamed: 0_level_0,line_str,book_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
42,CHAPTER 1,0
196,CHAPTER 2,0
399,CHAPTER 3,0
561,CHAPTER 4,0
756,CHAPTER 5,0
858,CHAPTER 6,0
986,CHAPTER 7,0
1112,CHAPTER 8,0
1244,CHAPTER 9,0
1448,CHAPTER 10,0


#### Task 2 - Assign numbers

In [10]:
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]
LINES.loc[chap_lines]

Unnamed: 0_level_0,line_str,book_num,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
42,CHAPTER 1,0,1.0
196,CHAPTER 2,0,2.0
399,CHAPTER 3,0,3.0
561,CHAPTER 4,0,4.0
756,CHAPTER 5,0,5.0
858,CHAPTER 6,0,6.0
986,CHAPTER 7,0,7.0
1112,CHAPTER 8,0,8.0
1244,CHAPTER 9,0,9.0
1448,CHAPTER 10,0,10.0


#### Task 2 - Forward Filling chapter numbers

In [11]:
LINES.chap_num = LINES.chap_num.ffill()
# LINES.sample(10)
LINES.head(50)

Unnamed: 0_level_0,line_str,book_num,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
20,,0,
21,,0,
22,,0,
23,,0,
24,,0,
25,,0,
26,,0,
27,,0,
28,,0,
29,,0,


#### Task 2 - Clean up

In [12]:
LINES = LINES.dropna(subset=['chap_num'])
LINES = LINES.loc[~chap_lines]
LINES.chap_num = LINES.chap_num.astype('int')
LINES.head(20)

Unnamed: 0_level_0,line_str,book_num,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
43,,0,1
44,,0,1
45,The family of Dashwood had long been settled i...,0,1
46,"was large, and their residence was at Norland ...",0,1
47,"their property, where, for many generations, t...",0,1
48,respectable a manner as to engage the general ...,0,1
49,surrounding acquaintance. The late owner of t...,0,1
50,"man, who lived to a very advanced age, and who...",0,1
51,"life, had a constant companion and housekeeper...",0,1
52,"death, which happened ten years before his own...",0,1


#### Task 2 - Group by Number

In [13]:
OHCO[:2]

CHAPS = LINES.groupby(OHCO[:2])\
    .line_str.apply(lambda x: '\n'.join(x))\
    .to_frame('chap_str')

CHAPS['chap_str'] = CHAPS.chap_str.str.strip()
CHAPS.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,chap_str
book_num,chap_num,Unnamed: 2_level_1
0,1,The family of Dashwood had long been settled i...
0,2,Mrs. John Dashwood now installed herself mistr...
0,3,Mrs. Dashwood remained at Norland several mont...
0,4,"""What a pity it is, Elinor,"" said Marianne, ""t..."
0,5,"No sooner was her answer dispatched, than Mrs...."
0,6,The first part of their journey was performed ...
0,7,Barton Park was about half a mile from the cot...
0,8,Mrs. Jennings was a widow with an ample jointu...
0,9,The Dashwoods were now settled at Barton with ...
0,10,"Marianne's preserver, as Margaret, with more e..."


### Task 3: Split Chapters into Paragraphs
Split resulting data frame into paragraphs using the regex provided

In [14]:
OHCO[:3]

['book_num', 'chap_num', 'para_num']

In [15]:
para_pat = r'\n\n+'

PARAS = CHAPS['chap_str'].str.split(para_pat, expand=True).stack()\
    .to_frame('para_str').sort_index()
PARAS.index.names = OHCO[:3]

PARAS.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,para_str
book_num,chap_num,para_num,Unnamed: 3_level_1
0,1,0,The family of Dashwood had long been settled i...
0,1,1,"By a former marriage, Mr. Henry Dashwood had o..."
0,1,2,"The old gentleman died: his will was read, and..."
0,1,3,"Mr. Dashwood's disappointment was, at first, s..."
0,1,4,His son was sent for as soon as his danger was...
0,1,5,Mr. John Dashwood had not the strong feelings ...
0,1,6,"He was not an ill-disposed young man, unless t..."
0,1,7,"When he gave his promise to his father, he med..."
0,1,8,"No sooner was his father's funeral over, than ..."
0,1,9,So acutely did Mrs. Dashwood feel this ungraci...


In [16]:
PARAS['para_str'] = PARAS['para_str'].str.replace(r'\n', ' ', regex=True)
PARAS['para_str'] = PARAS['para_str'].str.strip()
PARAS = PARAS[~PARAS['para_str'].str.match(r'^\s*$')]

PARAS.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,para_str
book_num,chap_num,para_num,Unnamed: 3_level_1
0,1,0,The family of Dashwood had long been settled i...
0,1,1,"By a former marriage, Mr. Henry Dashwood had o..."
0,1,2,"The old gentleman died: his will was read, and..."
0,1,3,"Mr. Dashwood's disappointment was, at first, s..."
0,1,4,His son was sent for as soon as his danger was...
0,1,5,Mr. John Dashwood had not the strong feelings ...
0,1,6,"He was not an ill-disposed young man, unless t..."
0,1,7,"When he gave his promise to his father, he med..."
0,1,8,"No sooner was his father's funeral over, than ..."
0,1,9,So acutely did Mrs. Dashwood feel this ungraci...


### Task 4: Split paragraphs into sentences
Split resulting data frame into sentences using the regex provided.

In [17]:
sent_pat = r'[.?!;:]+'
SENTS = PARAS['para_str'].str.split(sent_pat, expand=True).stack()\
    .to_frame('sent_str')
SENTS.index.names = OHCO[:4]

In [18]:
SENTS = SENTS[~SENTS['sent_str'].str.match(r'^\s*$')]
SENTS.sent_str = SENTS.sent_str.str.strip()

SENTS.head(10)
# SENTS.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,sent_str
book_num,chap_num,para_num,sent_num,Unnamed: 4_level_1
0,1,0,0,The family of Dashwood had long been settled i...
0,1,0,1,"Their estate was large, and their residence wa..."
0,1,0,2,The late owner of this estate was a single man...
0,1,0,3,"But her death, which happened ten years before..."
0,1,0,4,"for to supply her loss, he invited and receive..."
0,1,0,5,"Henry Dashwood, the legal inheritor of the Nor..."
0,1,0,6,"In the society of his nephew and niece, and th..."
0,1,0,7,His attachment to them all increased
0,1,0,8,The constant attention of Mr
0,1,0,9,and Mrs


### Task 4: Split sentences into tokens
Split resulting data frame into tokens using the regex provided.

In [19]:
token_pat = r"[\s',-]+"
TOKENS = SENTS['sent_str'].str.split(token_pat, expand=True).stack()\
    .to_frame('token_str')

TOKENS.index.names = OHCO[:5]

# TOKENS.head(10)
TOKENS.sample(10)
TOKENS.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str
book_num,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1
0,1,0,0,0,The
0,1,0,0,1,family
0,1,0,0,2,of
0,1,0,0,3,Dashwood
0,1,0,0,4,had
0,1,0,0,5,long
0,1,0,0,6,been
0,1,0,0,7,settled
0,1,0,0,8,in
0,1,0,0,9,Sussex


### Task 5: Combine Persuasion
Combine both Persuasion and Sense and Sensibility into a single data frame with an appropriately modified OHCO list. In other words, make sure your index includes a new index level for the book. Use the attached CSV to get the Persuasion data and then import it into your notebook as a data frame.

In [20]:
csv_file2  = f"{output_dir}/austen-persuasion.csv"
persuasion_df = pd.read_csv(csv_file2)

persuasion_df.drop('term_str', axis=1, inplace=True)
persuasion_df.insert(0, 'book_num', 1)
# persuasion_df.sample(30)

full_df = pd.concat([TOKENS, persuasion_df.set_index(['book_num', 'chap_num', 'para_num', 'sent_num', 'token_num'])])
full_df.sort_index(inplace=True)
# full_df.sample(30)
# full_df.head(20)
full_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str
book_num,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1
0,1,0,0,0,The
0,1,0,0,1,family
0,1,0,0,2,of
0,1,0,0,3,Dashwood
0,1,0,0,4,had
...,...,...,...,...,...
1,24,13,0,6,of
1,24,13,0,7,Persuasion
1,24,13,0,8,by
1,24,13,0,9,Jane


### Task 6: Extract Vocabulary
From the combined data frame, extract a vocabulary, i.e. a data frame with term string as index, along with term frequency and term length as features.

In [21]:
full_df['term_str'] = full_df.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
VOCAB = full_df.term_str.value_counts().to_frame('n').reset_index().rename(columns={'index':'term_str'})
VOCAB.index.name = 'term_id'

VOCAB['term_len'] = VOCAB.term_str.str.len()
VOCAB

Unnamed: 0_level_0,term_str,n,term_len
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,the,7435,3
1,to,6923,2
2,and,6290,3
3,of,6146,2
4,her,3747,3
...,...,...,...
8234,unconquerable,1,13
8235,outgrown,1,8
8236,prosperously,1,12
8237,nominal,1,7


In [22]:
total_raw_tokens = VOCAB['n'].sum()
total_raw_tokens

207786

### Task 7: 
Answer the following questions by extracting features from the corpus.
1. How many raw tokens are in the combined data frame?

   <b>There are 207786 tokens</b>

2. How many distinct terms are there in the combined data frame (i.e. how big is the vocabulary)?

    8239 rows, so <b>8239 terms</b>

#### Gathering by Content Object

In [23]:
def gather(ohco_level):
    global full_df
    level_name = OHCO[ohco_level-1].split('_')[0]
    df = full_df.groupby(OHCO[:ohco_level])\
        .token_str.apply(lambda x: x.str.cat(sep=' '))\
        .to_frame(f"{level_name}_str")
    return df

In [24]:
book_level = gather(1)

count_terms_book0 = set(book_level.loc[0, 'book_str'].split())
count_terms_book1 = set(book_level.loc[1, 'book_str'].split())

more_terms = len(count_terms_book0) - len(count_terms_book1)
more_terms

665

### Task 7: 
3. How many more terms does the vocabulary of Sense and Sensibility have than that of Persuasion?

    There are <b>665</b> more terms in Sense and Sensibility

In [25]:
count_token_chap = full_df.groupby(['book_num', 'chap_num']).size()
avg_tokens_chap = count_token_chap.mean().round().astype(int)
avg_tokens_chap

2809

### Task 7: 
4. What is the average number of tokens, rounded to an integer, per chapter in the corpus?

    There are <b>2809</b> number of tokens per chapter in the corpus

In [26]:
count_token_par = full_df.groupby(['book_num', 'chap_num', 'para_num']).size()
avg_tokens_par = count_token_par.mean().round().astype(int)
avg_tokens_par

74

### Task 7: 
5. What is the average number of tokens, rounded to an integer, per paragraph in the corpus?

    There are <b>74</b> number of tokens per paragraph in the corpus