# Ian Yung
### *1.21.2024*
### *m01hw.ipynb*

## Set Up

In [1]:
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
import configparser
config = configparser.ConfigParser()
config.read("../../../env.ini")
data_home = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']

## Import File

In [3]:
src_file = f"{data_home}/HW1/pg42324.txt"

In [4]:
lines = open(src_file, 'r').readlines()

In [5]:
lines[:5]

['\ufeffThe Project Gutenberg EBook of Frankenstein, by Mary W. Shelley\n',
 '\n',
 'This eBook is for the use of anyone anywhere at no cost and with\n',
 'almost no restrictions whatsoever.  You may copy it, give it away or\n',
 're-use it under the terms of the Project Gutenberg License included\n']

## Convert to Dataframe

In [6]:
text = pd.DataFrame(lines)

In [7]:
text

Unnamed: 0,0
0,"﻿The Project Gutenberg EBook of Frankenstein, ..."
1,\n
2,This eBook is for the use of anyone anywhere a...
3,almost no restrictions whatsoever. You may co...
4,re-use it under the terms of the Project Guten...
...,...
8023,\n
8024,This Web site includes information about Proje...
8025,including how to make donations to the Project...
8026,"Archive Foundation, how to help produce our ne..."


## Question 1

In [8]:
chunk_pat = '\n\n'

In [9]:
chunks = open(src_file, 'r').read().split(chunk_pat)

In [10]:
text = pd.DataFrame(chunks, columns=['chunk_str'])
text.index.name = 'chunk_id'

In [11]:
text.chunk_str = text.chunk_str.str.replace('\n+', ' ', regex=True).str.strip()

In [12]:
text.head()

Unnamed: 0_level_0,chunk_str
chunk_id,Unnamed: 1_level_1
0,"﻿The Project Gutenberg EBook of Frankenstein, ..."
1,This eBook is for the use of anyone anywhere a...
2,"Title: Frankenstein or, The Modern Prom..."
3,Author: Mary W. Shelley
4,"Release Date: March 13, 2013 [EBook #42324]"


## Convert lines to tokens

In [13]:
K = text.chunk_str.str.split(expand=True).stack().to_frame('token_str')
K.index.names = ['chunk_num','token_num']

In [14]:
K

Unnamed: 0_level_0,Unnamed: 1_level_0,token_str
chunk_num,token_num,Unnamed: 2_level_1
0,0,﻿The
0,1,Project
0,2,Gutenberg
0,3,EBook
0,4,of
...,...,...
941,35,to
941,36,hear
941,37,about
941,38,new


There are 80,985 tokens of the raw text.

## Question 2

In [15]:
K['term_str'] = K.token_str.str.replace(r'\W+', '', regex=True).str.lower()

In [16]:
V = K.term_str.value_counts().to_frame('n')
V.index.name = 'term_str'

In [17]:
V.head(10)

Unnamed: 0_level_0,n
term_str,Unnamed: 1_level_1
the,4575
and,3120
of,2918
i,2918
to,2257
my,1819
a,1497
in,1232
was,1064
that,1060


"I" is the most prominent pronoun.

## Question 3

In [18]:
src_file = f"{data_home}/gutenberg/pg105.txt"
chunks = open(src_file, 'r').read().split(chunk_pat)

In [19]:
text = pd.DataFrame(chunks, columns=['chunk_str'])
text.index.name = 'chunk_id'
text.chunk_str = text.chunk_str.str.replace('\n+', ' ', regex=True).str.strip()

In [20]:
K = text.chunk_str.str.split(expand=True).stack().to_frame('token_str')
K.index.names = ['chunk_num','token_num']

In [21]:
K['term_str'] = K.token_str.str.replace(r'\W+', '', regex=True).str.lower()

In [22]:
V = K.term_str.value_counts().to_frame('n')
V.index.name = 'term_str'

In [23]:
V.head(10)

Unnamed: 0_level_0,n
term_str,Unnamed: 1_level_1
the,3501
to,2862
and,2851
of,2684
a,1648
in,1439
was,1336
her,1202
had,1187
she,1143


"She" is the most prominent subject pronoun of the Jane Austen novel.

## Question 4

Based on my level of knowledge of the two authors and their work, Frankenstein is a story centrally focused on a "mad scientist" and his creation, while Jane Austen's work tends to revolve around the lives of young aristocratic women in 19th century British society. It seems fairly intuitive that, within this context, "she" will feature prominently within *Persuasion* by Jane Austen, since the novel follows the lives of several women, while *Frankenstein*, which is narrated in first-person, is obviously going to rely upon "I" often throughout the story. 