# Text Analysis Assignment

## Assignment Details

The text below is a summary from [this document](https://docs.google.com/document/d/1FNRmAS_vc-2eQESHH6uZMcuDVr720rsx-aDWGH3oz_Y/edit?usp=sharing).

The main goal of the assignment is to have you practice the tools we have been using in class.

The requirements are:

- Choose one or more texts to work with.
- Either save the text files in your working directory, or have python get them from a web address.
- If needed: convert your text from bytes to a string
- Tokenize your text
- Make it an NLTK text object so you can use nltk tools on it
- Clean the text in a way that is appropriate for the kind of analysis you want to do.
- Run some analysis
- Report findings

Important notes:

- If you plan to use functions, have your functions as a separate python file and import it in your main file.

- Please, over-comment your script. Make sure to comment every step of the way. Make sure to not only explain what you are doing in terms of code, but also your analytical goal too. For instance, both "I am running a for loop to remove the common words" and "I am trying to see how these two authors compare in terms of the ratio of unusual words to total words" are kind of comments I want to see.


## Assignment Submission

### Itinerary

I want to compare usage of modal verbs in American/English men and women authors from 19th century / early 20th literature. I wonder if there is any difference based on nationality or gender. To do my proposed analysis, I will:

- Download a sample of raw texts from [Project Gutenberg](https://www.gutenberg.org/) using `urllib`
- Convert the texts from bytes to a string
- Tokenize the texts
- Make it an NLTK text object
- Clean the texts, including removing front matter and other empherma from Project Gutenberg texts
- Create a list of modal verbs, filter for these works in each text
- Perhaps create a conditional frequency distribution on all texts (?) to see if we can establish a pattern

### Imports

In the cell below, I'm importing the libraries/modules:

In [1]:
from urllib.request import urlopen                    # requesting and opening a file on the internet
import nltk                                           # our tool for text analysis
nltk.download('punkt')                                # required to run word_tokenized() initially
from nltk.probability import FreqDist
from nltk.probability import ConditionalFreqDist
from helper_funcs import lowered

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kaiprenger/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Collecting the texts from Project Gutenberg

The first step is to get the raw texts from Project Gutenburg. This work is demonstrated in the cells below:

In [2]:
# Create variables for each plain text file on PG
custom_of_country_url = 'https://www.gutenberg.org/cache/epub/11052/pg11052.txt'
little_dorrit_url = 'https://www.gutenberg.org/files/963/963-0.txt'
persuasion_url = 'https://www.gutenberg.org/cache/epub/105/pg105.txt'
the_awkward_age_url = 'https://www.gutenberg.org/files/7433/7433-0.txt'


In [3]:
# Open URLs
custom_country_file = urlopen(custom_of_country_url)
little_dorrit_file = urlopen(little_dorrit_url)
persuasion_file = urlopen(persuasion_url)
the_awkward_age_file = urlopen(the_awkward_age_url)

In [4]:
# Read in the texts and assign them to a variables
custom_country_raw = custom_country_file.read()
little_dorrit_raw = little_dorrit_file.read()
persuasion_raw = persuasion_file.read()
the_awkward_age_raw = the_awkward_age_file.read()

### Converting the raw text to strings and lists

Below, I demonstrate that while we have the texts available, they're in a format that isn't conducive to text analysis. We will convert these UTF-8 bytes to strings, and finally, to lists.

In [5]:
# Check type for a given text
type(little_dorrit_raw)

bytes

If you were run `little_dorrit_raw` you'll notice when the text is in bytes, you'll get unicode embedded into the text (e.g. `b'\xef\xbb\xbf\r\n`). Next, we will decode the raw files into a string to make it more usable for text analysis.

In [6]:
custom_of_country = custom_country_raw.decode()
little_dorrit = little_dorrit_raw.decode()
persuasion = persuasion_raw.decode()
awkward_age = the_awkward_age_raw.decode()

In [7]:
# check the type for a given text
type(awkward_age)

str

At this point, the entire text is a string, which provides a grain per letter of the text and may not be super useful for a text analysis. Below you'll see slicing into the text by the first 50 characters to see what we mean by not useful.

Side note: If you were to run `awkward_age` you would continue to see unicode like `'\ufeff`. 

In [8]:
awkward_age[:50]

'\ufeffThe Project Gutenberg EBook of The Awkward Age, b'

Next, we'll tokenize the text, in order to create a list of strings made up of each discrete element separated by spaces (?)

In [9]:
awkward_age_tokens = nltk.word_tokenize(awkward_age)
custom_of_country_tokens = nltk.word_tokenize(custom_of_country)
little_dorrit_tokens = nltk.word_tokenize(little_dorrit)
persuasion = nltk.word_tokenize(persuasion)


In [10]:
# validate that the tokens are words
awkward_age_tokens[:50]

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'The',
 'Awkward',
 'Age',
 ',',
 'by',
 'Henry',
 'James',
 'This',
 'eBook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever',
 '.',
 'You',
 'may',
 'copy',
 'it',
 ',',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'Project',
 'Gutenberg',
 'License']

### Prepping the text
With these texts coming from Project Gutenberg (PG), we'll need to clean up front matter etc. I will do this by

1. Inspecting the first relevant words of a novel 
2. Identify what I believe to be the word which is likely to not be a part of the front matter that GP adds
3. Find the index of the word above in the word tokenized list for the novel
4. Perform a slice to make sure the first relevant sentence if selected

#### Prepping the Awkward Age

In [11]:
awkward_age_tokens.index('recall')

127

In [12]:
awkward_age_tokens[126:131]

['I', 'recall', 'with', 'perfect', 'ease']

We'll start our slice for `awkward_age_tokens` at 126. But we also need end the slice to remove post-novel text and license information.

In [13]:
awkward_age_tokens[-25:]    # Inspecting the end of the text to find this out

['Archive',
 'Foundation',
 ',',
 'how',
 'to',
 'help',
 'produce',
 'our',
 'new',
 'eBooks',
 ',',
 'and',
 'how',
 'to',
 'subscribe',
 'to',
 'our',
 'email',
 'newsletter',
 'to',
 'hear',
 'about',
 'new',
 'eBooks',
 '.']

I decided to approximate where the novel ends here by looking at the producer name of this text version.

In [14]:
awkward_age_tokens.index('Sobol')

118

In [15]:
awkward_age_tokens[118]

'Sobol'

The method above doesn't work, because the name is mentioned at the beginning. Let's look for the index from a slice beyond the index above.

In [16]:
awkward_age_wo_pref = awkward_age_tokens[126:]

In [17]:
awkward_age_wo_pref.index('Sobol')

181183

In [18]:
awkward_age_wo_pref[181183]

'Sobol'

In [19]:
awkward_age_wo_pref[181100:181116]    # trying to nail down the index of the last word of the text

['I',
 'see',
 '.',
 'There',
 'we',
 'are',
 '.',
 'Well',
 ',',
 '”',
 'said',
 'Mr.',
 'Longdon',
 '--',
 '“',
 'to-morrow.']

In [20]:
awkward_age_tokens_sliced = awkward_age_wo_pref[:181116]

In [21]:
awkward_age_tokens_sliced[:7]    # first seven tokens of the text

['I', 'recall', 'with', 'perfect', 'ease', 'the', 'idea']

In [22]:
awkward_age_tokens_sliced[-9:]    # last nine tokens of the text

['Well', ',', '”', 'said', 'Mr.', 'Longdon', '--', '“', 'to-morrow.']

In [23]:
# Let's remove non-alphabet characters and convert all words to lower case
awkward_age_tokens_prep = lowered(awkward_age_tokens_sliced)

#### Prepping Custom of the Country 

In [24]:
custom_of_country_tokens.index('Undine')

127

In [25]:
custom_of_country_tokens[126:135]

["''", 'Undine', 'Spragg', '--', 'how', 'can', 'you', '?', "''"]

In [26]:
custom_of_country_token_no_pref = custom_of_country_tokens[126:]

In [41]:
custom_of_country_token_no_pref.index('Proofreaders')    # in the producer name

166205

In [40]:
custom_of_country_token_no_pref[166129]    # last word in the text

'for'

In [42]:
custom_of_country_tokens_sliced = custom_of_country_token_no_pref[:166131]

In [50]:
custom_of_country_tokens_sliced[-20:]    # last twenty items after slice

['welcome',
 'her',
 'first',
 'guests',
 'she',
 'said',
 'to',
 'herself',
 'that',
 'it',
 'was',
 'the',
 'one',
 'part',
 'she',
 'was',
 'really',
 'made',
 'for',
 '.']

In [51]:
custom_of_country_tokens_prep = lowered(custom_of_country_tokens_sliced)

#### Prepping Little Dorrit