# Text Analysis Assignment

## Assignment Details

The text below is a summary from [this document](https://docs.google.com/document/d/1FNRmAS_vc-2eQESHH6uZMcuDVr720rsx-aDWGH3oz_Y/edit?usp=sharing).

The main goal of the assignment is to have you practice the tools we have been using in class.

The requirements are:

- Choose one or more texts to work with.
- Either save the text files in your working directory, or have python get them from a web address.
- If needed: convert your text from bytes to a string
- Tokenize your text
- Make it an NLTK text object so you can use nltk tools on it
- Clean the text in a way that is appropriate for the kind of analysis you want to do.
- Run some analysis
- Report findings

Important notes:

- If you plan to use functions, have your functions as a separate python file and import it in your main file.

- Please, over-comment your script. Make sure to comment every step of the way. Make sure to not only explain what you are doing in terms of code, but also your analytical goal too. For instance, both "I am running a for loop to remove the common words" and "I am trying to see how these two authors compare in terms of the ratio of unusual words to total words" are kind of comments I want to see.


## Assignment Submission

### Itinerary

I want to compare usage of modal verbs in American/English men and women authors from 19th century literature. I wonder if there is any difference based on nationality or gender. To do my proposed analysis, I will:

- Download a sample of raw texts from [Project Gutenberg](https://www.gutenberg.org/) using `urllib`
- Convert the texts from bytes to a string
- Tokenize the texts
- Make it an NLTK text object
- Clean the texts, including removing front matter and other empherma from Project Gutenberg texts
- Create a list of modal verbs, filter for these works in each text
- Perhaps create a conditional frequency distribution on all texts (?) to see if we can establish a pattern

### Imports

In the cell below, I'm importing the libraries/modules:

In [1]:
from urllib.request import urlopen
from nltk import word_tokenize
from nltk import Text
from nltk.probability import FreqDist
from nltk.probability import ConditionalFreqDist

### Collecting the texts from Project Gutenberg

The first step is to get the raw texts from Project Gutenburg. This work is demonstrated in the cells below:

In [2]:
# Create variables for each plain text file on PG
custom_of_country_url = 'https://www.gutenberg.org/cache/epub/11052/pg11052.txt'
little_dorrit_url = 'https://www.gutenberg.org/files/963/963-0.txt'
persuasion_url = 'https://www.gutenberg.org/cache/epub/105/pg105.txt'
the_awkward_age_url = 'https://www.gutenberg.org/files/7433/7433-0.txt'


In [3]:
# Open URLs
custom_country_file = urlopen(custom_of_country_url)
little_dorrit_file = urlopen(little_dorrit_url)
persuasion_file = urlopen(persuasion_url)
the_awkward_age_file = urlopen(the_awkward_age_url)

In [4]:
# Read in the texts and assign them to a variables
custom_country_raw = custom_country_file.read()
little_dorrit_raw = little_dorrit_file.read()
persuasion_raw = persuasion_file.read()
the_awkward_age_raw = the_awkward_age_file.read()

### Converting the raw text to strings and lists

Below, I demonstrate that while we have the texts available, they're in a format that isn't conducive to text analysis. We will convert these UTF-8 bytes to strings, and finally, to lists.

In [6]:
# Check type for a given text
type(little_dorrit_raw)

bytes

If you were run `little_dorrit_raw` you'll notice when the text is in bytes, you'll get unicode embedded into the text (e.g. `b'\xef\xbb\xbf\r\n`). Next, we will decode the raw files into a string to make it more usable for text analysis.

In [8]:
custom_of_country = custom_country_raw.decode()
little_dorrit = little_dorrit_raw.decode()
persuason = persuasion_raw.decode()
awkward_age = the_awkward_age_raw.decode()

In [9]:
# check the type for a given text
type(awkward_age)

str

If you were to run `awkward_age` you would continue to see unicode like `'\ufeff`. 