This project will use the  Natural Language Toolkit (NLTK) to analyze the text in *Pride and Prejudice* and  *Frankenstein* find out the most common words. Common words like the, or and will be filtered out since it's obvious those sorts of words will be plentiful. Both of these novels came out in the 1810s and come from diametrically opposed genres: a novel of manners versus proto-science fiction, so it would be interesting to see how they used language differently at a surface level. 

First, the two novels will be scraped from **Project Gutenberg**. Luckily, the site offers them in plain text form already, which skips over extracting the plain text from HTML or some other format. This means that the text can be forced into all lowercase (to avoid pride and Pride being viewed as different words), and then analyzed. Analysis will include common words, sentence length, average word length, readability, and average syllables per word.

In [19]:
import requests
import nltk
from collections import Counter
import string

In [21]:
Frankenstein = requests.get("https://www.gutenberg.org/cache/epub/84/pg84.txt")
Pride = requests.get("https://www.gutenberg.org/cache/epub/1342/pg1342.txt")

FrankensteinText = Frankenstein.text.replace('\r\n', ' ').lower()[1380:]
PrideText = Pride.text.replace('\r\n', ' ').lower()[35186:]

FrankensteinSent = nltk.sent_tokenize(FrankensteinText)
PrideSent = nltk.sent_tokenize(PrideText)

PrideSent[:10]

['it is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife.',
 'however little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters.',
 '“my dear mr. bennet,” said his lady to him one day, “have you heard that netherfield park is let at last?”  mr. bennet replied that he had not.',
 '“but it is,” returned she; “for mrs. long has just been here, and she told me all about it.”  mr. bennet made no answer.',
 '“do not you want to know who has taken it?” cried his wife, impatiently.',
 '“_you_ want to tell me, and i have no objection to hearing it.”  [illustration:  “he came down to see the place”  [_copyright 1894 by george allen._]]  this was invitation enough.',
 '“why, my dear, you must know, mrs. long says that netherfield is taken by a yo

In [25]:
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w')
for sentence in FrankensteinSent:
    tokenizer.tokenize(sentence)

FrankensteinSent[:10]

['letter 1  _to mrs. saville, england._   st. petersburgh, dec. 11th, 17—.',
 'you will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings.',
 'i arrived here yesterday, and my first task is to assure my dear sister of my welfare and increasing confidence in the success of my undertaking.',
 'i am already far north of london, and as i walk in the streets of petersburgh, i feel a cold northern breeze play upon my cheeks, which braces my nerves and fills me with delight.',
 'do you understand this feeling?',
 'this breeze, which has travelled from the regions towards which i am advancing, gives me a foretaste of those icy climes.',
 'inspirited by this wind of promise, my daydreams become more fervent and vivid.',
 'i try in vain to be persuaded that the pole is the seat of frost and desolation; it ever presents itself to my imagination as the region of beauty and delight.',
 'there, margaret, the sun is f

The text needs to be all lowercase so "doctor" and "Doctor" would not be counted as separate words. This does include the possiblity of proper nounds being counted alongside their non-proper version (cooper the job versus Cooper the surname), but this is unlikely to skew the data too much. There are also many new line characters throughout the text than needs to be removed, alongside a lengthy section before the story. This was fairly easily done with the replace and lower methods for the first two issues. The introductory segement was removed by using a character counter and a few rounds of guessing and checking. The counter made this a simple task. 

Then a tokenizer was used to extract the words from the text, courtesy of RegEx and NLTK. This is possible without using those methods by using the split method and a space as the separator, but this is far clunkier and is more error prone.

Based on the beginning of Frankenstein there are many common words like "I", "and", "in" etc. that should be filtered out to properly compare the two texts. Luckily, NLTK has a method for this called stopwords. It is also evident some words are strangely written like "_to" and but this should have a minimal effect just like proper nouns and non-proper nouns being confused for each other. 

In [None]:
StopWords = nltk.corpus.stopwords.words('english')

StrippedFrankensteinWords = [word for word in FrankensteinWords if word not in StopWords]
StrippedPrideWords = [word for word in PrideWords if word not in StopWords]

In [None]:
StrippedFrankensteinWords[:10]

['letter',
 '1',
 '_to',
 'mrs.',
 'saville',
 ',',
 'england._',
 'st.',
 'petersburgh',
 ',']

In [None]:
StrippedPrideWords[:10]

['truth',
 'universally',
 'acknowledged',
 ',',
 'single',
 'man',
 'possession',
 'good',
 'fortune',
 'must']

As expected, the "_to" is kept but this shouldn't be a significant problem. Regardless, the list comprehension worked and the word lists are stripped of the stop words. Now the words can be counted and compared.

In [None]:
FrankensteinCount = Counter(StrippedFrankensteinWords)
PrideCount = Counter(StrippedPrideWords)

In [None]:
FrankensteinCount.most_common(10)

[(',', 5096),
 ('.', 2797),
 (';', 972),
 ('“', 491),
 ('”', 304),
 ('!', 239),
 ('?', 220),
 ('one', 203),
 ('could', 197),
 ('would', 183)]

In [None]:
PrideCount.most_common(10)

[(',', 9638),
 ('.', 4216),
 ('“', 1840),
 ('”', 1808),
 (';', 1662),
 ('mr.', 766),
 ('’', 739),
 ('elizabeth', 633),
 ('could', 523),
 ('?', 476)]

A preliminary analysis shows a general difference in tone and outlook. For example, *Pride and Prejudice* contains honorifics like Mr., Mrs. and lady amongst the most common words as well as names such as Elizabeth, Darcy, and Bennet. On the other hand, *Frankenstein* has the general terms one, father, and man as more common than specific names. This is not to see there are no clear similiarities. Both include could, would, may, much, and shall in the most used words lists. These are likely common words for the time period that happened to not be stop words in NLTK's stopwords method. There is an argument to be made that the popularity of must, much, may, might and of honorifics, titles, and names corroborates the fact that *Pride and Prejudice* is a book of manners; likewise the commonplace impersonal words like one and man corroborates the fact that *Frankenstein* is a philosophical proto-science fiction text ( this is furthered by the popularity of life,death, mind and feelings).

This is interesting to analyze, but there are other aspects of the text to analyze, such as average word and sentence length, as well as readability. The unstripped version of each work will be used for this comparison. Stop words like 'and' would be useful for determining how long sentences and words are, as well as readability.
