This project will use the  Natural Language Toolkit (NLTK) to analyze the text in *Pride and Prejudice* and  *Frankenstein* find out the most common words. Common words like the, or and will be filtered out since it's obvious those sorts of words will be plentiful. Both of these novels came out in the 1810s and come from diametrically opposed genres: a novel of manners versus proto-science fiction, so it would be interesting to see how they used language differently at a surface level. 

First, the two novels will be scraped from **Project Gutenberg**. Luckily, the site offers them in plain text form already, which skips over extracting the plain text from HTML or some other format. This means that the text can be forced into all lowercase (to avoid pride and Pride being viewed as different words), and then analyzed. Analysis will include common words, sentence length, average word length, readability, and average syllables per word.

In [1]:
from bs4 import BeautifulSoup
import requests
import nltk
from collections import Counter

In [2]:
Frankenstein = requests.get("https://www.gutenberg.org/cache/epub/84/pg84.txt")
Pride = requests.get("https://www.gutenberg.org/cache/epub/1342/pg1342.txt")

FrankensteinText = Frankenstein.text.replace('\r\n', ' ').lower()[1380:]
PrideText = Pride.text.replace('\r\n', ' ').lower()[35186:]

tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

FrankensteinWords = tokenizer.tokenize(FrankensteinText)
PrideWords = tokenizer.tokenize(PrideText)

FrankensteinWords

['letter',
 '1',
 '_to',
 'mrs',
 'saville',
 'england',
 '_',
 'st',
 'petersburgh',
 'dec',
 '11th',
 '17',
 'you',
 'will',
 'rejoice',
 'to',
 'hear',
 'that',
 'no',
 'disaster',
 'has',
 'accompanied',
 'the',
 'commencement',
 'of',
 'an',
 'enterprise',
 'which',
 'you',
 'have',
 'regarded',
 'with',
 'such',
 'evil',
 'forebodings',
 'i',
 'arrived',
 'here',
 'yesterday',
 'and',
 'my',
 'first',
 'task',
 'is',
 'to',
 'assure',
 'my',
 'dear',
 'sister',
 'of',
 'my',
 'welfare',
 'and',
 'increasing',
 'confidence',
 'in',
 'the',
 'success',
 'of',
 'my',
 'undertaking',
 'i',
 'am',
 'already',
 'far',
 'north',
 'of',
 'london',
 'and',
 'as',
 'i',
 'walk',
 'in',
 'the',
 'streets',
 'of',
 'petersburgh',
 'i',
 'feel',
 'a',
 'cold',
 'northern',
 'breeze',
 'play',
 'upon',
 'my',
 'cheeks',
 'which',
 'braces',
 'my',
 'nerves',
 'and',
 'fills',
 'me',
 'with',
 'delight',
 'do',
 'you',
 'understand',
 'this',
 'feeling',
 'this',
 'breeze',
 'which',
 'has',
 '

The text needs to be all lowercase so "doctor" and "Doctor" would not be counted as separate words. This does include the possiblity of proper nounds being counted alongside their non-proper version (cooper the job versus Cooper the surname), but this is unlikely to skew the data too much. There are also many new line characters throughout the text than needs to be removed, alongside a lengthy section before the story. This was fairly easily done with the replace and lower methods for the first two issues. The introductory segement was removed by using a character counter and a few rounds of guessing and checking. The counter made this a simple task. 

Then a tokenizer was used to extract the words from the text, courtesy of RegEx and NLTK. This is possible without using those methods by using the split method and a space as the separator, but this is far clunkier and is more error prone.

Based on the beginning of Frankenstein there are many common words like "I", "and", "in" etc. that should be filtered out to properly compare the two texts. Luckily, NLTK has a method for this called stopwords. It is also evident some words are strangely written like "_to" and but this should have a minimal effect just like proper nouns and non-proper nouns being confused for each other. 

In [3]:
StopWords = nltk.corpus.stopwords.words('english')

StrippedFrankensteinWords = [word for word in FrankensteinWords if word not in StopWords]
StrippedPrideWords = [word for word in PrideWords if word not in StopWords]

In [4]:
StrippedFrankensteinWords[:10]

['letter',
 '1',
 '_to',
 'mrs',
 'saville',
 'england',
 '_',
 'st',
 'petersburgh',
 'dec']

In [5]:
StrippedPrideWords[:10]

['truth',
 'universally',
 'acknowledged',
 'single',
 'man',
 'possession',
 'good',
 'fortune',
 'must',
 'want']

As expected, the "_to" is kept but this shouldn't be a significant problem. Regardless, the list comprehension worked and the word lists are stripped of the stop words. Now the words can be counted and compared.

In [6]:
FrankensteinCount = Counter(StrippedFrankensteinWords)
PrideCount = Counter(StrippedPrideWords)

In [7]:
FrankensteinCount.most_common()

[('one', 208),
 ('could', 198),
 ('would', 184),
 ('yet', 152),
 ('man', 137),
 ('father', 133),
 ('upon', 128),
 ('life', 116),
 ('may', 113),
 ('every', 109),
 ('first', 108),
 ('might', 108),
 ('shall', 106),
 ('eyes', 104),
 ('said', 102),
 ('time', 98),
 ('even', 96),
 ('towards', 94),
 ('saw', 94),
 ('gutenberg', 93),
 ('elizabeth', 92),
 ('night', 91),
 ('found', 89),
 ('ever', 85),
 ('mind', 85),
 ('project', 85),
 ('day', 82),
 ('heart', 81),
 ('felt', 80),
 ('death', 79),
 ('work', 78),
 ('feelings', 76),
 ('must', 74),
 ('thought', 74),
 ('dear', 72),
 ('soon', 71),
 ('friend', 71),
 ('many', 70),
 ('made', 70),
 ('never', 69),
 ('also', 68),
 ('still', 68),
 ('passed', 67),
 ('thus', 66),
 ('place', 65),
 ('miserable', 65),
 ('like', 63),
 ('heard', 62),
 ('became', 61),
 ('us', 61),
 ('sometimes', 60),
 ('love', 59),
 ('clerval', 59),
 ('little', 58),
 ('human', 58),
 ('country', 57),
 ('appeared', 57),
 ('often', 56),
 ('indeed', 56),
 ('justine', 55),
 ('misery', 54),
 (

In [8]:
PrideCount.most_common()

[('mr', 782),
 ('elizabeth', 634),
 ('could', 524),
 ('would', 467),
 ('darcy', 418),
 ('said', 403),
 ('mrs', 343),
 ('much', 328),
 ('bennet', 327),
 ('must', 316),
 ('bingley', 308),
 ('jane', 293),
 ('miss', 284),
 ('one', 268),
 ('know', 238),
 ('well', 226),
 ('though', 226),
 ('never', 222),
 ('soon', 216),
 ('sister', 216),
 ('think', 211),
 ('may', 206),
 ('good', 202),
 ('might', 199),
 ('time', 197),
 ('wickham', 194),
 ('lady', 191),
 ('little', 187),
 ('every', 182),
 ('without', 179),
 ('collins', 179),
 ('nothing', 177),
 ('lydia', 171),
 ('make', 169),
 ('shall', 164),
 ('dear', 160),
 ('say', 158),
 ('illustration', 156),
 ('see', 153),
 ('room', 151),
 ('family', 151),
 ('man', 150),
 ('first', 147),
 ('day', 145),
 ('great', 142),
 ('mother', 137),
 ('however', 135),
 ('father', 135),
 ('two', 133),
 ('ever', 133),
 ('young', 130),
 ('made', 127),
 ('catherine', 127),
 ('give', 126),
 ('hope', 123),
 ('us', 123),
 ('many', 121),
 ('away', 121),
 ('always', 120),
 ('l

A preliminary analysis shows a general difference in tone and outlook. For example, *Pride and Prejudice* contains honorifics like Mr., Mrs. and lady amongst the most common words as well as names such as Elizabeth, Darcy, and Bennet. On the other hand, *Frankenstein* has the general terms one, father, and man as more common than specific names. This is not to see there are no clear similiarities. Both include could, would, may, much, and shall in the most used words lists. These are likely common words for the time period that happened to not be stop words in NLTK's stopwords method. There is an argument to be made that the popularity of must, much, may, might and of honorifics, titles, and names corroborates the fact that *Pride and Prejudice* is a book of manners; likewise the commonplace impersonal words like one and man corroborates the fact that *Frankenstein* is a philosophical proto-science fiction text ( this is furthered by the popularity of life,death, mind and feelings).

This is interesting to analyze, but there are other aspects of the text to analyze, such as average word and sentence length, as well as readability. The unstripped version of each work will be used for this comparison. Stop words like and would be useful for determining how long sentences and words are, as well as readability.
