# Exploring Frankenstein

This notebook is meant to explore Mary Shelley's _Frankenstein_. In practice, we're warming up our skills, prototyping some functions we may want to build for repeated code, and familiarizing ourselves with different text analysis tools. 

## Retrieving our text

First, we need to retrieve out text from trusty old [Project Gutenberg](https://www.gutenberg.org/).

In [1]:
from urllib.request import urlopen
import nltk

In [2]:
franken_url = 'https://www.gutenberg.org/cache/epub/42324/pg42324.txt'

In [3]:
franken_file = urlopen(franken_url)

In [4]:
type(franken_file)

http.client.HTTPResponse

In [5]:
raw = franken_file.read()    # Read in the actual file

In [6]:
franken_string = raw.decode()    # Convert from bites to a string

In [7]:
type(franken_string)    # Bingo

str

In [13]:
franken_string[:1000]

"\ufeffThe Project Gutenberg EBook of Frankenstein, by Mary W. Shelley\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: Frankenstein\r\n       or, The Modern Prometheus\r\n\r\nAuthor: Mary W. Shelley\r\n\r\nRelease Date: March 13, 2013 [EBook #42324]\r\n\r\nLanguage: English\r\n\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK FRANKENSTEIN ***\r\n\r\n\r\n\r\n\r\nProduced by Greg Weeks, Mary Meehan and the Online\r\nDistributed Proofreading Team at http://www.pgdp.net\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n                             FRANKENSTEIN:\r\n\r\n                                  OR,\r\n\r\n                         THE MODERN PROMETHEUS.\r\n\r\n                          BY MARY W. SHELLEY.\r\n\r\n            AUTHOR OF THE LAST MAN, PERKIN WARBECK, 

## Tokenize, format for nltk, slice out front/back matter

Next we'll tokenize the text, and format in such a way that `nltk` can work with it. We'll also trim the text of front and back matter added by Project Gutenberg's ebook

In [9]:
franken_tokens = nltk.word_tokenize(franken_string)

In [10]:
type(franken_tokens)    # Now, we're working with a list rather than a string

list

In [12]:
franken_tokens[:20]

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Frankenstein',
 ',',
 'by',
 'Mary',
 'W.',
 'Shelley',
 'This',
 'eBook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere']

In [41]:
franken_tokens.index('MODERN')

120

In [42]:
franken_slice = franken_tokens[121:]

In [43]:
franken_slice

['PROMETHEUS',
 '.',
 'BY',
 'MARY',
 'W.',
 'SHELLEY',
 '.',
 'AUTHOR',
 'OF',
 'THE',
 'LAST',
 'MAN',
 ',',
 'PERKIN',
 'WARBECK',
 ',',
 '&',
 'c.',
 '&',
 'c.',
 '[',
 'Transcriber',
 "'s",
 'Note',
 ':',
 'This',
 'text',
 'was',
 'produced',
 'from',
 'a',
 'photo-reprint',
 'of',
 'the',
 '1831',
 'edition',
 '.',
 ']',
 'REVISED',
 ',',
 'CORRECTED',
 ',',
 'AND',
 'ILLUSTRATED',
 'WITH',
 'A',
 'NEW',
 'INTRODUCTION',
 ',',
 'BY',
 'THE',
 'AUTHOR',
 '.',
 'LONDON',
 ':',
 'HENRY',
 'COLBURN',
 'AND',
 'RICHARD',
 'BENTLEY',
 ',',
 'NEW',
 'BURLINGTON',
 'STREET',
 ':',
 'BELL',
 'AND',
 'BRADFUTE',
 ',',
 'EDINBURGH',
 ';',
 'AND',
 'CUMMING',
 ',',
 'DUBLIN',
 '.',
 '1831',
 '.',
 'INTRODUCTION',
 '.',
 'The',
 'Publishers',
 'of',
 'the',
 'Standard',
 'Novels',
 ',',
 'in',
 'selecting',
 '``',
 'Frankenstein',
 "''",
 'for',
 'one',
 'of',
 'their',
 'series',
 ',',
 'expressed',
 'a',
 'wish',
 'that',
 'I',
 'should',
 'furnish',
 'them',
 'with',
 'some',
 'account',


In [44]:
franken_slice.index('END')

90548

In [45]:
franken_slice = fraken_slice[:90576]

In [46]:
franken_slice[-1]

'END'

In [50]:
print(franken_slice[0:10], franken_slice[-1])    # Reasonable slicing PG cruft

['FRANKENSTEIN', '*', '*', '*', 'Produced', 'by', 'Greg', 'Weeks', ',', 'Mary'] END


## NLTK work