In [74]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import matplotlib.pyplot as plt
import text_processing
import utils
import tf_idf
import requests

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Oscar Wilde's Use of Literary Devices Over Time
##### Jessica Brown and Lauren Nalajala
## Background
Oscar 

In [None]:
#Add timeline image here

And the goal of this project here

## Intial Text Processing
We gathered all of our data from Project Gutenberg by scraping. First, we made a list of the urls and title as strings of the Oscar Wilde works we wished to analyze (`ow_corpus_list`).

In [None]:
ow_corpus_list = ["https://www.gutenberg.org/files/773/773-0.txt", "Lord Arthur Savile's Crime And Other Short Stories",
"https://www.gutenberg.org/cache/epub/902/pg902.txt", "The Happy Prince and Other Short Stories",
"https://www.gutenberg.org/cache/epub/174/pg174.txt", "The Picture of Dorian Grey",
"https://www.gutenberg.org/cache/epub/42704/pg42704.txt", "Salome",
"https://www.gutenberg.org/files/873/873-0.txt", "A House of Pomegranates",
"https://www.gutenberg.org/files/875/875-0.txt", "The Ducchess of Padua",
"https://www.gutenberg.org/files/1017/1017-0.txt", "The Soul of Man Under Socialism",
"https://www.gutenberg.org/files/790/790-0.txt", "Lady Windermeres Fan",
"https://www.gutenberg.org/files/854/854-0.txt", "A Woman of No Importance",
"https://www.gutenberg.org/files/844/844-0.txt", "The Importance of Being Earnest",
"https://www.gutenberg.org/cache/epub/301/pg301.txt", "The Ballad of Reading Gaol",
"https://www.gutenberg.org/files/885/885-0.txt", "An Ideal Husband"]


To retrieve and process this data, we wrote functions in the file `text_processing.py`. First, we ran `get_data_from_book` for "Lord Arthur Savile's Crime". This takes a url and title and returns a list of words that appear in the text. It also creates a .txt file with the same name as the title passed through it, and writes the same text to it. However, this text also has some encoding marks, so we run `remove_encoding_marks`. This removes the first word of the text (which is Project Gutenburg's extraneous text), any "\r\n" combinations, and empty strings.

In [106]:
lasc_raw = text_processing.get_data_from_book(ow_corpus_list[0], ow_corpus_list[1])
lasc_encode = text_processing.remove_encoding_marks(lasc_raw)



As mentioned about, the text retrieved from Project Gutenberg's website includes introductions and closing statements that are not part of the original text, as well as a table of contents. In order to remove these, we use `remove_extra_text`, which removes this extraneous text as well as a given .csv file of common words (in this case, we use the top 200 most popular English language words). We do this by defining a "start word" and "end word", which are words we want the text to start and end at.

In [110]:
lasc_text_only = text_processing.remove_extra_text(lasc_encode, 'chapter', '\r\n\r\n***end')

We will also create a dictionary of start and end words for the other texts.

In [109]:
ow_start_end_words_dict = {"Lord Arthur Savile's Crime and Other Short Stories": ('chapter', '\r\n\r\n***end'),
                           "The Happy Prince and Other Short Stories": ('chapter', '\r\n\r\n***end'),
                           "The Picture of Dorian Grey": ('chapter', '\r\n\r\n***end'),
                           "Salome": ('chapter', '\r\n\r\n***end'),
                           "A House of Pomegranates": ('chapter', '\r\n\r\n***end'),
                           "The Ducchess of Padua": ('chapter', '\r\n\r\n***end'),
                           "The Soul of Man Under Socialism": ('chapter', '\r\n\r\n***end'),
                           "Lady Windermere's Fan": ('chapter', '\r\n\r\n***end'),
                           "A Woman of No Importance": ('chapter', '\r\n\r\n***end'),
                           "The Importance of Being Earnest": ('chapter', '\r\n\r\n***end'),
                           "The Ballad of Reading Gaol": ('chapter', '\r\n\r\n***end'),
                           "An Ideal Husband": ('chapter', '\r\n\r\n***end')}

Finally, we will remove extraneous punctuation marks and titles with `remove_punctuation` and `remove_titles`, respectively. We will also use `remove_character_names` to remove the names of the main characters, which takes a tuple consisting of character names.

In [111]:
lasc_characters = ('windermere', 'arthur', 'savile', 'podgers', 'clementina', 'sybil', 'otis', 'canterville', 'washington', 'virginia', 'umney', 'simon', 'eleanor', 'murchinson', 'alan', 'trevor', 'alroy', 'hughie', 'erskine', 'laura', 'merton')
lasc = text_processing.remove_character_names(text_processing.remove_titles(text_processing.remove_punctuation(lasc_text_only)), lasc_characters)

This leaves us with an edited list of words in a given text (in this case, Lord Arthur Sevile's Crime). With this data, we can analyze Wilde's word usage in this text. However, in order to analyze multiple texts, we must import multiple texts. To do this, we will first create `character_dict`, which maps texts to a tuple of characters in that text. Then, we will use `initial_text_processing` on each text, which will perform all the operations we've done above.

In [None]:
character_dict = {"Lord Arthur Savile's Crime and Other Short Stories": ('windermere', 'arthur', 'savile', 'podgers', 'clementina', 'sybil', 'otis', 'canterville', 'washington', 'virginia', 'umney', 'simon', 'eleanor', 'murchinson', 'alan', 'trevor', 'alroy', 'hughie', 'erskine', 'laura', 'merton'),
                           "The Happy Prince and Other Short Stories": ("happy", 'prince', 'swallow', 'student', 'nightingale', 'giant', 'hans', 'miller', 'rocket'),
                           "The Picture of Dorian Grey": ('dorian', 'basil', 'wolton', 'james', 'gray', 'hallward', 'sibyl', 'vane', 'campbell', 'fermor', 'singleton', 'victoria'),
                           "Salome": ('herod', 'judea', 'tigellinus', 'salome', 'antipas', 'jokanaan', 'herodius'),
                           "A House of Pomegranates": ('young', 'dwarf', 'fisherman', 'soul', 'king', 'infanta', 'mermaid', 'star-child'),
                           "The Ducchess of Padua": ('simone', 'andrea', 'taddeo', 'moranzano', 'gesso', 'maffio', 'guido', 'bernardo', 'beatrice', 'jeppo', 'ascanio', 'ugo', 'lucia'),
                           "The Soul of Man Under Socialism": (,),
                           "Lady Windermere's Fan": ('chapter', '\r\n\r\n***end'),
                           "A Woman of No Importance": ('chapter', '\r\n\r\n***end'),
                           "The Importance of Being Earnest": ('chapter', '\r\n\r\n***end'),
                           "The Ballad of Reading Gaol": ('chapter', '\r\n\r\n***end'),
                           "An Ideal Husband": ('chapter', '\r\n\r\n***end')}

## Word Usage
Introduce plots of word usage

In [81]:
las_freq = utils.most_freq_words(lord_arthur_saviles_crimes)
las_freq_sorted = sorted(las_freq.items(), key=lambda x:x[1])
las_words = []
las_freqs = []
for words, freqs = las_freq_sorted.items()

[('reception', 1),
 ('easter,', 1),
 ('house\r\nwas', 1),
 ('ministers', 1),
 ('speakerâ\x80\x99s', 1),
 ('levã©e', 1),
 ('ribands,', 1),
 ('wore\r\ntheir', 1),
 ('smartest', 1),
 ('dresses,', 1),
 ('picture-gallery', 1),
 ('the\r\nprincess', 1),
 ('sophia', 1),
 ('carlsrã¼he,', 1),
 ('tartar-looking', 1),
 ('tiny\r\nblack', 1),
 ('emeralds,', 1),
 ('her\r\nvoice,', 1),
 ('laughing', 1),
 ('immoderately', 1),
 ('medley', 1),
 ('peeresses', 1),
 ('chatted\r\naffably', 1),
 ('radicals,', 1),
 ('preachers', 1),
 ('brushed', 1),
 ('coat-tails', 1),
 ('with\r\neminent', 1),
 ('sceptics,', 1),
 ('bevy', 1),
 ('stout\r\nprima-donna', 1),
 ('royal\r\nacademicians,', 1),
 ('disguised', 1),
 ('artists,', 1),
 ('the\r\nsupper-room', 1),
 ('crammed', 1),
 ('geniuses.', 1),
 ('of\r\nlady', 1),
 ('nights,', 1),
 ('princess', 1),
 ('nearly\r\nhalf-past', 1),
 ('eleven.\r\n\r\nas', 1),
 ('picture-gallery,\r\nwhere', 1),
 ('political', 1),
 ('economist', 1),
 ('solemnly', 1),
 ('explaining', 1),
 ('the

Analyze most commonly used words here

## Numbers - some sort of catchy title here?
Numerical analysis of Wilde's writing - average length of words, average length of sentences over time
Maybe a comparison with another book?

In [None]:
#plots for word length, sentence length of wilde over time\
#also scatter plot of average word/sentence length of wilde's books and compared to other authors

Discuss how Wilde's writing style changed over time, as well as how he compares to other authors of his time

## Polarity
Talk about how we analyzed polarity and make plots of character polarity - maybe connect them back to his life (ie. who/what they represent)

In [None]:
#polarity plots here

## TF-IDF
Explain what TF-IDF is, the formula for it, and graph it for individual books\
Also plot TF-IDF of certain words over time 