# Introduction¶
In this project, I compare modeling and clustering for text classification and conclude that modeling is superior to clustering for this task. This notebook is part 1 of a two-notebook project on novels obtained from Project Gutenberg, and in this notebook I perform data cleaning on the dataset. [Part 2](https://github.com/michellekli/thinkful/blob/master/bootcamp/love-stories-part2.ipynb) covers feature creation, model evaluation, and clustering evaluation.

# Data Set
The data set consists of 10 novels downloaded from Project Gutenberg. They were all found within the first few pages of the ["Books about Love stories (sorted by popularity)"](https://www.gutenberg.org/ebooks/subject/2487) list.
* Pride and Prejudice, by Jane Austen
* Villette, by Charlotte Brontë
* The Woman in White, by Wilkie Collins
* Middlemarch, by George Eliot
* Wives and Daughters, by Elizabeth Cleghorn Gaskell
* Jude the Obscure, by Thomas Hardy
* The Portrait of a Lady, by Henry James
* The Lost Girl, by D. H. Lawrence
* The Age of Innocence, by Edith Wharton
* The Voyage Out, by Virginia Woolf

In [1]:
import spacy
import numpy as np
from tqdm import tqdm_notebook
from os import listdir
from os.path import isfile, join
import codecs
import re
import pickle

nlp = spacy.load('en_core_web_sm')

In [2]:
novels = {}
for filename in listdir('data\love-stories'):
    filepath = join('data\love-stories', filename)
    if isfile(filepath):
        with codecs.open(filepath, mode='r', encoding='utf-8-sig') as infile:
            novels[filename.split('-')[0]] = infile.read()

In [3]:
novels.keys()

dict_keys(['austen', 'bronte', 'collins', 'eliot', 'gaskell', 'hardy', 'james', 'lawrence', 'wharton', 'woolf'])

# Data Cleaning

## Austen - Pride and Prejudice

In [4]:
author = 'austen'
novels[author][:2000]

'The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: Pride and Prejudice\r\n\r\nAuthor: Jane Austen\r\n\r\nPosting Date: August 26, 2008 [EBook #1342]\r\nRelease Date: June, 1998\r\nLast Updated: March 10, 2018\r\n\r\nLanguage: English\r\n\r\nCharacter set encoding: UTF-8\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***\r\n\r\n\r\n\r\n\r\nProduced by Anonymous Volunteers\r\n\r\n\r\n\r\n\r\n\r\nPRIDE AND PREJUDICE\r\n\r\nBy Jane Austen\r\n\r\n\r\n\r\nChapter 1\r\n\r\n\r\nIt is a truth universally acknowledged, that a single man in possession\r\nof a good fortune, must be in want of a wife.\r\n\r\nHowever little known the feelings or views of such a man may be on his\r\nf

In [5]:
header = 701
novels[author][header:header+2000]

'Chapter 1\r\n\r\n\r\nIt is a truth universally acknowledged, that a single man in possession\r\nof a good fortune, must be in want of a wife.\r\n\r\nHowever little known the feelings or views of such a man may be on his\r\nfirst entering a neighbourhood, this truth is so well fixed in the minds\r\nof the surrounding families, that he is considered the rightful property\r\nof some one or other of their daughters.\r\n\r\n“My dear Mr. Bennet,” said his lady to him one day, “have you heard that\r\nNetherfield Park is let at last?”\r\n\r\nMr. Bennet replied that he had not.\r\n\r\n“But it is,” returned she; “for Mrs. Long has just been here, and she\r\ntold me all about it.”\r\n\r\nMr. Bennet made no answer.\r\n\r\n“Do you not want to know who has taken it?” cried his wife impatiently.\r\n\r\n“_You_ want to tell me, and I have no objection to hearing it.”\r\n\r\nThis was invitation enough.\r\n\r\n“Why, my dear, you must know, Mrs. Long says that Netherfield is taken\r\nby a young man of la

In [6]:
footer = novels[author].find('End of the Project Gutenberg EBook')
novels[author][footer:footer+500]

'End of the Project Gutenberg EBook of Pride and Prejudice, by Jane Austen\r\n\r\n*** END OF THIS PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***\r\n\r\n***** This file should be named 1342-0.txt or 1342-0.zip *****\r\nThis and all associated files of various formats will be found in:\r\n        http://www.gutenberg.org/1/3/4/1342/\r\n\r\nProduced by Anonymous Volunteers\r\n\r\nUpdated editions will replace the previous one--the old editions\r\nwill be renamed.\r\n\r\nCreating the works from public domain print editions m'

In [7]:
chapter = r'Chapter [\d]+\r\n'
re.findall(chapter, novels[author][header:header+100000])

['Chapter 1\r\n',
 'Chapter 2\r\n',
 'Chapter 3\r\n',
 'Chapter 4\r\n',
 'Chapter 5\r\n',
 'Chapter 6\r\n',
 'Chapter 7\r\n',
 'Chapter 8\r\n',
 'Chapter 9\r\n',
 'Chapter 10\r\n',
 'Chapter 11\r\n',
 'Chapter 12\r\n']

In [8]:
# remove header and footer info
novels[author] = novels[author][header:footer]
# remove chapter titles
novels[author] = re.sub(chapter, '', novels[author])

assert re.findall(chapter, novels[author]) == []
assert re.findall(r'Project Gutenberg', novels[author]) == []
assert novels[author].find('Chapter 1') == -1

In [9]:
novels[author][:2000]

'\r\n\r\nIt is a truth universally acknowledged, that a single man in possession\r\nof a good fortune, must be in want of a wife.\r\n\r\nHowever little known the feelings or views of such a man may be on his\r\nfirst entering a neighbourhood, this truth is so well fixed in the minds\r\nof the surrounding families, that he is considered the rightful property\r\nof some one or other of their daughters.\r\n\r\n“My dear Mr. Bennet,” said his lady to him one day, “have you heard that\r\nNetherfield Park is let at last?”\r\n\r\nMr. Bennet replied that he had not.\r\n\r\n“But it is,” returned she; “for Mrs. Long has just been here, and she\r\ntold me all about it.”\r\n\r\nMr. Bennet made no answer.\r\n\r\n“Do you not want to know who has taken it?” cried his wife impatiently.\r\n\r\n“_You_ want to tell me, and I have no objection to hearing it.”\r\n\r\nThis was invitation enough.\r\n\r\n“Why, my dear, you must know, Mrs. Long says that Netherfield is taken\r\nby a young man of large fortune f

## Bronte - Villette

In [10]:
author = 'bronte'
novels[author][:2000]

"The Project Gutenberg EBook of Villette, by Charlotte Brontë\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.net\r\n\r\n\r\nTitle: Villette\r\n\r\nAuthor: Charlotte Brontë\r\n\r\nPosting Date: August 23, 2010 [EBook #9182]\r\nRelease Date: October, 2005\r\nFirst Posted: September 12, 2003\r\n[Last updated: March 2, 2016]\r\n\r\nLanguage: English\r\n\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK VILLETTE ***\r\n\r\n\r\n\r\n\r\nProduced by Delphine Lettau, Charles Franks and Distributed Proofreaders\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nVILLETTE.\r\n\r\nBY\r\n\r\nCHARLOTTE BRONTË.\r\n\r\n\r\n\r\nCONTENTS\r\n\r\nCHAPTER\r\n\r\n       I.  BRETTON\r\n      II.  PAULINA\r\n     III.  THE PLAYMATES\r\n      IV.  MISS MARCHMONT\r\n       V.  TURNING A NEW LEAF\r\n      VI.  LONDO

In [11]:
header = 1844
novels[author][header:header+2000]

"CHAPTER I.\r\n\r\nBRETTON.\r\n\r\n\r\nMy godmother lived in a handsome house in the clean and ancient town of\r\nBretton. Her husband's family had been residents there for generations,\r\nand bore, indeed, the name of their birthplace--Bretton of Bretton:\r\nwhether by coincidence, or because some remote ancestor had been a\r\npersonage of sufficient importance to leave his name to his\r\nneighbourhood, I know not.\r\n\r\nWhen I was a girl I went to Bretton about twice a year, and well I\r\nliked the visit. The house and its inmates specially suited me. The\r\nlarge peaceful rooms, the well-arranged furniture, the clear wide\r\nwindows, the balcony outside, looking down on a fine antique street,\r\nwhere Sundays and holidays seemed always to abide--so quiet was its\r\natmosphere, so clean its pavement--these things pleased me well.\r\n\r\nOne child in a household of grown people is usually made very much of,\r\nand in a quiet way I was a good deal taken notice of by Mrs. Bretton,\r\nw

In [12]:
footer = novels[author].find('End of the Project Gutenberg EBook')
novels[author][footer:footer+500]

'End of the Project Gutenberg EBook of Villette, by Charlotte Brontë\r\n\r\n*** END OF THIS PROJECT GUTENBERG EBOOK VILLETTE ***\r\n\r\n***** This file should be named 9182-8.txt or 9182-8.zip *****\r\nThis and all associated files of various formats will be found in:\r\n        http://www.gutenberg.org/9/1/8/9182/\r\n\r\nProduced by Delphine Lettau, Charles Franks and Distributed Proofreaders\r\n\r\nUpdated editions will replace the previous one--the old editions\r\nwill be renamed.\r\n\r\nCreating the works from public '

In [13]:
chapter = r'CHAPTER .*[\.]\r\n\r\n.*[\.]\r\n'
re.findall(chapter, novels[author][header:header+100000])

['CHAPTER I.\r\n\r\nBRETTON.\r\n',
 'CHAPTER II.\r\n\r\nPAULINA.\r\n',
 'CHAPTER III.\r\n\r\nTHE PLAYMATES.\r\n',
 'CHAPTER IV.\r\n\r\nMISS MARCHMONT.\r\n',
 'CHAPTER V.\r\n\r\nTURNING A NEW LEAF.\r\n',
 'CHAPTER VI.\r\n\r\nLONDON.\r\n']

In [14]:
# remove header and footer info
novels[author] = novels[author][header:footer]
# remove chapter titles
novels[author] = re.sub(chapter, '', novels[author])

assert re.findall(chapter, novels[author]) == []
assert re.findall(r'Project Gutenberg', novels[author]) == []
assert novels[author].find('CHAPTER I.') == -1

In [15]:
novels[author][:2000]

"\r\n\r\nMy godmother lived in a handsome house in the clean and ancient town of\r\nBretton. Her husband's family had been residents there for generations,\r\nand bore, indeed, the name of their birthplace--Bretton of Bretton:\r\nwhether by coincidence, or because some remote ancestor had been a\r\npersonage of sufficient importance to leave his name to his\r\nneighbourhood, I know not.\r\n\r\nWhen I was a girl I went to Bretton about twice a year, and well I\r\nliked the visit. The house and its inmates specially suited me. The\r\nlarge peaceful rooms, the well-arranged furniture, the clear wide\r\nwindows, the balcony outside, looking down on a fine antique street,\r\nwhere Sundays and holidays seemed always to abide--so quiet was its\r\natmosphere, so clean its pavement--these things pleased me well.\r\n\r\nOne child in a household of grown people is usually made very much of,\r\nand in a quiet way I was a good deal taken notice of by Mrs. Bretton,\r\nwho had been left a widow, with

## Collins - The Woman in White

In [16]:
author = 'collins'
novels[author][:2000]

"The Project Gutenberg EBook of The Woman in White, by Wilkie Collins\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.net\r\n\r\n\r\nTitle: The Woman in White\r\n\r\nAuthor: Wilkie Collins\r\n\r\nPosting Date: September 13, 2008 [EBook #583]\r\nRelease Date: July, 1996\r\nLast updated: January 22, 2009\r\n\r\nLanguage: English\r\n\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK THE WOMAN IN WHITE ***\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nThe Woman in White\r\n\r\n\r\nby\r\n\r\nWilkie Collins\r\n\r\n\r\n\r\n\r\nCONTENTS\r\n\r\nFirst Epoch\r\n\r\n  THE STORY BEGUN BY WALTER HARTRIGHT\r\n  THE STORY CONTINUED BY VINCENT GILMORE\r\n  THE STORY CONTINUED BY MARIAN HALCOMBE\r\n\r\n\r\nSecond Epoch\r\n\r\n  THE STORY CONTINUED BY MARIAN HALCOMBE.\r\n  THE STORY CONT

In [17]:
header = 1453
novels[author][header:header+2000]

"THE STORY BEGUN BY WALTER HARTRIGHT\r\n\r\n(of Clement's Inn, Teacher of Drawing)\r\n\r\n\r\nThis is the story of what a Woman's patience can endure, and what a\r\nMan's resolution can achieve.\r\n\r\nIf the machinery of the Law could be depended on to fathom every case\r\nof suspicion, and to conduct every process of inquiry, with moderate\r\nassistance only from the lubricating influences of oil of gold, the\r\nevents which fill these pages might have claimed their share of the\r\npublic attention in a Court of Justice.\r\n\r\nBut the Law is still, in certain inevitable cases, the pre-engaged\r\nservant of the long purse; and the story is left to be told, for the\r\nfirst time, in this place.  As the Judge might once have heard it, so\r\nthe Reader shall hear it now.  No circumstance of importance, from the\r\nbeginning to the end of the disclosure, shall be related on hearsay\r\nevidence.  When the writer of these introductory lines (Walter\r\nHartright by name) happens to be more 

In [18]:
footer = novels[author].find('End of the Project Gutenberg EBook')
novels[author][footer:footer+500]

'End of the Project Gutenberg EBook of The Woman in White, by Wilkie Collins\r\n\r\n*** END OF THIS PROJECT GUTENBERG EBOOK THE WOMAN IN WHITE ***\r\n\r\n***** This file should be named 583.txt or 583.zip *****\r\nThis and all associated files of various formats will be found in:\r\n        http://www.gutenberg.org/5/8/583/\r\n\r\n\r\n\r\nUpdated editions will replace the previous one--the old editions\r\nwill be renamed.\r\n\r\nCreating the works from public domain print editions means that no\r\none owns a United States c'

In [19]:
narrative = r'THE NARRATIVE OF .*\r\n\r\n'
re.findall(narrative, novels[author][header:])

['THE NARRATIVE OF HESTER PINHORN, COOK IN THE SERVICE OF COUNT FOSCO\r\n\r\n',
 'THE NARRATIVE OF THE DOCTOR\r\n\r\n',
 'THE NARRATIVE OF JANE GOULD\r\n\r\n',
 'THE NARRATIVE OF THE TOMBSTONE\r\n\r\n',
 'THE NARRATIVE OF WALTER HARTRIGHT\r\n\r\n']

In [20]:
story = r'THE STORY .*\r\n\r\n'
re.findall(story, novels[author][header:])

['THE STORY BEGUN BY WALTER HARTRIGHT\r\n\r\n',
 'THE STORY CONTINUED BY VINCENT GILMORE\r\n\r\n',
 'THE STORY CONTINUED BY MARIAN HALCOMBE\r\n\r\n',
 'THE STORY CONTINUED BY MARIAN HALCOMBE.\r\n\r\n',
 'THE STORY CONTINUED BY FREDERICK FAIRLIE, ESQ., OF LIMMERIDGE HOUSE[2]\r\n\r\n',
 'THE STORY CONTINUED BY ELIZA MICHELSON\r\n\r\n',
 'THE STORY CONTINUED IN SEVERAL NARRATIVES\r\n\r\n',
 'THE STORY CONTINUED BY WALTER HARTRIGHT.\r\n\r\n',
 'THE STORY CONTINUED BY MRS. CATHERICK\r\n\r\n',
 'THE STORY CONTINUED BY WALTER HARTRIGHT\r\n\r\n',
 'THE STORY CONTINUED BY ISIDOR, OTTAVIO, BALDASSARE FOSCO\r\n\r\n',
 'THE STORY CONCLUDED BY WALTER HARTRIGHT\r\n\r\n']

In [21]:
chapter = r'THE STORY .*\r\n\r\n\(.*\)\r\n'
re.findall(chapter, novels[author][header:])

["THE STORY BEGUN BY WALTER HARTRIGHT\r\n\r\n(of Clement's Inn, Teacher of Drawing)\r\n",
 'THE STORY CONTINUED BY VINCENT GILMORE\r\n\r\n(of Chancery Lane, Solicitor)\r\n',
 'THE STORY CONTINUED BY MARIAN HALCOMBE\r\n\r\n(in Extracts from her Diary)\r\n',
 'THE STORY CONTINUED BY ELIZA MICHELSON\r\n\r\n(Housekeeper at Blackwater Park)\r\n']

In [22]:
# remove header and footer info
novels[author] = novels[author][header:footer]
# remove chapter titles
novels[author] = re.sub(narrative, '', novels[author])
novels[author] = re.sub(chapter, '', novels[author])
novels[author] = re.sub(story, '', novels[author])

assert re.findall(chapter, novels[author]) == []
assert re.findall(r'Project Gutenberg', novels[author]) == []
assert novels[author].find('(of Chancery Lane, Solicitor)') == -1
assert novels[author].find('THE STORY CONTINUED BY MARIAN HALCOMBE') == -1
assert novels[author].find('THE NARRATIVE OF JANE GOULD') == -1

In [23]:
novels[author][:2000]

"\r\n\r\nThis is the story of what a Woman's patience can endure, and what a\r\nMan's resolution can achieve.\r\n\r\nIf the machinery of the Law could be depended on to fathom every case\r\nof suspicion, and to conduct every process of inquiry, with moderate\r\nassistance only from the lubricating influences of oil of gold, the\r\nevents which fill these pages might have claimed their share of the\r\npublic attention in a Court of Justice.\r\n\r\nBut the Law is still, in certain inevitable cases, the pre-engaged\r\nservant of the long purse; and the story is left to be told, for the\r\nfirst time, in this place.  As the Judge might once have heard it, so\r\nthe Reader shall hear it now.  No circumstance of importance, from the\r\nbeginning to the end of the disclosure, shall be related on hearsay\r\nevidence.  When the writer of these introductory lines (Walter\r\nHartright by name) happens to be more closely connected than others\r\nwith the incidents to be recorded, he will describe 

## Eliot - Middlemarch

In [24]:
author = 'eliot'
novels[author][:2000]

'The Project Gutenberg EBook of Middlemarch, by George Eliot\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.net\r\n\r\n\r\nTitle: Middlemarch\r\n\r\nAuthor: George Eliot\r\n\r\nRelease Date: May 24, 2008 [EBook #145]\r\n[Last updated: March 2, 2015]\r\n\r\nLanguage: English\r\n\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK MIDDLEMARCH ***\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nMiddlemarch\r\n\r\n\r\nBy\r\n\r\nGeorge Eliot\r\n\r\n\r\n\r\n\r\nNew York and Boston\r\n\r\nH. M. Caldwell Company Publishers\r\n\r\n\r\n\r\n\r\nTo my dear Husband, George Henry Lewes, in this nineteenth year of our\r\nblessed union.\r\n\r\n\r\n\r\n\r\nCONTENTS\r\n\r\nBOOK I\r\n\r\nCHAPTER I CHAPTER II CHAPTER III CHAPTER IV CHAPTER V CHAPTER VI\r\nCHAPTER VII CHAPTER VIII CHA

In [25]:
header = 5026
novels[author][header:header+2000]

'BOOK I.\r\n\r\nMISS BROOKE.\r\n\r\n\r\n\r\nCHAPTER I.\r\n\r\n    "Since I can do no good because a woman,\r\n     Reach constantly at something that is near it.\r\n          --The Maid\'s Tragedy:  BEAUMONT AND FLETCHER.\r\n\r\n\r\nMiss Brooke had that kind of beauty which seems to be thrown into\r\nrelief by poor dress.  Her hand and wrist were so finely formed that\r\nshe could wear sleeves not less bare of style than those in which the\r\nBlessed Virgin appeared to Italian painters; and her profile as well as\r\nher stature and bearing seemed to gain the more dignity from her plain\r\ngarments, which by the side of provincial fashion gave her the\r\nimpressiveness of a fine quotation from the Bible,--or from one of our\r\nelder poets,--in a paragraph of to-day\'s newspaper.  She was usually\r\nspoken of as being remarkably clever, but with the addition that her\r\nsister Celia had more common-sense. Nevertheless, Celia wore scarcely\r\nmore trimmings; and it was only to close obser

In [26]:
footer = novels[author].find('End of the Project Gutenberg EBook')
novels[author][footer:footer+500]

'End of the Project Gutenberg EBook of Middlemarch, by George Eliot\r\n\r\n*** END OF THIS PROJECT GUTENBERG EBOOK MIDDLEMARCH ***\r\n\r\n***** This file should be named 145.txt or 145.zip *****\r\nThis and all associated files of various formats will be found in:\r\n        http://www.gutenberg.org/1/4/145/\r\n\r\n\r\n\r\nUpdated editions will replace the previous one--the old editions\r\nwill be renamed.\r\n\r\nCreating the works from public domain print editions means that no\r\none owns a United States copyright in thes'

In [27]:
book = r'BOOK .*[\.]\r\n\r\n.*\r\n'
re.findall(book, novels[author][header:])

['BOOK I.\r\n\r\nMISS BROOKE.\r\n',
 'BOOK II.\r\n\r\n\r\n',
 'BOOK III.\r\n\r\n\r\n',
 'BOOK IV.\r\n\r\n\r\n',
 'BOOK V.\r\n\r\n\r\n',
 'BOOK VI.\r\n\r\n\r\n',
 'BOOK VII.\r\n\r\n\r\n',
 'BOOK VIII.\r\n\r\n\r\n',
 'BOOK OF TOBIT: Marriage Prayer.\r\n\r\n\r\n']

In [28]:
chapter = r'CHAPTER .*[\.]\r\n'
re.findall(chapter, novels[author][header:header+100000])

['CHAPTER I.\r\n',
 'CHAPTER II.\r\n',
 'CHAPTER III.\r\n',
 'CHAPTER IV.\r\n',
 'CHAPTER V.\r\n',
 'CHAPTER VI.\r\n']

In [29]:
# remove header and footer info
novels[author] = novels[author][header:footer]
# remove chapter titles
novels[author] = re.sub(chapter, '', novels[author])
# remove book titles
novels[author] = re.sub(book, '', novels[author])

assert re.findall(chapter, novels[author]) == []
assert re.findall(r'Project Gutenberg', novels[author]) == []
assert novels[author].find('BOOK OF TOBIT: Marriage Prayer.') == -1
assert novels[author].find('CHAPTER I.') == -1

In [30]:
novels[author][:2000]

'\r\n\r\n\r\n\r\n    "Since I can do no good because a woman,\r\n     Reach constantly at something that is near it.\r\n          --The Maid\'s Tragedy:  BEAUMONT AND FLETCHER.\r\n\r\n\r\nMiss Brooke had that kind of beauty which seems to be thrown into\r\nrelief by poor dress.  Her hand and wrist were so finely formed that\r\nshe could wear sleeves not less bare of style than those in which the\r\nBlessed Virgin appeared to Italian painters; and her profile as well as\r\nher stature and bearing seemed to gain the more dignity from her plain\r\ngarments, which by the side of provincial fashion gave her the\r\nimpressiveness of a fine quotation from the Bible,--or from one of our\r\nelder poets,--in a paragraph of to-day\'s newspaper.  She was usually\r\nspoken of as being remarkably clever, but with the addition that her\r\nsister Celia had more common-sense. Nevertheless, Celia wore scarcely\r\nmore trimmings; and it was only to close observers that her dress\r\ndiffered from her sist

## Gaskell - Wives and Daughters

In [31]:
author = 'gaskell'
novels[author][:2000]

"The Project Gutenberg eBook, Wives and Daughters, by Elizabeth Cleghorn\r\nGaskell, Illustrated by George du Maurier\r\n\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\n\r\n\r\n\r\nTitle: Wives and Daughters\r\n       An Every-Day Story\r\n\r\n\r\nAuthor: Elizabeth Cleghorn Gaskell\r\n\r\n\r\nRelease Date: December 26, 2001  [eBook #4274]\r\nMost recently updated: November 4, 2011\r\n\r\nLanguage: English\r\n\r\n\r\n***START OF THE PROJECT GUTENBERG EBOOK WIVES AND DAUGHTERS***\r\n\r\n\r\nE-text prepared by Charles Aldarondo\r\nand revised by Joseph E. Loewenstein, M.D.\r\n\r\n\r\n\r\nEditorial note:\r\n\r\n      _Wives and Daughters_ was first published serially in the\r\n      _Cornhill Magazine_ from August, 1864, to January, 1866.\r\n      Elizabeth Gaskell

In [32]:
header = 5734
novels[author][header:header+2000]

'CHAPTER I.\r\n\r\nTHE DAWN OF A GALA DAY.\r\n\r\n\r\n[Illustration (untitled)]\r\n\r\nTo begin with the old rigmarole of childhood. In a country there was\r\na shire, and in that shire there was a town, and in that town there\r\nwas a house, and in that house there was a room, and in that room\r\nthere was a bed, and in that bed there lay a little girl; wide awake\r\nand longing to get up, but not daring to do so for fear of the unseen\r\npower in the next room--a certain Betty, whose slumbers must not\r\nbe disturbed until six o\'clock struck, when she wakened of herself\r\n"as sure as clockwork," and left the household very little peace\r\nafterwards. It was a June morning, and early as it was, the room was\r\nfull of sunny warmth and light.\r\n\r\nOn the drawers opposite to the little white dimity bed in which Molly\r\nGibson lay, was a primitive kind of bonnet-stand on which was hung a\r\nbonnet, carefully covered over from any chance of dust with a large\r\ncotton handkerchief, o

In [33]:
footer = novels[author].find('***END OF THE PROJECT GUTENBERG')
novels[author][footer:footer+500]

'***END OF THE PROJECT GUTENBERG EBOOK WIVES AND DAUGHTERS***\r\n\r\n\r\n******* This file should be named 4274-8.txt or 4274-8.zip *******\r\n\r\n\r\nThis and all associated files of various formats will be found in:\r\nhttp://www.gutenberg.org/dirs/4/2/7/4274\r\n\r\n\r\n\r\nUpdated editions will replace the previous one--the old editions\r\nwill be renamed.\r\n\r\nCreating the works from public domain print editions means that no\r\none owns a United States copyright in these works, so the Foundation\r\n(and you!) can copy an'

In [34]:
chapter = r'CHAPTER .*[\.]\r\n\r\n.*[\.]\r\n\r\n\r\n\[Illustration .*\]\r\n'
re.findall(chapter, novels[author][header:header+200000])

['CHAPTER I.\r\n\r\nTHE DAWN OF A GALA DAY.\r\n\r\n\r\n[Illustration (untitled)]\r\n',
 "CHAPTER IV.\r\n\r\nMR. GIBSON'S NEIGHBOURS.\r\n\r\n\r\n[Illustration (untitled)]\r\n",
 'CHAPTER VII.\r\n\r\nFORESHADOWS OF LOVE PERILS.\r\n\r\n\r\n[Illustration (untitled)]\r\n']

In [35]:
# remove header and footer info
novels[author] = novels[author][header:footer]
# remove chapter titles
novels[author] = re.sub(chapter, '', novels[author])

assert re.findall(chapter, novels[author]) == []
assert re.findall(r'Project Gutenberg', novels[author]) == []
assert novels[author].find('CHAPTER I.') == -1

In [36]:
novels[author][:2000]

'\r\nTo begin with the old rigmarole of childhood. In a country there was\r\na shire, and in that shire there was a town, and in that town there\r\nwas a house, and in that house there was a room, and in that room\r\nthere was a bed, and in that bed there lay a little girl; wide awake\r\nand longing to get up, but not daring to do so for fear of the unseen\r\npower in the next room--a certain Betty, whose slumbers must not\r\nbe disturbed until six o\'clock struck, when she wakened of herself\r\n"as sure as clockwork," and left the household very little peace\r\nafterwards. It was a June morning, and early as it was, the room was\r\nfull of sunny warmth and light.\r\n\r\nOn the drawers opposite to the little white dimity bed in which Molly\r\nGibson lay, was a primitive kind of bonnet-stand on which was hung a\r\nbonnet, carefully covered over from any chance of dust with a large\r\ncotton handkerchief, of so heavy and serviceable a texture that if\r\nthe thing underneath it had been a

## Hardy - Jude the Obscure

In [37]:
author = 'hardy'
novels[author][:2000]

'The Project Gutenberg eBook, Jude the Obscure, by Thomas Hardy\r\n\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.net\r\n\r\n\r\n\r\n\r\n\r\nTitle: Jude the Obscure\r\n\r\n\r\nAuthor: Thomas Hardy\r\n\r\n\r\n\r\nRelease Date: August, 1994  [eBook #153]\r\n[Most recently updated: September 13, 2005]\r\n\r\nLanguage: English\r\n\r\n\r\n***START OF THE PROJECT GUTENBERG EBOOK JUDE THE OBSCURE***\r\n\r\n\r\nE-text prepared by John Hamm with OmniPage Professional OCR\r\nsoftware donated to Project Gutenberg by Caere Corporation.\r\nE-text revised by Joseph E. Loewenstein, M.D.\r\n\r\n\r\n\r\nJUDE THE OBSCURE\r\n\r\nby\r\n\r\nThomas Hardy\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nCONTENTS\r\n\r\n   PART FIRST\r\n   At Marygreen\r\n\r\n   PART SECOND\r\n   At Christminster\r\n\r\n   PART THIR

In [38]:
header = 1047
novels[author][header:header+2000]

'Part First\r\n\r\nAT MARYGREEN\r\n\r\n\r\n\r\n   "Yea, many there be that have run out of their wits for\r\n    women, and become servants for their sakes.  Many also\r\n    have perished, have erred, and sinned, for women.... O\r\n    ye men, how can it be but women should be strong, seeing\r\n    they do thus?"--ESDRAS.\r\n\r\n\r\nI\r\n\r\n\r\nThe schoolmaster was leaving the village, and everybody seemed sorry.\r\nThe miller at Cresscombe lent him the small white tilted cart and\r\nhorse to carry his goods to the city of his destination, about twenty\r\nmiles off, such a vehicle proving of quite sufficient size for the\r\ndeparting teacher\'s effects.  For the schoolhouse had been partly\r\nfurnished by the managers, and the only cumbersome article possessed\r\nby the master, in addition to the packing-case of books, was a\r\ncottage piano that he had bought at an auction during the year in\r\nwhich he thought of learning instrumental music.  But the enthusiasm\r\nhaving waned he h

In [39]:
footer = novels[author].find('***END OF THE PROJECT GUTENBERG')
novels[author][footer:footer+500]

'***END OF THE PROJECT GUTENBERG EBOOK JUDE THE OBSCURE***\r\n\r\n\r\n******* This file should be named 153-8.txt or 153-8.zip *******\r\n\r\n\r\nThis and all associated files of various formats will be found in:\r\nhttp://www.gutenberg.org/dirs/1/5/153\r\n\r\n\r\n\r\nUpdated editions will replace the previous one--the old editions\r\nwill be renamed.\r\n\r\nCreating the works from public domain print editions means that no\r\none owns a United States copyright in these works, so the Foundation\r\n(and you!) can copy and distri'

In [40]:
chapter = r'Part .*\r\n\r\n.*\r\n'
re.findall(chapter, novels[author][header:])

['Part First\r\n\r\nAT MARYGREEN\r\n',
 'Part Second\r\n\r\n\r\n',
 'Part Third\r\n\r\n\r\n',
 'Part Fourth\r\n\r\n\r\n',
 'Part Fifth\r\n\r\n\r\n',
 'Part Sixth\r\n\r\n\r\n']

In [41]:
# remove header and footer info
novels[author] = novels[author][header:footer]
# remove chapter titles
novels[author] = re.sub(chapter, '', novels[author])

assert re.findall(chapter, novels[author]) == []
assert re.findall(r'Project Gutenberg', novels[author]) == []
assert novels[author].find('Part First') == -1

In [42]:
novels[author][:2000]

'\r\n\r\n\r\n   "Yea, many there be that have run out of their wits for\r\n    women, and become servants for their sakes.  Many also\r\n    have perished, have erred, and sinned, for women.... O\r\n    ye men, how can it be but women should be strong, seeing\r\n    they do thus?"--ESDRAS.\r\n\r\n\r\nI\r\n\r\n\r\nThe schoolmaster was leaving the village, and everybody seemed sorry.\r\nThe miller at Cresscombe lent him the small white tilted cart and\r\nhorse to carry his goods to the city of his destination, about twenty\r\nmiles off, such a vehicle proving of quite sufficient size for the\r\ndeparting teacher\'s effects.  For the schoolhouse had been partly\r\nfurnished by the managers, and the only cumbersome article possessed\r\nby the master, in addition to the packing-case of books, was a\r\ncottage piano that he had bought at an auction during the year in\r\nwhich he thought of learning instrumental music.  But the enthusiasm\r\nhaving waned he had never acquired any skill in pla

## James - Portrait of a Lady

In [43]:
author = 'james'
novels[author][:2000]

'The Project Gutenberg EBook of The Portrait of a Lady, by Henry James\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: The Portrait of a Lady\r\n       Volume 1 (of 2)\r\n\r\nAuthor: Henry James\r\n\r\nPosting Date: December 1, 2008 [EBook #2833]\r\nRelease Date: September, 2001\r\nLast Updated: September 20, 2016\r\n\r\nLanguage: English\r\n\r\nCharacter set encoding: UTF-8\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK THE PORTRAIT OF A LADY ***\r\n\r\n\r\n\r\n\r\nProduced by Eve Sobol\r\n\r\n\r\n\r\n\r\n\r\nTHE PORTRAIT OF A LADY\r\n\r\nVOLUME I\r\n\r\n\r\nBy Henry James\r\n\r\n\r\n\r\n\r\nPREFACE\r\n\r\n“_The Portrait of a Lady_” was, like “_Roderick Hudson_,” begun in Florence,\r\nduring three months spent there in the spring of 1879. Like “Roderic

In [44]:
header = 38135
novels[author][header:header+2000]

'CHAPTER I\r\n\r\nUnder certain circumstances there are few hours in life more agreeable\r\nthan the hour dedicated to the ceremony known as afternoon tea. There\r\nare circumstances in which, whether you partake of the tea or not--some\r\npeople of course never do,--the situation is in itself delightful. Those\r\nthat I have in mind in beginning to unfold this simple history offered\r\nan admirable setting to an innocent pastime. The implements of\r\nthe little feast had been disposed upon the lawn of an old English\r\ncountry-house, in what I should call the perfect middle of a splendid\r\nsummer afternoon. Part of the afternoon had waned, but much of it was\r\nleft, and what was left was of the finest and rarest quality. Real dusk\r\nwould not arrive for many hours; but the flood of summer light had begun\r\nto ebb, the air had grown mellow, the shadows were long upon the smooth,\r\ndense turf. They lengthened slowly, however, and the scene expressed\r\nthat sense of leisure still t

In [45]:
footer = novels[author].find('End of the Project Gutenberg EBook')
novels[author][footer:footer+500]

'End of the Project Gutenberg EBook of The Portrait of a Lady, by Henry James\r\n\r\n*** END OF THIS PROJECT GUTENBERG EBOOK THE PORTRAIT OF A LADY ***\r\n\r\n***** This file should be named 2833-0.txt or 2833-0.zip *****\r\nThis and all associated files of various formats will be found in:\r\n        http://www.gutenberg.org/2/8/3/2833/\r\n\r\nProduced by Eve Sobol\r\n\r\nUpdated editions will replace the previous one--the old editions\r\nwill be renamed.\r\n\r\nCreating the works from public domain print editions means '

In [46]:
chapter = r'CHAPTER .*\r\n'
re.findall(chapter, novels[author][header:header+100000])

['CHAPTER I\r\n',
 'CHAPTER II\r\n',
 'CHAPTER III\r\n',
 'CHAPTER IV\r\n',
 'CHAPTER V\r\n',
 'CHAPTER VI\r\n']

In [47]:
# remove header and footer info
novels[author] = novels[author][header:footer]
# remove chapter titles
novels[author] = re.sub(chapter, '', novels[author])

assert re.findall(chapter, novels[author]) == []
assert re.findall(r'Project Gutenberg', novels[author]) == []
assert novels[author].find('CHAPTER I.') == -1

In [48]:
novels[author][:2000]

'\r\nUnder certain circumstances there are few hours in life more agreeable\r\nthan the hour dedicated to the ceremony known as afternoon tea. There\r\nare circumstances in which, whether you partake of the tea or not--some\r\npeople of course never do,--the situation is in itself delightful. Those\r\nthat I have in mind in beginning to unfold this simple history offered\r\nan admirable setting to an innocent pastime. The implements of\r\nthe little feast had been disposed upon the lawn of an old English\r\ncountry-house, in what I should call the perfect middle of a splendid\r\nsummer afternoon. Part of the afternoon had waned, but much of it was\r\nleft, and what was left was of the finest and rarest quality. Real dusk\r\nwould not arrive for many hours; but the flood of summer light had begun\r\nto ebb, the air had grown mellow, the shadows were long upon the smooth,\r\ndense turf. They lengthened slowly, however, and the scene expressed\r\nthat sense of leisure still to come which 

## Lawrence - The Lost Girl

In [49]:
author = 'lawrence'
novels[author][:2000]

"The Project Gutenberg eBook, The Lost Girl, by D. H. Lawrence\r\n\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\n\r\n\r\n\r\nTitle: The Lost Girl\r\n\r\n\r\nAuthor: D. H. Lawrence\r\n\r\n\r\n\r\nRelease Date: December 3, 2007  [eBook #23727]\r\n\r\nLanguage: English\r\n\r\n\r\n***START OF THE PROJECT GUTENBERG EBOOK THE LOST GIRL***\r\n\r\n\r\nE-text prepared by Roger Frank, Roberta Staehlin, and the Project\r\nGutenberg Online Distributed Proofreading Team (http://www.pgdp.net)\r\n\r\n\r\n\r\nTHE LOST GIRL\r\n\r\nby\r\n\r\nD. H. LAWRENCE\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nNew York\r\nThomas Seltzer\r\n1921\r\n\r\nCopyright, 1921,\r\nby Thomas Seltzer, Inc.\r\nAll rights reserved\r\n\r\nFirst Printing, February, 1921\r\nSecond Printing, February, 1921\r\nThird Pri

In [50]:
header = 1869
novels[author][header:header+2000]

'CHAPTER I\r\n\r\nTHE DECLINE OF MANCHESTER HOUSE\r\n\r\n\r\nTake a mining townlet like Woodhouse, with a population of ten\r\nthousand people, and three generations behind it. This space of\r\nthree generations argues a certain well-established society. The old\r\n"County" has fled from the sight of so much disembowelled coal, to\r\nflourish on mineral rights in regions still idyllic. Remains one\r\ngreat and inaccessible magnate, the local coal owner: three\r\ngenerations old, and clambering on the bottom step of the "County,"\r\nkicking off the mass below. Rule him out.\r\n\r\nA well established society in Woodhouse, full of fine shades,\r\nranging from the dark of coal-dust to grit of stone-mason and\r\nsawdust of timber-merchant, through the lustre of lard and butter\r\nand meat, to the perfume of the chemist and the disinfectant of the\r\ndoctor, on to the serene gold-tarnish of bank-managers, cashiers for\r\nthe firm, clergymen and such-like, as far as the automobile\r\nrefulgen

In [51]:
footer = novels[author].find('***END OF THE PROJECT GUTENBERG')
novels[author][footer:footer+500]

'***END OF THE PROJECT GUTENBERG EBOOK THE LOST GIRL***\r\n\r\n\r\n******* This file should be named 23727-8.txt or 23727-8.zip *******\r\n\r\n\r\nThis and all associated files of various formats will be found in:\r\nhttp://www.gutenberg.org/dirs/2/3/7/2/23727\r\n\r\n\r\n\r\nUpdated editions will replace the previous one--the old editions\r\nwill be renamed.\r\n\r\nCreating the works from public domain print editions means that no\r\none owns a United States copyright in these works, so the Foundation\r\n(and you!) can copy and'

In [52]:
chapter = r'CHAPTER .*\r\n\r\n.*\r\n'
re.findall(chapter, novels[author][header:header+100000])

['CHAPTER I\r\n\r\nTHE DECLINE OF MANCHESTER HOUSE\r\n',
 'CHAPTER II\r\n\r\nTHE RISE OF ALVINA HOUGHTON\r\n',
 'CHAPTER III\r\n\r\nTHE MATERNITY NURSE\r\n',
 'CHAPTER IV\r\n\r\nTWO WOMEN DIE\r\n']

In [53]:
# remove header and footer info
novels[author] = novels[author][header:footer]
# remove chapter titles
novels[author] = re.sub(chapter, '', novels[author])

assert re.findall(chapter, novels[author]) == []
assert re.findall(r'Project Gutenberg', novels[author]) == []
assert novels[author].find('CHAPTER I') == -1

In [54]:
novels[author][:2000]

'\r\n\r\nTake a mining townlet like Woodhouse, with a population of ten\r\nthousand people, and three generations behind it. This space of\r\nthree generations argues a certain well-established society. The old\r\n"County" has fled from the sight of so much disembowelled coal, to\r\nflourish on mineral rights in regions still idyllic. Remains one\r\ngreat and inaccessible magnate, the local coal owner: three\r\ngenerations old, and clambering on the bottom step of the "County,"\r\nkicking off the mass below. Rule him out.\r\n\r\nA well established society in Woodhouse, full of fine shades,\r\nranging from the dark of coal-dust to grit of stone-mason and\r\nsawdust of timber-merchant, through the lustre of lard and butter\r\nand meat, to the perfume of the chemist and the disinfectant of the\r\ndoctor, on to the serene gold-tarnish of bank-managers, cashiers for\r\nthe firm, clergymen and such-like, as far as the automobile\r\nrefulgence of the general-manager of all the collieries. Her

## Wharton - The Age of Innocence

In [55]:
author = 'wharton'
novels[author][:2000]

'The Project Gutenberg EBook of The Age of Innocence, by Edith Wharton\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.net\r\n\r\n\r\nTitle: The Age of Innocence\r\n\r\nAuthor: Edith Wharton\r\n\r\nPosting Date: August 12, 2008 [EBook #541]\r\nRelease Date: May, 1996\r\n\r\nLanguage: English\r\n\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK THE AGE OF INNOCENCE ***\r\n\r\n\r\n\r\n\r\nProduced by Judith Boss and Charles Keller.  HTML version by Al Haines.\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nThe Age of Innocence\r\n\r\n\r\nby\r\n\r\nEdith Wharton\r\n\r\n\r\nJTABLE 6 18 1\r\n\r\nJTABLE 6 16 19\r\n\r\n\r\nBook I\r\n\r\n\r\n\r\nI.\r\n\r\nOn a January evening of the early seventies, Christine Nilsson was\r\nsinging in Faust at the Academy of Music in New York.\r\n\r\nThough th

In [56]:
header = 735
novels[author][header:header+2000]

'Book I\r\n\r\n\r\n\r\nI.\r\n\r\nOn a January evening of the early seventies, Christine Nilsson was\r\nsinging in Faust at the Academy of Music in New York.\r\n\r\nThough there was already talk of the erection, in remote metropolitan\r\ndistances "above the Forties," of a new Opera House which should\r\ncompete in costliness and splendour with those of the great European\r\ncapitals, the world of fashion was still content to reassemble every\r\nwinter in the shabby red and gold boxes of the sociable old Academy.\r\nConservatives cherished it for being small and inconvenient, and thus\r\nkeeping out the "new people" whom New York was beginning to dread and\r\nyet be drawn to; and the sentimental clung to it for its historic\r\nassociations, and the musical for its excellent acoustics, always so\r\nproblematic a quality in halls built for the hearing of music.\r\n\r\nIt was Madame Nilsson\'s first appearance that winter, and what the\r\ndaily press had already learned to describe as "an 

In [57]:
footer = novels[author].find('End of the Project Gutenberg EBook')
novels[author][footer:footer+500]

'End of the Project Gutenberg EBook of The Age of Innocence, by Edith Wharton\r\n\r\n*** END OF THIS PROJECT GUTENBERG EBOOK THE AGE OF INNOCENCE ***\r\n\r\n***** This file should be named 541.txt or 541.zip *****\r\nThis and all associated files of various formats will be found in:\r\n        http://www.gutenberg.org/5/4/541/\r\n\r\nProduced by Judith Boss and Charles Keller.  HTML version by Al Haines.\r\n\r\nUpdated editions will replace the previous one--the old editions\r\nwill be renamed.\r\n\r\nCreating the works f'

In [58]:
chapter = r'Book .*\r\n'
re.findall(chapter, novels[author][header:])

['Book I\r\n',
 'Book II\r\n',
 'Book of The Age of Innocence, by Edith Wharton\r\n',
 'Book for nearly any purpose\r\n',
 'Book is for the use of anyone anywhere at no cost and with\r\n',
 'Book or online at www.gutenberg.net\r\n']

In [59]:
# remove header and footer info
novels[author] = novels[author][header:footer]
# remove chapter titles
novels[author] = re.sub(chapter, '', novels[author])

assert re.findall(chapter, novels[author]) == []
assert re.findall(r'Project Gutenberg', novels[author]) == []
assert novels[author].find('Book I') == -1

In [60]:
novels[author][:2000]

'\r\n\r\n\r\nI.\r\n\r\nOn a January evening of the early seventies, Christine Nilsson was\r\nsinging in Faust at the Academy of Music in New York.\r\n\r\nThough there was already talk of the erection, in remote metropolitan\r\ndistances "above the Forties," of a new Opera House which should\r\ncompete in costliness and splendour with those of the great European\r\ncapitals, the world of fashion was still content to reassemble every\r\nwinter in the shabby red and gold boxes of the sociable old Academy.\r\nConservatives cherished it for being small and inconvenient, and thus\r\nkeeping out the "new people" whom New York was beginning to dread and\r\nyet be drawn to; and the sentimental clung to it for its historic\r\nassociations, and the musical for its excellent acoustics, always so\r\nproblematic a quality in halls built for the hearing of music.\r\n\r\nIt was Madame Nilsson\'s first appearance that winter, and what the\r\ndaily press had already learned to describe as "an exceptiona

## Woolf - The Voyage Out

In [61]:
author = 'woolf'
novels[author][:2000]

"The Project Gutenberg EBook of The Voyage Out, by Virginia Woolf\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: The Voyage Out\r\n\r\nAuthor: Virginia Woolf\r\n\r\nRelease Date: January 12, 2006 [EBook #144]\r\n[Last updated: July 19, 2011]\r\n\r\nLanguage: English\r\n\r\nCharacter set encoding: UTF-8\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK THE VOYAGE OUT ***\r\n\r\n\r\n\r\n\r\nProduced by Judith Boss and David Widger\r\n\r\n\r\n\r\n\r\n\r\nTHE VOYAGE OUT (1915)\r\n\r\n\r\nby Virginia Woolf (1882-1941)\r\n\r\n\r\n\r\nChapter I\r\n\r\n\r\nAs the streets that lead from the Strand to the Embankment are very\r\nnarrow, it is better not to walk down them arm-in-arm. If you persist,\r\nlawyers' clerks will have to make flying leaps into the mud; youn

In [62]:
header = 694
novels[author][header:header+2000]

"Chapter I\r\n\r\n\r\nAs the streets that lead from the Strand to the Embankment are very\r\nnarrow, it is better not to walk down them arm-in-arm. If you persist,\r\nlawyers' clerks will have to make flying leaps into the mud; young lady\r\ntypists will have to fidget behind you. In the streets of London where\r\nbeauty goes unregarded, eccentricity must pay the penalty, and it is\r\nbetter not to be very tall, to wear a long blue cloak, or to beat the\r\nair with your left hand.\r\n\r\nOne afternoon in the beginning of October when the traffic was becoming\r\nbrisk a tall man strode along the edge of the pavement with a lady on\r\nhis arm. Angry glances struck upon their backs. The small, agitated\r\nfigures--for in comparison with this couple most people looked\r\nsmall--decorated with fountain pens, and burdened with despatch-boxes,\r\nhad appointments to keep, and drew a weekly salary, so that there\r\nwas some reason for the unfriendly stare which was bestowed upon Mr.\r\nAmbrose

In [63]:
footer = novels[author].find('End of the Project Gutenberg EBook')
novels[author][footer:footer+500]

'End of the Project Gutenberg EBook of The Voyage Out, by Virginia Woolf\r\n\r\n*** END OF THIS PROJECT GUTENBERG EBOOK THE VOYAGE OUT ***\r\n\r\n***** This file should be named 144.txt or 144.zip *****\r\nThis and all associated files of various formats will be found in:\r\n        http://www.gutenberg.org/1/4/144/\r\n\r\nProduced by Judith Boss and David Widger\r\n\r\nUpdated editions will replace the previous one--the old editions\r\nwill be renamed.\r\n\r\nCreating the works from public domain print editions means tha'

In [64]:
chapter = r'Chapter .*\r\n'
re.findall(chapter, novels[author][header:header+100000])

['Chapter I\r\n', 'Chapter II\r\n', 'Chapter III\r\n', 'Chapter IV\r\n']

In [65]:
# remove header and footer info
novels[author] = novels[author][header:footer]
# remove chapter titles
novels[author] = re.sub(chapter, '', novels[author])

assert re.findall(chapter, novels[author]) == []
assert re.findall(r'Project Gutenberg', novels[author]) == []
assert novels[author].find('Chapter I') == -1

In [66]:
novels[author][:2000]

"\r\n\r\nAs the streets that lead from the Strand to the Embankment are very\r\nnarrow, it is better not to walk down them arm-in-arm. If you persist,\r\nlawyers' clerks will have to make flying leaps into the mud; young lady\r\ntypists will have to fidget behind you. In the streets of London where\r\nbeauty goes unregarded, eccentricity must pay the penalty, and it is\r\nbetter not to be very tall, to wear a long blue cloak, or to beat the\r\nair with your left hand.\r\n\r\nOne afternoon in the beginning of October when the traffic was becoming\r\nbrisk a tall man strode along the edge of the pavement with a lady on\r\nhis arm. Angry glances struck upon their backs. The small, agitated\r\nfigures--for in comparison with this couple most people looked\r\nsmall--decorated with fountain pens, and burdened with despatch-boxes,\r\nhad appointments to keep, and drew a weekly salary, so that there\r\nwas some reason for the unfriendly stare which was bestowed upon Mr.\r\nAmbrose's height and

## Check data cleaning so far

In [67]:
for author, text in novels.items():
    print('***{}***'.format(author))
    print(text[:100])

***austen***


It is a truth universally acknowledged, that a single man in possession
of a good fortune, must
***bronte***


My godmother lived in a handsome house in the clean and ancient town of
Bretton. Her husband's 
***collins***


This is the story of what a Woman's patience can endure, and what a
Man's resolution can achiev
***eliot***




    "Since I can do no good because a woman,
     Reach constantly at something that is nea
***gaskell***

To begin with the old rigmarole of childhood. In a country there was
a shire, and in that shire t
***hardy***



   "Yea, many there be that have run out of their wits for
    women, and become servants for
***james***

Under certain circumstances there are few hours in life more agreeable
than the hour dedicated to
***lawrence***


Take a mining townlet like Woodhouse, with a population of ten
thousand people, and three gener
***wharton***



I.

On a January evening of the early seventies, Christine Nilsson was
singing in Faust at 
***

## Uniform data cleaning

In [68]:
shortest = np.inf
for author in novels:
    # remove double hyphens
    novels[author] = novels[author].replace('--',' ')
    # remove extra white space
    novels[author] = ' '.join(novels[author].split())
    novel_length = len(novels[author])
    print(novel_length)
    if novel_length < shortest:
        shortest = novel_length

681165
1087183
1344288
1768143
1467388
792791
616587
763711
574886
760716


In [69]:
for author, text in novels.items():
    print('***{}***'.format(author))
    print(text[:100])

***austen***
It is a truth universally acknowledged, that a single man in possession of a good fortune, must be i
***bronte***
My godmother lived in a handsome house in the clean and ancient town of Bretton. Her husband's famil
***collins***
This is the story of what a Woman's patience can endure, and what a Man's resolution can achieve. If
***eliot***
"Since I can do no good because a woman, Reach constantly at something that is near it. The Maid's T
***gaskell***
To begin with the old rigmarole of childhood. In a country there was a shire, and in that shire ther
***hardy***
"Yea, many there be that have run out of their wits for women, and become servants for their sakes. 
***james***
Under certain circumstances there are few hours in life more agreeable than the hour dedicated to th
***lawrence***
Take a mining townlet like Woodhouse, with a population of ten thousand people, and three generation
***wharton***
I. On a January evening of the early seventies, Christine Nilsson was sin

In [70]:
for author in tqdm_notebook(novels):
    novels[author] = nlp(novels[author][:shortest])
    print(len(novels[author]))

HBox(children=(IntProgress(value=0, max=10), HTML(value='')))

121225
126742
124453
121284
128619
127166
127338
132042
122668
126064



In [71]:
docs = []
labels = []
for author in novels:
    labels += [author] * 480
    for i in range(480):
        docs.append(novels[author][i*250:(i+1)*250])

In [72]:
for text, label in zip(docs[5::480], labels[::480]):
    print('***{}***'.format(label))
    print(text)

***austen***
“No more have I,” said Mr. Bennet; “and I am glad to find that you do not depend on her serving you.” Mrs. Bennet deigned not to make any reply, but, unable to contain herself, began scolding one of her daughters. “Don't keep coughing so, Kitty, for Heaven's sake! Have a little compassion on my nerves. You tear them to pieces.” “Kitty has no discretion in her coughs,” said her father; “she times them ill.” “I do not cough for my own amusement,” replied Kitty fretfully. “When is your next ball to be, Lizzy?” “To-morrow fortnight.” “Aye, so it is,” cried her mother, “and Mrs. Long does not come back till the day before; so it will be impossible for her to introduce him, for she will not know him herself.” “Then, my dear, you may have the advantage of your friend, and introduce Mr. Bingley to _her_.” “Impossible, Mr. Bennet, impossible, when I am not acquainted with him myself; how can you be so teasing?” “I honour your circumspection. A fortnight's acquaintance is certainly 

knees, and his feet were encased in thick, embroidered slippers. A beautiful collie dog lay upon the grass near his chair, watching the master’s face almost as tenderly as the master took in the still more magisterial physiognomy of the house; and a little bristling, bustling terrier bestowed a desultory attendance upon the other gentlemen. One of these was a remarkably well-made man of five-and-thirty, with a face as English as that of the old gentleman I have just sketched was something else; a noticeably handsome face, fresh-coloured, fair and frank, with firm, straight features, a lively grey eye and the rich adornment of a chestnut beard. This person had a certain fortunate, brilliant exceptional look the air of a happy temperament fertilised by a high civilisation which would have made almost any observer envy him at a venture. He was booted and spurred, as if he had dismounted from a long ride; he wore a white hat, which looked too large for him; he held his two hands behind him

# Continued in next notebook

In [73]:
# saving the current data to load in the next notebook
# will recreate the docs and labels data becasue Spans are views of the
# parent Doc and shouldn't be pickled
with open('data/love-stories-novels.pickle', 'wb') as f:
    pickle.dump(novels, f)