# Synpsis

We import a collection of nineteenth century novels from a variety of authors. 
The directory consists of files for each section (chapter) of the novels. Each filename contains the following metadata: `author_title_chapter_genre`. 

The novels are grouped into three genres: `g` for Gothic, `d` for Detective, and `nh` for unclassified (not sure what the abbreviation is for). 

We can use HCA and PCA to figure out which genre the unclassified sections of text belong to. We can also explore the relationships among the various novels, to see if they fit neatly into their categories.

We write a corpus importer as we have done before, but this time need to add levels to our OHCO to include genre, author, and title. 

Note that we are going through the "ritual" of importing the content into our standard format as if we needed all the information we are collection, e.g. POS. For today's exercise, however, we are not going to need this information.

# Configuration

In [1]:
source_dir = 'vierthaler-stylometry/corpus'
para_pat = r'\n\n+'
token_pat = r'([\W_]+)'
db_file = '../../data/novels.db'

In [4]:
extra_stopwords = """
us rest went least would much must long one like much say well without though yet might still upon
done every rather particular made many previous always never thy thou go first oh thee ere ye came
almost could may sometimes seem called among another also however nevertheless even way one two three
ever put
""".strip().split()

In [5]:
OHCO = ['genre', 'author', 'book', 'chapter', 'para_num', 'sent_num', 'token_num']
GENRE = OHCO[:1]
AUTHS = OHCO[:2]
BOOKS = OHCO[:3]
CHAPS = OHCO[:4]
PARAS = OHCO[:5]
SENTS = OHCO[:6]

# Libraries

In [6]:
# import re
# import os
import glob
import sqlite3

import pandas as pd
import numpy as np

import nltk
from nltk.tokenize import RegexpTokenizer
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('tagsets')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/rca2t/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/rca2t/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /Users/rca2t/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package tagsets to /Users/rca2t/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/rca2t/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Pragmas

In [7]:
%matplotlib inline

# Process

We pause to look at the revised form of our text import function. The parsing function has been replaced with NLTK, which has improved the results of POS tagging. However, this has required some added string manipulation to produce better tokens.

## Import files

In [7]:
files = glob.glob("{}/*.txt".format(source_dir))

In [8]:
codes = [f.replace('.txt','').split('/')[-1].split('_') for f in files]

In [9]:
T = pd.DataFrame(codes, columns = ['author','book','chapter', 'genre'])
T = T[CHAPS]
T.chapter = T.chapter.astype('int')

In [10]:
T.head()

Unnamed: 0,genre,author,book,chapter
0,d,christie,secretadversary,23
1,g,austen,northangerabbey,30
2,d,doyle,scarlet,5
3,g,shelley,frankenstein,37
4,d,collins,moonstone,86


In [11]:
T['text'] = [open(f, 'r', encoding='utf-8').read() for f in files]

In [12]:
T.head()

Unnamed: 0,genre,author,book,chapter,text
0,d,christie,secretadversary,23,. A RACE AGAINST TIME\n\nAFTER ringing up Sir ...
1,g,austen,northangerabbey,30,Catherine's disposition was not naturally sede...
2,d,doyle,scarlet,5,. OUR ADVERTISEMENT BRINGS A VISITOR.\n\n\nOUR...
3,g,shelley,frankenstein,37,It was on a dreary night of November that I be...
4,d,collins,moonstone,86,"Late that evening, I was surprised at my lodgi..."


## Set OHCO Index

In [13]:
try:
    T = T.set_index(CHAPS)
    T = T.sort_index()
except KeyError:
    pass

## Create stopwords list

In [14]:
sw = nltk.corpus.stopwords.words('english') + extra_stopwords

In [15]:
sw[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

## Fix some characters to improve tokenization

In [16]:
T.text = T.text.str.replace(r"(—|-)", ' \g<1> ')

## Chapters to Paragraphs

In [26]:
paras = T.text.str.split(para_pat, expand=True)\
    .stack()\
    .to_frame()\
    .rename(columns={0:'para_str'})
paras.index.names = PARAS
paras.para_str = paras.para_str.str.strip()
paras.para_str = paras.para_str.str.replace(r'\n', ' ')
paras.para_str = paras.para_str.str.replace(r'\s+', ' ')
paras = paras[~paras.para_str.str.match(r'^\s*$')]

## Paragraphs to Sentences

In [27]:
sents = paras.para_str\
    .apply(lambda x: pd.Series(nltk.sent_tokenize(x)))\
    .stack()\
    .to_frame()\
    .rename(columns={0:'sent_str'})
sents.index.names = SENTS
del(paras)

## Sentences to Tokens with POS tagging

We create our own tokenizer to preserve whitespace.

In [28]:
tokenizer = RegexpTokenizer('\s+', gaps=True)

In [29]:
# tokens = sents.sent_str\
#     .apply(lambda x: pd.Series(nltk.pos_tag(nltk.word_tokenize(x))))\
tokens = sents.sent_str\
    .apply(lambda x: pd.Series(nltk.pos_tag(tokenizer.tokenize(x))))\
    .stack()\
    .to_frame()\
    .rename(columns={0:'pos_tuple'})
tokens.index.names = OHCO
tokens['pos'] = tokens.pos_tuple.apply(lambda x: x[1])
tokens['token_str'] = tokens.pos_tuple.apply(lambda x: x[0])
tokens = tokens.drop('pos_tuple', 1)
del(sents)

## Tag punctuation and numbers

In [30]:
tokens['punc'] = tokens.token_str.str.match(r'^[\W_]*$').astype('int')
tokens['num'] = tokens.token_str.str.match(r'^.*\d.*$').astype('int')

## Extract vocab with minimal normalization

In [31]:
WORDS = (tokens.punc == 0) & (tokens.num == 0)
tokens.loc[WORDS, 'term_str'] = tokens.token_str.str.lower()\
    .str.replace(token_pat, '')
#     .str.replace(r'["_*.\']', '')
vocab = tokens[tokens.punc == 0].term_str.value_counts().to_frame()\
    .reset_index()\
    .rename(columns={'index':'term_str', 'term_str':'n'})
vocab = vocab.sort_values('term_str').reset_index(drop=True)
vocab.index.name = 'term_id'

## Get priors for Vocab

In [32]:
vocab['p'] = vocab.n / vocab.n.sum()

## Add stems

In [33]:
stemmer = nltk.stem.porter.PorterStemmer()
vocab['port_stem'] = vocab.term_str.apply(lambda x: stemmer.stem(x))

## Define stopwords

In [34]:
stopwords = set(nltk.corpus.stopwords.words('english') + extra_stopwords)

In [35]:
sw = pd.DataFrame({'x':1}, index=stopwords)
vocab['stop'] = vocab.term_str.map(sw.x).fillna(0).astype('int')
del(sw)

## Add term_ids to Tokens 

In [36]:
tokens['term_id'] = tokens['term_str'].map(vocab.reset_index()\
    .set_index('term_str').term_id).fillna(-1).astype('int')

# Save

In [37]:
with sqlite3.connect(db_file) as db:
    T.to_sql('doc', db, if_exists='replace', index=True)
    tokens.to_sql('token', db, if_exists='replace', index=True)
    vocab.to_sql('vocab', db, if_exists='replace', index=True)

In [29]:
# END