Project Gutenberg is a large electronic collection of over 54,000 public domain books. We will be working with a dataset of 3036 books, available to download from [Shibamouli Lahiri](https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html). This notebook is to prepare the data for analysis.

In [1]:
import os
import re
import pandas as pd
import joblib
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
filenames = [name for name in os.listdir('txt') if name.endswith('.txt')]
authorlist = []
for name in filenames:
    a, _ = name.split('___')
    authorlist.append(a)
    with open('txt/'+name, 'r+', encoding='iso-8859-1') as f:
        for i in range(4):
            f.readline()
        data = f.readlines()
        f.seek(0)
        f.writelines(data)
        f.truncate()

In [2]:
filenames = [name for name in os.listdir('txt') if name.endswith('.txt')]
authorlist = []
for name in filenames:
    a, _ = name.split('___')
    authorlist.append(a)
titledf = pd.DataFrame({'Author':authorlist, 'Name':filenames})       
titledf.head()

Unnamed: 0,Author,Name
0,Abraham Lincoln,Abraham Lincoln___Lincoln Letters.txt
1,Abraham Lincoln,Abraham Lincoln___Lincoln's First Inaugural Ad...
2,Abraham Lincoln,Abraham Lincoln___Lincoln's Gettysburg Address...
3,Abraham Lincoln,"Abraham Lincoln___Lincoln's Inaugurals, Addres..."
4,Abraham Lincoln,Abraham Lincoln___Lincoln's Second Inaugural A...


I am going to remove the author's name from the text of each file as I believe it may bias the classifier later.

In [9]:
for index, row in titledf.iterrows():
    name = row['Name']
    author = row['Author']
    with open('txt/'+name, 'r+', encoding='iso-8859-1') as f:
        text=f.read()
        author_caseless = re.compile(author, re.IGNORECASE)
        text = re.sub(author_caseless, '', text)
        f.seek(0)
        f.writelines(text)
        f.truncate()

In order for the classifier to have predictive power, there needs to be sufficient data for each class. We will only use the data with at least 10 texts for each author.

In [3]:
titledf = titledf[list(titledf.Author.value_counts()[author]>=10 for author in titledf.Author)]
titledf.to_csv('author_title.csv')
titledf.describe()

Unnamed: 0,Author,Name
count,2798,2798
unique,83,2798
top,William Wymark Jacobs,Grant Allen___The Great Taboo.txt
freq,97,1


We will now want to vectorize the texts and save both the tokens and the vocabulary to disk to be used when we analyze the texts.

In [4]:
vectorizer = TfidfVectorizer(input='filename', min_df=3, max_df=.95, encoding='iso-8859-1')
tokens = vectorizer.fit_transform(['txt/'+fname for fname in titledf.Name])
joblib.dump(tokens, 'features.pkl')
joblib.dump(vectorizer.get_feature_names(), 'feature_names.pkl')

['feature_names.pkl']