# Authorship Attribution

From Chapter 5 of Getting Started with Natural Language Processing (2022, Kochmar)

The goal of this notebook is to train and evaluate a machine learning algorithm to classify a sentence as written by one of two authors (classification).

# 1. Preliminary Steps

In [11]:
# imports
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg # data for training
nltk.download('punkt') # sentence tokenizer
import random
import sklearn
from sklearn.model_selection import StratifiedShuffleSplit

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\yang0108\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\yang0108\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# select data for training

gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [4]:
# shakespeare and austen each have three attributed works, so we will use
# them as our data

# will use 2 works from each author as training and pretest data; third work
# will be reserved for testing data

# create training and testing data sentences for author1 (austen)
author1_train = gutenberg.sents('austen-emma.txt') + \
                gutenberg.sents('austen-persuasion.txt')
author1_test = gutenberg.sents('austen-sense.txt')

# create training and testing data sentences for author1 (shakespeare)
author2_train = gutenberg.sents('shakespeare-caesar.txt') + \
                gutenberg.sents('shakespeare-hamlet.txt')
author2_test = gutenberg.sents('shakespeare-macbeth.txt')

# check length of data sets
print(f"Author 1 training set length: {len(author1_train)} sentences")
print(f"Author 1 testing set length: {len(author1_test)} sentences")
print(f"Author 2 training set length: {len(author2_train)} sentences")
print(f"Author 2 testing set length: {len(author2_test)} sentences")

Author 1 training set length: 11464 sentences
Author 1 testing set length: 4999 sentences
Author 2 training set length: 5269 sentences
Author 2 testing set length: 1907 sentences


# 2. Statistics comparing authors

In [7]:
# define helper function to calculate statistics
def statistics(gutenberg_data):
    
    # iterate over works given
    for work in gutenberg_data:
        
        # number of characters in a work
        num_chars = len(gutenberg.raw(work))
        
        # number of words in a work
        num_words = len(gutenberg.words(work))
        
        # number of sentences in a work
        num_sents = len(gutenberg.sents(work))
        
        # number of unique words in a work (python set over list of all words)
        num_vocab = len(set(w.lower() for w in gutenberg.words(work)))
        
        print(
            # average length of words
            round(num_chars/num_words),
            
            # average length of sentences
            round(num_words/num_sents),
            
            # average number of times each word is used in a text by the author
            round(num_words/num_vocab),
            
            # name of work
            work
        )

In [8]:
# make list of works to get statistics for
gutenberg_data = ['austen-emma.txt', 'austen-persuasion.txt', 
                  'austen-sense.txt', 'shakespeare-caesar.txt', 
                  'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt']

In [9]:
# run statistics function on list of works
statistics(gutenberg_data)

5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt
4 12 9 shakespeare-caesar.txt
4 12 8 shakespeare-hamlet.txt
4 12 7 shakespeare-macbeth.txt


Austen on average uses slightly longer words than Shakespeare (an average of 5 letters to 4). She also uses much longer sentences (an average of 25-28 words, depending on the text, to 12). Perhaps influenced by the length of her works, the average number of times a single word is used across a whole text by Austen ranges from 17-26, while Shakespeare's words only get repeated an average of 7-9 times.

# 3. Split data into training and pretesting sets

In [18]:
# add all sentences in training set to list, keeping author label
all_sents = [(sent, "austen") for sent in author1_train]
all_sents += [(sent, "shakespeare") for sent in author2_train]

# check number of sentences in all_sents
print(f"Training set length: {str(len(all_sents))} sentences")

Training set length: 16733 sentences


In [19]:
# keep set of labels (authors) as values
values = [author for (sent, author) in all_sents]

# instantiate stratified shuffle split
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42)

# create empty lists for train and pretest sets
strat_train_set = []
strat_pretest_set = []

# perform split
for train_index, pretest_index in split.split(all_sents, values):
    strat_train_set = [all_sents[index] for index in train_index]
    strat_pretest_set = [all_sents[index] for index in pretest_index]

In [20]:
# set test_set data to same structure as strat_train_set and strat_pretest_set
test_set = [(sent, "austen") for sent in author1_test]
test_set += [(sent, "shakespeare") for sent in author2_test]

In [21]:
# check the proportions of the data in sets: 
# overall (strat_train and strat_pretest), strat_train_set, strat_pretest_set, and test_set

# define helper function to calculate proportions
def cat_proportions(data, cat):
    
    # initialize count of sentences
    count = 0
    
    # iterate over sentences in data
    for item in data:
        
        # if the item's author is cat
        if item[1] == cat:
            
            # increment count
            count += 1
    
    # return proportion
    return float(count) / float(len(data))

# set authors as categories
categories = ["austen", "shakespeare"]

# create rows for table
rows = []
rows.append(["Category", "(strat_train_set and strat_pretest_set)", 
             "strat_train_set", "strat_pretest_set", "test_set"])

# populate list for rows
for cat in categories:
    rows.append([cat, 
                 f"{cat_proportions(all_sents, cat): .6f}",
                 f"{cat_proportions(strat_train_set, cat): .6f}",
                 f"{cat_proportions(strat_pretest_set, cat): .6f}",
                 f"{cat_proportions(test_set, cat): .6f}"])

# make table, adjust column widths, and populate rows
columns = zip(*rows)
column_widths = [max(len(item) for item in col) for col in columns]
for row in rows:
    print(''.join(' {:{width}} '.format(row[i], width = column_widths[i])
                 for i in range(0, len(row))))

 Category     (strat_train_set and strat_pretest_set)  strat_train_set  strat_pretest_set  test_set  
 austen        0.685113                                 0.685119         0.685091           0.723863 
 shakespeare   0.314887                                 0.314881         0.314909           0.276137 


# 4. Extract words as features

In [22]:
# create function
def get_features(text):
    
    # create empty dictionary
    features = {}
    
    # create word list from text
    word_list = [word for word in text]
    
    # iterate over word_list
    for word in word_list:
        
        # set a presence flag to "True" in dictionary
        features[word] = True
    
    # return dictionary of features
    return features

# get feature dictionary for strat_train_set and strat_pretest_set
train_features = [(get_features(sents), label) for (sents, label) in strat_train_set]
pretest_features = [(get_features(sents), label) for (sents, label) in strat_pretest_set]

# checks
print(len(train_features))
print(len(pretest_features))

13386
3347


# 5. Naive Bayes

## 5.1 Train classifier