# case_9. cognition

Course materials for Computational Humanities: Critical Approaches to Literature as Data (David Bamman and Tom McEnaney, Fall 2020).

This material is based upon work supported by the National Science Foundation for the projects "CAREER: Using Fiction to Improve Real-World Information Systems" (IIS-1942591) and "SubjectiveKB: Building subjective knowledge bases by modeling viewpoints" (IIS-1813470).

Notebook is adapted from: https://github.com/dbamman/comphumF20/tree/master

This notebook explores the degree to which different characters have measurably different *registers* by training a multiclass classifier on character dialogue to predict the speaker.  This notebooks works with the output of [BookNLP](https://github.com/dbamman/book-nlp), which recognizes quotations and carries out speaker attribution on them.

In [2]:
import operator
from collections import Counter
import math
from os import path
import json
import random
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from scipy import sparse
import nltk
from sklearn.model_selection import StratifiedKFold
import numpy as np
from sklearn.metrics import accuracy_score
import gzip

np.random.seed(1)

In [4]:
# !pip install textstat

Collecting textstat
  Obtaining dependency information for textstat from https://files.pythonhosted.org/packages/d8/33/302083f47386d651e4b42923f5206eeb9ee0545ea94bb506324d05fd2274/textstat-0.7.3-py3-none-any.whl.metadata
  Downloading textstat-0.7.3-py3-none-any.whl.metadata (14 kB)
Collecting pyphen (from textstat)
  Obtaining dependency information for pyphen from https://files.pythonhosted.org/packages/e3/c3/556e4ed0402ad7810a828532d539f1b14884fc0ff6c2da8ab401bf3bbd63/pyphen-0.14.0-py3-none-any.whl.metadata
  Downloading pyphen-0.14.0-py3-none-any.whl.metadata (3.3 kB)
Downloading textstat-0.7.3-py3-none-any.whl (105 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.1/105.1 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyphen-0.14.0-py3-none-any.whl (2.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: pyphen, textstat
S

The textstat package in Python is a library designed to provide simple calculations and statistics for text data. It offers various functions to analyze and extract information from text, such as readability scores, word and sentence counts, syllable counts, and more.

In [5]:
import textstat

In [7]:
# helper function: count the number of quotes in the book by different characters
# and print the top-n characters which have the most quotes.
def print_top_speakers(filename, top_n=10):
        
    with gzip.open(filename) as file:
        data=json.load(file)
        counts={}

        for character in data["characters"]:
            char_id=character["id"]

            gender=character["g"]
            names='; '.join([x["n"] for x in character["names"]])
            quotes=0
            for q in character["speaking"]:
                quotes+=1
            counts[(char_id, names)]=quotes

        sorted_x = sorted(counts.items(), key=operator.itemgetter(1), reverse=True)
        print("#quotes\tchar_id\tname")
        for (charid,name),v in sorted_x[:top_n]:
            print("%s\t%s\t%s" % (v,charid, name))


Let's first just examine the characters who have the most dialogue.

In [10]:
print_top_speakers("data/harry_potter.book.gz")

#quotes	char_id	name
8014	343	Harry; Potter; Harry Potter; James; James Potter; POTTER; HARRY; HARRY POTTER; Harry James Potter; JAMES
3270	247	Ron; Weasley; Ron Weasley; WEASLEY; RON
2998	302	Hermione; Miss Granger; Hermione Granger; Granger; Miss Hermione Granger
1307	352	Dumbledore; Albus; Albus Dumbledore; DUMBLEDORE; ALBUS DUMBLEDORE
726	298	Hagrid; Rubeus Hagrid; Rubeus
723	561	Fred; George; Fred -- George; GEORGE
647	389	Snape; Severus; Severus Snape; SEVERUS SNAPE; SNAPE
442	563	Mrs. Weasley
414	200	Mr. Weasley; Ronald; Ronald Weasley; Mr. Ronald Weasley
412	489	Sirius; Black; Sirius Black; SIRIUS


In [11]:
print_top_speakers("data/lotr.book.gz")

#quotes	char_id	name
1066	216	Frodo; Mr. Frodo; Mr. Baggins; Frodo Baggins; Baggins; Mr. Frodo Baggins; Mr. FRODO
686	106	Sam; Sam Gamgee; Gamgee
675	49	Gandalf; Gandalf Greyhame
240	317	Gimli; Gloin; Gimli Gloin; Gimli son of Gloin
230	259	Legolas; Legolas Greenleaf
183	2	Merry; Merry Brandybuck; Brandybuck
174	215	Faramir; Lord Faramir; lord Faramir
146	54	Strider; Mr. Strider
121	281	Theoden; Theoden King; King
117	274	Pippin; Mr. Pippin


The function iterates over the characters in the data dictionary. For each character, it checks if the character ID (char_id) is present in the targets dictionary. If it is, the function extracts the proper name of the character from the targets dictionary. Then, for each quote spoken by the character, the function tokenizes the quote into words using nltk.word_tokenize and appends the resulting tokens to the quotes list. After all quotes for the character are collected, the function shuffles the quotes list randomly. It then asserts that the number of quotes is greater than or equal to max_num. Finally, the function extends the X list with the first max_num shuffled quotes and extends the Y list with the corresponding character's proper name repeated max_num times.

In [14]:
# helper function: retrieve max_num quotes and construct a quote-character pair.
def get_quote_data(data, targets, max_num):
            
    X=[]
    Y=[]
    
    for character in data["characters"]:
        proper_name_count=character["NNPcount"]
        char_id=character["id"]
        if char_id in targets:
            name=targets[char_id]
            quotes=[]
            for q in character["speaking"]:
                quote=q["w"].lower()
                tokens=nltk.word_tokenize(quote)
                quotes.append(tokens)
            
            random.shuffle(quotes)
            
            assert len(quotes) >= max_num
                
            X.extend(quotes[:max_num])
            Y.extend([name]*max_num)

    return X, Y

In [15]:
# helper function: call get_quote_data 
def read_data(filename, targets, max_num):

    with gzip.open(filename) as file:
        data=json.load(file)
        X, Y=get_quote_data(data, targets, max_num)
        return X, Y

Build features by given feature function which is defined in the following code. Feel free to ignore the helper functions. They are used to build up the machine learning pipeline which is out of scope for this class.

In [16]:
def build_features(dataX, feature_functions):
    
    """ This function featurizes the data according to the list of parameter feature_functions """
    
    data=[]
    for tokens in dataX:
        feats={}
        
        for function in feature_functions:
            feats.update(function(tokens))

        data.append(feats)
    return data

In [17]:
def features_to_ids(data, feature_vocab):
    
    """ 
    
    This helper function converts a dictionary of feature names to a sparse representation
 that we can fit in a scikit-learn model.  This is important because almost all feature 
 values will be 0 for most documents (note: why?), and we don't want to save them all in 
 memory.

    """
    new_data=sparse.lil_matrix((len(data), len(feature_vocab)))
    for idx,doc in enumerate(data):
        for f in doc:
            if f in feature_vocab:
                new_data[idx,feature_vocab[f]]=doc[f]
    return new_data

In [18]:
def create_vocab(data, top_n=None):
    
    """ 
    
    This helper function converts a dictionary of feature names to unique numerical ids. 
    top_n limits the features to only the n most frequent features observed in the training data 
    (in terms of the number of documents that contains it).
    
    """
    
    counts=Counter()
    for doc in data:
        for feat in doc:
            counts[feat]+=1

    feature_vocab={}

    for idx, (k, v) in enumerate(counts.most_common(top_n)):
        feature_vocab[k]=idx
                
    return feature_vocab

In [19]:
def pipeline(trainX, devX, trainY, devY, feature_functions):

    """ This function evaluates a list of feature functions on the training/dev data arguments """
    
    trainX_feat=build_features(trainX, feature_functions)
    devX_feat=build_features(devX, feature_functions)

    # just create vocabulary from features in *training* data.
    feature_vocab=create_vocab(trainX_feat, top_n=100000)

    trainX_ids=features_to_ids(trainX_feat, feature_vocab)
    devX_ids=features_to_ids(devX_feat, feature_vocab)
    
    clf = linear_model.LogisticRegression(C=1, solver='lbfgs', penalty='l2', max_iter=10000)
    clf.fit(trainX_ids, trainY)
    
    predictions=clf.predict(devX_ids)
    
    return clf, feature_vocab, predictions


The function majority_class(trainY, devY) calculates the majority class label based on the distribution of labels in the training set (trainY). It then assigns this majority class label to all instances in the development set (devY). This is a naive approach which serves as a baseline to compare different models using different features. If your model performance is worse than the majority votes, there are something wrong in your code.

In [20]:
def majority_class(trainY, devY):
    labelCounts=Counter()
    for label in trainY:
        labelCounts[label]+=1
    majority_class=labelCounts.most_common(1)[0][0]
    
    return [majority_class]*len(devY)

In [21]:
def print_weights(clf, vocab, n=10):

    reverse_vocab=[None]*len(clf.coef_[0])
    for k in vocab:
        reverse_vocab[vocab[k]]=k
        
    for i, cat in enumerate(clf.classes_):
        
        weights=clf.coef_[i]

        for feature, weight in list(reversed(sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))))[:n]:
            print("%s\t%.3f\t%s" % (cat, weight, feature))
        print()

Function unigram_feature(tokens) takes a list of tokens (tokens) as input and generates a feature dictionary (feats) representing the presence of each unigram (single word) in the tokenized text.

In [22]:
def unigram_feature(tokens):
    feats={}
    for word in tokens:
        feats["UNIGRAM_%s" % word]=1
    return feats

In [23]:
preps={}
with open("data/preposition_list.txt") as file:
    for line in file:
        if not line.startswith("#"):
            preps[line.rstrip()]=1
            
def preposition_feature(tokens):
    feats={}
    for word in tokens:
        if word in preps:
            feats["PREP_%s" % word]=1
    return feats

Function length_feature accepts a list of tokens as input and calculates two features related to the length of the tokens. First, it computes the total number of tokens in the utterance, which is stored in the "utterance_length" feature. Then, it calculates the average length of the words in the utterance by summing up the lengths of all the words and dividing the total by the number of tokens. This average word length is stored in the "avg_word_length" feature.

In [24]:
def length_feature(tokens):
    feats={}
    feats["utterance_length"]=len(tokens)
    
    avg_word_length=0.
    for word in tokens:
        avg_word_length+=len(word)
    avg_word_length/=len(tokens)
    
    feats["avg_word_length"]=avg_word_length
    
    return feats

Function readability_feature processes a list of tokens representing a piece of text by first joining them into a single string. Utilizing the textstat package, it then computes the Flesch Reading Ease score for the text, a metric indicating the ease of readability. This score is based on factors such as sentence length and syllable count per word, with higher scores signifying easier readability.

In [25]:
def readability_feature(tokens):
    feats={}
    data=' '.join(tokens)
    feats["flesch_reading_ease"]=textstat.flesch_reading_ease(data)
    return feats


Function punctuation_feature takes a list of tokens (tokens) as input and creates a feature dictionary (feats) representing the presence of certain punctuation marks in the tokenized text. The function initializes a set punct containing common punctuation marks such as question marks, commas, periods, exclamation marks, semicolons, and colons

In [26]:
def punctuation_feature(tokens):
    punct=set(["?", ",", ".", "!", ";", ":"])
    feats={}
    for word in tokens:
        if word in punct:
            feats["PUNCT_%s" % word]=1
    return feats

In [27]:
def process(targets, features, filename, max_num):

    random.seed(1) # for reproducibility

    X, Y=read_data(filename, targets, max_num=max_num)
    X=np.array(X, dtype=object)
    Y=np.array(Y, dtype=object)
    kf = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
    
    preds=[]
    golds=[]
    baseline=[]
    
    for train_index, test_index in kf.split(X, Y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = Y[train_index], Y[test_index]
        clf, vocab, predictions=pipeline(X_train, X_test, y_train, y_test, features)
        preds.extend(predictions)
        golds.extend(y_test)
        baseline.extend(majority_class(y_train, y_test))
    
    print("Majority class: %.3f (%s)\n" % (accuracy_score(baseline, golds), len(golds)))
    print("Cross-validated accuracy: %.3f (%s)\n" % (accuracy_score(preds, golds), len(golds)))

    # print weights from last fold
    print_weights(clf, vocab, n=10)

First, let's train a classifier to predict the character identity based on the *content* of their dialogue (effectively learning if different characters talk about kinds of different things).  How do the most characteristic features for each character accord with your own understanding of their language?

The accuracies provide insights into the performance of the classifier with different sets of features. By comparing the accuracies achieved with each feature set, we can discern how effectively each set captures the underlying patterns in the data. Higher accuracies indicate that the corresponding feature set better discriminates between different classes, resulting in more accurate predictions.

In [29]:
targets={216: "Frodo", 106: "Sam", 49: "Gandalf", 317: "Gimli", 259: "Legolas"}
process(targets, [unigram_feature], "data/lotr.book.gz", 230)

Majority class: 0.200 (1150)

Cross-validated accuracy: 0.343 (1150)

Frodo	1.387	UNIGRAM_smeagol
Frodo	1.162	UNIGRAM_n't
Frodo	0.916	UNIGRAM_mean
Frodo	0.914	UNIGRAM_sleep
Frodo	0.904	UNIGRAM_perhaps
Frodo	0.902	UNIGRAM_last
Frodo	0.898	UNIGRAM_boromir
Frodo	0.861	UNIGRAM_am
Frodo	0.839	UNIGRAM_hear
Frodo	0.832	UNIGRAM_expect

Gandalf	1.115	UNIGRAM_new
Gandalf	1.084	UNIGRAM_course
Gandalf	0.911	UNIGRAM_both
Gandalf	0.881	UNIGRAM_night
Gandalf	0.876	UNIGRAM_heard
Gandalf	0.849	UNIGRAM_end
Gandalf	0.834	UNIGRAM_wormtongue
Gandalf	0.810	UNIGRAM_man
Gandalf	0.797	UNIGRAM_understand
Gandalf	0.775	UNIGRAM_know

Gimli	1.099	UNIGRAM_twenty-one
Gimli	1.018	UNIGRAM_kheled-zaram
Gimli	0.993	UNIGRAM_indeed
Gimli	0.877	UNIGRAM_had
Gimli	0.871	UNIGRAM_middle-earth
Gimli	0.870	UNIGRAM_durin
Gimli	0.863	UNIGRAM_lost
Gimli	0.828	UNIGRAM_paths
Gimli	0.794	UNIGRAM_would
Gimli	0.794	UNIGRAM_less

Legolas	1.744	UNIGRAM_gimli
Legolas	1.329	UNIGRAM_nimrodel
Legolas	1.125	UNIGRAM_strange
Legolas	1.074	UNIGRA

In [30]:
targets={343: "Harry", 247: "Ron", 302: "Hermione", 352: "Dumbledore", 298: "Hagrid"}
process(targets, [unigram_feature], "data/harry_potter.book.gz", 500)

Majority class: 0.200 (2500)

Cross-validated accuracy: 0.406 (2500)

Dumbledore	1.598	UNIGRAM_voldemort
Dumbledore	1.450	UNIGRAM_certainly
Dumbledore	1.419	UNIGRAM_precisely
Dumbledore	1.280	UNIGRAM_very
Dumbledore	1.106	UNIGRAM_yes
Dumbledore	1.063	UNIGRAM_indeed
Dumbledore	1.053	UNIGRAM_alastor
Dumbledore	1.041	UNIGRAM_correct
Dumbledore	0.988	UNIGRAM_am
Dumbledore	0.915	UNIGRAM_means

Hagrid	2.686	UNIGRAM_'
Hagrid	2.322	UNIGRAM_yeh
Hagrid	2.220	UNIGRAM_yer
Hagrid	1.686	UNIGRAM_tha
Hagrid	1.549	UNIGRAM_eh
Hagrid	1.416	UNIGRAM_ter
Hagrid	1.358	UNIGRAM_'em
Hagrid	1.316	UNIGRAM_rubbish
Hagrid	1.145	UNIGRAM_us
Hagrid	1.010	UNIGRAM_ca

Harry	1.207	UNIGRAM_er
Harry	1.206	UNIGRAM_matter
Harry	1.043	UNIGRAM_move
Harry	0.961	UNIGRAM_thanks
Harry	0.942	UNIGRAM_hagger
Harry	0.941	UNIGRAM_mind
Harry	0.920	UNIGRAM_w-what
Harry	0.896	UNIGRAM_dobby
Harry	0.892	UNIGRAM_surprised
Harry	0.864	UNIGRAM_yeah

Hermione	1.025	UNIGRAM_learn
Hermione	0.988	UNIGRAM_`
Hermione	0.933	UNIGRAM_hidden
Hermione	0.

Next, let's train a classifier on primarily *stylistic* features (average word length, average utterance length, frequency of specific punctuation, reading difficulty).  Can we still see measurable differences between characters?

In [32]:
targets={216: "Frodo", 106: "Sam", 49: "Gandalf", 317: "Gimli", 259: "Legolas"}
process(targets, [length_feature, readability_feature, punctuation_feature], "data/lotr.book.gz", 230)

Majority class: 0.200 (1150)

Cross-validated accuracy: 0.260 (1150)

Frodo	0.353	PUNCT_?
Frodo	0.201	PUNCT_;
Frodo	0.190	PUNCT_.
Frodo	0.002	utterance_length
Frodo	-0.008	flesch_reading_ease
Frodo	-0.092	PUNCT_!
Frodo	-0.121	PUNCT_:
Frodo	-0.166	PUNCT_,
Frodo	-0.483	avg_word_length

Gandalf	0.016	PUNCT_,
Gandalf	0.006	utterance_length
Gandalf	-0.001	flesch_reading_ease
Gandalf	-0.030	PUNCT_!
Gandalf	-0.035	avg_word_length
Gandalf	-0.041	PUNCT_;
Gandalf	-0.076	PUNCT_.
Gandalf	-0.237	PUNCT_?
Gandalf	-0.267	PUNCT_:

Gimli	0.498	avg_word_length
Gimli	0.313	PUNCT_:
Gimli	0.001	flesch_reading_ease
Gimli	-0.007	utterance_length
Gimli	-0.016	PUNCT_,
Gimli	-0.152	PUNCT_.
Gimli	-0.256	PUNCT_?
Gimli	-0.296	PUNCT_!
Gimli	-0.859	PUNCT_;

Legolas	0.715	PUNCT_:
Legolas	0.374	avg_word_length
Legolas	0.116	PUNCT_;
Legolas	0.112	PUNCT_!
Legolas	0.012	flesch_reading_ease
Legolas	-0.002	utterance_length
Legolas	-0.050	PUNCT_,
Legolas	-0.365	PUNCT_.
Legolas	-0.422	PUNCT_?

Sam	0.582	PUNCT_;
Sam	0.562	PUNC

In [33]:
targets={343: "Harry", 247: "Ron", 302: "Hermione", 352: "Dumbledore", 298: "Hagrid"}
process(targets, [length_feature, readability_feature, punctuation_feature], "data/harry_potter.book.gz", 500)

Majority class: 0.200 (2500)

Cross-validated accuracy: 0.263 (2500)

Dumbledore	0.291	PUNCT_,
Dumbledore	0.158	avg_word_length
Dumbledore	0.102	PUNCT_.
Dumbledore	0.003	utterance_length
Dumbledore	-0.006	flesch_reading_ease
Dumbledore	-0.050	PUNCT_:
Dumbledore	-0.145	PUNCT_;
Dumbledore	-0.488	PUNCT_?
Dumbledore	-1.038	PUNCT_!

Hagrid	0.177	PUNCT_!
Hagrid	0.080	PUNCT_,
Hagrid	0.018	PUNCT_.
Hagrid	0.012	utterance_length
Hagrid	-0.004	flesch_reading_ease
Hagrid	-0.050	PUNCT_?
Hagrid	-0.175	PUNCT_;
Hagrid	-0.259	PUNCT_:
Hagrid	-0.559	avg_word_length

Harry	0.537	PUNCT_:
Harry	0.215	PUNCT_.
Harry	0.152	PUNCT_!
Harry	0.120	PUNCT_?
Harry	0.082	PUNCT_;
Harry	0.000	flesch_reading_ease
Harry	-0.000	utterance_length
Harry	-0.051	avg_word_length
Harry	-0.362	PUNCT_,

Hermione	0.586	PUNCT_!
Hermione	0.318	PUNCT_?
Hermione	0.271	PUNCT_;
Hermione	0.118	PUNCT_,
Hermione	0.113	avg_word_length
Hermione	0.000	flesch_reading_ease
Hermione	-0.008	utterance_length
Hermione	-0.063	PUNCT_.
Hermione	-0.122	PU

Given these results, let's brainstorm two ideas:

* How could we use these distinctive voices to build a better system for speaker attribution?
* How could we use these results to build a model for *free indirect discourse*? (i.e., where a character's voice influences the narration.)