# DH Downunder: Distant Reading

## Notebook 3: Stylistics, or Computer Magic

One of the most interesting applications of distant reading techniques is authorship attribution. In this third and final notebook for our Distant Reading course, we will consider how text analysis can be used to work out the true identity of a text's author.

Authorship attribution will also give us a chance to introduce several concepts from data science, such as vectors, distance metrics and clustering. These concepts are useful in a number of areas beyond authorship attribution.

Execute the cell below to load all the packages required for this notebook.

In [54]:
import numpy as np # for doing maths
import nltk # for natural language processing functions
import matplotlib # for displaying graphs
import random # some useful randomisation functions
import scipy
from import_corpus import import_corpus

# So graphs render properly
%matplotlib inline

# Section 1: Load and Inspect our Corpus

For this notebook, I have pre-prepared a corpus for you. Execute the cell below to import all the novels into your workspace.

In [2]:
novels = import_corpus()

Wyllard's Weird, by Mary Braddon successfully imported.
Castle Rackrent, by Maria Edgeworth successfully imported.
The Hungry Stones, and Other Stories, by Rabindranath Tagore successfully imported.
The Talisman, by Sir Walter Scott successfully imported.
Ivanhoe, by Sir Walter Scott successfully imported.
Belinda, by Maria Edgeworth successfully imported.
The Absentee, by Maria Edgeworth successfully imported.
The Blithdale Romance, by Nathaniel Hawthorne successfully imported.
Old Mortality, by Sir Walter Scott successfully imported.
The Crimson Cryptogram, by Fergus Hume successfully imported.
The Heart of Mid-Lothian, by Sir Walter Scott successfully imported.
Lady Audley's Secret, by Mary Braddon successfully imported.
Waverley; or, 'tis Sixty Years Since, by ??? successfully imported.
The Disappearing Eye, by Fergus Hume successfully imported.
The Bride of Lammermoor, by Sir Walter Scott successfully imported.
The Lost Parchment, by Fergus Hume successfully imported.
The Mystery 

There is one mystery novel in our corpus! In 1814, the novel *Waverley, or 'tis Sixty Years Since* was published anonymously. In this session we are going to use statistical analysis to find out who ??? really was.

The whole corpus, `novels` is a single `dict`. Execute the cell below to see how to access information from the corpus.

In [30]:
corpus_keys = '\n  -  '.join(list(novels))
novel_keys = '\n  -  '.join(list(novels['waverley']))
print(f'The keys for the novels dict are:\n  -  {corpus_keys}\n\n')

print(f'Some examples of how to find information:\n')
print(f'The title and author of Castle Rackrent:')
print(f'novels["rackrent"]["title"] = {novels["rackrent"]["title"]}')
print(f'novels["rackrent"]["author"] = {novels["rackrent"]["author"]}')
print('\nTokens 1000-1009 of The Hungry Stones:')
print(f'novels["hungry_stones"]["tokens"][1000:1010] = {novels["hungry_stones"]["tokens"][1000:1010]}')
print(f'\nFor each novel, the following information is available:\n  -  {novel_keys}')

The keys for the novels dict are:
  -  wyllards_weird
  -  rackrent
  -  hungry_stones
  -  talisman
  -  ivanhoe
  -  belinda
  -  absentee
  -  blithedale_romance
  -  old_mortality
  -  crimson_cryptogram
  -  mid_lothian
  -  lady_audley
  -  waverley
  -  disappearing_eye
  -  lammermoor
  -  lost_parchment
  -  hansom_cab
  -  australian_girl
  -  rob_roy
  -  scarlet_letter
  -  mashi
  -  seven_gables


Some examples of how to find information:

The title and author of Castle Rackrent:
novels["rackrent"]["title"] = Castle Rackrent
novels["rackrent"]["author"] = Maria Edgeworth

Tokens 1000-1009 of The Hungry Stones:
novels["hungry_stones"]["tokens"][1000:1010] = ['laden', 'with', 'an', 'oppressive', 'scent', 'from', 'the', 'spicy', 'shrubs', 'growing']

For each novel, the following information is available:
  -  title
  -  author
  -  header
  -  licence
  -  body
  -  tokens


You will notice that the body text and tokens have been put in lowercase, as discussed in our previous sessions.

## Section 2: Calculate word frequencies for each novel

One of the most interesting findings of distant reading over the last few decades has been that each person has a detectable stylistic 'signature'. It seems that when we write, each of us uses the extremely common words, 'the', 'a', 'and', 'for', 'but' in a particular ratio which is more-or-less unique. In the most famous paper in the field of stylometry, John Burrows showed that we can use this fact to help identify the authors of disputed texts:

* John Burrows, [‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship](https://doi.org/10.1093/llc/17.3.267), *Literary and Linguistic Computing* (2002) 17: 267-87.

There are many other measurements of stylistic similarity, and Burrows himself argues that 'Delta' is not sufficient on its own, but it is a good place to start the study of authorship attribution. In this notebook, you will learn exactly how to calculate Burrows' Delta and use it to find out who wrote *Waverley*.

The first step is to calculate the word frequencies for all the words in each novel. How often does each writer use each of the words in their vocabulary?

We can use Python's `set()` object and the `.count()` list method to do this. A `set()` object is a collection of *unique* values. Execute the cell below to see how a `set()` can be used to get the vocabulary of a text from a list of tokens:

In [8]:
a_list_of_tokens = ['hello','my','name','is','Han','Solo','hello','my','name','is','Luke','Skywalker']

a_set_of_tokens = set(a_list_of_tokens) # Conver the list into a set.

print(f'Here is the Set you just created from a_list_of_tokens: {a_set_of_tokens}')

Here is the Set you just created from a_list_of_tokens: {'Skywalker', 'my', 'hello', 'Luke', 'name', 'Han', 'is', 'Solo'}


The new `set` now contains exactly one copy of each word that appears in the original `list`. Notice for example that there is only one 'hello', one 'name' and one 'is'. Sets do not preserve the order of the original values, but this is immaterial for our purposes.

Notice that the new `set` is enclosed in curly braces `{}`. This is an unfortunate abiguity in Python, since `dicts` are also enclosed in curly braces `{}`. If you are ever unsure whether you are dealing with a `set` or a `dict`, you can use the `type` function to check:

In [9]:
type(a_set_of_tokens)

set

Great, now we can quickly and easily find out what all the words are in a given novel. Now we need to work out how to count all those words. The `.count()` method gives us a simple way to do this. Execute the cell below to see how it works:

In [12]:
another_list_of_tokens = ['James','James','Morrison','Morrison','Wetherby','George','Dupree',
                         'took','great','care','of','his','mother',
                         'though','he','was','only','three',
                         'James','James','Morrison','Morrison','said','to','his','mother','said','he',
                         'you','must','never','go','down','to','the','end','of','the','town',
                         'if','you','don\'t','go','down','with','me']

n_james = another_list_of_tokens.count('James')
n_of = another_list_of_tokens.count('of')

print(f'In the opening stanza of this lovely children\'s poem, "James" appears {n_james} times, and "of" appears {n_of} times.')

In the opening stanza of this lovely children's poem, "James" appears 4 times, and "of" appears 2 times.


The one other important point is that texts can vary considerably in length. Two books may both use the word 'the' 2000 times, but if book A is 10,000 words long, and book B is 10,000,000 words long, then obviously these 2000 'the's would mean very different things!

The usual way to express word frequencies is therefore 'occurrences per 1000 words'. The formula for calcuating this is $$\frac{n_x}{t} \times 1000 $$ where $n_x$ is the number of times word $x$ appears in the text and $t$ is the total words in the text.

### Exercise 2.1: Compute the relative word frequency of 'scotland'

Complete the code in the cell below to calculate the frequency per 1000 words of the word 'scotland' in *Waverley* (remember all words have been put into lower case). You can use `len()`, `.count('word')`, `*` and `/`.

In [26]:
# YOUR CODE HERE

waverley_tokens =          # Get Waverley's tokens from the 'novels' dict
scotland_count =           # Count the number of times 'scotland' is used
total_words =              # Count the total number of words in the novel
rel_freq_scotland =        # Apply the formula above

# END OF YOUR CODE

print(f'The word "scotland" appears {rel_freq_scotland:.3f} times per 1000 words in Waverley.')

The word "scotland" appears 0.509 times per 1000 words in Waverley.


**Expected output:** The word "scotland" appears 0.509 times per 1000 words in Waverley.

### Exercise 2.2: Find the 20 most common words in *Waverley*

Now let's find the 20 most common words in the novel. You can use `set()` on the list of *Waverley*'s tokens to get the vocabulary of the novel, then you can use a `for` loop to look for each word in the novel and apply the formula above. I have provided the code that will fetch the top twenty words when you are done.

In [None]:
# YOUR CODE HERE

waverley_tokens =             # Get Waverlery's tokens from the 'novels' dict
total_words =                 # Use len() to get the total words
waverley_vocab =              # Use set() to get all the unique words

results = {}                  # Initialise results dict (done for you)

for word in waverley_vocab:   # Loop over the set of unique words (done for you)
    
    word_count =              # How many times does this word appear in the novel?
    
    per_1000 =                # How many times does this word appear per 1000 words in the novel (use formula)
    
    results[word] = per_1000  # Add the result to the results dict (done for you)

# END OF YOUR CODE

# Sort the results:
top_20 = [(k, results[k]) for k in sorted(results, key=results.get, reverse=True)][0:20]

# Display them:
top_20

Can we make any guess about the author based on this information?

Obviously this method works, but it is very slow. Many programming libraries contain fast, useful functions for just this sort of task. One useful one is `FreqDist()`, which is provided by the Natural Langauge Tooklit. It has a very useful feature, the `.most_common()` method, which you can use to extract the most common words in a text. Execute the cell below to see how this function works:

In [38]:
from nltk import FreqDist

waverley_freqs = FreqDist(novels['waverley']['tokens'])

waverley_freqs.most_common(20)

[('the', 14164),
 ('of', 8856),
 ('and', 6718),
 ('to', 6116),
 ('a', 4788),
 ('in', 3706),
 ('his', 3458),
 ('he', 2715),
 ('was', 2423),
 ('that', 2191),
 ('with', 2055),
 ('which', 1924),
 ('i', 1879),
 ('as', 1850),
 ('it', 1598),
 ('had', 1579),
 ('for', 1497),
 ('by', 1379),
 ('but', 1200),
 ('at', 1196)]

Unfortunately `FreqDist()` gives you raw frequencies. A more advanced function is the `CountVectorizer()` function from the scikit-learn package. This function takes a `list` of strings, and outputs a "Document-Term Matrix" (DTM). A DTM is a giant table, where each row represents a novel from your corpus, and each column is a particular word:

|Novel|aaron|ab|aback|abacus|abaddon|abana|abandon|...|
|---|---|---|---|---|---|---|---|---|
|wyllards_weird|0|0|0|0|0|0|5|...|
|rackrent|0|0|0|0|0|0|0|...|
|hungry_stones|0|0|1|0|0|0|1|...|
|talisman|0|0|0|1|0|0|3|...|
|ivanhoe|2|0|0|2|0|1|6|...|

In [66]:
from sklearn.feature_extraction.text import CountVectorizer

# CountVectorizer requires us to provide a list, where each item of the list is a text represented
# by a single string. So first we will extract all the texts into text_list, and keep track of which
# text is which using novel_list.
novel_list = []
text_list = []
for name, novel in novels.items():
    novel_list.append(name)
    text_list.append(novel['body'])
    
# Initialise the 'vectorizer', the object we can use to convert the novels into a DTM:
vectorizer = CountVectorizer()

# Apply the vectorizer to the novels. The .fit_transform method performs the two steps we performed above:
# fit = find the vocabulary of all the texts
# transform = count the instances of each vocabulary word in each text
DTM = vectorizer.fit_transform(text_list)

print(f'The columns of the below matrix represent: {", ".join(vectorizer.get_feature_names()[63:70])}\n')
row_names = "\n".join(novel_list[0:5])
print(f'The rows represent:\n{row_names}\n')
print(DTM.toarray()[0:5,63:70])

The columns of the below matrix represent: aaron, ab, aback, abacus, abaddon, abana, abandon

The rows represent:
wyllards_weird
rackrent
hungry_stones
talisman
ivanhoe

[[0 0 0 0 0 0 5]
 [0 0 0 0 0 0 0]
 [0 0 1 0 0 0 1]
 [0 0 0 1 0 0 3]
 [2 0 0 2 0 1 6]]


Of course, these are also raw frequencies, and we would like to have the relative frequencies. This is very easy to compute using the built-in `.sum()` method. The object we have just created, `DTM`, is a [csr sparse matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html), which has many useful built-in methods besides `.sum()`. 

Since each *row* represents one novel, if we add up all the numbers in a single *row*, we will get the total words for that novel.

### Exercise 2.3: Calculate the relative frequencies.

In the cell below, use the `.sum()` method to calculate the sum for each row. Then 

**NB:** You have to use the extra parameter `axis = 1` when you call the `.sum()` method.

In [68]:
# YOUR CODE HERE

row_sums =        # Use .sum(axis = 1) to find the row sums of the DTM
DTM =             # Divide the whole DTM by row_sums and then multiply the DTM by 1000 to get the relative frequencies

# END OF YOUR CODE

DTM = np.array(DTM) # Type conversion to deal with quirk in software

print(f'The columns of the below matrix represent: {", ".join(vectorizer.get_feature_names()[63:69])}\n')
row_names = "\n".join(novel_list[0:5])
print(f'The rows represent:\n{row_names}\n')
print(DTM[0:5,63:69])

The columns of the below matrix represent: aaron, ab, aback, abacus, abaddon, abana

The rows represent:
wyllards_weird
rackrent
hungry_stones
talisman
ivanhoe

[[0.         0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.        ]
 [0.         0.         0.02137529 0.         0.         0.        ]
 [0.         0.         0.         0.00790883 0.         0.        ]
 [0.01058985 0.         0.         0.01058985 0.         0.00529493]]


**NB:** Due to a quirk in the software, `DTM` has now been converted into a [numpy array](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html). This is a very similar data type to a csr sparse matrix, so don't worry too much about it. The only difference is that you will see the `keepdims = True` argument popping up now, which is not necessary when you are working with a sparse matrix.

Since we have now divided each row by the total words, and multiplied by 1000, each row now adds up to 1000:

In [74]:
DTM.sum(axis = 1, keepdims = True)

array([[1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.],
       [1000.]])

If we want to find the most common words, we can use `.sum(axis = 0)` to sum over the columns and find the words that are most frequent. This is a slightly complicated step, so I have written the code for you. It finds the top 20 words. If you want to change the number of words it finds, simply alter the `top_n` variable at the top of the cell.

In [162]:
# How many words do we want?
top_n = 20

# Sum over the columns. Each column is 1 word, so if we add a column up we find out how many times
# that word was used in our corpus.
col_sums = DTM.sum(axis = 0)

# Which columns have the highest frequency?
# NB: np.sort() sorts into ascending order, so the big numbers are at the end.
columns_sorted = np.argsort(col_sums)

# Get the top n words:
top_list = []
for i in range(1, top_n + 1):
    num = -i
    idx = columns_sorted[num]
    word = vectorizer.get_feature_names()[idx]
    corpus_freq = col_sums[idx] / 22 # to get the frequency per 1000 words, divide by the number of texts in the corpus
    top_list.append((word, corpus_freq))

top_list

[('the', 58.73744312645612),
 ('of', 32.85473748253985),
 ('and', 31.990543031657804),
 ('to', 29.546003263751686),
 ('in', 16.94207333177555),
 ('that', 13.214060937614386),
 ('he', 12.691499197428486),
 ('was', 12.556211982903273),
 ('his', 12.014947873994034),
 ('it', 10.599612892020495),
 ('you', 9.825144620133825),
 ('her', 9.48036223768839),
 ('with', 9.39666135840758),
 ('as', 9.313692730118921),
 ('had', 8.116062127092208),
 ('for', 7.904788635883245),
 ('she', 7.162326034085504),
 ('my', 7.023160500383226),
 ('but', 6.704019060707225),
 ('not', 6.615259064156846)]

## Section 3: Calculate "Burrows' Delta" for all the novels

Now we have all the tools we need to calculate Burrows' Delta.

In [94]:
novels['rackrent']['body'][10000:12000]

'ir walter wrote i felt that something might be attempted for my own country of the same kind as that which miss edgeworth so fortunately achieved for ireland in the memoirs of miss edgeworth there is a pretty account of her sudden burst of feeling when this passage so unexpected and so deeply felt by her was read out by one of her sisters at a time when maria lay weak and recovering from illness in edgeworthstown our host took us that day among other pleasant things for a marvellous and delightful flight on a jaunting car to see something of the country we sped through storms and sunshine by open moors and fields and then by villages and little churches by farms where the pigs were standing at the doors to be fed by pretty trim cottages the lights came and went as the mist lifted we could see the exquisite colours the green the dazzling sweet lights on the meadows playing upon the meadow sweet and elder bushes at last we came to the lovely glades of carriglass it seemed to me that we 

## Section 4: Who wrote *Waverley*?

In [4]:
novels = {
    'foo':{
        'title':'foo',
        'author':'bar'
    },
    'bar':{
        'title':'bar strikes back',
        'author': 'Emperor Barbatine'
    }
}

In [14]:
[info['author'] for novel,info in novels.items()]

['bar', 'Emperor Barbatine']