Skip to content
Small authorship attribution tutorial. Technique: Naive Bayes.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Authorship attribution tutorial

This tutorial shows an example use of the Naive Bayes algorithm. The system is trained on five texts by four authors: Austen, Kipling, Carroll and Grahame. It is then required to guess the author of an additional text (which -- spoiler alert -- is Jane Austen's Emma).


You'll need the docopt package to run the code. If you're on PythonAnywhere, this will be installed by default. If you need to install it on your system, do:

sudo pip3 install docopt

Run the code

To run the code, type:

python3 --words data/emma.txt

Or alternatively:

python3 --chars 3 data/emma.txt

The first alternative computes a model over words, as we saw in class. The second alternative uses character ngrams of the size given to the system.

The output of the code tells you which operations the classifier is currently performing: computing prior probabilities, conditional probabilities, etc. Then, for illustration, it outputs the 10 features with highest conditional probability for the class under consideration (i.e. for each author). It gives you an idea of which words / ngrams are most important for each author. Finally, you get sorted log figures for the probability of each class. The first entry in that list is the author guessed by the system.

Try it out

Change parameters and see what happens! What do you get with longer ngrams? Can you interpret the output? Does the system still perform the way you expect?


The opposite of authorship attribution is obfuscation. You are Jane Austen and you don't want to have your texts identify you. What can you do to prevent this? Try out different methods and see if you can fool the system.

You can’t perform that action at this time.