Authorship attribution tutorial

This tutorial shows an example use of the Naive Bayes algorithm. The system is trained on five texts by four authors: Austen, Shelley, Carroll and Grahame. It is then required to guess the author of an additional text (which -- spoiler alert -- is Jane Austen's Emma).

Requirements

You'll need the docopt package to run the code from the terminal. If you need to install it on your system, do:

sudo pip3 install docopt-ng

Run the code

To run the code, type:

python3 attribution.py --words data/test/emma.txt

Or alternatively:

python3 attribution.py --chars 3 data/test/emma.txt

The first alternative computes a model over words, as we saw in class. The second alternative uses character ngrams of the size given to the system.

The output of the code tells you which operations the classifier is currently performing: computing prior probabilities, conditional probabilities, etc. Then, for illustration, it outputs the 10 features with highest conditional probability for the class under consideration (i.e. for each author). It gives you an idea of which words / ngrams are most important for each author. Finally, you get sorted log figures for the probability of each class. The first entry in that list is the author guessed by the system.

Exercises

First, read the code and make sure you understand what it does. If it helps you, you can add comments in the file.
Can you interpret the output? Just looking at the most frequent words/ngrams, what do you notice about the similarities and differences between authors?
Change parameters and see what happens! What do you get with longer ngrams? What happens if you modify the smoothing parameter alpha?

Write up your experiments

Write a little report of what you've done (this is just practice for the exam!) Your report should contain the following sections:

Description of the task
Your hypothesis: it could be anything you like. You can keep it simple. For instance, you might posit that the system will not work so well anymore if you choose very long char ngrams (it will overfit to the particular book in the training set and not generalise to the test book).
The experiments you ran: which parameters did you change? did you modify the system? Explain everything you did in detail.
Results: write a little table showing your results and discuss it with respect to your hypothesis. Was it confirmed or disproved?

Open-ended project

For those who want to go further... The opposite of authorship attribution is obfuscation. You are Jane Austen and you don't want to have your texts identify you. What can you do to prevent this? Try out different methods and see if you can fool the system.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
.gitignore		.gitignore
README.md		README.md
attribution.py		attribution.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

.gitignore

.gitignore

README.md

README.md

attribution.py

attribution.py

utils.py

utils.py

Repository files navigation

Authorship attribution tutorial

Requirements

Run the code

Exercises

Write up your experiments

Open-ended project

About

Releases

Packages

Languages

ml-for-nlp/authorship-attribution

Folders and files

Latest commit

History

Repository files navigation

Authorship attribution tutorial

Requirements

Run the code

Exercises

Write up your experiments

Open-ended project

About

Resources

Stars

Watchers

Forks

Languages