Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
This branch is 2 commits ahead of LukasMut:master.

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Authorship Attribution in Fan-Fictional Texts given variable length Character and Word N-Grams

If you would like to use parts of the code or replicate our approach, please cite us! :)

Authors: Lukas Muttenthaler, Gordon Lucas, Janek Amann.

Authorship Attribution (AA) is the task of determining the author of a text from a set of candidates. It requires text features to be represented according to rigorous experiments. In the context of machine learning, AA can be regarded as a multi-class, single-label text classification problem. Its applications include plagiarism detection and forensic linguistics as well as research in literature.

In the current study, we aimed to develop three different n-gram models to identify authors of various fan-fictional texts. Each of the three models was developed as a variable-length n-gram model. We implemented both a standard character n-gram model (2−5 gram), a distorted character n-gram model (1−3 gram) and a word n-gram model (1−3 gram) to not only capture the syntactic features, but also the lexical features and content of a given text. Token weighting was performed through term-frequency inverse-document frequency (tf-idf) computation. For each of the three models, we implemented a linear Support Vector Machine (SVM) classifier, and in the end applied a soft voting procedure to take the average of the classifiers’ results (i.e., ensemble SVM). Results showed, that among the three individual models, the standard character n-gram model performed best. However, the combination of all three classifier’s predictions yielded the best results overall. To enhance computational efficiency, we computed dimensionality reduction using Singular Value Decomposition (SVD) before fitting the SVMs with training data.

With a run time of approximately 180 seconds for all 20 problems, we achieved a macro F1-score of 70.5% for the development corpus and a F1-score of 69% for the competition’s test corpus, which significantly outperformed the PAN 2019 baseline classifier. Thus, we have shown that it is not a single feature representation that will yield accurate classifications, but rather the combination of various text representations that will depict an author’s writing style most thoroughly.

The code for our approach can be found in the provided Jupyter notebook.


Cross-domain authorship attribution in fan-fictional forms of literature.






No releases published


No packages published


  • Jupyter Notebook 88.0%
  • Python 12.0%