Interference

Author Verification Using Common N-Gram Profiles of Text Documents

The formulas form the presentation are duplicated in the Class implementation notebook.

Participants:

Bystrova Olga (ovbystrova)
Okhapkia Anna (eischaire)
Ryazanskaya Galina (flying-bear)

The tasks

Links in the text lead to the notbooks where the mentioned task is done.

Objective

In the original article the authors had interinsic authorship attribution task as a binary classification: the text was written either by the same author or by someone else. We could have simulated this structure using language background (LB) and first (native) language (FL) as "authors". However, it would not be ecologically valid, as the texts are, of course, written by different authors, and we did not have the data on authorship. Thus, we changed the task to be binary (for LB) and multiclass (for FL) classification.

Pipeline

preprocessing

tokenization for word n-grams (of length n)
truncation so that all texts are of the same length (omitting the shorter texts)
train/test split (correcting for imbalanced classes!)
- on FL, native language
- on LB, speaker type

building classifiers for each parameter combination

calculation of n-gram profiles (P)
cutoff of the most frequent L
distance calculation

multiclass classification with minimal distance for each ensemble, averaging the results
- on FL, native language
- on LB, speaker type
building baselines

TF-IDF + logistic regression
TF-IDF on word bigrams + logistic regression with parameter search
word2vec + logistic regression with parameter search
word2vec + perceptron

comparing results

Architectural choices

We decided to onbly use ensemble classifiers as they performed the best in the article.
We decided to cut all the texts to the length of mode length and omit all texts shorter than that.
We decided that we need to balance classes and select the same number of texts from each class, landing on two options - 90 and 400 from each class. All the classes with less datapoints were omitted.
Character ensembles were slow and thus were only calculated for LB.
We decided to only use the number of n-grams (L) to determine the length of a profile and to use multiclass classification with minimal distance, that does not need a threshold (θ). The parameteres from the original article (the ones we included in bold):

Parameter space

size of N-grams (n)
- from 3 to 10 for characters
- from 1 to 3 for words
size of a profile
- Number of n-grams (L) 200, 500, 1000, 1500, 2000, 2500, 3000
- Fraction of n-grams from the shortest text (f) from 0.2 to 1 (increments of 0.1)
Threshold (θ)
- if more than 1 known-author document available (θ2+)
- if only 1 known-author document available (θ1)
Ensemble size and parameters

Results

On test

On train (among radius distance models)

Only FL

Only LB

Discussion

One can see that in ALL cases the simplest baseline model (TF-IDF + logistic regression) outperforms all others. It is interesting that on of the radius distance models, namely 4-class 50-word word bigrams, outperforms NN. Another thing to notice is that charachter models are outperformed by word models on train, but not on test. Generally, longer n-grams yeild better results, but the rule also holds more true on train than on test.

The question is why does the radius distance is outperformed by the baseline, the simplest of the models? One could argue it is due to the method being unaplicable for multiclass classification, and being specifically created for intrinsic authorship attribution.

There is another issue connected to this unapplicability. Training each radius distance model took A LOT of time (up to 5 hours) while training logisitic regression and even simple NN took almost no time (under 5 finutes). This is one of the limitations of the radius distance algorithm, as it's complexity and thus time is proportional to the number of distance calculations. This, in turn, is proportional to (1) the number of classes, (2) the number of texts in each class, (3) the profile length. In the article the number of classes was 2 and the number of texts was below 50, which made the time aspect unimportant.

The concusion is that the method might be well-suited for intrinsic authorship attribution, but not for extrinsic authorship attribution, which is essentially multiclass classification that we had.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
data		data
modules		modules
Baseline.ipynb		Baseline.ipynb
Class.ipynb		Class.ipynb
Distance.ipynb		Distance.ipynb
Experiments.ipynb		Experiments.ipynb
JSON_Files.ipynb		JSON_Files.ipynb
LB_Testing.ipynb		LB_Testing.ipynb
Language_Testing.ipynb		Language_Testing.ipynb
Notes.ipynb		Notes.ipynb
README.md		README.md
Report.ipynb		Report.ipynb
acc.png		acc.png
f1.png		f1.png
image.png		image.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interference

Author Verification Using Common N-Gram Profiles of Text Documents

Participants:

The tasks

Objective

Pipeline

Architectural choices

Parameter space

Results

On test

On train (among radius distance models)

Only FL

Only LB

Discussion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Interference

Author Verification Using Common N-Gram Profiles of Text Documents

Participants:

The tasks

Objective

Pipeline

Architectural choices

Parameter space

Results

On test

On train (among radius distance models)

Only FL

Only LB

Discussion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages