<a href="https://colab.research.google.com/github/ovbystrova/Interference/blob/master/Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Interference

## [Author Verification Using Common N-Gram Profiles of Text Documents](https://www.aclweb.org/anthology/C14-1038.pdf)
The formulas form the [presentation](https://docs.google.com/presentation/d/1BZhBRqKzosFH2LZMjeQsJ-l_2NAoIszGsNeXn3zk0Z8/edit#slide=id.p) are duplicated in the Class implementation [notebook]((https://github.com/ovbystrova/Interference/blob/master/Class.ipynb)).


### Participants:
- Bystrova Olga [(ovbystrova)](https://github.com/ovbystrova) 
- Okhapkia Anna [(eischaire)](https://github.com/eischaire)
- Ryazanskaya Galina [(flying-bear)](https://github.com/flying-bear)


---
## The tasks
Links in the text lead to the notbooks where the mentioned task is done.
### Objective
In the original article the authors had interinsic authorship attribution task as a binary classification: the text was written either by the same author or by someone else. We could have simulated this structure using language background (*LB*) and first (native) language (*FL*) as "authors". However, it would not be ecologically valid, as the texts are, of course, written by different authors, and we did not have the data on authorship. Thus, we changed the task to be binary (for LB) and multiclass (for FL) classification.

### Pipeline
1. [preprocessing](https://github.com/ovbystrova/Interference/blob/master/JSON_Files.ipynb)
  1. tokenization for word n-grams (of length n)
  2. truncation so that all texts are of the same length (omitting the shorter texts)
  3. train/test split  (correcting for imbalanced classes!)
    1. on FL, native language
    2. on LB, speaker type
3. building classifiers [for each parameter combination](https://github.com/ovbystrova/Interference/blob/master/Class.ipynb)
  1. calculation of n-gram profiles (P)
  2. cutoff of the most frequent L
  3. distance calculation
4. multiclass classification with minimal distance for each ensemble, averaging the results
    1. on FL, [native language](https://github.com/ovbystrova/Interference/blob/master/Language_Testing.ipynb)
    2. on LB, [speaker type](https://github.com/ovbystrova/Interference/blob/master/LB_Testing.ipynb)
5. building [baselines](https://github.com/ovbystrova/Interference/blob/master/Baseline.ipynb)
  1. TF-IDF + logistic regression
  2. TF-IDF on word bigrams + logistic regression with parameter search
  3. word2vec + logistic regression with parameter search
  4. word2vec + perceptron 
6. [comparing results](https://github.com/ovbystrova/Interference/blob/master/Report.ipynb)

### Architectural choices
- We decided to onbly use ensemble classifiers as they performed the best in the article.
- We decided to cut all the texts to the length of mode length and omit all texts shorter than that.
- We decided that we need to balance classes and select the same number of texts from each class, landing on two options - 90 and 400 from each class. All the classes with less datapoints were omitted.
- Character ensembles were slow and thus were only calculated for LB.
- We decided to only use the number of n-grams (L) to determine the length of a profile and to use multiclass classification with minimal distance, that does not need a threshold (θ). The parameteres from the original article (the ones we included in bold):

#### **Parameter space**
- size of N-grams (n)
    - **from 3 to 10 for characters**
    - **from 1 to 3 for words**
- size of a profile 
    - **Number of n-grams (L) 200, 500, 1000, 1500, 2000, 2500, 3000**
    - Fraction of n-grams from the shortest text (f) from 0.2 to 1 (increments of 0.1)
- Threshold (θ)
  - if more than 1 known-author document available (θ2+)
  - if only 1 known-author document available (θ1)
- **Ensemble size and parameters**

## Results
### On test
![test](https://github.com/ovbystrova/Interference/raw/master/data/on_test.png)
### On train (among radius distance models)
![train](https://github.com/ovbystrova/Interference/raw/master/data/on_train.png)
#### Only FL
![fl](https://github.com/ovbystrova/Interference/raw/master/data/fl_only.png)
#### Only LB
![lb](https://github.com/ovbystrova/Interference/raw/master/data/lb_only.png)

## Discussion
One can see that in ALL cases the simplest baseline model (TF-IDF + logistic regression) outperforms all others.  It is interesting that radius distance method frequently outperforms NN on language background, as NN shows bad results on LB. Another thing to notice is that charachter models are outperformed by word models on train, but not on test. Generally, longer n-grams yeild better results, but the rule also holds more true on train than on test.

The question is why does the radius distance is outperformed by the baseline, the simplest of the models? One could argue it is due to the method being unaplicable for multiclass classification, and being specifically created for intrinsic authorship attribution. 

There is another issue connected to this unapplicability. Training each radius distance model took A LOT of time (up to 5 hours) while training logisitic regression and even simple NN took almost no time (under 5 finutes). This is one of the limitations of the radius distance algorithm, as it's complexity and thus time is proportional to the number of distance calculations. This, in turn, is proportional to (1) the number of classes, (2) the number of texts in each class, (3) the profile length. In the article the number of classes was 2 and the number of texts was below 50, which made the time aspect unimportant.

The concusion is that the method might be well-suited for intrinsic authorship attribution, but not for extrinsic authorship attribution, which is essentially multiclass classification that we had.


In [0]:
import pandas as pd
import numpy as np

In [2]:
!wget https://github.com/ovbystrova/Interference/raw/master/data/base_results.csv

--2020-03-29 14:45:23--  https://github.com/ovbystrova/Interference/raw/master/data/base_results.csv
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ovbystrova/Interference/master/data/base_results.csv [following]
--2020-03-29 14:45:23--  https://raw.githubusercontent.com/ovbystrova/Interference/master/data/base_results.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 277 [text/plain]
Saving to: ‘base_results.csv.2’


2020-03-29 14:45:23 (37.9 MB/s) - ‘base_results.csv.2’ saved [277/277]



In [3]:
!wget  https://github.com/ovbystrova/Interference/raw/master/data/LB_results.csv

--2020-03-29 14:45:25--  https://github.com/ovbystrova/Interference/raw/master/data/LB_results.csv
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ovbystrova/Interference/master/data/LB_results.csv [following]
--2020-03-29 14:45:25--  https://raw.githubusercontent.com/ovbystrova/Interference/master/data/LB_results.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1241 (1.2K) [text/plain]
Saving to: ‘LB_results.csv.2’


2020-03-29 14:45:25 (239 MB/s) - ‘LB_results.csv.2’ saved [1241/1241]



In [4]:
! wget https://github.com/ovbystrova/Interference/raw/master/data/FL_results.csv

--2020-03-29 14:45:26--  https://github.com/ovbystrova/Interference/raw/master/data/FL_results.csv
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ovbystrova/Interference/master/data/FL_results.csv [following]
--2020-03-29 14:45:26--  https://raw.githubusercontent.com/ovbystrova/Interference/master/data/FL_results.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 803 [text/plain]
Saving to: ‘FL_results.csv.2’


2020-03-29 14:45:26 (189 MB/s) - ‘FL_results.csv.2’ saved [803/803]



In [0]:
columns = ['train/test', 'class number', 'class length', 'ngram type', 'ngram size', 'accuracy', 'class']

In [6]:
lb = pd.read_csv('LB_results.csv')
lb['class length'] = 50
lb['class'] = 'language_background'
lb = lb[['Train/Test', 'Class number', 'class length', 'Profile type', 'Profile length',
       'Accuracy score', 'class']]
lb.head()

Unnamed: 0,Train/Test,Class number,class length,Profile type,Profile length,Accuracy score,class
0,train,9,50,word,1-grams,0.4875,language_background
1,train,9,50,word,2-grams,0.675,language_background
2,train,9,50,word,3-grams,0.6,language_background
3,train,9,50,character,3-grams,0.4875,language_background
4,train,9,50,character,4-grams,0.5875,language_background


In [7]:
lb.columns = columns
lb.head()

Unnamed: 0,train/test,class number,class length,ngram type,ngram size,accuracy,class
0,train,9,50,word,1-grams,0.4875,language_background
1,train,9,50,word,2-grams,0.675,language_background
2,train,9,50,word,3-grams,0.6,language_background
3,train,9,50,character,3-grams,0.4875,language_background
4,train,9,50,character,4-grams,0.5875,language_background


In [8]:
fl = pd.read_csv('FL_results.csv', index_col=0)
fl['class'] = 'native'
fl.head()

Unnamed: 0,train/test mode,class number,class length,ngram_type,ngram_size,accuracy_score,class
0,train,9,50,word,1-grams,0.397222,native
1,train,9,50,word,2-grams,0.558333,native
2,train,9,50,word,3-grams,0.516667,native
3,train,4,50,word,1-grams,0.6125,native
4,train,4,50,word,2-grams,0.69375,native


In [9]:
fl.columns = columns
fl.head()

Unnamed: 0,train/test,class number,class length,ngram type,ngram size,accuracy,class
0,train,9,50,word,1-grams,0.397222,native
1,train,9,50,word,2-grams,0.558333,native
2,train,9,50,word,3-grams,0.516667,native
3,train,4,50,word,1-grams,0.6125,native
4,train,4,50,word,2-grams,0.69375,native


In [0]:
rows = []
ids = []
jds = []
for i, fl_row in fl.iterrows():
  fl_params = fl_row[:-2]
  row = ['radius distance'] + fl_row[:-1].tolist()
  for j, lb_row in lb.iterrows():
    lb_params = lb_row[:-2]
    if lb_params.tolist() == fl_params.tolist():
      row += [lb_row[-2]]
      rows.append(row)
      ids.append(i)
      jds.append(j)

for i, fl_row in fl.iterrows():
  if i not in ids:
    row = ['radius distance'] + fl_row[:-1].tolist() + [np.nan]
    rows.append(row)

for j, lb_row in lb.iterrows():
  if j not in jds:
    row = ['radius distance'] + lb_row[:-2].tolist() + [np.nan] + [lb_row[-2]]
    rows.append(row)

res = pd.DataFrame(rows, columns=['model', 'train/test', 'class number', 'class length', 'ngram type',
       'ngram size', 'fl', 'lb'])

In [11]:
res.tail()

Unnamed: 0,model,train/test,class number,class length,ngram type,ngram size,fl,lb
40,radius distance,test,9,50,character,9-grams,,0.5
41,radius distance,test,9,50,character,10-grams,,0.5
42,radius distance,test,4,50,character,3-grams,,0.5
43,radius distance,test,4,50,character,4-grams,,0.5
44,radius distance,test,4,50,character,5-grams,,0.45


In [0]:
def prep(ntype):
  if ntype == 'word':
    return 'regex tokens'
  else:
    return 'None'

In [0]:
res['preprocess'] = res['ngram type'].apply(prep)
res['vectorize'] = np.nan
res = res[['model', 'train/test', 'class number', 'class length', 'ngram type',
       'ngram size', 'preprocess', 'vectorize', 'fl', 'lb']]

In [14]:
bs = pd.read_csv('base_results.csv', index_col='id')
bs['class number'] = 15
bs['class length'] = 'full'
bs['train/test'] = 'test'
bs['ngram type'] = np.nan
bs['ngram size'] = np.nan
bs = bs[['model', 'train/test', 'class number', 'class length', 'ngram type','ngram size', 'preprocess', 'vectorize', 'native', 'language_background']]
bs.columns = ['model', 'train/test', 'class number', 'class length', 'ngram type', 'ngram size', 'preprocess', 'vectorize', 'fl', 'lb']
bs.head()

Unnamed: 0_level_0,model,train/test,class number,class length,ngram type,ngram size,preprocess,vectorize,fl,lb
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,logreg,test,15,full,,,,tf-idf,0.8659,0.889527
2,logreg search,test,15,full,,,regex tokens,tf-idf,0.821839,0.828863
3,logreg search,test,15,full,,,regex tokens,w2v bow mean,0.779055,0.772669
4,1 linear nn,test,15,full,,,regex tokens + lemmas,w2v bow mean,0.70638,0.757812


In [0]:
def strip_g(s):
  if type(s) == str:
    s = s.replace('uni', '1')
    s = s.replace('bi', '2')
    s = s.replace('tri', '3')
    return int(s.strip('-grams'))

In [26]:
final = pd.concat([res, bs])
final = final.drop_duplicates()
final.index = range(len(final))
final['ngram size'] = final['ngram size'].apply(strip_g)
final.sort_values(['fl', 'lb'], ascending=(False, False))

Unnamed: 0,model,train/test,class number,class length,ngram type,ngram size,preprocess,vectorize,fl,lb
45,logreg,test,15,full,,,,tf-idf,0.8659,0.889527
46,logreg search,test,15,full,,,regex tokens,tf-idf,0.821839,0.828863
47,logreg search,test,15,full,,,regex tokens,w2v bow mean,0.779055,0.772669
7,radius distance,test,4,50,word,2.0,regex tokens,,0.75,0.5
48,1 linear nn,test,15,full,,,regex tokens + lemmas,w2v bow mean,0.70638,0.757812
6,radius distance,test,4,50,word,1.0,regex tokens,,0.7,0.5
8,radius distance,test,4,50,word,3.0,regex tokens,,0.7,0.5
16,radius distance,test,4,100,word,1.0,regex tokens,,0.7,
10,radius distance,train,4,50,word,2.0,regex tokens,,0.69375,
15,radius distance,train,4,75,word,2.0,regex tokens,,0.654167,


In [27]:
nonan = final[final['fl'].notna() & final['lb'].notna()]
nonan[nonan['train/test'] == 'test'].sort_values(['fl', 'lb'], ascending=(False, False)).style.background_gradient(axis=0, cmap='Reds')

Unnamed: 0,model,train/test,class number,class length,ngram type,ngram size,preprocess,vectorize,fl,lb
45,logreg,test,15,full,,,,tf-idf,0.8659,0.889527
46,logreg search,test,15,full,,,regex tokens,tf-idf,0.821839,0.828863
47,logreg search,test,15,full,,,regex tokens,w2v bow mean,0.779055,0.772669
7,radius distance,test,4,50,word,2.0,regex tokens,,0.75,0.5
48,1 linear nn,test,15,full,,,regex tokens + lemmas,w2v bow mean,0.70638,0.757812
6,radius distance,test,4,50,word,1.0,regex tokens,,0.7,0.5
8,radius distance,test,4,50,word,3.0,regex tokens,,0.7,0.5
4,radius distance,test,9,50,word,2.0,regex tokens,,0.533333,0.45
5,radius distance,test,9,50,word,3.0,regex tokens,,0.488889,0.55
3,radius distance,test,9,50,word,1.0,regex tokens,,0.422222,0.5


In [28]:
nonan[nonan['train/test'] == 'train'].sort_values(['fl', 'lb'], ascending=(False, False)).style.background_gradient(axis=0, cmap='Reds')

Unnamed: 0,model,train/test,class number,class length,ngram type,ngram size,preprocess,vectorize,fl,lb
1,radius distance,train,9,50,word,2,regex tokens,,0.558333,0.675
2,radius distance,train,9,50,word,3,regex tokens,,0.516667,0.6
0,radius distance,train,9,50,word,1,regex tokens,,0.397222,0.4875


In [29]:
final_fl = final[final['fl'].notna()].drop(['lb'], axis=1)
final_fl.sort_values(['fl'], ascending=False).style.background_gradient(axis=0, cmap='Reds')

Unnamed: 0,model,train/test,class number,class length,ngram type,ngram size,preprocess,vectorize,fl
45,logreg,test,15,full,,,,tf-idf,0.8659
46,logreg search,test,15,full,,,regex tokens,tf-idf,0.821839
47,logreg search,test,15,full,,,regex tokens,w2v bow mean,0.779055
7,radius distance,test,4,50,word,2.0,regex tokens,,0.75
48,1 linear nn,test,15,full,,,regex tokens + lemmas,w2v bow mean,0.70638
16,radius distance,test,4,100,word,1.0,regex tokens,,0.7
6,radius distance,test,4,50,word,1.0,regex tokens,,0.7
8,radius distance,test,4,50,word,3.0,regex tokens,,0.7
10,radius distance,train,4,50,word,2.0,regex tokens,,0.69375
15,radius distance,train,4,75,word,2.0,regex tokens,,0.654167


In [30]:
final_lb = final[final['lb'].notna()].drop(['fl'], axis=1)
final_lb.sort_values(['lb'], ascending=False).style.background_gradient(axis=0, cmap='Reds')

Unnamed: 0,model,train/test,class number,class length,ngram type,ngram size,preprocess,vectorize,lb
45,logreg,test,15,full,,,,tf-idf,0.889527
46,logreg search,test,15,full,,,regex tokens,tf-idf,0.828863
47,logreg search,test,15,full,,,regex tokens,w2v bow mean,0.772669
48,1 linear nn,test,15,full,,,regex tokens + lemmas,w2v bow mean,0.757812
1,radius distance,train,9,50,word,2.0,regex tokens,,0.675
27,radius distance,train,9,50,character,10.0,,,0.675
29,radius distance,train,4,50,word,2.0,regex tokens,,0.6625
26,radius distance,train,9,50,character,9.0,,,0.65
25,radius distance,train,9,50,character,8.0,,,0.65
2,radius distance,train,9,50,word,3.0,regex tokens,,0.6


In [35]:
final_lb_ch = final_lb[final_lb['ngram type'] == 'character']
final_lb_ch[final_lb_ch['train/test'] == 'test'].sort_values(['lb'], ascending=False).style.background_gradient(axis=0, cmap='Reds')

Unnamed: 0,model,train/test,class number,class length,ngram type,ngram size,preprocess,vectorize,lb
39,radius distance,test,9,50,character,8,,,0.55
34,radius distance,test,9,50,character,3,,,0.5
35,radius distance,test,9,50,character,4,,,0.5
36,radius distance,test,9,50,character,5,,,0.5
37,radius distance,test,9,50,character,6,,,0.5
38,radius distance,test,9,50,character,7,,,0.5
40,radius distance,test,9,50,character,9,,,0.5
41,radius distance,test,9,50,character,10,,,0.5
42,radius distance,test,4,50,character,3,,,0.5
43,radius distance,test,4,50,character,4,,,0.5


In [36]:
final_lb_ch[final_lb_ch['train/test'] == 'train'].sort_values(['lb'], ascending=False).style.background_gradient(axis=0, cmap='Reds')

Unnamed: 0,model,train/test,class number,class length,ngram type,ngram size,preprocess,vectorize,lb
27,radius distance,train,9,50,character,10,,,0.675
25,radius distance,train,9,50,character,8,,,0.65
26,radius distance,train,9,50,character,9,,,0.65
21,radius distance,train,9,50,character,4,,,0.5875
24,radius distance,train,9,50,character,7,,,0.5875
22,radius distance,train,9,50,character,5,,,0.5625
23,radius distance,train,9,50,character,6,,,0.5625
33,radius distance,train,4,50,character,5,,,0.525
31,radius distance,train,4,50,character,3,,,0.5
32,radius distance,train,4,50,character,4,,,0.5


In [37]:
final_lb_w = final_lb[final_lb['ngram type'] == 'word']
final_lb_w[final_lb_w['train/test'] == 'test'].sort_values(['lb'], ascending=False).style.background_gradient(axis=0, cmap='Reds')

Unnamed: 0,model,train/test,class number,class length,ngram type,ngram size,preprocess,vectorize,lb
5,radius distance,test,9,50,word,3,regex tokens,,0.55
3,radius distance,test,9,50,word,1,regex tokens,,0.5
6,radius distance,test,4,50,word,1,regex tokens,,0.5
7,radius distance,test,4,50,word,2,regex tokens,,0.5
8,radius distance,test,4,50,word,3,regex tokens,,0.5
4,radius distance,test,9,50,word,2,regex tokens,,0.45


In [38]:
final_lb_w[final_lb_w['train/test'] == 'train'].sort_values(['lb'], ascending=False).style.background_gradient(axis=0, cmap='Reds')

Unnamed: 0,model,train/test,class number,class length,ngram type,ngram size,preprocess,vectorize,lb
1,radius distance,train,9,50,word,2,regex tokens,,0.675
29,radius distance,train,4,50,word,2,regex tokens,,0.6625
2,radius distance,train,9,50,word,3,regex tokens,,0.6
30,radius distance,train,4,50,word,3,regex tokens,,0.575
0,radius distance,train,9,50,word,1,regex tokens,,0.4875
28,radius distance,train,4,50,word,1,regex tokens,,0.4875


In [0]:
def make_av_lb(df):
  return np.sum(df.lb.values) / len(df)

def get_best_lb(df):
  return np.max(df.lb.values)

words

In [0]:
w_tr = ['train', 'word', get_best_lb(final_lb_w[final_lb_w['train/test'] == 'train']), make_av_lb(final_lb_w[final_lb_w['train/test'] == 'train'])]
w_ts = ['test', 'word', get_best_lb(final_lb_w[final_lb_w['train/test'] == 'test']), make_av_lb(final_lb_w[final_lb_w['train/test'] == 'test'])]

charachters

In [0]:
ch_tr = ['train', 'character', get_best_lb(final_lb_ch[final_lb_ch['train/test'] == 'train']), make_av_lb(final_lb_ch[final_lb_ch['train/test'] == 'train'])]
ch_ts = ['test', 'character', get_best_lb(final_lb_ch[final_lb_ch['train/test'] == 'test']), make_av_lb(final_lb_ch[final_lb_ch['train/test'] == 'test'])]

In [0]:
cols = columns=['train/test', 'ngram type', 'max', 'average']
lb_compare = pd.DataFrame(columns = cols)

In [0]:
for el in [w_tr, w_ts, ch_tr, ch_ts]:
  lb_compare = lb_compare.append(dict(zip(cols, el)), ignore_index=True)

In [79]:
lb_compare.sort_values(['max', 'average'], ascending=(False, False)).style.background_gradient(axis=0, cmap='Reds')

Unnamed: 0,train/test,ngram type,max,average
0,train,word,0.675,0.58125
2,train,character,0.675,0.571591
1,test,word,0.55,0.5
3,test,character,0.55,0.5
