# Authorship detection with the author-topic model

Links:  <br />
[Gensim](https://radimrehurek.com/gensim/index.html)  <br />
[Gensim author-topic model help page](https://radimrehurek.com/gensim/models/atmodel.html)  <br />
[Gensim author-topic model tutorial](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/atmodel_tutorial.ipynb)

Relevant papers:  <br />
[Blei et al. 2003](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)  (LDA) <br />
[Rosen-Zvi et al. 2004](https://mimno.infosci.cornell.edu/info6150/readings/398.pdf) (Author-topic models) <br />
[Rosen-Zvi et al. 2010](https://www.researchgate.net/profile/Michal_Rosen-Zvi/publication/220515711_Learning_author-topic_models_from_text_corpora/links/53fb31000cf27c365cf07efd.pdf) (Author-topic models extension) <br />
[Seroussi et al. 2011](http://aclweb.org/anthology/W11-0321) (Authorship attribution with LDA) <br />
[Seroussi et al. 2012](http://anthology.aclweb.org/P/P12/P12-2.pdf#page=292) (Authorship attribution with author-topic models)

![alt text](http://img.blog.csdn.net/20170417124825166?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvbGFmZWVkZmg=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)

Plan for this presentation:

- Read in Reddit data
- Preprocessing
- Get a document to author dictionary
- Transform that dictionary to author to document format
- Create a 'test set' by anonymizing the author of randomly selected posts (so that each of those documents has a unique author with no other posts)
- Run an author-topic model with Gensim
- Get the topic distribution for each author
- Get the topic distribution of the anonymized posts
- Use the Hellinger Distance to find the author closest to the anonymized author unique to a test document
- Check predictive accuracy
- Compare results to SVM

## Reading in the reddit data

The reddit data consists of .json files for each month, which are compressed as .bz2. First, we find all the .bz2 files in a folder.

In [2]:
import pandas as pd

In [3]:
import bz2
import glob
files = glob.glob("data/*.bz2")
files

For the purposes of this tutorial, we only use one. The file is read in, and the .json file it contains is directly transformed into a pandas data frame.

In [None]:
with bz2.open(files[0], 'rt') as f:
    text = f.read()
    
df = pd.read_json(text,  lines=True)

Due to the size of the data, this can take a lot of time and RAM. Consequently we save the data as .csv file, which can be re-loaded more easily.

In [7]:
#write to csv
df.to_csv('data/reddit2010-06.csv')

In [3]:
#load data
df = pd.read_csv('data/reddit2010-06.csv')
#Note: this raises a warning, but setting low_memory=False as recommended crashes Jupyter

  interactivity=interactivity, compiler=compiler, result=result)


Subset the data to include only what we need.

In [4]:
#need only author body and subreddit variables
df = df[['author', 'body', 'subreddit']]

#restrict to gaming subreddit
df = df[df['subreddit']=='gaming']

#retain only posts with more than 300 characters
df = df[df['body'].apply(len, )>300]

#remove [deleted] authors
df = df[df['author']!='[deleted]']

Authors who only have one post can't be predicted, so we will remove them from the dataset.

In [5]:
#count the number of posts per author (like table() in R)
author_counts = df.author.value_counts()

#remove authors who only posted once
authors = author_counts[author_counts!=1]

#get the axis labels (i.e. the author names) and turn them into a list
authors = authors.axes[0].tolist()

#subset the dataframe
df = df[df['author'].isin(authors)]

In [6]:
df.shape

(15733, 3)

In [7]:
df.head()

Unnamed: 0,author,body,subreddit
4,DaimyoNoNeko,I'm like this in far less complicated setup. \...,gaming
126,AJRiddle,Definitely a good reason. I think it will attr...,gaming
573,thesearenotthehammer,The people I share with generally don't have t...,gaming
607,awj,I have the Zelda Reorchestrated version of WW ...,gaming
1056,Andrewr05,Just a question here.\n\nWho else thinks that ...,gaming


In [8]:
#save (and possibly load) the data again, small enough to fit on GitHub
df.to_csv('data/reddit2010-06_subset.csv')
df = pd.read_csv('data/reddit2010-06_subset.csv')

## Preprocessing

This section is basically just a copy of the [Gensim tutorial](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/atmodel_tutorial.ipynb), with some minor modification to fit data as a pandas dataframe instead of the format used by the author.

In [8]:
import spacy
nlp = spacy.load('en')

In [9]:
docs = []    
for doc in nlp.pipe(df['body'], n_threads=11, batch_size=100):
    # Process document using Spacy NLP pipeline.
    
    ents = doc.ents  # Named entities.

    # Keep only words (no numbers, no punctuation).
    # Lemmatize tokens, remove punctuation and remove stopwords.
    doc = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]

    # Remove common words from a stopword list.
    #doc = [token for token in doc if token not in STOPWORDS]

    # Add named entities, but only if they are a compound of more than word.
    doc.extend([str(entity) for entity in ents if len(entity) > 1])
    
    docs.append(doc)

In [12]:
# Compute bigrams.
from gensim.models import Phrases
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

Using TensorFlow backend.


In [13]:
# Create a dictionary representation of the documents, and filter out frequent and rare words.

from gensim.corpora import Dictionary
dictionary = Dictionary(docs)

# Remove rare and common tokens.
# Filter out words that occur too frequently or too rarely.
max_freq = 0.5
min_wordcount = 20
dictionary.filter_extremes(no_below=min_wordcount, no_above=max_freq)

_ = dictionary[0]  # This sort of "initializes" dictionary.id2token.

In [14]:
# Vectorize data.

# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

Mapping from documents to authors, which is different for this dataset (compared to the Gensim tutorial).

In [15]:
df['docID'] = range(0,len(df['body']))

In [16]:
doc2author = pd.Series(df.author.values, index=df.docID).to_dict()

Let's inspect the dimensionality of our data.

In [18]:
print('Number of authors: %d' % len(pd.Series.unique(df['author'])))
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of authors: 3101
Number of unique tokens: 4023
Number of documents: 15733


## Training the author-topic model

An author-topic model is normally run as following in Gensim.

In [19]:
from gensim.models import AuthorTopicModel

In [20]:
#model = AuthorTopicModel(corpus=corpus, num_topics=100, id2word=dictionary.id2token, \
#                    doc2author=doc2author, chunksize=100, passes=100, gamma_threshold=0.001, \
#                    eval_every=0, iterations=1, random_state=1)

The information about which author wrote which document is stored in the doc2author (or alternatively, in the author2doc) object. To create a test set, we replace randomly sampled names with a numbered identifier.

In [21]:
import random
random.seed(1)
#size of the test set is a fifth of the dataset
n_test = round(len(corpus)/5)
#randomly sample indices to be 'anonymized' and tested
sample_indices = random.sample( doc2author.keys(), n_test )
#create new dictionary
doc2author_test = doc2author.copy()
#randomly replace entries
for k in (sample_indices):
    doc2author_test[k] = 'test_id_' + str(k)

Unfortunately the AT model in Gensim doesn't seem to work correctly with doc2author. Re-mapping the dictionary to author2doc instead:

In [22]:
#re-map test dictionary
import collections
author2doc_test = collections.defaultdict(set)
for k, v in doc2author_test.items():
    author2doc_test[v].add(k)
author2doc_test

defaultdict(set,
            {'DaimyoNoNeko': {0, 3187},
             'AJRiddle': {1, 8574},
             'thesearenotthehammer': {2,
              7,
              11,
              18,
              26,
              32,
              39,
              41,
              45,
              355,
              361,
              366,
              368,
              382,
              390,
              402,
              415,
              419,
              422,
              438,
              442,
              503,
              688,
              709,
              741,
              800,
              1093,
              1100,
              6675},
             'awj': {3, 10301, 10961, 12092, 12547, 14873},
             'Andrewr05': {4, 836, 841, 9751, 15469},
             'Zhiroc': {5},
             'rhiesa': {6, 16, 5398, 5410, 5660, 9579},
             'thehybridfrog': {8, 3181, 3185, 7736, 8756, 9168},
             'test_id_9': {9},
             'marsvolta': {10, 14021},
      

Again, the author-topic model, but this time with randomly selected names anonymized.

Parameters:

**num_topics**: The number of topics ion the model. There is no 'correct' value here, and it depends entirely on how many different topics occur in the corpus. 100 is generally a reasonable compromise for a corpus this size.  <br />
**chunksize**: Controls the size of the mini-batches. This depends entirely on the corpus - 2000 is the default, but this obviously makes no sense if a corpus only contains 1000 documents. <br />
**passes**: 100 by default <br />
**iterations**: iterations is the maximum number of times the model loops over each document <br />
**alpha**: Can be set to 'asymmetric' <br />
**eta**: Can be set to ‘auto’, which learns an asymmetric prior over words directly from the data

In [23]:
%%time
model = AuthorTopicModel(corpus=corpus, num_topics=100, id2word=dictionary.id2token, \
                    author2doc=author2doc_test, chunksize=100, passes=100, gamma_threshold=0.001, \
                    eval_every=0, iterations=1, random_state=1)

CPU times: user 1h 24min, sys: 3h 48min 25s, total: 5h 12min 25s
Wall time: 36min 22s


In [24]:
# Save model. 
model.save('./results/model_presentation.atmodel')

In [18]:
#import pandas as pd
#import spacy
#from gensim.models import Phrases
#from gensim.corpora import Dictionary
#from gensim.models import AuthorTopicModel

#Load model
model = AuthorTopicModel.load('./results/model_presentation.atmodel')

### Results

In [25]:
model.top_topics(corpus)

[([(0.11524080601615801, 'blizzard'),
   (0.06348996805572428, 'people'),
   (0.06080680697483444, 'wow'),
   (0.051554455074081885, 'play'),
   (0.051000143480809439, 'like'),
   (0.041115405637865865, 'change'),
   (0.029130594405093819, 'year'),
   (0.028900516716012882, 'want'),
   (0.025712654564630271, 'email'),
   (0.024143530503241445, 'go'),
   (0.023392625382483288, 'log'),
   (0.023154980379645067, 'thing'),
   (0.022842816747554312, 'say'),
   (0.02261281654191697, 'think'),
   (0.021938031091934145, 'know'),
   (0.019360093182063731, 'account'),
   (0.018030590240599491, 'try'),
   (0.017234923254724921, 'suggest'),
   (0.01614834435283926, 'steal'),
   (0.016095084024320683, 'come')],
  -2.367644650384936),
 ([(0.087398092715305215, 'like'),
   (0.068720510645649921, 'think'),
   (0.065725920281471645, 'bad'),
   (0.061489760784434438, 'good'),
   (0.036326435645632427, 'pro'),
   (0.03333568510962269, 'know'),
   (0.03320273434547788, 'review'),
   (0.032553078000057502,

## Authorship attribution

### Hellinger Distance

$$
D(\theta_1, \theta_2) = \frac{1}{\sqrt{2}} \sqrt{\sum_{t=1}^T (\sqrt{\theta_{1,t}} - \sqrt{\theta_{2,t}})^2}
$$

where  <br />
$\theta_i$ is a T-dimensional multinomial topic distribution  <br />
$\theta_{i,t}$ is the probability of the t-th topic

Predict authors by making a 'fake' author for the test documents, and then compare that author's topic distribution to those of the real authors via the Hellinger Distance:

In [26]:
from gensim.similarities import MatrixSimilarity

Functions to calculate Hellinger distance. Mostly taken from the Gensim AT tutorial, but there are some modifications to make this work for prediction.

In [27]:
import re
from gensim import matutils

# Make a list of all the author-topic distributions.
author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]

def similarity(vec1, vec2):
    '''Get similarity between two vectors'''
    dist = matutils.hellinger(matutils.sparse2full(vec1, model.num_topics), \
                              matutils.sparse2full(vec2, model.num_topics))
    sim = 1.0 / (1.0 + dist)
    return sim

def get_sims(vec):
    '''Get similarity of vector to all authors.'''
    sims = [similarity(vec, vec2) for vec2 in author_vecs]
    return sims

def get_bestmatch(name, top_n=10, smallest_author=1):
    '''
    Get table with similarities, author names, and author sizes.
    Return `top_n` authors as a dataframe.
    
    '''
    
    # Get similarities.
    sims = get_sims(model.get_author_topics(name))

    # Arrange author names, similarities, and author sizes in a list of tuples.
    table = []
    for elem in enumerate(sims):
        author_name = model.id2author[elem[0]]
        sim = elem[1]
        author_size = len(model.author2doc[author_name])
        if author_size >= smallest_author:
            table.append((author_name, sim, author_size))
    
    
    #turn similarities table int pd dataframe
    df2 = pd.DataFrame(table, columns=['Author', 'Score', 'Size'])
    #remove the test authors
    df2 = df2[df2['Author'].str.contains("test_id_")==False]
    #sort and get the top 10 predictions
    df2 = df2.sort_values('Score', ascending=False)[:top_n]   
    
    bestmatch = df2.Author.iloc[0]
    
    return bestmatch

def get_table(name, top_n=10, smallest_author=1):
    '''
    Get table with similarities, author names, and author sizes.
    Return `top_n` authors as a dataframe.
    
    '''
    
    # Get similarities.
    sims = get_sims(model.get_author_topics(name))

    # Arrange author names, similarities, and author sizes in a list of tuples.
    table = []
    for elem in enumerate(sims):
        author_name = model.id2author[elem[0]]
        sim = elem[1]
        author_size = len(model.author2doc[author_name])
        if author_size >= smallest_author:
            table.append((author_name, sim, author_size))
            
    df2 = pd.DataFrame(table, columns=['Author', 'Score', 'Size'])
    df2 = df2.sort_values('Score', ascending=False)[:top_n]
    
    return df2

### Calculate predicted values:

In [29]:
%%time
doc2author_predict = doc2author.copy()
#randomly replace entries
for k in (sample_indices):
    doc2author_predict[k] = get_bestmatch('test_id_' + str(k))

CPU times: user 18min 55s, sys: 44 ms, total: 18min 55s
Wall time: 18min 55s


In [30]:
#predicted authors
pred = pd.Series(list(doc2author_predict.values()))[sample_indices]

In [31]:
#real authors
actual = pd.Series(list(doc2author.values()))[sample_indices]

## Correctly Predicted:

In [32]:
sum(pred==actual)/len(pred==actual)

0.025738798856053385

:(  <br /> <br />
Only 2.5% prediction accuracy.

Looking at some authors and texts to find out what went wrong:

In [34]:
#look at the predictions for a specific post
get_table('test_id_63')

Unnamed: 0,Author,Score,Size
5147,test_id_63,1.0,1
4682,test_id_4252,0.617851,1
3985,test_id_15271,0.608137,1
1714,dietsoda,0.599374,1
1914,ginja_ninja,0.592131,16
669,LeMango,0.588433,3
554,IndigoMoss,0.588137,11
98,Azeltir,0.587495,7
786,MsgGodzilla,0.584094,20
6043,twich35,0.583854,3


In [35]:
#look at that post
df.iloc[63]

author                                                     elt
body         My experience was that it was great in concept...
subreddit                                               gaming
docID                                                       63
Name: 12298, dtype: object

In [36]:
#look at all the posts of the real author
df[df.author=="elt"]

Unnamed: 0,author,body,subreddit,docID
12298,elt,My experience was that it was great in concept...,gaming,63
350187,elt,"I am jealous! I had, at different times, a 50...",gaming,1729
1891894,elt,I used the Clear Sky Complete mod\n\nhttp://ar...,gaming,8628
2223104,elt,Agreed. Hopefully the simplification is just ...,gaming,9808
2689033,elt,"Man I wish I could help you, I'm in the same p...",gaming,11355


Seems like people don't necessary post in the same topic a lot, so topic models accomplish pretty much the opposite of what we are dealing with here.