# SCC.413 Applied Data Mining
# Week 18
# Authorship attribution (with single Tweets)

## Contents
* [Introduction](#intro)
* [Preamble](#preamble)
* [Document object and processors](#doc)
* [Dataset](#data)
* [Pipeline](#pipeline)
* [Error analysis](#error)
* [Exercise](#ex)

<a name="intro"></a>
## Introduction

In previous labs you have classified collections of Tweets (for a single user) as one document/instance in the classifier. Here, we instead treat individual tweets as documents, and attempt to classify these. We use authorship attribution as the task here, i.e. predicting the user who produced the Tweet.

<a name="preamble"></a>
## Preamble
Below are imports and helper functions from previous labs.

In [None]:
!pip install ftfy

In [None]:
import ftfy
import nltk
import json

from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import Binarizer, StandardScaler

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import re

from collections import Counter
from os import listdir, makedirs
from os.path import isfile, join, splitext, split

As in previous notebooks, you should upload all of the provided files to a Google Drive folder, you can then access these files from your Python code. See also the files tab.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

We save the folder we are working from as a variable for easy access. You may need to edit the path to match your own.

In [None]:
working_folder = '/content/gdrive/MyDrive/413/wk19/'

The below code adds the working folder to the system path, so you can import Python files from this folder.

In [None]:
import sys
sys.path.append(working_folder)

A couple of methods for showing classifier results (from 1st classification lab):

In [None]:
def print_cv_scores_summary(name, scores):
    print("{}: mean = {:.2f}%, sd = {:.2f}%, min = {:.2f}, max = {:.2f}".format(name, scores.mean()*100, scores.std()*100, scores.min()*100, scores.max()*100))
    
def confusion_matrix_heatmap(cm, index):
    cmdf = pd.DataFrame(cm, index = index, columns=index)
    dims = (10, 10)
    fig, ax = plt.subplots(figsize=dims)
    sns.heatmap(cmdf, annot=True, cmap="coolwarm", center=0)
    ax.set_ylabel('Actual')    
    ax.set_xlabel('Predicted')
    
def confusion_matrix_percent_heatmap(cm, index):
    cmdf = pd.DataFrame(cm, index = index, columns=index)
    percents = cmdf.div(cmdf.sum(axis=1), axis=0)*100
    dims = (10, 10)
    fig, ax = plt.subplots(figsize=dims)
    sns.heatmap(percents, annot=True, cmap="coolwarm", center=0, vmin=0, vmax=100)
    ax.set_ylabel('Actual')    
    ax.set_xlabel('Predicted')
    cbar = ax.collections[0].colorbar
    cbar.set_ticks([0, 25, 50, 75, 100])
    cbar.set_ticklabels(['0%', '25%', '50%', '75%', '100%'])

<a name="doc"></a>
## Document object and processors

Functions for processing text, and producing a Document class.

In [None]:
hashtag_re = re.compile(r"#\w+")
mention_re = re.compile(r"@\w+")
url_re = re.compile(r"(?:https?://)?(?:[-\w]+\.)+[a-zA-Z]{2,9}[-\w/#~:;.?+=&%@~]*")

def preprocess(text):
    p_text = hashtag_re.sub("[hashtag]",text)
    p_text = mention_re.sub("[mention]",p_text)
    p_text = url_re.sub("[url]",p_text)
    p_text = ftfy.fix_text(p_text)
    return p_text

tokenise_re = re.compile(r"(\[[^\]]+\]|[-'\w]+|[^\s\w\[']+)") #([]|words|other non-space)
def tokenise(text):
    return tokenise_re.findall(text)

class Document:
    def __init__(self, meta={}):
        self.meta = meta
        self.tokens_fql = Counter() #empty Counter, ready to be added to with Counter.update.
        self.ht_fql = Counter()
        self.num_tokens = 0
        self.text = ""
        
    def extract_features_from_text(self, text):
        hts = hashtag_re.findall(text)
        self.ht_fql.update([ht.lower() for ht in hts])
        p_text = preprocess(text)
        tokens = tokenise(p_text)
        lower_tokens = [t.lower() for t in tokens]
        self.num_tokens += len(lower_tokens)
        self.tokens_fql.update(lower_tokens) #updating Counter counts items in list, adding to existing Counter items.
        self.text += text
        
    def extract_features_from_texts(self, texts): #texts should be iterable text lines, e.g. read in from file.
        for text in texts:
            extract_features_from_text(text)
            
    def average_token_length(self):
        sum_lengths = 0
        for key, value in self.tokens_fql.items():
            sum_lengths += len(key) * value
        return sum_lengths / self.num_tokens

A transformer to convert `Document` to extract features via a callable method:

In [None]:
class DocumentProcessor(BaseEstimator, TransformerMixin):
    def __init__(self, process_method):
        self.process_method = process_method
    
    def fit(self, X, y=None): #no fitting necessary, although could use this to build a vocabulary for all documents, and then limit to set (e.g. top 1000).
        return self

    def transform(self, documents):
        for document in documents:
            yield self.process_method(document)

In [None]:
def get_tokens_fql(document):
    return document.tokens_fql

def get_ht_fql(document):
    return document.ht_fql

def get_text_stats(document):
    ttr = len(document.tokens_fql) / document.num_tokens
    return {'avg_token_length': document.average_token_length(), 'ttr': ttr }

def read_list(file):
    with open(file) as f:
        items = []
        lines = f.readlines()
        for line in lines:
            items.append(line.strip())
    return items

fws = read_list(working_folder + "functionwords.txt")

def get_fws_fql(document):
    fws_fql = Counter({t: document.tokens_fql[t] for t in fws}) #dict comprehension, t: fql[t] is token: freq.
    return +fws_fql

<a name="data"></a>
## Dataset

Import Tweets as single Document, with metadata of user included. You could utilise other metadata to predict party of user (or age/gender from celebs data) of a single Tweet.

In [None]:
def import_tweets_json(folder):
    jsonfiles = [join(folder, f) for f in listdir(folder) if isfile(join(folder, f)) and f.endswith(".json")]
    for jf in jsonfiles:
        with open(jf) as f:
            data = json.load(f)
            tweets = data.pop('tweets')
            metadata = data
        print("Processing " + metadata['screen_name'])
        for tweet in tweets:
            doc = Document(meta=metadata)
            doc.extract_features_from_text(tweet['text'])
            yield doc

Here we are using a small collection of previously collected mp twitter users. You could use users from the celebs dataset instead. Just be aware that for some datasets the `import_tweets_json` method will need updating to take `full_text` from the tweet, not `text`.

In [None]:
corpus = []
corpus.extend(import_tweets_json(working_folder + "mps-json-10"))

In [None]:
y = [d.meta['screen_name'] for d in corpus]
X = corpus

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state = 0, stratify=y)
print(len(X_train), len(X_test))
print(len(y_train), len(y_test))

<a name="pipeline"></a>
## Pipeline

Here is a sample pipeline to be used with gridsearch, with a feature union of hashtags, words or function words, and some text stats. Using either a naive bayes or logisitic regression classifier.

In [None]:
model = Pipeline([
    ('union', FeatureUnion(
        transformer_list = [
            ('hts', Pipeline([
                ('processor', DocumentProcessor(process_method = get_ht_fql)),
                ('vectorizer', DictVectorizer()),
                ('binarize', Binarizer()),
            ])),
            ('word', Pipeline([
                ('processor', DocumentProcessor(process_method = None)), # to be set by grid search.
                ('vectorizer', DictVectorizer()),
                ('binarize', Binarizer()),
            ])),
            ('stats', Pipeline([
                ('processor', DocumentProcessor(process_method = get_text_stats)),
                ('vectorizer', DictVectorizer()),
            ])),
        ],
    )),
    ('selector', SelectKBest(score_func = chi2)),
    ('standardize', StandardScaler(with_mean=False)),
    ('clf', None), # to be set by grid search.
])

param_grid={
    'union__word__processor__process_method': [get_fws_fql, get_tokens_fql],
    'selector__k': [50, 100, 150, 500],
    'clf': [MultinomialNB(), LogisticRegression(solver='liblinear', random_state=0, multi_class='ovr')],
}

In [None]:
search = GridSearchCV(model, cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0), 
                      return_train_score = False, 
                      scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted'],
                      refit = 'f1_weighted',
                      param_grid = param_grid
                     )

search.fit(X_train, y_train)

In [None]:
pd.DataFrame(search.cv_results_)

In [None]:
search.best_params_

In [None]:
predictions = search.predict(X_test)

print("Accuracy: ", accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

confusion_matrix_percent_heatmap(confusion_matrix(y_test,predictions), search.classes_)

Once evaluated, we can see that we can predict the author of a Tweet quite accurately, with some users easier to predict than others. Why might this be?

<a name="error"></a>
## Error analysis

We can perform some error analysis by looking at the text alongside the predictions:

In [None]:
X_test_texts = [x.text for x in X_test]

In [None]:
df = pd.DataFrame(list(zip(X_test_texts,y_test,predictions)), columns=["Tweet", "Actual", "Predicted"])

In [None]:
pd.options.display.max_colwidth = 300
df.head(10)

For example, we can see when Theresa May's tweets are predicted incorrectly: 

In [None]:
df[df['Actual'].str.match("@theresa_may") & ~df['Predicted'].str.match("@theresa_may")]

Or when predicted as a specific other person:

In [None]:
df[df['Actual'].str.match("@theresa_may") & df['Predicted'].str.match("@jeremycorbyn")]

<a name="ex"></a>
## Exercise

The classifier above works quite well, but is using hashtags and words, which are going to be topic/time related. Would the same features work for tweets months or years apart? Try to develop a new feature set that is restricted to style features only, and evaluate on this. You could also use the celebs data, or use a larger set of authors, to test your new model.