## Title
## : Revisiting and reduplicating authorship attribution methods

### LING4181 Supervised Reading
### Spring 2023
### Jihyeong Lee

### Table of Contents

#### Introduction
#### Literature
#### Methods
#### Discussion

### Introduction

* What is authorship attribution?
* What approaches are there?
* What is it useful for?
* What was the purpose of me doing this & why reduplication?

### Literature
Literature review; summarize important concepts, approaches, methods
Recent developments surrounding machine learning models/large language models
Limitations

### Methods

In this section, I explain in detail the data collection procedure and methods of analysis.
As was mentioned in Section 2(Literature), there are various environments to authorship attribution: dichotomous attribution(one candidate), closed-pool attribution(determined number of candidates; say, 3 candidates, one of which the text can be attributed to), infinite-pool attribution(the author can be anyone), etc. In real-world authorship attribution problems, there are also instances of co-authorship, which requires more complex methods of analysis.
The methods employed in this essay are applicable for closed-pool, single-author problems, and the data were collected accordingly. The methods can be divided to two large sections: lexical analysis and distance-based analysis. I collected data and wrote the codes myself, but the formula for each quantitative analysis largely followed chapters 2 and 3 from Savoy(2020).

I will first briefly explain the datasets and then explain what each method entails and what each block of codes does.

#### Datasets
Two sets of data consisted with texts from several authors each.
As the writing style of an individual varies across themes(what the text is about) and channels(what platform the text was intended for), I decided one channel and two themes: about gardening and true crime, on personal blogs. Sample blogs are found after simple google searches for each keyword "gardening" and "true crime" and three random blogs each with enough amount of text were selected. In total, texts from 3 gardening-related blogs were included in the first dataset (hereby "data1"), and three true-crime blogs were included in the other ("data2"). Texts were manually collected by copy-pasting. A short excerpt from each author were stored separately as test(query) text. The query text is not included in the sample text. (In this essay, "text" refer to each sample text whose author is the same, unless otherwise noted.)

Things worth noting about texts and collecting them:
1. I do not know any of the blog owners that were included in the sample. (Links to the source blogs are included in the bibliography.)
2. Author C in data1 was one of three co-writers in the blog, so I filtered out the two other authors and only included one.

#### Blog posts as a text
EXPLANATION HERE

Sample texts as well as full codes can be seen on ((LINK)). Codes are in addition attached as appendix.

#### Preparation
First, packages as the following are installed:

In [4]:
import nltk
import os
import time
import random
import pandas as pd
import collections
import string
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.probability import FreqDist
pd.set_option("display.max_rows", None)

nltk.download('punkt')
print("Done!")

Done!


[nltk_data] Downloading package punkt to /home/jupyter/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Preprocessing
Each text went through preprocessing as follows:

In [5]:
def preprocess(filename):
    text = open(filename, 'r').read().replace("\n", " ").lower()
    return text.translate(str.maketrans("","", string.punctuation)).split()

As such, all letters were converted to lowercase letters, and was removed punctuation marks (more about this later), then split to word units instead of letter unit strings. In other words, all texts after this are a list of all the words it contains, not a string of letters, as seen below:

In [9]:
text_a = preprocess('author_a.txt')
text_b = preprocess('author_b.txt')
text_c = preprocess('author_c.txt')

print(text_a[0:50])

['these', 'weeks', 'just', 'after', 'the', 'calendar', 'turns', 'from', 'one', 'year', 'to', 'the', 'next', 'are', 'the', 'perfect', 'time', 'to', 'think', 'about', 'your', 'goals', 'for', 'the', 'coming', 'gardening', 'season', 'on', 'this', 'week’s', 'podcast', 'i', 'discuss', 'plotting', 'out', 'plans', 'for', 'doubling', 'down', 'on', 'what', 'worked', 'well', 'in', 'the', 'garden', 'while', 'also', 'deciding', 'on', 'what', 'i', 'want', 'to', 'stop', 'doing', 'and', 'identifying', 'new', 'things', 'that', 'i’d', 'like', 'to', 'give', 'a', 'try', 'last', 'week', 'i', 'discussed', 'my', '10', 'garden', 'lessons', 'from', '2022', 'and', 'now', 'i', 'am', 'shifting', 'gears', 'from', 'looking', 'back', 'to', 'forging', 'ahead', 'there', 'are', 'things', 'that', 'i', 'have', 'experimented', 'with', 'in', 'recent', 'years', 'to', 'varying', 'degrees', 'of', 'success', 'and', 'i', 'want', 'to', 'take', 'those', 'lessons', 'and', 'move', 'forward', 'refining', 'and', 'enhancing', 'the', '

#### Basic data summarization
Each text has more than 15,000 words(tokens), with the smallest being the "author C" in data1.

##### Tokens, types, type-token ratio

In [None]:
# number of tokens

print(len(text_a))
print(len(text_b))
print(len(text_c))

# number of types

print(len(set(text_a)))
print(len(set(text_b)))
print(len(set(text_c)))

#### Lexical analysis

Lexical analysis methods refer to methods that make use of surface lexical information. There have been suggested several such methods that utilize different aspects of the usage of words.
##### Type-token ratio
Type-token ratio is a simple index for measuring lexical diversity in a given text. A low type-token ratio indicates low degree of diversity in the choice of words the author made.

In [6]:
def typetoken_ratio(text):
    return round(len(set(text))/len(text),4)

print(typetoken_ratio(text_a))
print(typetoken_ratio(text_b))
print(typetoken_ratio(text_c))
# Visualize as a table

0.1644
0.159
0.1501


# 이 밑에 Type-token ratio 계산 부분은 나중에 result 논의할 때 적는 게 낫겠음

To compare to these, we need type-token ratio for the query text as well. As an example, one of the query texts are shown below: a more detailed results will be discussed in Section 3.

In [7]:
q1 = preprocess('q1.txt')
print(lexical_diversity(q1))

## 여기에 차의 절댓값을 사용해서 가장 가까운 수 (차의 절댓값이 가장 작은 수) 찾는 함수를 만들면 재밌겠다

0.2593


Type-token ratio for `q1` is the closest to `text_a`, which0 is an accurate attribution result as `q1` is indeed written by Author A. The method seems to work, but since the difference between the type-token ratio of the query text and the closest sample text is bigger than that between the sample texts, it is doubtful how effective this method is in the grand scheme.
This is partially due to the fact that the query text is short (((HOW MANY WORDS))) compared to the query texts. The type-token ratio is heavily influenced by the length of the text. Text length can be expanded infinitely, but the use of vocabulary does not follow the same rate, since the most frequent words take up the majority of the words we use, as we will see later.

#### Simpson's D
Among other ways to measure lexical diversity that were introduced in Savoy(2020), Simpson's D(1949) is a useful index that is less influenced by the text length. Simpson's D measures vocabulary richness by calculating the sum of probabilities 
of selecting the same word twice in two separate trials. The formula is as follows:

**(1)** $$Simpson's  D(T) = \displaystyle\sum_{r=1} \frac{r}{n} \cdot \frac{r-1}{n-1} \cdot \mid Voc_{r}(T) \mid$$

- T refers to the text.
- n refers to the corpus size, i.e. the number of tokens in T.
- r refers to the number of times a given word type appears in T.
- VOC<sub>r</sub>(T) refers to the number of word types that appear in T exactly r times.

When there is no diversity in the usage of words, in other words there are only one word that is used throughout the entire text ($ r=n $), the formula returns 1, which is the maximum value, Hence, the closer to 0 Simpson's D value is, the richer the vocabulary use is for the given text.

The formula in **(1)** is rewritten as codes as follows:

In [None]:
# The following function calculates Simpson_D index for a given text
def simpson_D(text):
    count = collections.Counter(text)
    types = set(text)
    n = len(text)
    def VOC(r):
        VOC = 0
        for i in types: # i is a word(type)
            if count.get(i) == r:
                VOC += 1
        return VOC
    if sum(VOC(r) for r in range(1, n-1)) == 0:
        return 1
    else:
        return round(sum(VOC(r) * (r**2 - r) / (n**2 - n) for r in range(1,n)),4)

The function `simpson_D(text)` takes a preprocessed text as its argument. `collections.Counter` function creates a dictionary type data where the key is the word type and the value is its occurence in the text. `VOC(r)` is an inner function that is defined as the number of word types(`i`) in `text` that appears r times. Since it presupposes `i` appears at least once in `text`, `VOC(r)` always appears as a positive number. Hence the absolute value sign in **(1)** is unnecessary. The function, then, calculates the sum of $\frac{r}{n} \cdot \frac{r-1}{n-1} \cdot Voc_{r}$ for all $r$.

By  definition, the function should return 1 when $r=n$. However, the code above somehow always returns `0.0`. From a practical point of view, $ r=n $ is unlikely to happen, since we are dealing with a real-life language use where there are more than 1 word type in a text. But for the sake of completing the equation, the following lines were added,
```python
if sum(VOC(r) for r in range(1, n-1)) == 0:
        return 1
```
which returns 1 if there is all `r` is zero until $r=n-1$.

#### Mean word length

To make a fairer comparison for mean word length between sample texts, I took the first 15,000 words from each text of data1, since the shortest text of data1 (`text_c`) has about 15,000 words.

In [None]:
text_a_5k = text_a[:15000]
text_b_5k = text_b[:15000]
text_c_5k = text_c[:15000]

def average(text):
    return sum(len(word) for word in text) / len(text)

#### Word length distribution
# WRITE THE CODES #

#### Lexical Density

Lexical density is the ratio between the number of lexical items (1-functional words) and the text length.
Functional words include determiners, pronouns, prepositions, auxiliary verbs, etc.

Lexical density: function words 

#### Distance-based Analysis

Distance-based methods establish a profile for each candidate author to which we can compare the query text's profile.

Burrow's Delta (Savoy 2020: 34-36) is one of such methods: it considers 40-150 most frequent word types, and the style is reflected through the word choice. According to Savoy(34), 150 most frequent word types cover 50-65% of all tokens in a certain text, with the percentage varying depending on the theme, genre, etc. of the text.

The following is the formula for Delta:

**(2)** $$Burrow's  Delta(A_{j},Q) = \displaystyle\frac{1}{m} \cdot \sum_{i=1}^{m} \mid Zscore(t_{i,A{j}}) - Zscore(t_{i,Q}) \mid$$

- Aj is a candidate author A's profile.
- Q is the query text.
- t is a set of word-types in the MFW list.

Each t in the MFW list has the same importance, but the impact depends on their Z score values. <br>
To get the Delta value between the query text and a sample text, a list of most frequent word-types is necessary. A relative frequency value for each term can be calculated for each text: the number of occurrences for a certain word-type in a certain text is divided by the length of the text.<br>
Then the relative frequency values are compared against each other to get mean and standard deviation values. This is to get Z score for each term in each text: Z score is the relative frequency minus mean divided by standard deviation. Z score helps us understand where a certain value lies in relation to the entire sample. By comparing a Z score for a certain term in both texts, we know how much difference in using that word there is, and the bigger the sum is, the bigger the difference in word choices between the texts will be.

The function `MFW` below returns a list of 300 most frequent words and their frequency. The number 300 can be changed if necessary. Frequency here is absolute frequency, i.e. how many times it appears in the text.
Function `MFW_100` returns the percentage of MFW tokens in relation to the entire text.

In [None]:
def MFW(text):
    freq = FreqDist(text)
    MFWlist = freq.most_common(300)
    return MFWlist

def MFW_100(text):
    return 100 * sum(i[1] for i in MFW(text)) / len(text)

The codes below create a table of most frequent words (MFW) with their absolute frequency in the three respective texts. Obviously, the MFW list is different for each of the text with some overlap, and to be able to compare to each other, I took only MFWs that are present in all three lists, which makes the list shorter than the original. 

In [None]:
def abs_table(xa, xb, xc):
    dict_b = (dict(MFW(xb)))
    dict_c = (dict(MFW(xc)))
    table = pd.DataFrame(MFW(xa)).rename(columns={0: 'word', 1:'a'})
    table.set_index('word',inplace=True)
    table["b"] = ""
    table["c"] = ""
    for n in MFW(xa):
        word = n[0]
        if dict_b.get(word) != None and dict_c.get(word) != None:
            table.loc[word,"b"] = dict_b.get(word)
            table.loc[word,"c"] = dict_c.get(word)
        else:
            table.loc[word,"b"] = np.nan
            table.loc[word,"c"] = np.nan
        table.dropna(inplace= True)
    return table

abs_table(text_a,text_b,text_c)

The table we get from `abs_table` is then turned into a relative frequency table. Relative frequency table takes each text's length (number of tokens) into consideration. Since the absolute frequency of MFW will be heavily influenced by the size of the corpus, a relative term frequency is more useful.

In [None]:
def rel_table(xa, xb, xc):
    table = abs_table(xa, xb, xc)
    table = table.astype(float)
    table["words"] = table.index
    table.loc[:,"a"] = round(table["a"] / len(xa),5)
    table.loc[:,"b"] = round(table["b"] / len(xb),5)
    table.loc[:,"c"] = round(table["c"] / len(xc),5)
    table.loc[:,"mean"] = table.mean(axis='columns')
    table.loc[:,"sd"] = table.std(axis='columns')
    return table

table = rel_table(text_a,text_b,text_c)

Lastly, the codes below calculate z-score and eventually Delta score. The query text has to be preprocessed before running these lines.

In [None]:
q = preprocess('q3.txt')

def zscore_table(a,b,c):
    dict_q = dict(collections.Counter(q))
    table = abs_table(a, b, c)
    table = table.astype(float)
    table["words"] = table.index
    table["q"] = ""
    table.loc[:,"a"] = round(table["a"] / len(a),5)
    table.loc[:,"b"] = round(table["b"] / len(b),5)
    table.loc[:,"c"] = round(table["c"] / len(c),5)
    table.loc[:,"mean"] = table.mean(axis='columns')
    table.loc[:,"sd"] = table.std(axis='columns')
    for word in table["words"]:
        if dict_q.get(word) != None:
            table.loc[word,"q"] = round((dict_q.get(word) / len(q)),5)
        else:
            table.loc[word,"q"] = np.nan
    table.loc[:,"z_a"] = (table["a"] - table["mean"]) / table["sd"] # calculates z-scores for columns a,b,c,q
    table.loc[:,"z_b"] = (table["b"] - table["mean"]) / table["sd"]
    table.loc[:,"z_c"] = (table["c"] - table["mean"]) / table["sd"]
    table.loc[:,"z_q"] = (table["q"] - table["mean"]) / table["sd"]
    table.dropna(inplace= True) # deletes rows that contain NaN
    table.drop('words', axis = 'columns',inplace= True) # deletes the redundant column
    return table

In [None]:
def delta(df): # calculates delta score between the column a in the given dataframe and the query text
    delta_a = round(sum(list(abs(df["z_a"]-df["z_q"]))) / len(df),5)
    delta_b = round(sum(list(abs(df["z_b"]-df["z_q"]))) / len(df),5)
    delta_c = round(sum(list(abs(df["z_c"]-df["z_q"]))) / len(df),5)
    return delta_a, delta_b, delta_c, 'are Delta distance values between the query text and text a, b, c, respectively.'

### Results

### Discussion
Interpret the results
Ways forward (What more could be done, what more am I interested in, what I will do next)