# Session 4: Burrow's Delta

In this session, we’ll perform a stylometric analysis on historical text data. We’ll use Burrow's Delta to build a classification model that attributes literary works to their respective authors based on unique writing styles.

**To-Do**  
Before the session, make sure all of the following packages are installed by running the cell below while using your Anaconda environment as the kernel.

In [1]:
!pip install pandas scikit-learn plotly nbformat matplotlib



## 1. Step: Preparing the data

We’ll be conducting the stylometric analysis using classic German literature, including works by well-known authors such as Goethe, Schiller, and others.

The following CSV file, provided via Moodle, contains 553 German belletristic texts sourced from the Deutsches Textarchiv (DTA).

> The DTA offers cross-disciplinary and cross-genre collections and corpora of German-language texts. Its core corpus of around 1,500 titles serves as the foundation for a reference corpus of Modern High German.  
> https://www.deutschestextarchiv.de

Each document was originally downloaded as an individual XML file from [https://www.deutschestextarchiv.de/download](https://www.deutschestextarchiv.de/download). The XML files were parsed and cleaned of archaic German characters. The CSV contains the processed texts we’ll be working with.

In [2]:
# Get an overview of the authors

As we know, Burrow's Delta works by comparing a given text to others. To calculate similarities and attribute authorship, we need reference material for each author.  
This approach doesn't make sense for authors who only appear once in the dataset, so we need to filter out those cases.

We need to split three elements from the csv and store them as lists. The texts, in the following name documents, the surnames named authors and the titles.

## 2. Step: Extracting Word Frequencies

As you know, Burrow's Delta works by comparing distributions of word frequencies. It does so by focusing on so-called function words.

**Function words**  
> This linguistic category can broadly be defined as the small set of (typically short) words in a language (prepositions, particles, determiners, etc.) which are heavily grammaticalized and which, as opposed to nouns or verbs, often only carry little meaning in isolation (e.g., the versus cat). https://www.humanitiesdataanalysis.org/stylometry/notebook.html

We will now create what are called count vectorizations. We use the top 30 tokens. If you have a specific list of words you want to build the model with, it's even better.

Recall that Burrows's Delta assumes word counts to be normalized. Normalization, in this context, means dividing each document vector by its length, where length is measured using a vector norm such as the sum of the components (the L1 norm) or the Euclidean length (L2). L1 normalization is a fancy way of saying that the absolute word frequencies in each document vector will be turned into relative frequencies, through dividing them by their sum (i.e., the total word count of the document, or the sum of the document vector). Scikit-learn’s preprocessing functions specify a number of normalization functions and classes. We use the function normalize() to turn the absolute frequencies into relative frequencies:

## Step 3: Splitting Training and Test Data

Burrow's Delta doesn't involve traditional training. However, since we want to evaluate how well the method attributes texts to their authors, we need to split the dataset into two parts: one where the authors are known (training data), and one where we’ll attempt to assign authors (test data).  
Because we also know the true authors of the test data, we can later evaluate how accurately the method performs.

## Step 4: Scaling the Vectors

In the code block below, we transform the relative word frequencies into *z-scores* using scikit-learn’s `StandardScaler` class.

**What are z-scores?**  
Z-scores are standardized values that indicate how many standard deviations a data point is from the mean of its distribution. This scaling allows us to compare features on a common scale, which is crucial for distance-based methods like Burrow's Delta.

**Formula:**  
$$
z_i = \frac{x_i - \mu}{\sigma}
$$

**Legend:**  
- \(x_i\): the original value (e.g., word frequency)  
- \( \mu \): the mean of the feature across all documents  
- \( \sigma \): the standard deviation of the feature  
- \( z_i \): the resulting z-score  

Using z-scores helps normalize the influence of frequent and rare function words across different authors' texts.


## Step 5: Calculating the distances

In this step we are calculating the distance of all test documents to all training documents.

Cityblock (Manhatten) distance: 
$$
D(x, y) = \sum_{i=1}^{n} |x_i - y_i|
$$


## Step 5: Evaluating the Attribution

After applying Burrow's Delta to assign authors to the test texts, we can evaluate how well the method performed.

One common metric is **accuracy**, which tells us the proportion of correctly attributed texts:

**Formula:**  
$$
\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}
$$

While accuracy gives a general idea of performance, it doesn't show which authors were confused with each other. For that, we use a **confusion matrix**.

**Confusion Matrix:**  
A confusion matrix is a table that shows the counts of actual vs. predicted author labels. Each row represents the true author, while each column represents the predicted author. Ideally, most values should fall along the diagonal, indicating correct attributions.

This helps us analyze:
- Which authors are often misclassified
- Whether the model struggles more with certain authors than others

We’ll now compute both metrics to assess the results of our model.
