# 1. Preprocessing

## 1.1 What is Preprocessing?
In this notebook, we survey some snippets of code that help 'preprocess' your corpus of texts.

The preprocessing of text files is a significant first step in ensuring reliability of all later stylometric analyses. For all texts predating the invention of the printing press, one could argue that this is perhaps even slightly more important than for early modern or modern texts. Stylometry’s application to premodern texts comes with specific desiderata when compared to, for instance, authorship detection of current-day online, electronically available blog posts. As a result of scribal culture, premodern texts can be tremendously varied. These variations can be captured and in some cases they might even be desirable (e.g. when analyzing various recension of a text or comparing writing conventions across scriptoria), but more often than not they are redundant if what you want to analyze is individual writing style.

Preprocessing (potentially) entails steps such as these:

* the removal and editing of all irrelevant characters in the text: punctuation, numerals, optical character recognition errors, case-folding, titles or annotations, etc …
* ‘tokenising’ the text to meaningful units, often word tokens, but others are:
  * subwords; not necessarily linguistic yet meaningful word fragments that are picked up as relevant by a language model, e.g. un | believ | able
  * morphemes = linguistic meaningful units, e.g. un | believe | able
  * syllables (sounds, phonological units), e.g. un | be | liev | a | ble

If so desired, preprocessing also takes care of this:
* normalization / standardization: align variant orthographical and editorial conventions between text versions
* disambiguation, for instance to semantically distinguish homographs (words that are spelled the same) and homonyms (words that sound and/or are spelled the same).
* stemming (recover basis stem or morphological root of word tokens
* lemmatisation transforms word tokens to a standard dictionary form
* PoS(part-of-speech)-tagging and parsing: identify a token’s part of speech and syntactic function
* automated scansion for prosodic units of analysis


## 1.2 Reading and Handling File Objects

The first step is reading in your corpus of texts, so that we can start manipulating them. The steps below will be easier to follow and execute correctly if you have ensured that your file names are formatted as such: ```author-name_text-title.txt```.

In [None]:
# Colab uploads the files into the temporary runtime (not your Drive).
from google.colab import files
uploaded = files.upload()

Below is a rather large chunk of code in which a number of consecutive steps are introduced and combined.
First, we declare empty list containers (```authors```, ```titles```, ```texts```), where we will store our metadata and data.
We introduce our first stylometric parameter, the sample length (variable ```sample_len```)
We then go over all files you have uploaded, and extract the data (texts) and metadata (authors and titles) from the files.

In the text itself, we use the ```re``` (RegEx Module in Python, which stands for regular expressions) which removes digits and punctuation from the text (if so desired). We also apply case folding (convert upper to lowercase).

Once the data has been 'cleared' of some of the text items we could say are insignificant for stylistic analysis, we proceed by slicing up the data into discrete segments, or chunks of text.

In [None]:
import re
import glob
from string import punctuation

# Declare empty lists to fill up with our metadata and data
authors, titles, texts = [], [], []

# We declare some parameters — the 'settings' of our stylometric experiments
sample_len = 1400 # word length of text segment

# Function to clean and split text
def clean_and_split_text(text, sample_len):
    words = re.sub(r'[\d%s]' % re.escape(punctuation), '', text.lower()).split()
    return [words[i:i + sample_len] for i in range(0, len(words), sample_len)]

for filename in uploaded.keys():
    author, title = filename.split('/')[-1].split('.')[0].split('_')[:2]
    with open(filename, encoding='utf-8-sig') as file:
        text = file.read().strip()
        bulk = clean_and_split_text(text, sample_len)

        for index, sample in enumerate(bulk):
            if len(sample) == sample_len:
                authors.append(author)
                titles.append(f"{title}_{index + 1}")
                texts.append(" ".join(sample))

# Print summary to confirm things worked
print("Text processing complete!")
print(f"Number of text segments: {len(texts)}")
print(f"Number of authors: {len(set(authors))}")
print(f"Number of titles/segments: {len(titles)}")

# Optional: print a sample segment
print("\nSample processed text segment:\n", texts[0][:200], "...")

## 1.3 Sampling (Text Segmentation)

Despite many permutations, sampling methods generally fall within one of these four categories: (a) discrete, (b) rolling, (c) random and (d) generative.

* **Discrete** is as above. A longer text is sliced in discrete pieces according to a predefined fixed sample size, where the next sample picks up the trail where the previous one ended.
* **Rolling**: makes use of a sliding window, it 'shingles' your text segments. Rolling segmentation samples the text in non-identical, partially overlapping windows instead of discrete chunks of text. It is generally considered to be a more sensitive way of linearly scanning the stylistic profile of a text, and registers how it changes from first to last word.
* **Random**: sentences from a certain author’s entire oeuvre are randomly selected until a predefined sample length limit is reached (e.g. keep on randomly selecting until 1,000 words have been found) in order to come to an almost inexhaustible number of new, real-world representations of the author’s lexical distribution through new combinations.
* **Generative**: closely related to random sampling, but takes the idea of inexhaustible representability of a stylistic profile one more step further. Text generation attempts to not only imitate the distribution by making use of extant text samples, but even expands a corpus by generating new text. Needless to say this is an interesting yet underexplored area of research for medieval texts. Some work has been done in this regard for Latin-writing late antique and medieval authors (Manjavacas et al. 2017).


### 1.3.1 Rolling Sampling

The block of code below allows you to apply a relatively easy form of sampling, that of **rolling sampling**. The `step_size`-variable specifies the number of words between the starting indices of consecutive samples. For example, if `step_size=100`, each sample starts 100 words after the previous sample. It determines the amount of overlap between consecutive samples.

In [None]:
"""
Process uploaded text files into overlapping word-based samples.
Each sample is of fixed length (sample_len), taken in steps of step_size,
and stored along with its author and title metadata.
"""

import re
import numpy as np
from string import punctuation
import pandas as pd

# Declare empty lists to fill up with our metadata and data
authors = []
titles = []
texts = []

sample_len = 1400 # word length of text segment
step_size = 200 # step size

for filename in uploaded.keys():
    author, title = filename.split("/")[-1].split(".")[0].split("_")[:2]
    with open(filename, 'r') as file:
        text = file.read().lower()
        text = re.sub('[%s]' % re.escape(punctuation), '', text)
        text = re.sub(r'\d+', '', text)
        words = text.split()
        steps = np.arange(0, len(words), step_size)
        for each_begin in steps:
            sample_range = range(each_begin, each_begin + sample_len)
            sample = [words[index] for index in sample_range if index < len(words)]
            if len(sample) == sample_len:
                key = '{}-{}-{}'.format(title, str(each_begin), str(each_begin + sample_len))
                authors.append(author)
                titles.append(key)
                texts.append(" ".join(sample))

# Turn results into a dataframe
df = pd.DataFrame({
    "author": authors,
    "title": titles,
    "text": texts
})

# Peek at the data
print(df.head())

# Count samples per author
author_counts = df["author"].value_counts()

### 1.3.2 Random sampling
The code block below gives you a starting point to experiment with **random sampling**. The variable `word_limit` is virtually the same as the desired `sample_len` above: it indicates how many words you want to include in your sample. `n_samples` yields the desired number of random samples per author.

In [None]:
import re
from string import punctuation
import random

# Upper word limit for each sample (in words)
word_limit = 1400
n_samples = 10  # number of randomly generated segments per author

# Count the total number of words across a list of sentences
def count_words(sentences):
    return sum(len(sentence.split()) for sentence in sentences)

# Randomly select sentences until the word limit is reached
def sample_sentences(sentences, word_limit):
    sampled_sentences = []
    total_words = 0
    remaining_sentences = sentences.copy()

    while total_words < word_limit and remaining_sentences:
        sentence = random.choice(remaining_sentences)
        sentence_word_count = len(sentence.split())
        if total_words + sentence_word_count <= word_limit:
            sampled_sentences.append(sentence)
            total_words += sentence_word_count
        remaining_sentences.remove(sentence)

    return sampled_sentences

data = {}
# Read all uploaded files and split them into sentences
for filename in uploaded.keys():
    author, title = filename.split('/')[-1].split('.')[0].split('_')[:2]
    data[author] = []
    with open(filename, encoding='utf-8-sig') as file:
        text = file.read().strip()
        # Split sentences on . or ? while avoiding common abbreviation issues
        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
        for sentence in sentences:
            data[author].append(sentence)

# Generate random samples per author and collect metadata
authors, titles, texts = [], [], []
sampled_data = {}
for author in data.keys():
    sampled_data[author] = []
for author, sentences in data.items():
    for i in range(0, n_samples):
        random_sample = sample_sentences(sentences, word_limit)
        random_sample = ' '.join(random_sample)
        title = 'sample_' + str(i+1)  # label each random sample
        authors.append(author)
        titles.append(title)
        texts.append(random_sample)

# Store results in a dataframe for inspection and analysis
df = pd.DataFrame({
    "author": authors,
    "title": titles,
    "text": texts
})

# Display a few samples (results will change each run due to random selection)
print(df.head())

# Quick overview: number of samples generated per author
author_counts = df["author"].value_counts()