This repository contains Python scripts for performing stylometric analysis using various techniques:
burrows-delta-mds.py: Computes Burrows's Delta and visualises the results using Multidimensional Scaling (MDS).burrows-delta-dendrogram.py: Computes Burrows's Delta and visualises the results using a dendrogram based on hierarchical clustering.roberta-embeddings.py: Analyses stylistic similarities among literary texts using the RoBERTa model for embedding generation, dimensionality reduction via PCA, and visualisation through scatter plots.
Purpose: Computes Burrows's Delta and visualises the results using Multidimensional Scaling (MDS).
Output: A scatter plot where points represent texts, and their proximity reflects stylistic similarity.
- Most Frequent Words (MFW): Set to 100 by default.
- Saves the Burrows's Delta matrix as
burrows_delta_matrix.csv. - Saves the MDS visualisation as
mds_visualisation.png.
Run the script in your terminal:
python3 burrows-delta-mds.pynltkpandasnumpymatplotlibscikit-learn
Purpose: Computes Burrows's Delta and visualises the results using a dendrogram based on hierarchical clustering.
Output: A dendrogram where labels are colour-coded by group, extracted from the text before the first _ in each filename.
- Linkage Method: Uses average linkage (default in stylometry for balanced clustering).
- Colour-Coded Labels: Groups are derived from the filenames (e.g.,
group_filename.txt). - Saves the Burrows's Delta matrix as
burrows_delta_matrix.csv. - Saves the dendrogram visualisation as
dendrogram_visualisation_coloured.png.
Run the script in your terminal:
python3 burrows-delta-dendrogram.pynltkpandasnumpymatplotlibscipy
Purpose: Analyses stylistic similarities among literary texts using the RoBERTa model for embedding generation, dimensionality reduction via PCA, and visualisation through scatter plots.
Both burrows-delta scripts expect a folder named corpus containing .txt files. Each file should represent a single text.
For burrows-delta-dendrogram.py, filenames should follow the format:
<group>_rest_of_filename.txt
For roberta-embeddings.py, the folder should be named lit-families on your Desktop. Filenames should follow the structure:
surname_firstinitial_title.txt
corpus/
├── group1_text1.txt
├── group1_text2.txt
├── group2_text1.txt
├── group2_text2.txt
lit-families/
├── Joyce_J_Ulysses.txt
├── Woolf_V_ToTheLighthouse.txt
- Delta Matrix: Saved as
burrows_delta_matrix.csv. A symmetric matrix of stylistic distances between texts.
-
MDS Script (
burrows-delta-mds.py):- Scatter Plot: Saved as
mds_visualisation.png.
- Scatter Plot: Saved as
-
Dendrogram Script (
burrows-delta-dendrogram.py):- Dendrogram Plot: Saved as
dendrogram_visualisation_coloured.png.
- Dendrogram Plot: Saved as
-
RoBERTa Script (
roberta-embeddings.py):- Scatter plots are displayed in separate windows but are not saved automatically.
Both burrows-delta scripts use the top 100 Most Frequent Words (MFW) by default. To change this, modify the mfw parameter in the compute_frequencies function:
frequency_matrix = compute_frequencies(preprocessed_texts, mfw=200) # Example: Use 200 MFWThe dendrogram script uses average linkage by default. To change the method, update the linkage function:
linkage_matrix = linkage(condensed_matrix, method='complete') # Use complete linkageAvailable methods: 'single', 'complete', 'average', 'ward'.
To save the visualisations from roberta-embeddings.py, modify the plt.show() lines to include saving functionality, such as:
plt.savefig('plot1.png')Install required Python packages using pip:
pip install nltk pandas numpy matplotlib scipy scikit-learn sentence-transformers- Place the respective script in any directory on your machine.
- Ensure the required folder (
corpusorlit-families) exists and contains.txtfiles. - Run the script using:
python <script_name>.py- View the generated plots or outputs.
- Model Choice:
roberta-embeddings.pyusesall-roberta-large-v1, a RoBERTa-based transformer model trained for sentence embeddings. - Customisation: You can replace the model with any
SentenceTransformermodel. Update the following line:
model = SentenceTransformer('all-roberta-large-v1')- Text Preprocessing: Minimal preprocessing is applied. Add additional steps if needed for your corpus.
-
No Plots Displayed:
- Ensure
matplotlibis installed. - Check for errors during PCA or embedding generation.
- Ensure
-
Incorrect Labels:
- Verify the file naming structure. Non-conforming files are labelled as "Unknown."