Stylometry

This repository contains Python scripts for performing stylometric analysis using various techniques:

burrows-delta-mds.py: Computes Burrows's Delta and visualises the results using Multidimensional Scaling (MDS).
burrows-delta-dendrogram.py: Computes Burrows's Delta and visualises the results using a dendrogram based on hierarchical clustering.
roberta-embeddings.py: Analyses stylistic similarities among literary texts using the RoBERTa model for embedding generation, dimensionality reduction via PCA, and visualisation through scatter plots.

Scripts Overview

1. `burrows-delta-mds.py`

Purpose: Computes Burrows's Delta and visualises the results using Multidimensional Scaling (MDS).

Output: A scatter plot where points represent texts, and their proximity reflects stylistic similarity.

Features:

Most Frequent Words (MFW): Set to 100 by default.
Saves the Burrows's Delta matrix as burrows_delta_matrix.csv.
Saves the MDS visualisation as mds_visualisation.png.

Usage:

Run the script in your terminal:

python3 burrows-delta-mds.py

Dependencies:

nltk
pandas
numpy
matplotlib
scikit-learn

2. `burrows-delta-dendrogram.py`

Purpose: Computes Burrows's Delta and visualises the results using a dendrogram based on hierarchical clustering.

Output: A dendrogram where labels are colour-coded by group, extracted from the text before the first _ in each filename.

Features:

Linkage Method: Uses average linkage (default in stylometry for balanced clustering).
Colour-Coded Labels: Groups are derived from the filenames (e.g., group_filename.txt).
Saves the Burrows's Delta matrix as burrows_delta_matrix.csv.
Saves the dendrogram visualisation as dendrogram_visualisation_coloured.png.

Usage:

Run the script in your terminal:

python3 burrows-delta-dendrogram.py

Dependencies:

nltk
pandas
numpy
matplotlib
scipy

3. `roberta-embeddings.py`

Purpose: Analyses stylistic similarities among literary texts using the RoBERTa model for embedding generation, dimensionality reduction via PCA, and visualisation through scatter plots.

Input Requirements

Corpus

Both burrows-delta scripts expect a folder named corpus containing .txt files. Each file should represent a single text.

For burrows-delta-dendrogram.py, filenames should follow the format:

<group>_rest_of_filename.txt

For roberta-embeddings.py, the folder should be named lit-families on your Desktop. Filenames should follow the structure:

surname_firstinitial_title.txt

Example Folder Structure

corpus/
├── group1_text1.txt
├── group1_text2.txt
├── group2_text1.txt
├── group2_text2.txt

lit-families/
├── Joyce_J_Ulysses.txt
├── Woolf_V_ToTheLighthouse.txt

Outputs

Common Outputs

Delta Matrix: Saved as burrows_delta_matrix.csv. A symmetric matrix of stylistic distances between texts.

Script-Specific Outputs

MDS Script (burrows-delta-mds.py):
- Scatter Plot: Saved as mds_visualisation.png.
Dendrogram Script (burrows-delta-dendrogram.py):
- Dendrogram Plot: Saved as dendrogram_visualisation_coloured.png.
RoBERTa Script (roberta-embeddings.py):
- Scatter plots are displayed in separate windows but are not saved automatically.

Customisation

Adjusting Most Frequent Words (MFW)

Both burrows-delta scripts use the top 100 Most Frequent Words (MFW) by default. To change this, modify the mfw parameter in the compute_frequencies function:

frequency_matrix = compute_frequencies(preprocessed_texts, mfw=200)  # Example: Use 200 MFW

Changing Linkage Method in Dendrogram

The dendrogram script uses average linkage by default. To change the method, update the linkage function:

linkage_matrix = linkage(condensed_matrix, method='complete')  # Use complete linkage

Available methods: 'single', 'complete', 'average', 'ward'.

RoBERTa Visualisation Saving

To save the visualisations from roberta-embeddings.py, modify the plt.show() lines to include saving functionality, such as:

plt.savefig('plot1.png')

Dependencies

Install required Python packages using pip:

pip install nltk pandas numpy matplotlib scipy scikit-learn sentence-transformers

How to Run Scripts

Place the respective script in any directory on your machine.
Ensure the required folder (corpus or lit-families) exists and contains .txt files.
Run the script using:

python <script_name>.py

View the generated plots or outputs.

Notes

Model Choice: roberta-embeddings.py uses all-roberta-large-v1, a RoBERTa-based transformer model trained for sentence embeddings.
Customisation: You can replace the model with any SentenceTransformer model. Update the following line:

model = SentenceTransformer('all-roberta-large-v1')

Text Preprocessing: Minimal preprocessing is applied. Add additional steps if needed for your corpus.

Troubleshooting

No Plots Displayed:
- Ensure matplotlib is installed.
- Check for errors during PCA or embedding generation.
Incorrect Labels:
- Verify the file naming structure. Non-conforming files are labelled as "Unknown."

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
README.md		README.md
RoBERTa-embeddings.py		RoBERTa-embeddings.py
burrows-delta-dendrogram.py		burrows-delta-dendrogram.py
burrows-delta-mds.py		burrows-delta-mds.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stylometry

Scripts Overview

1. `burrows-delta-mds.py`

Features:

Usage:

Dependencies:

2. `burrows-delta-dendrogram.py`

Features:

Usage:

Dependencies:

3. `roberta-embeddings.py`

Input Requirements

Corpus

Example Folder Structure

Outputs

Common Outputs

Script-Specific Outputs

Customisation

Adjusting Most Frequent Words (MFW)

Changing Linkage Method in Dendrogram

RoBERTa Visualisation Saving

Dependencies

How to Run Scripts

Notes

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Stylometry

Scripts Overview

1. burrows-delta-mds.py

Features:

Usage:

Dependencies:

2. burrows-delta-dendrogram.py

Features:

Usage:

Dependencies:

3. roberta-embeddings.py

Input Requirements

Corpus

Example Folder Structure

Outputs

Common Outputs

Script-Specific Outputs

Customisation

Adjusting Most Frequent Words (MFW)

Changing Linkage Method in Dendrogram

RoBERTa Visualisation Saving

Dependencies

How to Run Scripts

Notes

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `burrows-delta-mds.py`

2. `burrows-delta-dendrogram.py`

3. `roberta-embeddings.py`

Packages