EFAuR

Embeddings For Authorship using RoBERTa

Dataset

The dataset used in this project consists of public domain books written in English, with a single, known author, sourced from Project Gutenberg.

Mirror all .txt files from a Project Gutenberg Mirror
Download the metadata tarball from gutenberg.org
Extract the metadata tarball and parse all the RDF metadata files within
- Select English books labeled for public domain use with a single, known author
- Write the ID of each book, along with the corresponding author, to a CSV file
Combine all books written by an author into one file
- Combine all books by the same author into a single file
- Remove Project Gutenberg headers and footers
- Remove readability formatting that may obscure the texts
Create a dataset directory
- Create subdirectories for train, test, and val
- Each subdirectory has files containing a single data point for the model
- Each subdirectory has text1, text2, and a label specifying same or different authorship

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.vscode		.vscode
data		data
model		model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt