Skip to content

Latest commit

 

History

History
63 lines (45 loc) · 10.8 KB

README.md

File metadata and controls

63 lines (45 loc) · 10.8 KB

JEFP: A Corpus of Canonical and Non-Canonical Texts

Canonical fictional texts are considered to have high cultural value and are familiar to educated members of society, and are included in school curricula. Writers of canonical texts are prestigious writers and hold strong reputation. Although canonization is a process influenced by various stakeholders, and the reputation of canonical texts is promoted by influential sectors of society, for a text to be selected in a canon, the textual properties, reading, and readership of these texts undoubtedly play crucial roles.

Conversely, non-canonical texts do not gain recognition comparable to canonical texts, and many of them may not survive, as was probably the fate of most during the pre-digitalization era.

To analyze canonical texts in comparison to non-canonical texts, we built a corpus called the Jena Corpus of Expository and Fictional Prose (JEFP), which includes the two aforementioned fictional categories, canonical and non-canonical, as well as one category of non-fictional texts, which allow for inter- and intra-genre comparisons.

Canonical Text

To select a text as a canonical text, we considered three criteria:

  1. The text has been recognized as a canonical text in The Western Canon: The Books and School of the Ages (Bloom, 1994). These texts have been extracted from Project Gutenberg and collected in the Corpus of Canonical Western Literature (Green, 2017).
  2. The author should have a high international reputation, measured by counting the number of pages the author has in the top 30 Wikipedia editions.
  3. The text should be long enough to facilitate structural analysis. We set a threshold of 35K tokens, including words and punctuation. This length allows us to apply various types of structural analysis, including long-range correlation analysis (fractal analysis).

In total, 76 canonical texts, written by 30 authors in the 19th and early 20th century, were incorporated in the The JEFP corpus, version 2.0. The list of authors is as follows:

Anne Bronte Arnold Bennett Bram Stoker
Charles Dickens Charlotte Bronte Edgar Allan Poe
Elizabeth Gaskell George Eliot Henry David Thoreau
Henry James Herman Melville James Fenimore Cooper
James Joyce Jane Austen Joseph Conrad
D.H. Lawrence Louisa May Alcott Mark Twain
Nathaniel Hawthorne Oscar Wilde Rudyard Kipling
Sinclair Lewis Theodore Dreiser Thomas Carlyle
Thomas Hardy Walter Scott Wilkie Collins
Willa Cather William Makepeace Thackeray William Morris

The complete list of texts is published in canon-version2.0.csv.

Non-Canonical Text

The raw texts for this corpus category were extracted from Project Gutenberg. The extraction aimed to ensure that the distribution of the years of publication is statistically indistinguishable for canonical and non-canonical texts. As of the compilation of the corpus in May 2020, none of these texts had a download count exceeding 40, as indicated on the Project Gutenberg website. This could be seen as an approach to avoid the inclusion of popular literature in the corpus. Similar to the category of canonical texts, a minimum text length of 35K tokens were applied in the selection process.

The sub-corpus of non-canonical texts includes 130 texts. Authors of non-canonical texts selected have a lower international reputation compared to canonical authors. The plot below shows the number of articles dedicated to canonical and non-canonical authors on the top 30 Wikipedia editions, a metric as a proxy for international reputation.

Internation Reputation (Source: Mohseni et al., 2020)

Although non-canonical authors may be less renowned than their canonical counterparts, some of them still are famous and have pages across various languages. Upon closer examination, we found that their notability is often because of activities beyond literature, such as involvement in politics.

Non-Fictional Texts

185 non-fictional texts belonging to genres like philosophy, psychology, and sociology were selected from Project Gutenberg from the same time period as the two fictional categories, i.e., the 19th and early 20th century, to build the category of non-fictional texts. The following table shows information about the JEFP corpus (information taken from Mohseni et al., 2020):

Category # Texts Mean Length
Canonical 76 199 ± 96K
Non-Canonical 130 111 ± 56K
Non-Fictional 185 171 ± 178K

The complete list of all texts with metadata is published in corpus-version2.0-all-information.csv.

(Information presented in this post is mostly taken from our two papers: Mohseni et al., 2021, 2022)

References

  • Mohseni, Mahdi, Volker Gast, Christoph Redies, Fractality and Variability in Canonical and Non-Canonical English Fiction and in Non-Fictional Texts, Front. Psychol. 2021, 12, 920.
  • Mohseni, Mahdi, Christoph Redies, and Volker Gast, Approximate Entropy in Canonical and Non-Canonical Fiction, In: Entropy 24.2, 2022, p. 277.
  • Bloom, Harold, The Western Canon: The Books and School of the Ages, Harcourt: New York, NY, USA, 1994.
  • Green, Clarence , Introducing the Corpus of the Canon of Western Literature: A Corpus for Culturomics and Stylistics, Lang. Lit., 2017, 26, 282–299.