# Poetry Sentiment Analysis: Comparative Text Classification Using Statistical and Embedding-Based Models

## Introduction

### Introduction to the Domain-Specific Area



Sentiment analysis entails the automated classification of texts based on their emotional tone. Kim and Klinger identify one definition of sentiment analysis as determining whether an **opinion** expresses "positive or negative feeling" (2019). However, "opinion mining" is merely one application of sentiment analysis, albeit an important one that focuses on detecting positive or negative attitudes ("relatively enduring, affectively colored beliefs, preferences, and predispositions towards objects or persons" in Scherer's affective typology) (Jurafsky & Martin, 2024). As Liu (2012) highlights, this is particularly useful for tasks such as the analysis of customer satisfaction for business purposes, creating movie recommendation engines, and predicting stock prices based on Twitter posts. This project, however, focuses on the multi-class text classification of lines of poetry (taken from the Google Research Datasets' poem-sentiment corpus). Lines of English poetry from Project Gutenberg have been marked with one of four possible labels: 0 for "negative sentiment" or emotion, 1 for "positive sentiment," 2 for "neutral" or "no impact," and 3 for "mixed." This project aims to compare and analyse the performance of classical, statistical approaches to this multi-class classification problem with that of a deep learning model. Unlike typical sentiment analysis applications that detect **attitudes** towards entities, poetry often expresses various complex affective states, including moods, emotions and interpersonal stance (Jurafsky & Martin, 2024).

The vast availability of text data on the Web since the early 2000s has elevated the domain of sentiment analysis as a highly valuable research field. Liu (2012) outlines how the automated sentiment analysis using online customer reviews, forum posts, and social media comments is central to much current decision-making process in the business domain. It provides a more efficient cost-effective way of gauging customer response to products, especially when compared to traditional ways of measuring customer satisfaction (i.e., surveys, focus groups). Additionally, computing sentiment polarity of online posts can assist in predicting election results and sociopolitical trends. While the benefits of sentiment analysis for measuring customer satisfaction or political moods using online text data have been extensively documented, the rationale for sentiment analysis for poetic texts is not as immediately apparent as these clear financial or political advantages – however, there are several reasons it can be of value.

Poetic language constitutes a specific challenge for any text classification task due to the prevalence of figurative language and unconventional word usage (Kim & Klinger, 2019). Metaphorical and figurative expressions are also used in customer reviews and social media posts: thus, analysing and addressing the challenges raised during the computational detection of sentiment polarity in poetic language may improve the robustness of general sentiment detection techniques. Furthermore, Kim and Klinger argue that sentiment analysis techniques are of vital importance to the growing field of the "digital humanities" (Kim & Klinger, 2019). They note that "the stylistic properties of texts can be defined on the basis of their emotional interest," and not merely on their linguistic characteristics. Sentiment analysis can shed important new perspectives on key research topics in literary studies, such as the evolution of the literary expression of emotions in fiction or poetry over different historical periods. Importantly, sentiment analysis of literary texts can also be used to facilitate the detection of bias towards certain geographic locations or demographic groups over time. Sheng and Uthus, who compiled and annotated the dataset used in this project, did so in order to counteract the societal bias of an automated poetry collaboration tool (Sheng & Uthus, 2020). As an initial step in this process, they trained a BERT-like deep-learning model to learn to identify the sentiment polarity in the lines of poetry randomly selected from Project Gutenberg, resulting in 85% accuracy.
Conclusively, the field of sentiment analysis of poetic language can be valuable for refining NLP models' capability to handle figurative language, assisting in the formulation and testing of theories in literary research, and diagnosing social biases in seminal cultural texts.
s.



### Objectives

- The primary goal of this project is to evaluate and compare the performance of a traditional statistical text classification algorithm (e.g. Multinomial Naive-Bayes, Decision Tree classifiers) on this [poetry sentiment polarity detection dataset](https://huggingface.co/datasets/google-research-datasets/poem_sentiment) (Sheng & Uthus, 2020) to the effectiveness of a newer approach to text classification which uses deep-learning and embeddings (such as BERT-based models with transformers). As such, this study will determine which kind of model is better adept at handling the nuances and subtleties of poetic language.
  
- Each "document" in this classification task consists of a line of poetry randomly sampled from Project Gutenberg. The classification task entails selecting the correct label for each input/line of poetry. The set of labels here is 0 (negative sentiment), 1 (positive sentiment), 2 (neutral sentiment), and 3 (mixed - both positive and negative - sentiment). This is a basic supervised multi-class (4 possible classes) classification task where "each input is considered in isolation from all other inputs" (Bird, Klein, & Loper, 2019).
  
- The current state-of-the-art performance scores on this dataset, as to my knowledge on the 15th of June 2024, are 89% accuracy, 92% precision, 88% recall and 90% F1-score, obtained by the user [AiManatee (Aime Fernández Vallina)  on HuggingFace](https://huggingface.co/AiManatee/RoBERTa_poem_sentiment) using a fine-tuned version of Facebook's large transformer-based model (RoBERTa) (AiManatee, 2024). This score even outperforms the original authors' accuracy score of 84.6% on the test set (Sheng & Uthus, 2020). These high figures suggest that transformer-based models perform extremely well on this dataset, which raises the question of *why* it would be a valuable contribution to compare the effectiveness of BERT-based models to that of traditional statistical models.
  
-  One answer to this is that deep-learning transformer-based models are a sort of "black box", where exceptional results come at the cost of interpretability. This is in stark contrast to statistical models such as Naive Bayes or logistic regression that can also be used analytically to identify which features are the strongest predictors of sentiment in poetry. As Kim and Klinger (2019) strongly emphasize, in computational literary studies "*transparency of the computational method is not a bonus; it is a crucial property*" as "in digital humanities, research is often exploratory" (Kim & Klinger, 2019). Unfortunately, this interpretability "is something that pretrained deep neural methods cannot (yet) provide" (Kim & Klinger, 2019). Consequently, exploring traditional text classification methods for literary texts might still be of value despite the progress made by advances in neural networks.

- Additionally, as Nan Z. Da has pointed out in her strong critique of applying computational techniques to literary studies (2019), *"computational literary criticism is prone to fallacious overclaims or misinterpretations of statistical results because it often places itself in a position of making claims based **purely on word frequencies without regard to position, syntax, context, and semantics**. Word frequencies and the measurement of their differences over time or between works are asked to do an enormous amount of work, standing in for vastly different things".* Da argues that so far, scholars applying NLP techniques to poetic language have relied excessively on the assumption that word frequency counts (whether via Bag of Words or TF-IDF encodings) are what distinguish literary genres, markers of historical periods, or the mood and emotions of a text. One of her main criticisms appears to be that these methods fail to take *context* and relationships between words into account. However, since 2019, there have been significant advancements to applying transformer-based models to text data, which *do* take into account the relationships and dependencies between words. This timeline from HuggingFace helps to demonstrate that just as Da was writing this article, transformer-based models were about to take off as one of the main techniques used for text classification: ![image.png](attachment:967c4bee-7ff9-4c28-a327-fb0fceb370a4.png)
Image taken from [(HuggingFace, n.d.)](https://huggingface.co/learn/nlp-course/en/chapter1/4). Consequently, comparing a traditional model which takes word frequency counts into account to one of these contemporary deep-learning models can provide a clear refutation to these arguments: there are now methods which rely on techniques that have advanced beyond mere word counts. Nonetheless, as mentioned above, this comes at the expense of the transparency which is so crucial to the analysis of patterns in literary and poetic language. This demonstrates that as yet, there is still a significant degree of research in the field of NLP to be done that could potentially balance the demand for interpretability with the power of BERT-like deep-learning models.

- To achieve the primary, overarching objective stated in the first point here, it is necessary to follow the subsequent steps:
    - First, analyze the dataset to determine which text processing and feature-extraction techniques are best suited for the task at hand. Furthermore, the particularities of the dataset are crucial in determining the choice of model: for instance, the presence of imbalanced sentiment classes. Additionally, this analysis is required to determine the evaluation protocol (e.g. to use cross-validation or not, choice of metrics) used to assess the performance of the classifiers.
    - After conducting this analysis, a baseline shall be calculated using a simple model and version of the dataset, prior to branching out into the results of various experiments using different pre-processing and feature construction techniques.
    -  After tabulating and comparing the results, conclusions will be reached about the advantages and disadvantages of statistical and embedding-based classifiers, as well as pointing out suggestions for improvement and further study.*

## Dataset Description

The dataset used here is the [Google Research Datasets Poem Sentiment dataset](https://huggingface.co/datasets/google-research-datasets/poem_sentiment). It was constructed from random verses from the Gutenberg Poem Dataset by Emily Sheng and David Uthus (a developer at Google). The purpose of creating the dataset was to develop techniques mitigating societal bias for a collaborative "poetry composition system" (Sheng & Uthus, 2020). The dataset is licensed under the Creative Commons Attribution 4.0 International License, which can be found [here](https://creativecommons.org/licenses/by/4.0/deed.en) (Creative Commons, n.d.). It allows the user to copy, share, adapt and remix "the material for any purpose, even commercially" as long as attribution is given, and a link to the license is provided. The attribution is: *"Sheng, E., & Uthus, D. (2020). Investigating Societal Biases in a Poetry Composition System. arXiv. Retrieved June 15, 2024, from https://arxiv.org/abs/2011.02686"*. Additionally, the dataset was added to HuggingFace by Suraj Patil ([link to GitHub page](https://github.com/patil-suraj)).

The dataset downloaded from Hugging Face has **already been split into three parts**: the training, validation, and test set. The train set contains 892 samples, the validation set 105 samples, and the test set 104. As can be seen in the code below, each sample in the dataset consists of an 'id' field (an integer, with the count starting from 0), a 'verse_text' field which is the string of poetry that is to be classified, and finally an integer representing the sentiment polarity of the verse_text, with 0 for negative, 1 for positive, 2 for "no impact" (neutral) and 3 for "mixed" (both negative and positive).

According to Sheng and Uthus, at the time of publication (2020), there was "no existing public poetry dataset with sentiment annotations". I was unable to successfully locate any alternative English language poetry dataset with sentiment scores either (as of June 2024). Sheng and Uthus employed two expert annotators to label the extracts of poetic language. The "Cohen's kappa" inter-annotator agreement score was 0.53 when all the possible labels (including "mixed" sentiment) were included, but increased to 0.58 when these ambiguous/mixed samples were removed. Cohen's kappa measures "how often the annotators may agree with each other" (Wang, Yang, & Xia, 2019, p. 164387). A score between  0.41–0.60 indicates "moderate agreement" between the annotators. Additionally, Spearman's correlation for the samples in the basic three positive/neutral/negative categories was 0.67 - which Sheng and Uthus state shows substantial inter-annotator agreement. The authors state that they only kept the sample if there was agreement across both annotators. Before training the BERT model, they filtered the samples to keep only those with a "negative", "no impact" (neutral) and "positive" labels - the "mixed" lines of poetry were removed. Thus, the accuracy score of 84.6% achieved here was based on excluding any "mixed" samples in either the training or test data. As shown in the code below, one can see that although the training set contains 49 instances of the "mixed" class, the validation and test sets do not contain any samples from this class. this category.
.sks

In [27]:
# Import the "datasets" library that allows downloading datasets from Hugging Face
#!pip install datasets # datasets library is already installed on this machine.
import datasets
from datasets import load_dataset
# Enable loading the local copy of the dataset with this function
from datasets import load_from_disk
import pandas as pd

In [9]:
# Load in the poetry sentiment dataset from Hugging Face
dataset = load_dataset("google-research-datasets/poem_sentiment")

In [20]:
# Print a summary of the different splits in the poem sentiment dataset
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 892
    })
    validation: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 105
    })
    test: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 104
    })
})


In [22]:
# Output the first twenty samples in the train part of the dataset
print(dataset['train'][0:20])

# Output the first five samples in the validation part of the dataset
print(dataset['validation'][0:5])

# Output the first five samples in the test part of the dataset
print(dataset['test'][0:5])

{'id': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], 'verse_text': ['with pale blue berries. in these peaceful shades--', 'it flows so long as falls the rain,', 'and that is why, the lonesome day,', 'when i peruse the conquered fame of heroes, and the victories of mighty generals, i do not envy the generals,', 'of inward strife for truth and liberty.', 'the red sword sealed their vows!', 'and very venus of a pipe.', 'who the man, who, called a brother.', 'and so on. then a worthless gaud or two,', 'to hide the orb of truth--and every throne', "the call's more urgent when he journeys slow.", "with the _quart d'heure_ of rabelais!", 'and match, and bend, and thorough-blend, in her colossal form and face.', 'have i played in different countries.', 'tells us that the day is ended."', 'and not alone by gold;', 'that has a charmingly bourbon air.', "sounded o'er earth and sea its blast of war,", 'chief poet on the tiber-side', 'as under a sunbeam a cloud ascends,'],

In [29]:
# Save this dataset locally inside the 'data' sub-directory using the dataset's 'save_to_disk' method
original_dir = './datasets/original_poem_sentiment_dataset'
# Save the entire dataset locally
dataset.save_to_disk(original_dir)

Saving the dataset (0/1 shards):   0%|          | 0/892 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/105 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/104 [00:00<?, ? examples/s]

In [31]:
# Easy loading of the dataset into notebook
# dataset = load_from_disk(original_dir)
# print(dataset)

In [34]:
# Now convert the three splits into pandas dataframes for easier viewing and analysis of the dataset using the inbuilt 'to_pandas' method
train_df = dataset['train'].to_pandas() 
val_df = dataset['validation'].to_pandas()
test_df = dataset['test'].to_pandas()

In [51]:
# Display the first 10 lines of the training, val and test data
print("TRAIN DATA")
print(train_df.head(10))
print('\n**********************************************************************\n')
print("VALIDATION DATA")
print(val_df.head(10))
print('\n**********************************************************************\n')
print("TEST DATA")
print(test_df.head(10))

TRAIN DATA
   id                                         verse_text  label
0   0  with pale blue berries. in these peaceful shad...      1
1   1                it flows so long as falls the rain,      2
2   2                 and that is why, the lonesome day,      0
3   3  when i peruse the conquered fame of heroes, an...      3
4   4            of inward strife for truth and liberty.      3
5   5                   the red sword sealed their vows!      3
6   6                          and very venus of a pipe.      2
7   7                who the man, who, called a brother.      2
8   8           and so on. then a worthless gaud or two,      0
9   9         to hide the orb of truth--and every throne      2

**********************************************************************

VALIDATION DATA
   id                                         verse_text  label
0   0              to water, cloudlike on the bush afar,      2
1   1      shall yet be glad for him, and he shall bless      1
2   

In [52]:
## Check dataset for balanced/imbalanced classes using the value_counts method to count occurrences of each label
label_counts_train = train_df['label'].value_counts()
print(label_counts_train)

# Label counts for validation set
label_counts_val = val_df['label'].value_counts()
print(label_counts_val)

# Label counts for test set
label_counts_test = test_df['label'].value_counts()
print(label_counts_test)

label
2    555
0    155
1    133
3     49
Name: count, dtype: int64
label
2    69
0    19
1    17
Name: count, dtype: int64
label
2    69
0    19
1    16
Name: count, dtype: int64


## Handling the Missing "Mixed" Class Limitation

As shown here, only the train set contains (49) instances of class number 3 (mixed sentiment).

This raises an important issue which must be addressed before proceeding: what should be done in a situtation where **only** the training split contains "mixed" samples - particularly when considering that these "mixed" sentiment classes were associated with substantially less inter-annotator agreement? As Sheng and Uthus were interested in this dataset for a very particular use case (transforming lines of poetry expressing "negative" sentiment towards certain demographic groups into "positive" samples), the correct classification of poetic fragments of text expressing mixed or ambiguous emotion or mood may not have been the main priority of the authors. However, in the context where the primary objective *is* determining the capability of different models to detect sentiment in nuanced, figurative and literary language, being able to correctly identify "mixed" sentiment classes is pivotal to fully evaluating the robustness of existing text classification tools.

As such, substantial consideration was given to how to proceed and develop an adequate strategy for comparing the classification algorithms while accounting for this limitation. It was necessary to explore the possible ways to navigate around this shortcoming, and the advantages and disadvantages of each solution will now be clarified and summarized.

- **Adding more real samples**: First, the ideal solution would be to find more samples of poetic text expressing "mixed" (both negative and positive) sentiment. Unfortunately, although this method of enriching the existing dataset would increase the ability to truly evaluate the robustness of the models, it was regrettably not possible to find and employ two "expert validators" who would agree to do this given the time limitations and deadlines for this project.

- **Dataset "Augmentation"**: The HuggingFace user [AIManatee](https://huggingface.co/AiManatee) states that she "augmented" this dataset by adding 16 "mixed" sentiment samples to the validation and test sets and tested these new instances for ["semantic consistency, diversity (cosine similarity), length variation and novelty (ensuring the augmented data introduced new, relevant vocabulary)"](https://huggingface.co/AiManatee/RoBERTa_poem_sentiment), but unfortunately did not provide the updated, augmented dataset to the public (AIManatee, 2024). Augmenting the dataset in order to assess the RoBERTa model on the "mixed" class as well led to very minor decrease in accuracy from 0.8857142857142857 to 0.8842975206611571, and a decrease in F1-score from 0.9025108225108224 to 0.8813606847697756 (largely due to a decline in precision from 0.9201298701298701 to 0.8810538160090016) (AIManatee, 2024). This shows that when the "mixed" samples were included in the validation set, there were more false positives - although it would be better to be able to see in more detail which labels were misclassified more, this information was not provided.

- **Synthetic Sample Generation**: As described by Li et al. (2023), another option would be to generate synthetic text data using a large language model - however as stated by these authors "the effectiveness of the LLM-generated synthetic data in supporting
model training is inconsistent across different classification tasks" (Li et al., 2023). Silva, 2023).

- **Dataset Filtering**: An alternative would be simply to filter the dataset by removing the 49 "mixed" samples from the training data, like the original authors did (Sheng and Uthus, 2020), and to limit testing the ability of the classifiers to categorize the data under the remaining labels. Although this would be the easiest option, it is not ideal because mixed/ambiguous sentiment is something that is an important kind of affective language - in poetry and literature more generally, there are frequent expressions of emotions such as nostalgia, wistfulness, or melancholia (e.g. Thus, ignoring these samples would greatly diminish the importance and contribution made by this study.

- **Re-splitting the Dataset**: On could also combine the training, validation and test sets and create new splits that would contain the "mixed" class in the validation and test sets. However, changing the splits comes with the major disadvantage that the results could not be evaluated in context of the scores currently published online and would diminish the reproducibility and compatibility of this work.

Consequently, after some deliberation, it was decided that the classification methods used in this study will be applied *both* to the original train-dev-test split, *as well as* to a new, recombined version of the dataset which will ensure that samples from the "mixed" minority class are included in the validation and test sets. By performing this analysis on both configurations of the dataset, it may be possible to evaluate the impact of including more complex and ambiguous language in the dataset, while also ensuring comparability to existing achievements by running the experiments on the same dataset split.

### Evaluation Methodology

#### Training a Baseline Classifier
- Text strings will be pre-processed using basic methods (i.e. word tokenization, Bag-of-Words).
- A simple classifier (e.g. Multinomial Naive Bayes) will be trained on the training set to provide a benchmark for future experiments.
- This basic classifier will be assessed using the existing validation set. Along with accuracy, the precision, recall and F1 scores will be calculated for each of the three classes present in the original validation set.
- A confusion matrix will be used to visualize performance across classes.
-  It is vital to also calculate the recall and precision across the classes as well as "accuracy", as the dataset is heavily unbalanced.
-  Macro-averages for precision and recall will be computed. Macro-averaging tends to result in worse outcomes than micro-averaging because equal weight is given to each class (including the poorly-performing ones). Macro averages are useful when determining the performance for each class is "equally important”: here, we want to see how well the classifier performs over *all* the sentiment polarities (EvidentlyAI, 2024). Also, micro-averaging can "overemphasize the performance of the majority class", which can result in misleadingly "inflated" performance scores when the algorithm performs poorly on the less-represented categories.

#### Pre-Processing Experiments
- Several experiments will be run using the original split, e.g. adding negation handling, removing stopwords. 
- The same metrics as mentioned above will be calculated, tabulated and compared with baseline scores.
-  The highest-performing configuration will be used on the test set.

#### Stratified K-Fold Cross Validation
-  Following these explorations using the original dataset train-validation-test split, I will recombine the samples into one dataset, before creating a train-test split that ensures that the "mixed" samples are represented in both sections.
-  Experiments with different pre-processing techniques will be conducted again, but using stratified k-fold cross-validation on the new train set to guarantee that "mixed" instances are present in each validation split. 
- With ordinary cross-validation, the splits might not be "representative of the overall data distribution" (Nagaraj, 2023). If the training data in one particular fold lacks any "mixed" samples, a classifier such as Naive Bayes would struggle with the "zero probability problem" (Jayaswal, 2020).
-  Moreover, with a relatively small dataset such as this one, cross-validation ensures that all of the data is used and improves the robustness of the evaluation. 
-  The pre-processing configuration leading to the highest performance will then be used to evaluate performance on the test set.

#### Deep-Learning Model Evaluation
-  The same protocol (first using the original split, then using stratified cross-validation) will be applied to assess the performance of a transformer-based deep-learning based model.
 - Stratified cross-validation will also be used here to find the best combination of deep-learning model hyperparameters for the re-split dataset to avoid overfitting to the original validation/devset.
-  The highest-performing configuration will be trained on the test set.
- Tables and confusion matrices will be used throughout to compare the different models, data splits and techniques.
 text-processing techniques.


## References

[1] Kim, E., & Klinger, R. (2019). A survey on sentiment and emotion analysis for computational literary studies. Zeitschrift für digitale Geisteswissenschaften, Herzog August Bibliothek. Retrieved June 15, 2024, from https://zfdg.de/2019_008_v1

[2] Liu, B. (2012). Sentiment analysis and opinion mining (1st ed.). Springer. https://doi.org/10.1007/978-3-031-02145-9

[3] Sheng, E., & Uthus, D. (2020). Investigating societal biases in a poetry composition system. arXiv. Retrieved June 15, 2024, from https://arxiv.org/abs/2011.02686

[4] Jurafsky, D., & Martin, J. H. (2024). Speech and language processing (3rd ed. draft, February 3, 2024). Online Draft. Stanford University. Retrieved June 15, 2024, from https://web.stanford.edu/~jurafsky/slp3/

[5] Bird, S., Klein, E., & Loper, E. (2019). Natural language processing with Python. O'Reilly Media, Inc. Retrieved June 15, 2024, from https://www.nltk.org/book/

[6] Sheng, E., & Uthus, D. (2020). Poem sentiment [Data set]. Hugging Face. https://huggingface.co/datasets/google-research-datasets/poem_sentiment

[7] AiManatee. (2024). RoBERTa_poem_sentiment [Machine learning model]. Hugging Face. Retrieved June 15, 2024, from https://huggingface.co/AiManatee/RoBERTa_poem_sentiment

[8] Da, N. Z. (2019). The computational case against computational literary studies. Critical Inquiry, 45(3), 601-639. https://doi.org/10.1086/702594

[9] Hugging Face. (n.d.). Introduction to NLP with Hugging Face. Hugging Face. Retrieved June 15, 2024, from https://huggingface.co/learn/nlp-course/en/chapter1/4

[10] Wang, J., Yang, Y., & Xia, B. (2019). A simplified Cohen’s Kappa for use in binary classification data annotation tasks. IEEE Access, 7, 164386-164397. https://doi.org/10.1109/ACCESS.2019.2953104

[11] McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica (Zagreb), 22(3), 276-282. PMID: 23092060; PMCID: PMC3900052.

[12] Creative Commons. (n.d.). Attribution 4.0 International (CC BY 4.0). Retrieved June 18, 2024, from https://creativecommons.org/licenses/by/4.0/

[13] Patil, S. [@patil-suraj]. (n.d.). Suraj Patil. GitHub. Retrieved June 18, 2024, from https://github.com/patil-suraj

[14] Li, Z., Zhu, H., Lu, Z., & Yin, M. (2023). Synthetic data generation with large language models for text classification: Potential and limitations. arXiv preprint arXiv:2310.07849. Retrieved from https://arxiv.org/abs/2310.07849

[15] Shalevska, E., & Ma, Y. (2024). The digital laureate: Examining AI-generated poetry. RATE Issues, 31. https://doi.org/10.69475/RATEI.2024.1.1

[16] Köbis, N., & Mossink, L. D. (2021). Artificial intelligence versus Maya Angelou: Experimental evidence that people cannot differentiate AI-generated from human-written poetry. Computers in Human Behavior, 114, 106553. https://doi.org/10.1016/j.chb.2020.106553

[17] Silva, M. J. (2023, October 20). Unveiling the ethics of AI and language models in art. Ecolonical. https://ecolonical.org/ethical-ai-and-llms-in-art/

[18] Widodo, S., Brawijaya, H., & Samudi, S. (2022). Stratified K-fold cross validation optimization on machine learning for prediction. Sinkron: Jurnal Dan Penelitian Teknik Informatika, 6(4), 2407-2414. https://doi.org/10.33395/sinkron.v7i4.11792

[19] Evidently AI Team. (2024). Accuracy, precision, and recall in multi-class classification. Evidently AI. https://www.evidentlyai.com/classification-metrics/multi-class-metrics#:~:text=Macro%2Daveraging%20gives%20equal%20weight,same%20and%20identical%20to%20accuracy. Accessed June 18, 2024.

[20] Nagaraj, Y. (2023, October 20). Stratified K-Fold Cross-Validation: An in-depth look 📊🧪. LinkedIn. https://www.linkedin.com/pulse/stratified-k-fold-cross-validation-in-depth-look-yeshwanth-n/. Accessed June 19, 2024.

[21] Jayaswal, V. (2020, November 22). Laplace smoothing in Naïve Bayes algorithm: Solving the zero probability problem in Naïve Bayes algorithm. Towards Data Science. https://towardsdatascience.com/laplace-smoothing-in-na%C3%AFve-bayes-algorithm-9c237a8bdece. Accessed June 19, 2024.