Skip to content

IS2AI/KazSAnDRA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 

Repository files navigation

😡 KazSAnDRA 😀

GitHub stars GitHub issues ISSAI Official Website
🤗
Hugging Face Dataset Hugging Face Model

This repository provides a dataset and pre-trained polarity and score classification models for the paper
KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes

Domains ℹ️

The source data for our dataset came from four domains:

  1. an online store for Android devices that offers a diverse range of applications (hereafter Appstore),
  2. an online library that serves as a source of books and audiobooks in Kazakh (hereafter Bookstore)
  3. digital mapping and navigation services (hereafter Mapping),
  4. online marketplaces (hereafter Market).
Domain ⭐️ ⭐️⭐️ ⭐️⭐️⭐️ ⭐️⭐️⭐️⭐️ ⭐️⭐️⭐️⭐️⭐️ Total
Appstore 22,547 4,202 5,758 7,949 94,617 135,073
Bookstore 686 107 222 368 4,422 5,805
Mapping 959 270 369 525 6,774 8,897
Market 1,043 350 913 2,775 25,208 30,289
Total 25,235 4,929 7,262 11,617 131,021 180,064

Review Variations 🔀

In Kazakhstan, people often switch between speaking Kazakh and Russian. There is also a trend of moving from using the Cyrillic script to the Latin script. As a result, the Kazakh reviews in our dataset can take various forms: (a) purely Kazakh words written in the Kazakh Cyrillic script, (b) Kazakh words in the Latin script, (c) a mix of Cyrillic and Latin characters, (d) a mix of Russian and Kazakh words, or (e) entirely in Cyrillic with Russian characters instead of Kazakh ones.

Actual review Correct form (Kazakh) Correct form (English)
a керемет кітап керемет кітап a wonderful book
b keremet керемет wonderful
c jok кітап кітап жоқ no books
d Осы приложениеге көп рахмет! Осы қолданбаға көп рақмет! Many thanks to this app!
e Кушти! Күшті! Great!

Sentiment Classification Tasks 🕵️‍♂️

We utilised KazSAnDRA for two distinct tasks:

  1. polarity classification (PC), involving the prediction of whether a review is positive or negative:
    • reviews with original scores of 1 or 2 were classified as negative and assigned a new score of 0,
    • reviews with original scores of 4 or 5 were classified as positive and assigned a new score of 1,
    • reviews with an original score of 3 were categorized as neutral and were excluded from the task.
  2. score classification (SC), where the objective was to predict the score of a review on a scale ranging from 1 to 5.

Data Pre-Processing 🔧

During the data pre-processing stage, the following steps were undertaken:

  • Removal of emojis 🤓
  • Lowercasing all reviews 🔠 ➙ 🔡
  • Removal of punctuation marks ⁉️
  • Removal of newline (\n), tab (\t), and carriage return (\r) characters ⇥ ↵
  • Replacement of multiple spaces with a single space ␣
  • Reduction of consecutive recurring characters to two single instances (e.g., "кееррреемееетт" to "кеерреемеетт") 🔂
  • Removal of duplicate entries (i.e., reviews sharing identical text and scores) 👯‍♂️

Data Partitioning 🧩

For the sake of maintaining consistency and facilitating reproducibility of our experimental outcomes among different research groups, we partitioned KazSAnDRA into three distinct sets: training (train), validation (valid), and testing (test) sets, following an 80/10/10 ratio.

Task Train Valid Test Total
# % # % # % # %
PC 134,368 80 16,796 10 16,797 10 167,961 100
SC 140,126 80 17,516 10 17,516 10 175,158 100

The distribution of reviews across the three sets based on their domains and scores for the PC task:

Domain Train Valid Test
# % # % # %
Appstore 101,477 75.52 12,685 75.52 12,685 75.52
Market 22,561 16.79 2,820 16.79 2,820 16.79
Mapping 6,509 4.84 813 4.84 814 4.85
Bookstore 3,821 2.84 478 2.85 478 2.85
Total 134,368 100 16,796 100 16,797 100
Score Train Valid Test
# % # % # %
1 110,417 82.18 13,801 82.17 13,804 82.18
0 23,951 17.82 2,995 17.83 2,993 17.82
Total 134,368 100 16,796 100 16,797 100

The distribution of reviews across the three sets based on their domains and scores for the SC task:

Domain Train Valid Test
# % # % # %
Appstore 106,058 75.69 13,258 75.69 13,257 75.69
Market 23,278 16.61 2,909 16.61 2,910 16.61
Mapping 6,794 4.85 849 4.85 849 4.85
Bookstore 3,996 2.85 500 2.85 500 2.85
Total 140,126 100 17,516 100 17,516 100
Score Train Valid Test
# % # % # %
5 101,302 72.29 12,663 72.29 12,663 72.29
1 20,031 14.29 2,504 14.30 2,504 14.30
4 9,115 6.50 1,140 6.51 1,139 6.50
3 5,758 4.11 719 4.10 720 4.11
2 3,920 2.80 490 2.80 490 2.80
Total 140,126 100 17,516 100 17,517 100

Score Resampling ♻️

To address the data imbalance in our training data, we employed random oversampling (ROS) and random undersampling (RUS) techniques, aiming to balance the representation of classes by creating new samples for the smaller class to align with the count of the majority class and eliminating samples from the larger class to match the count of the minority class.

The balanced training sets for the PC task:

Score Balanced Imbalanced
OS US
0 110,417 23,951 23,951
1 110,417 23,951 110,417

The balanced training sets for the SC task:

Score Balanced Imbalanced
OS US
1 101,302 3,920 20,031
2 101,302 3,920 3,920
3 101,302 3,920 5,758
4 101,302 3,920 9,115
5 101,302 3,920 101,302

Dataset Structure 📁

The dataset folder contains ten ZIP files, each containing a CSV file. Files "01" to "05" are associated with PC (polarity classification), while files "06" to "10" are related to SC (score classification). To align with the enumeration used for labelling in the classifier, which starts from 0 rather than 1, labels 1-5 in the SC task were transformed into 0-4. Different training set variations are indicated by the suffixes "ib" for imbalanced data, "ros" for random oversampling, and "rus" for random undersampling. Each file includes records containing a custom review identifier (custom_id), the original review text (text), the pre-processed review text (text_cleaned), the corresponding review score (label), and the domain information (domain).

Sentiment Classification Models 🧠

For the evaluation of KazSAnDRA, we utilised four multilingual machine learning models, all incorporating the Kazakh language and accessible through the Hugging Face Transformers framework:

  1. mBERT
  2. XLM-R
  3. RemBERT
  4. mBART-50

Experimental Setup 🔬

The models were fine-tuned using both the balanced and imbalanced training sets, while the hyperparameters were refined using the validation set. The final and most optimal models were evaluated on the test sets. The fine-tuning of the models was executed on a single A100 GPU hosted on an NVIDIA DGX A100 machine. The initial learning rate was set at 10-5 the weight decay rate was set at 10-3. Early stopping was employed, executed when the F1-score exhibited no improvement for three consecutive epochs. We set the batch size to 32 (mBERT, XLM-R, RemBERT) or 16 (mBART-50) and applied 800 warm-up steps.

Model PC SC
ROS RUS IB ROS RUS IB
mBERT 4 7 6 8 10 11
XLM-R 5 7 5 4 9 16
RemBERT 4 5 5 6 6 9
mBART-50 5 7 5 8 7 5

Number of training epochs for models

Performance Metrics 📏

Several conventional metrics were used to evaluate the performance of the models, including accuracy (A), precision (P), recall (R), and F1-score (F1). Given the imbalanced nature of the dataset, where all classes carry equal importance, we opted for macro-averaging, calculated from the arithmetic (i.e., unweighted) mean of all F1-scores per class, and thus ensuring equal treatment of all classes during the evaluation, resulting in a stronger penalty if the model performs worse on minority classes.

Fine-Tuning, Evaluating, and Predicting Models 🤖

  1. Download this repository and install the required packages:
git clone https://github.com/IS2AI/KazSAnDRA.git
cd KazSAnDRA/scripts
pip install -r requirements.txt
  1. To fine-tune and evaluate a model, select the necessary arguments in finetune_evaluate.py and run:
python finetune_evaluate.py
  1. To classify a review, select the necessary arguments and add a review in predict.py and run:
python predict.py

Experiment Results 📊

Model POLARITY CLASSIFICATION
Balanced (ROS) Balanced (RUS) Imbalanced
A P R F1 A P R F1 A P R F1
mBERT 0.84 0.74 0.83 0.77 0.85 0.76 0.82 0.78 0.89 0.82 0.79 0.80
XLM-R 0.86 0.76 0.83 0.79 0.85 0.75 0.83 0.78 0.89 0.81 0.81 0.81
RemBERT 0.88 0.79 0.82 0.81 0.87 0.78 0.82 0.80 0.89 0.81 0.82 0.81
mBART50 0.87 0.77 0.79 0.78 0.81 0.72 0.81 0.74 0.89 0.82 0.78 0.80

PC results on the test sets

Model SCORE CLASSIFICATION
Balanced (ROS) Balanced (RUS) Imbalanced
A P R F1 A P R F1 A P R F1
mBERT 0.67 0.34 0.36 0.35 0.63 0.35 0.39 0.36 0.77 0.44 0.36 0.37
XLM-R 0.58 0.36 0.42 0.36 0.66 0.36 0.41 0.37 0.77 0.42 0.37 0.39
RemBERT 0.73 0.37 0.36 0.36 0.62 0.35 0.40 0.35 0.76 0.41 0.38 0.39
mBART50 0.74 0.36 0.34 0.35 0.55 0.36 0.41 0.34 0.77 0.42 0.37 0.38

SC results on the test sets

POLARITY CLASSIFICATION
predicted →
actual ↓
0 1 Total
0 2,155 838 2,993
1 1,036 12,768 13,804

RemBERT PC results

SCORE CLASSIFICATION
predicted →
actual ↓
1 2 3 4 5 Total
1 1,379 145 132 64 784 2,504
2 182 55 56 25 172 490
3 173 54 118 65 310 720
4 110 39 90 169 731 1,139
5 564 59 165 297 11,578 12,663

RemBERT SC results

Domain PC
A P R F1
Appstore 0.87 0.80 0.81 0.80
Bookstore 0.86 0.75 0.80 0.77
Mapping 0.92 0.84 0.88 0.86
Market 0.97 0.84 0.91 0.87

RemBERT PC results by domain

Domain SC
A P R F1
Appstore 0.74 0.41 0.37 0.38
Bookstore 0.73 0.34 0.32 0.32
Mapping 0.80 0.42 0.41 0.41
Market 0.82 0.43 0.41 0.42

RemBERT SC results by domain

Acknowledgements 🙏

We sincerely thank Alma Murzagulova, Aizhan Seipanova, Meiramgul Akanova, Almas Aitzhan, Aigerim Boranbayeva, and Assel Kospabayeva, who acted as moderators during the review collection process. Their tireless efforts, diligence, and remarkable patience contributed significantly to the successful completion of this endeavour.

Citation 🎓

We kindly urge you, if you incorporate our dataset and/or model into your work, to cite our paper as a gesture of recognition for its valuable contribution. The act of referencing the relevant sources not only upholds academic honesty but also ensures proper acknowledgement of the authors' efforts. Your citation in your research significantly contributes to the continuous progress and evolution of the scholarly realm. Your endorsement and acknowledgement of our endeavours are genuinely appreciated.

@misc{yeshpanov2024kazsandra,
      title={KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes}, 
      author={Rustem Yeshpanov and Huseyin Atakan Varol},
      year={2024},
      eprint={2403.19335},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

An open-source Kazakh Sentiment Analysis Dataset of Reviews and Attitudes (KazSAnDRA) and baseline sentiment classification models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages