😡 KazSAnDRA 😀

Provide feedback

Saved searches

😡 KazSAnDRA 😀

IS2AI/KazSAnDRA

😡 KazSAnDRA 😀

Domains ℹ️

Review Variations 🔀

Sentiment Classification Tasks 🕵️‍♂️

Data Pre-Processing 🔧

Data Partitioning 🧩

Score Resampling ♻️

Dataset Structure 📁

Sentiment Classification Models 🧠

Experimental Setup 🔬

Performance Metrics 📏

Fine-Tuning, Evaluating, and Predicting Models 🤖

Experiment Results 📊

Acknowledgements 🙏

Citation 🎓

Use saved searches to filter your results more quickly

Repository files navigation

Domains ℹ️

Review Variations 🔀

Sentiment Classification Tasks 🕵️‍♂️

Data Pre-Processing 🔧

Data Partitioning 🧩

Score Resampling ♻️

Dataset Structure 📁

Sentiment Classification Models 🧠

Experimental Setup 🔬

Performance Metrics 📏

Fine-Tuning, Evaluating, and Predicting Models 🤖

Experiment Results 📊

Acknowledgements 🙏

Citation 🎓

About

Releases

Packages

Languages

Folders and files

Latest commit

History

Repository files navigation

Domains ℹ️

Review Variations 🔀

Sentiment Classification Tasks 🕵️‍♂️

Data Pre-Processing 🔧

Data Partitioning 🧩

Score Resampling ♻️

Dataset Structure 📁

Sentiment Classification Models 🧠

Experimental Setup 🔬

Performance Metrics 📏

Fine-Tuning, Evaluating, and Predicting Models 🤖

Experiment Results 📊

Acknowledgements 🙏

Citation 🎓

About

Languages

dataset

dataset

scripts

scripts

README.md

README.md

Resources

Stars

Watchers

Forks

🤗

This repository provides a dataset and pre-trained polarity and score classification models for the paper
KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes

The source data for our dataset came from four domains:

an online store for Android devices that offers a diverse range of applications (hereafter Appstore),
an online library that serves as a source of books and audiobooks in Kazakh (hereafter Bookstore)
digital mapping and navigation services (hereafter Mapping),
online marketplaces (hereafter Market).

Domain	⭐️	⭐️⭐️	⭐️⭐️⭐️	⭐️⭐️⭐️⭐️	⭐️⭐️⭐️⭐️⭐️	Total
Appstore	22,547	4,202	5,758	7,949	94,617	135,073
Bookstore	686	107	222	368	4,422	5,805
Mapping	959	270	369	525	6,774	8,897
Market	1,043	350	913	2,775	25,208	30,289
Total	25,235	4,929	7,262	11,617	131,021	180,064

In Kazakhstan, people often switch between speaking Kazakh and Russian. There is also a trend of moving from using the Cyrillic script to the Latin script. As a result, the Kazakh reviews in our dataset can take various forms: (a) purely Kazakh words written in the Kazakh Cyrillic script, (b) Kazakh words in the Latin script, (c) a mix of Cyrillic and Latin characters, (d) a mix of Russian and Kazakh words, or (e) entirely in Cyrillic with Russian characters instead of Kazakh ones.

	Actual review	Correct form (Kazakh)	Correct form (English)
a	керемет кітап	керемет кітап	a wonderful book
b	keremet	керемет	wonderful
c	jok кітап	кітап жоқ	no books
d	Осы приложениеге көп рахмет!	Осы қолданбаға көп рақмет!	Many thanks to this app!
e	Кушти!	Күшті!	Great!

We utilised KazSAnDRA for two distinct tasks:

polarity classification (PC), involving the prediction of whether a review is positive or negative:
- reviews with original scores of 1 or 2 were classified as negative and assigned a new score of 0,
- reviews with original scores of 4 or 5 were classified as positive and assigned a new score of 1,
- reviews with an original score of 3 were categorized as neutral and were excluded from the task.
score classification (SC), where the objective was to predict the score of a review on a scale ranging from 1 to 5.

During the data pre-processing stage, the following steps were undertaken:

Removal of emojis 🤓
Lowercasing all reviews 🔠 ➙ 🔡
Removal of punctuation marks ⁉️
Removal of newline (\n), tab (\t), and carriage return (\r) characters ⇥ ↵
Replacement of multiple spaces with a single space ␣
Reduction of consecutive recurring characters to two single instances (e.g., "кееррреемееетт" to "кеерреемеетт") 🔂
Removal of duplicate entries (i.e., reviews sharing identical text and scores) 👯‍♂️

For the sake of maintaining consistency and facilitating reproducibility of our experimental outcomes among different research groups, we partitioned KazSAnDRA into three distinct sets: training (train), validation (valid), and testing (test) sets, following an 80/10/10 ratio.

Task	Train		Valid		Test		Total

	#	%	#	%	#	%	#	%
PC	134,368	80	16,796	10	16,797	10	167,961	100
SC	140,126	80	17,516	10	17,516	10	175,158	100

The distribution of reviews across the three sets based on their domains and scores for the PC task:

Domain	Train		Valid		Test

	#	%	#	%	#	%
Appstore	101,477	75.52	12,685	75.52	12,685	75.52
Market	22,561	16.79	2,820	16.79	2,820	16.79
Mapping	6,509	4.84	813	4.84	814	4.85
Bookstore	3,821	2.84	478	2.85	478	2.85
Total	134,368	100	16,796	100	16,797	100

Score	Train		Valid		Test

	#	%	#	%	#	%
1	110,417	82.18	13,801	82.17	13,804	82.18
0	23,951	17.82	2,995	17.83	2,993	17.82
Total	134,368	100	16,796	100	16,797	100

The distribution of reviews across the three sets based on their domains and scores for the SC task:

Domain	Train		Valid		Test

	#	%	#	%	#	%
Appstore	106,058	75.69	13,258	75.69	13,257	75.69
Market	23,278	16.61	2,909	16.61	2,910	16.61
Mapping	6,794	4.85	849	4.85	849	4.85
Bookstore	3,996	2.85	500	2.85	500	2.85
Total	140,126	100	17,516	100	17,516	100

Score	Train		Valid		Test

	#	%	#	%	#	%
5	101,302	72.29	12,663	72.29	12,663	72.29
1	20,031	14.29	2,504	14.30	2,504	14.30
4	9,115	6.50	1,140	6.51	1,139	6.50
3	5,758	4.11	719	4.10	720	4.11
2	3,920	2.80	490	2.80	490	2.80
Total	140,126	100	17,516	100	17,517	100

To address the data imbalance in our training data, we employed random oversampling (ROS) and random undersampling (RUS) techniques, aiming to balance the representation of classes by creating new samples for the smaller class to align with the count of the majority class and eliminating samples from the larger class to match the count of the minority class.

The balanced training sets for the PC task:

Score	Balanced		Imbalanced

	OS	US
0	110,417	23,951	23,951
1	110,417	23,951	110,417

The balanced training sets for the SC task:

Score	Balanced		Imbalanced

	OS	US
1	101,302	3,920	20,031
2	101,302	3,920	3,920
3	101,302	3,920	5,758
4	101,302	3,920	9,115
5	101,302	3,920	101,302

The dataset folder contains ten ZIP files, each containing a CSV file. Files "01" to "05" are associated with PC (polarity classification), while files "06" to "10" are related to SC (score classification). To align with the enumeration used for labelling in the classifier, which starts from 0 rather than 1, labels 1-5 in the SC task were transformed into 0-4. Different training set variations are indicated by the suffixes "ib" for imbalanced data, "ros" for random oversampling, and "rus" for random undersampling. Each file includes records containing a custom review identifier (custom_id), the original review text (text), the pre-processed review text (text_cleaned), the corresponding review score (label), and the domain information (domain).

For the evaluation of KazSAnDRA, we utilised four multilingual machine learning models, all incorporating the Kazakh language and accessible through the Hugging Face Transformers framework:

The models were fine-tuned using both the balanced and imbalanced training sets, while the hyperparameters were refined using the validation set. The final and most optimal models were evaluated on the test sets. The fine-tuning of the models was executed on a single A100 GPU hosted on an NVIDIA DGX A100 machine. The initial learning rate was set at 10^-5 the weight decay rate was set at 10^-3. Early stopping was employed, executed when the F₁-score exhibited no improvement for three consecutive epochs. We set the batch size to 32 (mBERT, XLM-R, RemBERT) or 16 (mBART-50) and applied 800 warm-up steps.

Number of training epochs for models

Several conventional metrics were used to evaluate the performance of the models, including accuracy (A), precision (P), recall (R), and F₁-score (F₁). Given the imbalanced nature of the dataset, where all classes carry equal importance, we opted for macro-averaging, calculated from the arithmetic (i.e., unweighted) mean of all F₁-scores per class, and thus ensuring equal treatment of all classes during the evaluation, resulting in a stronger penalty if the model performs worse on minority classes.

Download this repository and install the required packages:

git clone https://github.com/IS2AI/KazSAnDRA.git
cd KazSAnDRA/scripts
pip install -r requirements.txt

To fine-tune and evaluate a model, select the necessary arguments in finetune_evaluate.py and run:

python finetune_evaluate.py

To classify a review, select the necessary arguments and add a review in predict.py and run:

python predict.py

Model	POLARITY CLASSIFICATION

	Balanced (ROS)				Balanced (RUS)				Imbalanced

	A	P	R	F1	A	P	R	F1	A	P	R	F1
mBERT	0.84	0.74	0.83	0.77	0.85	0.76	0.82	0.78	0.89	0.82	0.79	0.80
XLM-R	0.86	0.76	0.83	0.79	0.85	0.75	0.83	0.78	0.89	0.81	0.81	0.81
RemBERT	0.88	0.79	0.82	0.81	0.87	0.78	0.82	0.80	0.89	0.81	0.82	0.81
mBART50	0.87	0.77	0.79	0.78	0.81	0.72	0.81	0.74	0.89	0.82	0.78	0.80

PC results on the test sets

Model	SCORE CLASSIFICATION

	Balanced (ROS)				Balanced (RUS)				Imbalanced

	A	P	R	F1	A	P	R	F1	A	P	R	F1
mBERT	0.67	0.34	0.36	0.35	0.63	0.35	0.39	0.36	0.77	0.44	0.36	0.37
XLM-R	0.58	0.36	0.42	0.36	0.66	0.36	0.41	0.37	0.77	0.42	0.37	0.39
RemBERT	0.73	0.37	0.36	0.36	0.62	0.35	0.40	0.35	0.76	0.41	0.38	0.39
mBART50	0.74	0.36	0.34	0.35	0.55	0.36	0.41	0.34	0.77	0.42	0.37	0.38

SC results on the test sets

POLARITY CLASSIFICATION
predicted → actual ↓	0	1	Total
0	2,155	838	2,993
1	1,036	12,768	13,804

RemBERT PC results

SCORE CLASSIFICATION
predicted → actual ↓	1	2	3	4	5	Total
1	1,379	145	132	64	784	2,504
2	182	55	56	25	172	490
3	173	54	118	65	310	720
4	110	39	90	169	731	1,139
5	564	59	165	297	11,578	12,663

RemBERT SC results

Domain	PC

	A	P	R	F1
Appstore	0.87	0.80	0.81	0.80
Bookstore	0.86	0.75	0.80	0.77
Mapping	0.92	0.84	0.88	0.86
Market	0.97	0.84	0.91	0.87

RemBERT PC results by domain

Domain	SC

	A	P	R	F1
Appstore	0.74	0.41	0.37	0.38
Bookstore	0.73	0.34	0.32	0.32
Mapping	0.80	0.42	0.41	0.41
Market	0.82	0.43	0.41	0.42

RemBERT SC results by domain

We sincerely thank Alma Murzagulova, Aizhan Seipanova, Meiramgul Akanova, Almas Aitzhan, Aigerim Boranbayeva, and Assel Kospabayeva, who acted as moderators during the review collection process. Their tireless efforts, diligence, and remarkable patience contributed significantly to the successful completion of this endeavour.

We kindly urge you, if you incorporate our dataset and/or model into your work, to cite our paper as a gesture of recognition for its valuable contribution. The act of referencing the relevant sources not only upholds academic honesty but also ensures proper acknowledgement of the authors' efforts. Your citation in your research significantly contributes to the continuous progress and evolution of the scholarly realm. Your endorsement and acknowledgement of our endeavours are genuinely appreciated.

@misc{yeshpanov2024kazsandra,
      title={KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes}, 
      author={Rustem Yeshpanov and Huseyin Atakan Varol},
      year={2024},
      eprint={2403.19335},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
dataset		dataset
scripts		scripts
README.md		README.md

RemBERT

mBART-50