# Sentiment Analysis in Nepali Language

This google colab is about sentiment analysis in Nepali language by fine-tuning BERT-derivative model. The dataset used in this notebook is mainly from [here](https://github.com/oya163/nepali-sentiment-analysis/blob/master/data/nepcls/csv/ss_ac_at_txt_unbal.csv)

## Installation

In [1]:
%%capture
!python3 -m pip install -U huggingface_hub
!python3 -m pip install -U transformers
!python3 -m pip install -U datasets evaluate
!python3 -m pip install -U accelerate
!python3 -m pip install -U seqeval
!python3 -m pip install -U wandb

In [2]:
# Wrap the text in ipython notebook
from IPython.display import HTML, display

# def set_css():
#   display(HTML('''
#   <style>
#     pre {
#         white-space: pre-wrap;
#     }
#   </style>
#   '''))
# get_ipython().events.register('pre_run_cell', set_css)

import warnings
warnings.filterwarnings("ignore")

# Data Preprocessing

## Prepare NepSA dataset

This is related to creating a dataset based off of the raw dataset from [Nepali Sentiment Analysis](https://raw.githubusercontent.com/oya163/nepali-sentiment-analysis/master/data/nepcls/csv/ss_ac_at_txt_unbal.csv) project

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
# filepath = '/content/drive/MyDrive/nepsa_data/nepsa'

In [5]:
# !wget https://raw.githubusercontent.com/oya163/nepali-sentiment-analysis/master/data/nepcls/csv/ss_ac_at_txt_bal.csv

In [6]:
# import matplotlib.pyplot as plt
# import os
# import pandas as pd
# import torch
# import numpy as np
# pd.set_option('display.max_colwidth', None)

In [7]:
# filepath = "/kaggle/working/ss_ac_at_txt_bal.csv"
# df = pd.read_csv(filepath,
#                    names=["Severity", "Category", "Aspect Word", "text"])

In [8]:
# df.head()

In [9]:
# df[df['Category']=='PROFANITY']

In [10]:
# df['Category'].unique()

In [11]:
# df = df[~df['Category'].isin(['FEEDBACK'])]
# df['Category'].unique()

In [12]:
# def create_label(row):
#     if row['Category'] == "GENERAL" and row['Severity'] == 0:
#         return 0
#     elif row['Category'] == "GENERAL" and row['Severity'] == 1:
#         return 1
#     elif row['Category'] == "PROFANITY":
#         return 2
#     elif row['Category'] == "VIOLENCE":
#         return 3

# df['label'] = df.apply(create_label, axis=1).astype(int)

# df = df.drop(['Severity', 'Category', 'Aspect Word'], axis=1)
# df.head()


In [13]:
# from sklearn.model_selection import train_test_split

# X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=163)

# X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=163)


In [14]:
# def create_csv(X, y, filename):
#     df = pd.DataFrame()
#     df['text'] = X
#     df['label'] = y
#     df.to_csv(f"{filename}.txt", sep='\t', header=False, index=False)

# create_csv(X_train, y_train, 'train')
# create_csv(X_val, y_val, 'valid')
# create_csv(X_test, y_test, 'test')


## Load NepSA dataset

In [15]:
import os
from datasets import load_dataset

filepath="/kaggle/input/nepsa-data"
data_files = {
    "train": os.path.join(filepath, "train.txt"),
    "validation": os.path.join(filepath, "valid.txt"),
    "test": os.path.join(filepath, "test.txt"),
}

raw_datasets = load_dataset(os.path.join(filepath, "load_sa.py"), data_files=data_files)

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Check the basic information on the loaded dataset

In [16]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 714
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 239
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 239
    })
})

Check the data statistics

In [17]:
from collections import Counter

for k, v in raw_datasets.items():
    print(k, Counter(raw_datasets[k]['label']))

train Counter({2: 236, 1: 199, 3: 172, 0: 107})
validation Counter({2: 86, 1: 63, 3: 59, 0: 31})
test Counter({2: 85, 1: 61, 3: 54, 0: 39})


In [18]:
train_data = raw_datasets['train']
test_data = raw_datasets['test']
valid_data = raw_datasets['validation']

Check sample of tokens from train dataset

In [19]:
print(train_data[10]["text"])

‡§Ø‡•ã ‡§ñ‡§æ‡§§‡•á ‡§∞‡§®‡•ç‡§°‡§ø ‡§ï‡•ã ‡§õ‡•ã‡§∞‡§æ ‡§ï‡§∏‡•ç‡§§‡•ã ‡§Æ‡§æ‡§®‡§µ ‡§Ö‡§ß‡§ø‡§ï‡§æ‡§∞ ‡§¨‡§æ‡§¶‡•Ä ‡§π‡•ã ?


Check the NER tags (its IDS) of the corresponding sample

In [20]:
print(raw_datasets["train"][10]["label"])

2


In [21]:
ner_feature = raw_datasets["train"].features["label"]
ner_feature.num_classes

4

## Tokenization

In [22]:
from transformers import AutoTokenizer

# model_checkpoint = "NepBERTa/NepBERTa"
# model_checkpoint = "Rajan/NepaliBERT"
# model_checkpoint = "Rajan/nepbertaTorch"
model_checkpoint = "Sakonii/distilbert-base-nepali"
# model_checkpoint = "xlm-roberta-large"
# model_checkpoint = "Sakonii/deberta-base-nepali"
# model_checkpoint = "bert-base-multilingual-uncased"
# model_checkpoint = "/kaggle/input/nepsa-model/model"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

## Data Preprocessing

In [23]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=False)

tokenized_train = train_data.map(preprocess_function, batched=True)
tokenized_test = test_data.map(preprocess_function, batched=True)
tokenized_val = valid_data.map(preprocess_function, batched=True)

Map:   0%|          | 0/714 [00:00<?, ? examples/s]

Map:   0%|          | 0/239 [00:00<?, ? examples/s]

Map:   0%|          | 0/239 [00:00<?, ? examples/s]

# Fine Tuning

## Data Collation

In [24]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Load pre-trained model

In [25]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=4,
    from_tf=True if model_checkpoint=="NepBERTa/NepBERTa" else False
)

config.json:   0%|          | 0.00/559 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at Sakonii/distilbert-base-nepali and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Setup Evaluation

In [26]:
import numpy as np
from sklearn.metrics import classification_report
from datasets import load_metric

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    report = classification_report(y_true=labels, y_pred=predictions, output_dict=True)

    accuracy = report['accuracy']
    recall = report['weighted avg']['recall']
    precision = report['weighted avg']['precision']
    f1 = report['weighted avg']['f1-score']
    return {
        "accuracy": accuracy,
        "recall": recall,
        "precision": precision,
        "f1": f1
    }



In [27]:
model.config.num_labels

4

## Training

In [28]:
# from google.colab import userdata
# from huggingface_hub import login, notebook_login

# login(token=userdata.get('hugging_face'))

In [29]:
from transformers import TrainingArguments, Trainer
from transformers import EarlyStoppingCallback, IntervalStrategy

model_name = "nepsa"

args = TrainingArguments(
    model_name,
    evaluation_strategy=IntervalStrategy.STEPS,
    eval_steps = 100,
    save_total_limit = 2,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    learning_rate=2e-5,
    num_train_epochs=6,
    weight_decay=0.01,
    push_to_hub=False,
    metric_for_best_model = 'f1',
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=2)]
)

In [30]:
from kaggle_secrets import UserSecretsClient
import wandb

user_secrets = UserSecretsClient()
wandb_secret = user_secrets.get_secret("wandb")
wandb.login(key=wandb_secret)

trainer.train()

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33moyashi163[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.16.0
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20231128_185016-f42cw4hp[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33musual-yogurt-24[0m
[34m[1mwandb[0m: ‚≠êÔ∏è View project at [34m[4mhttps://wandb.ai/oyashi163/huggingface[0m
[34m[1mwandb[0m: üöÄ View run at [34m[4mhttps://wandb.ai/oyashi163/huggingface/runs/f42cw4hp[0m
You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` meth

Step,Training Loss,Validation Loss,Accuracy,Recall,Precision,F1
100,No log,1.312885,0.351464,0.351464,0.280521,0.273618
200,No log,1.235498,0.422594,0.422594,0.544112,0.334263
300,No log,1.134891,0.485356,0.485356,0.588691,0.457831
400,No log,1.036294,0.564854,0.564854,0.58202,0.565628
500,1.153900,0.999386,0.58159,0.58159,0.582514,0.578021
600,1.153900,0.990986,0.60251,0.60251,0.60306,0.597647
700,1.153900,1.108117,0.598326,0.598326,0.644136,0.571993
800,1.153900,0.945812,0.635983,0.635983,0.632019,0.630802
900,1.153900,1.150024,0.577406,0.577406,0.576098,0.571937
1000,0.738000,1.269696,0.606695,0.606695,0.624631,0.603008


TrainOutput(global_step=2142, training_loss=0.6809723655494099, metrics={'train_runtime': 119.0452, 'train_samples_per_second': 35.986, 'train_steps_per_second': 17.993, 'total_flos': 41809444068672.0, 'train_loss': 0.6809723655494099, 'epoch': 6.0})

In [31]:
trainer.evaluate()

{'eval_loss': 1.8124080896377563,
 'eval_accuracy': 0.602510460251046,
 'eval_recall': 0.602510460251046,
 'eval_precision': 0.6175961341067085,
 'eval_f1': 0.6058744850376651,
 'eval_runtime': 0.8183,
 'eval_samples_per_second': 292.069,
 'eval_steps_per_second': 146.645,
 'epoch': 6.0}

## Save the model

In [32]:
saved_model_path='nepsa'
trainer.save_model(saved_model_path)

## Evaluation

In [33]:
predictions = trainer.predict(tokenized_test)

In [34]:
import pandas as pd
final_predictions = np.argmax(predictions.predictions, axis=1)

label_map = {
    0: 'GENERAL POSITIVE',
    1: 'GENERAL NEGATIVE',
    2: 'PROFANITY',
    3: 'VIOLENCE'
}

prediction_data = []
for text, gt, pt in zip(tokenized_test['text'], tokenized_test['label'], final_predictions):
    prediction_data.append([text, gt, pt])
    
prediction_df = pd.DataFrame(prediction_data, columns=['text', 'ground_truth', 'predictions'])

In [35]:
# prediction_df[prediction_df['predictions']==0]

In [36]:
from sklearn.metrics import classification_report
report = classification_report(y_true=prediction_df['ground_truth'], y_pred=prediction_df['predictions'])
print(report)

              precision    recall  f1-score   support

           0       0.89      0.62      0.73        39
           1       0.58      0.69      0.63        61
           2       0.70      0.67      0.68        85
           3       0.59      0.63      0.61        54

    accuracy                           0.66       239
   macro avg       0.69      0.65      0.66       239
weighted avg       0.67      0.66      0.66       239



In [37]:
from tabulate import tabulate

metrics = ['precision', 'recall', 'f1', 'accuracy']
prediction_results = []

for key, val in predictions.metrics.items():
    if any(item in key for item in metrics):
        prediction_results.append([key, str(round(val,4)*100)+'%'])

print(tabulate(prediction_results, headers=['Metric', 'Score']))

Metric          Score
--------------  -------
test_accuracy   65.69%
test_recall     65.69%
test_precision  67.36%
test_f1         65.98%


## Inference

In [38]:
from transformers import pipeline

text_classifier = pipeline("text-classification", model=saved_model_path)

In [39]:
inference_sentences = ["‡§µ‡§æ ‡§ú‡•à‡§®‡•Å ‡§ú‡§ø ‡§§‡§™‡§æ‡§á‡§≤‡§æ‡§á ‡§ß‡§®‡•ç‡§Ø‡§¨‡§æ‡§¶ ‡§õ ‡§ó‡§ó‡§®‡•ç‡§Ø‡§æ ‡§ö‡•ã‡§∞ ‡§≤‡§æ‡§à ‡§Æ‡•Å‡§ñ ‡§≠‡§∞‡•Ä ‡§ú‡§¨‡§æ‡§¨ ‡§¶‡§ø‡§è‡§ï‡•ã ‡§Æ‡§æ",
                       "‡§ì‡§≤‡•Ä ‡§¶‡§≤‡§æ‡§≤ ‡§Æ‡•Å‡§∞‡•ç‡§¶‡§æ‡§¨‡§æ‡§¶",
                       "‡§Ø‡•ã ‡§ó‡•ã‡§µ‡§ø‡§®‡•ç‡§¶‡•á ‡§≤‡§æ‡§à ‡§¶‡•á‡§∂ ‡§®‡§ø‡§ï‡§æ‡§≤‡§æ ‡§ó‡§∞‡•ç‡§®‡•Å ‡§™‡§∞‡•ç‡§õ",
                       "‡§Ø‡•ã ‡§Æ‡•Å‡§≤‡§æ ‡§ó‡•ã‡§¨‡§ø‡§®‡•ç‡§¶ ‡§†‡§ø‡§ï ‡§õ‡•à‡§®",
                       "‡§Ø‡•ã ‡§™‡•Å‡§£‡•ç‡§Ø ‡§ó‡•å‡§§‡§Æ ‡§ú‡§°‡•ç‡§Ø‡§æ ‡§π‡•ã ‡§ú‡§∏‡•ç‡§§‡•ã ‡§ï‡§∏ ‡§ï‡§∏ ‡§≤‡§æ‡§à ‡§≤‡§æ‡§ó‡•ç‡§õ ‡•§",
                      "‡§§‡§™‡§æ‡§à‡§Ç ‡§ï‡•Å‡§µ‡§æ ‡§Æ‡§æ ‡§¶‡•Å‡§¨‡•á‡§∞ ‡§Æ‡§∞‡•á ‡§π‡•Å‡§®‡•ç‡§õ ‡•§",
                      "‡§Ö‡§®‡•Å‡§π‡§æ‡§∞ ‡§π‡•á‡§∞‡•ç‡§¶‡§æ ‡§†‡§Æ‡•á‡§≤ ‡§ï‡•ã ‡§≠‡§æ‡§≤‡•Ç ‡§π‡•ã ‡•§"]

results = text_classifier(inference_sentences)


In [40]:
label_map = {
    0: 'GENERAL POSITIVE',
    1: 'GENERAL NEGATIVE',
    2: 'PROFANITY',
    3: 'VIOLENCE'
}

prediction_results = []
for sent, result in zip(inference_sentences, results):
    pred = result['label'].split('_')[1]
    prediction_results.append([sent, pred, label_map[int(pred)]])

print(tabulate(prediction_results, headers=['Sentences', 'Labels', 'Remarks'], tablefmt='orgtbl'))


| Sentences                                                    |   Labels | Remarks          |
|--------------------------------------------------------------+----------+------------------|
| ‡§µ‡§æ ‡§ú‡•à‡§®‡•Å ‡§ú‡§ø ‡§§‡§™‡§æ‡§á‡§≤‡§æ‡§á ‡§ß‡§®‡•ç‡§Ø‡§¨‡§æ‡§¶ ‡§õ ‡§ó‡§ó‡§®‡•ç‡§Ø‡§æ ‡§ö‡•ã‡§∞ ‡§≤‡§æ‡§à ‡§Æ‡•Å‡§ñ ‡§≠‡§∞‡•Ä ‡§ú‡§¨‡§æ‡§¨ ‡§¶‡§ø‡§è‡§ï‡•ã ‡§Æ‡§æ |        1 | GENERAL NEGATIVE |
| ‡§ì‡§≤‡•Ä ‡§¶‡§≤‡§æ‡§≤ ‡§Æ‡•Å‡§∞‡•ç‡§¶‡§æ‡§¨‡§æ‡§¶                                             |        1 | GENERAL NEGATIVE |
| ‡§Ø‡•ã ‡§ó‡•ã‡§µ‡§ø‡§®‡•ç‡§¶‡•á ‡§≤‡§æ‡§à ‡§¶‡•á‡§∂ ‡§®‡§ø‡§ï‡§æ‡§≤‡§æ ‡§ó‡§∞‡•ç‡§®‡•Å ‡§™‡§∞‡•ç‡§õ                              |        1 | GENERAL NEGATIVE |
| ‡§Ø‡•ã ‡§Æ‡•Å‡§≤‡§æ ‡§ó‡•ã‡§¨‡§ø‡§®‡•ç‡§¶ ‡§†‡§ø‡§ï ‡§õ‡•à‡§®                                         |        2 | PROFANITY        |
| ‡§Ø‡•ã ‡§™‡•Å‡§£‡•ç‡§Ø ‡§ó‡•å‡§§‡§Æ ‡§ú‡§°‡•ç‡§Ø‡§æ ‡§π‡•ã ‡§ú‡§∏‡•ç‡§§‡•ã ‡§ï‡§∏ ‡§ï‡§∏ ‡§≤‡§æ‡§à ‡§≤‡§æ‡§ó‡•ç‡§õ ‡•§                    |        2 | PROFANITY        |
| ‡§§‡

## Conclusion

### Sakonii/distilbert-base-nepali - epoch - 6


                  precision    recall  f1-score   support

               0       0.87      0.67      0.75        39
               1       0.56      0.70      0.62        61
               2       0.71      0.69      0.70        85
               3       0.61      0.56      0.58        54

        accuracy                           0.66       239
       macro avg       0.69      0.66      0.67       239
    weighted avg       0.68      0.66      0.66       239



### xlm-roberta-large - epoch 6

                    precision    recall  f1-score   support

               0       0.00      0.00      0.00        39
               1       0.00      0.00      0.00        61
               2       0.36      1.00      0.52        85
               3       0.00      0.00      0.00        54

        accuracy                           0.36       239
       macro avg       0.09      0.25      0.13       239
    weighted avg       0.13      0.36      0.19       239

### Rajan/NepaliBERT

                  precision    recall  f1-score   support

               0       0.64      0.64      0.64        39
               1       0.44      0.51      0.47        61
               2       0.65      0.52      0.58        85
               3       0.61      0.70      0.66        54

        accuracy                           0.58       239
       macro avg       0.59      0.59      0.59       239
    weighted avg       0.59      0.58      0.58       239
    
### bert-base-multilingual-uncased - epoch - 6
        
                  precision    recall  f1-score   support

               0       0.74      0.64      0.68        39
               1       0.48      0.52      0.50        61
               2       0.76      0.67      0.71        85
               3       0.59      0.69      0.63        54

        accuracy                           0.63       239
       macro avg       0.64      0.63      0.63       239
    weighted avg       0.64      0.63      0.64       239
    
    
### Sakonii/deberta-base-nepali - epoch - 6

                  precision    recall  f1-score   support

               0       0.83      0.64      0.72        39
               1       0.56      0.59      0.58        61
               2       0.69      0.58      0.63        85
               3       0.54      0.74      0.62        54

        accuracy                           0.63       239
       macro avg       0.66      0.64      0.64       239
    weighted avg       0.65      0.63      0.63       239
    

