# AfriSenti — Multilingual Sentiment Analysis

**GROUP -2 Assignment**  
Sentiment Analysis (Multilingual Tweets)

**Authors:** Ainedembe Denis, Musinguzi Benson  
**Lecturer:** Dr. Sitenda Harriet

This notebook implements a comprehensive analysis of the AfriSenti dataset, the largest sentiment analysis dataset for under-represented African languages, covering 110,000+ annotated tweets in 14 African languages.


## 1. Imports & Installs


In [3]:
# Install required packages for sentiment analysis assignment
%pip install -q "datasets<4.0.0" pandas numpy
%pip install -q matplotlib seaborn scikit-learn
%pip install -q tqdm

# Deep Learning frameworks (required for XLM-RoBERTa fine-tuning and LSTM baseline)
# Note: May have compatibility issues with Python 3.13
%pip install -q torch transformers

print("Core packages installed")

# Data loading and manipulation
from datasets import load_dataset, get_dataset_config_names
import pandas as pd
import numpy as np

# Visualization (for data exploration, confusion matrix, attention visualization)
import matplotlib.pyplot as plt
import seaborn as sns

# Evaluation metrics (F1-score, accuracy, ROC-AUC, confusion matrix)
from sklearn.metrics import (
    accuracy_score, f1_score, roc_auc_score, 
    confusion_matrix, classification_report
)
from sklearn.model_selection import train_test_split

# Deep Learning - PyTorch (for fine-tuning XLM-RoBERTa/AfriBERTa and LSTM)
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from torch.nn.utils import clip_grad_norm_

# Transformers (for mBERT, XLM-RoBERTa tokenizers and models)
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification
)

# Progress bars
from tqdm import tqdm

print("\nAll required libraries imported successfully!")


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Core packages installed

All required libraries imported successfully!


In [5]:
# Download NLTK data (run once)
try:
    nltk.download('punkt', quiet=True)
    nltk.download('stopwords', quiet=True)
    nltk.download('vader_lexicon', quiet=True)
    print("NLTK data downloaded")
except:
    print("NLTK data download skipped (may already be installed)")


NLTK data download skipped (may already be installed)


## 2. Loading the Dataset


In [6]:
# List all available language configs for this dataset
configs = get_dataset_config_names("HausaNLP/AfriSenti-Twitter", trust_remote_code=True)
print("Available language configs:", configs)

# Example: load Amharic (amh) with all splits (train/validation/test)
amh_ds = load_dataset("HausaNLP/AfriSenti-Twitter", "amh", trust_remote_code=True)
print(amh_ds)

# Example: load a single split (train) only
amh_train = load_dataset("HausaNLP/AfriSenti-Twitter", "amh", split="train", trust_remote_code=True)
print(amh_train[0])


Available language configs: ['amh', 'hau', 'ibo', 'arq', 'ary', 'yor', 'por', 'twi', 'tso', 'tir', 'orm', 'pcm', 'kin', 'swa']
DatasetDict({
    train: Dataset({
        features: ['tweet', 'label'],
        num_rows: 5984
    })
    validation: Dataset({
        features: ['tweet', 'label'],
        num_rows: 1497
    })
    test: Dataset({
        features: ['tweet', 'label'],
        num_rows: 1999
    })
})
{'tweet': 'Tesfaye ለካስ ጭብል ለብሰሽ የፕሮፌሰርን ፎቶ ለጥፈክ እልም ያልክ ባዳ ነክ እፈር ትንሽ', 'label': 2}


## 3. Dataset Summary (size, languages, total tweets)

This corresponds to the "Dataset Summary" section: 110k+ tweets, 14 languages.


In [7]:
configs = get_dataset_config_names("HausaNLP/AfriSenti-Twitter", trust_remote_code=True)
print("Language configs:", configs)

summary_rows = []
for cfg in configs:
    # Load each split individually to handle cases where some splits don't exist
    n_train = 0
    n_val = 0
    n_test = 0
    
    try:
        train_ds = load_dataset("HausaNLP/AfriSenti-Twitter", cfg, split="train", trust_remote_code=True)
        n_train = len(train_ds)
    except ValueError:
        n_train = 0
    
    try:
        val_ds = load_dataset("HausaNLP/AfriSenti-Twitter", cfg, split="validation", trust_remote_code=True)
        n_val = len(val_ds)
    except ValueError:
        n_val = 0
    
    try:
        test_ds = load_dataset("HausaNLP/AfriSenti-Twitter", cfg, split="test", trust_remote_code=True)
        n_test = len(test_ds)
    except ValueError:
        n_test = 0
    
    total = n_train + n_val + n_test
    summary_rows.append({
        "lang": cfg,
        "train": n_train,
        "validation": n_val,
        "test": n_test,
        "total": total
    })

summary_df = pd.DataFrame(summary_rows).set_index("lang")
print(summary_df)
print("\nTotal tweets across all languages:", summary_df["total"].sum())


Language configs: ['amh', 'hau', 'ibo', 'arq', 'ary', 'yor', 'por', 'twi', 'tso', 'tir', 'orm', 'pcm', 'kin', 'swa']
      train  validation  test  total
lang                                
amh    5984        1497  1999   9480
hau   14172        2677  5303  22152
ibo   10192        1841  3682  15715
arq    1651         414   958   3023
ary    5583         494  2961   9038
yor    8522        2090  4515  15127
por    3063         767  3662   7492
twi    3481         388   949   4818
tso     804         203   254   1261
tir       0         398  2000   2398
orm       0         396  2096   2492
pcm    5121        1281  4154  10556
kin    3302         827  1026   5155
swa    1810         453   748   3011

Total tweets across all languages: 111718
