# Sentiment and Thematic Analysis

This notebook performs sentiment analysis and thematic analysis on mobile banking app reviews for CBE, BOA, and Dashen Bank. It uses DistilBERT for sentiment scoring and spaCy/TF-IDF for theme extraction.

**Steps:**
1. Load and preprocess cleaned data.
2. Perform sentiment analysis with DistilBERT.
3. Extract keywords and themes.
4. Save results.

**KPI:** Sentiment for 90%+ reviews, 3+ themes per bank.

In [1]:
import sys
print(sys.executable)

c:\Users\lenovo\Desktop\AIM\10_academy\Week_2_challenge\venv\Scripts\python.exe


In [4]:
import pandas as pd
import numpy as np
from transformers import pipeline
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from collections import defaultdict

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

In [5]:
# Load cleaned data
df = pd.read_csv("../Data/cleaned_reviews.csv")  # Adjust path if needed
print(f"Loaded {len(df)} reviews.")

Loaded 1180 reviews.


In [6]:
def preprocess_text(text):
    # Convert to string, lowercase, remove special characters
    text = str(text).lower()
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenization and stop-word removal with spaCy
    doc = nlp(text)
    return " ".join([token.lemma_ for token in doc if not token.is_stop and not token.is_punct])

df["processed_review"] = df["review"].apply(preprocess_text)
print("Text preprocessing completed.")

Text preprocessing completed.


In [7]:
# Initialize DistilBERT sentiment pipeline
sentiment_analyzer = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

def get_sentiment(text):
    result = sentiment_analyzer(text[:512])[0]  # Limit to 512 tokens
    return result["label"], result["score"]

# Apply sentiment analysis
df[["sentiment_label", "sentiment_score"]] = df["review"].apply(
    lambda x: pd.Series(get_sentiment(x))
)
print(f"Sentiment analyzed for {len(df[df['sentiment_label'].notna()])} reviews.")



config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Sentiment analyzed for 1180 reviews.


In [8]:
sentiment_agg = df.groupby(["bank", "rating"]).agg({"sentiment_score": "mean"}).reset_index()
print("\nMean Sentiment Score by Bank and Rating:")
print(sentiment_agg)


Mean Sentiment Score by Bank and Rating:
                           bank  rating  sentiment_score
0             Bank of Abyssinia       1         0.988881
1             Bank of Abyssinia       2         0.981547
2             Bank of Abyssinia       3         0.990526
3             Bank of Abyssinia       4         0.978465
4             Bank of Abyssinia       5         0.977828
5   Commercial Bank of Ethiopia       1         0.988785
6   Commercial Bank of Ethiopia       2         0.976726
7   Commercial Bank of Ethiopia       3         0.969137
8   Commercial Bank of Ethiopia       4         0.964873
9   Commercial Bank of Ethiopia       5         0.966693
10                  Dashen Bank       1         0.995097
11                  Dashen Bank       2         0.960082
12                  Dashen Bank       3         0.997640
13                  Dashen Bank       4         0.978575
14                  Dashen Bank       5         0.986615


In [9]:
# Extract keywords using TF-IDF
tfidf = TfidfVectorizer(max_features=10, stop_words="english")
tfidf_matrix = tfidf.fit_transform(df["processed_review"])
keywords = tfidf.get_feature_names_out()
print("\nTop Keywords per Bank (TF-IDF):")
for bank in df["bank"].unique():
    bank_reviews = df[df["bank"] == bank]["processed_review"]
    bank_tfidf = tfidf.transform(bank_reviews)
    avg_tfidf = np.mean(bank_tfidf.toarray(), axis=0)
    bank_keywords = [keywords[i] for i in avg_tfidf.argsort()[-5:][::-1]]
    print(f"{bank}: {bank_keywords}")


Top Keywords per Bank (TF-IDF):
Commercial Bank of Ethiopia: ['app', 'transaction', 'update', 'work', 'transfer']
Bank of Abyssinia: ['app', 'work', 'update', 'bank', 'use']
Dashen Bank: ['app', 'good', 'bank', 'banking', 'use']


In [10]:
# Define themes based on keywords (manual grouping)
themes_dict = defaultdict(list)
keyword_themes = {
    "login": "Account Access Issues",
    "error": "Account Access Issues",
    "crash": "Reliability",
    "slow": "Transaction Performance",
    "transfer": "Transaction Performance",
    "ui": "User Interface & Experience",
    "design": "User Interface & Experience",
    "support": "Customer Support",
    "help": "Customer Support",
    "feature": "Feature Requests"
}

def assign_themes(review):
    themes = set()
    doc = nlp(review)
    for token in doc:
        if token.text in keyword_themes:
            themes.add(keyword_themes[token.text])
    return ";".join(themes) if themes else "Other"

df["themes"] = df["processed_review"].apply(assign_themes)
print("\nSample Themes Assigned:")
print(df[["bank", "review", "themes"]].head())


Sample Themes Assigned:
                          bank  \
0  Commercial Bank of Ethiopia   
1  Commercial Bank of Ethiopia   
2  Commercial Bank of Ethiopia   
3  Commercial Bank of Ethiopia   
4  Commercial Bank of Ethiopia   

                                              review  \
0  The CBE app has been highly unreliable in rece...   
1  this new update(Mar 19,2025) is great in fixin...   
2  Good job to the CBE team on this mobile app! I...   
3  this app has developed in a very good ways but...   
4  everytime you uninstall the app you have to re...   

                        themes  
0                        Other  
1                        Other  
2  User Interface & Experience  
3                        Other  
4                        Other  


In [11]:
# Add review_id
df["review_id"] = range(len(df))
# Save to CSV
df[["review_id", "review", "sentiment_label", "sentiment_score", "themes"]].to_csv("../Data/sentiment_thematic_results.csv", index=False)
print(f"\nSaved results for {len(df)} reviews to sentiment_thematic_results.csv.")


Saved results for 1180 reviews to sentiment_thematic_results.csv.
