## Task 2: Sentiment and Thematic Analysis
This notebook serves as the execution and reporting layer for Task 2. All core logic—including sentiment scoring, TF-IDF, and rule-based thematic clustering—is executed via the modular Python scripts located in the src/ directory.



### 1. Project Setup and Dependencies
This section imports visualization tools and sets up the necessary file paths.

In [1]:
%load_ext autoreload
%autoreload 2

In [3]:
# Setup and Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys

# Navigate up one level to the project root directory
project_root = os.path.dirname(os.getcwd())

# Add the project root to the Python system path list
if project_root not in sys.path:
    sys.path.append(project_root)
    print(f"Added project root to sys.path: {project_root}")

# Set visualization styles
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

Added project root to sys.path: d:\DS_Courses\KAIM_10_Academy\KAIM_8\Week_2\KAIM_Week2


In [4]:
# Import the custom class
from src.preprocessor import ReviewPreprocessor
from src.sentiment_analyzer import SentimentAnalyzer
from src.thematic_analyzer import ThematicAnalyzer

In [5]:
# ---  Data Loading ---
try:
    df = pd.read_csv('../data/processed/reviews_processed.csv')
    print("Successfully loaded 'reviews_processed.csv'")
except FileNotFoundError:
    print("WARNING: Data file not found. Please ensure data is in '../data/processed/'.")


Successfully loaded 'reviews_processed.csv'


### 2. Analysis Methodology

2.1 Sentiment Analysis Approach

In [6]:
# --- Sentiment Analysis  ---
sentiment_processor = SentimentAnalyzer(df=df.copy())
df_with_sentiment = sentiment_processor.analyze_textblob()

# Show output (Verification step for the report)
print("\Textblob Analysis Results Sample:")
print(
    df_with_sentiment[[
        "review_text", "rating", "tb_polarity", "tb_subjectivity"
    ]].head()
)

  print("\Textblob Analysis Results Sample:")


Starting sentiment analysis using TextBlob...
TextBlob analysis complete.
\Textblob Analysis Results Sample:
                                         review_text  rating  tb_polarity  \
0                   bad exprience...it is so crushed       1    -0.400000   
1  not user friendly at all it requires a huge co...       1     0.425000   
2           most of the time is not working properly       1     0.250000   
3  It keeps notifying me to disable developer opt...       1     0.068182   
4  the lag is unbelievable when you need it the m...       1     0.125000   

   tb_subjectivity  
0         0.383333  
1         0.633333  
2         0.300000  
3         0.393939  
4         0.750000  


In [15]:
# --- VADER Sentiment Analysis  ---

# Re-initialize or continue with the same processor instance if you want to run both:
if 'df_with_sentiment' not in locals():
    df_with_sentiment = df.copy()

sentiment_processor = SentimentAnalyzer(df=df_with_sentiment)
df_with_all_sentiment = sentiment_processor.analyze_vader()

# Show output (Verification step for the report)
print("\nVADER Analysis Results Sample:")
print(
    df_with_all_sentiment[[
        "review_text", "rating", "vader_compound", "vader_sentiment"
    ]].head()
)

Starting sentiment analysis using VADER...
VADER analysis complete.

VADER Analysis Results Sample:
                                         review_text  rating  vader_compound  \
0                   bad exprience...it is so crushed       1         -0.7973   
1  not user friendly at all it requires a huge co...       1         -0.4268   
2           most of the time is not working properly       1          0.0000   
3  It keeps notifying me to disable developer opt...       1         -0.4019   
4  the lag is unbelievable when you need it the m...       1         -0.1531   

  vader_sentiment  
0        Negative  
1        Negative  
2         Neutral  
3        Negative  
4        Negative  


2.2 Thematic Analysis and Grouping Logic

In [16]:
# --- Thematic Analysis (Keyword Extraction) ---

# 'df_with_sentiment' contains the data from the previous step.
thematic_processor = ThematicAnalyzer(df=df_with_sentiment.copy())

# This single call runs the preprocessing AND the TF-IDF vectorization
thematic_processor.extract_keywords_tfidf() 


# Verification: Show the new column
print("\nCleaned Reviews Sample (Thematic Preprocessing):")
print(thematic_processor.df[['review_text', 'cleaned_review']].head())

Preprocessing text for TF-IDF...
TF-IDF analysis complete. Found 469 features.

Cleaned Reviews Sample (Thematic Preprocessing):
                                         review_text  \
0                   bad exprience...it is so crushed   
1  not user friendly at all it requires a huge co...   
2           most of the time is not working properly   
3  It keeps notifying me to disable developer opt...   
4  the lag is unbelievable when you need it the m...   

                                      cleaned_review  
0                     bad exprience it is so crushed  
1  not user friendly at all it requires a huge co...  
2           most of the time is not working properly  
3  it keeps notifying me to disable developer opt...  
4  the lag is unbelievable when you need it the m...  


In [17]:
# --- Thematic Analysis: Keyword Extraction ---

# Assuming 'df_with_sentiment' contains the data from the sentiment step.

thematic_processor = ThematicAnalyzer(df=df_with_sentiment.copy())

# This single call runs cleaning, vectorization, and mean calculation
keywords_df = thematic_processor.extract_keywords_tfidf() 

# Show output (Verification and reporting step)
print("\nTop 10 Most Significant Keywords (Mean TF-IDF Score):")
print(keywords_df.head(10))

Preprocessing text for TF-IDF...
TF-IDF analysis complete. Found 469 features.

Top 10 Most Significant Keywords (Mean TF-IDF Score):
          word     tfidf
0          app  0.098637
1         good  0.054004
2         best  0.038600
3         bank  0.037071
4          use  0.027366
5      banking  0.026982
6     good app  0.025107
7         easy  0.023530
8  application  0.022963
9       dashen  0.021473


Topic Modeling (LDA)

In [18]:
# --- Advanced Thematic Analysis: LDA Topic Modeling ---
# This is used for discovering themes, informing the rule-based clustering.

# Assuming 'df_with_sentiment' contains the data from the previous step.
thematic_processor = ThematicAnalyzer(df=df_with_sentiment.copy())

# Run the full LDA pipeline via the modular method
lda_results = thematic_processor.run_lda_topic_modeling(num_topics=4)

# Display the output (Reporting step)
print(lda_results)


--- Starting LDA Topic Modeling (4 Topics) ---
LDA modeling complete.
--- Topic 1 ---
good             weight=0.0740
use              weight=0.0497
easy             weight=0.0471
bank             weight=0.0467
dashen           weight=0.0297
work             weight=0.0272
super            weight=0.0213
fast             weight=0.0208
great            weight=0.0177
please           weight=0.0171

--- Topic 2 ---
bank             weight=0.0352
banking          weight=0.0328
one              weight=0.0272
dashen           weight=0.0271
user             weight=0.0237
super            weight=0.0222
ever             weight=0.0200
experience       weight=0.0191
amazing          weight=0.0170
friendly         weight=0.0164

--- Topic 3 ---
banking          weight=0.0376
best             weight=0.0376
mobile           weight=0.0329
money            weight=0.0252
time             weight=0.0219
bank             weight=0.0199
nice             weight=0.0165
even             weight=0.0164
one        

In [11]:
# --- Bank-Specific Thematic Analysis (LDA) ---

# Assuming 'df_with_sentiment' contains the final data.
thematic_processor = ThematicAnalyzer(df=df_with_sentiment.copy())

# Run the bank-specific LDA pipeline via the modular method
bank_lda_results = thematic_processor.run_lda_by_bank(n_topics=4)

# Display the output (Reporting step)
print(bank_lda_results)


=== Bank of Abyssinia (370 reviews) ===
Topic 1: app, best, easy, fast, worest, worest app, best app, disappointing
Topic 2: app, worst, banking, mobile, working, worst app, doesn, work
Topic 3: good, works, update, better, app, really, needs, work
Topic 4: bank, good, app, bad, slow, good app, application, good application

=== Commercial Bank of Ethiopia (303 reviews) ===
Topic 1: app, bank, good app, banking, good, transaction, history, mobile
Topic 2: best, good, application, app, fast, service, useful, add
Topic 3: app, use, easy, easy use, nice, life, money, problem
Topic 4: cbe, excellent, time, user, works, application, work, friendly

=== Dashen Bank (424 reviews) ===
Topic 1: use, easy, app, slow, easy use, application, transaction, dashen
Topic 2: bank, app, good, working, account, nice, option, app good
Topic 3: app, best, dashen, user, super, fast, banking, good
Topic 4: mobile, star, mobile banking, banking, worst, worst app, app, takes


In [15]:
# --- Rule-Based Theme Assignment ---
# Replaces the THEME_MAPPING definition and the df.apply logic

# The ThematicAnalyzer instance already holds the df (df_with_sentiment).
df_final_analyzed = thematic_processor.assign_themes(top_k=1)

# Verification: Show the new 'theme' column
print("\nRule-Based Theme Assignment Sample:")
print(df_final_analyzed[['cleaned_review', 'theme']].head())

Starting rule-based thematic clustering...
Thematic assignment complete.

Rule-Based Theme Assignment Sample:
                                      cleaned_review  \
0                     bad exprience it is so crushed   
1  not user friendly at all it requires a huge co...   
2           most of the time is not working properly   
3  it keeps notifying me to disable developer opt...   
4  the lag is unbelievable when you need it the m...   

                         theme  
0                        other  
1  User Experience (UI/Design)  
2                        other  
3         App Stability & Bugs  
4                        other  


In [None]:
# --- Data Loading for Reporting ---
import pandas as pd
from config import DATA_PATHS

# Load the final file created by the pipeline.py script
REPORTING_FILE = DATA_PATHS['final_results']
df_final = pd.read_csv(REPORTING_FILE)

print(f"Loaded {len(df_final)} analyzed records for visualization.")

# --- Visualization 1: Theme Sentiment Aggregation ---
theme_sentiment = (
    df_final.groupby('theme')['sentiment_score']
    .agg(['mean', 'count'])
    .sort_values('mean')
)

print("\n--- Thematic Sentiment Summary ---")
print(theme_sentiment)

# --- Visualization 2: Plotting Code ---
# Use df_final for all subsequent plotting and aggregation code.

Loaded 1195 analyzed records for visualization.

--- Thematic Sentiment Summary ---
                        mean  count
theme                              
user-sentiment     -0.417632     76
performance        -0.054321     76
core-functionality  0.139479    106
other               0.281353    768
feature-requests    0.351479     28
ui-ux               0.656111    141
