# ðŸ§¹ Task 1 â€” Data Collection & Preprocessing
### Customer Experience Analytics for Fintech Apps
**Commercial Bank of Ethiopia (CBE) â€¢ Bank of Abyssinia (BOA) â€¢ Dashen Bank**  
*November 2025*

## Objective
Scrape **minimum 1,200 real user reviews** (â‰¥400 per bank) from the Google Play Store and deliver a **clean, analysis-ready dataset** with the following columns:

| Column   | Description                  |
|----------|------------------------------|
| `review` | User review text             |
| `rating` | 1â€“5 star rating              |
| `date`   | Review date (YYYY-MM-DD)     |
| `bank`   | CBE / BOA / DASHEN           |
| `source` | Google Play Store            |

___

### 1. Setup & Data Loading

In [1]:
import sys
import os

# Add project root (one directory above "notebooks")
sys.path.append(os.path.abspath(".."))

In [None]:
# import necessary modules and libraries
import pandas as pd
import re

from scripts.scrape_reviews import scrape_reviews_for_app
from scripts.preprocess_reviews import preprocess_pipeline

In [3]:
# package names for each application
APPS = {
    "CBE": "com.combanketh.mobilebanking",  #https://play.google.com/store/apps/details?id=com.combanketh.mobilebanking&hl=en
    "BOA": "com.boa.boaMobileBanking",    #https://play.google.com/store/apps/details?id=com.boa.boaMobileBanking&pcampaignid=web_share
    "Dashen": "com.dashen.dashensuperapp",  #https://play.google.com/store/apps/details?id=com.dashen.dashensuperapp&pcampaignid=web_share
}

### 2. Data Collection (Web Scraping)

In [4]:
all_reviews = [] 
# loop through each bank using the predefined APPS dictionary
for bank, package in APPS.items():
    data = scrape_reviews_for_app(bank, package)
    all_reviews.extend(data)

ðŸ”¹ Scraping CBE...
âœ… Finished CBE (600 reviews)
ðŸ”¹ Scraping BOA...
âœ… Finished BOA (600 reviews)
ðŸ”¹ Scraping Dashen...
âœ… Finished Dashen (600 reviews)


In [6]:
df = pd.DataFrame(all_reviews)  #Create a pandas DataFrame for easy manipulation and analysis
df.head()  #quick check

Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,appVersion,bank
0,c69f051a-00f8-4144-8423-b7ebcd328d2d,Mohammed Abrahim,https://play-lh.googleusercontent.com/a-/ALV-U...,The app makes our life easier. Thank you CBE!,5,0,5.2.1,2025-11-27 18:00:06,,NaT,5.2.1,CBE
1,d2995fb9-63c6-4bfc-8d3c-93a0ee9dba8f,Sulxaan Huseen,https://play-lh.googleusercontent.com/a-/ALV-U...,this app very bad ðŸ‘Ž,1,0,,2025-11-27 16:28:10,,NaT,,CBE
2,f8002d06-b5c5-4ed1-9d51-a9a379304cf8,Sayid Ahmad,https://play-lh.googleusercontent.com/a-/ALV-U...,the most advanced app. but how to stay safe?,5,0,4.4.0,2025-11-27 10:03:41,,NaT,4.4.0,CBE
3,81000db5-aa51-467e-826c-fc96160e96a8,Hiwot Gebrie,https://play-lh.googleusercontent.com/a/ACg8oc...,Good application,4,0,,2025-11-27 08:59:12,,NaT,,CBE
4,3d88a334-958c-4717-9f97-c5d46359e054,samson getachew,https://play-lh.googleusercontent.com/a/ACg8oc...,It is nice app,5,1,5.2.1,2025-11-26 12:03:18,,NaT,5.2.1,CBE


In [7]:
df.to_csv("../data/raw_bank_reviews.csv", index=False)
print("ðŸ’¾ Saved to data/raw_bank_reviews.csv")


ðŸ’¾ Saved to data/raw_bank_reviews.csv


### 3. Preprocessing

In [8]:
# Work on a copy to preserve raw data in memory
df_clean = df.copy()
# Run the full pipeline defined in scripts/preprocess_reviews.py
df_clean = preprocess_pipeline(df_clean)


Starting preprocessing pipeline...

Raw data loaded: 1,800 reviews
ðŸ”¹ Removed 0 duplicate reviews.
ðŸ”¹ Removed 0 empty reviews.
Removed 253 reviews with non-Latin characters (Amharic/Arabic/etc)
â†’ Kept 1,547 clean English reviews for accurate sentiment analysis
Date normalized â†’ datetime64[ns]
ðŸ”¹ Standardized bank names.
ðŸ”¹ Selected required final columns.


### 4. Validation Checks

In [9]:
# Check total review count
print("Total reviews:", len(df_clean))
# check for normalized date
df_clean.info()

Total reviews: 1547
<class 'pandas.core.frame.DataFrame'>
Index: 1547 entries, 0 to 1799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   review  1547 non-null   object        
 1   rating  1547 non-null   int64         
 2   date    1547 non-null   datetime64[ns]
 3   bank    1547 non-null   object        
 4   source  1547 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 72.5+ KB


In [10]:
# Check distribution across banks
print(df_clean["bank"].value_counts())

bank
CBE       521
BOA       514
DASHEN    512
Name: count, dtype: int64


In [11]:
# Check missing values (< 5% expected)
df_clean.isnull().mean() * 100

review    0.0
rating    0.0
date      0.0
bank      0.0
source    0.0
dtype: float64

In [12]:
# Confirm final columns
df_clean.columns

Index(['review', 'rating', 'date', 'bank', 'source'], dtype='object')

In [16]:
# check for non latin words which could affect output in sentiment analysis
non_latin_left = df_clean["review"].fillna("").astype(str).str.contains(r'[^\x00-\x7F]', regex=True).sum()
print(f"Non-Latin characters still present: {non_latin_left}  â†’ should be 0")
if non_latin_left == 0:
    print("Perfect! All non-Latin script removed.")
else:
    print("Warning: Some non-Latin text still exists!")

Non-Latin characters still present: 0  â†’ should be 0
Perfect! All non-Latin script removed.


In [17]:
# save cleaned data set into data/processed
df_clean.to_csv("../data/processed/cleaned_reviews.csv", index=False)
print("ðŸ’¾ Saved cleaned dataset to ../data/cleaned_reviews.csv")

ðŸ’¾ Saved cleaned dataset to ../data/cleaned_reviews.csv


In [18]:
# check for non latin words which could affect output in sentiment analysis
non_latin_left = df["content"].fillna("").astype(str).str.contains(r'[^\x00-\x7F]', regex=True).sum()
print(f"Non-Latin characters still present: {non_latin_left}  â†’ should be 0")
if non_latin_left == 0:
    print("Perfect! All non-Latin script removed.")
else:
    print("Warning: Some non-Latin text still exists!")

Non-Latin characters still present: 253  â†’ should be 0
