# ðŸ§¹ Task 1 â€” Data Collection & Preprocessing
### Customer Experience Analytics for Fintech Apps
**Commercial Bank of Ethiopia (CBE) â€¢ Bank of Abyssinia (BOA) â€¢ Dashen Bank**  
*November 2025*

## Objective
Scrape **minimum 1,200 real user reviews** (â‰¥400 per bank) from the Google Play Store and deliver a **clean, analysis-ready dataset** with the following columns:

| Column   | Description                  |
|----------|------------------------------|
| `review` | User review text             |
| `rating` | 1â€“5 star rating              |
| `date`   | Review date (YYYY-MM-DD)     |
| `bank`   | CBE / BOA / DASHEN           |
| `source` | Google Play Store            |

___

### 1. Setup & Data Loading

In [1]:
import sys
import os

# Add project root (one directory above "notebooks")
sys.path.append(os.path.abspath(".."))

In [2]:
# import necessary modules and libraries
import pandas as pd
from scripts.scrape_reviews import scrape_reviews_for_app
from scripts.preprocess_reviews import preprocess_pipeline

In [3]:
# package names for each application
APPS = {
    "CBE": "com.combanketh.mobilebanking",  #https://play.google.com/store/apps/details?id=com.combanketh.mobilebanking&hl=en
    "BOA": "com.boa.boaMobileBanking",    #https://play.google.com/store/apps/details?id=com.boa.boaMobileBanking&pcampaignid=web_share
    "Dashen": "com.dashen.dashensuperapp",  #https://play.google.com/store/apps/details?id=com.dashen.dashensuperapp&pcampaignid=web_share
}

### 2. Data Collection (Web Scraping)

In [4]:
all_reviews = [] 
# loop through each bank using the predefined APPS dictionary
for bank, package in APPS.items():
    data = scrape_reviews_for_app(bank, package)
    all_reviews.extend(data)

ðŸ”¹ Scraping CBE...
âœ… Finished CBE (400 reviews)
ðŸ”¹ Scraping BOA...
âœ… Finished BOA (400 reviews)
ðŸ”¹ Scraping Dashen...
âœ… Finished Dashen (400 reviews)


In [5]:
df = pd.DataFrame(all_reviews)  #Create a pandas DataFrame for easy manipulation and analysis
df.head()  #quick check

Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,appVersion,bank
0,f8002d06-b5c5-4ed1-9d51-a9a379304cf8,Sayid Ahmad,https://play-lh.googleusercontent.com/a-/ALV-U...,the most advanced app. but how to stay safe?,5,0,4.4.0,2025-11-27 10:03:41,,NaT,4.4.0,CBE
1,81000db5-aa51-467e-826c-fc96160e96a8,Hiwot Gebrie,https://play-lh.googleusercontent.com/a/ACg8oc...,Good application,4,0,,2025-11-27 08:59:12,,NaT,,CBE
2,3d88a334-958c-4717-9f97-c5d46359e054,samson getachew,https://play-lh.googleusercontent.com/a/ACg8oc...,It is nice app,5,1,5.2.1,2025-11-26 12:03:18,,NaT,5.2.1,CBE
3,99d376ea-4824-4af9-a093-27360acc3a5c,Nejbadin Ali,https://play-lh.googleusercontent.com/a-/ALV-U...,best,5,0,5.2.1,2025-11-25 20:27:20,,NaT,5.2.1,CBE
4,f1861daf-a1ed-407a-9e7c-295edbb3877d,Amman Mom,https://play-lh.googleusercontent.com/a/ACg8oc...,good app,5,0,5.2.1,2025-11-25 18:10:35,,NaT,5.2.1,CBE


In [6]:
df.to_csv("../data/raw_bank_reviews.csv", index=False)
print("ðŸ’¾ Saved to data/raw_bank_reviews.csv")


ðŸ’¾ Saved to data/raw_bank_reviews.csv


### 3. Preprocessing

In [7]:
# Work on a copy to preserve raw data in memory
df_clean = df.copy()
# Run the full pipeline defined in scripts/preprocess_reviews.py
df_clean = preprocess_pipeline(df_clean)


Starting preprocessing pipeline...

Raw data loaded: 1,200 reviews
ðŸ”¹ Removed 0 duplicate reviews.
ðŸ”¹ Removed 0 empty reviews.
Date normalized â†’ datetime64[ns]
ðŸ”¹ Standardized bank names.
ðŸ”¹ Selected required final columns.


### 4. Validation Checks

In [8]:
# Check total review count
print("Total reviews:", len(df_clean))
# check for normalized date
df_clean.info()

Total reviews: 1200
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   review  1200 non-null   object        
 1   rating  1200 non-null   int64         
 2   date    1200 non-null   datetime64[ns]
 3   bank    1200 non-null   object        
 4   source  1200 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 47.0+ KB


In [9]:
# Check distribution across banks
print(df_clean["bank"].value_counts())

bank
CBE       400
BOA       400
DASHEN    400
Name: count, dtype: int64


In [10]:
# Check missing values (< 5% expected)
df_clean.isnull().mean() * 100

review    0.0
rating    0.0
date      0.0
bank      0.0
source    0.0
dtype: float64

In [11]:
# Confirm final columns
df_clean.columns

Index(['review', 'rating', 'date', 'bank', 'source'], dtype='object')

In [12]:
# save cleaned data set into data/processed
df_clean.to_csv("../data/processed/cleaned_reviews.csv", index=False)
print("ðŸ’¾ Saved cleaned dataset to ../data/cleaned_reviews.csv")

ðŸ’¾ Saved cleaned dataset to ../data/cleaned_reviews.csv
