# Preprocessing Reviews for Mobile Banking Analysis

This notebook performs Exploratory Data Analysis (EDA) and preprocessing on raw review data scraped from the Google Play Store for Commercial Bank of Ethiopia (CBE), Bank of Abyssinia (BOA), and Dashen Bank. The goal is to clean the data by removing duplicates, handling missing values, and normalizing dates, while documenting the initial state of the dataset.

**Steps:**
1. Load raw data from `raw_reviews.csv`.
2. Perform EDA to understand the dataset.
3. Remove duplicates.
4. Handle missing data.
5. Verify and normalize dates.
6. Save the cleaned data to `cleaned_reviews.csv`.

**KPI:** Ensure 1,200+ reviews with <5% missing data.

In [2]:
# Import required libraries
import pandas as pd

In [3]:
# Load raw data
df = pd.read_csv("../Data/raw_reviews.csv")
print("Raw data loaded successfully.")

Raw data loaded successfully.


## Exploratory Data Analysis (EDA)

Let's analyze the raw dataset to understand its structure, missing values, and basic statistics before preprocessing.

In [4]:
# EDA: Dataset Shape
print("=== Exploratory Data Analysis of Raw Data ===")
print("\n1. Dataset Shape (Rows, Columns):", df.shape)

=== Exploratory Data Analysis of Raw Data ===

1. Dataset Shape (Rows, Columns): (1200, 5)


In [5]:
# EDA: Column Names and Data Types
print("\n2. Column Names and Data Types:")
print(df.dtypes)


2. Column Names and Data Types:
review    object
rating     int64
date      object
bank      object
source    object
dtype: object


In [6]:
# EDA: Missing Value Counts
print("\n3. Missing Value Counts per Column:")
print(df.isnull().sum())


3. Missing Value Counts per Column:
review    0
rating    0
date      0
bank      0
source    0
dtype: int64


In [7]:
# EDA: Basic Summary Statistics
print("\n4. Basic Summary Statistics:")
print(df.describe(include='all'))  # Include all columns (numeric and object)


4. Basic Summary Statistics:
       review       rating        date                         bank  \
count    1200  1200.000000        1200                         1200   
unique   1180          NaN         479                            3   
top      good          NaN  2025-04-21  Commercial Bank of Ethiopia   
freq        7          NaN         105                          400   
mean      NaN     3.118333         NaN                          NaN   
std       NaN     1.765955         NaN                          NaN   
min       NaN     1.000000         NaN                          NaN   
25%       NaN     1.000000         NaN                          NaN   
50%       NaN     3.000000         NaN                          NaN   
75%       NaN     5.000000         NaN                          NaN   
max       NaN     5.000000         NaN                          NaN   

             source  
count          1200  
unique            1  
top     Google Play  
freq           1200  
mean   

In [8]:
# EDA: Sample of the Data
print("\n5. Sample of the Data (First 5 Rows):")
print(df.head())


5. Sample of the Data (First 5 Rows):
                                              review  rating        date  \
0  The CBE app has been highly unreliable in rece...       2  2025-05-25   
1  this new update(Mar 19,2025) is great in fixin...       4  2025-03-20   
2  Good job to the CBE team on this mobile app! I...       5  2025-04-04   
3  this app has developed in a very good ways but...       5  2025-05-31   
4  everytime you uninstall the app you have to re...       1  2025-06-04   

                          bank       source  
0  Commercial Bank of Ethiopia  Google Play  
1  Commercial Bank of Ethiopia  Google Play  
2  Commercial Bank of Ethiopia  Google Play  
3  Commercial Bank of Ethiopia  Google Play  
4  Commercial Bank of Ethiopia  Google Play  


In [9]:
# Remove duplicates
df = df.drop_duplicates(subset=["review"], keep="first")
print(f"\nRemoved duplicates. New shape: {df.shape}")


Removed duplicates. New shape: (1180, 5)


In [10]:
# Handle missing data
df = df.dropna(subset=["review", "rating", "date"])  # Drop rows with missing key fields
print(f"Dropped rows with missing key fields. New shape: {df.shape}")

df["review"] = df["review"].fillna("No review")  # Fill missing reviews with placeholder
df["rating"] = df["rating"].fillna(df["rating"].mean())  # Fill missing ratings with mean
print(f"Filled missing reviews with 'No review' and ratings with mean. Missing values remaining:")
print(df.isnull().sum())

Dropped rows with missing key fields. New shape: (1180, 5)
Filled missing reviews with 'No review' and ratings with mean. Missing values remaining:
review    0
rating    0
date      0
bank      0
source    0
dtype: int64


In [11]:
# Verify date format
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d", errors="coerce")
print(f"\nDate format verified. Invalid dates coerced to NaT: {df['date'].isnull().sum()}")


Date format verified. Invalid dates coerced to NaT: 0


In [12]:
# Save cleaned data
df.to_csv("../Data/cleaned_reviews.csv", index=False)
print(f"\nProcessed {len(df)} reviews with <5% missing data.")


Processed 1180 reviews with <5% missing data.


## Final Notes

- Verify that the number of reviews meets the 1,200+ target.
- Check the missing data percentage (<5%) using the EDA output.
- Commit this notebook and the cleaned CSV to your Git repository.

In [None]:
 # Ensure no bank column exists

<bound method Series.all of 0       Commercial Bank of Ethiopia
1       Commercial Bank of Ethiopia
2       Commercial Bank of Ethiopia
3       Commercial Bank of Ethiopia
4       Commercial Bank of Ethiopia
                   ...             
1194                    Dashen Bank
1195                    Dashen Bank
1196                    Dashen Bank
1197                    Dashen Bank
1198                    Dashen Bank
Name: bank, Length: 1180, dtype: object>