# PulseFit ETL Process: From Raw to Clean Data
**Objective**: Extract data from raw CSV files, perform cleaning and transformations, and load the results into a processed format suitable for BI analysis.

## Phase 1: Extraction & Initial Inspection
In this phase, we load our datasets and perform a high-level scan to understand the quality of the data we are working with.

In [88]:
import pandas as pd
import numpy as np
import os
import re

# Set display options
pd.set_option('display.max_columns', None)

# Paths
RAW_DATA_PATH = "data/raw/"
PROCESSED_DATA_PATH = "data/processed/"

### 1.1 Loading the datasets

In [89]:
members_raw = pd.read_csv(os.path.join(RAW_DATA_PATH, "members_raw.csv"))
visits_raw = pd.read_csv(os.path.join(RAW_DATA_PATH, "visits_raw.csv"))
payments_raw = pd.read_csv(os.path.join(RAW_DATA_PATH, "payments_raw.csv"))
trainers_raw = pd.read_csv(os.path.join(RAW_DATA_PATH, "trainers_raw.csv"))
classes_raw = pd.read_csv(os.path.join(RAW_DATA_PATH, "classes_raw.csv"))
bookings_raw = pd.read_csv(os.path.join(RAW_DATA_PATH, "bookings_raw.csv"))

print(f"- Members: {members_raw.shape[0]} rows")
print(f"- Visits: {visits_raw.shape[0]} rows")
print(f"- Payments: {payments_raw.shape[0]} rows")
print(f"- Trainers: {trainers_raw.shape[0]} rows")
print(f"- Classes: {classes_raw.shape[0]} rows")
print(f"- Bookings: {bookings_raw.shape[0]} rows")

- Members: 500 rows
- Visits: 1998 rows
- Payments: 1500 rows
- Trainers: 30 rows
- Classes: 60 rows
- Bookings: 1200 rows


### 1.2 Initial Data Quality Audit
Let's look at the structure and find basic issues like missing values and wrong data types.

In [90]:
print("--- Members Info ---")
print(members_raw.info())
print("\n--- Visits Info ---")
print(visits_raw.info())
print("\n--- Payments Info ---")
print(payments_raw.info())
print("\n--- Trainers Info ---")
print(trainers_raw.info())
print("\n--- Classes Info ---")
print(classes_raw.info())
print("\n--- Bookings Info ---")
print(bookings_raw.info())

--- Members Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   member_id        500 non-null    object
 1   full_name        500 non-null    object
 2   email            478 non-null    object
 3   phone_number     500 non-null    object
 4   gender           496 non-null    object
 5   join_date        500 non-null    object
 6   membership_type  500 non-null    object
 7   status           500 non-null    object
 8   home_branch      500 non-null    object
 9   birth_year       500 non-null    int64 
dtypes: int64(1), object(9)
memory usage: 39.2+ KB
None

--- Visits Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1998 entries, 0 to 1997
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   visit_id        1998 non-null   object
 1   member_id     

### 1.3 Identifying Inconsistencies
Scanning for human errors (branches, gender, types) across all tables.

In [91]:
print("Unique Branches (Members):", members_raw['home_branch'].unique())
print("Unique Branches (Trainers):", trainers_raw['branch'].unique())
print("Unique Branches (Classes):", classes_raw['branch'].unique())
print("Unique Genders (Members):", members_raw['gender'].unique())
print("Membership Types (Members):", members_raw['membership_type'].unique())

Unique Branches (Members): ['Nabeul Plage' 'Sousse Corniche' 'Bizerte Port' 'Sousse' 'Bizert'
 'Sfax_Center' 'Tunis Lake' 'Sfax Center' 'tunis lake' 'Nabeul']
Unique Branches (Trainers): ['Sfax_Center' 'Sfax Center' 'tunis lake' 'Nabeul' 'Bizerte Port' 'Sousse'
 'Tunis Lake' 'Bizert']
Unique Branches (Classes): ['Sousse' 'Bizert' 'Sfax_Center' 'Sfax Center' 'tunis lake' 'Bizerte Port'
 'Tunis Lake']
Unique Genders (Members): ['M' 'F' 'Femal' 'Male' nan]
Membership Types (Members): ['Student' 'VIP' '6-Month' 'Annual' 'Monthly' 'ANNUAL' '6-MONTH' 'MONTHLY'
 'STUDENT']


### 1.4 Detailed Quality Audit

In [92]:
print("     Missing Values Check (Members):")
print(members_raw.isnull().sum())

print("\n   Duplicate Members Check:")
dups = members_raw[members_raw.duplicated(subset=['full_name', 'email'], keep=False)]
print(f"Found {len(dups)} potential duplicate records.")

print("\n   Value Range Checks:")
print("Payments Amount Min/Max:", payments_raw['amount'].min(), "/", payments_raw['amount'].max())
print("Visits Check-out Missing:", visits_raw['check_out_time'].isnull().sum())

     Missing Values Check (Members):
member_id           0
full_name           0
email              22
phone_number        0
gender              4
join_date           0
membership_type     0
status              0
home_branch         0
birth_year          0
dtype: int64

   Duplicate Members Check:
Found 257 potential duplicate records.

   Value Range Checks:
Payments Amount Min/Max: -4500 / 8000
Visits Check-out Missing: 176


## Phase 2: Data Cleaning & Transformation

### 2.1 Cleaning Members Data
- Normalize Branch Names
- Fix Gender Categories
- Normalize Phone Numbers
- Remove Duplicates
- Handle missing emails

In [93]:
members_clean = members_raw.copy()

# 1. Standardize Branch names
branch_map = {
    'tunis lake': 'Tunis Lake', 
    'Sfax_Center': 'Sfax Center', 
    'Sousse': 'Sousse Corniche', 
    'Nabeul': 'Nabeul Plage',
    'Bizert': 'Bizerte Port'
}
members_clean['home_branch'] = members_clean['home_branch'].replace(branch_map)

# 2. Standardize Gender
members_clean['gender'] = members_clean['gender'].replace({'Male': 'M', 'Femal': 'F'})
members_clean['gender'] = members_clean['gender'].fillna('U')

# 3. Normalize Phone Numbers
def normalize_phone(phone):
    digits = re.sub(r'\D', '', str(phone))
    clean = digits[-8:]
    return clean if len(clean) == 8 else 'INVALID'

members_clean['phone_number'] = members_clean['phone_number'].apply(normalize_phone)

# 4. Remove Duplicates
members_clean = members_clean.drop_duplicates(subset=['full_name', 'email'], keep='first')

print(f"Members dataset after cleaning: {members_clean.shape[0]} rows")

Members dataset after cleaning: 352 rows


### 2.2 Cleaning Visits Data
- Convert timestamps to datetime
- Normalize branch names
- Smart Imputation for check-out
- Filter logic errors

In [94]:
visits_clean = visits_raw.copy()
visits_clean['check_in_time'] = pd.to_datetime(visits_clean['check_in_time'])
visits_clean['check_out_time'] = pd.to_datetime(visits_clean['check_out_time'])
visits_clean['branch'] = visits_clean['branch'].replace(branch_map)

# Filter and create duration
temp_df = visits_clean.dropna(subset=['check_out_time']).copy()
temp_df['duration'] = temp_df['check_out_time'] - temp_df['check_in_time']
temp_df = temp_df[temp_df['duration'] > pd.Timedelta(0)]

medians_map = temp_df.groupby('branch')['duration'].median()
global_median = temp_df['duration'].median()

mask = visits_clean['check_out_time'].isnull()
visits_clean.loc[mask, 'check_out_time'] = visits_clean.loc[mask, 'check_in_time'] + \
    visits_clean.loc[mask, 'branch'].map(medians_map).fillna(global_median)

print(f"Visits dataset after cleaning: {visits_clean.shape[0]} rows")

Visits dataset after cleaning: 1998 rows


### 2.3 Cleaning Payments Data
- Remove negative amounts
- Fix outlier typos (10x)
- Capitalize membership types

In [95]:
payments_clean = payments_raw.copy()
payments_clean = payments_clean[payments_clean['amount'] > 0]
payments_clean.loc[payments_clean['amount'] > 1500, 'amount'] = payments_clean['amount'] / 10
payments_clean['membership_type'] = payments_clean['membership_type'].str.capitalize()

print(f"Payments dataset after cleaning: {payments_clean.shape[0]} rows")

Payments dataset after cleaning: 1481 rows


### 2.4 Cleaning Trainers, Classes & Bookings
- Normalize Branch Names
- Fill missing specializations
- Standardize dates

In [96]:
trainers_clean = trainers_raw.copy()
trainers_clean['branch'] = trainers_clean['branch'].replace(branch_map)
trainers_clean['specialization'] = trainers_clean['specialization'].fillna("General")

classes_clean = classes_raw.copy()
classes_clean['branch'] = classes_clean['branch'].replace(branch_map)

bookings_clean = bookings_raw.copy()
bookings_clean['booking_date'] = pd.to_datetime(bookings_clean['booking_date'])

print(f"Trainers clean: {trainers_clean.shape[0]} rows")
print(f"Classes clean: {classes_clean.shape[0]} rows")
print(f"Bookings clean: {bookings_clean.shape[0]} rows")

Trainers clean: 30 rows
Classes clean: 60 rows
Bookings clean: 1200 rows


## Phase 3: Loading (Saving results)
Now we save the full PulseFit clean ecosystem to `data/processed/`.

In [97]:
members_clean.to_csv(os.path.join(PROCESSED_DATA_PATH, "members_clean.csv"), index=False)
visits_clean.to_csv(os.path.join(PROCESSED_DATA_PATH, "visits_clean.csv"), index=False)
payments_clean.to_csv(os.path.join(PROCESSED_DATA_PATH, "payments_clean.csv"), index=False)
trainers_clean.to_csv(os.path.join(PROCESSED_DATA_PATH, "trainers_clean.csv"), index=False)
classes_clean.to_csv(os.path.join(PROCESSED_DATA_PATH, "classes_clean.csv"), index=False)
bookings_clean.to_csv(os.path.join(PROCESSED_DATA_PATH, "bookings_clean.csv"), index=False)

print("ETL Complete! All 6 clean PulseFit datasets saved.")

ETL Complete! All 6 clean PulseFit datasets saved.
