# ETL Pipeline - Freelance Job Earnings Data Warehouse

This notebook extracts data from raw sources, **cleans data quality issues**, transforms it into a star schema, and loads it into dimension and fact tables.

## Data Cleaning Strategies:
- **Duplicate Removal** - Identify and remove duplicate records
- **Missing Value Handling** - Impute or remove missing values
- **Text Standardization** - Fix inconsistent casing and whitespace
- **Data Validation** - Ensure referential integrity

## Output Tables:
- **fact_job_earnings.csv** - Fact table with job earnings metrics
- **dim_worker.csv** - Worker dimension
- **dim_platform.csv** - Platform dimension
- **dim_region.csv** - Region dimension
- **dim_project.csv** - Project dimension
- **dim_date.csv** - Date dimension

In [2]:
import pandas as pd
import json
import os
import numpy as np
from pathlib import Path

# Suppress pandas warnings for cleaner output
pd.options.mode.chained_assignment = None

## 1. Configuration

In [3]:
# Define paths
BASE_DIR = Path.cwd().parent
RAW_DATA_DIR = BASE_DIR / 'data_raw'
CLEANED_DATA_DIR = BASE_DIR / 'data_cleaned'

# Create output directory if it doesn't exist
CLEANED_DATA_DIR.mkdir(parents=True, exist_ok=True)

print(f"Raw data directory: {RAW_DATA_DIR}")
print(f"Cleaned data directory: {CLEANED_DATA_DIR}")

Raw data directory: c:\Users\louey\Desktop\Business intelligence\Flexi Income new\data_raw
Cleaned data directory: c:\Users\louey\Desktop\Business intelligence\Flexi Income new\data_cleaned


---
## 2. EXTRACT - Load Raw Data

In [4]:
# Extract jobs transactions (main facts data)
jobs_df = pd.read_csv(RAW_DATA_DIR / 'jobs_transactions.csv')
print(f"Jobs transactions: {len(jobs_df)} records")
jobs_df.head()

Jobs transactions: 15500 records


Unnamed: 0,job_id,worker_id,platform,client_region,project_type,job_category,payment_method,work_date,earnings_usd,job_completed,job_duration_days,hourly_rate,job_success_rate,client_rating,rehire_rate,marketing_spend
0,4704,1319,Upwork,Asia,Hourly,Graphic Design,PayPal,2025-04-11,4610.5,191,9,45.81,61.58,4.32,82.59,418
1,7750,876,Freelancer,Europe,Fixed,Seo,Crypto,2025-08-08,5593.28,63,27,126.52,72.36,4.4,67.29,237
2,11504,840,Fiverr,Europe,Fixed,Content Writing,Crypto,2025-09-13,8006.26,63,59,32.62,64.27,3.88,23.61,263
3,7235,1456,Upwork,Middle East,Hourly,Seo,Crypto,2025-10-18,7496.48,181,51,136.6,62.29,4.93,63.58,406
4,10270,365,Freelancer,Europe,Hourly,Seo,PayPal,2025-10-03,4077.29,68,34,113.02,64.57,3.99,71.28,72


In [5]:
# Extract workers dimension (JSON)
with open(RAW_DATA_DIR / 'workers.json', 'r') as f:
    workers_raw = json.load(f)
workers_df = pd.DataFrame(workers_raw)
print(f"Workers: {len(workers_df)} records")
workers_df.head()

Workers: 1950 records


Unnamed: 0,worker_id,experience_level,primary_skill
0,1,Beginner,Web Development
1,2,Beginner,App Development
2,3,Beginner,Web Development
3,4,Intermediate,Data Entry
4,5,Expert,Digital Marketing


In [6]:
# Extract platforms dimension (CSV)
platforms_df = pd.read_csv(RAW_DATA_DIR / 'platforms.csv')
print(f"Platforms: {len(platforms_df)} records")
platforms_df

Platforms: 5 records


Unnamed: 0,platform_id,platform_name,category,payment_cycle
0,1,Fiverr,Freelancing,Crypto
1,2,Peopleperhour,Freelancing,Bank Transfer
2,3,Upwork,Freelancing,PayPal
3,4,Toptal,Freelancing,Crypto
4,5,Freelancer,Freelancing,Crypto


In [7]:
# Extract regions dimension (JSON)
with open(RAW_DATA_DIR / 'regions.json', 'r') as f:
    regions_raw = json.load(f)
regions_df = pd.DataFrame(regions_raw)
print(f"Regions: {len(regions_df)} records")
regions_df

Regions: 7 records


Unnamed: 0,region_id,region,cost_of_living_index
0,1,Asia,112.0
1,2,Australia,120.0
2,3,Uk,102.0
3,4,Europe,109.0
4,5,Usa,97.0
5,6,Middle East,128.0
6,7,Canada,124.0


In [8]:
# Extract projects dimension (CSV)
projects_df = pd.read_csv(RAW_DATA_DIR / 'projects.csv')
print(f"Projects: {len(projects_df)} records")
projects_df.head(10)

Projects: 16 records


Unnamed: 0,project_id,project_type,job_category
0,1,Fixed,Web Development
1,2,Fixed,App Development
2,3,Hourly,Web Development
3,4,Hourly,Data Entry
4,5,Hourly,Digital Marketing
5,6,Fixed,Customer Support
6,7,Fixed,Data Entry
7,8,Fixed,Content Writing
8,9,Hourly,App Development
9,10,Hourly,Customer Support


In [9]:
# Extract date dimension (CSV)
dim_date_df = pd.read_csv(RAW_DATA_DIR / 'dates.csv')
print(f"Dates: {len(dim_date_df)} records")
dim_date_df.head()

Dates: 365 records


Unnamed: 0,date_id,full_date,year,quarter,month_number,month_name,day_of_week,is_weekend,is_holiday
0,20250101,2025-01-01,2025,1,1,January,Wednesday,0,1
1,20250102,2025-01-02,2025,1,1,January,Thursday,0,0
2,20250103,2025-01-03,2025,1,1,January,Friday,0,0
3,20250104,2025-01-04,2025,1,1,January,Saturday,1,0
4,20250105,2025-01-05,2025,1,1,January,Sunday,1,0


---
## 3. DATA PROFILING - Identify Data Quality Issues

In [10]:
print("=" * 60)
print("DATA QUALITY ASSESSMENT - Jobs Transactions")
print("=" * 60)

# Check for duplicates
duplicate_count = jobs_df.duplicated(subset=['job_id']).sum()
print(f"\n1. DUPLICATES:")
print(f"   - Duplicate job_ids found: {duplicate_count}")
print(f"   - Total rows: {len(jobs_df)}")
print(f"   - Unique job_ids: {jobs_df['job_id'].nunique()}")

# Check for missing values
print(f"\n2. MISSING VALUES:")
missing = jobs_df.isnull().sum()
missing_pct = (missing / len(jobs_df) * 100).round(2)
for col in jobs_df.columns:
    if missing[col] > 0:
        print(f"   - {col}: {missing[col]} ({missing_pct[col]}%)")

# Check for inconsistent casing
print(f"\n3. INCONSISTENT CASING:")
print(f"   - Platform values: {jobs_df['platform'].unique()}")
print(f"   - Region values: {jobs_df['client_region'].unique()}")

DATA QUALITY ASSESSMENT - Jobs Transactions

1. DUPLICATES:
   - Duplicate job_ids found: 3000
   - Total rows: 15500
   - Unique job_ids: 12500

2. MISSING VALUES:
   - earnings_usd: 1278 (8.25%)
   - hourly_rate: 1245 (8.03%)
   - job_success_rate: 1237 (7.98%)
   - client_rating: 1213 (7.83%)
   - rehire_rate: 1237 (7.98%)

3. INCONSISTENT CASING:
   - Platform values: ['Upwork' 'Freelancer' 'Fiverr' 'Toptal' 'Peopleperhour' 'UPWORK' 'uPWORK'
 'toptal' 'FIVERR' 'tOPTAL' 'FREELANCER' 'fREELANCER' 'PEOPLEPERHOUR'
 'pEOPLEPERHOUR' 'upwork' 'freelancer' 'fiverr' 'fIVERR' 'peopleperhour'
 'TOPTAL']
   - Region values: ['Asia' 'Europe' 'Middle East' 'CANADA' 'middle east' 'AUSTRALIA' 'Canada'
 'uk' 'Australia' 'Usa' 'Uk' 'USA' 'australia' 'europe' 'ASIA' 'usa'
 'EUROPE' 'UK' 'MIDDLE EAST' 'canada' 'asia']


In [11]:
# Check for whitespace issues
print("\n4. WHITESPACE ISSUES:")
jobs_with_whitespace = jobs_df[jobs_df['job_category'].str.contains(r'^\s|\s$', regex=True, na=False)]
print(f"   - Rows with leading/trailing whitespace in job_category: {len(jobs_with_whitespace)}")

# Show sample of issues
if len(jobs_with_whitespace) > 0:
    print(f"   - Sample values: {jobs_with_whitespace['job_category'].head(3).tolist()}")


4. WHITESPACE ISSUES:
   - Rows with leading/trailing whitespace in job_category: 790
   - Sample values: ['  Graphic Design  ', '  Digital Marketing  ', '  Data Entry  ']


---
## 4. DATA CLEANING - Fix Quality Issues

### 4.1 Remove Duplicates

In [12]:
print("CLEANING: Removing duplicates...")
print(f"   Before: {len(jobs_df)} rows")

# Remove duplicates based on job_id (keep first occurrence) - use .copy() to avoid warnings
jobs_clean = jobs_df.drop_duplicates(subset=['job_id'], keep='first').copy()

print(f"   After: {len(jobs_clean)} rows")
print(f"   Removed: {len(jobs_df) - len(jobs_clean)} duplicate rows")

CLEANING: Removing duplicates...
   Before: 15500 rows
   After: 12500 rows
   Removed: 3000 duplicate rows


### 4.2 Handle Missing Values

In [13]:
print("CLEANING: Handling missing values...")

# Strategy: Fill numeric columns with median values
numeric_cols_to_fill = ['earnings_usd', 'hourly_rate', 'client_rating', 'job_success_rate', 'rehire_rate']

for col in numeric_cols_to_fill:
    missing_before = jobs_clean[col].isna().sum()
    if missing_before > 0:
        median_val = jobs_clean[col].median()
        jobs_clean.loc[:, col] = jobs_clean[col].fillna(median_val)
        print(f"   - {col}: Filled {missing_before} missing values with median ({median_val:.2f})")

# Verify no missing values remain in critical columns
remaining_missing = jobs_clean[numeric_cols_to_fill].isna().sum().sum()
print(f"\n   Remaining missing values in numeric columns: {remaining_missing}")

CLEANING: Handling missing values...
   - earnings_usd: Filled 1041 missing values with median (4759.09)
   - hourly_rate: Filled 1015 missing values with median (82.09)
   - client_rating: Filled 982 missing values with median (4.00)
   - job_success_rate: Filled 988 missing values with median (79.74)
   - rehire_rate: Filled 1011 missing values with median (50.31)

   Remaining missing values in numeric columns: 0


### 4.3 Standardize Text Fields

In [14]:
print("CLEANING: Standardizing text fields...")

# Standardize platform names (title case)
platform_mapping = {
    'fiverr': 'Fiverr', 'FIVERR': 'Fiverr', 'FiVerR': 'Fiverr',
    'upwork': 'Upwork', 'UPWORK': 'Upwork', 'UpWork': 'Upwork',
    'toptal': 'Toptal', 'TOPTAL': 'Toptal', 'TopTal': 'Toptal',
    'freelancer': 'Freelancer', 'FREELANCER': 'Freelancer', 'FreeLancer': 'Freelancer',
    'peopleperhour': 'PeoplePerHour', 'PEOPLEPERHOUR': 'PeoplePerHour', 'Peopleperhour': 'PeoplePerHour'
}

jobs_clean.loc[:, 'platform'] = jobs_clean['platform'].replace(platform_mapping)
print(f"   - Platform: Standardized to {jobs_clean['platform'].unique()}")

# Standardize region names
region_mapping = {
    'asia': 'Asia', 'ASIA': 'Asia',
    'australia': 'Australia', 'AUSTRALIA': 'Australia',
    'uk': 'UK', 'Uk': 'UK',
    'usa': 'USA', 'Usa': 'USA',
    'europe': 'Europe', 'EUROPE': 'Europe',
    'middle east': 'Middle East', 'MIDDLE EAST': 'Middle East', 'Middle east': 'Middle East',
    'canada': 'Canada', 'CANADA': 'Canada'
}

jobs_clean.loc[:, 'client_region'] = jobs_clean['client_region'].replace(region_mapping)
print(f"   - Region: Standardized to {jobs_clean['client_region'].unique()}")

# Remove leading/trailing whitespace from all string columns
string_cols = ['platform', 'client_region', 'project_type', 'job_category', 'payment_method']
for col in string_cols:
    jobs_clean.loc[:, col] = jobs_clean[col].str.strip()
print(f"   - Whitespace: Trimmed from {string_cols}")

CLEANING: Standardizing text fields...
   - Platform: Standardized to ['Upwork' 'Freelancer' 'Fiverr' 'Toptal' 'PeoplePerHour' 'uPWORK' 'tOPTAL'
 'fREELANCER' 'pEOPLEPERHOUR' 'fIVERR']
   - Region: Standardized to ['Asia' 'Europe' 'Middle East' 'Canada' 'Australia' 'UK' 'USA']
   - Whitespace: Trimmed from ['platform', 'client_region', 'project_type', 'job_category', 'payment_method']


In [15]:
# Summary of cleaning
print("\n" + "=" * 60)
print("DATA CLEANING SUMMARY")
print("=" * 60)
print(f"Original rows: {len(jobs_df)}")
print(f"After cleaning: {len(jobs_clean)}")
print(f"Rows removed (duplicates): {len(jobs_df) - len(jobs_clean)}")
print(f"Missing values filled: {sum([jobs_df[col].isna().sum() for col in numeric_cols_to_fill])}")
print(f"\nCleaned data ready for transformation!")


DATA CLEANING SUMMARY
Original rows: 15500
After cleaning: 12500
Rows removed (duplicates): 3000
Missing values filled: 6210

Cleaned data ready for transformation!


---
## 5. TRANSFORM - Create Dimension Tables

### 5.1 dim_worker

In [16]:
# Transform worker dimension
dim_worker = workers_df.copy()

# Ensure proper column order as per data dictionary
dim_worker = dim_worker[['worker_id', 'experience_level', 'primary_skill']]

# Convert worker_id to string (as per data dictionary)
dim_worker['worker_id'] = dim_worker['worker_id'].astype(str)

print(f"dim_worker shape: {dim_worker.shape}")
dim_worker.head()

dim_worker shape: (1950, 3)


Unnamed: 0,worker_id,experience_level,primary_skill
0,1,Beginner,Web Development
1,2,Beginner,App Development
2,3,Beginner,Web Development
3,4,Intermediate,Data Entry
4,5,Expert,Digital Marketing


### 5.2 dim_platform

In [17]:
# Transform platform dimension
dim_platform = platforms_df.copy()

# Ensure proper column order
dim_platform = dim_platform[['platform_id', 'platform_name', 'category', 'payment_cycle']]

print(f"dim_platform shape: {dim_platform.shape}")
dim_platform

dim_platform shape: (5, 4)


Unnamed: 0,platform_id,platform_name,category,payment_cycle
0,1,Fiverr,Freelancing,Crypto
1,2,Peopleperhour,Freelancing,Bank Transfer
2,3,Upwork,Freelancing,PayPal
3,4,Toptal,Freelancing,Crypto
4,5,Freelancer,Freelancing,Crypto


### 5.3 dim_region

In [18]:
# Transform region dimension
dim_region = regions_df.copy()

# Ensure proper column order as per data dictionary
dim_region = dim_region[['region_id', 'region', 'cost_of_living_index']]

print(f"dim_region shape: {dim_region.shape}")
dim_region

dim_region shape: (7, 3)


Unnamed: 0,region_id,region,cost_of_living_index
0,1,Asia,112.0
1,2,Australia,120.0
2,3,Uk,102.0
3,4,Europe,109.0
4,5,Usa,97.0
5,6,Middle East,128.0
6,7,Canada,124.0


### 5.4 dim_project

In [19]:
# Transform project dimension
dim_project = projects_df.copy()

# Ensure proper column order as per data dictionary
dim_project = dim_project[['project_id', 'project_type', 'job_category']]

print(f"dim_project shape: {dim_project.shape}")
dim_project

dim_project shape: (16, 3)


Unnamed: 0,project_id,project_type,job_category
0,1,Fixed,Web Development
1,2,Fixed,App Development
2,3,Hourly,Web Development
3,4,Hourly,Data Entry
4,5,Hourly,Digital Marketing
5,6,Fixed,Customer Support
6,7,Fixed,Data Entry
7,8,Fixed,Content Writing
8,9,Hourly,App Development
9,10,Hourly,Customer Support


### 5.5 dim_date

In [20]:
# Transform date dimension
dim_date = dim_date_df.copy()

# Ensure proper column order as per data dictionary
dim_date = dim_date[['date_id', 'full_date', 'day_of_week', 'is_weekend', 'is_holiday', 
                      'month_name', 'month_number', 'quarter', 'year']]

print(f"dim_date shape: {dim_date.shape}")
dim_date.head()

dim_date shape: (365, 9)


Unnamed: 0,date_id,full_date,day_of_week,is_weekend,is_holiday,month_name,month_number,quarter,year
0,20250101,2025-01-01,Wednesday,0,1,January,1,1,2025
1,20250102,2025-01-02,Thursday,0,0,January,1,1,2025
2,20250103,2025-01-03,Friday,0,0,January,1,1,2025
3,20250104,2025-01-04,Saturday,1,0,January,1,1,2025
4,20250105,2025-01-05,Sunday,1,0,January,1,1,2025


---
## 6. TRANSFORM - Create Fact Table

In [21]:
# Start with cleaned jobs transactions
fact_job_earnings = jobs_clean.copy()

# Convert work_date to datetime for joining
fact_job_earnings['work_date'] = pd.to_datetime(fact_job_earnings['work_date'])

print(f"Initial fact table: {len(fact_job_earnings)} records")
fact_job_earnings.head()

Initial fact table: 12500 records


Unnamed: 0,job_id,worker_id,platform,client_region,project_type,job_category,payment_method,work_date,earnings_usd,job_completed,job_duration_days,hourly_rate,job_success_rate,client_rating,rehire_rate,marketing_spend
0,4704,1319,Upwork,Asia,Hourly,Graphic Design,PayPal,2025-04-11,4610.5,191,9,45.81,61.58,4.32,82.59,418
1,7750,876,Freelancer,Europe,Fixed,Seo,Crypto,2025-08-08,5593.28,63,27,126.52,72.36,4.4,67.29,237
2,11504,840,Fiverr,Europe,Fixed,Content Writing,Crypto,2025-09-13,8006.26,63,59,32.62,64.27,3.88,23.61,263
3,7235,1456,Upwork,Middle East,Hourly,Seo,Crypto,2025-10-18,7496.48,181,51,136.6,62.29,4.93,63.58,406
4,10270,365,Freelancer,Europe,Hourly,Seo,PayPal,2025-10-03,4077.29,68,34,113.02,64.57,3.99,71.28,72


In [22]:
# Create date_id from work_date (YYYYMMDD format)
fact_job_earnings['date_id'] = fact_job_earnings['work_date'].dt.strftime('%Y%m%d').astype(int)

fact_job_earnings[['work_date', 'date_id']].head()

Unnamed: 0,work_date,date_id
0,2025-04-11,20250411
1,2025-08-08,20250808
2,2025-09-13,20250913
3,2025-10-18,20251018
4,2025-10-03,20251003


In [23]:
# Create platform lookup for getting platform_id
# Handle case differences (e.g., 'PeoplePerHour' vs 'Peopleperhour')
platform_lookup = dim_platform[['platform_id', 'platform_name']].copy()
platform_lookup['platform_name_lower'] = platform_lookup['platform_name'].str.lower()

fact_job_earnings['platform_lower'] = fact_job_earnings['platform'].str.lower()

# Merge to get platform_id
fact_job_earnings = fact_job_earnings.merge(
    platform_lookup[['platform_id', 'platform_name_lower']],
    left_on='platform_lower',
    right_on='platform_name_lower',
    how='left'
)

# Drop temporary columns
fact_job_earnings = fact_job_earnings.drop(columns=['platform_lower', 'platform_name_lower'])

print(f"Platform ID mapping check (null count): {fact_job_earnings['platform_id'].isna().sum()}")

Platform ID mapping check (null count): 0


In [24]:
# Create region lookup for getting region_id
# Handle case differences (e.g., 'UK' vs 'Uk', 'USA' vs 'Usa')
region_lookup = dim_region[['region_id', 'region']].copy()
region_lookup['region_lower'] = region_lookup['region'].str.lower()

fact_job_earnings['client_region_lower'] = fact_job_earnings['client_region'].str.lower()

# Merge to get region_id
fact_job_earnings = fact_job_earnings.merge(
    region_lookup[['region_id', 'region_lower']],
    left_on='client_region_lower',
    right_on='region_lower',
    how='left'
)

# Drop temporary columns
fact_job_earnings = fact_job_earnings.drop(columns=['client_region_lower', 'region_lower'])

print(f"Region ID mapping check (null count): {fact_job_earnings['region_id'].isna().sum()}")

Region ID mapping check (null count): 0


In [25]:
# Create project lookup for getting project_id
# Need to match on both project_type and job_category
project_lookup = dim_project[['project_id', 'project_type', 'job_category']].copy()
project_lookup['project_type_lower'] = project_lookup['project_type'].str.lower()
project_lookup['job_category_lower'] = project_lookup['job_category'].str.lower()

fact_job_earnings['project_type_lower'] = fact_job_earnings['project_type'].str.lower()
fact_job_earnings['job_category_lower'] = fact_job_earnings['job_category'].str.lower()

# Merge to get project_id
fact_job_earnings = fact_job_earnings.merge(
    project_lookup[['project_id', 'project_type_lower', 'job_category_lower']],
    on=['project_type_lower', 'job_category_lower'],
    how='left'
)

# Drop temporary columns
fact_job_earnings = fact_job_earnings.drop(columns=['project_type_lower', 'job_category_lower'])

print(f"Project ID mapping check (null count): {fact_job_earnings['project_id'].isna().sum()}")

Project ID mapping check (null count): 0


In [26]:
# Convert worker_id to string for consistency with dim_worker
fact_job_earnings['worker_id'] = fact_job_earnings['worker_id'].astype(str)

# Create is_gap_day flag: 1 if earnings are zero/missing or job_completed is 0, else 0
fact_job_earnings['is_gap_day'] = (
    (fact_job_earnings['earnings_usd'].isna()) | 
    (fact_job_earnings['earnings_usd'] == 0) | 
    (fact_job_earnings['job_completed'] == 0)
).astype(int)

print(f"is_gap_day distribution:\n{fact_job_earnings['is_gap_day'].value_counts()}")

is_gap_day distribution:
is_gap_day
0    11936
1      564
Name: count, dtype: int64


In [27]:
# Select and order columns for final fact table as per data dictionary
fact_columns = [
    'job_id',           # DK - Degenerate key
    'worker_id',        # FK -> dim_worker
    'platform_id',      # FK -> dim_platform
    'region_id',        # FK -> dim_region
    'project_id',       # FK -> dim_project
    'date_id',          # FK -> dim_date
    'earnings_usd',     # Metric
    'job_completed',    # Metric
    'job_duration_days',# Metric
    'hourly_rate',      # Metric
    'job_success_rate', # Metric
    'client_rating',    # Metric
    'rehire_rate',      # Metric
    'marketing_spend',  # Metric
    'is_gap_day'        # Calculated flag
]

fact_job_earnings_final = fact_job_earnings[fact_columns].copy()

# Convert FK columns to int (handling NaN)
for col in ['platform_id', 'region_id', 'project_id']:
    fact_job_earnings_final[col] = fact_job_earnings_final[col].fillna(-1).astype(int)
    # Replace -1 back to NaN for cleaner output (optional)
    fact_job_earnings_final[col] = fact_job_earnings_final[col].replace(-1, pd.NA)

print(f"Final fact_job_earnings shape: {fact_job_earnings_final.shape}")
fact_job_earnings_final.head(10)

Final fact_job_earnings shape: (12500, 15)


Unnamed: 0,job_id,worker_id,platform_id,region_id,project_id,date_id,earnings_usd,job_completed,job_duration_days,hourly_rate,job_success_rate,client_rating,rehire_rate,marketing_spend,is_gap_day
0,4704,1319,3,1,11,20250411,4610.5,191,9,45.81,61.58,4.32,82.59,418,0
1,7750,876,5,4,16,20250808,5593.28,63,27,126.52,72.36,4.4,67.29,237,0
2,11504,840,1,4,8,20250913,8006.26,63,59,32.62,64.27,3.88,23.61,263,0
3,7235,1456,3,6,12,20251018,7496.48,181,51,136.6,62.29,4.93,63.58,406,0
4,10270,365,5,4,12,20251003,4077.29,68,34,113.02,64.57,3.99,71.28,72,0
5,5016,1758,4,1,13,20250928,95.05,72,16,80.43,92.43,3.83,20.34,417,0
6,4443,1680,4,7,4,20251028,9366.07,268,40,76.84,81.46,3.16,51.98,258,0
7,9423,312,3,6,8,20251021,921.91,182,54,127.06,91.33,4.81,21.18,335,0
8,904,700,3,1,10,20251117,7112.12,250,38,77.98,90.34,3.45,18.24,374,0
9,2151,62,3,2,13,20250309,3030.12,128,29,110.38,62.11,4.67,86.82,90,0


---
## 7. Data Quality Checks

In [28]:
# Check for any unmapped foreign keys
print("=== Data Quality Report ===")
print(f"\nFact Table Records: {len(fact_job_earnings_final)}")
print(f"\nNull/Unmapped Foreign Keys:")
print(f"  - platform_id: {fact_job_earnings_final['platform_id'].isna().sum()}")
print(f"  - region_id: {fact_job_earnings_final['region_id'].isna().sum()}")
print(f"  - project_id: {fact_job_earnings_final['project_id'].isna().sum()}")
print(f"\nDimension Tables:")
print(f"  - dim_worker: {len(dim_worker)} records")
print(f"  - dim_platform: {len(dim_platform)} records")
print(f"  - dim_region: {len(dim_region)} records")
print(f"  - dim_project: {len(dim_project)} records")
print(f"  - dim_date: {len(dim_date)} records")

=== Data Quality Report ===

Fact Table Records: 12500

Null/Unmapped Foreign Keys:
  - platform_id: 0
  - region_id: 0
  - project_id: 0

Dimension Tables:
  - dim_worker: 1950 records
  - dim_platform: 5 records
  - dim_region: 7 records
  - dim_project: 16 records
  - dim_date: 365 records


In [29]:
# Verify referential integrity for worker_id
workers_in_fact = set(fact_job_earnings_final['worker_id'].unique())
workers_in_dim = set(dim_worker['worker_id'].unique())

orphan_workers = workers_in_fact - workers_in_dim
print(f"Workers in fact but not in dimension: {len(orphan_workers)}")
if orphan_workers:
    print(f"  Sample orphan worker IDs: {list(orphan_workers)[:5]}")

Workers in fact but not in dimension: 0


---
## 8. LOAD - Save to data_cleaned folder

In [30]:
# Save dimension tables
dim_worker.to_csv(CLEANED_DATA_DIR / 'dim_worker.csv', index=False)
print(f"✓ Saved dim_worker.csv ({len(dim_worker)} records)")

dim_platform.to_csv(CLEANED_DATA_DIR / 'dim_platform.csv', index=False)
print(f"✓ Saved dim_platform.csv ({len(dim_platform)} records)")

dim_region.to_csv(CLEANED_DATA_DIR / 'dim_region.csv', index=False)
print(f"✓ Saved dim_region.csv ({len(dim_region)} records)")

dim_project.to_csv(CLEANED_DATA_DIR / 'dim_project.csv', index=False)
print(f"✓ Saved dim_project.csv ({len(dim_project)} records)")

dim_date.to_csv(CLEANED_DATA_DIR / 'dim_date.csv', index=False)
print(f"✓ Saved dim_date.csv ({len(dim_date)} records)")

✓ Saved dim_worker.csv (1950 records)
✓ Saved dim_platform.csv (5 records)
✓ Saved dim_region.csv (7 records)
✓ Saved dim_project.csv (16 records)
✓ Saved dim_date.csv (365 records)


In [31]:
# Save fact table
fact_job_earnings_final.to_csv(CLEANED_DATA_DIR / 'fact_job_earnings.csv', index=False)
print(f"✓ Saved fact_job_earnings.csv ({len(fact_job_earnings_final)} records)")

✓ Saved fact_job_earnings.csv (12500 records)


In [32]:
# List all output files
print("\n=== ETL Pipeline Complete ===")
print(f"\nOutput files in {CLEANED_DATA_DIR}:")
for file in sorted(CLEANED_DATA_DIR.glob('*.csv')):
    size = file.stat().st_size / 1024  # KB
    print(f"  - {file.name} ({size:.1f} KB)")


=== ETL Pipeline Complete ===

Output files in c:\Users\louey\Desktop\Business intelligence\Flexi Income new\data_cleaned:
  - dim_date.csv (17.7 KB)
  - dim_platform.csv (0.2 KB)
  - dim_project.csv (0.4 KB)
  - dim_region.csv (0.1 KB)
  - dim_worker.csv (55.7 KB)
  - fact_job_earnings.csv (839.7 KB)


---
## 9. Summary Statistics

In [33]:
# Display summary statistics for the fact table
print("=== Fact Table Summary Statistics ===")
fact_job_earnings_final[['earnings_usd', 'job_completed', 'job_duration_days', 
                          'hourly_rate', 'job_success_rate', 'client_rating', 
                          'rehire_rate', 'marketing_spend']].describe()

=== Fact Table Summary Statistics ===


Unnamed: 0,earnings_usd,job_completed,job_duration_days,hourly_rate,job_success_rate,client_rating,rehire_rate,marketing_spend
count,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0,12500.0
mean,4791.83351,143.51432,30.60456,82.171001,79.851556,4.002498,50.262116,251.39656
std,2819.037637,90.088145,17.281965,37.38187,11.071702,0.555372,22.25559,145.147844
min,0.0,0.0,1.0,15.01,60.0,3.0,10.01,0.0
25%,2465.0925,65.0,16.0,51.6775,70.64,3.55,31.92,125.0
50%,4759.09,142.0,31.0,82.09,79.74,4.0,50.31,253.0
75%,7109.1575,222.0,46.0,113.065,89.1,4.46,68.6825,378.0
max,9999.44,300.0,60.0,150.0,100.0,5.0,89.99,500.0
