# üîç 01. RAW LAYER EXPLORATION

## üéØ M·ª§C TI√äU:
- Hi·ªÉu c·∫•u tr√∫c RAW layer
- Kh√°m ph√° data quality issues
- Th·ª±c h√†nh ingest data t·ª´ Parquet files

## üìö N·ªòI DUNG:
1. ƒê·ªçc Parquet files
2. Ingest v√†o RAW schema
3. Ph√¢n t√≠ch data quality issues
4. Metadata tracking

In [1]:
import sys
sys.path.append('../scripts')

import pandas as pd
from pathlib import Path
from datetime import datetime

from db_connector import DatabaseConnector
from etl_raw import RawLayerETL

print("‚úÖ Libraries imported!")

‚úÖ Libraries imported!


## 1. ƒê·ªåC PARQUET FILES

In [2]:
print("üìÅ ƒê·ªçc Parquet files...")

raw_data_dir = Path('../raw_data')

# ƒê·ªçc 1 partition c·ªßa customers
partition_path = raw_data_dir / 'customers' / '2025-01-01' / 'data.parquet'

if partition_path.exists():
    df = pd.read_parquet(partition_path)
    print(f"\n‚úÖ ƒê·ªçc th√†nh c√¥ng: {len(df)} rows")
    print(f"\nColumns: {df.columns.tolist()}")
    print(f"\nFirst 5 rows:")
    display(df.head())
else:
    print("‚ö†Ô∏è File not found! Run: python scripts/generate_raw_data.py --test-mode")

üìÅ ƒê·ªçc Parquet files...

‚úÖ ƒê·ªçc th√†nh c√¥ng: 16 rows

Columns: ['customer_id', 'customer_name', 'email', 'country', 'signup_date', 'customer_segment']

First 5 rows:


Unnamed: 0,customer_id,customer_name,email,country,signup_date,customer_segment
0,1,Connor West,brownjessica@example.org,Jamaica,2025-01-01,Premium
1,2,Scott Pierce,joshuawright@example.org,Australia,2025-01-01,Basic
2,3,John Lewis,novaksara@example.org,Iran,2025-01-01,Premium
3,4,Richard Gibson,martinaaron@example.com,Jamaica,2025-01-01,Basic
4,5,lauren daniels,smoore@example.org,San Marino,2025-01-01,Standard


## 2. PH√ÇN T√çCH DATA QUALITY ISSUES

In [3]:
print("üîç PH√ÇN T√çCH DATA QUALITY ISSUES")
print("="*70)

# Check duplicates
print("\n1Ô∏è‚É£ DUPLICATES:")
dup_ids = df['customer_id'].duplicated().sum()
dup_emails = df['email'].duplicated().sum()
print(f"  Duplicate customer_ids: {dup_ids}")
print(f"  Duplicate emails: {dup_emails}")

if dup_ids > 0:
    print("\n  Example duplicates:")
    display(df[df['customer_id'].duplicated(keep=False)].sort_values('customer_id').head())

üîç PH√ÇN T√çCH DATA QUALITY ISSUES

1Ô∏è‚É£ DUPLICATES:
  Duplicate customer_ids: 0
  Duplicate emails: 0


In [4]:
# Check nulls
print("2Ô∏è‚É£ NULL VALUES:")
nulls = df.isnull().sum()
print(nulls[nulls > 0])

2Ô∏è‚É£ NULL VALUES:
Series([], dtype: int64)


In [5]:
# Check email format
print("3Ô∏è‚É£ INVALID EMAILS:")
invalid_emails = df[~df['email'].str.contains('@', na=False)]
print(f"  Invalid emails: {len(invalid_emails)}")
if len(invalid_emails) > 0:
    display(invalid_emails[['customer_id', 'customer_name', 'email']].head())

3Ô∏è‚É£ INVALID EMAILS:
  Invalid emails: 1


Unnamed: 0,customer_id,customer_name,email
8,9,Michael Evans,james53_at_example.com


In [6]:
# Check lowercase names
print("4Ô∏è‚É£ LOWERCASE NAMES:")
lowercase_names = df[df['customer_name'].notna() & df['customer_name'].str.islower()]
print(f"  Lowercase names: {len(lowercase_names)}")
if len(lowercase_names) > 0:
    display(lowercase_names[['customer_id', 'customer_name']].head())

4Ô∏è‚É£ LOWERCASE NAMES:
  Lowercase names: 2


Unnamed: 0,customer_id,customer_name
4,5,lauren daniels
7,8,kevin mills


## 3. INGEST V√ÄO RAW SCHEMA

In [7]:
print("üì• INGEST V√ÄO RAW SCHEMA")
print("="*70)

db = DatabaseConnector()
etl_raw = RawLayerETL(db)

# Check current data in raw schema
current_count_query = "SELECT COUNT(*) as count FROM raw.customers"
current_count = db.read_sql(current_count_query)['count'][0]
print(f"\nCurrent rows in raw.customers: {current_count:,}")

# Ingest customers
print("\nüöÄ Running ETL Raw Layer...")
result = etl_raw.ingest_table('customers', incremental=True)

print(f"\n‚úÖ Ingest complete!")
print(f"  Partitions processed: {result['partitions_processed']}")
print(f"  Total rows ingested: {result['total_rows']:,}")

# Verify
new_count = db.read_sql(current_count_query)['count'][0]
print(f"\nNew total in raw.customers: {new_count:,}")

2025-12-20 08:56:02,469 - db_connector - INFO - Database connector initialized for data_engineer@postgres
2025-12-20 08:56:02,491 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)


üì• INGEST V√ÄO RAW SCHEMA

Current rows in raw.customers: 11,116

üöÄ Running ETL Raw Layer...


2025-12-20 08:56:03,866 - db_connector - INFO - Query executed, DataFrame shape: (365, 1)
2025-12-20 08:56:03,869 - etl_raw - INFO - Incremental mode: 0 new partitions to ingest
2025-12-20 08:56:03,874 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)



‚úÖ Ingest complete!
  Partitions processed: 0
  Total rows ingested: 0

New total in raw.customers: 11,116


## 4. QUERY RAW DATA

In [8]:
print("üîç QUERY RAW DATA")
print("="*70)

# Query with metadata
query = """
SELECT 
    customer_id,
    customer_name,
    email,
    country,
    signup_date,
    _partition_date,
    _ingested_at
FROM raw.customers
WHERE _partition_date = '2025-01-01'
LIMIT 10
"""

raw_df = db.read_sql(query)
display(raw_df)

2025-12-20 08:56:11,665 - db_connector - INFO - Query executed, DataFrame shape: (10, 7)


üîç QUERY RAW DATA


Unnamed: 0,customer_id,customer_name,email,country,signup_date,_partition_date,_ingested_at
0,1,Connor West,brownjessica@example.org,Jamaica,2025-01-01,2025-01-01,2025-12-20 07:50:45.349732
1,2,Scott Pierce,joshuawright@example.org,Australia,2025-01-01,2025-01-01,2025-12-20 07:50:45.349732
2,3,John Lewis,novaksara@example.org,Iran,2025-01-01,2025-01-01,2025-12-20 07:50:45.349732
3,4,Richard Gibson,martinaaron@example.com,Jamaica,2025-01-01,2025-01-01,2025-12-20 07:50:45.349732
4,5,lauren daniels,smoore@example.org,San Marino,2025-01-01,2025-01-01,2025-12-20 07:50:45.349732
5,6,Kim Brown,brian97@example.net,Malawi,2025-01-01,2025-01-01,2025-12-20 07:50:45.349732
6,7,Cynthia Wilson,yorkcasey@example.org,Tuvalu,2025-01-01,2025-01-01,2025-12-20 07:50:45.349732
7,8,kevin mills,bethwilliams@example.org,Vietnam,2025-01-01,2025-01-01,2025-12-20 07:50:45.349732
8,9,Michael Evans,james53_at_example.com,Germany,2025-01-01,2025-01-01,2025-12-20 07:50:45.349732
9,10,Angel Lewis MD,smithchristine@example.net,Venezuela,2025-01-01,2025-01-01,2025-12-20 07:50:45.349732


## 5. PH√ÇN T√çCH METADATA

In [9]:
print("üìä PH√ÇN T√çCH METADATA")
print("="*70)

metadata_query = """
SELECT 
    _partition_date,
    COUNT(*) as row_count,
    MIN(_ingested_at) as first_ingested,
    MAX(_ingested_at) as last_ingested
FROM raw.customers
GROUP BY _partition_date
ORDER BY _partition_date
LIMIT 10
"""

metadata_df = db.read_sql(metadata_query)
display(metadata_df)

2025-12-20 08:56:15,339 - db_connector - INFO - Query executed, DataFrame shape: (10, 4)


üìä PH√ÇN T√çCH METADATA


Unnamed: 0,_partition_date,row_count,first_ingested,last_ingested
0,2025-01-01,16,2025-12-20 07:50:45.349732,2025-12-20 07:50:45.349732
1,2025-01-02,50,2025-12-20 07:50:45.382492,2025-12-20 07:50:45.382492
2,2025-01-03,49,2025-12-20 07:50:45.406189,2025-12-20 07:50:45.406189
3,2025-01-04,34,2025-12-20 07:50:45.429150,2025-12-20 07:50:45.429150
4,2025-01-05,24,2025-12-20 07:50:45.448413,2025-12-20 07:50:45.448413
5,2025-01-06,46,2025-12-20 07:50:45.484814,2025-12-20 07:50:45.484814
6,2025-01-07,28,2025-12-20 07:50:45.503889,2025-12-20 07:50:45.503889
7,2025-01-08,27,2025-12-20 07:50:45.523113,2025-12-20 07:50:45.523113
8,2025-01-09,28,2025-12-20 07:50:45.543926,2025-12-20 07:50:45.543926
9,2025-01-10,13,2025-12-20 07:50:45.563226,2025-12-20 07:50:45.563226


## 6. SO S√ÅNH RAW VS PARQUET

In [10]:
print("üìä SO S√ÅNH RAW VS PARQUET")
print("="*70)

# Count in Parquet
parquet_count = len(pd.read_parquet(partition_path))

# Count in Raw
raw_count_query = """
SELECT COUNT(*) as count 
FROM raw.customers 
WHERE _partition_date = '2025-01-01'
"""
raw_count = db.read_sql(raw_count_query)['count'][0]

print(f"Parquet file: {parquet_count:,} rows")
print(f"Raw schema:   {raw_count:,} rows")
print(f"Match: {'‚úÖ YES' if parquet_count == raw_count else '‚ùå NO'}")

2025-12-20 08:56:16,144 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)


üìä SO S√ÅNH RAW VS PARQUET
Parquet file: 16 rows
Raw schema:   16 rows
Match: ‚úÖ YES


# üéì KEY TAKEAWAYS

## ‚úÖ RAW Layer Characteristics:
1. **Immutable**: D·ªØ li·ªáu kh√¥ng ƒë∆∞·ª£c s·ª≠a ƒë·ªïi
2. **Has Issues**: C√≥ duplicate, null, invalid format
3. **Metadata**: Tracking _ingested_at, _source_file, _partition_date
4. **Append-Only**: Ch·ªâ th√™m, kh√¥ng x√≥a/s·ª≠a

## üîÑ Next Step:
- Open `02_staging_transformation.ipynb` ƒë·ªÉ h·ªçc c√°ch clean data

In [11]:
print("\n‚úÖ Raw Layer Exploration Complete!")


‚úÖ Raw Layer Exploration Complete!
