# Getting Started with ETL in Python

This notebook will help you get started with ETL basics using Python, pandas, and other data tools.

## Setup
Make sure your virtual environment is activated before starting Jupyter!

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from datetime import datetime
import sys

print(f"Python version: {sys.version}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## Exercise 1: Load Sample Data

Let's load the customer data from our sample files.

In [None]:
# Load customer data
customers_df = pd.read_csv('../data/raw/customers.csv')

# Display first few rows
print(f"Shape: {customers_df.shape}")
customers_df.head()

## Exercise 2: Data Exploration

Let's explore the data to understand what we're working with.

In [None]:
# Check data info
customers_df.info()

In [None]:
# Check for missing values
print("Missing values per column:")
customers_df.isnull().sum()

In [None]:
# Check for duplicates
print(f"Total rows: {len(customers_df)}")
print(f"Duplicate rows: {customers_df.duplicated().sum()}")
print(f"Duplicate customer_ids: {customers_df['customer_id'].duplicated().sum()}")

## Exercise 3: Data Cleaning

Now let's clean the data!

In [None]:
# Create a copy for cleaning
clean_df = customers_df.copy()

# 1. Remove duplicates based on customer_id
clean_df = clean_df.drop_duplicates(subset=['customer_id'], keep='first')
print(f"After removing duplicates: {len(clean_df)} rows")

# 2. Standardize email to lowercase
clean_df['email'] = clean_df['email'].str.lower()

# 3. Create full_name column
clean_df['full_name'] = clean_df['first_name'] + ' ' + clean_df['last_name']

# Display results
clean_df.head()

## Exercise 4: Data Transformation

Let's add some useful transformations.

In [None]:
# Convert signup_date to datetime
clean_df['signup_date'] = pd.to_datetime(clean_df['signup_date'])

# Calculate days since signup
clean_df['days_since_signup'] = (datetime.now() - clean_df['signup_date']).dt.days

# Extract year and month
clean_df['signup_year'] = clean_df['signup_date'].dt.year
clean_df['signup_month'] = clean_df['signup_date'].dt.month

clean_df[['full_name', 'signup_date', 'days_since_signup', 'signup_year', 'signup_month']].head()

## Exercise 5: Save Cleaned Data

In [None]:
# Save to processed folder
output_path = '../data/processed/customers_cleaned_notebook.csv'
clean_df.to_csv(output_path, index=False)
print(f"Cleaned data saved to: {output_path}")

## Your Turn!

Try these challenges:

1. Load the `sales_data.csv` file
2. Identify and fix the data quality issues
3. Clean the price column (remove $ signs and convert to float)
4. Convert dates to a standard format
5. Save the cleaned data

Use the cells below to work on these challenges:

In [None]:
# Your code here
