# Data Cleaning and Visualization

## Learning Objectives
In this lesson, you will learn to:
1. Load and inspect messy datasets
2. Clean data by handling missing values, duplicates, and outliers
3. Filter and subset data based on conditions
4. Transform and prepare data for visualization
5. Create professional visualizations using matplotlib and pandas

---

## Part 1: Loading and Inspecting Data

Let's start by creating a sample dataset that contains common data quality issues.

In [None]:
# ========================================
# Import Required Libraries
# ========================================
# pandas (pd): The primary library for working with tabular data (like spreadsheets)
import pandas as pd

# numpy (np): Library for numerical computations and array operations
import numpy as np

# matplotlib.pyplot (plt): The main plotting library for creating visualizations
import matplotlib.pyplot as plt

# datetime modules: For working with dates and times
from datetime import datetime, timedelta

# ========================================
# Configure Settings
# ========================================
# Set random seed to 42 so everyone gets the same "random" data
# This makes our code reproducible - you'll get the same results every time
np.random.seed(42)

# Configure pandas display options for better readability
pd.set_option('display.max_columns', None)  # Show all columns (don't truncate)
pd.set_option('display.width', None)        # Use full screen width

print("âœ“ Libraries imported successfully!")
print("âœ“ Settings configured!")

In [None]:
# ========================================
# Create Sample Dataset with Intentional Data Quality Issues
# ========================================
# We're creating a realistic dataset that has common problems you'll encounter in real data

# Set the number of records we want to generate
n_records = 500

# ----------------------------------------
# Generate Random Dates
# ----------------------------------------
# Create a starting date (January 1, 2024)
start_date = datetime(2024, 1, 1)

# Generate 500 random dates throughout 2024
# This simulates transaction dates spread across the year
dates = [start_date + timedelta(days=np.random.randint(0, 365)) for _ in range(n_records)]

# ----------------------------------------
# Build the Dataset Dictionary
# ----------------------------------------
# Create a dictionary where each key is a column name and value is a list of data
data = {
    # Transaction date for each sale
    'date': dates,
    
    # Product names - randomly chosen from 6 different products
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse'], n_records),
    
    # Region names - NOTE: Intentionally inconsistent capitalization ('north' vs 'North' vs 'SOUTH')
    # This is a common data quality issue we'll need to fix!
    'region': np.random.choice(['North', 'South', 'East', 'West', 'north', 'SOUTH'], n_records),
    
    # Sales amount in dollars - random values between $100 and $5,000
    'sales': np.random.randint(100, 5000, n_records),
    
    # Quantity of items sold - random values between 1 and 50
    'quantity': np.random.randint(1, 50, n_records),
    
    # Customer age - random values between 18 and 75
    'customer_age': np.random.randint(18, 75, n_records),
    
    # Customer satisfaction score (1-5 scale, where 5 is best)
    'satisfaction': np.random.choice([1, 2, 3, 4, 5], n_records)
}

# Convert the dictionary into a pandas DataFrame (like a spreadsheet table)
df = pd.DataFrame(data)

# ----------------------------------------
# Introduce Missing Values (10% of data)
# ----------------------------------------
# Randomly select 10% of rows to have missing data
missing_indices = np.random.choice(df.index, size=int(n_records * 0.10), replace=False)

# Make half of those rows have missing sales values
df.loc[missing_indices[:len(missing_indices)//2], 'sales'] = np.nan

# Make the other half have missing satisfaction scores
df.loc[missing_indices[len(missing_indices)//2:], 'satisfaction'] = np.nan

# ----------------------------------------
# Introduce Duplicate Rows
# ----------------------------------------
# Randomly select 20 rows and duplicate them
duplicate_rows = df.sample(20)
# Add these duplicate rows to the dataframe (this creates duplicates)
df = pd.concat([df, duplicate_rows], ignore_index=True)

# ----------------------------------------
# Introduce Outliers
# ----------------------------------------
# Select 10 random rows and give them unrealistically high sales values
outlier_indices = np.random.choice(df.index, size=10, replace=False)
# Set their sales to be between $50,000 and $100,000 (much higher than normal)
df.loc[outlier_indices, 'sales'] = np.random.randint(50000, 100000, len(outlier_indices))

# ----------------------------------------
# Display Results
# ----------------------------------------
print(f"âœ“ Dataset created with {len(df)} records")
print(f"âœ“ Includes: missing values, duplicates, outliers, and inconsistent data")
print(f"\nFirst 10 rows of the dataset:")
df.head(10)

### Exercise 1.1: Initial Data Inspection

Let's examine the dataset to identify data quality issues.

In [None]:
# ========================================
# Display Basic Dataset Information
# ========================================
print("Dataset Info:")
print("=" * 50)

# .info() shows: column names, data types, non-null counts, and memory usage
df.info()

print("\nDataset Shape:")
# .shape returns a tuple: (number of rows, number of columns)
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

In [None]:
# ========================================
# Check for Missing Values
# ========================================
print("Missing Values Count:")
print("=" * 50)

# .isnull() returns True for missing values, .sum() counts them
missing = df.isnull().sum()

# Calculate what percentage of each column is missing
missing_pct = (df.isnull().sum() / len(df)) * 100

# Create a summary DataFrame showing both count and percentage
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
})

# Only show columns that actually have missing values
print(missing_df[missing_df['Missing Count'] > 0])

In [None]:
# ========================================
# Check for Duplicate Rows
# ========================================
# .duplicated() returns True for rows that are exact copies of earlier rows
print(f"Number of duplicate rows: {df.duplicated().sum()}")

# Show some example duplicate rows
print(f"\nSample duplicate rows:")
# keep=False marks ALL duplicates (not just the second occurrence)
df[df.duplicated(keep=False)].sort_values('date').head(10)

In [None]:
# ========================================
# Check for Inconsistent Values
# ========================================
# Look at the unique values in categorical columns

print("Unique values in 'region' column:")
# .value_counts() shows each unique value and how many times it appears
# Notice: 'North', 'north', and 'SOUTH' - these should be standardized!
print(df['region'].value_counts())

print("\nUnique values in 'product' column:")
print(df['product'].value_counts())

---
## Part 2: Data Cleaning

Now we'll systematically clean the dataset.

### Step 2.1: Remove Duplicates

In [None]:
# ========================================
# Remove Duplicate Rows
# ========================================
# Create a copy of the dataframe so we don't modify the original
# This is good practice - always preserve your raw data!
df_clean = df.copy()

print(f"Before removing duplicates: {len(df_clean)} records")

# .drop_duplicates() removes all rows that are exact copies
# By default, it keeps the first occurrence and removes later ones
df_clean = df_clean.drop_duplicates()

print(f"After removing duplicates: {len(df_clean)} records")
print(f"Removed {len(df) - len(df_clean)} duplicate records")

### Step 2.2: Standardize Categorical Data

In [None]:
# ========================================
# Standardize Categorical Data
# ========================================
# Fix inconsistent capitalization in the 'region' column

print("Before standardization:")
print(df_clean['region'].value_counts())

# .str.title() converts text to Title Case (First Letter Capitalized)
# This makes 'north' â†’ 'North', 'SOUTH' â†’ 'South', etc.
df_clean['region'] = df_clean['region'].str.title()

print("\nAfter standardization:")
print(df_clean['region'].value_counts())

### Step 2.3: Handle Missing Values

In [None]:
# ========================================
# Handle Missing Sales Values
# ========================================
# Strategy: Fill missing sales with the median sales for that product
# Why? Different products have different typical prices, so we use product-specific medians

print("Handling missing sales values...")
print(f"Missing sales before: {df_clean['sales'].isnull().sum()}")

# .groupby('product') groups rows by product
# .transform() applies the function to each group and returns a Series the same size as the original
# lambda x: x.fillna(x.median()) fills missing values with that group's median
df_clean['sales'] = df_clean.groupby('product')['sales'].transform(
    lambda x: x.fillna(x.median())
)

print(f"Missing sales after: {df_clean['sales'].isnull().sum()}")

In [None]:
# ========================================
# Handle Missing Satisfaction Scores
# ========================================
# Strategy: Fill with the mode (most common value)
# Why? Satisfaction is categorical (1-5), so median might not make sense

print("Handling missing satisfaction values...")
print(f"Missing satisfaction before: {df_clean['satisfaction'].isnull().sum()}")

# .mode()[0] gets the most common value (the first mode if there are ties)
satisfaction_mode = df_clean['satisfaction'].mode()[0]

# .fillna() replaces all NaN (missing) values with the specified value
df_clean['satisfaction'] = df_clean['satisfaction'].fillna(satisfaction_mode)

print(f"Missing satisfaction after: {df_clean['satisfaction'].isnull().sum()}")
print(f"Used mode value: {satisfaction_mode}")

### Step 2.4: Identify and Handle Outliers

In [None]:
# ========================================
# Identify Outliers Using IQR Method
# ========================================
# IQR (Interquartile Range) is a statistical method to detect outliers
# It finds values that are unusually far from the middle 50% of the data

# Calculate quartiles (25th and 75th percentiles)
Q1 = df_clean['sales'].quantile(0.25)  # 25% of data is below this value
Q3 = df_clean['sales'].quantile(0.75)  # 75% of data is below this value
IQR = Q3 - Q1                           # The range of the middle 50% of data

# Calculate outlier boundaries
# Values beyond 1.5 Ã— IQR from Q1 or Q3 are considered outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Sales Statistics:")
print(f"Q1 (25th percentile): ${Q1:,.2f}")
print(f"Q3 (75th percentile): ${Q3:,.2f}")
print(f"IQR: ${IQR:,.2f}")
print(f"\nOutlier bounds:")
print(f"Lower bound: ${lower_bound:,.2f}")
print(f"Upper bound: ${upper_bound:,.2f}")

# Find all outliers (values outside the bounds)
outliers = df_clean[(df_clean['sales'] < lower_bound) | (df_clean['sales'] > upper_bound)]
print(f"\nNumber of outliers detected: {len(outliers)}")

# Display the outlier records
if len(outliers) > 0:
    print("\nOutlier records:")
    print(outliers[['date', 'product', 'sales', 'quantity']].sort_values('sales', ascending=False))

In [None]:
# ========================================
# Remove Outliers
# ========================================
# We'll remove values that fall outside our calculated bounds
# This helps prevent extreme values from skewing our analysis

# Filter to keep only rows within the bounds
# The & operator means "and" - both conditions must be true
df_clean_no_outliers = df_clean[(df_clean['sales'] >= lower_bound) & (df_clean['sales'] <= upper_bound)]

print(f"Records before removing outliers: {len(df_clean)}")
print(f"Records after removing outliers: {len(df_clean_no_outliers)}")
print(f"Outliers removed: {len(df_clean) - len(df_clean_no_outliers)}")

# Use the cleaned dataset for the rest of our analysis
df_clean = df_clean_no_outliers.copy()

### Step 2.5: Add Derived Columns

In [None]:
# ========================================
# Add Derived Columns
# ========================================
# Create new columns calculated from existing data
# These give us additional insights for analysis

# Calculate average price per unit for each transaction
df_clean['price_per_unit'] = df_clean['sales'] / df_clean['quantity']

# Extract time-based features from the date column
# First, ensure dates are in datetime format
df_clean['month'] = pd.to_datetime(df_clean['date']).dt.month          # Month as number (1-12)
df_clean['month_name'] = pd.to_datetime(df_clean['date']).dt.strftime('%B')  # Month name (January, etc.)
df_clean['quarter'] = pd.to_datetime(df_clean['date']).dt.quarter     # Quarter (1, 2, 3, or 4)

# Create age groups using pd.cut()
# bins: boundaries for each group, labels: names for each group
df_clean['age_group'] = pd.cut(df_clean['customer_age'], 
                               bins=[0, 25, 35, 50, 100],
                               labels=['18-25', '26-35', '36-50', '50+'])

print("New columns added:")
print(df_clean[['sales', 'quantity', 'price_per_unit', 'month_name', 'quarter', 'age_group']].head())

---
## Part 3: Data Subsetting and Filtering

Let's create different subsets of data for focused analysis.

### Exercise 3.1: Filter by Product Category

In [None]:
# ========================================
# Filter by Product Category
# ========================================
# Create a subset containing only high-value products

# .isin() checks if each value is in the provided list
# This returns only rows where product is 'Laptop' or 'Monitor'
high_value_products = df_clean[df_clean['product'].isin(['Laptop', 'Monitor'])].copy()

print(f"High-value products subset: {len(high_value_products)} records")
print(f"\nProduct distribution:")
print(high_value_products['product'].value_counts())

# Display summary statistics grouped by product
print("\nSummary Statistics:")
print(high_value_products.groupby('product')['sales'].describe())

### Exercise 3.2: Filter by Region and Time Period

In [None]:
# ========================================
# Filter by Region and Time Period
# ========================================
# Create a subset for the first half of the year in specific regions

# Combine multiple conditions using & (and)
# Parentheses are required when using & operator
first_half = df_clean[
    (df_clean['quarter'].isin([1, 2])) &      # Q1 or Q2
    (df_clean['region'].isin(['North', 'East']))  # North or East region
].copy()

print(f"First half (Q1-Q2) North & East subset: {len(first_half)} records")
print(f"\nRegion distribution:")
print(first_half['region'].value_counts())
print(f"\nQuarter distribution:")
print(first_half['quarter'].value_counts())

### Exercise 3.3: Filter by Multiple Conditions

In [None]:
# ========================================
# Filter by Multiple Conditions
# ========================================
# Create a subset of high-performing transactions with young, satisfied customers

# Combine three conditions using & (and)
premium_segment = df_clean[
    (df_clean['satisfaction'] >= 4) &                          # High satisfaction (4 or 5)
    (df_clean['sales'] > 2000) &                              # High sales (over $2000)
    (df_clean['age_group'].isin(['18-25', '26-35']))         # Young customers
].copy()

print(f"Premium segment subset: {len(premium_segment)} records")
print(f"\nAverage sales: ${premium_segment['sales'].mean():,.2f}")
print(f"Average satisfaction: {premium_segment['satisfaction'].mean():.2f}")
print(f"\nAge group distribution:")
print(premium_segment['age_group'].value_counts())

---
## Part 4: Advanced Visualizations

Now let's create professional, publication-quality visualizations.

### Visualization 1: Multi-Panel Sales Analysis Dashboard

In [None]:
# Create a comprehensive sales dashboard
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Sales Performance Dashboard - 2024', fontsize=20, fontweight='bold', y=0.995)

# Color palette
colors = ['#2E86AB', '#A23B72', '#F18F01', '#C73E1D', '#6A994E', '#BC4B51']

# 1. Sales by Product (Bar chart)
product_sales = df_clean.groupby('product')['sales'].sum().sort_values(ascending=True)
ax1 = axes[0, 0]
bars = ax1.barh(product_sales.index, product_sales.values, color=colors)
ax1.set_xlabel('Total Sales ($)', fontsize=12, fontweight='bold')
ax1.set_title('Total Sales by Product', fontsize=14, fontweight='bold', pad=10)
ax1.grid(axis='x', alpha=0.3, linestyle='--')

# Add value labels on bars
for i, (bar, value) in enumerate(zip(bars, product_sales.values)):
    ax1.text(value, bar.get_y() + bar.get_height()/2, 
             f'${value:,.0f}', 
             va='center', ha='left', fontsize=10, fontweight='bold')

# 2. Sales by Region (Pie chart with explosion)
region_sales = df_clean.groupby('region')['sales'].sum()
ax2 = axes[0, 1]
explode = [0.05 if i == region_sales.argmax() else 0 for i in range(len(region_sales))]
wedges, texts, autotexts = ax2.pie(region_sales.values, 
                                     labels=region_sales.index,
                                     autopct='%1.1f%%',
                                     colors=colors[:len(region_sales)],
                                     explode=explode,
                                     shadow=True,
                                     startangle=90)
ax2.set_title('Sales Distribution by Region', fontsize=14, fontweight='bold', pad=10)

# Style the percentage text
for autotext in autotexts:
    autotext.set_color('white')
    autotext.set_fontweight('bold')
    autotext.set_fontsize(10)

# 3. Monthly Sales Trend (Line chart)
monthly_sales = df_clean.groupby('month')['sales'].agg(['sum', 'mean'])
ax3 = axes[1, 0]
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

ax3_twin = ax3.twinx()
line1 = ax3.plot(monthly_sales.index, monthly_sales['sum'], 
                 marker='o', linewidth=2.5, markersize=8, 
                 color=colors[0], label='Total Sales')
line2 = ax3_twin.plot(monthly_sales.index, monthly_sales['mean'], 
                      marker='s', linewidth=2.5, markersize=8, 
                      color=colors[1], label='Average Sales', linestyle='--')

ax3.set_xlabel('Month', fontsize=12, fontweight='bold')
ax3.set_ylabel('Total Sales ($)', fontsize=12, fontweight='bold', color=colors[0])
ax3_twin.set_ylabel('Average Sales ($)', fontsize=12, fontweight='bold', color=colors[1])
ax3.set_title('Monthly Sales Trends', fontsize=14, fontweight='bold', pad=10)
ax3.set_xticks(range(1, 13))
ax3.set_xticklabels(month_names, rotation=45)
ax3.grid(True, alpha=0.3, linestyle='--')
ax3.tick_params(axis='y', labelcolor=colors[0])
ax3_twin.tick_params(axis='y', labelcolor=colors[1])

# Combine legends
lines = line1 + line2
labels = [l.get_label() for l in lines]
ax3.legend(lines, labels, loc='upper left', framealpha=0.9)

# 4. Satisfaction vs Sales (Scatter plot with size variation)
satisfaction_sales = df_clean.groupby('satisfaction').agg({
    'sales': ['mean', 'count']
}).reset_index()
satisfaction_sales.columns = ['satisfaction', 'avg_sales', 'count']

ax4 = axes[1, 1]
scatter = ax4.scatter(satisfaction_sales['satisfaction'], 
                      satisfaction_sales['avg_sales'],
                      s=satisfaction_sales['count']*2,  # Size based on count
                      c=satisfaction_sales['satisfaction'],
                      cmap='RdYlGn',
                      alpha=0.6,
                      edgecolors='black',
                      linewidth=2)

# Add trend line
z = np.polyfit(satisfaction_sales['satisfaction'], satisfaction_sales['avg_sales'], 1)
p = np.poly1d(z)
ax4.plot(satisfaction_sales['satisfaction'], 
         p(satisfaction_sales['satisfaction']), 
         "r--", linewidth=2, alpha=0.8, label='Trend')

ax4.set_xlabel('Customer Satisfaction Score', fontsize=12, fontweight='bold')
ax4.set_ylabel('Average Sales ($)', fontsize=12, fontweight='bold')
ax4.set_title('Satisfaction vs Average Sales\n(Bubble size = # of transactions)', 
              fontsize=14, fontweight='bold', pad=10)
ax4.set_xticks([1, 2, 3, 4, 5])
ax4.grid(True, alpha=0.3, linestyle='--')
ax4.legend()

# Add colorbar
cbar = plt.colorbar(scatter, ax=ax4)
cbar.set_label('Satisfaction Score', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

print("Dashboard created successfully!")

### Visualization 2: Product Performance Heatmap with Annotations

In [None]:
# Create a heatmap showing product performance across regions
pivot_data = df_clean.pivot_table(
    values='sales',
    index='product',
    columns='region',
    aggfunc='mean'
)

fig, ax = plt.subplots(figsize=(12, 8))

# Create heatmap
im = ax.imshow(pivot_data.values, cmap='YlOrRd', aspect='auto')

# Set ticks and labels
ax.set_xticks(np.arange(len(pivot_data.columns)))
ax.set_yticks(np.arange(len(pivot_data.index)))
ax.set_xticklabels(pivot_data.columns, fontsize=12, fontweight='bold')
ax.set_yticklabels(pivot_data.index, fontsize=12, fontweight='bold')

# Rotate the tick labels for better readability
plt.setp(ax.get_xticklabels(), rotation=0, ha="center")

# Add text annotations
for i in range(len(pivot_data.index)):
    for j in range(len(pivot_data.columns)):
        value = pivot_data.values[i, j]
        text = ax.text(j, i, f'${value:.0f}',
                      ha="center", va="center", 
                      color="white" if value > pivot_data.values.mean() else "black",
                      fontweight='bold', fontsize=11)

# Add colorbar
cbar = ax.figure.colorbar(im, ax=ax)
cbar.ax.set_ylabel('Average Sales ($)', rotation=-90, va="bottom", 
                   fontsize=12, fontweight='bold')

# Add title and labels
ax.set_title('Product Performance Heatmap by Region\nAverage Sales ($)', 
             fontsize=16, fontweight='bold', pad=20)
ax.set_xlabel('Region', fontsize=14, fontweight='bold')
ax.set_ylabel('Product', fontsize=14, fontweight='bold')

# Add grid
ax.set_xticks(np.arange(pivot_data.shape[1]+1)-.5, minor=True)
ax.set_yticks(np.arange(pivot_data.shape[0]+1)-.5, minor=True)
ax.grid(which="minor", color="white", linestyle='-', linewidth=2)

plt.tight_layout()
plt.show()

print("\nHeatmap Analysis:")
print("=" * 60)
print(f"Highest average sales: ${pivot_data.values.max():.2f}")
print(f"Lowest average sales: ${pivot_data.values.min():.2f}")

# Find best performing product-region combination
max_idx = np.unravel_index(pivot_data.values.argmax(), pivot_data.values.shape)
best_product = pivot_data.index[max_idx[0]]
best_region = pivot_data.columns[max_idx[1]]
print(f"\nBest performing combination: {best_product} in {best_region} region")

### Visualization 3: Advanced Multi-Dimensional Analysis

In [None]:
# Create a comprehensive analysis showing multiple dimensions
fig = plt.figure(figsize=(18, 10))
gs = fig.add_gridspec(2, 3, hspace=0.3, wspace=0.3)

# Main title
fig.suptitle('Multi-Dimensional Sales Analysis', 
             fontsize=22, fontweight='bold', y=0.98)

# 1. Stacked Bar Chart: Sales by Product and Region
ax1 = fig.add_subplot(gs[0, :])
pivot_region_product = df_clean.pivot_table(
    values='sales',
    index='product',
    columns='region',
    aggfunc='sum'
)

pivot_region_product.plot(kind='bar', stacked=True, ax=ax1, 
                          color=colors, width=0.7, edgecolor='white', linewidth=1.5)
ax1.set_title('Total Sales by Product Across Regions', 
              fontsize=16, fontweight='bold', pad=15)
ax1.set_xlabel('Product', fontsize=13, fontweight='bold')
ax1.set_ylabel('Total Sales ($)', fontsize=13, fontweight='bold')
ax1.legend(title='Region', title_fontsize=12, fontsize=11, 
           loc='upper right', framealpha=0.9)
ax1.tick_params(axis='x', rotation=45)
ax1.grid(axis='y', alpha=0.3, linestyle='--')

# 2. Box Plot: Sales Distribution by Age Group
ax2 = fig.add_subplot(gs[1, 0])
age_groups = df_clean['age_group'].cat.categories
data_by_age = [df_clean[df_clean['age_group'] == ag]['sales'].values 
               for ag in age_groups]

bp = ax2.boxplot(data_by_age, labels=age_groups, patch_artist=True,
                 notch=True, showmeans=True,
                 boxprops=dict(facecolor=colors[0], alpha=0.7),
                 medianprops=dict(color='red', linewidth=2),
                 meanprops=dict(marker='D', markerfacecolor='yellow', 
                               markeredgecolor='black', markersize=8),
                 whiskerprops=dict(linewidth=1.5),
                 capprops=dict(linewidth=1.5))

ax2.set_title('Sales Distribution by Age Group', 
              fontsize=14, fontweight='bold', pad=10)
ax2.set_xlabel('Age Group', fontsize=12, fontweight='bold')
ax2.set_ylabel('Sales ($)', fontsize=12, fontweight='bold')
ax2.grid(axis='y', alpha=0.3, linestyle='--')

# 3. Grouped Bar Chart: Quantity by Product and Quarter
ax3 = fig.add_subplot(gs[1, 1])
quarterly_product = df_clean.groupby(['product', 'quarter'])['quantity'].sum().unstack()

x = np.arange(len(quarterly_product.index))
width = 0.2

for i, quarter in enumerate(quarterly_product.columns):
    offset = width * (i - len(quarterly_product.columns)/2 + 0.5)
    ax3.bar(x + offset, quarterly_product[quarter], width, 
            label=f'Q{quarter}', color=colors[i], alpha=0.8, edgecolor='black')

ax3.set_title('Total Quantity Sold by Quarter', 
              fontsize=14, fontweight='bold', pad=10)
ax3.set_xlabel('Product', fontsize=12, fontweight='bold')
ax3.set_ylabel('Total Quantity', fontsize=12, fontweight='bold')
ax3.set_xticks(x)
ax3.set_xticklabels(quarterly_product.index, rotation=45, ha='right')
ax3.legend(title='Quarter', fontsize=10)
ax3.grid(axis='y', alpha=0.3, linestyle='--')

# 4. Violin Plot: Price Distribution by Satisfaction
ax4 = fig.add_subplot(gs[1, 2])
satisfaction_levels = sorted(df_clean['satisfaction'].unique())
data_by_satisfaction = [df_clean[df_clean['satisfaction'] == s]['price_per_unit'].values 
                        for s in satisfaction_levels]

parts = ax4.violinplot(data_by_satisfaction, positions=satisfaction_levels,
                       widths=0.7, showmeans=True, showmedians=True)

# Customize violin plot colors
for i, pc in enumerate(parts['bodies']):
    pc.set_facecolor(colors[i % len(colors)])
    pc.set_alpha(0.7)

ax4.set_title('Price Distribution by Satisfaction', 
              fontsize=14, fontweight='bold', pad=10)
ax4.set_xlabel('Satisfaction Score', fontsize=12, fontweight='bold')
ax4.set_ylabel('Price per Unit ($)', fontsize=12, fontweight='bold')
ax4.set_xticks(satisfaction_levels)
ax4.grid(axis='y', alpha=0.3, linestyle='--')

plt.show()

print("Multi-dimensional analysis visualization complete!")

---
## Part 5: Summary Statistics for Cleaned Data

In [None]:
# Generate comprehensive summary report
print("="*70)
print("DATA CLEANING SUMMARY REPORT")
print("="*70)

print("\n1. DATASET OVERVIEW")
print("-" * 70)
print(f"Total records in cleaned dataset: {len(df_clean):,}")
print(f"Total records in original dataset: {len(df):,}")
print(f"Records removed: {len(df) - len(df_clean):,}")
print(f"Percentage retained: {(len(df_clean)/len(df)*100):.2f}%")

print("\n2. SALES STATISTICS")
print("-" * 70)
print(f"Total Sales: ${df_clean['sales'].sum():,.2f}")
print(f"Average Sale: ${df_clean['sales'].mean():,.2f}")
print(f"Median Sale: ${df_clean['sales'].median():,.2f}")
print(f"Standard Deviation: ${df_clean['sales'].std():,.2f}")

print("\n3. PRODUCT PERFORMANCE")
print("-" * 70)
product_summary = df_clean.groupby('product').agg({
    'sales': ['sum', 'mean', 'count']
}).round(2)
product_summary.columns = ['Total Sales', 'Avg Sales', 'Transactions']
product_summary['Market Share %'] = (product_summary['Total Sales'] / 
                                     product_summary['Total Sales'].sum() * 100).round(2)
print(product_summary.sort_values('Total Sales', ascending=False))

print("\n4. REGIONAL PERFORMANCE")
print("-" * 70)
region_summary = df_clean.groupby('region').agg({
    'sales': ['sum', 'mean'],
    'satisfaction': 'mean'
}).round(2)
region_summary.columns = ['Total Sales', 'Avg Sales', 'Avg Satisfaction']
print(region_summary.sort_values('Total Sales', ascending=False))

print("\n5. CUSTOMER INSIGHTS")
print("-" * 70)
age_summary = df_clean.groupby('age_group').agg({
    'sales': ['mean', 'count'],
    'satisfaction': 'mean'
}).round(2)
age_summary.columns = ['Avg Sales', 'Customers', 'Avg Satisfaction']
print(age_summary)

print("\n" + "="*70)
print("END OF REPORT")
print("="*70)

---
## Exercises for Students

### Exercise A: Custom Filtering
Create a subset of data for:
1. Sales in the West region during Q3 and Q4
2. Customers aged 36-50 who purchased Tablets or Phones
3. Low satisfaction (1-2) transactions with sales > $1000

### Exercise B: Additional Visualizations
Create the following visualizations:
1. A histogram showing the distribution of sales amounts
2. A scatter plot showing customer age vs sales with region colors
3. A bar chart comparing average satisfaction across products

### Exercise C: Data Quality Checks
1. Check for any negative values in sales or quantity
2. Identify any dates outside the year 2024
3. Find products with unusually low or high average prices

### Exercise D: Advanced Analysis
1. Calculate the correlation between customer age and satisfaction
2. Identify the most profitable product-region combination
3. Determine if there's a seasonal trend in sales

In [None]:
# Space for student exercises
# Complete the challenges below!

---
## Conclusion

In this lesson, you learned:
- âœ… How to identify and handle data quality issues (missing values, duplicates, outliers)
- âœ… Techniques for standardizing and transforming data
- âœ… Methods to filter and subset data based on multiple criteria
- âœ… Creating professional, multi-dimensional visualizations
- âœ… Using pandas and matplotlib together for comprehensive analysis

### Key Takeaways:
1. **Always inspect your data first** - understand what you're working with
2. **Clean systematically** - handle issues in a logical order
3. **Document your decisions** - keep track of what cleaning steps you performed
4. **Visualize strategically** - choose the right chart type for your message
5. **Tell a story** - combine multiple visualizations to provide complete insights

---
## ðŸŽ¯ Learning Challenges

Test your skills with these hands-on challenges! Each challenge includes hints to help you succeed.

### ðŸŒŸ Challenge 1: Find the Best-Selling Product in Each Region

**Task:** Write code to identify which product has the highest total sales in each region.

**What you'll practice:**
- Grouping data by multiple columns
- Finding maximum values
- Working with pivot tables or groupby operations

**Hints:**
- Use `df_clean.groupby()` with both 'region' and 'product' columns
- The `.sum()` function will help you get total sales
- Try using `.idxmax()` to find the product with maximum sales in each group
- Alternatively, you could use `.sort_values()` and `.head(1)` for each group

**Expected output:** A result showing each region and its top-selling product with the sales amount.

In [None]:
# Challenge 1: Your code here
# TODO: Find the best-selling product in each region



---
### ðŸŒŸ Challenge 2: Create a Sales Performance Rating System

**Task:** Add a new column to `df_clean` called `performance_rating` that categorizes each sale based on the sales amount:
- 'Low' for sales < $1,500
- 'Medium' for sales between $1,500 and $3,000
- 'High' for sales > $3,000

Then, count how many transactions fall into each category.

**What you'll practice:**
- Creating conditional columns
- Using `pd.cut()` or `np.where()` or `.apply()`
- Value counting and categorization

**Hints:**
- `pd.cut()` is perfect for creating bins/categories based on ranges
- Set the `bins` parameter to [0, 1500, 3000, infinity]
- Set the `labels` parameter to ['Low', 'Medium', 'High']
- Use `np.inf` for infinity
- After creating the column, use `.value_counts()` to see the distribution

**Bonus:** Calculate the average satisfaction score for each performance rating level.

In [None]:
# Challenge 2: Your code here
# TODO: Create performance_rating column and analyze it



---
### ðŸŒŸ Challenge 3: Visualize the Age Distribution with Style

**Task:** Create a histogram showing the distribution of customer ages with the following features:
- Use 10 bins
- Add a title and axis labels
- Use a custom color
- Add a grid for easier reading
- Display the mean age as a vertical line in a different color

**What you'll practice:**
- Creating histograms with matplotlib
- Customizing plot appearance
- Adding reference lines
- Using colors and styling

**Hints:**
- Use `plt.figure(figsize=(10, 6))` to create a larger figure
- `plt.hist()` creates the histogram; use the `bins` and `color` parameters
- `plt.axvline()` adds a vertical line; use `df_clean['customer_age'].mean()` for the x position
- Use `linestyle='--'` and `linewidth=2` for a dashed line
- Don't forget `plt.xlabel()`, `plt.ylabel()`, and `plt.title()`
- `plt.grid(alpha=0.3)` adds a subtle grid
- Add a legend with `plt.legend()` to label the mean line

**Example color:** Try '#3498db' for blue or '#e74c3c' for red

In [None]:
# Challenge 3: Your code here
# TODO: Create a styled histogram of customer ages



---
### ðŸŒŸ Challenge 4: Filter and Analyze Peak Season

**Task:** Identify the "peak season" (the quarter with the highest total sales), then:
1. Create a subset of data containing only that quarter
2. Calculate which product was most popular (highest quantity sold) during peak season
3. Calculate the average customer age during peak season
4. Create a simple bar chart showing sales by region during that quarter

**What you'll practice:**
- Complex filtering with multiple steps
- Aggregating data in different ways
- Creating focused visualizations
- Combining multiple pandas operations

**Hints:**
- First, use `df_clean.groupby('quarter')['sales'].sum()` to find total sales per quarter
- Use `.idxmax()` to find which quarter has the maximum sales
- Filter the dataframe: `peak_data = df_clean[df_clean['quarter'] == peak_quarter]`
- For most popular product, group by product and sum quantity
- Use `.plot(kind='bar')` on grouped data for a quick bar chart
- Add `.sort_values(ascending=False)` to see results in descending order

**Bonus:** Add the average sales value as a horizontal line on your bar chart!

In [None]:
# Challenge 4: Your code here
# TODO: Analyze peak season data



---
### ðŸŒŸ Challenge 5: Create a Customer Satisfaction Dashboard

**Task:** Create a 2x2 subplot figure that shows:
1. **Top-left:** Pie chart of satisfaction score distribution (how many 1s, 2s, 3s, etc.)
2. **Top-right:** Bar chart showing average sales by satisfaction level
3. **Bottom-left:** Count of transactions by age group
4. **Bottom-right:** Scatter plot of quantity vs. sales (colored by satisfaction)

**What you'll practice:**
- Creating subplot figures
- Multiple plot types in one visualization
- Using color to represent additional dimensions
- Creating comprehensive dashboards

**Hints:**
- Start with `fig, axes = plt.subplots(2, 2, figsize=(14, 10))`
- Access each subplot using `axes[0, 0]`, `axes[0, 1]`, etc.
- For pie chart: `df_clean['satisfaction'].value_counts().plot(kind='pie', ax=axes[0,0])`
- For bar chart: use `.groupby('satisfaction')['sales'].mean()` then `.plot(kind='bar')`
- For scatter with color: `axes[1,1].scatter(x, y, c=df_clean['satisfaction'], cmap='viridis')`
- Add titles to each subplot using `axes[row, col].set_title('Title')`
- Use `plt.tight_layout()` at the end to prevent overlap

**Extra challenge:** Add a colorbar to the scatter plot to show what the colors mean!

In [None]:
# Challenge 5: Your code here
# TODO: Create a 2x2 dashboard with multiple visualizations



---
## ðŸ’¡ Tips for Success

**When working on these challenges:**

1. **Read the task carefully** - Make sure you understand what's being asked before you start coding

2. **Use the hints** - They're there to guide you in the right direction, not give away the answer

3. **Test incrementally** - Don't write all the code at once. Write a little, test it, then add more

4. **Check your output** - Does the result make sense? Are there any unexpected values?

5. **Review earlier examples** - Look back at the visualizations and code from earlier in the notebook

6. **Don't be afraid to experiment** - Try different approaches! Data analysis is creative

7. **Use print statements** - `print()` is your friend for debugging and understanding what your code does

8. **Google is your ally** - Looking up pandas or matplotlib documentation is a normal part of coding

**Remember:** The goal is to learn, not to be perfect. Every mistake is a learning opportunity! ðŸš€

---
---
# ðŸ“š Solutions to Learning Challenges

Below are the solutions to each challenge. Try to solve them on your own first before looking at these answers!

## Solution 1: Find the Best-Selling Product in Each Region

In [None]:
# ========================================
# Solution 1: Find Best-Selling Product in Each Region
# ========================================

# Step 1: Group by both region AND product, then sum sales
# This creates a total for each region-product combination
region_product_sales = df_clean.groupby(['region', 'product'])['sales'].sum().reset_index()

# Step 2: For each region, find the row with the maximum sales
# .groupby('region')['sales'].idxmax() returns the index of the max value in each group
# .loc[] then selects those specific rows
best_sellers = region_product_sales.loc[
    region_product_sales.groupby('region')['sales'].idxmax()
]

# Display results in a formatted table
print("Best-Selling Product in Each Region:")
print("=" * 60)
for _, row in best_sellers.iterrows():
    print(f"{row['region']:10} | {row['product']:10} | ${row['sales']:,.2f}")

print("\n" + "=" * 60)
print("\nAlternative display:")
print(best_sellers.sort_values('sales', ascending=False))

---
## Solution 2: Create a Sales Performance Rating System

In [None]:
# ========================================
# Solution 2: Create Performance Rating Column
# ========================================

# Create a copy to avoid modifying the original cleaned data
df_solution = df_clean.copy()

# Use pd.cut() to create categorical bins based on sales values
# bins: boundaries for categories [0 to 1500], [1500 to 3000], [3000 to infinity]
# labels: names for each category
# np.inf represents infinity (no upper limit for 'High')
df_solution['performance_rating'] = pd.cut(
    df_solution['sales'],
    bins=[0, 1500, 3000, np.inf],
    labels=['Low', 'Medium', 'High']
)

# Display distribution of performance ratings
print("Performance Rating Distribution:")
print("=" * 60)
print(df_solution['performance_rating'].value_counts().sort_index())

# Show statistics for each rating category
print("\n" + "=" * 60)
print("Sales Statistics by Performance Rating:")
print(df_solution.groupby('performance_rating')['sales'].agg(['count', 'mean', 'min', 'max']))

# BONUS: Calculate average satisfaction for each performance level
print("\n" + "=" * 60)
print("BONUS: Average Satisfaction by Performance Rating:")
satisfaction_by_performance = df_solution.groupby('performance_rating')['satisfaction'].mean()
print(satisfaction_by_performance)

# ========================================
# Visualize the Results
# ========================================
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Chart 1: Count of transactions by rating
df_solution['performance_rating'].value_counts().sort_index().plot(
    kind='bar', ax=ax1, color=['#e74c3c', '#f39c12', '#27ae60'], edgecolor='black'
)
ax1.set_title('Transaction Count by Performance Rating', fontsize=14, fontweight='bold')
ax1.set_xlabel('Performance Rating', fontweight='bold')
ax1.set_ylabel('Number of Transactions', fontweight='bold')
ax1.tick_params(axis='x', rotation=0)
ax1.grid(axis='y', alpha=0.3)

# Chart 2: Average satisfaction by rating
satisfaction_by_performance.plot(kind='bar', ax=ax2, color=['#e74c3c', '#f39c12', '#27ae60'], 
                                  edgecolor='black')
ax2.set_title('Average Satisfaction by Performance Rating', fontsize=14, fontweight='bold')
ax2.set_xlabel('Performance Rating', fontweight='bold')
ax2.set_ylabel('Average Satisfaction Score', fontweight='bold')
ax2.tick_params(axis='x', rotation=0)
ax2.grid(axis='y', alpha=0.3)
ax2.set_ylim(0, 5)  # Set y-axis limits from 0 to 5

plt.tight_layout()
plt.show()

---
## Solution 3: Visualize the Age Distribution with Style

In [None]:
# ========================================
# Solution 3: Create Styled Histogram of Customer Ages
# ========================================

# Create a larger figure for better visibility
plt.figure(figsize=(12, 7))

# Calculate the mean age (we'll show this as a line on the chart)
mean_age = df_clean['customer_age'].mean()

# Create the histogram
# bins=10: divide the age range into 10 equal groups
# color: hex color code for blue
# edgecolor: color of the bar borders
# alpha: transparency (0=invisible, 1=solid)
plt.hist(df_clean['customer_age'], 
         bins=10, 
         color='#3498db', 
         edgecolor='black', 
         alpha=0.7,
         linewidth=1.5)

# Add a vertical line at the mean age
# axvline: adds a vertical line
# linestyle='--': dashed line
# label: text for the legend
plt.axvline(mean_age, 
            color='#e74c3c', 
            linestyle='--', 
            linewidth=3, 
            label=f'Mean Age: {mean_age:.1f}')

# Add titles and labels
plt.title('Distribution of Customer Ages', 
          fontsize=18, 
          fontweight='bold', 
          pad=20)  # pad: space between title and chart
plt.xlabel('Customer Age (years)', 
           fontsize=14, 
           fontweight='bold')
plt.ylabel('Number of Customers', 
           fontsize=14, 
           fontweight='bold')

# Add a subtle grid for easier reading
plt.grid(axis='y', alpha=0.3, linestyle='--')

# Add a legend to explain the mean line
plt.legend(fontsize=12, loc='upper right', framealpha=0.9)

# Add a text box with statistics
median_age = df_clean['customer_age'].median()
# transform=plt.gca().transAxes uses relative positioning (0,0)=bottom-left, (1,1)=top-right
plt.text(0.02, 0.98, 
         f'Statistics:\nMean: {mean_age:.1f}\nMedian: {median_age:.1f}\nMin: {df_clean["customer_age"].min()}\nMax: {df_clean["customer_age"].max()}',
         transform=plt.gca().transAxes,
         fontsize=11,
         verticalalignment='top',
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.tight_layout()  # Adjust spacing to prevent label cutoff
plt.show()

# Print summary statistics
print(f"Age Distribution Summary:")
print(f"Mean Age: {mean_age:.2f} years")
print(f"Median Age: {median_age:.2f} years")
print(f"Age Range: {df_clean['customer_age'].min()} - {df_clean['customer_age'].max()} years")

---
## Solution 4: Filter and Analyze Peak Season

In [None]:
# ========================================
# Solution 4: Analyze Peak Season
# ========================================

# ----------------------------------------
# Step 1: Find the Peak Quarter
# ----------------------------------------
# Group by quarter and sum all sales
quarterly_sales = df_clean.groupby('quarter')['sales'].sum()

# Find which quarter has the maximum sales
peak_quarter = quarterly_sales.idxmax()  # Returns the quarter number (1, 2, 3, or 4)
peak_sales = quarterly_sales.max()       # Returns the sales value

print("Quarterly Sales Analysis:")
print("=" * 60)
print(quarterly_sales.sort_index())
print(f"\nPeak Season: Q{peak_quarter} with total sales of ${peak_sales:,.2f}")

# ----------------------------------------
# Step 2: Create Subset for Peak Quarter
# ----------------------------------------
# Filter to include only rows from the peak quarter
peak_data = df_clean[df_clean['quarter'] == peak_quarter].copy()

print(f"\n" + "=" * 60)
print(f"Peak Season (Q{peak_quarter}) Analysis:")
print(f"Total transactions: {len(peak_data)}")

# ----------------------------------------
# Step 3: Find Most Popular Product (by quantity sold)
# ----------------------------------------
# Group by product and sum the quantities
product_quantity = peak_data.groupby('product')['quantity'].sum().sort_values(ascending=False)
most_popular = product_quantity.index[0]    # First item (highest quantity)
most_popular_qty = product_quantity.values[0]

print(f"\nMost Popular Product: {most_popular}")
print(f"Total Quantity Sold: {most_popular_qty}")

print("\nAll Products by Quantity Sold:")
print(product_quantity)

# ----------------------------------------
# Step 4: Calculate Average Customer Age
# ----------------------------------------
avg_age = peak_data['customer_age'].mean()
print(f"\nAverage Customer Age in Q{peak_quarter}: {avg_age:.1f} years")

# ----------------------------------------
# Step 5: Create Bar Chart of Regional Sales
# ----------------------------------------
print(f"\n" + "=" * 60)
print("Creating visualization...")

fig, ax = plt.subplots(figsize=(12, 7))

# Calculate sales by region for the peak quarter
regional_sales = peak_data.groupby('region')['sales'].sum().sort_values(ascending=False)
colors_region = ['#2E86AB', '#A23B72', '#F18F01', '#C73E1D']

# Create bar chart
bars = ax.bar(regional_sales.index, 
               regional_sales.values, 
               color=colors_region[:len(regional_sales)],
               edgecolor='black',
               linewidth=2,
               alpha=0.8)

# Add value labels on top of each bar
for bar in bars:
    height = bar.get_height()
    # Place text at the center of the bar, just above it
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'${height:,.0f}',
            ha='center', va='bottom', 
            fontweight='bold', fontsize=11)

# ----------------------------------------
# BONUS: Add Average Sales Line
# ----------------------------------------
avg_sales = regional_sales.mean()
# axhline: adds a horizontal line
ax.axhline(avg_sales, color='red', linestyle='--', linewidth=2.5, 
           label=f'Average: ${avg_sales:,.0f}')

# Customize the chart
ax.set_title(f'Sales by Region - Q{peak_quarter} (Peak Season)', 
             fontsize=16, fontweight='bold', pad=20)
ax.set_xlabel('Region', fontsize=13, fontweight='bold')
ax.set_ylabel('Total Sales ($)', fontsize=13, fontweight='bold')
ax.grid(axis='y', alpha=0.3, linestyle='--')
ax.legend(fontsize=11, loc='upper right')

plt.tight_layout()
plt.show()

# Print final summary
print(f"\nRegional Performance in Q{peak_quarter}:")
print(regional_sales)

---
## Solution 5: Create a Customer Satisfaction Dashboard

In [None]:
# Solution 5: Create a 2x2 Customer Satisfaction Dashboard

# Create the figure with 2x2 subplots
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Customer Satisfaction Dashboard', fontsize=20, fontweight='bold', y=0.995)

# Define color scheme
colors_dashboard = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A', '#98D8C8']

# ============================================================
# 1. TOP-LEFT: Pie chart of satisfaction score distribution
# ============================================================
ax1 = axes[0, 0]
satisfaction_counts = df_clean['satisfaction'].value_counts().sort_index()

wedges, texts, autotexts = ax1.pie(
    satisfaction_counts.values,
    labels=[f'Score {i}' for i in satisfaction_counts.index],
    autopct='%1.1f%%',
    colors=colors_dashboard,
    startangle=90,
    explode=[0.05 if i == satisfaction_counts.idxmax() else 0 for i in satisfaction_counts.index]
)

# Style the text
for autotext in autotexts:
    autotext.set_color('white')
    autotext.set_fontweight('bold')
    autotext.set_fontsize(10)

ax1.set_title('Satisfaction Score Distribution', fontsize=14, fontweight='bold', pad=10)

# ============================================================
# 2. TOP-RIGHT: Bar chart of average sales by satisfaction
# ============================================================
ax2 = axes[0, 1]
avg_sales_by_satisfaction = df_clean.groupby('satisfaction')['sales'].mean().sort_index()

bars = ax2.bar(avg_sales_by_satisfaction.index,
               avg_sales_by_satisfaction.values,
               color=colors_dashboard,
               edgecolor='black',
               linewidth=1.5,
               alpha=0.8)

# Add value labels
for i, (idx, val) in enumerate(avg_sales_by_satisfaction.items()):
    ax2.text(idx, val, f'${val:.0f}',
             ha='center', va='bottom',
             fontweight='bold', fontsize=10)

ax2.set_title('Average Sales by Satisfaction Level', fontsize=14, fontweight='bold', pad=10)
ax2.set_xlabel('Satisfaction Score', fontsize=12, fontweight='bold')
ax2.set_ylabel('Average Sales ($)', fontsize=12, fontweight='bold')
ax2.set_xticks(avg_sales_by_satisfaction.index)
ax2.grid(axis='y', alpha=0.3, linestyle='--')

# ============================================================
# 3. BOTTOM-LEFT: Count of transactions by age group
# ============================================================
ax3 = axes[1, 0]
age_group_counts = df_clean['age_group'].value_counts().sort_index()

bars3 = ax3.barh(range(len(age_group_counts)),
                 age_group_counts.values,
                 color=colors_dashboard[:len(age_group_counts)],
                 edgecolor='black',
                 linewidth=1.5,
                 alpha=0.8)

# Add value labels
for i, (idx, val) in enumerate(age_group_counts.items()):
    ax3.text(val, i, f' {val}',
             ha='left', va='center',
             fontweight='bold', fontsize=10)

ax3.set_yticks(range(len(age_group_counts)))
ax3.set_yticklabels(age_group_counts.index)
ax3.set_title('Transaction Count by Age Group', fontsize=14, fontweight='bold', pad=10)
ax3.set_xlabel('Number of Transactions', fontsize=12, fontweight='bold')
ax3.set_ylabel('Age Group', fontsize=12, fontweight='bold')
ax3.grid(axis='x', alpha=0.3, linestyle='--')

# ============================================================
# 4. BOTTOM-RIGHT: Scatter plot of quantity vs sales (colored by satisfaction)
# ============================================================
ax4 = axes[1, 1]

scatter = ax4.scatter(df_clean['quantity'],
                     df_clean['sales'],
                     c=df_clean['satisfaction'],
                     cmap='viridis',
                     s=50,
                     alpha=0.6,
                     edgecolors='black',
                     linewidth=0.5)

ax4.set_title('Sales vs Quantity (Colored by Satisfaction)',
              fontsize=14, fontweight='bold', pad=10)
ax4.set_xlabel('Quantity Sold', fontsize=12, fontweight='bold')
ax4.set_ylabel('Sales Amount ($)', fontsize=12, fontweight='bold')
ax4.grid(True, alpha=0.3, linestyle='--')

# EXTRA CHALLENGE: Add colorbar
cbar = plt.colorbar(scatter, ax=ax4)
cbar.set_label('Satisfaction Score', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

print("Dashboard created successfully!")
print("\nKey Insights:")
print(f"- Most common satisfaction score: {satisfaction_counts.idxmax()}")
print(f"- Highest average sales at satisfaction level: {avg_sales_by_satisfaction.idxmax()}")
print(f"- Largest age group: {age_group_counts.idxmax()}")

---
## ðŸŽ“ What You've Learned

By completing these challenges, you've mastered:

âœ… **Data Aggregation** - Grouping and summarizing data in multiple ways  
âœ… **Conditional Logic** - Creating categories based on value ranges  
âœ… **Data Filtering** - Extracting subsets based on complex criteria  
âœ… **Statistical Analysis** - Computing means, medians, and distributions  
âœ… **Data Visualization** - Creating professional charts with matplotlib  
âœ… **Dashboard Creation** - Combining multiple visualizations into comprehensive displays  

### Next Steps in Your Data Analytics Journey:

1. **Practice with Real Data** - Try applying these techniques to actual datasets from Kaggle or government open data portals
2. **Learn Advanced Visualizations** - Explore seaborn for statistical visualizations
3. **Master Statistical Tests** - Learn hypothesis testing and correlation analysis
4. **Build Interactive Dashboards** - Explore tools like Plotly and Dash
5. **Share Your Work** - Create a portfolio of data analysis projects

**Remember:** The best way to learn data analytics is by doing. Keep practicing, stay curious, and don't be afraid to experiment with different approaches!

---
### ðŸŒŸ Congratulations on completing this lesson! ðŸŒŸ
---