# Data Analysis Tutorial with Claude Code

This notebook demonstrates comprehensive data analysis techniques using Python and popular data science libraries.

## Table of Contents
1. Data Loading and Exploration
2. Data Cleaning
3. Exploratory Data Analysis
4. Data Visualization
5. Statistical Analysis
6. Key Insights and Recommendations

## 1. Data Loading and Exploration

Let's start by loading the necessary libraries and our datasets.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries loaded successfully!")

In [None]:
# Load datasets
sales_df = pd.read_csv('../data/sales_data.csv')
customer_df = pd.read_csv('../data/customer_data.csv')

print(f"Sales data shape: {sales_df.shape}")
print(f"Customer data shape: {customer_df.shape}")

In [None]:
# Display first few rows of sales data
print("Sales Data Sample:")
sales_df.head()

In [None]:
# Display first few rows of customer data
print("Customer Data Sample:")
customer_df.head()

In [None]:
# Check data types and missing values
print("Sales Data Info:")
print(sales_df.info())
print("\nMissing values:")
print(sales_df.isnull().sum())

## 2. Data Cleaning

Let's clean and prepare our data for analysis.

In [None]:
# Convert date columns to datetime
sales_df['date'] = pd.to_datetime(sales_df['date'])
customer_df['member_since'] = pd.to_datetime(customer_df['member_since'])

# Create additional date features for sales data
sales_df['year'] = sales_df['date'].dt.year
sales_df['month'] = sales_df['date'].dt.month
sales_df['day_of_week'] = sales_df['date'].dt.day_name()

print("Date columns processed successfully!")

## 3. Exploratory Data Analysis

Let's explore our data through summary statistics and aggregations.

In [None]:
# Sales summary statistics
print("Sales Data - Descriptive Statistics:")
sales_df[['quantity', 'price', 'revenue']].describe()

In [None]:
# Revenue by category
category_revenue = sales_df.groupby('category').agg({
    'revenue': ['sum', 'mean', 'count']
}).round(2)

category_revenue.columns = ['Total Revenue', 'Avg Revenue', 'Transactions']
print("Revenue by Category:")
category_revenue.sort_values('Total Revenue', ascending=False)

In [None]:
# Revenue by region
region_revenue = sales_df.groupby('region').agg({
    'revenue': 'sum',
    'quantity': 'sum'
}).round(2)

print("Revenue by Region:")
region_revenue.sort_values('revenue', ascending=False)

In [None]:
# Customer analysis
print("Customer Data - Descriptive Statistics:")
customer_df[['age', 'income', 'purchase_frequency', 'satisfaction_score']].describe()

## 4. Data Visualization

Let's create visualizations to better understand our data.

In [None]:
# Revenue by category - Bar plot
plt.figure(figsize=(10, 6))
category_total = sales_df.groupby('category')['revenue'].sum().sort_values(ascending=False)
category_total.plot(kind='bar', color='steelblue')
plt.title('Total Revenue by Category', fontsize=16, fontweight='bold')
plt.xlabel('Category', fontsize=12)
plt.ylabel('Revenue ($)', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Sales trend over time
plt.figure(figsize=(12, 6))
daily_revenue = sales_df.groupby('date')['revenue'].sum()
plt.plot(daily_revenue.index, daily_revenue.values, marker='o', linewidth=2, color='darkgreen')
plt.title('Daily Revenue Trend', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Revenue ($)', fontsize=12)
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Regional distribution - Pie chart
plt.figure(figsize=(10, 8))
region_revenue_sum = sales_df.groupby('region')['revenue'].sum()
colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99']
plt.pie(region_revenue_sum.values, labels=region_revenue_sum.index, autopct='%1.1f%%',
        colors=colors, startangle=90)
plt.title('Revenue Distribution by Region', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Customer age distribution
plt.figure(figsize=(10, 6))
plt.hist(customer_df['age'], bins=15, color='coral', edgecolor='black', alpha=0.7)
plt.title('Customer Age Distribution', fontsize=16, fontweight='bold')
plt.xlabel('Age', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Income vs Satisfaction scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(customer_df['income'], customer_df['satisfaction_score'], alpha=0.6, s=100, color='purple')
plt.title('Income vs Customer Satisfaction', fontsize=16, fontweight='bold')
plt.xlabel('Income ($)', fontsize=12)
plt.ylabel('Satisfaction Score', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 8))
numeric_cols = ['age', 'income', 'purchase_frequency', 'satisfaction_score']
correlation_matrix = customer_df[numeric_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap - Customer Metrics', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

## 5. Statistical Analysis

Let's perform statistical tests to validate our findings.

In [None]:
# Hypothesis test: Is there a significant difference in revenue between categories?
electronics = sales_df[sales_df['category'] == 'Electronics']['revenue']
furniture = sales_df[sales_df['category'] == 'Furniture']['revenue']
supplies = sales_df[sales_df['category'] == 'Supplies']['revenue']

f_stat, p_value = stats.f_oneway(electronics, furniture, supplies)

print("ANOVA Test: Revenue Difference Between Categories")
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Result: Significant difference exists (p < 0.05)")
else:
    print("Result: No significant difference (p >= 0.05)")

In [None]:
# Correlation test: Income vs Purchase Frequency
corr_coef, p_value = stats.pearsonr(customer_df['income'], customer_df['purchase_frequency'])

print("Correlation Test: Income vs Purchase Frequency")
print(f"Correlation coefficient: {corr_coef:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Result: Statistically significant correlation (p < 0.05)")
else:
    print("Result: No significant correlation (p >= 0.05)")

## 6. Key Insights and Recommendations

Based on our analysis, here are the key insights:

In [None]:
# Summary statistics
print("="*60)
print("KEY PERFORMANCE INDICATORS")
print("="*60)
print(f"Total Revenue: ${sales_df['revenue'].sum():,.2f}")
print(f"Total Transactions: {len(sales_df)}")
print(f"Average Transaction Value: ${sales_df['revenue'].mean():,.2f}")
print(f"Top Category: {sales_df.groupby('category')['revenue'].sum().idxmax()}")
print(f"Top Region: {sales_df.groupby('region')['revenue'].sum().idxmax()}")
print(f"\nTotal Customers: {len(customer_df)}")
print(f"Average Customer Satisfaction: {customer_df['satisfaction_score'].mean():.2f}/5.0")
print(f"Average Purchase Frequency: {customer_df['purchase_frequency'].mean():.1f} purchases/year")

### Recommendations:

1. **Product Strategy**: Focus on Electronics as it generates the highest revenue
2. **Regional Focus**: Invest more in top-performing regions
3. **Customer Engagement**: High-income customers tend to purchase more frequently
4. **Customer Satisfaction**: Maintain high satisfaction scores through quality service
5. **Inventory Management**: Stock popular items based on sales trends