# Simple Big Data Tutorial - Your First Pipeline

Welcome! This is a beginner-friendly introduction to the Big Data Sandbox.

## What we'll do:
1. ✅ Check if everything is working
2. 📊 Load some sample data
3. 🔄 Process it with Spark
4. 💾 Save the results
5. 📈 Make a simple chart

**No experience needed!** Just follow along step by step.

## Step 1: Import Libraries and Check Spark

In [None]:
# Import the tools we need
import pandas as pd
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

print("✅ Libraries imported successfully!")

In [None]:
# Connect to Spark (this might take a moment)
print("🔄 Connecting to Spark...")

spark = SparkSession.builder \
    .appName("SimpleTutorial") \
    .master("spark://spark-master:7077") \
    .getOrCreate()

print(f"✅ Connected to Spark version {spark.version}")
print(f"📊 You can see Spark jobs at: http://localhost:4040")

## Step 2: Create Some Sample Data

Let's create a simple dataset about online store sales.

In [None]:
# Create sample data
data = [
    ("Phone", "Electronics", 599.99, 5),
    ("Laptop", "Electronics", 1299.99, 3),
    ("Book", "Education", 19.99, 15),
    ("Headphones", "Electronics", 199.99, 8),
    ("Desk", "Furniture", 299.99, 2),
    ("Chair", "Furniture", 149.99, 4),
    ("Notebook", "Education", 4.99, 25),
    ("Mouse", "Electronics", 29.99, 12)
]

# Define column names
columns = ["product", "category", "price", "quantity_sold"]

# Create a Spark DataFrame
df = spark.createDataFrame(data, columns)

print("✅ Sample data created!")
print("Here's what our data looks like:")
df.show()

## Step 3: Process the Data

Let's calculate some basic business metrics.

In [None]:
# Add a new column: total revenue per product
df_with_revenue = df.withColumn("total_revenue", col("price") * col("quantity_sold"))

print("💰 Added revenue calculation:")
df_with_revenue.show()

In [None]:
# Find the best-selling products
top_products = df_with_revenue.orderBy(col("total_revenue").desc())

print("🏆 Products ranked by total revenue:")
top_products.show()

In [None]:
# Group by category to see which category performs best
category_summary = df_with_revenue.groupBy("category") \
    .agg(
        sum("total_revenue").alias("category_revenue"),
        sum("quantity_sold").alias("total_items_sold"),
        avg("price").alias("avg_price")
    ) \
    .orderBy(col("category_revenue").desc())

print("📊 Sales by category:")
category_summary.show()

## Step 4: Create a Simple Chart

Let's visualize our results!

In [None]:
# Convert to pandas for easy plotting
category_data = category_summary.toPandas()

# Create a bar chart
plt.figure(figsize=(10, 6))
plt.bar(category_data['category'], category_data['category_revenue'], color=['skyblue', 'lightgreen', 'salmon'])
plt.title('Revenue by Category', fontsize=16, fontweight='bold')
plt.xlabel('Category')
plt.ylabel('Revenue ($)')
plt.xticks(rotation=45)

# Add value labels on bars
for i, v in enumerate(category_data['category_revenue']):
    plt.text(i, v + 50, f'${v:,.0f}', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print("📈 Chart created successfully!")

## Step 5: Summary

Let's see what we learned from our data:

In [None]:
# Calculate some final insights
total_revenue = df_with_revenue.agg(sum("total_revenue")).collect()[0][0]
total_items = df_with_revenue.agg(sum("quantity_sold")).collect()[0][0]
avg_order_value = total_revenue / df_with_revenue.count()

print("📊 BUSINESS INSIGHTS")
print("=" * 30)
print(f"💰 Total Revenue: ${total_revenue:,.2f}")
print(f"📦 Total Items Sold: {total_items}")
print(f"🛒 Average Order Value: ${avg_order_value:.2f}")
print("\n🏆 Key Findings:")
print("- Electronics is our top-performing category")
print("- Laptops generate the highest individual revenue")
print("- Education products sell in higher quantities but at lower prices")

## Step 6: Save Your Work (Optional)

Let's save our processed data to MinIO (our data storage):

In [None]:
# Note: This would normally save to MinIO, but we'll skip it for this demo
# In a real scenario, you would uncomment the lines below:

# df_with_revenue.write \
#     .mode("overwrite") \
#     .parquet("s3a://processed/sales_analysis")

print("💾 In a real scenario, your data would now be saved to MinIO!")
print("📁 You could view it at: http://localhost:9000")
print("🔍 Files would appear in the 'processed' bucket")

## 🎉 Congratulations!

You've successfully completed your first big data pipeline! Here's what you accomplished:

✅ **Connected to Spark** - You're now using distributed computing!  
✅ **Processed data** - You transformed raw data into insights  
✅ **Created visualizations** - You made data easy to understand  
✅ **Generated insights** - You discovered meaningful business information  

## What's Next?

1. **Try the Advanced Tutorial**: Open `01_getting_started.ipynb` for more complex examples
2. **Explore Real Data**: Use the sample CSV files in the `/data` folder
3. **Stream Data**: Check out the Kafka streaming examples
4. **Build Pipelines**: Create automated workflows with Airflow

## Need Help?

- **Airflow UI**: http://localhost:8080 (workflow management)
- **Spark UI**: http://localhost:4040 (job monitoring)
- **MinIO Console**: http://localhost:9000 (data storage)
- **Kafka UI**: http://localhost:9001 (streaming data)

Happy data engineering! 🚀

In [None]:
# Clean up (optional)
# spark.stop()
print("🎯 Tutorial complete! Keep exploring!")