# 🎓 From Zero to Big Data Hero: Complete Learning Guide

## Welcome, Future Big Data Developer! 👋

Hi there! I'm going to teach you Big Data development step by step, starting from the very basics. Think of me as your friendly guide who will help you become a **Big Data Professional** by the end of this journey!

### 🎯 What You'll Learn:
- **Data basics** using everyday examples
- **Python programming** for data work
- **Working with databases** and files  
- **Big Data tools** like Spark and Hadoop
- **Real-world projects** you can put on your resume
- **Professional skills** that companies want

### 🚀 Learning Path:
1. **Baby Steps**: Understanding data (like counting your toys!)
2. **Walking**: Python basics and small data
3. **Running**: Databases and bigger datasets  
4. **Flying**: Big Data tools and distributed computing
5. **Soaring**: Real projects and professional skills

**Ready to become a Big Data superhero? Let's start! 🦸‍♂️**

---

> 💡 **Learning Tip**: Each section builds on the previous one. Don't skip ahead - master each level first!

## 📚 Section 1: What is Data and Why it Matters

### Let's Start Simple! 🧸

Imagine you have a toy box. Inside, you have:
- 5 red cars
- 3 blue trucks  
- 2 yellow airplanes
- 7 green soldiers

When you **count** and **write down** what you have, that's **DATA**!

### Real-Life Data Examples:
- **Your video game scores**: Mario Kart times, Pokemon you caught
- **School stuff**: Test grades, how many books you read
- **Family data**: Heights, birthdays, favorite foods
- **YouTube**: Views, likes, comments on videos

### Why Data is Like a Superpower 🦸‍♀️

Data helps us:
- **Make better decisions** (Which game should I buy next?)
- **Find patterns** (I always do better on math tests after breakfast!)
- **Predict things** (If it's cloudy, it might rain)
- **Solve problems** (Why is my phone battery dying so fast?)

### 🎯 Your First Mission:
Think about data in YOUR life. What do you count or measure every day?

In [None]:
# Let's practice with data from your life!
# Fill in YOUR numbers below:

print("=== MY PERSONAL DATA ===")
print("My age:", 10)  # Replace with your age
print("My favorite number:", 7)  # Replace with your favorite number
print("Books I read this month:", 3)  # Replace with your number
print("Hours I sleep:", 8)  # Replace with your number
print("Pets I have:", 1)  # Replace with your number

# Let's do some simple math with YOUR data
age = 10  # Replace with your age
favorite_number = 7  # Replace with your favorite number

print("\n=== FUN CALCULATIONS ===")
print("In 10 years, I'll be:", age + 10)
print("My age times my favorite number:", age * favorite_number)
print("Days I've been alive (approximately):", age * 365)

# This is your first data analysis! 🎉

## 🌟 Section 2: Understanding Big Data with Simple Examples

### The Library Analogy 📚

**Small Data** = Your bedroom bookshelf (maybe 20 books)
- You can look at every book quickly
- Easy to find what you want
- You remember where everything is

**Big Data** = ALL the libraries in the ENTIRE WORLD! 
- Millions and millions of books
- Too many to look at one by one
- Need special systems to find anything
- Multiple buildings (computers) to store everything

### The 3 V's of Big Data (Like 3 Superpowers!)

#### 1. 📊 **VOLUME** = How MUCH data
- **Small**: Your playlist (50 songs)
- **Big**: ALL songs on Spotify (100 million songs!)

#### 2. ⚡ **VELOCITY** = How FAST data comes
- **Slow**: Writing in your diary (once a day)
- **Fast**: TikTok videos uploaded (thousands per minute!)

#### 3. 🎨 **VARIETY** = How MANY TYPES of data  
- **Simple**: Just numbers (your test scores)
- **Complex**: Videos, photos, text, sounds, GPS locations ALL mixed together!

### Real Big Data Examples You Know:
- **YouTube**: Stores billions of videos, millions uploaded daily
- **Google**: Searches trillions of web pages in seconds
- **Netflix**: Tracks what millions of people watch to recommend movies
- **Weather**: Collects data from thousands of sensors worldwide

### 🎯 Think About It:
Why can't we use regular computers for Big Data? (Hint: imagine counting all the stars in the sky by yourself!)

In [None]:
# Let's simulate the difference between small and big data!

import time
import random

print("=== SMALL DATA EXAMPLE ===")
# Imagine checking 10 students' test scores
small_scores = [85, 92, 78, 95, 88, 76, 90, 82, 89, 94]

start_time = time.time()
average_small = sum(small_scores) / len(small_scores)
end_time = time.time()

print(f"Small data: {len(small_scores)} students")
print(f"Average score: {average_small:.1f}")
print(f"Time taken: {end_time - start_time:.6f} seconds")

print("\n=== BIG DATA SIMULATION ===")
# Now imagine checking 1 million students' scores!
print("Generating 1 million student scores...")

start_time = time.time()
# We'll simulate this without actually creating 1 million numbers
# (that would use too much memory!)
big_data_size = 1000000
total_sum = 0

# Process in chunks (this is what Big Data tools do!)
for i in range(100):  # 100 chunks of 10,000 each
    chunk_sum = sum([random.randint(70, 100) for _ in range(10000)])
    total_sum += chunk_sum

average_big = total_sum / big_data_size
end_time = time.time()

print(f"Big data: {big_data_size:,} students")
print(f"Average score: {average_big:.1f}")
print(f"Time taken: {end_time - start_time:.6f} seconds")

print("\n🤔 Notice how Big Data takes longer and needs special techniques!")
print("That's why we need Big Data tools like Spark and Hadoop!")

## 🔧 Section 3: Setting Up Your Data Science Environment

### Welcome to Python! 🐍

Python is like a magic language that computers understand. It's called Python because the creator liked a TV show called "Monty Python" (not because of snakes!).

### Why Python for Data?
- **Easy to read** (almost like English!)
- **Powerful tools** for working with data
- **Used by professionals** at Google, Netflix, NASA
- **Great for beginners** but powerful enough for experts

### Essential Tools We'll Use:

#### 1. 📓 **Jupyter Notebooks** (What you're using right now!)
- Like a digital notebook with superpowers
- Mix text, code, and pictures all together
- Perfect for learning and experimenting

#### 2. 🐼 **Pandas** (Your data best friend)
- Handles spreadsheet-like data (like Excel, but better!)
- Named after "Panel Data" (but everyone thinks of cute pandas 🐼)

#### 3. 📊 **Matplotlib & Seaborn** (Make pretty charts)
- Turn boring numbers into colorful pictures
- Help people understand your data instantly

#### 4. ⚡ **NumPy** (Super fast math)
- Makes calculations lightning fast
- The foundation under many other tools

### Let's Check Your Setup! 🔍

In [None]:
# Let's check if all our data science tools are ready!
print("🔍 CHECKING YOUR DATA SCIENCE TOOLBOX...")
print("=" * 50)

# Check basic Python
print("✅ Python is working! (You're seeing this message)")
print(f"Python version info: {__import__('sys').version}")

# Check essential libraries
tools_to_check = [
    ('pandas', '🐼 Data manipulation'),
    ('numpy', '🔢 Fast math operations'),
    ('matplotlib', '📊 Basic plotting'),
    ('seaborn', '🎨 Beautiful charts'),
]

working_tools = []
missing_tools = []

for tool, description in tools_to_check:
    try:
        __import__(tool)
        print(f"✅ {tool} - {description}")
        working_tools.append(tool)
    except ImportError:
        print(f"❌ {tool} - {description} (Need to install)")
        missing_tools.append(tool)

print("\n" + "=" * 50)
if missing_tools:
    print("🛠️  TO INSTALL MISSING TOOLS:")
    print("Run this in your terminal:")
    print(f"pip install {' '.join(missing_tools)}")
else:
    print("🎉 CONGRATULATIONS! All tools are ready!")
    print("You're equipped for Big Data adventures!")

print(f"\n📊 Tools working: {len(working_tools)}/{len(tools_to_check)}")

# Let's make sure Jupyter is working too
print("\n🎯 JUPYTER NOTEBOOK CHECK:")
print("✅ Jupyter is working! (You can run this cell)")
print("✅ You can see formatted text and code together")
print("✅ You're ready to become a Data Scientist!")

## 📁 Section 4: Working with Small Data First - CSV Files

### What's a CSV File? 🧾

CSV stands for "Comma Separated Values". Think of it like a digital spreadsheet:

```
Name,Age,Favorite Color,Pet
Alice,10,Blue,Cat
Bob,11,Red,Dog
Charlie,9,Green,Fish
```

### Why Start with CSV?
- **Simple and common** (like the PDF of data world)
- **Easy to understand** (you can open it in Excel)
- **Good practice** before tackling Big Data
- **Used everywhere** (schools, businesses, government)

### Real CSV Examples You Might See:
- **School**: Student grades, attendance records
- **Sports**: Player stats, game scores  
- **Business**: Sales data, customer info
- **Science**: Experiment results, survey data

### CSV = Training Wheels for Big Data! 🚲

Just like you learn to ride a bike with training wheels before racing, we'll master CSV files before moving to massive datasets.

### Let's Create and Play with CSV Data! 🎮

In [None]:
# Let's create our first CSV file and work with it!
import pandas as pd
import os

print("🎯 CREATING YOUR FIRST CSV FILE...")

# Let's create a CSV about your imaginary class
class_data = {
    'Student_Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [10, 11, 9, 10, 11],
    'Favorite_Subject': ['Math', 'Science', 'Art', 'Math', 'Reading'],
    'Grade': [95, 87, 92, 88, 96],
    'Has_Pet': [True, True, False, True, False]
}

# Convert to DataFrame (think of it as a smart table)
df = pd.DataFrame(class_data)

print("✅ Created data for 5 students:")
print(df)

# Save it as a CSV file
csv_filename = 'my_first_dataset.csv'
df.to_csv(csv_filename, index=False)  # index=False means no row numbers
print(f"\n💾 Saved data to {csv_filename}")

# Now let's read it back (like magic!)
print("\n📖 READING THE CSV FILE BACK...")
loaded_data = pd.read_csv(csv_filename)
print("✅ Loaded data from file:")
print(loaded_data)

# Let's explore our data
print("\n🔍 EXPLORING OUR DATA...")
print(f"Number of students: {len(loaded_data)}")
print(f"Average age: {loaded_data['Age'].mean():.1f}")
print(f"Average grade: {loaded_data['Grade'].mean():.1f}")
print(f"Students with pets: {loaded_data['Has_Pet'].sum()}")

print("\n🎉 Congratulations! You just:")
print("✅ Created a dataset")
print("✅ Saved it as CSV")
print("✅ Loaded it back")
print("✅ Analyzed the data")
print("\nYou're officially a data analyst now! 📊")

# Section 5: Python Fundamentals for Big Data 🐍

Congratulations! Your Big Data environment is now running perfectly! Let's build your Python skills specifically for Big Data work.

## Why Python for Big Data?

Think of Python as your **universal translator** in the Big Data world:

- **Simple Language**: Like English vs. complicated technical jargon
- **Powerful Libraries**: Like having a toolbox with every tool you need
- **Big Data Ready**: Works with Hadoop, Spark, and all our tools

## Your Learning Path
1. **Variables & Data Types** (basic building blocks)
2. **Lists & Dictionaries** (organizing data)
3. **Functions** (reusable code recipes)
4. **File Handling** (reading/writing data)
5. **Error Handling** (dealing with problems gracefully)

Let's start building your Python foundation!

## 5.1 Variables - Your Data Containers 📦

Variables are like **labeled boxes** where you store information:

**Real-World Analogy**: 
- Box labeled "Age" contains the number 25
- Box labeled "Name" contains the text "Alice"
- Box labeled "Grades" contains a list [85, 92, 78]

In [None]:
# Let's create different types of variables - like different types of boxes!

# 📦 Text Box (String)
student_name = "Alice Johnson"
favorite_color = "blue"

# 📦 Number Boxes (Integers and Decimals)
age = 25
height_cm = 165.5
salary = 75000

# 📦 True/False Box (Boolean)
is_student = True
has_graduated = False

# 📦 List Box (Multiple items)
grades = [85, 92, 78, 90]
subjects = ["Math", "Science", "English", "History"]

# 📦 Dictionary Box (Key-Value pairs - like a filing cabinet)
person = {
    "name": "Alice Johnson",
    "age": 25,
    "city": "New York",
    "is_employed": True
}

# Let's see what's in our boxes!
print("=== What's in our variable boxes? ===")
print(f"Name: {student_name}")
print(f"Age: {age}")
print(f"Height: {height_cm} cm")
print(f"Is student: {is_student}")
print(f"Grades: {grades}")
print(f"Person info: {person}")

# Check the type of each variable
print("\n=== What type of box is each one? ===")
print(f"student_name is a {type(student_name)}")
print(f"age is a {type(age)}")
print(f"height_cm is a {type(height_cm)}")
print(f"is_student is a {type(is_student)}")
print(f"grades is a {type(grades)}")
print(f"person is a {type(person)}")

print("\n🎉 You just created your first Python variables!")

## 5.2 Working with Lists - Your Data Collections 📝

Lists are like **shopping lists** or **to-do lists** - they keep things in order!

**Big Data Connection**: Lists are everywhere in Big Data:
- List of customer names
- List of sales amounts
- List of website clicks
- List of temperatures recorded each hour

In [None]:
# Let's practice with lists - think of them as organized collections!

# 📝 Create a list of daily temperatures (like a weather station collects)
daily_temps = [22, 25, 23, 27, 26, 24, 21]
print("This week's temperatures:", daily_temps)

# 📝 Create a list of customer names (like a business database)
customers = ["Alice", "Bob", "Charlie", "Diana", "Eve"]
print("Our customers:", customers)

# 📝 Create a list of sales amounts (like daily revenue)
daily_sales = [1250.50, 980.75, 1500.00, 750.25, 2100.80]
print("Daily sales:", daily_sales)

print("\n=== List Operations (like working with your data) ===")

# Get specific items (like looking up a specific day)
print(f"Temperature on day 1: {daily_temps[0]}°C")  # First item (index 0)
print(f"Temperature on day 3: {daily_temps[2]}°C")  # Third item (index 2)
print(f"Last temperature: {daily_temps[-1]}°C")     # Last item

# Add new data (like recording today's temperature)
daily_temps.append(28)  # Add 28°C for today
print(f"Updated temperatures: {daily_temps}")

# Add a new customer
customers.append("Frank")
print(f"Updated customers: {customers}")

# Get useful information about our data
print(f"\nData Analysis:")
print(f"Number of temperature readings: {len(daily_temps)}")
print(f"Highest temperature: {max(daily_temps)}°C")
print(f"Lowest temperature: {min(daily_temps)}°C")
print(f"Average temperature: {sum(daily_temps)/len(daily_temps):.1f}°C")

print(f"\nTotal sales this week: ${sum(daily_sales):,.2f}")
print(f"Best sales day: ${max(daily_sales):,.2f}")
print(f"Number of customers: {len(customers)}")

print("\n🎉 You're now working with data collections like a pro!")

## 5.3 Dictionaries - Your Smart Filing Cabinet 🗂️

Dictionaries are like **smart filing cabinets** where you can quickly find information by name!

**Real-World Analogy**: 
- Instead of searching through 100 folders, you just ask for "Customer Info for Alice"
- Like a phone book: You know the name (key), you get the number (value)

**Big Data Power**: In Big Data, we deal with millions of records - dictionaries make lookups super fast!

In [None]:
# Let's create smart filing cabinets (dictionaries)!

# 🗂️ Customer information (like a customer database record)
customer_alice = {
    "name": "Alice Johnson",
    "age": 28,
    "city": "New York",
    "purchases": 15,
    "total_spent": 2450.80,
    "is_premium": True
}

# 🗂️ Product information (like inventory system)
product_laptop = {
    "id": "LAP001",
    "name": "Gaming Laptop",
    "price": 1299.99,
    "brand": "TechCorp",
    "in_stock": 45,
    "category": "Electronics"
}

# 🗂️ Website analytics (like Google Analytics data)
daily_website_stats = {
    "date": "2025-10-05",
    "visitors": 1250,
    "page_views": 3800,
    "bounce_rate": 0.35,
    "top_page": "/products",
    "avg_session_time": 180  # seconds
}

print("=== Smart Filing Cabinet in Action! ===")

# Quick lookups (like asking for specific information)
print(f"Customer name: {customer_alice['name']}")
print(f"Customer's total purchases: {customer_alice['purchases']}")
print(f"Is premium customer: {customer_alice['is_premium']}")

print(f"\nProduct: {product_laptop['name']}")
print(f"Price: ${product_laptop['price']}")
print(f"Stock available: {product_laptop['in_stock']} units")

print(f"\nWebsite had {daily_website_stats['visitors']} visitors today")
print(f"Most popular page: {daily_website_stats['top_page']}")

# Adding new information (like updating records)
print("\n=== Updating our records ===")
customer_alice["last_login"] = "2025-10-05"
customer_alice["purchases"] += 1  # Customer made another purchase!
customer_alice["total_spent"] += 89.99  # Spent $89.99 more

print(f"Updated purchases: {customer_alice['purchases']}")
print(f"Updated total spent: ${customer_alice['total_spent']}")

# Getting all information at once
print(f"\n=== Complete customer record ===")
for key, value in customer_alice.items():
    print(f"{key}: {value}")

print("\n🎉 You now know how to organize data like a database pro!")

## 5.4 Functions - Your Code Recipes 👨‍🍳

Functions are like **cooking recipes** - write once, use many times!

**Real-World Analogy**:
- Recipe for "Calculate Average" → Use it for grades, temperatures, sales, etc.
- Recipe for "Send Email" → Use it whenever you need to notify someone
- Recipe for "Process Customer Data" → Use it for every new customer

**Big Data Superpower**: Instead of writing the same code 1000 times, write it once as a function!

In [None]:
# Let's create useful code recipes (functions)!

# 👨‍🍳 Recipe 1: Calculate average of any list of numbers
def calculate_average(numbers):
    """
    Recipe to calculate average of any list of numbers
    Like a cooking recipe - give me ingredients (numbers), get back the dish (average)
    """
    if len(numbers) == 0:
        return 0
    total = sum(numbers)
    average = total / len(numbers)
    return average

# 👨‍🍳 Recipe 2: Analyze customer behavior
def analyze_customer(customer_data):
    """
    Recipe to analyze any customer's behavior
    Input: customer dictionary
    Output: customer insights
    """
    name = customer_data["name"]
    purchases = customer_data["purchases"]
    total_spent = customer_data["total_spent"]
    
    # Calculate average per purchase
    avg_per_purchase = total_spent / purchases if purchases > 0 else 0
    
    # Determine customer type
    if total_spent > 2000:
        customer_type = "VIP Customer"
    elif total_spent > 1000:
        customer_type = "Regular Customer"
    else:
        customer_type = "New Customer"
    
    return {
        "name": name,
        "customer_type": customer_type,
        "avg_per_purchase": avg_per_purchase,
        "total_value": total_spent
    }

# 👨‍🍳 Recipe 3: Process any sales data
def process_sales_data(sales_list):
    """
    Recipe to analyze any sales data
    Works with daily, weekly, monthly sales - any list!
    """
    total_sales = sum(sales_list)
    avg_sales = calculate_average(sales_list)  # Using our other recipe!
    best_day = max(sales_list)
    worst_day = min(sales_list)
    
    return {
        "total": total_sales,
        "average": avg_sales,
        "best_day": best_day,
        "worst_day": worst_day,
        "num_days": len(sales_list)
    }

print("=== Using Our Code Recipes ===")

# Let's use our recipes with different data!

# Test data
test_grades = [85, 92, 78, 90, 88]
test_temperatures = [22, 25, 23, 27, 26, 24, 21]
test_sales = [1250.50, 980.75, 1500.00, 750.25, 2100.80]

# Using Recipe 1: Calculate averages
print(f"Average grade: {calculate_average(test_grades):.1f}")
print(f"Average temperature: {calculate_average(test_temperatures):.1f}°C")
print(f"Average daily sales: ${calculate_average(test_sales):.2f}")

# Using Recipe 2: Analyze customer
test_customer = {
    "name": "Bob Smith",
    "purchases": 8,
    "total_spent": 1750.30
}

customer_analysis = analyze_customer(test_customer)
print(f"\n=== Customer Analysis ===")
print(f"Customer: {customer_analysis['name']}")
print(f"Type: {customer_analysis['customer_type']}")
print(f"Average per purchase: ${customer_analysis['avg_per_purchase']:.2f}")

# Using Recipe 3: Process sales data
sales_analysis = process_sales_data(test_sales)
print(f"\n=== Sales Analysis ===")
print(f"Total sales: ${sales_analysis['total']:,.2f}")
print(f"Average daily sales: ${sales_analysis['average']:,.2f}")
print(f"Best day: ${sales_analysis['best_day']:,.2f}")
print(f"Worst day: ${sales_analysis['worst_day']:,.2f}")

print("\n🎉 You've mastered code recipes! Now you can reuse your code like a pro chef!")

# Section 6: Hands-On HDFS - Your Big Data Filing Cabinet 🗂️

Now let's get hands-on with HDFS! Remember, your Hadoop cluster is running and ready.

## What is HDFS?

**Simple Analogy**: HDFS is like a **magical filing cabinet** that:
- **Spreads files across multiple drawers** (distributed storage)
- **Makes copies of important documents** (replication for safety)  
- **Can handle HUGE files** (terabytes and petabytes)
- **Never loses your data** (fault tolerance)

## Why HDFS for Big Data?

- **Scale**: Can store more data than any single computer
- **Reliability**: If one computer breaks, your data is safe
- **Speed**: Multiple computers work together to read/write data
- **Cost**: Uses regular computers instead of expensive storage systems

## Your HDFS Commands Cheat Sheet 📋

We'll practice these essential HDFS operations:
- `hdfs dfs -ls` → List files (like `ls` in Linux)
- `hdfs dfs -mkdir` → Create directories
- `hdfs dfs -put` → Upload files to HDFS
- `hdfs dfs -get` → Download files from HDFS
- `hdfs dfs -cat` → View file contents
- `hdfs dfs -rm` → Delete files

## 6.1 Your First HDFS Commands 🎯

**Time to get hands-on!** Run these commands in your terminal (you're using WSL, so these will work perfectly):

### Step 1: Explore Your HDFS
```bash
# See what's in the root directory
docker exec -it hadoop-namenode hdfs dfs -ls /

# Get detailed HDFS cluster information
docker exec -it hadoop-namenode hdfs dfsadmin -report
```

### Step 2: Create Your Data Directories
```bash
# Create a user directory (like your home folder)
docker exec -it hadoop-namenode hdfs dfs -mkdir /user

# Create a data directory for our experiments
docker exec -it hadoop-namenode hdfs dfs -mkdir /data

# Create nested directories for projects
docker exec -it hadoop-namenode hdfs dfs -mkdir -p /projects/learning/input
docker exec -it hadoop-namenode hdfs dfs -mkdir -p /projects/learning/output
```

### Step 3: Create and Upload Your First File
```bash
# Create a test file inside the container
docker exec -it hadoop-namenode sh -c "echo 'Welcome to Big Data with HDFS!' > /tmp/welcome.txt"
docker exec -it hadoop-namenode sh -c "echo 'This is my first HDFS file' > /tmp/first_file.txt"

# Upload files to HDFS
docker exec -it hadoop-namenode hdfs dfs -put /tmp/welcome.txt /data/
docker exec -it hadoop-namenode hdfs dfs -put /tmp/first_file.txt /projects/learning/input/

# List your uploaded files
docker exec -it hadoop-namenode hdfs dfs -ls /data/
docker exec -it hadoop-namenode hdfs dfs -ls /projects/learning/input/
```

**Try these commands now!** Each one teaches you something important about HDFS.

## 6.2 Working with Files in HDFS 📁

Now let's practice the most common HDFS operations you'll use in Big Data projects:

### View and Analyze Files
```bash
# Read file contents (like opening a text file)
docker exec -it hadoop-namenode hdfs dfs -cat /data/welcome.txt

# See file size and details
docker exec -it hadoop-namenode hdfs dfs -ls -h /data/

# View just the first few lines (useful for big files)
docker exec -it hadoop-namenode hdfs dfs -head /data/welcome.txt

# Get total space usage
docker exec -it hadoop-namenode hdfs dfs -du -h /data/
```

### Copy and Move Files
```bash
# Make a backup copy
docker exec -it hadoop-namenode hdfs dfs -cp /data/welcome.txt /data/welcome_backup.txt

# Move a file to a different location
docker exec -it hadoop-namenode hdfs dfs -mv /data/welcome_backup.txt /projects/learning/

# List both locations to see the results
docker exec -it hadoop-namenode hdfs dfs -ls /data/
docker exec -it hadoop-namenode hdfs dfs -ls /projects/learning/
```

### Download Files from HDFS
```bash
# Download a file from HDFS to local container storage
docker exec -it hadoop-namenode hdfs dfs -get /data/welcome.txt /tmp/downloaded_welcome.txt

# Verify the download worked
docker exec -it hadoop-namenode cat /tmp/downloaded_welcome.txt
```

**Practice Time!** Try each command and see how HDFS manages your files. This is exactly how you'll work with terabytes of data in real Big Data projects!

## 6.3 Real Big Data Exercise - Working with CSV Data 📊

Let's create a realistic Big Data scenario! We'll create sample customer data and work with it in HDFS:

### Create Sample Big Data Files
```bash
# Create a sample customer database CSV file
docker exec -it hadoop-namenode sh -c "cat > /tmp/customers.csv << 'EOF'
customer_id,name,age,city,total_purchases,signup_date
1001,Alice Johnson,28,New York,2450.80,2023-01-15
1002,Bob Smith,35,Los Angeles,1750.30,2023-02-20
1003,Charlie Brown,42,Chicago,3200.50,2023-01-08
1004,Diana Prince,29,Miami,890.25,2023-03-12
1005,Eve Wilson,31,Seattle,1580.90,2023-02-05
EOF"

# Create a sales transactions file
docker exec -it hadoop-namenode sh -c "cat > /tmp/sales.csv << 'EOF'
transaction_id,customer_id,product,amount,date
T001,1001,Laptop,1299.99,2023-03-01
T002,1002,Phone,899.50,2023-03-02
T003,1001,Mouse,29.99,2023-03-03
T004,1003,Tablet,599.00,2023-03-04
T005,1004,Headphones,199.99,2023-03-05
EOF"
```

### Upload to HDFS and Analyze
```bash
# Create directory structure for our data warehouse
docker exec -it hadoop-namenode hdfs dfs -mkdir -p /warehouse/customers
docker exec -it hadoop-namenode hdfs dfs -mkdir -p /warehouse/sales

# Upload our data files
docker exec -it hadoop-namenode hdfs dfs -put /tmp/customers.csv /warehouse/customers/
docker exec -it hadoop-namenode hdfs dfs -put /tmp/sales.csv /warehouse/sales/

# Verify our data warehouse
docker exec -it hadoop-namenode hdfs dfs -ls -R /warehouse/

# Look at our data
docker exec -it hadoop-namenode hdfs dfs -cat /warehouse/customers/customers.csv
docker exec -it hadoop-namenode hdfs dfs -cat /warehouse/sales/sales.csv

# Check file sizes (this would be gigabytes/terabytes in real scenarios)
docker exec -it hadoop-namenode hdfs dfs -du -h /warehouse/
```

**🎉 Congratulations!** You've just created your first Big Data warehouse in HDFS! This is exactly how companies store and manage massive datasets.