# Lesson 5: Data Analysis with Pandas

Learn to work with data using Pandas, Python's most popular data analysis library.

## What You'll Learn
- Working with DataFrames
- Loading and exploring data
- Data cleaning and manipulation
- Basic data visualization

## Introduction to Pandas

Pandas provides DataFrames - think of them as super-powered spreadsheets in Python!

In [None]:
# You'll need to install pandas:
# pip install pandas matplotlib

import pandas as pd
import numpy as np

# Create a simple DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'Paris', 'London', 'Tokyo', 'Berlin'],
    'Salary': [70000, 80000, 75000, 90000, 85000]
}

df = pd.DataFrame(data)
print("Our DataFrame:")
print(df)

## Exploring Data

Basic operations to understand your data:

In [None]:
# Basic information
print("Shape (rows, columns):", df.shape)
print("\nColumn names:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)

# First and last rows
print("\nFirst 3 rows:")
print(df.head(3))

# Statistical summary
print("\nStatistical Summary:")
print(df.describe())

## Selecting and Filtering Data

Access specific data from your DataFrame:

In [None]:
# Select a single column
print("Names:")
print(df['Name'])

# Select multiple columns
print("\nNames and Ages:")
print(df[['Name', 'Age']])

# Filter rows by condition
print("\nPeople older than 30:")
print(df[df['Age'] > 30])

# Multiple conditions
print("\nPeople older than 28 AND salary > 80000:")
print(df[(df['Age'] > 28) & (df['Salary'] > 80000)])

## Data Manipulation

Modify and transform your data:

In [None]:
# Add a new column
df['Salary_Bonus'] = df['Salary'] * 0.1
print("With bonus column:")
print(df)

# Calculate total compensation
df['Total_Comp'] = df['Salary'] + df['Salary_Bonus']

# Sort by salary
print("\nSorted by Total Compensation:")
print(df.sort_values('Total_Comp', ascending=False))

## Grouping and Aggregation

Summarize data by groups:

In [None]:
# Create a larger dataset
sales_data = {
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Mouse', 'Keyboard',
                'Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Monitor'],
    'Region': ['North', 'North', 'South', 'South', 'North', 'North',
               'South', 'South', 'North', 'North', 'South'],
    'Sales': [1200, 25, 75, 1500, 30, 60, 1100, 28, 80, 300, 350],
    'Quantity': [2, 5, 3, 3, 6, 2, 2, 7, 4, 1, 1]
}

sales_df = pd.DataFrame(sales_data)
print("Sales Data:")
print(sales_df)

# Group by product and calculate total sales
print("\nTotal Sales by Product:")
product_sales = sales_df.groupby('Product')['Sales'].sum().sort_values(ascending=False)
print(product_sales)

# Multiple aggregations
print("\nSummary by Region:")
region_summary = sales_df.groupby('Region').agg({
    'Sales': ['sum', 'mean'],
    'Quantity': ['sum', 'count']
})
print(region_summary)

## Handling Missing Data

Real-world data often has missing values:

In [None]:
# Create data with missing values
messy_data = {
    'Name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
    'Score': [85, None, 92, 88, None],
    'Grade': ['A', 'B', 'A', 'B', 'C']
}

messy_df = pd.DataFrame(messy_data)
print("Data with missing values:")
print(messy_df)

# Check for missing values
print("\nMissing values per column:")
print(messy_df.isnull().sum())

# Fill missing values
messy_df['Score'].fillna(messy_df['Score'].mean(), inplace=True)
messy_df['Name'].fillna('Unknown', inplace=True)

print("\nAfter filling missing values:")
print(messy_df)

## Simple Visualization

Create basic plots from your data:

In [None]:
import matplotlib.pyplot as plt

# Bar chart of sales by product
product_sales.plot(kind='bar', color='skyblue', figsize=(10, 6))
plt.title('Total Sales by Product')
plt.xlabel('Product')
plt.ylabel('Sales ($)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Line plot
time_series = pd.DataFrame({
    'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'Revenue': [15000, 18000, 17500, 21000, 23000, 25000]
})

time_series.plot(x='Month', y='Revenue', marker='o', figsize=(10, 6))
plt.title('Monthly Revenue')
plt.ylabel('Revenue ($)')
plt.grid(True)
plt.tight_layout()
plt.show()

## Exercise

Analyze a student grades dataset:
1. Create a DataFrame with students, subjects, and scores
2. Calculate average score per student
3. Find the top-performing students
4. Calculate average score per subject
5. Create a visualization of the results

In [None]:
# Your code here
# Sample data structure:
# students_data = {
#     'Student': ['Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Bob'],
#     'Subject': ['Math', 'Science', 'English', 'Math', 'Science', 'English'],
#     'Score': [90, 85, 88, 78, 82, 75]
# }

