# Data Manipulation with Pandas

## Overview

**Data Manipulation** = Transforming, reshaping, and analyzing data

### Key Operations We'll Cover

| Category | Operations | Use Case |
|----------|------------|----------|
| **Transform** | `apply()`, `map()`, `applymap()` | Create new columns, custom logic |
| **String** | `.str` methods | Clean text, extract patterns |
| **DateTime** | `.dt` methods | Parse dates, extract components |
| **Groupby** | `groupby()`, `agg()` | Summary statistics by category |
| **Combine** | `merge()`, `join()`, `concat()` | Join multiple datasets |
| **Reshape** | `pivot()`, `melt()`, `stack()` | Change data structure |
| **Sort** | `sort_values()`, `sort_index()` | Order data |
| **Filter** | Boolean indexing, `query()` | Select specific rows |

### Why Data Manipulation?
- 📊 **Analysis**: Calculate metrics, trends
- 🔄 **Transform**: Prepare data for ML models
- 📈 **Insights**: Answer business questions
- 🎯 **Reporting**: Create summaries, dashboards

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.precision', 2)

print("✅ Pandas imported successfully")
print(f"Version: {pd.__version__}")

## Sample Dataset: Sales Data

We'll use a realistic sales dataset throughout this notebook.

In [None]:
# Create sample sales data
np.random.seed(42)

# Generate dates
start_date = datetime(2024, 1, 1)
dates = [start_date + timedelta(days=x) for x in range(100)]

df = pd.DataFrame({
    'date': np.random.choice(dates, 100),
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Headphones', 'Watch'], 100),
    'category': np.random.choice(['Electronics', 'Accessories'], 100),
    'quantity': np.random.randint(1, 10, 100),
    'price': np.random.choice([299, 599, 899, 1299, 1999], 100),
    'customer_name': np.random.choice(['John Smith', 'Alice Johnson', 'Bob Wilson', 
                                       'Emma Davis', 'Michael Brown'], 100),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'payment_method': np.random.choice(['Credit Card', 'Debit Card', 'UPI', 'Cash'], 100)
})

# Calculate revenue
df['revenue'] = df['quantity'] * df['price']

print("Sample Sales Data:")
print(df.head(10))
print(f"\nShape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")

## 1. Transform Functions: apply(), map(), applymap()

### Differences

| Method | Works On | Purpose | Example |
|--------|----------|---------|----------|
| **`apply()`** | Series/DataFrame | Apply function to rows/columns | Calculate tax, categorize |
| **`map()`** | Series only | Replace values, map dict | Category encoding |
| **`applymap()`** | DataFrame | Apply to every cell | Format entire DataFrame |

### When to Use
- **`apply()`**: Custom calculations (commission, discounts)
- **`map()`**: Simple value mapping (category codes)
- **`applymap()`**: Format all cells (rarely used)

### Syntax
```python
# apply() on Series
df['column'].apply(function)

# apply() on DataFrame
df.apply(function, axis=0)  # axis=0: column-wise, axis=1: row-wise

# map() on Series
df['column'].map(mapping_dict)

# applymap() on DataFrame
df.applymap(function)
```

In [None]:
print("=== APPLY() EXAMPLES ===\n")

# Example 1: Apply function to Series
print("Example 1: Calculate 18% tax on revenue")
df['tax'] = df['revenue'].apply(lambda x: x * 0.18)
print(df[['revenue', 'tax']].head())
print()

# Example 2: Apply custom function
print("Example 2: Categorize revenue (High/Medium/Low)")
def categorize_revenue(revenue):
    if revenue > 10000:
        return 'High'
    elif revenue > 5000:
        return 'Medium'
    else:
        return 'Low'

df['revenue_category'] = df['revenue'].apply(categorize_revenue)
print(df[['revenue', 'revenue_category']].head())
print()

# Example 3: Apply to DataFrame rows (axis=1)
print("Example 3: Calculate discount based on quantity and price")
def calculate_discount(row):
    if row['quantity'] >= 5:
        return row['revenue'] * 0.10  # 10% discount for bulk
    elif row['price'] > 1000:
        return row['revenue'] * 0.05  # 5% for expensive items
    else:
        return 0

df['discount'] = df.apply(calculate_discount, axis=1)
print(df[['quantity', 'price', 'revenue', 'discount']].head(10))
print()

# Example 4: Apply to column
print("Example 4: Calculate total sum per column")
numeric_cols = df[['quantity', 'price', 'revenue']]
totals = numeric_cols.apply(sum, axis=0)
print("Column Totals:")
print(totals)

In [None]:
print("=== MAP() EXAMPLES ===\n")

# Example 1: Map with dictionary
print("Example 1: Map payment method to codes")
payment_mapping = {
    'Credit Card': 'CC',
    'Debit Card': 'DC',
    'UPI': 'UP',
    'Cash': 'CA'
}

df['payment_code'] = df['payment_method'].map(payment_mapping)
print(df[['payment_method', 'payment_code']].head())
print()

# Example 2: Map with function
print("Example 2: Convert product names to uppercase")
df['product_upper'] = df['product'].map(str.upper)
print(df[['product', 'product_upper']].head())
print()

# Example 3: Map with lambda
print("Example 3: Map price to price range")
df['price_range'] = df['price'].map(lambda x: f"${x-100}-${x+100}")
print(df[['price', 'price_range']].head())
print()

# Example 4: Map vs Apply comparison
print("Example 4: map() vs apply() - Performance")
print("map() is faster for simple value replacements")
print("apply() is more flexible for complex logic")

## 2. String Operations (.str methods)

### Common String Methods

| Method | Purpose | Example |
|--------|---------|----------|
| `.str.lower()` | Lowercase | 'HELLO' → 'hello' |
| `.str.upper()` | Uppercase | 'hello' → 'HELLO' |
| `.str.title()` | Title Case | 'john smith' → 'John Smith' |
| `.str.strip()` | Remove whitespace | ' text ' → 'text' |
| `.str.replace()` | Replace text | 'hello' → 'hi' |
| `.str.contains()` | Check if contains | Check if 'Smith' in name |
| `.str.startswith()` | Starts with | Check if starts with 'J' |
| `.str.endswith()` | Ends with | Check if ends with '.com' |
| `.str.split()` | Split string | 'John Smith' → ['John', 'Smith'] |
| `.str.len()` | String length | 'hello' → 5 |
| `.str.extract()` | Extract pattern | Extract digits from text |
| `.str.slice()` | Slice string | Get first 3 characters |

### Real-World Use Cases
- Clean customer names
- Extract email domains
- Parse product codes
- Validate phone numbers
- Standardize addresses

In [None]:
print("=== STRING OPERATIONS ===\n")

# Example 1: Basic transformations
print("Example 1: Basic string transformations")
print("Original names:")
print(df['customer_name'].head())
print()
print("Uppercase:")
print(df['customer_name'].str.upper().head())
print()
print("Lowercase:")
print(df['customer_name'].str.lower().head())
print()

# Example 2: String length
print("Example 2: Calculate name length")
df['name_length'] = df['customer_name'].str.len()
print(df[['customer_name', 'name_length']].head())
print()

# Example 3: Contains
print("Example 3: Filter customers with 'Smith' in name")
smith_customers = df[df['customer_name'].str.contains('Smith', case=False)]
print(f"Found {len(smith_customers)} customers with 'Smith'")
print(smith_customers[['customer_name']].drop_duplicates())
print()

# Example 4: Split names
print("Example 4: Split customer names into first and last")
df[['first_name', 'last_name']] = df['customer_name'].str.split(' ', n=1, expand=True)
print(df[['customer_name', 'first_name', 'last_name']].head())
print()

# Example 5: String slicing
print("Example 5: Extract first 3 characters of product name")
df['product_code'] = df['product'].str.slice(0, 3).str.upper()
print(df[['product', 'product_code']].head())
print()

# Example 6: Starts with
print("Example 6: Products starting with 'L'")
l_products = df[df['product'].str.startswith('L')]
print(f"Products starting with 'L': {l_products['product'].nunique()}")
print(l_products['product'].unique())
print()

# Example 7: Replace
print("Example 7: Replace 'Phone' with 'Smartphone'")
df['product_renamed'] = df['product'].str.replace('Phone', 'Smartphone')
print(df[['product', 'product_renamed']].head())

## 3. DateTime Operations (.dt methods)

### Common DateTime Methods

| Method | Purpose | Example |
|--------|---------|----------|
| `.dt.year` | Extract year | 2024 |
| `.dt.month` | Extract month | 3 (March) |
| `.dt.day` | Extract day | 15 |
| `.dt.dayofweek` | Day of week | 0=Monday, 6=Sunday |
| `.dt.day_name()` | Day name | 'Monday' |
| `.dt.month_name()` | Month name | 'March' |
| `.dt.quarter` | Quarter | 1, 2, 3, 4 |
| `.dt.week` | Week number | 1-52 |
| `.dt.weekday` | Weekday | 0-6 |
| `.dt.is_month_end` | Is month end | True/False |
| `.dt.is_month_start` | Is month start | True/False |
| `.dt.date` | Date only | 2024-03-15 |

### Time Calculations
```python
# Date arithmetic
df['date'] + pd.Timedelta(days=7)  # Add 7 days
df['date2'] - df['date1']  # Date difference

# Date ranges
pd.date_range('2024-01-01', periods=10, freq='D')
```

### Real-World Use Cases
- Sales by month/quarter
- Weekday vs weekend analysis
- Time-based filtering
- Calculate age, tenure
- Seasonality analysis

In [None]:
print("=== DATETIME OPERATIONS ===\n")

# Ensure date column is datetime
df['date'] = pd.to_datetime(df['date'])

# Example 1: Extract components
print("Example 1: Extract date components")
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_name'] = df['date'].dt.day_name()
df['month_name'] = df['date'].dt.month_name()

print(df[['date', 'year', 'month', 'day', 'day_name', 'month_name']].head())
print()

# Example 2: Day of week (0=Monday, 6=Sunday)
print("Example 2: Analyze weekday patterns")
df['is_weekend'] = df['date'].dt.dayofweek >= 5
print(df[['date', 'day_name', 'is_weekend']].head())
print()

weekend_sales = df[df['is_weekend']]['revenue'].sum()
weekday_sales = df[~df['is_weekend']]['revenue'].sum()
print(f"Weekend sales: ${weekend_sales:,.2f}")
print(f"Weekday sales: ${weekday_sales:,.2f}")
print()

# Example 3: Quarter
print("Example 3: Extract quarter")
df['quarter'] = df['date'].dt.quarter
print(df[['date', 'quarter']].head())
print()

# Example 4: Date filtering
print("Example 4: Filter sales in March 2024")
march_sales = df[(df['date'].dt.month == 3) & (df['date'].dt.year == 2024)]
print(f"March sales count: {len(march_sales)}")
print(f"March revenue: ${march_sales['revenue'].sum():,.2f}")
print()

# Example 5: Calculate days since first sale
print("Example 5: Calculate days since first sale")
first_sale_date = df['date'].min()
df['days_since_first_sale'] = (df['date'] - first_sale_date).dt.days
print(df[['date', 'days_since_first_sale']].head())
print()

# Example 6: Date arithmetic
print("Example 6: Add 7 days to all dates (delivery date)")
df['delivery_date'] = df['date'] + pd.Timedelta(days=7)
print(df[['date', 'delivery_date']].head())

## 4. GroupBy and Aggregations

### What is GroupBy?

**GroupBy** = Split-Apply-Combine pattern
1. **Split**: Divide data into groups
2. **Apply**: Apply function to each group
3. **Combine**: Combine results

```python
df.groupby('column').agg(function)
```

### Common Aggregation Functions

| Function | Purpose | Example |
|----------|---------|----------|
| `sum()` | Total | Total revenue by product |
| `mean()` | Average | Average price by region |
| `median()` | Middle value | Median revenue |
| `count()` | Count | Number of sales |
| `nunique()` | Unique count | Unique customers |
| `min()` | Minimum | Lowest price |
| `max()` | Maximum | Highest revenue |
| `std()` | Standard deviation | Revenue variability |
| `var()` | Variance | Price variance |
| `first()` | First value | First sale date |
| `last()` | Last value | Last sale date |

### Multiple Aggregations

```python
# Single aggregation
df.groupby('product')['revenue'].sum()

# Multiple aggregations
df.groupby('product')['revenue'].agg(['sum', 'mean', 'count'])

# Different aggregations per column
df.groupby('product').agg({
    'revenue': ['sum', 'mean'],
    'quantity': ['sum', 'max']
})
```

### Real-World Use Cases
- Sales by product/region/month
- Customer lifetime value
- Regional performance
- Product performance analysis

In [None]:
print("=== GROUPBY OPERATIONS ===\n")

# Example 1: Simple groupby
print("Example 1: Total revenue by product")
product_revenue = df.groupby('product')['revenue'].sum().sort_values(ascending=False)
print(product_revenue)
print()

# Example 2: Multiple aggregations
print("Example 2: Product statistics")
product_stats = df.groupby('product')['revenue'].agg(['sum', 'mean', 'count', 'max', 'min'])
product_stats.columns = ['Total', 'Average', 'Count', 'Max', 'Min']
print(product_stats)
print()

# Example 3: Group by multiple columns
print("Example 3: Revenue by product and region")
product_region = df.groupby(['product', 'region'])['revenue'].sum().sort_values(ascending=False)
print(product_region.head(10))
print()

# Example 4: Different aggregations per column
print("Example 4: Complex aggregations")
complex_agg = df.groupby('product').agg({
    'revenue': ['sum', 'mean'],
    'quantity': ['sum', 'max'],
    'customer_name': 'nunique'  # Unique customers
})
print(complex_agg)
print()

# Example 5: Custom aggregation function
print("Example 5: Custom aggregation - Revenue range")
def revenue_range(x):
    return x.max() - x.min()

revenue_ranges = df.groupby('product')['revenue'].agg([
    ('Total', 'sum'),
    ('Average', 'mean'),
    ('Range', revenue_range)
])
print(revenue_ranges)
print()

# Example 6: Group by date components
print("Example 6: Monthly sales trend")
monthly_sales = df.groupby(df['date'].dt.to_period('M'))['revenue'].agg([
    ('Total_Revenue', 'sum'),
    ('Num_Orders', 'count'),
    ('Avg_Order_Value', 'mean')
])
print(monthly_sales.head())
print()

# Example 7: Filter groups
print("Example 7: Products with total revenue > $50,000")
high_revenue_products = df.groupby('product')['revenue'].sum()
high_revenue_products = high_revenue_products[high_revenue_products > 50000]
print(high_revenue_products)
print()

# Example 8: Transform (keep original DataFrame size)
print("Example 8: Add group average to each row")
df['product_avg_revenue'] = df.groupby('product')['revenue'].transform('mean')
print(df[['product', 'revenue', 'product_avg_revenue']].head())
print()

# Example 9: Rank within groups
print("Example 9: Rank sales within each product")
df['revenue_rank'] = df.groupby('product')['revenue'].rank(ascending=False, method='dense')
print(df[['product', 'revenue', 'revenue_rank']].head(10))

## 5. Combining DataFrames: merge(), join(), concat()

### Types of Joins

| Join Type | Description | SQL Equivalent |
|-----------|-------------|----------------|
| **inner** | Only matching rows | INNER JOIN |
| **left** | All from left, matching from right | LEFT JOIN |
| **right** | All from right, matching from left | RIGHT JOIN |
| **outer** | All rows from both | FULL OUTER JOIN |

### Methods

**1. merge()** - SQL-style joins
```python
pd.merge(df1, df2, on='key', how='inner')
```

**2. join()** - Join on index
```python
df1.join(df2, how='left')
```

**3. concat()** - Stack DataFrames
```python
pd.concat([df1, df2], axis=0)  # Vertical stack
pd.concat([df1, df2], axis=1)  # Horizontal stack
```

### Real-World Use Cases
- Join customer data with orders
- Merge product catalog with sales
- Combine regional datasets
- Add demographic information

In [None]:
print("=== COMBINING DATAFRAMES ===\n")

# Create additional datasets for merging
print("Creating sample datasets...\n")

# Customer info dataset
customers = pd.DataFrame({
    'customer_name': ['John Smith', 'Alice Johnson', 'Bob Wilson', 'Emma Davis', 'Michael Brown'],
    'customer_id': [101, 102, 103, 104, 105],
    'email': ['john@email.com', 'alice@email.com', 'bob@email.com', 
              'emma@email.com', 'michael@email.com'],
    'loyalty_level': ['Gold', 'Silver', 'Platinum', 'Gold', 'Bronze']
})

# Product info dataset
products = pd.DataFrame({
    'product': ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Watch'],
    'brand': ['Dell', 'Apple', 'Samsung', 'Sony', 'Apple'],
    'warranty_months': [24, 12, 12, 6, 12]
})

print("Customers DataFrame:")
print(customers)
print()
print("Products DataFrame:")
print(products)
print()

# Example 1: Inner merge
print("Example 1: INNER MERGE - Sales with customer info")
sales_with_customers = pd.merge(df, customers, on='customer_name', how='inner')
print(sales_with_customers[['customer_name', 'email', 'loyalty_level', 'revenue']].head())
print(f"Rows: {len(df)} → {len(sales_with_customers)}")
print()

# Example 2: Left merge
print("Example 2: LEFT MERGE - Sales with product info")
sales_with_products = pd.merge(df, products, on='product', how='left')
print(sales_with_products[['product', 'brand', 'warranty_months', 'revenue']].head())
print()

# Example 3: Merge on multiple columns
print("Example 3: Merge on multiple keys")
# Create sample with multiple keys
df_subset = df[['product', 'region', 'revenue']].head()
region_info = pd.DataFrame({
    'product': ['Laptop', 'Phone', 'Laptop'],
    'region': ['North', 'North', 'South'],
    'regional_discount': [0.05, 0.03, 0.10]
})

merged = pd.merge(df_subset, region_info, on=['product', 'region'], how='left')
print(merged)
print()

# Example 4: concat() - Vertical stack
print("Example 4: CONCAT - Stack DataFrames vertically")
df1 = df.head(5)
df2 = df.tail(5)
stacked = pd.concat([df1, df2], axis=0, ignore_index=True)
print(f"df1: {len(df1)} rows, df2: {len(df2)} rows")
print(f"Stacked: {len(stacked)} rows")
print()

# Example 5: concat() - Horizontal stack
print("Example 5: CONCAT - Stack DataFrames horizontally")
df_left = df[['product', 'quantity']].head()
df_right = df[['price', 'revenue']].head()
h_stacked = pd.concat([df_left, df_right], axis=1)
print(h_stacked)
print()

# Example 6: merge with indicator
print("Example 6: Merge with indicator (shows merge source)")
merged_indicator = pd.merge(df.head(5), customers, on='customer_name', 
                            how='outer', indicator=True)
print(merged_indicator[['customer_name', 'loyalty_level', '_merge']].head())
print("\n_merge column values:")
print("  - 'both': Found in both DataFrames")
print("  - 'left_only': Only in left DataFrame")
print("  - 'right_only': Only in right DataFrame")

## 6. Pivot Tables and Reshaping

### Pivot Table

**Pivot** = Reshape data from long to wide format

```python
df.pivot_table(
    values='column_to_aggregate',
    index='row_labels',
    columns='column_labels',
    aggfunc='mean'  # or sum, count, etc.
)
```

### Parameters

| Parameter | Purpose | Example |
|-----------|---------|----------|
| `values` | Column to aggregate | 'revenue' |
| `index` | Row labels | 'product' |
| `columns` | Column labels | 'region' |
| `aggfunc` | Aggregation function | 'sum', 'mean', 'count' |
| `fill_value` | Fill missing values | 0 |
| `margins` | Add row/column totals | True |

### Related Operations

**melt()** - Unpivot (wide to long)
```python
pd.melt(df, id_vars=['id'], value_vars=['col1', 'col2'])
```

**stack()** - Pivot column labels to row index
```python
df.stack()
```

**unstack()** - Pivot row index to column labels
```python
df.unstack()
```

### Real-World Use Cases
- Create cross-tabulation reports
- Sales by product × region
- Time series analysis (month × product)
- Excel-style pivot tables

In [None]:
print("=== PIVOT TABLES ===\n")

# Example 1: Simple pivot table
print("Example 1: Revenue by Product × Region")
pivot1 = df.pivot_table(
    values='revenue',
    index='product',
    columns='region',
    aggfunc='sum',
    fill_value=0
)
print(pivot1)
print()

# Example 2: Pivot with margins (totals)
print("Example 2: Pivot with row and column totals")
pivot2 = df.pivot_table(
    values='revenue',
    index='product',
    columns='region',
    aggfunc='sum',
    fill_value=0,
    margins=True,
    margins_name='Total'
)
print(pivot2)
print()

# Example 3: Multiple aggregations
print("Example 3: Multiple aggregation functions")
pivot3 = df.pivot_table(
    values='revenue',
    index='product',
    columns='region',
    aggfunc=['sum', 'mean', 'count'],
    fill_value=0
)
print(pivot3)
print()

# Example 4: Multiple values
print("Example 4: Pivot multiple values")
pivot4 = df.pivot_table(
    values=['revenue', 'quantity'],
    index='product',
    columns='region',
    aggfunc='sum',
    fill_value=0
)
print(pivot4)
print()

# Example 5: Cross-tabulation (special pivot)
print("Example 5: Cross-tabulation (product × payment method)")
crosstab = pd.crosstab(
    df['product'],
    df['payment_method'],
    values=df['revenue'],
    aggfunc='sum',
    margins=True
)
print(crosstab)
print()

# Example 6: Melt (unpivot)
print("Example 6: MELT - Convert wide to long format")
# Create wide format
wide_df = pd.DataFrame({
    'product': ['Laptop', 'Phone', 'Tablet'],
    'Q1_sales': [1000, 1500, 800],
    'Q2_sales': [1200, 1600, 900],
    'Q3_sales': [1100, 1700, 850]
})
print("Wide format:")
print(wide_df)
print()

# Melt to long format
long_df = pd.melt(
    wide_df,
    id_vars=['product'],
    value_vars=['Q1_sales', 'Q2_sales', 'Q3_sales'],
    var_name='quarter',
    value_name='sales'
)
print("Long format (melted):")
print(long_df)
print()

# Example 7: Stack and Unstack
print("Example 7: STACK and UNSTACK")
pivot_for_stack = df.pivot_table(
    values='revenue',
    index='product',
    columns='region',
    aggfunc='sum'
)
print("Original pivot:")
print(pivot_for_stack.head())
print()

stacked = pivot_for_stack.stack()
print("Stacked (long format):")
print(stacked.head())
print()

unstacked = stacked.unstack()
print("Unstacked (back to wide):")
print(unstacked.head())

## 7. Sorting and Filtering

### Sorting

**sort_values()** - Sort by column values
```python
df.sort_values('column', ascending=True)
df.sort_values(['col1', 'col2'], ascending=[True, False])
```

**sort_index()** - Sort by index
```python
df.sort_index()
```

### Filtering (Boolean Indexing)

**Single condition**
```python
df[df['revenue'] > 1000]
```

**Multiple conditions (AND)**
```python
df[(df['revenue'] > 1000) & (df['region'] == 'North')]
```

**Multiple conditions (OR)**
```python
df[(df['product'] == 'Laptop') | (df['product'] == 'Phone')]
```

**NOT condition**
```python
df[~(df['region'] == 'North')]
```

**isin()** - Check if in list
```python
df[df['product'].isin(['Laptop', 'Phone'])]
```

**query()** - SQL-like filtering
```python
df.query('revenue > 1000 and region == "North"')
```

In [None]:
print("=== SORTING ===\n")

# Example 1: Sort by single column
print("Example 1: Sort by revenue (descending)")
sorted_df = df.sort_values('revenue', ascending=False)
print(sorted_df[['product', 'revenue']].head())
print()

# Example 2: Sort by multiple columns
print("Example 2: Sort by product (A-Z) then revenue (high to low)")
sorted_multi = df.sort_values(['product', 'revenue'], ascending=[True, False])
print(sorted_multi[['product', 'revenue']].head(10))
print()

# Example 3: Sort by index
print("Example 3: Sort by index")
df_shuffled = df.sample(frac=1)  # Shuffle
df_sorted_idx = df_shuffled.sort_index()
print(f"Before: {df_shuffled.index[:5].tolist()}")
print(f"After: {df_sorted_idx.index[:5].tolist()}")
print()

print("=== FILTERING ===\n")

# Example 4: Simple filter
print("Example 4: Revenue > $5000")
high_revenue = df[df['revenue'] > 5000]
print(f"Found {len(high_revenue)} high-revenue orders")
print(high_revenue[['product', 'revenue']].head())
print()

# Example 5: Multiple conditions (AND)
print("Example 5: Laptops in North region")
laptop_north = df[(df['product'] == 'Laptop') & (df['region'] == 'North')]
print(f"Found {len(laptop_north)} matching orders")
print(laptop_north[['product', 'region', 'revenue']].head())
print()

# Example 6: Multiple conditions (OR)
print("Example 6: Laptop OR Phone")
laptop_or_phone = df[(df['product'] == 'Laptop') | (df['product'] == 'Phone')]
print(f"Found {len(laptop_or_phone)} orders")
print(laptop_or_phone['product'].value_counts())
print()

# Example 7: NOT condition
print("Example 7: NOT North region")
not_north = df[~(df['region'] == 'North')]
print(f"Orders not in North: {len(not_north)}")
print(not_north['region'].value_counts())
print()

# Example 8: isin() method
print("Example 8: Products in specific list")
products_of_interest = ['Laptop', 'Phone', 'Tablet']
filtered = df[df['product'].isin(products_of_interest)]
print(f"Found {len(filtered)} orders for: {products_of_interest}")
print(filtered['product'].value_counts())
print()

# Example 9: between() method
print("Example 9: Revenue between $2000 and $8000")
mid_revenue = df[df['revenue'].between(2000, 8000)]
print(f"Found {len(mid_revenue)} mid-range orders")
print(mid_revenue[['product', 'revenue']].head())
print()

# Example 10: query() method
print("Example 10: Using query() - SQL-like syntax")
queried = df.query('revenue > 5000 and region == "North"')
print(f"Query result: {len(queried)} rows")
print(queried[['product', 'region', 'revenue']].head())
print()

# Example 11: String contains in filter
print("Example 11: Customers with 'Smith' in name")
smith_customers = df[df['customer_name'].str.contains('Smith')]
print(f"Found {len(smith_customers)} orders from Smith customers")
print(smith_customers['customer_name'].unique())
print()

# Example 12: nlargest() and nsmallest()
print("Example 12: Top 5 and Bottom 5 by revenue")
print("\nTop 5:")
print(df.nlargest(5, 'revenue')[['product', 'revenue']])
print("\nBottom 5:")
print(df.nsmallest(5, 'revenue')[['product', 'revenue']])

## 8. Window Functions (Rolling, Expanding, Cumulative)

### What are Window Functions?

**Window Functions** = Perform calculations across a set of rows (window) related to current row

### Types

**1. Rolling (Moving Window)**
- Fixed window size moves through data
- Example: 7-day moving average

```python
df['rolling_avg'] = df['value'].rolling(window=7).mean()
```

**2. Expanding (Cumulative)**
- Window grows from start
- Example: Cumulative sum

```python
df['expanding_sum'] = df['value'].expanding().sum()
```

**3. Exponentially Weighted (EWM)**
- Recent values have more weight
- Example: Exponential moving average

```python
df['ewm_avg'] = df['value'].ewm(span=7).mean()
```

### Common Operations

| Operation | Purpose | Example |
|-----------|---------|----------|
| `mean()` | Moving average | 7-day avg sales |
| `sum()` | Moving sum | Running total |
| `min()` | Moving minimum | Lowest price in 30 days |
| `max()` | Moving maximum | Highest revenue |
| `std()` | Moving std dev | Revenue volatility |

### Real-World Use Cases
- **Finance**: Moving averages, volatility
- **Sales**: Trends, seasonality
- **IoT**: Sensor smoothing
- **Web**: Page view trends

In [None]:
print("=== WINDOW FUNCTIONS ===\n")

# Create time series data
print("Creating time series dataset...\n")
date_range = pd.date_range('2024-01-01', periods=30, freq='D')
ts_df = pd.DataFrame({
    'date': date_range,
    'sales': np.random.randint(100, 500, 30) + np.sin(np.arange(30)) * 50
})
ts_df['sales'] = ts_df['sales'].round(2)

print(ts_df.head(10))
print()

# Example 1: Rolling mean (moving average)
print("Example 1: 7-day moving average")
ts_df['rolling_7d_avg'] = ts_df['sales'].rolling(window=7).mean()
print(ts_df[['date', 'sales', 'rolling_7d_avg']].head(10))
print()

# Example 2: Rolling sum
print("Example 2: 7-day rolling sum")
ts_df['rolling_7d_sum'] = ts_df['sales'].rolling(window=7).sum()
print(ts_df[['date', 'sales', 'rolling_7d_sum']].tail())
print()

# Example 3: Rolling min and max
print("Example 3: 7-day rolling min and max")
ts_df['rolling_min'] = ts_df['sales'].rolling(window=7).min()
ts_df['rolling_max'] = ts_df['sales'].rolling(window=7).max()
print(ts_df[['date', 'sales', 'rolling_min', 'rolling_max']].tail())
print()

# Example 4: Expanding (cumulative)
print("Example 4: Cumulative sum and mean")
ts_df['cumsum'] = ts_df['sales'].expanding().sum()
ts_df['cum_avg'] = ts_df['sales'].expanding().mean()
print(ts_df[['date', 'sales', 'cumsum', 'cum_avg']].head(10))
print()

# Example 5: Using cumsum() directly
print("Example 5: Cumulative functions (direct)")
ts_df['cumsum_direct'] = ts_df['sales'].cumsum()
ts_df['cumprod'] = ts_df['sales'].cumprod()  # Cumulative product
ts_df['cummax'] = ts_df['sales'].cummax()  # Cumulative maximum
print(ts_df[['date', 'sales', 'cumsum_direct', 'cummax']].head(10))
print()

# Example 6: Exponential weighted moving average
print("Example 6: Exponential weighted moving average (EWMA)")
ts_df['ewma'] = ts_df['sales'].ewm(span=7).mean()
print(ts_df[['date', 'sales', 'rolling_7d_avg', 'ewma']].tail(10))
print("\nNote: EWMA gives more weight to recent values")
print()

# Example 7: Shift (lag/lead)
print("Example 7: Shift for lag and lead values")
ts_df['prev_day_sales'] = ts_df['sales'].shift(1)  # Lag 1
ts_df['next_day_sales'] = ts_df['sales'].shift(-1)  # Lead 1
ts_df['sales_change'] = ts_df['sales'] - ts_df['prev_day_sales']
print(ts_df[['date', 'prev_day_sales', 'sales', 'next_day_sales', 'sales_change']].head(10))
print()

# Example 8: Percentage change
print("Example 8: Percentage change")
ts_df['pct_change'] = ts_df['sales'].pct_change() * 100
print(ts_df[['date', 'sales', 'pct_change']].head(10))
print()

# Example 9: Rolling with different window sizes
print("Example 9: Multiple rolling windows")
ts_df['ma_3d'] = ts_df['sales'].rolling(window=3).mean()
ts_df['ma_7d'] = ts_df['sales'].rolling(window=7).mean()
ts_df['ma_14d'] = ts_df['sales'].rolling(window=14).mean()
print(ts_df[['date', 'sales', 'ma_3d', 'ma_7d', 'ma_14d']].tail(10))

## 9. Additional Data Manipulation Techniques

### Creating New Columns

**Method 1: Direct assignment**
```python
df['new_col'] = df['col1'] + df['col2']
```

**Method 2: assign()**
```python
df.assign(new_col = lambda x: x['col1'] + x['col2'])
```

**Method 3: np.where() - Conditional**
```python
df['category'] = np.where(df['value'] > 100, 'High', 'Low')
```

**Method 4: np.select() - Multiple conditions**
```python
conditions = [df['value'] > 100, df['value'] > 50]
choices = ['High', 'Medium']
df['category'] = np.select(conditions, choices, default='Low')
```

### Binning

**cut()** - Bin continuous data into intervals
```python
pd.cut(df['age'], bins=[0, 18, 35, 60, 100], 
       labels=['Child', 'Young', 'Middle', 'Senior'])
```

**qcut()** - Quantile-based binning
```python
pd.qcut(df['revenue'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
```

### Ranking

```python
df['rank'] = df['revenue'].rank(ascending=False, method='dense')
```

### Sample

```python
df.sample(n=10)  # Random 10 rows
df.sample(frac=0.1)  # Random 10% of data
```

In [None]:
print("=== ADDITIONAL TECHNIQUES ===\n")

# Example 1: np.where() for conditional columns
print("Example 1: Create category using np.where()")
df['revenue_level'] = np.where(df['revenue'] > 5000, 'High', 'Low')
print(df[['revenue', 'revenue_level']].head(10))
print()

# Example 2: np.select() for multiple conditions
print("Example 2: Multiple conditions with np.select()")
conditions = [
    df['revenue'] > 10000,
    df['revenue'] > 5000,
    df['revenue'] > 2000
]
choices = ['Premium', 'High', 'Medium']
df['revenue_tier'] = np.select(conditions, choices, default='Low')
print(df[['revenue', 'revenue_tier']].head(10))
print()
print("Revenue tier distribution:")
print(df['revenue_tier'].value_counts())
print()

# Example 3: cut() - Binning into intervals
print("Example 3: Bin revenue into ranges")
df['revenue_bin'] = pd.cut(
    df['revenue'],
    bins=[0, 2000, 5000, 10000, 20000],
    labels=['0-2k', '2k-5k', '5k-10k', '10k+'],
    include_lowest=True
)
print(df[['revenue', 'revenue_bin']].head(10))
print()
print("Bin distribution:")
print(df['revenue_bin'].value_counts().sort_index())
print()

# Example 4: qcut() - Quantile-based binning
print("Example 4: Quantile-based binning (quartiles)")
df['revenue_quartile'] = pd.qcut(
    df['revenue'],
    q=4,
    labels=['Q1 (Lowest 25%)', 'Q2', 'Q3', 'Q4 (Top 25%)']
)
print(df[['revenue', 'revenue_quartile']].head(10))
print()
print("Quartile distribution:")
print(df['revenue_quartile'].value_counts().sort_index())
print()

# Example 5: Ranking
print("Example 5: Rank orders by revenue")
df['revenue_rank'] = df['revenue'].rank(ascending=False, method='dense')
top_orders = df.nsmallest(10, 'revenue_rank')[['product', 'revenue', 'revenue_rank']]
print("Top 10 orders by revenue:")
print(top_orders)
print()

# Example 6: assign() method
print("Example 6: Create multiple columns with assign()")
df_assigned = df.assign(
    total_before_tax = lambda x: x['quantity'] * x['price'],
    tax_amount = lambda x: x['revenue'] * 0.18,
    total_with_tax = lambda x: x['revenue'] * 1.18
)
print(df_assigned[['revenue', 'tax_amount', 'total_with_tax']].head())
print()

# Example 7: Sampling
print("Example 7: Random sampling")
random_10 = df.sample(n=10, random_state=42)
print(f"Sampled 10 random rows: indices {random_10.index.tolist()}")
print()

random_10pct = df.sample(frac=0.1, random_state=42)
print(f"Sampled 10% of data: {len(random_10pct)} rows")
print()

# Example 8: Clip values
print("Example 8: Clip revenue (cap at min/max)")
df['revenue_clipped'] = df['revenue'].clip(lower=1000, upper=10000)
print("Original revenue range:", df['revenue'].min(), '-', df['revenue'].max())
print("Clipped revenue range:", df['revenue_clipped'].min(), '-', df['revenue_clipped'].max())
print()

# Example 9: Categorical encoding
print("Example 9: Encode categories as numbers")
df['region_code'] = pd.Categorical(df['region']).codes
print(df[['region', 'region_code']].drop_duplicates().sort_values('region_code'))

## 10. Comprehensive Real-World Example

### Business Scenario

**Task**: Analyze sales data to answer key business questions:
1. Which products are top performers?
2. What are the regional trends?
3. Who are the VIP customers?
4. What's the sales trend over time?
5. Which product-region combinations are most profitable?

We'll combine multiple techniques learned in this notebook.

In [None]:
print("="*70)
print("COMPREHENSIVE SALES ANALYSIS")
print("="*70)
print()

# Question 1: Top performing products
print("1. TOP PERFORMING PRODUCTS")
print("-" * 70)
product_performance = df.groupby('product').agg({
    'revenue': ['sum', 'mean', 'count'],
    'quantity': 'sum',
    'customer_name': 'nunique'
}).round(2)

product_performance.columns = ['Total_Revenue', 'Avg_Order_Value', 
                                'Num_Orders', 'Units_Sold', 'Unique_Customers']
product_performance = product_performance.sort_values('Total_Revenue', ascending=False)

print(product_performance)
print()

# Question 2: Regional trends
print("2. REGIONAL PERFORMANCE")
print("-" * 70)
regional_pivot = df.pivot_table(
    values='revenue',
    index='region',
    columns='product',
    aggfunc='sum',
    fill_value=0,
    margins=True
)
print(regional_pivot)
print()

# Question 3: VIP Customers (top 10 by total spend)
print("3. VIP CUSTOMERS (Top 10 by Total Spend)")
print("-" * 70)
customer_analysis = df.groupby('customer_name').agg({
    'revenue': ['sum', 'mean', 'count'],
    'quantity': 'sum'
}).round(2)

customer_analysis.columns = ['Total_Spend', 'Avg_Order', 'Num_Orders', 'Total_Items']
customer_analysis['Customer_Value_Score'] = (
    customer_analysis['Total_Spend'] * 0.5 + 
    customer_analysis['Num_Orders'] * 100
).round(2)

vip_customers = customer_analysis.sort_values('Total_Spend', ascending=False).head(10)
print(vip_customers)
print()

# Question 4: Sales trend over time
print("4. MONTHLY SALES TREND")
print("-" * 70)
df_sorted = df.sort_values('date')
monthly_trend = df_sorted.groupby(df_sorted['date'].dt.to_period('M')).agg({
    'revenue': ['sum', 'mean', 'count']
}).round(2)

monthly_trend.columns = ['Total_Revenue', 'Avg_Order_Value', 'Num_Orders']
monthly_trend['MoM_Growth_%'] = monthly_trend['Total_Revenue'].pct_change() * 100
monthly_trend['MoM_Growth_%'] = monthly_trend['MoM_Growth_%'].round(2)

print(monthly_trend)
print()

# Question 5: Best product-region combinations
print("5. TOP 10 PRODUCT-REGION COMBINATIONS")
print("-" * 70)
combo_analysis = df.groupby(['product', 'region'])['revenue'].agg(['sum', 'count']).round(2)
combo_analysis.columns = ['Total_Revenue', 'Num_Orders']
combo_analysis['Avg_Order'] = (combo_analysis['Total_Revenue'] / combo_analysis['Num_Orders']).round(2)
combo_analysis = combo_analysis.sort_values('Total_Revenue', ascending=False).head(10)

print(combo_analysis)
print()

# Bonus: Payment method analysis
print("6. PAYMENT METHOD ANALYSIS")
print("-" * 70)
payment_analysis = pd.crosstab(
    df['payment_method'],
    df['product'],
    values=df['revenue'],
    aggfunc='sum',
    margins=True
).round(2)
print(payment_analysis)
print()

# Summary statistics
print("="*70)
print("SUMMARY STATISTICS")
print("="*70)
print(f"Total Revenue: ${df['revenue'].sum():,.2f}")
print(f"Average Order Value: ${df['revenue'].mean():,.2f}")
print(f"Total Orders: {len(df):,}")
print(f"Unique Products: {df['product'].nunique()}")
print(f"Unique Customers: {df['customer_name'].nunique()}")
print(f"Date Range: {df['date'].min().strftime('%Y-%m-%d')} to {df['date'].max().strftime('%Y-%m-%d')}")
print(f"Best Selling Product: {df.groupby('product')['revenue'].sum().idxmax()}")
print(f"Best Region: {df.groupby('region')['revenue'].sum().idxmax()}")
print("="*70)

## 11. Performance Tips & Best Practices

### Performance Optimization

**1. Vectorization > Loops**
```python
# ❌ Slow - Loop
for i in range(len(df)):
    df.loc[i, 'new_col'] = df.loc[i, 'col1'] * 2

# ✅ Fast - Vectorized
df['new_col'] = df['col1'] * 2
```

**2. Use appropriate methods**
```python
# map() faster than apply() for simple replacements
df['col'].map(mapping_dict)  # ✅ Faster
df['col'].apply(lambda x: mapping_dict.get(x))  # ❌ Slower
```

**3. Filter early, aggregate late**
```python
# ✅ Filter first (smaller dataset)
df[df['year'] == 2024].groupby('product')['revenue'].sum()

# ❌ Aggregate everything first
df.groupby('product')['revenue'].sum()[df['year'] == 2024]
```

**4. Use categorical for repeated strings**
```python
df['category'] = df['category'].astype('category')  # Saves memory
```

**5. Chain operations for readability**
```python
# ✅ Method chaining
result = (df
    .query('revenue > 1000')
    .groupby('product')['revenue']
    .sum()
    .sort_values(ascending=False)
)
```

### Common Mistakes to Avoid

1. **Using `inplace=True` unnecessarily** - Can prevent optimization
2. **Not setting index for frequent lookups** - Slow row access
3. **Using `apply()` when vectorization possible** - Much slower
4. **Creating unnecessary copies** - Memory waste
5. **Not using `query()` for complex filters** - Less readable

### Memory Management

```python
# Check memory usage
df.memory_usage(deep=True)

# Optimize dtypes
df['int_col'] = df['int_col'].astype('int32')  # Instead of int64
df['cat_col'] = df['cat_col'].astype('category')
```

## Summary & Quick Reference

### Key Operations Covered

| Operation | Method | Use Case |
|-----------|--------|----------|
| **Transform** | `apply()`, `map()` | Create new columns with logic |
| **String** | `.str.` methods | Clean and parse text |
| **DateTime** | `.dt.` methods | Extract date components |
| **Group** | `groupby()`, `agg()` | Summary by category |
| **Combine** | `merge()`, `concat()` | Join datasets |
| **Reshape** | `pivot_table()`, `melt()` | Change structure |
| **Sort** | `sort_values()` | Order data |
| **Filter** | Boolean indexing, `query()` | Select rows |
| **Window** | `rolling()`, `expanding()` | Moving calculations |
| **Create** | `assign()`, `np.where()` | New columns |

### Quick Reference Guide

**Apply custom function:**
```python
df['new'] = df['col'].apply(lambda x: x * 2)
df['new'] = df.apply(lambda row: row['col1'] + row['col2'], axis=1)
```

**String operations:**
```python
df['col'].str.lower().str.strip()
df['col'].str.contains('pattern')
df['col'].str.split(' ', expand=True)
```

**DateTime:**
```python
df['date'].dt.year
df['date'].dt.day_name()
df['date'].dt.quarter
```

**GroupBy:**
```python
df.groupby('col')['value'].sum()
df.groupby('col').agg(['sum', 'mean', 'count'])
df.groupby(['col1', 'col2'])['value'].sum()
```

**Merge:**
```python
pd.merge(df1, df2, on='key', how='inner')
pd.concat([df1, df2], axis=0)
```

**Pivot:**
```python
df.pivot_table(values='val', index='row', columns='col', aggfunc='sum')
```

**Filter:**
```python
df[df['col'] > 100]
df[(df['col1'] > 100) & (df['col2'] == 'A')]
df.query('col1 > 100 and col2 == "A"')
```

**Window functions:**
```python
df['rolling_avg'] = df['value'].rolling(window=7).mean()
df['cumsum'] = df['value'].cumsum()
```

---

## Practice Exercises

**Try these on your own:**

1. Create a pivot table showing average revenue by product and payment method
2. Find the top 3 customers in each region by total spend
3. Calculate 7-day moving average of daily sales
4. Merge customer demographics with sales data
5. Create revenue bins (Low, Medium, High) and analyze patterns
6. Find month-over-month growth percentage for each product
7. Extract customer first names and analyze by first letter
8. Calculate cumulative revenue by date
9. Find products that are popular on weekends vs weekdays
10. Create a function that categorizes orders as "Bulk" (qty>5) or "Regular"

---

## Next Steps

After mastering data manipulation:
1. **Data Visualization** (Matplotlib, Seaborn)
2. **Advanced Analytics** (Statistics, ML)
3. **Time Series Analysis** (Forecasting)
4. **Feature Engineering** (For ML models)
5. **Big Data** (Dask, PySpark for large datasets)

---

### Remember:
- 🎯 **Practice regularly** with real datasets
- 📚 **Read documentation** - Pandas docs are excellent
- 💡 **Think before coding** - Plan your analysis
- 🐛 **Debug systematically** - Check intermediate results
- 🚀 **Optimize when needed** - Vectorize, don't loop!

**Happy Data Manipulation! 🐼**