# Phase 7: Pandas Deep Dive
## The engine that powers your pipelines

Pandas is the most important library for data engineering in Python. 
It provides the `DataFrame` -- a 2D table of data with labeled columns.

Odibi's `PandasEngine` is the primary development engine. Every transformer, 
validator, and pattern you build will manipulate DataFrames.

This notebook covers the Pandas skills you need for both Odibi and interviews.

---
## Section 1: DataFrame Basics


In [None]:
import pandas as pd

# Creating a DataFrame from a list of dicts
data = [
    {"id": 1, "name": "Alice", "department": "Engineering", "salary": 95000},
    {"id": 2, "name": "Bob", "department": "Marketing", "salary": 72000},
    {"id": 3, "name": "Charlie", "department": "Engineering", "salary": 105000},
    {"id": 4, "name": "Diana", "department": "Marketing", "salary": 68000},
    {"id": 5, "name": "Eve", "department": "Engineering", "salary": 110000},
]
df = pd.DataFrame(data)
print(df)
print(f"\nShape: {df.shape}")       # (rows, columns)
print(f"Columns: {list(df.columns)}")
print(f"\n{df.dtypes}")

In [None]:
import pandas as pd

# Reading from CSV (the most common way)
# df = pd.read_csv("data/employees.csv")

# Inspecting data
data = [
    {"id": 1, "name": "Alice", "salary": 95000},
    {"id": 2, "name": "Bob", "salary": 72000},
    {"id": 3, "name": "Charlie", "salary": 105000},
]
df = pd.DataFrame(data)

print(df.head())       # First 5 rows
print(df.info())       # Column types, non-null counts
print(df.describe())   # Statistics for numeric columns

---
## Section 2: Selecting and Filtering


In [None]:
import pandas as pd

data = [
    {"id": 1, "name": "Alice", "dept": "Eng", "salary": 95000},
    {"id": 2, "name": "Bob", "dept": "Mkt", "salary": 72000},
    {"id": 3, "name": "Charlie", "dept": "Eng", "salary": 105000},
    {"id": 4, "name": "Diana", "dept": "Mkt", "salary": 68000},
    {"id": 5, "name": "Eve", "dept": "Eng", "salary": 110000},
]
df = pd.DataFrame(data)

# Select columns
print(df["name"])                    # Single column (Series)
print(df[["name", "salary"]])        # Multiple columns (DataFrame)

# Filter rows
high_salary = df[df["salary"] > 90000]
print("\nHigh salary:")
print(high_salary)

# Multiple conditions (use & for AND, | for OR, ~ for NOT)
eng_high = df[(df["dept"] == "Eng") & (df["salary"] > 100000)]
print("\nEngineering > 100k:")
print(eng_high)

---
## Section 3: Transformations


In [None]:
import pandas as pd

data = [
    {"id": 1, "first_name": "Alice", "salary": 95000},
    {"id": 2, "first_name": "Bob", "salary": 72000},
    {"id": 3, "first_name": "Charlie", "salary": 105000},
]
df = pd.DataFrame(data)

# Add a new column
df["bonus"] = df["salary"] * 0.1

# Rename columns
df = df.rename(columns={"first_name": "name"})

# Drop a column
df = df.drop(columns=["bonus"])

# Apply a function to a column
df["name_upper"] = df["name"].str.upper()

print(df)

---
## Section 4: Groupby and Aggregation


In [None]:
import pandas as pd

data = [
    {"dept": "Eng", "name": "Alice", "salary": 95000},
    {"dept": "Mkt", "name": "Bob", "salary": 72000},
    {"dept": "Eng", "name": "Charlie", "salary": 105000},
    {"dept": "Mkt", "name": "Diana", "salary": 68000},
    {"dept": "Eng", "name": "Eve", "salary": 110000},
]
df = pd.DataFrame(data)

# Group by department
grouped = df.groupby("dept")["salary"].agg(["mean", "min", "max", "count"])
print(grouped)

# Multiple aggregations
summary = df.groupby("dept").agg(
    avg_salary=("salary", "mean"),
    headcount=("name", "count"),
    total_salary=("salary", "sum"),
).reset_index()
print(summary)

---
## Section 5: Merge (Joins)


In [None]:
import pandas as pd

# Two tables to join
employees = pd.DataFrame([
    {"emp_id": 1, "name": "Alice", "dept_id": 10},
    {"emp_id": 2, "name": "Bob", "dept_id": 20},
    {"emp_id": 3, "name": "Charlie", "dept_id": 10},
])

departments = pd.DataFrame([
    {"dept_id": 10, "dept_name": "Engineering"},
    {"dept_id": 20, "dept_name": "Marketing"},
    {"dept_id": 30, "dept_name": "Sales"},
])

# Inner join (only matching rows)
result = pd.merge(employees, departments, on="dept_id", how="inner")
print("Inner join:")
print(result)

# Left join (all employees, even without dept match)
result = pd.merge(employees, departments, on="dept_id", how="left")
print("\nLeft join:")
print(result)

---
## Section 6: Exercises


### Exercise 6.1: Build a transformer function
Write `rename_columns(df, mapping)` that renames columns using a dict mapping. 
Write `filter_rows(df, column, values)` that keeps only rows where column is in values. 
Write `add_computed_column(df, new_col, source_col, func)` that adds a new column.

In [None]:
# Exercise 6.1
# YOUR CODE HERE
import pandas as pd










# Test with sample data:
# data = [{"cust_id": 1, "nm": "Alice", "amt": 100}, {"cust_id": 2, "nm": "Bob", "amt": 200}]
# df = pd.DataFrame(data)
# df = rename_columns(df, {"cust_id": "customer_id", "nm": "name", "amt": "amount"})
# print(df)

### Exercise 6.2: Groupby analysis
Given a sales DataFrame, calculate: total revenue per product, 
average order size per customer, and the top 3 products by revenue.

In [None]:
# Exercise 6.2
# YOUR CODE HERE
import pandas as pd

sales = pd.DataFrame([
    {"product": "A", "customer": "C1", "amount": 100},
    {"product": "B", "customer": "C1", "amount": 200},
    {"product": "A", "customer": "C2", "amount": 150},
    {"product": "C", "customer": "C2", "amount": 300},
    {"product": "A", "customer": "C3", "amount": 75},
    {"product": "B", "customer": "C3", "amount": 250},
])


---
## Checkpoint

You now know Pandas:
- DataFrame creation, inspection
- Selecting columns, filtering rows
- Adding, renaming, dropping columns
- Groupby and aggregation
- Merge (joins)
- Building transformer functions

These are the operations Odibi's PandasEngine performs. 
You can now build the transformers for mini-odibi.

**Next:** Phase 8 -- Building Mini-Odibi Core.