# Diverse Pandas Quiz — Questions & Solutions

This notebook contains the 6 problems followed immediately by their full, runnable solutions.
Run the import cell below first (single import for the whole notebook).
---


In [None]:
import pandas as pd

# Single import cell - pandas only


## Q1 — Grouping, aggregation and filtering

**Instructions (DataFrame construction):**
Create `df_sales` with columns: `OrderID`, `CustomerID`, `Product`, `Units`, `UnitPrice`, `OrderDate`.
- ~12 rows across 4 customers (CustomerID 1–4) and 5 products (P1–P5).
- `OrderDate` as strings like `'2025-01-03'`.

**Task:**
1. Add `Amount = Units * UnitPrice`.
2. Compute total `Amount` per `CustomerID`.
3. Show only customers with total > 300, sorted descending.


In [None]:
# Solution for Q1
df_sales = pd.DataFrame([
    {'OrderID':101,'CustomerID':1,'Product':'P1','Units':5,'UnitPrice':20,'OrderDate':'2025-01-03'},
    {'OrderID':102,'CustomerID':2,'Product':'P2','Units':3,'UnitPrice':50,'OrderDate':'2025-01-04'},
    {'OrderID':103,'CustomerID':1,'Product':'P3','Units':2,'UnitPrice':120,'OrderDate':'2025-01-05'},
    {'OrderID':104,'CustomerID':3,'Product':'P2','Units':10,'UnitPrice':15,'OrderDate':'2025-01-06'},
    {'OrderID':105,'CustomerID':4,'Product':'P4','Units':1,'UnitPrice':300,'OrderDate':'2025-01-06'},
    {'OrderID':106,'CustomerID':2,'Product':'P1','Units':7,'UnitPrice':18,'OrderDate':'2025-01-07'},
    {'OrderID':107,'CustomerID':3,'Product':'P5','Units':4,'UnitPrice':40,'OrderDate':'2025-01-08'},
    {'OrderID':108,'CustomerID':1,'Product':'P2','Units':6,'UnitPrice':30,'OrderDate':'2025-01-09'},
    {'OrderID':109,'CustomerID':4,'Product':'P3','Units':2,'UnitPrice':120,'OrderDate':'2025-01-10'},
    {'OrderID':110,'CustomerID':2,'Product':'P4','Units':1,'UnitPrice':250,'OrderDate':'2025-01-10'},
    {'OrderID':111,'CustomerID':3,'Product':'P1','Units':8,'UnitPrice':20,'OrderDate':'2025-01-11'},
    {'OrderID':112,'CustomerID':1,'Product':'P5','Units':3,'UnitPrice':60,'OrderDate':'2025-01-12'},
])

# 1. Add Amount column
df_sales['Amount'] = df_sales['Units'] * df_sales['UnitPrice']

# 2. Total per CustomerID
total_by_cust = df_sales.groupby('CustomerID', as_index=False)['Amount'].sum()

# 3. Filter and sort
result_q1 = total_by_cust[total_by_cust['Amount'] > 300].sort_values('Amount', ascending=False)

print('--- df_sales (first 8 rows) ---') 
print(df_sales.head(8).to_string(index=False))
print('\n--- Total Amount per Customer ---')
print(total_by_cust.to_string(index=False))
print('\n--- Customers with total > 300 ---')
print(result_q1.to_string(index=False))


## Q2 — Join (merge) and derived boolean column

**Instructions (DataFrame construction):**
Create `df_employees` with `EmpID`, `Name`, `DeptID` (6 rows). Create `df_depts` with `DeptID`, `DeptName`, `Manager` (3 rows).

**Task:**
1. Merge into `df_full` that includes DeptName and Manager.
2. Create `IsManagedByAlice` == True when Manager == 'Alice'.
3. Show counts of employees per DeptName and number managed by Alice.


In [None]:
# Solution for Q2
df_employees = pd.DataFrame([
    {'EmpID':1,'Name':'Ravi','DeptID':10},
    {'EmpID':2,'Name':'Meera','DeptID':20},
    {'EmpID':3,'Name':'Ajay','DeptID':10},
    {'EmpID':4,'Name':'Priya','DeptID':30},
    {'EmpID':5,'Name':'Sameer','DeptID':20},
    {'EmpID':6,'Name':'Anita','DeptID':30},
])

df_depts = pd.DataFrame([
    {'DeptID':10,'DeptName':'Data','Manager':'Alice'},
    {'DeptID':20,'DeptName':'Infra','Manager':'Bob'},
    {'DeptID':30,'DeptName':'HR','Manager':'Alice'},
])

# 1. Merge
df_full = df_employees.merge(df_depts, on='DeptID', how='left')

# 2. Derived boolean
df_full['IsManagedByAlice'] = df_full['Manager'] == 'Alice'

# 3. Counts per department and managed by Alice
counts = df_full.groupby('DeptName', as_index=False)['EmpID'].count().rename(columns={'EmpID':'EmployeeCount'})
managed_by_alice = df_full[df_full['IsManagedByAlice']].groupby('DeptName', as_index=False)['EmpID'].count().rename(columns={'EmpID':'ManagedByAlice'})
summary = counts.merge(managed_by_alice, on='DeptName', how='left').fillna(0)
summary['ManagedByAlice'] = summary['ManagedByAlice'].astype(int)

print('--- Merged df_full ---')
print(df_full.to_string(index=False))
print('\n--- Employee counts and managed by Alice ---')
print(summary.to_string(index=False))


## Q3 — Reshape: melt and pivot

**Instructions (DataFrame construction):**
Create `df_quarterly` (wide) with columns `Company`, `Q1`, `Q2`, `Q3`, `Q4` for 5 companies (numeric revenue).

**Task:**
1. Use `melt` to convert to long format (`Company`, `Quarter`, `Revenue`).
2. Pivot back to wide using `pivot_table`.
3. Compute each company's annual revenue and append as column to the pivot result.


In [None]:
# Solution for Q3
df_quarterly = pd.DataFrame([
    {'Company':'C1','Q1':120,'Q2':130,'Q3':125,'Q4':140},
    {'Company':'C2','Q1':200,'Q2':210,'Q3':190,'Q4':205},
    {'Company':'C3','Q1':95,'Q2':100,'Q3':110,'Q4':105},
    {'Company':'C4','Q1':150,'Q2':160,'Q3':155,'Q4':165},
    {'Company':'C5','Q1':80,'Q2':90,'Q3':85,'Q4':95},
])

# 1. Melt to long format
df_long = df_quarterly.melt(id_vars=['Company'], value_vars=['Q1','Q2','Q3','Q4'], var_name='Quarter', value_name='Revenue')

# 2. Pivot back to wide
df_pivot = df_long.pivot_table(index='Company', columns='Quarter', values='Revenue').reset_index()

# 3. Annual revenue
df_pivot['Annual'] = df_pivot[['Q1','Q2','Q3','Q4']].sum(axis=1)

print('--- Long format (first 8 rows) ---')
print(df_long.head(8).to_string(index=False))
print('\n--- Pivoted back to wide with Annual ---')
print(df_pivot.to_string(index=False))


## Q4 — Time series: rolling mean and resample

**Instructions (DataFrame construction):**
Create `df_temp` with daily `Date` (as strings) from '2025-03-01' to '2025-03-10' and a `Temperature` float column.

**Task:**
1. Convert `Date` to datetime and set as index.
2. Add `Temp_3d_avg` as 3-day rolling mean of Temperature.
3. Resample to weekly frequency and report mean temperature per week.


In [None]:
# Solution for Q4
dates = ['2025-03-01','2025-03-02','2025-03-03','2025-03-04','2025-03-05','2025-03-06','2025-03-07','2025-03-08','2025-03-09','2025-03-10']
temps = [25.2, 24.8, 26.1, 27.0, 26.5, 25.0, 24.0, 23.5, 24.8, 25.6]

df_temp = pd.DataFrame({'Date': dates, 'Temperature': temps})

# 1. Convert to datetime and set index
df_temp['Date'] = pd.to_datetime(df_temp['Date'])
df_temp = df_temp.set_index('Date')

# 2. 3-day rolling average
df_temp['Temp_3d_avg'] = df_temp['Temperature'].rolling(window=3, min_periods=1).mean()

# 3. Weekly resample mean
weekly_mean = df_temp['Temperature'].resample('W').mean()

print('--- Daily with 3-day rolling average ---')
print(df_temp.to_string())
print('\n--- Weekly mean temperature ---')
print(weekly_mean.to_string())


## Q5 — Missing data handling and imputation

**Instructions (DataFrame construction):**
Create `df_products` with columns `SKU`, `Category`, `Price`, `Stock` (8 rows). Include some `NaN` values in `Price` and `Stock`. At least 3 categories.

**Task:**
1. Show number of missing values per column.
2. Fill missing `Price` with median `Price` of the same `Category` and missing `Stock` with 0.
3. After imputation, show SKUs where `Stock == 0` or `Price > 100`.


In [None]:
# Solution for Q5
df_products = pd.DataFrame([
    {'SKU':'S1','Category':'Electronics','Price':120.0,'Stock':10},
    {'SKU':'S2','Category':'Electronics','Price':None,'Stock':5},
    {'SKU':'S3','Category':'Home','Price':45.0,'Stock':None},
    {'SKU':'S4','Category':'Home','Price':55.0,'Stock':8},
    {'SKU':'S5','Category':'Clothing','Price':None,'Stock':2},
    {'SKU':'S6','Category':'Clothing','Price':80.0,'Stock':None},
    {'SKU':'S7','Category':'Electronics','Price':200.0,'Stock':1},
    {'SKU':'S8','Category':'Home','Price':None,'Stock':None},
])

# 1. Missing counts
missing_counts = df_products.isna().sum()

# 2. Impute Price by category median
df_imputed = df_products.copy()
df_imputed['Price'] = df_imputed.groupby('Category')['Price'].transform(lambda x: x.fillna(x.median()))
df_imputed['Stock'] = df_imputed['Stock'].fillna(0)

# 3. SKUs where Stock == 0 or Price > 100
condition = (df_imputed['Stock'] == 0) | (df_imputed['Price'] > 100)
skus_selected = df_imputed.loc[condition]

print('--- Missing value counts before imputation ---')
print(missing_counts.to_string())
print('\n--- After imputation ---')
print(df_imputed.to_string(index=False))
print('\n--- SKUs with Stock == 0 or Price > 100 ---')
print(skus_selected.to_string(index=False))


## Q6 — Apply/custom logic and string operations

**Instructions (DataFrame construction):**
Create `df_courses` with columns `CourseID`, `Title`, `Enrolled`, `StartDate` for 6 courses. Some titles should contain 'Intro' or 'Advanced'. `StartDate` as strings.

**Task:**
1. Create `Level`: 'Beginner' if Title contains 'Intro', 'Advanced' if contains 'Advanced', else 'Intermediate'. Use `apply` or vectorized string ops.
2. Create `StartMonth` derived from `StartDate`.
3. Show average `Enrolled` per `Level`.


In [None]:
# Solution for Q6
df_courses = pd.DataFrame([
    {'CourseID':'C101','Title':'Intro to Python','Enrolled':120,'StartDate':'2025-02-01'},
    {'CourseID':'C102','Title':'Advanced Machine Learning','Enrolled':40,'StartDate':'2025-03-15'},
    {'CourseID':'C103','Title':'Data Visualization','Enrolled':75,'StartDate':'2025-02-20'},
    {'CourseID':'C104','Title':'Intro to SQL','Enrolled':90,'StartDate':'2025-01-10'},
    {'CourseID':'C105','Title':'Advanced Deep Learning','Enrolled':30,'StartDate':'2025-04-05'},
    {'CourseID':'C106','Title':'Applied Statistics','Enrolled':60,'StartDate':'2025-03-01'},
])

# 1. Level creation using vectorized str.contains
df_courses['Level'] = 'Intermediate'
df_courses.loc[df_courses['Title'].str.contains('Intro'), 'Level'] = 'Beginner'
df_courses.loc[df_courses['Title'].str.contains('Advanced'), 'Level'] = 'Advanced'

# 2. StartMonth
df_courses['StartDate'] = pd.to_datetime(df_courses['StartDate'])
df_courses['StartMonth'] = df_courses['StartDate'].dt.month

# 3. Average Enrolled per Level
avg_enrolled = df_courses.groupby('Level', as_index=False)['Enrolled'].mean()

print('--- Courses with Level and StartMonth ---')
print(df_courses.to_string(index=False))
print('\n--- Average Enrolled per Level ---')
print(avg_enrolled.to_string(index=False))
