# Module 4: Pandas Test - SOLUTIONS

This test covers all topics from Module 4 (Pandas):
- Series and DataFrames
- Reading and writing data
- Indexing and selection (loc, iloc, boolean)
- Data cleaning (missing data, duplicates, types)
- Groupby and aggregation
- Merging and joining

**Instructions:**
1. Read each question carefully
2. Write your code in the provided answer cells
3. Run your code to verify it works
4. Do not look at the solutions until you have attempted all questions

**Scoring:**
- Questions 1-6: Basic (1 point each)
- Questions 7-12: Intermediate (2 points each)
- Questions 13-17: Advanced (3 points each)

**Total: 33 points**

---

## Setup

Run this cell first to import the required libraries.

In [None]:
import pandas as pd
import numpy as np

# Set display options for better output
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 100)

print("Setup complete! Pandas version:", pd.__version__)

---

## Section 1: Series and DataFrames (Questions 1-3)

### Question 1: Create a Series (1 point)

Create a Pandas Series called `temperatures` with the following data:
- Index: 'Mon', 'Tue', 'Wed', 'Thu', 'Fri'
- Values: 68, 72, 75, 71, 69

Give the Series the name 'Daily Temps' and print it along with its mean value.

In [None]:
# SOLUTION
temperatures = pd.Series(
    [68, 72, 75, 71, 69],
    index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
    name='Daily Temps'
)

print("Temperatures Series:")
print(temperatures)
print(f"\nMean temperature: {temperatures.mean():.1f}")

### Question 2: Create a DataFrame (1 point)

Create a DataFrame called `products` with the following data:

| product_id | name     | price  | quantity |
|------------|----------|--------|----------|
| P001       | Laptop   | 999.99 | 50       |
| P002       | Mouse    | 29.99  | 200      |
| P003       | Keyboard | 79.99  | 150      |
| P004       | Monitor  | 299.99 | 75       |

Print the DataFrame, its shape, and the data types of each column.

In [None]:
# SOLUTION
products = pd.DataFrame({
    'product_id': ['P001', 'P002', 'P003', 'P004'],
    'name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price': [999.99, 29.99, 79.99, 299.99],
    'quantity': [50, 200, 150, 75]
})

print("Products DataFrame:")
print(products)
print(f"\nShape: {products.shape}")
print(f"\nData types:\n{products.dtypes}")

### Question 3: DataFrame Inspection (1 point)

Using the DataFrame below:
1. Display the first 3 rows
2. Display the statistical summary for numeric columns only
3. Print the column names as a list

In [None]:
# Sample DataFrame for Question 3
np.random.seed(42)
employees = pd.DataFrame({
    'emp_id': range(1, 11),
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry', 'Iris', 'Jack'],
    'department': ['Engineering', 'Sales', 'Engineering', 'HR', 'Sales', 'Engineering', 'HR', 'Sales', 'Engineering', 'HR'],
    'salary': np.random.randint(50000, 100000, 10),
    'years_exp': np.random.randint(1, 15, 10)
})
print("Employees DataFrame:")
print(employees)

In [None]:
# SOLUTION
# 1. First 3 rows
print("1. First 3 rows:")
print(employees.head(3))

# 2. Statistical summary for numeric columns
print("\n2. Statistical summary:")
print(employees.describe())

# 3. Column names as a list
print("\n3. Column names:")
print(employees.columns.tolist())

---

## Section 2: Reading and Writing Data (Questions 4-5)

### Question 4: CSV String Parsing (1 point)

Parse the following CSV string into a DataFrame. The data has:
- A header row
- The 'date' column should be parsed as datetime
- Display the DataFrame and print the data types

In [None]:
# CSV data as a string
csv_data = """date,product,sales,revenue
2024-01-15,Widget,100,1500.00
2024-01-16,Gadget,75,2250.00
2024-01-17,Widget,120,1800.00
2024-01-18,Gizmo,50,1000.00"""

# Hint: Use pd.read_csv() with io.StringIO()
from io import StringIO

# SOLUTION
df = pd.read_csv(StringIO(csv_data), parse_dates=['date'])

print("Parsed DataFrame:")
print(df)
print(f"\nData types:\n{df.dtypes}")

### Question 5: DataFrame to Dictionary (1 point)

Convert the following DataFrame to:
1. A dictionary with orientation 'records' (list of row dictionaries)
2. A dictionary with orientation 'list' (column names as keys, values as lists)

Print both results.

In [None]:
# Sample DataFrame for Question 5
sample_df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'score': [85, 92, 78],
    'grade': ['B', 'A', 'C']
})
print("Sample DataFrame:")
print(sample_df)

In [None]:
# SOLUTION
# 1. Records orientation (list of row dictionaries)
records_dict = sample_df.to_dict(orient='records')
print("1. Records orientation:")
print(records_dict)

# 2. List orientation (column names as keys, values as lists)
list_dict = sample_df.to_dict(orient='list')
print("\n2. List orientation:")
print(list_dict)

---

## Section 3: Indexing and Selection (Questions 6-8)

### Question 6: Basic Selection with loc and iloc (1 point)

Using the DataFrame below:
1. Use `loc` to select the row with index 'C' and columns 'name' and 'score'
2. Use `iloc` to select the last 2 rows and the first 2 columns

In [None]:
# Sample DataFrame for Question 6
students = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'score': [85, 92, 78, 95, 88],
    'grade': ['B', 'A', 'C', 'A', 'B'],
    'passed': [True, True, True, True, True]
}, index=['A', 'B', 'C', 'D', 'E'])
print("Students DataFrame:")
print(students)

In [None]:
# SOLUTION
# 1. Using loc to select row 'C' and columns 'name' and 'score'
print("1. Row 'C', columns 'name' and 'score':")
print(students.loc['C', ['name', 'score']])

# 2. Using iloc to select last 2 rows and first 2 columns
print("\n2. Last 2 rows, first 2 columns:")
print(students.iloc[-2:, :2])

### Question 7: Boolean Indexing (2 points)

Using the DataFrame below:
1. Select all rows where salary is greater than 70000
2. Select all rows where department is 'Engineering' AND years_exp is greater than 5
3. Select all rows where department is either 'Sales' OR 'HR'

In [None]:
# Sample DataFrame for Question 7
np.random.seed(123)
staff = pd.DataFrame({
    'emp_id': range(1, 9),
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
    'department': ['Engineering', 'Sales', 'Engineering', 'HR', 'Sales', 'Engineering', 'HR', 'Sales'],
    'salary': [75000, 62000, 82000, 58000, 71000, 90000, 55000, 68000],
    'years_exp': [8, 3, 12, 2, 5, 15, 1, 7]
})
print("Staff DataFrame:")
print(staff)

In [None]:
# SOLUTION
# 1. Salary greater than 70000
print("1. Salary > 70000:")
print(staff[staff['salary'] > 70000])

# 2. Engineering AND years_exp > 5
print("\n2. Engineering AND years_exp > 5:")
print(staff[(staff['department'] == 'Engineering') & (staff['years_exp'] > 5)])

# 3. Sales OR HR
print("\n3. Sales OR HR:")
print(staff[(staff['department'] == 'Sales') | (staff['department'] == 'HR')])
# Alternative: print(staff[staff['department'].isin(['Sales', 'HR'])])

### Question 8: Advanced Selection (2 points)

Using the DataFrame below:
1. Select the 'name' and 'score' columns for rows where score is in the top 3
2. Use `query()` to select rows where subject is 'Math' and score > 80
3. Use `isin()` to select rows where subject is either 'Math' or 'Science'

In [None]:
# Sample DataFrame for Question 8
grades = pd.DataFrame({
    'student_id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
    'name': ['Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Bob', 'Charlie', 'Charlie', 'Charlie'],
    'subject': ['Math', 'Science', 'English', 'Math', 'Science', 'English', 'Math', 'Science', 'English'],
    'score': [92, 88, 85, 78, 82, 90, 95, 91, 87]
})
print("Grades DataFrame:")
print(grades)

In [None]:
# SOLUTION
# 1. Top 3 scores - name and score columns
print("1. Top 3 scores:")
top3_threshold = grades['score'].nlargest(3).min()
print(grades[grades['score'] >= top3_threshold][['name', 'score']])
# Alternative using nlargest index:
# print(grades.loc[grades['score'].nlargest(3).index, ['name', 'score']])

# 2. Using query() for Math and score > 80
print("\n2. Math and score > 80 (using query):")
print(grades.query("subject == 'Math' and score > 80"))

# 3. Using isin() for Math or Science
print("\n3. Math or Science (using isin):")
print(grades[grades['subject'].isin(['Math', 'Science'])])

---

## Section 4: Data Cleaning (Questions 9-11)

### Question 9: Handling Missing Data (2 points)

Using the DataFrame below:
1. Count the number of missing values in each column
2. Fill missing values in 'age' with the median age
3. Fill missing values in 'city' with 'Unknown'
4. Drop any remaining rows with missing values
5. Print the cleaned DataFrame

In [None]:
# Sample DataFrame for Question 9
messy_data = pd.DataFrame({
    'name': ['Alice', 'Bob', None, 'Diana', 'Eve', 'Frank'],
    'age': [25, None, 35, 28, None, 42],
    'city': ['NYC', 'LA', 'Chicago', None, 'Phoenix', 'Houston'],
    'salary': [50000, 60000, 70000, 55000, 65000, None]
})
print("Messy DataFrame:")
print(messy_data)

In [None]:
# SOLUTION
# Make a copy to work with
df = messy_data.copy()

# 1. Count missing values in each column
print("1. Missing values per column:")
print(df.isnull().sum())

# 2. Fill missing 'age' with median
median_age = df['age'].median()
df['age'] = df['age'].fillna(median_age)
print(f"\n2. Filled age with median: {median_age}")

# 3. Fill missing 'city' with 'Unknown'
df['city'] = df['city'].fillna('Unknown')
print("3. Filled city with 'Unknown'")

# 4. Drop remaining rows with missing values
df = df.dropna()
print("4. Dropped rows with remaining missing values")

# 5. Print cleaned DataFrame
print("\n5. Cleaned DataFrame:")
print(df)

### Question 10: Handling Duplicates (2 points)

Using the DataFrame below:
1. Identify duplicate rows (print True/False for each row)
2. Count the total number of duplicate rows
3. Remove duplicates, keeping the first occurrence
4. Remove duplicates based only on 'product' and 'store' columns, keeping the last occurrence

In [None]:
# Sample DataFrame for Question 10
sales = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02', '2024-01-01', '2024-01-03'],
    'product': ['Widget', 'Widget', 'Gadget', 'Gadget', 'Widget', 'Widget'],
    'store': ['NYC', 'NYC', 'LA', 'LA', 'NYC', 'NYC'],
    'quantity': [10, 10, 15, 20, 10, 25]
})
print("Sales DataFrame:")
print(sales)

In [None]:
# SOLUTION
# 1. Identify duplicate rows
print("1. Duplicate rows (True/False):")
print(sales.duplicated())

# 2. Count total duplicate rows
print(f"\n2. Total duplicate rows: {sales.duplicated().sum()}")

# 3. Remove duplicates, keeping first
print("\n3. Remove duplicates (keep first):")
print(sales.drop_duplicates(keep='first'))

# 4. Remove duplicates based on 'product' and 'store', keeping last
print("\n4. Remove duplicates based on product and store (keep last):")
print(sales.drop_duplicates(subset=['product', 'store'], keep='last'))

### Question 11: Data Type Conversion (2 points)

Using the DataFrame below:
1. Convert 'price' from string to float (remove the '$' first)
2. Convert 'quantity' from string to integer
3. Convert 'date' from string to datetime
4. Convert 'in_stock' from string ('Yes'/'No') to boolean
5. Print the DataFrame and verify all data types

In [None]:
# Sample DataFrame for Question 11
inventory = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price': ['$999.99', '$29.99', '$79.99', '$299.99'],
    'quantity': ['50', '200', '150', '75'],
    'date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18'],
    'in_stock': ['Yes', 'Yes', 'No', 'Yes']
})
print("Inventory DataFrame:")
print(inventory)
print("\nOriginal data types:")
print(inventory.dtypes)

In [None]:
# SOLUTION
# Make a copy
df = inventory.copy()

# 1. Convert 'price' to float (remove '$')
df['price'] = df['price'].str.replace('$', '', regex=False).astype(float)

# 2. Convert 'quantity' to integer
df['quantity'] = df['quantity'].astype(int)

# 3. Convert 'date' to datetime
df['date'] = pd.to_datetime(df['date'])

# 4. Convert 'in_stock' to boolean
df['in_stock'] = df['in_stock'].map({'Yes': True, 'No': False})

# 5. Print results
print("Converted DataFrame:")
print(df)
print("\nNew data types:")
print(df.dtypes)

---

## Section 5: Groupby and Aggregation (Questions 12-14)

### Question 12: Basic Groupby (2 points)

Using the DataFrame below:
1. Calculate the total sales for each product
2. Calculate the average price for each category
3. Count the number of transactions per store

In [None]:
# Sample DataFrame for Question 12
transactions = pd.DataFrame({
    'transaction_id': range(1, 13),
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Mouse', 'Monitor', 
                'Keyboard', 'Mouse', 'Laptop', 'Monitor', 'Mouse', 'Keyboard'],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories', 'Electronics',
                 'Accessories', 'Accessories', 'Electronics', 'Electronics', 'Accessories', 'Accessories'],
    'store': ['NYC', 'NYC', 'LA', 'LA', 'Chicago', 'NYC', 'Chicago', 'LA', 'NYC', 'LA', 'NYC', 'Chicago'],
    'price': [999.99, 29.99, 79.99, 1099.99, 24.99, 299.99, 89.99, 34.99, 949.99, 349.99, 29.99, 74.99],
    'quantity': [2, 5, 3, 1, 10, 2, 4, 8, 3, 1, 6, 5]
})
transactions['sales'] = transactions['price'] * transactions['quantity']
print("Transactions DataFrame:")
print(transactions)

In [None]:
# SOLUTION
# 1. Total sales for each product
print("1. Total sales by product:")
print(transactions.groupby('product')['sales'].sum())

# 2. Average price for each category
print("\n2. Average price by category:")
print(transactions.groupby('category')['price'].mean())

# 3. Number of transactions per store
print("\n3. Transactions per store:")
print(transactions.groupby('store')['transaction_id'].count())
# Alternative: print(transactions.groupby('store').size())

### Question 13: Multiple Aggregations (3 points)

Using the same transactions DataFrame from Question 12:
1. Group by 'category' and calculate: total sales, average price, and count of transactions
2. Group by both 'store' and 'category' and calculate the sum of sales and quantity
3. Use `agg()` to apply different aggregations to different columns:
   - 'price': mean and max
   - 'quantity': sum and min
   - 'sales': sum

In [None]:
# SOLUTION (use 'transactions' DataFrame from Question 12)
# 1. Group by category with multiple aggregations
print("1. Category summary:")
category_summary = transactions.groupby('category').agg({
    'sales': 'sum',
    'price': 'mean',
    'transaction_id': 'count'
}).rename(columns={'transaction_id': 'num_transactions'})
print(category_summary)

# 2. Group by store and category
print("\n2. Store and category summary:")
store_category = transactions.groupby(['store', 'category'])[['sales', 'quantity']].sum()
print(store_category)

# 3. Different aggregations for different columns
print("\n3. Custom aggregations:")
custom_agg = transactions.groupby('category').agg({
    'price': ['mean', 'max'],
    'quantity': ['sum', 'min'],
    'sales': 'sum'
})
print(custom_agg)

### Question 14: Transform and Apply (3 points)

Using the DataFrame below:
1. Use `transform()` to add a column 'dept_avg_salary' showing the average salary for each employee's department
2. Use `transform()` to add a column 'salary_pct_of_dept' showing each employee's salary as a percentage of their department's total
3. Use `apply()` with a custom function to find the employee with the highest salary in each department

In [None]:
# Sample DataFrame for Question 14
company = pd.DataFrame({
    'emp_id': range(1, 10),
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry', 'Iris'],
    'department': ['Engineering', 'Engineering', 'Engineering', 'Sales', 'Sales', 'Sales', 'HR', 'HR', 'HR'],
    'salary': [85000, 92000, 78000, 65000, 72000, 68000, 55000, 62000, 58000]
})
print("Company DataFrame:")
print(company)

In [None]:
# SOLUTION
# Make a copy
df = company.copy()

# 1. Add department average salary using transform
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean')
print("1. With department average salary:")
print(df)

# 2. Add salary as percentage of department total
df['salary_pct_of_dept'] = (df['salary'] / df.groupby('department')['salary'].transform('sum') * 100).round(2)
print("\n2. With salary percentage of department:")
print(df)

# 3. Find employee with highest salary in each department
print("\n3. Highest paid employee per department:")
def get_highest_paid(group):
    return group.loc[group['salary'].idxmax()]

highest_paid = company.groupby('department').apply(get_highest_paid, include_groups=False)
print(highest_paid)

# Alternative approach using idxmax
# idx = company.groupby('department')['salary'].idxmax()
# print(company.loc[idx])

---

## Section 6: Merging and Joining (Questions 15-17)

### Question 15: Basic Merge Operations (3 points)

Using the DataFrames below:
1. Perform an inner join of orders and customers on 'customer_id'
2. Perform a left join to include all orders (even those with invalid customer_id)
3. Perform an outer join to see all customers and all orders

In [None]:
# Sample DataFrames for Question 15
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'city': ['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105, 106],
    'customer_id': [1, 2, 1, 3, 6, 2],  # Note: customer_id 6 doesn't exist
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Tablet', 'Headphones'],
    'amount': [999.99, 29.99, 79.99, 299.99, 499.99, 149.99]
})

print("Customers:")
print(customers)
print("\nOrders:")
print(orders)

In [None]:
# SOLUTION
# 1. Inner join
print("1. Inner join (only matching):")
inner_result = pd.merge(orders, customers, on='customer_id', how='inner')
print(inner_result)

# 2. Left join (all orders)
print("\n2. Left join (all orders):")
left_result = pd.merge(orders, customers, on='customer_id', how='left')
print(left_result)

# 3. Outer join (all customers and all orders)
print("\n3. Outer join (all from both):")
outer_result = pd.merge(orders, customers, on='customer_id', how='outer')
print(outer_result)

### Question 16: Multi-Table Join (3 points)

Using the DataFrames below, create a comprehensive report that shows:
- Order ID
- Customer name
- Product name
- Category
- Order amount
- Unit cost
- Profit (amount - unit_cost)

This requires joining all three tables.

In [None]:
# Sample DataFrames for Question 16
customers_q16 = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

orders_q16 = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005],
    'customer_id': [1, 2, 1, 3, 2],
    'product_id': ['P1', 'P2', 'P3', 'P1', 'P3'],
    'amount': [150.00, 45.00, 250.00, 150.00, 250.00]
})

products_q16 = pd.DataFrame({
    'product_id': ['P1', 'P2', 'P3'],
    'product_name': ['Widget', 'Gadget', 'Gizmo'],
    'category': ['Electronics', 'Accessories', 'Electronics'],
    'unit_cost': [75.00, 20.00, 125.00]
})

print("Customers:")
print(customers_q16)
print("\nOrders:")
print(orders_q16)
print("\nProducts:")
print(products_q16)

In [None]:
# SOLUTION
# Step 1: Merge orders with customers
step1 = pd.merge(orders_q16, customers_q16, on='customer_id')

# Step 2: Merge with products
step2 = pd.merge(step1, products_q16, on='product_id')

# Step 3: Calculate profit and select/rename columns
step2['profit'] = step2['amount'] - step2['unit_cost']

# Create final report with selected columns
report = step2[['order_id', 'name', 'product_name', 'category', 'amount', 'unit_cost', 'profit']]
report.columns = ['Order ID', 'Customer', 'Product', 'Category', 'Amount', 'Unit Cost', 'Profit']

print("Comprehensive Order Report:")
print(report)

### Question 17: Concatenation and Merge Indicator (3 points)

Using the DataFrames below:
1. Concatenate q1_sales and q2_sales vertically with keys ['Q1', 'Q2'] to identify the source quarter
2. Use merge with indicator to find:
   - Products that were sold in Q1 but not Q2
   - Products that were sold in Q2 but not Q1
   - Products that were sold in both quarters

In [None]:
# Sample DataFrames for Question 17
q1_sales = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'units_sold': [100, 500, 300, 150],
    'revenue': [99999, 14995, 23997, 44999]
})

q2_sales = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Tablet', 'Headphones'],
    'units_sold': [120, 450, 200, 350],
    'revenue': [119999, 13495, 99998, 52493]
})

print("Q1 Sales:")
print(q1_sales)
print("\nQ2 Sales:")
print(q2_sales)

In [None]:
# SOLUTION
# 1. Concatenate with keys
print("1. Concatenated with quarter keys:")
combined = pd.concat([q1_sales, q2_sales], keys=['Q1', 'Q2'])
print(combined)

# 2. Use merge with indicator to analyze product overlap
print("\n2. Product analysis using merge indicator:")
merged = pd.merge(
    q1_sales[['product']], 
    q2_sales[['product']], 
    on='product', 
    how='outer', 
    indicator=True
)
print("Merge result:")
print(merged)

# Products only in Q1
q1_only = merged[merged['_merge'] == 'left_only']['product'].tolist()
print(f"\nProducts in Q1 only: {q1_only}")

# Products only in Q2
q2_only = merged[merged['_merge'] == 'right_only']['product'].tolist()
print(f"Products in Q2 only: {q2_only}")

# Products in both quarters
both = merged[merged['_merge'] == 'both']['product'].tolist()
print(f"Products in both quarters: {both}")

---

## End of Test

Review your answers before submitting. Make sure all code cells run without errors.

**Scoring Reminder:**
- Questions 1-6: 1 point each (6 points total)
- Questions 7-12: 2 points each (12 points total)
- Questions 13-17: 3 points each (15 points total)

**Total: 33 points**