# Week 3 Lab: Data Wrangling with Pandas

**Estimated Time:** 30-60 minutes  
**Objective:** Master essential data cleaning and manipulation techniques using Philippine datasets.

In this lab, you will:
- Handle missing data appropriately
- Clean and transform messy datasets
- Merge and join multiple data sources
- Work with Philippine census and economic data

---

## Setup

Run this cell first to import required libraries:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set random seed
np.random.seed(42)

print("âœ“ Libraries imported successfully!")

---
## Part 1: Handling Missing Data

**Background:** Real-world datasets often have missing values. Let's work with Philippine regional population data with incomplete records.

### Exercise 1.1: Identify Missing Data

In [None]:
# Philippine regional data with missing values
data = {
    'Region': ['NCR', 'CAR', 'Region I', 'Region II', 'Region III', 'Region IV-A', 'Region V', 'Region VI'],
    'Population': [13484462, 1797660, None, 3685744, 12422172, 16195042, None, 7954723],
    'Poverty_Rate': [4.9, 24.6, 15.3, None, 7.8, 6.5, 20.4, 18.3],
    'Literacy_Rate': [99.2, 94.3, 97.8, 96.5, None, 98.9, 96.2, None]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# TODO: Check for missing values - count per column
missing_counts = None  # Your code here (use .isnull() or .isna())

print("\nMissing values per column:")
print(missing_counts)

# TODO: Calculate percentage of missing values per column
missing_percentage = None  # Your code here

print("\nMissing percentage per column:")
print(missing_percentage)

### Exercise 1.2: Handle Missing Data

Different strategies for different situations.

In [None]:
# TODO: Fill missing Population values with the mean population
df_filled = df.copy()
# Your code here (use .fillna() with the mean)

# TODO: Fill missing Poverty_Rate with median
# Your code here

# TODO: Fill missing Literacy_Rate with forward fill method
# Your code here (use .ffill())

print("DataFrame after handling missing values:")
print(df_filled)

# Verify no missing values remain
print("\nRemaining missing values:", df_filled.isnull().sum().sum())

---
## Part 2: Data Cleaning

**Background:** Data often comes in messy formats. Let's clean Philippine student enrollment data.

### Exercise 2.1: Clean Messy String Data

In [None]:
# Messy university data
messy_data = {
    'University': ['  UP DILIMAN  ', 'up cebu', 'ATENEO DE MANILA', '  De La Salle  ', 'UST'],
    'Students': ['23,000', '4,500', '12000', '  15,000  ', '40000'],
    'Type': ['Public', 'PUBLIC', 'private', 'Private', 'PRIVATE']
}

messy_df = pd.DataFrame(messy_data)
print("Messy DataFrame:")
print(messy_df)

# TODO: Clean University column - strip whitespace and title case
messy_df['University'] = None  # Your code here (use .str.strip() and .str.title())

# TODO: Convert Students to numeric (remove commas first)
# Your code here (use .str.replace() and pd.to_numeric())

# TODO: Standardize Type column to capitalize properly
messy_df['Type'] = None  # Your code here (use .str.capitalize())

print("\nCleaned DataFrame:")
print(messy_df)
print("\nData types:")
print(messy_df.dtypes)

### Exercise 2.2: Handle Duplicates

In [None]:
# Data with duplicates
duplicate_data = {
    'City': ['Manila', 'Quezon City', 'Manila', 'Cebu City', 'Davao', 'Quezon City'],
    'Region': ['NCR', 'NCR', 'NCR', 'Region VII', 'Region XI', 'NCR'],
    'Population': [1780000, 2960000, 1780000, 960000, 1780000, 2960000]
}

dup_df = pd.DataFrame(duplicate_data)
print("DataFrame with duplicates:")
print(dup_df)

# TODO: Check for duplicate rows
print(f"\nNumber of duplicate rows: {None}")  # Your code here (use .duplicated().sum())

# TODO: Remove duplicate rows
clean_df = None  # Your code here (use .drop_duplicates())

print("\nDataFrame after removing duplicates:")
print(clean_df)

---
## Part 3: Merging and Joining Datasets

**Background:** Combine multiple data sources to create comprehensive datasets.

### Exercise 3.1: Inner Join

In [None]:
# City population data
population_df = pd.DataFrame({
    'City': ['Manila', 'Quezon City', 'Caloocan', 'Davao', 'Cebu City'],
    'Population': [1780000, 2960000, 1660000, 1780000, 960000]
})

# City economic data
income_df = pd.DataFrame({
    'City': ['Manila', 'Quezon City', 'Cebu City', 'Makati', 'Davao'],
    'Avg_Income': [35000, 32000, 28000, 55000, 30000]
})

print("Population DataFrame:")
print(population_df)
print("\nIncome DataFrame:")
print(income_df)

# TODO: Perform inner join on City
inner_merged = None  # Your code here (use pd.merge() with how='inner')

print("\nInner Join Result (only matching cities):")
print(inner_merged)

### Exercise 3.2: Left Join

In [None]:
# TODO: Perform left join (keep all cities from population_df)
left_merged = None  # Your code here (use pd.merge() with how='left')

print("Left Join Result (all population cities):")
print(left_merged)

# TODO: Fill missing income values with 0
# Your code here

print("\nAfter filling missing values:")
print(left_merged)

### Exercise 3.3: Concatenate DataFrames

In [None]:
# Regional data from different sources
luzon_data = pd.DataFrame({
    'Region': ['NCR', 'Region III', 'Region IV-A'],
    'Population': [13484462, 12422172, 16195042],
    'Island_Group': ['Luzon', 'Luzon', 'Luzon']
})

visayas_data = pd.DataFrame({
    'Region': ['Region VI', 'Region VII', 'Region VIII'],
    'Population': [7954723, 8081988, 4547150],
    'Island_Group': ['Visayas', 'Visayas', 'Visayas']
})

# TODO: Concatenate the two DataFrames vertically
combined_df = None  # Your code here (use pd.concat())

# TODO: Reset the index
# Your code here (use .reset_index(drop=True))

print("Combined DataFrame:")
print(combined_df)

---
## Part 4: Advanced Transformations

### Exercise 4.1: Creating New Columns

In [None]:
# Philippine city data
city_df = pd.DataFrame({
    'City': ['Manila', 'Quezon City', 'Caloocan', 'Davao'],
    'Population': [1780000, 2960000, 1660000, 1780000],
    'Area_km2': [42.88, 161.11, 53.33, 2443.61]
})

# TODO: Create a population density column (Population / Area)
city_df['Density'] = None  # Your code here

# TODO: Create a category column based on population
# Large: > 2M, Medium: 1M-2M, Small: < 1M
def categorize_population(pop):
    # Your code here
    pass

city_df['Size_Category'] = None  # Your code here (use .apply())

print("DataFrame with new columns:")
print(city_df)

### Exercise 4.2: Grouping and Aggregation

In [None]:
# Regional GDP data
gdp_data = pd.DataFrame({
    'Region': ['NCR', 'NCR', 'NCR', 'Region VII', 'Region VII', 'Region XI', 'Region XI'],
    'Year': [2020, 2021, 2022, 2020, 2021, 2020, 2021],
    'GDP_Billion': [6200, 6450, 6800, 1100, 1150, 850, 900]
})

print("GDP Data:")
print(gdp_data)

# TODO: Calculate average GDP per region
avg_gdp = None  # Your code here (use .groupby() and .mean())

print("\nAverage GDP by Region:")
print(avg_gdp)

# TODO: Find the total GDP per region across all years
total_gdp = None  # Your code here

print("\nTotal GDP by Region:")
print(total_gdp)

---
## Reflection Questions

Answer these questions in the cells below:

1. **Missing Data Strategy:** When would you choose to drop missing values versus filling them? What are the trade-offs?

2. **Data Cleaning Impact:** Why is data cleaning crucial before analysis? Give an example of how messy data could lead to wrong conclusions.

3. **Merging Strategies:** Explain the difference between inner join, left join, and outer join. When would you use each?

### Your Answer to Question 1:

[Your answer here]

### Your Answer to Question 2:

[Your answer here]

### Your Answer to Question 3:

[Your answer here]

---

## ðŸŽ¯ Congratulations!

You've completed Week 3 Lab on Data Wrangling with Pandas.

**Key Skills Practiced:**
- Identifying and handling missing data
- Cleaning messy string and numeric data
- Removing duplicates
- Merging and joining multiple datasets
- Creating derived columns and aggregations

**Remember:** Check the **solution notebook** if you need help!

**Next Steps:**
- Practice with more complex Philippine datasets
- Explore advanced Pandas functions
- Move on to Week 4 lab (Exploratory Data Analysis)

---

*CMSC 178DA - Data Analytics | University of the Philippines Cebu*