# Week 3: In-Class Exercise - Data Manipulation & Cleaning

## Objective
Practice data cleaning techniques using the Education Statistics dataset (MEN_ESTADISTICAS).

## Time: ~30 minutes

## Dataset
Education Statistics from the Colombian Ministry of Education containing enrollment, dropout, and demographic information across departments and municipalities.

---

## Setup
Run this cell to load the necessary libraries and dataset.

In [None]:
import pandas as pd
import numpy as np

# Load the Education Statistics dataset from datos.gov.co
# MEN_ESTADISTICAS - Education statistics by department
url = "https://www.datos.gov.co/resource/ji8i-4anb.csv?$limit=15000"
df = pd.read_csv(url)

print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
df.head()

---

## Part 1: Data Inspection (5 minutes)

Before cleaning, you need to understand what you are working with.

### Task 1.1: Check the data types
Use `.dtypes` to see the data types of all columns. How many numeric columns do you see? How many object (text) columns?

In [None]:
# YOUR CODE HERE


### Task 1.2: Find missing values
Use `.isnull().sum()` to count missing values in each column. Which columns have the most missing values?

In [None]:
# YOUR CODE HERE


### Task 1.3: Calculate the percentage of missing values
Divide the count of missing values by the total number of rows and multiply by 100. Which columns have more than 10% missing?

In [None]:
# YOUR CODE HERE


---

## Part 2: Handle Missing Values (10 minutes)

Remember the decision framework:
- **>50% missing**: Consider dropping the column
- **<5% missing**: Usually safe to drop rows
- **5-50% missing**: Need domain knowledge (fill or drop)

### Task 2.1: Fill numeric missing values with 0
For enrollment-related columns, missing values likely mean "no data reported" which we can treat as 0 for aggregation purposes.

Find a numeric column with missing values and fill them with 0.

In [None]:
# YOUR CODE HERE
# Example pattern: df['column_name'] = df['column_name'].fillna(0)


### Task 2.2: Drop rows with critical missing values
Some columns are essential (like department name). If those are missing, the row is useless.

Use `dropna(subset=['column_name'])` to remove rows where a critical column is missing.

In [None]:
# YOUR CODE HERE
# Check row count before and after
print(f"Rows before: {len(df)}")

# Your dropna code here

# print(f"Rows after: {len(df)}")

---

## Part 3: Data Type Conversion (5 minutes)

### Task 3.1: Convert a column to the correct type
Sometimes numeric columns are loaded as text (object type). Check if the year column is numeric. If not, convert it using `.astype(int)`.

**Remember:** Handle NaN values BEFORE converting to int!

In [None]:
# YOUR CODE HERE
# Step 1: Check the current dtype of the year column
# Step 2: Fill NaN if needed
# Step 3: Convert to int


---

## Part 4: Filtering with Multiple Conditions (5 minutes)

### Task 4.1: Filter for a specific department and year
Create a filtered DataFrame that contains only:
- A specific department (e.g., "ANTIOQUIA" or "BOGOTA")
- A specific year (e.g., 2020 or the most recent year in the data)

**Remember:** Use `&` (not `and`) and parentheses around each condition!

In [None]:
# YOUR CODE HERE
# First, check what departments and years are available
# print(df['DEPARTAMENTO'].unique())
# print(df['ANIO'].unique())

# Then filter
# filtered_df = df[(condition1) & (condition2)]


---

## Part 5: GroupBy Operations (5 minutes)

Remember the M&Ms analogy: GroupBy is like sorting candy by color and then counting each color.

### Task 5.1: Calculate total enrollment by department
Use GroupBy to find the total enrollment (sum) for each department.

Pattern: `df.groupby('CATEGORY_COLUMN')['VALUE_COLUMN'].sum()`

In [None]:
# YOUR CODE HERE
# Look for an enrollment column (might be named MATRICULA, ESTUDIANTES, etc.)
# Group by department and sum


### Task 5.2: Calculate dropout rates by department
If there is a dropout column (DESERCION, TASA_DESERCION, etc.), calculate the average dropout rate for each department.

If no dropout column exists, calculate the average of any other relevant numeric column.

In [None]:
# YOUR CODE HERE
# Group by department and calculate mean


### Task 5.3: Find the top 5 departments by enrollment
Sort your grouped results to find the top 5 departments with the highest enrollment.

In [None]:
# YOUR CODE HERE
# Use .sort_values(ascending=False).head(5)


---

## Bonus Challenge (if time permits)

### Calculate dropout rates by department AND year
Use multiple grouping columns: `df.groupby(['DEPARTAMENTO', 'ANIO'])['DESERCION'].mean()`

This creates a hierarchical result showing how dropout rates changed over time for each department.

In [None]:
# YOUR CODE HERE (BONUS)


---

## Summary

In this exercise you practiced:

1. **Data inspection** - Understanding your data before cleaning
2. **Missing values** - Using `isnull()`, `fillna()`, and `dropna()`
3. **Type conversion** - Using `astype()` to fix data types
4. **Filtering** - Using multiple conditions with `&` and parentheses
5. **GroupBy** - Aggregating data by categories

These are the foundational skills you will use in EVERY data analysis project!