# Week 4 Workshop: Data Filtering & Selection

## Overview
Practice filtering techniques to answer analytical questions about the Education Statistics dataset.

**Duration:** ~2 hours

**Objectives:**
- Filter rows using comparison operators
- Combine conditions with & (AND), | (OR), ~ (NOT)
- Use .isin(), .between(), .str.contains()
- Use .query() for clean syntax
- Translate analytical questions into filter code

---

## Setup
Run this cell to load libraries, load the dataset, and apply the Week 3 quick-clean.

In [None]:
import pandas as pd
import numpy as np

# Load the Education Statistics dataset
df = pd.read_csv('../../semana03/data/educacion_estadisticas.csv')

# Quick clean (Week 3 skills)
df['departamento'] = df['departamento'].str.upper().str.strip()
df['ano'] = df['ano'].fillna(0).astype(int)
df['poblacion_5_16'] = pd.to_numeric(
    df['poblacion_5_16'].astype(str).str.replace(',', ''), errors='coerce'
)
df = df.drop_duplicates()
df = df.dropna(subset=['departamento'])

print(f"Dataset: {df.shape[0]} rows x {df.shape[1]} columns")
print(f"Years: {sorted(df['ano'].unique())}")
print(f"Departments: {df['departamento'].nunique()} unique")
print(f"\nColumns: {df.columns.tolist()}")
df.head(3)

---

## Part 1: Single Condition Filters (20 minutes)

Practice the basic filtering pattern: `df[df['column'] operator value]`

### Task 1.1: Filter by year
Get all rows from the year 2020.

Print the number of rows and display the first 5.

In [None]:
# YOUR CODE HERE


### Task 1.2: Filter by comparison
Find all rows where the total dropout rate (`desercion`) is greater than 8%.

**Question:** How many rows have dropout above 8%? Which departments appear most frequently?

In [None]:
# YOUR CODE HERE
# Hint: after filtering, try result['departamento'].value_counts() to see which departments appear most


### Task 1.3: Filter by exact text
Get all rows for the department "CHOCO".

Display the year, net coverage, and dropout columns sorted by year.

In [None]:
# YOUR CODE HERE
# Hint: after filtering, use .sort_values('ano') and select specific columns


### Task 1.4: Filter with not equal
Get all rows EXCEPT the year 0 (rows where the year was originally missing and filled with 0 during cleaning).

Print the row count before and after to see how many "bad year" rows you removed.

In [None]:
# YOUR CODE HERE
# print(f"Before: {len(df)} rows")
# df_valid = ...
# print(f"After: {len(df_valid)} rows")


---

## Part 2: Combining Conditions (30 minutes)

Combine multiple conditions to answer more specific questions.

**Rules:**
- Use `&` (AND), `|` (OR), `~` (NOT) - NOT `and`, `or`, `not`
- Always wrap each condition in parentheses

### Task 2.1: Two AND conditions
Find departments in 2023 where net coverage (`cobertura_neta`) is below 80%.

**Question:** Which departments had coverage below 80% in 2023? List them.

In [None]:
# YOUR CODE HERE


### Task 2.2: Three AND conditions
Find departments in 2023 where:
- Dropout rate is below 3% AND
- Approval rate (`aprobacion`) is above 90% AND
- Net coverage is above 85%

These are the "top performers." Show department, desercion, aprobacion, and cobertura_neta.

In [None]:
# YOUR CODE HERE


### Task 2.3: OR conditions
Find all rows from either GUAINIA, VAUPES, or VICHADA (the most remote departments in Colombia).

**Two approaches:**
1. Using `|` (OR) operators
2. Using `.isin()` (cleaner)

Try both and verify you get the same result.

In [None]:
# YOUR CODE HERE
# Approach 1: Using |

# Approach 2: Using .isin()

# Verify both have the same number of rows


### Task 2.4: Mixed AND + OR
Find rows from 2023 that are from either Antioquia OR Bogota, AND have a dropout rate below 3%.

**Hint:** You can combine .isin() with other conditions:
`df[(df['col'].isin([...]) & (df['other_col'] < value)]`

In [None]:
# YOUR CODE HERE


### Task 2.5: NOT condition with .isin()
Get all department data for 2023, EXCLUDING the national aggregate ("NACIONAL") and Bogota ("BOGOTA D.C.").

**Pattern:** `df[~df['col'].isin([list])]`

In [None]:
# YOUR CODE HERE


---

## Part 3: Convenience Methods (20 minutes)

Practice .isin(), .between(), and .str.contains() for common filtering patterns.

### Task 3.1: .between() for range
Find all rows where the primary school dropout rate (`desercion_primaria`) is between 1% and 3% (inclusive).

**Pattern:** `df[df['col'].between(low, high)]`

In [None]:
# YOUR CODE HERE


### Task 3.2: .between() with year range
Get data only for the years 2018 through 2023.

Print the unique years in your result to verify.

In [None]:
# YOUR CODE HERE


### Task 3.3: .str.contains() for text search
Find all departments whose name contains the word "NORTE".

**Question:** Which department names match? Print the unique department names.

**Remember:** Always add `na=False` to handle potential NaN values.

In [None]:
# YOUR CODE HERE


### Task 3.4: .str.startswith()
Find all departments whose name starts with "SAN".

**Pattern:** `df[df['col'].str.startswith('text', na=False)]`

In [None]:
# YOUR CODE HERE


### Task 3.5: .query() method
Rewrite this boolean indexing filter using `.query()`:

```python
df[(df['ano'] == 2023) & (df['desercion'] > 5) & (df['cobertura_neta'] < 80)]
```

Verify both give the same number of rows.

In [None]:
# YOUR CODE HERE
# Boolean indexing version (given):
result_bool = df[(df['ano'] == 2023) & (df['desercion'] > 5) & (df['cobertura_neta'] < 80)]

# .query() version (your code):
# result_query = ...

# Verify:
# print(f"Boolean indexing: {len(result_bool)} rows")
# print(f".query(): {len(result_query)} rows")


---

## Part 4: Analytical Questions (30 minutes)

Now use your filtering skills to answer real questions about education in Colombia. For each question:
1. Write the filter
2. Display relevant columns
3. Write a 1-2 sentence interpretation of the result

### Question 4.1: Regional comparison
Compare Antioquia vs. Choco in 2023: which has better net coverage and lower dropout?

Filter for both departments in 2023 and show: departamento, cobertura_neta, desercion, aprobacion.

In [None]:
# YOUR CODE HERE


**Your interpretation:** *(Write 1-2 sentences about what you observe)*


### Question 4.2: High dropout departments
Which departments had a secondary school dropout rate (`desercion_secundaria`) above 8% in the most recent 3 years (2022, 2023, 2024)?

Show unique department names and their dropout rates.

In [None]:
# YOUR CODE HERE


**Your interpretation:** *(Write 1-2 sentences)*


### Question 4.3: Coverage extremes
Find departments where gross coverage (`cobertura_bruta`) exceeds 110% in any year.

**Context:** Gross coverage can exceed 100% when students outside the expected age range are enrolled (over-age students). This is a sign of age-grade distortion.

Which departments and years show this pattern?

In [None]:
# YOUR CODE HERE


**Your interpretation:** *(Write 1-2 sentences)*


### Question 4.4: Your own question
Design your own analytical question about the education data. Write the question, the filter, and your interpretation.

Ideas:
- How have failure rates (`reprobacion`) changed over time for a specific department?
- Which departments have both high repetition (`repitencia`) and high dropout?
- How does the Eje Cafetero (Caldas, Quindio, Risaralda) compare to the Caribbean coast?

**Your question:** *(Write it here)*


In [None]:
# YOUR CODE HERE


**Your interpretation:** *(Write 1-2 sentences)*


---

## Part 5: Reflection

Answer these questions about your experience with data filtering.

### Reflection 1
Which filtering method did you find most useful: boolean indexing, .isin(), .between(), .str.contains(), or .query()? Why?

**Your answer:**


### Reflection 2
What questions do you want to answer about YOUR project dataset using filtering? List at least 3 questions.

**Your answer:**
1. 
2. 
3. 


### Reflection 3
Why is it important to clean the data (Week 3) BEFORE filtering it (Week 4)? What problems could arise if you filter dirty data?

**Your answer:**


---

## Summary

In this workshop you practiced:

| Skill | Methods |
|-------|--------|
| Single conditions | `==`, `!=`, `>`, `<`, `>=`, `<=` |
| Combining conditions | `&` (AND), `|` (OR), `~` (NOT) |
| Match a list | `.isin([list])` |
| Numeric range | `.between(low, high)` |
| Text pattern | `.str.contains()`, `.str.startswith()` |
| Clean syntax | `.query('expression')` |

These skills are essential for Milestone 1: you need to filter your project dataset to focus on specific subsets of data for analysis.

**Next week:** Exploratory Data Analysis (EDA) - using GroupBy and statistics to discover patterns.

---

*Week 4 - Data Analytics Course - Universidad Cooperativa de Colombia*