# Week 3 Workshop: Data Cleaning Practice

**Duration:** ~2 hours (independent work)  
**Dataset:** Education Statistics from the Colombian Ministry of Education (datos.gov.co)  
**Rows:** 482 (dirty) | **Columns:** 37  

---

### What is this?

This is your **independent practice** notebook. Unlike the in-class exercise where code was provided for you to run,
here **you write all the code yourself**. Each section tells you what to do, gives you hints about which pandas
methods to use, and describes what your output should look like. But the actual code is yours to write.

### How to work through this notebook

1. **Read** the markdown cell explaining the task
2. **Write** your code in the code cell (follow the comments for structure)
3. **Run** your code and compare with the expected output described
4. **Document** your decisions in the reflection cells (these matter for grading)

### Grading

You are graded on **two things equally**:
- Correct, working code that cleans the dataset
- Thoughtful documentation of your decisions (the markdown cells asking for your reasoning)

### The 5 data quality issues you must fix

| # | Issue | Columns affected |
|---|-------|------------------|
| 1 | Missing values (NaN) | Multiple rate columns, `departamento`, `tamano_promedio_grupo`, `sedes_conectadas_a_internet` |
| 2 | Wrong data types | `ano` (float instead of int), `poblacion_5_16` (text instead of number) |
| 3 | Duplicate rows | ~20 exact duplicates |
| 4 | Text inconsistencies | `departamento` has ~80+ unique values instead of ~34 |
| 5 | Invalid values | Negative percentages and percentages over 100 |

---

## Setup

Run this cell as-is. It imports the libraries, loads the dataset, and stores a copy of the original
so you can compare before and after at the end.

In [None]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('../data/educacion_estadisticas.csv')

# Store original for comparison at the end
df_original = df.copy()

print(f"Dataset loaded: {df.shape[0]} rows x {df.shape[1]} columns")
print(f"\nColumns: {df.columns.tolist()}")

---

## Part 1: Initial Inspection

Before you clean anything, you need to **diagnose** the problems. This is the inspection ritual
you should run at the start of every data project:

1. `df.shape` -- how big is the dataset?
2. `df.head()` -- what does the actual data look like?
3. `df.dtypes` -- are columns the right type?
4. `df.isnull().sum()` -- where are the gaps?
5. `df.describe()` -- are the numbers reasonable?

Run all five in the cell below. The first line is done for you. Complete the rest.

In [None]:
# 1. Shape
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
print("=" * 50)

# 2. First rows -- display the first 5 rows
# YOUR CODE

print("=" * 50)

In [None]:
# 3. Data types -- display all column types
# Look for: ano should be int but is not. poblacion_5_16 should be numeric but is not.
# YOUR CODE


In [None]:
# 4. Missing values -- count NaN per column, sorted from worst to least
# Hint: df.isnull().sum().sort_values(ascending=False)
# Only show columns that have at least 1 missing value
# YOUR CODE


In [None]:
# 5. Statistical summary -- run describe() and look at min/max values
# Are there any negative percentages? Any values over 100 that shouldn't be?
# YOUR CODE


### Documentation: Initial Inspection

Based on the 5 commands you just ran, list the data quality issues you found.
For each one, note which column(s) are affected, what the problem is, and which command revealed it.

| # | Issue | Column(s) | How you spotted it |
|---|-------|-----------|--------------------|
| 1 | *YOUR ANSWER* | *YOUR ANSWER* | *YOUR ANSWER* |
| 2 | *YOUR ANSWER* | *YOUR ANSWER* | *YOUR ANSWER* |
| 3 | *YOUR ANSWER* | *YOUR ANSWER* | *YOUR ANSWER* |
| 4 | *YOUR ANSWER* | *YOUR ANSWER* | *YOUR ANSWER* |
| 5 | *YOUR ANSWER* | *YOUR ANSWER* | *YOUR ANSWER* |

---

---

## Part 2: Missing Values

Missing values (NaN) prevent calculations and distort results. But not all missing values
should be handled the same way. The strategy depends on **what the column represents**
and **how much data is missing**.

Use this decision framework:

| Missing % | Recommended action | Reasoning |
|-----------|-------------------|-----------|
| > 50% | Consider dropping the column, or fill with 0 if "not reported" makes sense | More gaps than data |
| < 5% | Drop the rows | Losing very few rows |
| 5-50% | Fill with an appropriate value (median for rates, 0 for counts) | Too many rows to lose |

You need to handle 3 groups of missing values:

1. **Count columns** (`sedes_conectadas_a_internet`, `tamano_promedio_grupo`): ~50% missing. These columns only have data through 2017. Fill with 0 ("not reported").
2. **`departamento`**: ~3% missing. This is the key identifier. Drop those rows (you cannot guess which department a row belongs to).
3. **Rate columns** (dropout, coverage, approval, etc.): ~8-11% missing. Fill with the median (the middle value, resistant to outliers).

**Why median instead of mean for rates?** The mean is pulled by extreme outliers. The median is the
middle value and better represents "typical." For education rates, median is the safer assumption.

**Why NOT fill rates with 0?** A 0% dropout rate means "nobody dropped out" (a strong claim).
That is very different from "we don't know."

### Step 2.1: Calculate missing percentages

Calculate the percentage of missing values per column. Show only columns with at least 1 missing value,
sorted from highest to lowest.

**Hint:** `(df.isnull().sum() / len(df) * 100).round(2)`

**Expected output:** You should see `sedes_conectadas_a_internet` and `tamano_promedio_grupo` around 50%,
`departamento` around 2-3%, and several rate columns around 8-11%.

In [None]:
# Calculate percentage of missing values per column
missing_pct = # YOUR CODE: (df.isnull().sum() / len(df) * 100).round(2)

# Show only columns with missing values, sorted descending
# YOUR CODE: filter to only > 0, sort descending, print


### Step 2.2: Fill count-based columns with 0

The columns `sedes_conectadas_a_internet` (% schools with internet) and `tamano_promedio_grupo`
(average class size) are only available through 2017. After that, they were not reported.
Fill them with 0 to indicate "no data available."

**Hint:** `df['column'] = df['column'].fillna(0)`

**Expected output:** Both columns should go from ~240 NaN to 0 NaN.

In [None]:
# Print NaN count before
print("Before:")
print(f"  sedes_conectadas_a_internet NaN: {df['sedes_conectadas_a_internet'].isnull().sum()}")
print(f"  tamano_promedio_grupo NaN:       {df['tamano_promedio_grupo'].isnull().sum()}")

# YOUR CODE: fill both columns with 0


# Print NaN count after
print("\nAfter:")
print(f"  sedes_conectadas_a_internet NaN: {df['sedes_conectadas_a_internet'].isnull().sum()}")
print(f"  tamano_promedio_grupo NaN:       {df['tamano_promedio_grupo'].isnull().sum()}")

### Step 2.3: Drop rows with missing `departamento`

The `departamento` column is the key identifier. A row without a department is like a letter
without an address: we cannot use it. Drop those rows.

**Hint:** `df = df.dropna(subset=['departamento'])`

**Expected output:** You should remove approximately 10-15 rows.

In [None]:
rows_before = len(df)

# YOUR CODE: drop rows where departamento is NaN


rows_after = len(df)
print(f"Rows before: {rows_before}")
print(f"Rows after:  {rows_after}")
print(f"Removed {rows_before - rows_after} rows with missing departamento")

### Step 2.4: Fill rate columns with the median

Rate columns represent percentages (dropout rate, coverage, approval, etc.). For these,
filling with the median is the safest strategy: "if we don't know, assume a typical value."

The list of rate columns is provided below. Loop through each one, check if it has
missing values, and fill them with that column's median.

**Hint:** For each column, use `df[col].fillna(df[col].median())` to fill NaN with the median.

**Expected output:** Each column should print how many NaN were filled and what median was used.
After the loop, all rate columns should have 0 NaN.

In [None]:
rate_columns = [
    'tasa_matriculacion_5_16',
    'cobertura_neta', 'cobertura_neta_transicion', 'cobertura_neta_primaria',
    'cobertura_neta_secundaria', 'cobertura_neta_media',
    'cobertura_bruta', 'cobertura_bruta_transicion', 'cobertura_bruta_primaria',
    'cobertura_bruta_secundaria', 'cobertura_bruta_media',
    'desercion', 'desercion_transicion', 'desercion_primaria',
    'desercion_secundaria', 'desercion_media',
    'aprobacion', 'aprobacion_transicion', 'aprobacion_primaria',
    'aprobacion_secundaria', 'aprobacion_media',
    'reprobacion', 'reprobacion_transicion', 'reprobacion_primaria',
    'reprobacion_secundaria', 'reprobacion_media',
    'repitencia', 'repitencia_transicion', 'repitencia_primaria',
    'repitencia_secundaria', 'repitencia_media',
]

# YOUR CODE: loop through rate_columns
#   For each column:
#     1. Count how many NaN it has
#     2. If it has any NaN, calculate the median
#     3. Fill NaN with the median
#     4. Print: "{column}: filled {n} NaN with median {value:.2f}"



# Verify: total NaN remaining in rate columns
print(f"\nTotal NaN remaining in rate columns: {df[rate_columns].isnull().sum().sum()}")

### Step 2.5: Verify missing values

Check how many total NaN remain in the entire dataset.

**Expected output:** The only column with remaining NaN should be `poblacion_5_16` (we fix
that in Part 3, because it has both missing values AND type problems).

In [None]:
# YOUR CODE: show remaining NaN per column (only columns with > 0)
remaining_nan = df.isnull().sum()
remaining_nan = remaining_nan[remaining_nan > 0]

if len(remaining_nan) == 0:
    print("No missing values remain!")
else:
    print(f"Columns still with NaN ({len(remaining_nan)}):")
    print(remaining_nan)

### Documentation: Missing Values

Answer these questions:

**1. Which fill strategy did you use for each type of column?**

| Column type | Strategy | Why |
|-------------|----------|-----|
| Count columns (sedes, tamano) | *YOUR ANSWER* | *YOUR ANSWER* |
| departamento | *YOUR ANSWER* | *YOUR ANSWER* |
| Rate columns | *YOUR ANSWER* | *YOUR ANSWER* |

**2. Why did you choose median over mean for rate columns?**

*YOUR ANSWER*

**3. How many missing values did `departamento` have? Why is dropping rows acceptable here but not for rate columns?**

*YOUR ANSWER*

---

---

## Part 3: Data Type Issues

Even if a column contains numbers, pandas might store it as text (`object`) if even one value
has non-numeric characters. And a column of whole numbers will be stored as `float64` (with
decimal points) if it contains any NaN values, because NaN is technically a float.

You need to fix two columns:

1. **`ano`** (year): Currently `float64` (shows as 2023.0). Should be `int64` (2023).
2. **`poblacion_5_16`** (population aged 5-16): Currently `object` (text) because some values have commas like "394,574" and some say "sin dato". Should be `int64`.

**Key methods:**
- `pd.to_numeric(series, errors='coerce')` -- converts to number, turns unparseable values into NaN
- `.str.replace(',', '')` -- removes commas from strings
- `.fillna(0).astype(int)` -- fills NaN then converts to integer

**Why `errors='coerce'`?** Without it, `pd.to_numeric()` crashes when it hits a non-numeric
value like "sin dato". With `errors='coerce'`, it quietly turns those into NaN instead.

### Step 3.1: Fix `ano` (year)

Convert `ano` from float to integer. The process is:
1. Use `pd.to_numeric(errors='coerce')` to handle any non-numeric values
2. Fill remaining NaN with 0
3. Convert to int with `.astype(int)`

**Expected output:** `ano` dtype changes from `float64` to `int64`. Sample values like 2023.0 become 2023.

In [None]:
print(f"Before: ano dtype = {df['ano'].dtype}")
print(f"Sample: {df['ano'].head(5).tolist()}")

# YOUR CODE: convert ano to int
#   1. pd.to_numeric() with errors='coerce'
#   2. fillna(0)
#   3. astype(int)


print(f"\nAfter: ano dtype = {df['ano'].dtype}")
print(f"Sample: {df['ano'].head(5).tolist()}")

### Step 3.2: Fix `poblacion_5_16` (population)

This column is trickier. It is stored as text (`object`) because:
- Some values have commas: "394,574"
- Some values say "sin dato" ("no data" in Spanish)

The process is:
1. Convert to string with `.astype(str)` (ensures all values are strings for `.str.replace()`)
2. Remove commas with `.str.replace(',', '')`
3. Convert to numeric with `pd.to_numeric(errors='coerce')` ("sin dato" becomes NaN)
4. Fill NaN with 0 and convert to int

**Expected output:** `poblacion_5_16` dtype changes from `object` to `int64`. Values like "394,574" become 394574.

In [None]:
print(f"Before: poblacion_5_16 dtype = {df['poblacion_5_16'].dtype}")
print(f"Sample: {df['poblacion_5_16'].head(5).tolist()}")

# YOUR CODE: convert poblacion_5_16 to int
#   1. .astype(str) to make sure everything is a string
#   2. .str.replace(',', '') to remove commas
#   3. pd.to_numeric(errors='coerce') to convert ("sin dato" becomes NaN)
#   4. .fillna(0).astype(int)


print(f"\nAfter: poblacion_5_16 dtype = {df['poblacion_5_16'].dtype}")
print(f"Sample: {df['poblacion_5_16'].head(5).tolist()}")

### Step 3.3: Verify type fixes

Print the dtypes of `ano` and `poblacion_5_16` to confirm they are both `int64`.

**Expected output:** Both should be `int64`.

In [None]:
# YOUR CODE: print dtype for both columns and verify they are int64
print(f"ano:             {df['ano'].dtype}")
print(f"poblacion_5_16:  {df['poblacion_5_16'].dtype}")

# Also check: any remaining NaN in these columns?
print(f"\nano NaN:             {df['ano'].isnull().sum()}")
print(f"poblacion_5_16 NaN:  {df['poblacion_5_16'].isnull().sum()}")

### Documentation: Data Types

Answer these questions:

**1. What would happen if you tried `df['ano'].astype(int)` directly on a column that still has NaN values?**

*YOUR ANSWER*

**2. Why do we need `errors='coerce'` in `pd.to_numeric()`? What would happen without it when the column contains "sin dato"?**

*YOUR ANSWER*

**3. Why do we convert to string first (`.astype(str)`) before using `.str.replace()`?**

*YOUR ANSWER*

---

---

## Part 4: Duplicate Rows

Duplicate rows inflate counts and distort averages. If a department appears twice for the
same year with identical data, every calculation using that data is biased. The dataset
should have ~462 rows (one per department per year), but it has more because of duplicates.

**Key methods:**
- `df.duplicated().sum()` -- count how many duplicate rows exist
- `df[df.duplicated(keep=False)]` -- show ALL copies (both "original" and "duplicate")
- `df.drop_duplicates()` -- keep the first occurrence, remove the rest

**Expected result:** You should find approximately 20 duplicate rows. After removal, expect ~462 rows.

### Step 4.1: Count duplicates

How many exact duplicate rows are in the dataset?

In [None]:
# YOUR CODE: count duplicate rows
n_dupes = # YOUR CODE
print(f"Duplicate rows: {n_dupes}")

### Step 4.2: Examine the duplicates

Look at the actual duplicate rows to understand what they are. Use `keep=False` to see
both the original and the copy side by side. Sort by `departamento` and `ano` to group them.

Show only the key columns: `ano`, `departamento`, `poblacion_5_16`, `desercion`.

In [None]:
# YOUR CODE: show all duplicate rows, sorted by departamento and ano
# Display columns: ano, departamento, poblacion_5_16, desercion
# Hint: df[df.duplicated(keep=False)].sort_values([...])[[...]]



### Step 4.3: Remove duplicates

Remove the duplicate rows, keeping the first occurrence of each.

In [None]:
rows_before = len(df)

# YOUR CODE: remove duplicates


rows_after = len(df)
print(f"Rows before: {rows_before}")
print(f"Rows after:  {rows_after}")
print(f"Removed {rows_before - rows_after} duplicate rows")
print(f"Duplicates remaining: {df.duplicated().sum()}")

### Documentation: Duplicates

Answer these questions:

**1. How many duplicate rows did you find? Are they exact copies (all columns identical)?**

*YOUR ANSWER*

**2. When might a duplicate row be valid and NOT an error? Give one real-world example.**

*YOUR ANSWER*

---

---

## Part 5: Text Inconsistencies

Colombia has about 34 departments (32 states + Bogota D.C. + a national aggregate). But our
`departamento` column has far more unique values because the same department appears in
different forms:

- "Antioquia", "ANTIOQUIA", "  Antioquia  ", "antioquia"
- Leading/trailing spaces
- Mixed upper/lower case

To pandas, each of these is a completely different string. This breaks grouping: if you
try `df.groupby('departamento')`, you get 80+ groups instead of 34.

**Key methods:**
- `.str.upper()` -- convert to uppercase
- `.str.strip()` -- remove leading/trailing whitespace
- `.nunique()` -- count unique values

**IMPORTANT:** After standardizing text, previously different rows ("Antioquia" and "ANTIOQUIA"
for the same year) become identical. You must check for NEW duplicates after this step.

**Expected result:** Unique departments should drop from ~80+ to ~34.

### Step 5.1: Check current state

How many unique department values exist before standardization? Print the count and the
sorted list of all unique values.

In [None]:
print(f"Unique departments before: {df['departamento'].nunique()}")
print(f"(Expected: ~34)")
print()

# YOUR CODE: print sorted list of all unique department values
# Hint: sorted(df['departamento'].unique())


### Step 5.2: Standardize text

Apply `.str.upper().str.strip()` to the `departamento` column.

In [None]:
# YOUR CODE: standardize departamento to uppercase and stripped


print(f"Unique departments after: {df['departamento'].nunique()}")

### Step 5.3: Check for NEW duplicates

Cleaning text can reveal new problems. Rows that looked different before ("Antioquia" vs
"ANTIOQUIA" for the same year) are now identical. Check for and remove any new duplicates.

In [None]:
# YOUR CODE: check for new duplicates
new_dupes = # YOUR CODE
print(f"New duplicates after text fix: {new_dupes}")

# YOUR CODE: if there are new duplicates, remove them


print(f"Final row count: {len(df)}")
print(f"Duplicates remaining: {df.duplicated().sum()}")

### Step 5.4: Verify final department list

Print the sorted list of unique departments. It should look like a clean list of
Colombian departments, all uppercase, no duplicates.

In [None]:
# YOUR CODE: print sorted unique departments after cleaning
print(f"Unique departments: {df['departamento'].nunique()}")
print()

# YOUR CODE: print the sorted list


### Documentation: Text Inconsistencies

Answer these questions:

**1. How many unique department values did you start with, and how many after cleaning?**

*YOUR ANSWER*

**2. Did text standardization create any new duplicates? How many?**

*YOUR ANSWER*

**3. What other text cleaning might still be needed? (Think about accents: Narino vs. Nari\u00f1o)**

*YOUR ANSWER*

---

---

## Part 6: Invalid Values

This is the trickiest issue because the values are not missing, not the wrong type, and
not duplicates. They **exist** and **look like numbers**, but they are **logically impossible**.

Percentage columns like dropout rate, coverage, and approval must be between 0 and 100:
- A dropout rate of **-5%** is impossible (you cannot have negative dropouts)
- A net coverage of **150%** is impossible (net coverage caps at 100%)

These errors slip past all automated checks. Only **domain knowledge** (knowing what
valid education statistics look like) can catch them.

**Strategy:** Replace invalid values with NaN, then fill with the column median (same
approach we used for missing values).

**Key methods:**
- `df[cols].describe().loc[['min', 'max']]` -- check ranges quickly
- `df[col] < 0` -- boolean mask for negative values
- `df[col] > 100` -- boolean mask for values over 100
- `df.loc[mask, col] = np.nan` -- replace matching values with NaN

**Expected result:** Approximately 8 negative values and 6 values over 100.

### Step 6.1: Check min/max of percentage columns

Use `describe()` on the percentage columns and look at the `min` and `max` rows.
Any `min` below 0 or `max` above 100 means there are invalid values.

In [None]:
percentage_cols = [
    'desercion', 'desercion_primaria', 'desercion_secundaria', 'desercion_media',
    'cobertura_neta', 'cobertura_neta_primaria', 'cobertura_neta_secundaria', 'cobertura_neta_media',
    'aprobacion', 'reprobacion',
]

# YOUR CODE: show describe() for only the min and max rows
# Hint: df[percentage_cols].describe().loc[['min', 'max']]


### Step 6.2: Count the invalid values

Count exactly how many values are below 0 and above 100 in each column.

In [None]:
# YOUR CODE: count negatives (values < 0) per column
# Hint: df[percentage_cols].lt(0).sum()
negatives = # YOUR CODE
print("Negative values (< 0) per column:")
print(negatives[negatives > 0])
print(f"Total negatives: {negatives.sum()}")

print()

# YOUR CODE: count values > 100 per column
# Hint: df[percentage_cols].gt(100).sum()
over_100 = # YOUR CODE
print("Values over 100 per column:")
print(over_100[over_100 > 0])
print(f"Total over 100: {over_100.sum()}")

### Step 6.3: Fix invalid values

For each percentage column:
1. Create a boolean mask: `(df[col] < 0) | (df[col] > 100)`
2. Replace those values with NaN: `df.loc[mask, col] = np.nan`
3. Fill NaN with the column median

Loop through `percentage_cols` and apply this fix.

In [None]:
total_fixed = 0

for col in percentage_cols:
    # YOUR CODE: create mask for invalid values (< 0 or > 100)
    invalid_mask = # YOUR CODE
    n_invalid = invalid_mask.sum()
    
    if n_invalid > 0:
        # YOUR CODE: replace invalid values with NaN
        
        # YOUR CODE: fill NaN with median
        median_val = df[col].median()
        
        print(f"{col}: fixed {n_invalid} invalid values (replaced with median {median_val:.2f})")
        total_fixed += n_invalid

print(f"\nTotal invalid values fixed: {total_fixed}")

### Step 6.4: Verify the fix

Confirm that all percentage columns now have values between 0 and 100.

In [None]:
# YOUR CODE: show min and max for percentage_cols after fix
# Verify: no min < 0 and no max > 100
print("After fix -- Min and Max:")
# YOUR CODE


# Count remaining invalid values
remaining_invalid = (df[percentage_cols].lt(0).sum().sum() + df[percentage_cols].gt(100).sum().sum())
print(f"\nInvalid values remaining: {remaining_invalid}")

### Documentation: Invalid Values

Answer these questions:

**1. How many negative values and how many values over 100 did you find?**

*YOUR ANSWER*

**2. Why is domain knowledge necessary to catch these errors? Could an automated tool find them?**

*YOUR ANSWER*

**3. An alternative approach is "clipping": set negatives to 0 and values >100 to 100. When would clipping be better than replacing with the median? When would it be worse?**

*YOUR ANSWER*

---

---

## Part 7: Final Verification

Compare the original dataset with your cleaned version to see the full impact of your work.
This cell is pre-filled. Run it and review the output.

In [None]:
print("=" * 55)
print("  BEFORE CLEANING (original)")
print("=" * 55)
print(f"  Rows:                {len(df_original)}")
print(f"  Total NaN:           {df_original.isnull().sum().sum()}")
print(f"  Duplicates:          {df_original.duplicated().sum()}")
print(f"  Unique departments:  {df_original['departamento'].nunique()}")
print(f"  ano dtype:           {df_original['ano'].dtype}")
print(f"  poblacion dtype:     {df_original['poblacion_5_16'].dtype}")

print()

print("=" * 55)
print("  AFTER CLEANING")
print("=" * 55)
print(f"  Rows:                {len(df)}")
print(f"  Total NaN:           {df.isnull().sum().sum()}")
print(f"  Duplicates:          {df.duplicated().sum()}")
print(f"  Unique departments:  {df['departamento'].nunique()}")
print(f"  ano dtype:           {df['ano'].dtype}")
print(f"  poblacion dtype:     {df['poblacion_5_16'].dtype}")

print()

print("=" * 55)
print("  PERCENTAGE COLUMNS RANGE CHECK")
print("=" * 55)
pct_check = ['desercion', 'cobertura_neta', 'aprobacion', 'reprobacion']
for col in pct_check:
    print(f"  {col}: min={df[col].min():.2f}, max={df[col].max():.2f}")

### Verification Checklist

Before submitting, verify each item. Change `[ ]` to `[x]` for each one you confirm:

- [ ] **Missing values:** Zero NaN remaining (or justified remaining)
- [ ] **Data types:** `ano` is `int64`, `poblacion_5_16` is `int64`
- [ ] **Duplicates:** Zero duplicate rows remaining
- [ ] **Text:** ~34 unique departments (all uppercase, no extra spaces)
- [ ] **Invalid values:** All percentage columns between 0 and 100
- [ ] **Row count:** ~462 rows (original 482 minus duplicates and dropped rows)
- [ ] **All documentation cells filled in** with your reasoning

---

---

## Part 8: Reflection

Answer each question thoughtfully. These are graded.

### 1. What was the most challenging data quality issue to fix? Why?

*YOUR ANSWER*

### 2. Which cleaning decisions required domain expertise about education data?

*YOUR ANSWER*

### 3. Data cleaning is iterative: fixing one problem can create another. Where did you experience this in the workshop?

*YOUR ANSWER*

### 4. How will you apply these steps to your project dataset from datos.gov.co?

*YOUR ANSWER*

### 5. If you had to clean this dataset again from scratch, what would you do differently?

*YOUR ANSWER*

---

*Week 3 Workshop -- Data Analytics Course -- Universidad Cooperativa de Colombia*