# Handling Missing Data in Pandas

Welcome to this short tutorial on **handling missing data** using the `pandas` library.  
By the end of this lesson, you'll be able to:

* Detect where data is missing and quantify it.
* Decide when to **remove** or **fill** missing values.
* Apply common strategies like mean‑imputation, constant filling, and row/column thresholding.

> **Prerequisites**: Basic familiarity with Python and pandas (creating a `DataFrame`, selecting columns, basic methods).

---

Let's get started!

In [None]:
# Import the core libraries we'll use throughout the notebook
import numpy as np 
import pandas as pd 

## 📍 Finding Missing Data

Below we explore pandas utilities that help you **identify** missing values.

**Finding Missing Data**

In [None]:
# 1️⃣ Create a sample DataFrame that intentionally contains `NaN` values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [1, 2, 3, 4, 5],
    'C': [1, 2, 3, np.nan, np.nan],
    'D': [1, np.nan, np.nan, np.nan, 5]
}
df = pd.DataFrame(data)

In [None]:
# 2️⃣ Inspect the raw DataFrame
df

Unnamed: 0,A,B,C,D
0,1.0,1,1.0,1.0
1,2.0,2,2.0,
2,,3,3.0,
3,4.0,4,,
4,5.0,5,,5.0


In [None]:
# 3️⃣ Locate missing values: `isna()` shows True where data is missing
df.isna()

Unnamed: 0,A,B,C,D
0,False,False,False,False
1,False,False,False,True
2,True,False,False,True
3,False,False,True,True
4,False,False,True,False


In [None]:
# 4️⃣ Count missing values per column
df.isna().sum()

A    1
B    0
C    2
D    3
dtype: int64

In [None]:
# 5️⃣ Quickly check if **any** value is missing in each column
df.isna().any()

A     True
B    False
C     True
D     True
dtype: bool

## 🗑️ Removing Missing Data

Sometimes the simplest solution is to drop rows or columns that contain too many nulls.

**Removing Missing Data**

In [None]:
# 6️⃣ Re‑display the original DataFrame before dropping rows
df

Unnamed: 0,A,B,C,D
0,1.0,1,1.0,1.0
1,2.0,2,2.0,
2,,3,3.0,
3,4.0,4,,
4,5.0,5,,5.0


In [None]:
# 7️⃣ Drop **any** rows that contain at least one missing value
df.dropna()

Unnamed: 0,A,B,C,D
0,1.0,1,1.0,1.0


In [None]:
# 8️⃣ The original `df` is unchanged because we didn't assign the result
#    (uncomment the next line to see the dropped version ➡️ df.dropna())
df

Unnamed: 0,A,B,C,D
0,1.0,1,1.0,1.0
1,2.0,2,2.0,
2,,3,3.0,
3,4.0,4,,
4,5.0,5,,5.0


In [None]:
# 9️⃣ Drop rows that have *more* than a threshold of missing values
#    (`thresh=1` keeps rows with at least 1 non‑null value)
df.dropna(thresh=1)

Unnamed: 0,A,B,C,D
0,1.0,1,1.0,1.0
1,2.0,2,2.0,
2,,3,3.0,
3,4.0,4,,
4,5.0,5,,5.0


## 🧩 Filling Missing Data

When dropping data isn't an option, we can *impute* – i.e., plug reasonable values in place of the missing ones.

**Filling the missing Data**

In [None]:
# 🔟 Check the DataFrame again before we start filling values
df

Unnamed: 0,A,B,C,D
0,1.0,1,1.0,1.0
1,2.0,2,2.0,
2,,3,3.0,
3,4.0,4,,
4,5.0,5,,5.0


In [None]:
# 1️⃣1️⃣ Replace every missing value with **0**
df.fillna(0)

Unnamed: 0,A,B,C,D
0,1.0,1,1.0,1.0
1,2.0,2,2.0,0.0
2,0.0,3,3.0,0.0
3,4.0,4,0.0,0.0
4,5.0,5,0.0,5.0


In [None]:
# 1️⃣2️⃣ Fill missing values with **column‑specific** replacements
values = {'A':0,'B':100,"C":300,'D':400}
df.fillna(value=values)

Unnamed: 0,A,B,C,D
0,1.0,1,1.0,1.0
1,2.0,2,2.0,400.0
2,0.0,3,3.0,400.0
3,4.0,4,300.0,400.0
4,5.0,5,300.0,5.0


In [None]:
# 1️⃣3️⃣ Inspect the DataFrame after custom filling
df

Unnamed: 0,A,B,C,D
0,1.0,1,1.0,1.0
1,2.0,2,2.0,
2,,3,3.0,
3,4.0,4,,
4,5.0,5,,5.0


In [None]:
# 1️⃣4️⃣ Fill numeric columns with their **mean** value (a common choice)
df.fillna(df.mean())

Unnamed: 0,A,B,C,D
0,1.0,1,1.0,1.0
1,2.0,2,2.0,3.0
2,3.0,3,3.0,3.0
3,4.0,4,2.0,3.0
4,5.0,5,2.0,5.0


---

### ✏️ Your Turn – Practice Exercises

1. **Percentage of Missing Data**  
   Calculate the *percentage* of missing values for each column and visualize it using a bar plot.

2. **Column vs Row Strategy**  
   Imagine the DataFrame represents survey responses: which *rows* (people) would you keep if you can tolerate *up to* 25 % missing answers per respondent?

3. **Median Imputation**  
   Fill missing numeric values with the **median** of each column. Compare the results with mean imputation—what differences do you observe?

4. **Domain‑Aware Filling**  
   Pick a column and choose context‑specific replacements (e.g., 'Unknown', 0, or a sentinel date like '1900‑01‑01'). Explain *why* your choice makes sense.

5. **Challenge**  
   Load a real‑world dataset (e.g., `titanic.csv` from [Kaggle](https://www.kaggle.com)) and repeat the steps above. Summarize your findings in a short Markdown cell.

---

## 🚀 Next Steps
- Learn this file **4_Merging_Joining_Concatenation.ipynb**