## How to Handle Missing Data in Pandas

In Pandas, missing data is usually represented as:
- `NaN` → "Not a Number" (from NumPy)
- `None` → often converted to `NaN` automatically




In [1]:
import pandas as pd
import numpy as np

a = pd.Series([10, None, 30, np.nan])
a

Unnamed: 0,0
0,10.0
1,
2,30.0
3,


### Detecting Missing Values

In [2]:
a.isna()

Unnamed: 0,0
0,False
1,True
2,False
3,True


In [3]:
a.isnull()

Unnamed: 0,0
0,False
1,True
2,False
3,True


In [4]:
a.notna()

Unnamed: 0,0
0,True
1,False
2,True
3,False


In [5]:
# return non-null values in a
a[a.notna()]

Unnamed: 0,0
0,10.0
2,30.0


## Missing data in DataFrames

In [6]:
data = {
    "name": ["Alice", "Bob", "Caro", "Dan"],
    "age": [23, None, 22, 24],
    "score": [88, 72, None, 91]
}

df = pd.DataFrame(data)
df

Unnamed: 0,name,age,score
0,Alice,23.0,88.0
1,Bob,,72.0
2,Caro,22.0,
3,Dan,24.0,91.0


In [7]:
# detect missing values
df.isna()

Unnamed: 0,name,age,score
0,False,False,False
1,False,True,False
2,False,False,True
3,False,False,False


In [8]:
# count missing values in each row
df.isna().sum()

Unnamed: 0,0
name,0
age,1
score,1


In [9]:
# count total missing values
df.isna().any().sum()

np.int64(2)

In [11]:
# drop rows with any missing values
df1 = df.dropna() # specify inplace argument to modify original dataframe instead of returning a new one
df1

Unnamed: 0,name,age,score
0,Alice,23.0,88.0
3,Dan,24.0,91.0


In [12]:
# drop columns with missing values
df1 = df.dropna(axis=1)
df1

Unnamed: 0,name
0,Alice
1,Bob
2,Caro
3,Dan


In [13]:
# drop only if all values are missing
df1 = df.dropna(how="all")
df1

Unnamed: 0,name,age,score
0,Alice,23.0,88.0
1,Bob,,72.0
2,Caro,22.0,
3,Dan,24.0,91.0


### Filling Missing Data

In [14]:
# fill all missing values with a constant
df1 = df.fillna(0)
df1

Unnamed: 0,name,age,score
0,Alice,23.0,88.0
1,Bob,0.0,72.0
2,Caro,22.0,0.0
3,Dan,24.0,91.0


In [17]:
# fill missing age with median age
df1 = df["age"].fillna(round(df["age"].median(), 1))
df1

Unnamed: 0,age
0,23.0
1,23.0
2,22.0
3,24.0


In [18]:
# fill missing score with mean score
df1 = df["score"].fillna(round(df["score"].mean(), 1))
df1

Unnamed: 0,score
0,88.0
1,72.0
2,83.7
3,91.0


### Common filling strategies

| Situation    | Typical strategy             |
| ------------ | ---------------------------- |
| Numeric data | mean / median                |
| Skewed data  | median                       |
| Categorical  | mode                         |
| Time series  | forward fill / backward fill |


In [19]:
# Forward and backward fill
s = pd.Series([10, None, None, 25, None])

s.ffill()   # forward fill

Unnamed: 0,0
0,10.0
1,10.0
2,10.0
3,25.0
4,25.0


In [20]:
s.bfill()   # backward fill

Unnamed: 0,0
0,10.0
1,25.0
2,25.0
3,25.0
4,


Note that missing data affects calculations. Pandas ignores NaNs by default in most computations but it is useful to know how to handle them in different cases.

## The Professional Cleaning Workflow

When you see missing data, think in this order:

1. Find out how much missing data there is.

2. Understand why it’s missing. Could it be

- data collection?

- sensor failure?

- optional fields?

3. Decide whether to

- drop

- fill

- leave as-is

4. Apply decision explicitly, column by column.

## Mini-Project

Given:

In [21]:
data = {
    "city": ["Lagos", "Abuja", "Ibadan", None],
    "population": [14.3, None, 3.6, 2.5],
    "rainfall": [None, 1200, 1300, 1100]
}

data_df = pd.DataFrame(data)
data_df

Unnamed: 0,city,population,rainfall
0,Lagos,14.3,
1,Abuja,,1200.0
2,Ibadan,3.6,1300.0
3,,2.5,1100.0


State:

1. How many missing values per column?

2. One reasonable filling strategy for each column.

3. Which column you would not fill automatically, and why?

In [24]:
data_df.isna().sum(axis=0)

Unnamed: 0,0
city,1
population,1
rainfall,1


In [27]:
# drop the row where city is None
data_df = data_df.dropna(subset=["city"])
data_df

Unnamed: 0,city,population,rainfall
0,Lagos,14.3,
1,Abuja,,1200.0
2,Ibadan,3.6,1300.0


In [29]:
# fill missing population with mean
data_df["population"] = data_df["population"].fillna(round(data_df["population"].mean(), 1))
data_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df["population"] = data_df["population"].fillna(round(data_df["population"].mean(), 1))


Unnamed: 0,city,population,rainfall
0,Lagos,14.3,
1,Abuja,9.0,1200.0
2,Ibadan,3.6,1300.0


In [30]:
# fill missing rainfall with mean
data_df["rainfall"] = data_df["rainfall"].fillna(round(data_df["rainfall"].mean(), 1))
data_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df["rainfall"] = data_df["rainfall"].fillna(round(data_df["rainfall"].mean(), 1))


Unnamed: 0,city,population,rainfall
0,Lagos,14.3,1250.0
1,Abuja,9.0,1200.0
2,Ibadan,3.6,1300.0


In [None]:
# there are 1 missing values per column
# filled population and rainfall with mean
# deleted the city data whose name was unknown
# didn't fill the city column automatically because the name of the city is an important piece of data that shouldn't be easily substituted.