# Data Exploration and Summary

## 6. Missing Data Summary in pandas

Handling **missing values (NaNs)** is a core part of data exploration. Ignoring them can lead to misleading statistics and failed models.

In [1]:
import pandas as pd
import numpy as np


### 📌 Key Functions Overview

| Function                | Purpose                                 |
| ----------------------- | --------------------------------------- |
| `isnull()` / `isna()`   | Detect missing values                   |
| `notnull()` / `notna()` | Detect non-missing values               |
| `dropna()`              | Remove rows/columns with missing values |
| `fillna()`              | Fill missing values                     |
| `interpolate()`         | Fill missing values via interpolation   |
| `df.info()`             | View null count summary                 |

In [2]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, np.nan, 35, 40, np.nan],
    'Gender': ['F', 'M', np.nan, 'M', 'F'],
    'Score': [85, 90, np.nan, 88, 75]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,F,85.0
1,Bob,,M,90.0
2,Charlie,35.0,,
3,David,40.0,M,88.0
4,Eve,,F,75.0


### 1. Detecting Missing Values


#### `df.isnull()` or `df.isna()`

Returns a DataFrame of `True` for NaNs and `False` otherwise.

In [5]:
df.isnull()

Unnamed: 0,Name,Age,Gender,Score
0,False,False,False,False
1,False,True,False,False
2,False,False,True,True
3,False,False,False,False
4,False,True,False,False


In [7]:
df.isna()

Unnamed: 0,Name,Age,Gender,Score
0,False,False,False,False
1,False,True,False,False
2,False,False,True,True
3,False,False,False,False
4,False,True,False,False


In [6]:
df.Age.isnull()

0    False
1     True
2    False
3    False
4     True
Name: Age, dtype: bool

In [11]:
df.isnull().sum()

Name      0
Age       2
Gender    1
Score     1
dtype: int64

Also useful: `df.info()` shows non-null counts.

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    5 non-null      object 
 1   Age     3 non-null      float64
 2   Gender  4 non-null      object 
 3   Score   4 non-null      float64
dtypes: float64(2), object(2)
memory usage: 288.0+ bytes


### 2. Detecting Non-Missing Values

#### `df.notnull()` or `df.notna()`
Inverse of `isnull()`. Useful for filtering valid rows:

In [15]:
df.notnull()

Unnamed: 0,Name,Age,Gender,Score
0,True,True,True,True
1,True,False,True,True
2,True,True,False,False
3,True,True,True,True
4,True,False,True,True


In [17]:
df[df.notnull()]

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,F,85.0
1,Bob,,M,90.0
2,Charlie,35.0,,
3,David,40.0,M,88.0
4,Eve,,F,75.0


### 3. Dropping Missing Values – `df.dropna()`

Removes rows or columns with missing values.

#### a. Drop rows (default):

In [18]:
df.dropna()

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,F,85.0
3,David,40.0,M,88.0


Drops rows with **any missing value**.

#### b. Drop columns with missing values:

In [20]:
df.dropna(axis=1)

Unnamed: 0,Name
0,Alice
1,Bob
2,Charlie
3,David
4,Eve


#### c. Drop rows only if **all** values are missing:

In [21]:
df.dropna(how='all')

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,F,85.0
1,Bob,,M,90.0
2,Charlie,35.0,,
3,David,40.0,M,88.0
4,Eve,,F,75.0


#### d. Drop rows where **specific columns** are NaN:

In [23]:
# If either of the column data is NaN, entire row would be dropped
df.dropna(subset=['Age', 'Score'])

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,F,85.0
3,David,40.0,M,88.0


In [24]:
df.dropna(subset=['Age', 'Score'], how='all')
# By specifying all, when both Age and Score has NaN, then only row dropped
# As no row has both Age and Score with NaN, no row dropped

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,F,85.0
1,Bob,,M,90.0
2,Charlie,35.0,,
3,David,40.0,M,88.0
4,Eve,,F,75.0


In [25]:
df.dropna(subset=['Gender', 'Score'], how='all')

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,F,85.0
1,Bob,,M,90.0
3,David,40.0,M,88.0
4,Eve,,F,75.0


### 4. Filling Missing Values – `df.fillna()`

Replaces NaNs with specified value.

#### a. Fill with constant:

In [26]:
df

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,F,85.0
1,Bob,,M,90.0
2,Charlie,35.0,,
3,David,40.0,M,88.0
4,Eve,,F,75.0


In [27]:
df.fillna(0)

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,F,85.0
1,Bob,0.0,M,90.0
2,Charlie,35.0,0,0.0
3,David,40.0,M,88.0
4,Eve,0.0,F,75.0


In [28]:
df.Gender.fillna('M')

0    F
1    M
2    M
3    M
4    F
Name: Gender, dtype: object

#### b. Forward fill (`ffill`) – Use previous value:

In [29]:
df.ffill()

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,F,85.0
1,Bob,25.0,M,90.0
2,Charlie,35.0,M,90.0
3,David,40.0,M,88.0
4,Eve,40.0,F,75.0


#### c. Backward fill (`bfill`) – Use next value:

In [30]:
df.bfill()

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,F,85.0
1,Bob,35.0,M,90.0
2,Charlie,35.0,M,88.0
3,David,40.0,M,88.0
4,Eve,,F,75.0


#### d. Fill with column mean/median/mode:

In [32]:
df.Age.fillna(df.Age.mean())

0    25.000000
1    33.333333
2    35.000000
3    40.000000
4    33.333333
Name: Age, dtype: float64

In [33]:
df.Score.fillna(df.Score.median())

0    85.0
1    90.0
2    86.5
3    88.0
4    75.0
Name: Score, dtype: float64

In [34]:
df.Gender.fillna(df.Gender.mode()[0])

0    F
1    M
2    F
3    M
4    F
Name: Gender, dtype: object

### 5. Interpolation – `df.interpolate()`

Fills numeric NaNs by interpolating between known values.

In [38]:
df.Score

0    85.0
1    90.0
2     NaN
3    88.0
4    75.0
Name: Score, dtype: float64

In [36]:
df.Score.interpolate()

0    85.0
1    90.0
2    89.0
3    88.0
4    75.0
Name: Score, dtype: float64

Works well when data is sequential/time-series.

### 6. Counting % Missing

To calculate **percent missing per column**:

In [41]:
(df.isnull().sum() / len(df)) * 100

Name       0.0
Age       40.0
Gender    20.0
Score     20.0
dtype: float64

### ✅ Summary Table

| Task                        | Method                        |
| --------------------------- | ----------------------------- |
| Detect missing              | `df.isnull()` / `df.isna()`   |
| Detect not missing          | `df.notnull()` / `df.notna()` |
| Drop rows/columns with NaNs | `df.dropna()`                 |
| Fill with value or method   | `df.fillna(value/method=...)` |
| Interpolate values          | `df.interpolate()`            |

---

### 🔍 Real-Life Use Cases

| Scenario                                   | Solution                               |
| ------------------------------------------ | -------------------------------------- |
| Replace null ages with average             | `df['Age'].fillna(df['Age'].mean())`   |
| Drop columns where most values are missing | Use threshold with `df.dropna(axis=1)` |
| Forward fill time-series data              | `df.fillna(method='ffill')`            |
| Visualize missing data                     | Use `missingno` or seaborn heatmaps    |

<center><b>Thanks</b></center>