# Numpy mathematical operations

In [4]:
import numpy as np

# 9. Handling Missing Data

Handling **missing data** is a very important step in **data preprocessing** for data science and machine learning.

In **NumPy**, missing data is typically represented by:

* **`np.nan`** → Stands for **Not a Number** and is commonly used to denote missing or undefined values.

Let’s cover a **comprehensive guide** on:

* ✅ How to **identify**, **filter**, **replace**, and **process** missing data in NumPy
* ✅ Techniques and functions to handle missing data
* ✅ Detailed examples for each method


## ✅ 1. Identifying Missing Data

### Method: `np.isnan()`

* Checks whether each element is `np.nan` (missing value).

In [6]:
arr = np.array([1, 2, np.nan, 4, np.nan])

# Find where missing data is
missing_mask = np.isnan(arr)
print(missing_mask)

# For multi-dimensional array
arr_2d = np.array([[1, np.nan], [3, 4]])
print(np.isnan(arr_2d)) 

[False False  True False  True]
[[False  True]
 [False False]]


## ✅ 2. Removing Missing Data

### a) Remove Missing Values (1D)

In [13]:
arr = np.array([1, 2, np.nan, 4, np.nan])

cleaned_arr = arr[~np.isnan(arr)]
print(cleaned_arr)

[1. 2. 4.]


### b) Remove Rows or Columns with Missing Data (2D)

In [16]:
arr_2d = np.array([[1, np.nan, 3], [4, 5, 6], [np.nan, 8, 9]])

# Remove rows with any NaN
cleaned = arr_2d[~np.isnan(arr_2d).any(axis=1)]
print(cleaned)

[[4. 5. 6.]]


## ✅ 3. Replacing Missing Data

### a) Replace with a Fixed Value

In [19]:
arr = np.array([1, 2, np.nan, 4, np.nan])

arr[np.isnan(arr)] = 0  # Replace NaN with 0
print(arr)

[1. 2. 0. 4. 0.]


### b) Replace with Mean, Median, etc.

In [21]:
arr = np.array([1, 2, np.nan, 4, np.nan])

mean_value = np.nanmean(arr)  # Mean ignoring NaN
arr[np.isnan(arr)] = mean_value
print(arr)

[1.         2.         2.33333333 4.         2.33333333]



## ✅ 4. Aggregate Functions that Ignore NaN

NumPy provides **NaN-safe aggregate functions:**

| Function       | Purpose                         |
| -------------- | ------------------------------- |
| `np.nanmean()` | Mean ignoring NaN               |
| `np.nanstd()`  | Standard deviation ignoring NaN |
| `np.nanvar()`  | Variance ignoring NaN           |
| `np.nansum()`  | Sum ignoring NaN                |
| `np.nanmin()`  | Minimum ignoring NaN            |
| `np.nanmax()`  | Maximum ignoring NaN            |

In [23]:
arr = np.array([1, 2, np.nan, 4])

print(np.nanmean(arr))  # Mean ignoring NaN
print(np.nansum(arr))   # Sum ignoring NaN

2.3333333333333335
7.0


✅ These are safer than regular `mean`, `sum`, etc., when missing data is present.

## ✅ 5. Interpolating Missing Data (Basic Method)

NumPy **does not** provide a direct interpolation function, but you can use simple linear interpolation manually or use **`pandas`** for more advanced interpolation.

In [26]:
arr = np.array([1, np.nan, 3, np.nan, 5])

# Find indices of valid and missing data
valid = ~np.isnan(arr)
invalid = np.isnan(arr)

# Perform linear interpolation
arr[invalid] = np.interp(np.flatnonzero(invalid), np.flatnonzero(valid), arr[valid])
print(arr)

[1. 2. 3. 4. 5.]


✅ If you need more powerful interpolation, you can use `pandas.Series.interpolate()`.

## ✅ 6. Counting Missing Values

In [29]:
arr = np.array([1, np.nan, 3, np.nan, 5])

missing_count = np.isnan(arr).sum()
print(missing_count)

2


## ✅ 7. Complex Masking with Conditions

In [31]:
# Keep only elements > 2 and not NaN
arr = np.array([1, 2, np.nan, 4, np.nan, 5])

filtered = arr[(~np.isnan(arr)) & (arr > 2)]
print(filtered)

[4. 5.]


## ✅ Summary Table

| Technique                    | Purpose                         |          |
| ---------------------------- | ------------------------------- | -------- |
| `np.isnan(arr)`              | Identify missing values         |          |
| `arr[~np.isnan(arr)]`        | Remove missing values           |          |
| `arr[np.isnan(arr)] = value` | Replace missing values          |          |
| `np.nanmean(arr)`            | Compute mean ignoring NaN       |          |
| `np.nansum(arr)`             | Compute sum ignoring NaN        |          |
| `np.nanmin(arr)`             | Compute min ignoring NaN        |          |
| `np.nanmax(arr)`             | Compute max ignoring NaN        |          |
| `np.interp()`                | Basic linear interpolation      |          |
| `np.isnan(arr).sum()`        | Count number of missing values  |          |
| Complex masking              | Combine conditions with `&`, \` | `, `\~\` |

## 🔥 Key Takeaways:

* **NaN-aware functions** (`nanmean`, `nansum`) should be used when missing data is present.
* You can **drop, replace, or interpolate** missing data based on your analysis need.
* For **more complex imputation (mean by group, regression imputation)** → use `pandas`.


<center><b>Thanks</b></center>