## Missing Data

When working with real-world data, it is common to encounter missing values. But what does it mean for a value to be missing ? It depends on the context. For example, a missing value in a survey could mean that the respondent chose not to answer the question, while a missing value in a time series could mean that the data was not collected for that time period.

Other examplle of missing values or data could be for instance:
- A customer did not provide their email address when signing up for a newsletter.
- A sensor failed to record a temperature reading for a specific hour.
- A survey respondent skipped a question.
- A website visitor's session ended unexpectedly, leaving some data points incomplete.
- A financial transaction was not recorded due to a system error.
- The year of birth for a participant in a study is unknown.
- A product's weight is not listed on its packaging.
- A student's grade for a particular assignment is missing.
- A medical record lacks information about a patient's allergy history.
- A social media post does not include the number of likes or shares.
- Erroneous data, such as a participant's age recorded as 350 years instead of 35.0 due to a typographical error.

And many other reasons that could lead to missing data.

Let´s check a pratical example for a survey, a `Salary` field with an empty value, or a number 0, or an invalid value (a string for example) can be considered "missing data". These concepts are related to the values that Python will consider "Falsy":

In [144]:
import numpy as np
import pandas as pd

In [145]:
falsy_values = (0, False, None, "", [], {})

For Python, all the values above are considered "falsy":

In [146]:
any(falsy_values)

False

`Numpy` has a special "nullable" value for numbers which is `np.nan`. It's NaN: "Not a number"

In [147]:
np.nan

nan

The `np.nan` value is kind of a virus. Everything that it touches becomes `np.nan`:

In [148]:
3 + np.nan

nan

In [149]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [150]:
a.sum()

nan

In [151]:
a.mean()

nan

This is better than regular `None` values, which in the previous examples would have raised an exception:

In [152]:
# 3 + None # TypeError: unsupported operand type(s) for +: 'int' and 'NoneType' raise TypeError

For a numeric array, the `None` value is replaced by np.nan:

In [153]:
a = np.array([1, 2, 3, np.nan, None, 4], dtype="float")

In [154]:
a

array([ 1.,  2.,  3., nan, nan,  4.])

As we said, `np.nan` is like a virus. If you have any nan value in an array and you try to perform an operation on it, you'll get unexpected results:

In [155]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [156]:
a.mean()

nan

In [157]:
a.sum()

nan

Numpy also supports an "Infinite" type:

In [158]:
np.inf

inf

Which also behaves as a virus:

In [159]:
3 + np.inf

inf

In [160]:
np.inf / 3

inf

In [161]:
np.inf / np.inf

nan

In [162]:
b = np.array([1, 2, 3, np.inf, np.nan, 4], dtype=np.float16)

In [163]:
b.sum()

nan

---

### Checking for `nan` or `inf`

There are two functions: `np.isnan` and `np.isinf` that will perform the desired checks:

In [164]:
np.isnan(np.nan) # True
# np.isnan(np.inf) # False
# np.isnan(2) # False

True

In [165]:
# np.isinf(np.nan) # False
np.isinf(np.inf) # True

True

And the joint operation can be performed with `np.isfinite`.

In [166]:
np.isfinite(np.nan), np.isfinite(np.inf)    

(False, False)

`np.isnan` and `np.isinf` also take arrays as inputs, and return boolean arrays as results:

In [167]:
np.isnan(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([False, False, False,  True, False, False])

In [168]:
np.isinf(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([False, False, False, False,  True, False])

In [169]:
np.isfinite(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([ True,  True,  True, False, False,  True])

Note: It's not so common to find infinite values. From now on, we'll keep working with only `np.nan`

---

### Filtering them out

Whenever you're trying to perform an operation with a `Numpy` array and you know there might be missing values, you'll need to filter them out before proceeding, to avoid `nan` propagation. We'll use a combination of the previous `np.isnan` + boolean arrays for this purpose:

In [170]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])

In [171]:
a[~np.isnan(a)]

array([1., 2., 3., 4.])

Which is equivalent to:

In [172]:
a[np.isfinite(a)]

array([1., 2., 3., 4.])

And with that result, all the operation can be now performed:

In [173]:
a[np.isfinite(a)].sum()

10.0

In [174]:
a[np.isfinite(a)].mean()

2.5