# Consistent missing data support

Roadmap item: https://pandas.pydata.org/docs/dev/development/roadmap.html#consistent-missing-value-handling

## The current situation

In [None]:
import pandas as pd
import datetime

In [None]:
df = pd.DataFrame([
    (1, 0.1, True, "A", datetime.datetime(2020, 1, 1)),
    (2, None, False, None, datetime.datetime(2020, 1, 1)),
    (None, 0.3, True, "C", None),
    (4, 0.4, None, "D", datetime.datetime(2020, 1, 1)),
], columns=["int", "float", "bool", "string", "timestamp"])

In [None]:
df

In [None]:
df.dtypes

* Integer don't support missing data (cast to float)
* Booleans don't support missing data (object dtype)
* Different missing value indicators (`np.nan`, `None`, `pd.NaT`)

## New "nullable" dtypes!

Introduced in pandas 1.0 (https://pandas.pydata.org/docs/dev/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values):

* The new `pd.NA` missing value sentinel
* The nullable integer, boolean and string dtype (nullable floating dtype will be added in pandas 1.2)

In [None]:
# convert_dtypes is a helper method to convert to those new dtypes
df2 = df.convert_dtypes()

In [None]:
df2

In [None]:
pd.NA

In [None]:
df2.dtypes

Those new nullable values with `pd.NA` will work similarly in functions like `fillna()`, `isna()`, `dropna()`, ..

## Attention: `pd.NA` has different behaviour in certain cases

In [None]:
s_NA = pd.Series([1, pd.NA, 3], dtype="Int64")
s_nan = pd.Series([1, np.nan, 3])

In [None]:
s_NA

In [None]:
s_nan

As usual, missing values are skipped by default in reductions:

In [None]:
s_NA.sum()

As usual, missing values get propagated in element-wise arithmetic operations:

In [None]:
s_NA + 1

But *also* for comparison operations:

In [None]:
s_NA == 1

In [None]:
s_nan == 1

Further, in logical operations, it does not always propagate ("three-value" or "Kleene" logic):

In [None]:
pd.NA & True

In [None]:
pd.NA | True

## How does this work?

For the nullable integer, floating and boolean data types, we use a **"masked array"** approach: 1 array with the actual data, and 1 array indicating if the values are missing or not.

## What's next?

* Expand the use of `pd.NA` to more data types
* Complete support for nullable dtypes across pandas

In [None]:
arr = s_NA.array

In [None]:
arr

In [None]:
arr._data

In [None]:
arr._mask

In [None]:
%%html
<style>
.jp-Cell.jp-mod-selected ~ .jp-Cell {
    display: none;
}
</style>