# Consistent missing data support

https://pandas.pydata.org/docs/dev/development/roadmap.html#consistent-missing-value-handling

## The current situation

In [1]:
import pandas as pd
import datetime

In [2]:
df = pd.DataFrame([
    (1, 0.1, True, "A", datetime.datetime(2020, 1, 1)),
    (2, None, False, None, datetime.datetime(2020, 1, 1)),
    (None, 0.3, True, "C", None),
    (4, 0.4, None, "D", datetime.datetime(2020, 1, 1)),
], columns=["int", "float", "bool", "string", "timestamp"])

In [3]:
df

Unnamed: 0,int,float,bool,string,timestamp
0,1.0,0.1,True,A,2020-01-01
1,2.0,,False,,2020-01-01
2,,0.3,True,C,NaT
3,4.0,0.4,,D,2020-01-01


In [4]:
df.dtypes

int                 float64
float               float64
bool                 object
string               object
timestamp    datetime64[ns]
dtype: object

* Integer don't support missing data (cast to float)
* Booleans don't support missing data (object dtype)
* Different missing value indicators (`np.nan`, `None`, `pd.NaT`)

## New "nullable" dtypes!

The `pd.NA` missing value sentinel, and the nullable integer, boolean and string dtype were added in pandas 1.0 (https://pandas.pydata.org/docs/dev/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values). The nullable floating dtype will be added in pandas 1.2.

In [5]:
# convert_dtypes is a helper method to convert to those new dtypes
df2 = df.convert_dtypes()

In [6]:
df2

Unnamed: 0,int,float,bool,string,timestamp
0,1.0,0.1,True,A,2020-01-01
1,2.0,,False,,2020-01-01
2,,0.3,True,C,NaT
3,4.0,0.4,,D,2020-01-01


In [7]:
df2.dtypes

int                   Int64
float               Float64
bool                boolean
string               string
timestamp    datetime64[ns]
dtype: object

## Attention: `pd.NA` has different behaviour

In [8]:
s_NA = pd.Series([1, pd.NA, 3], dtype="Int64")
s_nan = pd.Series([1, np.nan, 3])

In [9]:
s_NA

0       1
1    <NA>
2       3
dtype: Int64

In [10]:
s_nan

0    1.0
1    NaN
2    3.0
dtype: float64

As usual, missing values get propagated in element-wise arithmetic operations:

In [11]:
s_NA +1

0       2
1    <NA>
2       4
dtype: Int64

But *also* for comparison operations:

In [12]:
s_NA == 1

0     True
1     <NA>
2    False
dtype: boolean

In [13]:
s_nan == 1

0     True
1    False
2    False
dtype: bool

Further, in logical operations, it does not always propagate ("three-value" or "Kleene" logic):

In [14]:
pd.NA & True

<NA>

In [15]:
pd.NA | True

True

## How does this work?

For the nullable integer, floating and boolean data types, we use a **"masked array"** approach: 1 array with the actual data, and 1 array indicating if the values are missing or not.

In [16]:
arr = s_NA.array

In [17]:
arr

<IntegerArray>
[1, <NA>, 3]
Length: 3, dtype: Int64

In [18]:
arr._data

array([1, 1, 3])

In [19]:
arr._mask

array([False,  True, False])