In [2]:
import pandas as pd
import numpy as np

# Handling Missing Data

## Difference between `np.nan` and `pd.na`

The difference between `numpy.nan` and `pandas.NA` lies primarily in their intended use and behavior within the Pandas library:

### `numpy.nan`

- **Type**: It is a floating-point value defined by the IEEE standard to represent "Not a Number" (NaN) in NumPy.
- **Usage**: Primarily used to denote missing values in floating-point arrays.
- **Behavior**: When used in Pandas, `numpy.nan` can lead to automatic type coercion. For example, if you use `numpy.nan` in an integer column, Pandas will convert the entire column to a floating-point type to accommodate the NaN value[2][3].
- **Operations**: `numpy.nan` behaves as a floating-point number and can be used in arithmetic operations, but it propagates as NaN in results.

### `pandas.NA`

- **Type**: Introduced as an experimental feature in Pandas 1.0, `pandas.NA` is a scalar used to represent missing values across all data types in a more consistent manner.
- **Usage**: Designed to be a more generic missing value indicator that can be used with Pandas' nullable data types, such as `Int64`, `boolean`, and `string`[3][4].
- **Behavior**: It does not coerce the data type of the column. For example, using `pandas.NA` in an integer column with a nullable integer type (`Int64`) will not change the column's type to float[3].
- **Operations**: In addition to arithmetic operations, `pandas.NA` propagates as "missing" or "unknown" in comparison operations, which can be useful for data analysis where such behavior is desired[4].

### Conclusion

While both `numpy.nan` and `pandas.NA` are used to represent missing values, `pandas.NA` offers a more consistent and flexible approach for handling missing data across different data types in Pandas. It is particularly useful when working with Pandas' nullable data types, as it avoids the automatic type coercion that occurs with `numpy.nan`. Using `pandas.NA` is recommended when you want to maintain the integrity of data types and take advantage of Pandas' enhanced handling of missing data[1][2][3][4].

Citations:
[1] https://www.includehelp.com/python/pd-na-vs-np-nan-for-pandas.aspx
[2] https://pandas.pydata.org/docs/user_guide/missing_data.html
[3] https://towardsdatascience.com/nan-none-and-experimental-na-d1f799308dd5?gi=37f8cbd168a0
[4] https://stackoverflow.com/questions/60115806/pd-na-vs-np-nan-for-pandas
[5] https://www.youtube.com/watch?v=CeqvH6DdMso

## Detecting Null Values

In [3]:
data = pd.Series([1, np.nan, 'hello', None])

In [4]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [5]:
data[data.notnull()]

0        1
2    hello
dtype: object

In [17]:
data = pd.DataFrame(np.arange(12).reshape(3, -1))

In [18]:
data

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [19]:
data.iloc[0, 1] = pd.NA

In [20]:
data

Unnamed: 0,0,1,2,3
0,0,,2,3
1,4,5.0,6,7
2,8,9.0,10,11


In [21]:
data.iloc[1, 3] = pd.NA

In [22]:
data

Unnamed: 0,0,1,2,3
0,0,,2,3.0
1,4,5.0,6,
2,8,9.0,10,11.0


In [23]:
data.isnull()

Unnamed: 0,0,1,2,3
0,False,True,False,False
1,False,False,False,True
2,False,False,False,False


In [25]:
data.notnull()

Unnamed: 0,0,1,2,3
0,True,False,True,True
1,True,True,True,False
2,True,True,True,True


In [26]:
data[data.notnull()]

Unnamed: 0,0,1,2,3
0,0,,2,3.0
1,4,5.0,6,
2,8,9.0,10,11.0


Looks like nothing happens in this case.

## Dropping Null Values

In [27]:
data

Unnamed: 0,0,1,2,3
0,0,,2,3.0
1,4,5.0,6,
2,8,9.0,10,11.0


In [28]:
data.dropna()

Unnamed: 0,0,1,2,3
2,8,9.0,10,11.0


> We cannot drop single values from a DataFrame; we can only drop entire rows or columns. Depending on the application, you might want one or the other, so dropna includes a number of options for a DataFrame.

In [48]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])

In [31]:
# drop rows
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [33]:
# drop columns
df.dropna(axis=1)

Unnamed: 0,2
0,2
1,5
2,6


In [34]:
# drop columns (alternative syntax)
df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


In [41]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [50]:
df[3] = pd.NA

In [51]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [52]:
df.dropna(axis='columns', how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [53]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [54]:
df.dropna(axis='rows', thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


### Filling Null Values

In [59]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))

In [61]:
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [62]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'), dtype='Int32')

In [63]:
data

a       1
b    <NA>
c       2
d    <NA>
e       3
dtype: Int32

In [64]:
data.fillna(0)

a    1
b    0
c    2
d    0
e    3
dtype: Int32

In [65]:
data.fillna(method='ffill')

  data.fillna(method='ffill')


a    1
b    1
c    2
d    2
e    3
dtype: Int32

In [68]:
# forward fill to propagate the previous value forward
data.ffill()

a    1
b    1
c    2
d    2
e    3
dtype: Int32

In [69]:
# backward fill to propagate the next values backward
data.bfill()

a    1
b    2
c    2
d    3
e    3
dtype: Int32

> In the case of a DataFrame, the options are similar, but we can also specify an axis
along which the fills should take place