# Working with Missing Values in Pandas and NumPy

This notebook has all codes related to the medium article *Working with Missing Values in Pandas and NumPy**.
Medium profile: https://medium.com/@ealizadeh 
Linkedin profile: https://www.linkedin.com/in/alizadehesmaeil/


Copyright &copy; 2019 Esmaeil Alizadeh 

In [1]:
import numpy as np
import pandas as pd

## Comparing Null Objects ("=="  vs "is")

Using "==" or "is" interchangeably to compare Python's *None* type objects may be fine!

In [2]:
None == None

True

In [3]:
None is None

True

However, when we compare NumPy's NaN object, we get different results

In [4]:
np.nan == np.nan

False

In [5]:
np.nan is np.nan

True

Let's see an example of semantic error!

In [6]:
data = [1.0, np.nan, 2.0]

In [7]:
# Using "==" in the element-wise comparison
for x in data:
    if x == np.nan:
        print(f"Using '==' -->  {x} is a nan!")
    else:
        print(f"Using '==' -->  {x} is not a nan!")

Using '==' -->  1.0 is not a nan!
Using '==' -->  nan is not a nan!
Using '==' -->  2.0 is not a nan!


In [8]:
# Using "is" in the element-wise comparison
for x in data:
    if x is np.nan:
        print(f"Using 'is' -->  {x} is a nan!")
    else:
        print(f"Using 'is' -->  {x} is not a nan!")

Using 'is' -->  1.0 is not a nan!
Using 'is' -->  nan is a nan!
Using 'is' -->  2.0 is not a nan!


## Finding null objects in Pandas & NumPy

Use NumPy's built-in method **isnan()**:

In [9]:
np.isnan(np.nan)

True

In [10]:
try:
    np.isnan(None)
except TypeError as e:
    print("TypeError message:\n", e)

TypeError message:
 ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''


Use Pandas's built-in method **isna()** (or it's alias **isnull()**)

In [11]:
pd.isna(np.nan)

True

In [12]:
pd.isna(pd.NaT)

True

In [13]:
pd.isna(None)

True

## Calculation with missing data

In [14]:
2 + np.nan

nan

In [15]:
np.nan / 2

nan

In [16]:
data = [1.0, 2.0, 3.0, np.nan]

#### np.mean() vs np.nanmean()

In [17]:
np.mean(data)

nan

In [18]:
np.nanmean(data)

2.0

#### np.sum() vs np.nansum()

In [19]:
 np.sum(data)

nan

In [20]:
np.nansum(data)

6.0

## Additional Materials

#### NaNs in *Pandas* **groupby()** method:

In [21]:
data = {
    'Fruit': ['apple', 'orange', 'strawberry', 'tangerine'], 
    'Price': [1.0, np.nan, 0.1, 1.0]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Fruit,Price
0,apple,1.0
1,orange,
2,strawberry,0.1
3,tangerine,1.0


In [22]:
g_obj = df.groupby(by='Price')
g_obj.groups

{0.1: Int64Index([2], dtype='int64'), 1.0: Int64Index([0, 3], dtype='int64')}

As can be seen above, groups with NaNs are dropped. Here we cannot retreive the Fruit *orange* since its price is missing!!!

#### NaNs in *Pandas* **groupby()** method:

In [23]:
df['Price'] = pd.Categorical(df['Price'])
df['Price'].cat.categories

Float64Index([0.1, 1.0], dtype='float64')

Categories with NaN are dropped.