# Handling Missing Data

In the real world, ACTUAL data is rarely clean and homogeneous. To make matters even more complicated, different data sources may indicate missing data in different ways.


In this section, we will discuss some general considerations for missing data, discuss how Pandas chooses to represent it, and demonstrate some built-in Pandas tools for handling missing data in Python.


We’ll refer to missing data in general as null, NaN, or NA values.

In [0]:
import numpy as np
import pandas as pd

# is np.NaN and None are same?

In [10]:
np.nan == None

False

In [11]:
np.NaN == None

False

# Is np.Nan is equal to np.Nan?


In [12]:
np.nan == np.nan

False

# Then how to check for the NaN in Numpy?

In [13]:
np.isnan(np.nan)

True

#NaN and None in Numpy

In [14]:
type(None)

NoneType

In [15]:
vals1 = np.array([1, 3, 4])
vals1.dtype

dtype('int64')

In [16]:
%timeit vals1.sum()

The slowest run took 220.70 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.17 µs per loop


In [21]:
vals1_obj = np.array([1, 3, 4], dtype=object)
vals1_obj

array([1, 3, 4], dtype=object)

In [22]:
%timeit vals1_obj.sum()

The slowest run took 28.29 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.22 µs per loop


#None: Pythonic missing data

In [25]:
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

This dtype=object means that the best common type representation NumPy could
infer for the contents of the array is that they are Python objects.



The use of Python objects in an array also means that if you perform aggregations like sum() or min() across an array with a None value, you will generally get an error:

In [26]:
vals1.sum()

TypeError: ignored

This reflects the fact that addition between an integer and None is undefined.

#NaN: Missing numerical data


The other missing data representation, NaN (acronym for Not a Number), is different;
it is a special floating-point value recognized by all systems that use the standard
IEEE floating-point representation:

In [27]:
vals2 = np.array([1, np.nan, 3, 4])
vals2

array([ 1., nan,  3.,  4.])

In [28]:
vals2.dtype

dtype('float64')

NaN is a bit like a data virus—it infects any other object it touches - See similarity between Coronavirus and Nan data virus ;-)

In [30]:
1 + np.nan , 0 * np.nan

(nan, nan)

In [0]:
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

aggregates over the values are well defined (i.e., they don’t
result in an error) but not always useful:

In [32]:
#NumPy does provide some special aggregations that will ignore these missing values:
np.nansum(vals2)

8.0

In [33]:
np.nanmin(vals2), np.nanmax(vals2)

(1.0, 4.0)

Keep in mind that NaN is specifically a floating-point value; there is no equivalent
NaN value for integers, strings, or other types.

#NaN and None in Pandas


NaN and None both have their place, and Pandas is built to handle the two of them
nearly interchangeably, converting between them where appropriate:

In [34]:
x = pd.Series(range(5))
x

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [35]:
x[1] = None
x[3] = np.nan
x

0    0.0
1    NaN
2    2.0
3    NaN
4    4.0
dtype: float64

#IMPORTANANT NOTE
  
    1. np.nan and None both converted into the NaN. 
    2. Data type of the series automatically changed into the 'float64' by the introduction of NaN


In [51]:
x = pd.Series([True, False])
x

0     True
1    False
dtype: bool

In [52]:
x[2] = None
x[3] = np.nan

x

0     True
1    False
2     None
3      NaN
dtype: object

In [53]:
type(np.nan)

float

In [54]:
np.nan

nan

#Operating on Null Values

As we have seen, Pandas treats None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful
methods for detecting, removing, and replacing null values in Pandas data structures.
They are:

    isnull() : Generate a Boolean mask indicating missing values
    notnull() : Opposite of isnull()
    dropna() : Return a filtered version of the data
    fillna() : Return a copy of the data with missing values filled or imputed


# Detecting null values


Pandas data structures have two useful methods for detecting null data: isnull() and
notnull() . Either one will return a Boolean mask over the data. For example:

##isnull()

Returns 
boolean value True - where missing data is found (np.nan or None)
boolean value False - where actually data exists


### isnull() example with series

In [56]:
my_series = pd.Series([1, np.nan, 'Handling Pandas Null Value', None , 3.4 ])
my_series

0                             1
1                           NaN
2    Handling Pandas Null Value
3                          None
4                           3.4
dtype: object

In [57]:
my_series.isnull()

0    False
1     True
2    False
3     True
4    False
dtype: bool

### isnull() example with dataframe

In [0]:
df = pd.DataFrame({'A':[0,1,2,np.nan, 4],
                  'B':[5,np.nan,np.nan,8,None],
                  'C':[10,11,12,13,14]})

In [59]:
df

Unnamed: 0,A,B,C
0,0.0,5.0,10
1,1.0,,11
2,2.0,,12
3,,8.0,13
4,4.0,,14


In [60]:
df.isnull()

Unnamed: 0,A,B,C
0,False,False,False
1,False,True,False
2,False,True,False
3,True,False,False
4,False,True,False


##notnull()

Opposite of isnull()


In [61]:
df.notnull()

Unnamed: 0,A,B,C
0,True,True,True
1,True,False,True
2,True,False,True
3,False,True,True
4,True,False,True


In [62]:
my_series

0                             1
1                           NaN
2    Handling Pandas Null Value
3                          None
4                           3.4
dtype: object

In [63]:
my_series.notnull()

0     True
1    False
2     True
3    False
4     True
dtype: bool

#Dropping null values

#dropna()

removes NA values


 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html       

##dropna on series

In [64]:
my_series.dropna()

0                             1
2    Handling Pandas Null Value
4                           3.4
dtype: object

In [65]:
my_series

0                             1
1                           NaN
2    Handling Pandas Null Value
3                          None
4                           3.4
dtype: object

In [67]:
my_series.dropna(inplace=True)
my_series

0                             1
2    Handling Pandas Null Value
4                           3.4
dtype: object

##dropna() on dataframes

#Drop NaN values row-wise

bydafault axis = 0

In [69]:
df = pd.DataFrame({'A':[0,1,2,np.nan, 4],
                  'B':[5,np.nan,np.nan,8,None],
                  'C':[10,11,12,13,14]})
print(df)

df.dropna(axis=0, inplace=True)
print("\n")
print(df)

     A    B   C
0  0.0  5.0  10
1  1.0  NaN  11
2  2.0  NaN  12
3  NaN  8.0  13
4  4.0  NaN  14


     A    B   C
0  0.0  5.0  10


#Drop NaN values column-wise

 axis = 1

In [71]:
df = pd.DataFrame({'A':[0,1,2,np.nan, 4],
                  'B':[5,np.nan,np.nan,8,None],
                  'C':[10,11,12,13,14]})
print(df)
df.dropna(axis=1, inplace=True)
print("\n")
print(df)

     A    B   C
0  0.0  5.0  10
1  1.0  NaN  11
2  2.0  NaN  12
3  NaN  8.0  13
4  4.0  NaN  14


    C
0  10
1  11
2  12
3  13
4  14


## Drop only those rows having 2 nan values

In [87]:
#It will look 2 nan values row-wise as by default axis = 0
df = pd.DataFrame({'A':[0,1,2,np.nan, np.nan],
                  'B':[5,np.nan,np.nan,8,None],
                  'C':[10,11,12,13,None],
                  'D':[10,11,12,13,14]})
print(df)
df.dropna( thresh=2 ,inplace=True)
print("\n")
print(df)


     A    B     C   D
0  0.0  5.0  10.0  10
1  1.0  NaN  11.0  11
2  2.0  NaN  12.0  12
3  NaN  8.0  13.0  13
4  NaN  NaN   NaN  14


     A    B     C   D
0  0.0  5.0  10.0  10
1  1.0  NaN  11.0  11
2  2.0  NaN  12.0  12
3  NaN  8.0  13.0  13


In [91]:
#It will look 'any' nan values column-wise
df = pd.DataFrame({'A':[0,1,2,np.nan, np.nan],
                  'B':[5,np.nan,np.nan,8,9],
                  'C':[10,11,12,13,14]})
print(df)
df.dropna(axis = 1, how='any' , inplace=True)
print("\n")
print(df)


     A    B   C
0  0.0  5.0  10
1  1.0  NaN  11
2  2.0  NaN  12
3  NaN  8.0  13
4  NaN  9.0  14


    C
0  10
1  11
2  12
3  13
4  14


In [92]:
#It will look 'all' nan values column-wise
df = pd.DataFrame({'A':[0,1,2,np.nan, np.nan],
                  'B':[5,np.nan,np.nan,8,9],
                  'C':[10,11,12,13,14],
                   'D': [np.nan,np.nan,np.nan,np.nan,np.nan]})
print(df)
df.dropna(axis = 1, how='all' , inplace=True)
print("\n")
print(df)


     A    B   C   D
0  0.0  5.0  10 NaN
1  1.0  NaN  11 NaN
2  2.0  NaN  12 NaN
3  NaN  8.0  13 NaN
4  NaN  9.0  14 NaN


     A    B   C
0  0.0  5.0  10
1  1.0  NaN  11
2  2.0  NaN  12
3  NaN  8.0  13
4  NaN  9.0  14


#Filling null values

Sometimes rather than dropping NA values, you’d rather replace them with a valid
value. This value might be a single number like zero, or it might be some sort of
imputation or interpolation from the good values. You could do this in-place using
the isnull() method as a mask, but because it is such a common operation Pandas
provides the fillna() method, which returns a copy of the array with the null values
replaced.

In [93]:
my_series = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
my_series

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

We can fill NA entries with a single value, such as zero:

In [94]:
my_series.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

We can specify a forward-fill to propagate the previous value forward:

In [0]:
my_series.ffill()

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

Or we can specify a back-fill to propagate the next values backward:

In [0]:
my_series.bfill()

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [95]:
df = pd.DataFrame({'A':[0,1,2,np.nan, np.nan],
                  'B':[5,np.nan,np.nan,8,9],
                  'C':[10,11,12,13,14],
                   'D': [np.nan,np.nan,np.nan,np.nan,np.nan]})

df.fillna('Fill Value')

Unnamed: 0,A,B,C,D
0,0,5,10,Fill Value
1,1,Fill Value,11,Fill Value
2,2,Fill Value,12,Fill Value
3,Fill Value,8,13,Fill Value
4,Fill Value,9,14,Fill Value


In [97]:
df = pd.DataFrame({'A':[0,1,2,np.nan, np.nan],
                  'B':[5,np.nan,np.nan,8,9],
                  'C':[10,11,12,13,14],
                   'D': [np.nan,np.nan,np.nan,np.nan,np.nan]})

df.fillna(0)

Unnamed: 0,A,B,C,D
0,0.0,5.0,10,0.0
1,1.0,0.0,11,0.0
2,2.0,0.0,12,0.0
3,0.0,8.0,13,0.0
4,0.0,9.0,14,0.0


In [100]:
df = pd.DataFrame({'A':[0,1,2,np.nan, np.nan],
                  'B':[5,np.nan,np.nan,8,9],
                  'C':[10,11,12,13,14],
                   'D': [np.nan,np.nan,np.nan,np.nan,np.nan]})


df.ffill()


Unnamed: 0,A,B,C,D
0,0.0,5.0,10,
1,1.0,5.0,11,
2,2.0,5.0,12,
3,2.0,8.0,13,
4,2.0,9.0,14,


Notice that if a previous value is not available during a forward fill, the NA value remains.

In [101]:
df = pd.DataFrame({'A':[0,1,2,np.nan, np.nan],
                  'B':[5,np.nan,np.nan,8,9],
                  'C':[10,11,12,13,14],
                   'D': [np.nan,np.nan,np.nan,np.nan,np.nan]})
df.bfill()


Unnamed: 0,A,B,C,D
0,0.0,5.0,10,
1,1.0,8.0,11,
2,2.0,8.0,12,
3,,8.0,13,
4,,9.0,14,


In [102]:
df = pd.DataFrame({'A':[0,1,2,np.nan, np.nan,99],
                  'B':[5,np.nan,np.nan,8,9,99],
                  'C':[10,np.nan,12,13,14,99],
                   'D': [np.nan,np.nan,np.nan,np.nan,np.nan,99]})

df.bfill()

Unnamed: 0,A,B,C,D
0,0.0,5.0,10.0,99.0
1,1.0,8.0,12.0,99.0
2,2.0,8.0,12.0,99.0
3,99.0,8.0,13.0,99.0
4,99.0,9.0,14.0,99.0
5,99.0,99.0,99.0,99.0


In [103]:
df

Unnamed: 0,A,B,C,D
0,0.0,5.0,10.0,
1,1.0,,,
2,2.0,,12.0,
3,,8.0,13.0,
4,,9.0,14.0,
5,99.0,99.0,99.0,99.0


In [104]:
df['A'].mean()

25.5

#Replacing the nan values with the mean of the respective column


In [105]:
df['A'].fillna(value=df['A'].mean())

0     0.0
1     1.0
2     2.0
3    25.5
4    25.5
5    99.0
Name: A, dtype: float64

#Replacing the nan values with the median of the respective column


In [106]:
df['A'].median()

1.5

In [107]:
df['A'].fillna(value = df['A'].median() )

0     0.0
1     1.0
2     2.0
3     1.5
4     1.5
5    99.0
Name: A, dtype: float64

#Replacing the nan values with the MODE of the respective column


In [0]:
df = pd.DataFrame({'A':[0,1,2,2,np.nan, np.nan,99],
                  'B':[5,np.nan,5,np.nan,6,6,99],
                  'C':[10,np.nan,10,10,14,99,14],
                  'D':[1,np.nan,1,2,3,3,2] })

In [109]:
df['D'].mode()

0    1.0
1    2.0
2    3.0
dtype: float64

In [110]:
df['D'].fillna( value = df['D'].mode())

0    1.0
1    2.0
2    1.0
3    2.0
4    3.0
5    3.0
6    2.0
Name: D, dtype: float64

In [111]:
df['C'].mode()

0    10.0
dtype: float64

Question for you? why the mode is not replaced with the nan values?

In [113]:
df['A'].fillna( value = df['A'].mode())

0     0.0
1     1.0
2     2.0
3     2.0
4     NaN
5     NaN
6    99.0
Name: A, dtype: float64

In [114]:
df['B'].fillna( value = df['B'].mode())

0     5.0
1     6.0
2     5.0
3     NaN
4     6.0
5     6.0
6    99.0
Name: B, dtype: float64

In [0]:
df['C'].fillna( value = df['C'].mode())

0    10.0
1     NaN
2    10.0
3    10.0
4    14.0
5    99.0
6    14.0
Name: C, dtype: float64