# **Handling Missing Data with Pandas**

In [1]:
import numpy as np
import pandas as pd

### **Pandas Utility Function**

In [2]:
pd.isnull(np.nan)

True

In [3]:
pd.isnull(None)

True

In [4]:
pd.isna(None)

True

In [5]:
pd.isna(np.nan)

True

So in pandas None is considered as same as NaN and are considered Null.
The opposite ones of isna() and isnull() method:

In [6]:
pd.notnull(np.nan)

False

In [7]:
pd.notna(np.nan)

False

In [8]:
pd.notnull(None)

False

In [9]:
pd.notna(None)

False

In [10]:
pd.notna(3)

True

In [11]:
pd.notna('a')

True

In [12]:
pd.isnull(pd.Series([1,3,np.nan,None,5]))

0    False
1    False
2     True
3     True
4    False
dtype: bool

In [13]:
pd.notnull(pd.Series([1,2,3,np.nan,5]))

0     True
1     True
2     True
3    False
4     True
dtype: bool

In [14]:
pd.isnull(pd.DataFrame({
    'COl_0':[1,2,np.nan,3],
    'COL_1':[3,6,7,np.nan],
    'COL_2':[4,np.nan,5,np.nan]
}))

Unnamed: 0,COl_0,COL_1,COL_2
0,False,False,False
1,False,False,True
2,True,False,False
3,False,True,True


### Pandas Operation with Missing Values

Pandas manages missing values more gracefully than numpy. NaNs will no longer behave as "viruses", and operations will just ignore them completely

In [15]:
pd.Series([1,2,np.nan]).count()

2

In [16]:
pd.Series([1,2,3,np.nan]).sum()

6.0

In [17]:
pd.Series([1,np.nan,3]).mean()

2.0

## Filtering missing data

Just like in numpy we need to use isnull or notnull + boolean selection to filter out these nans and null values:

In [18]:
s=pd.Series([1,2,3,np.nan])

In [19]:
pd.isnull(s)

0    False
1    False
2    False
3     True
dtype: bool

In [20]:
pd.notnull(s)

0     True
1     True
2     True
3    False
dtype: bool

In [21]:
pd.notnull(s).sum() #True =1 (True+True+True)1+1+1=3

3

In [22]:
pd.isnull(s).sum()

1

In [23]:
s[pd.notnull(s)]

0    1.0
1    2.0
2    3.0
dtype: float64

In [24]:
s[pd.isnull(s)]

3   NaN
dtype: float64

### Dropping(Removing) null values

Instead of using the boolean selection + notnull() to drop the null values we can use **dropna()** method

In [25]:
s

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

In [26]:
s.dropna()

0    1.0
1    2.0
2    3.0
dtype: float64

### Dropping null values on DataFrames

In pandas series it was easier to drop null values as we had **dropna()** method. But in DataFrames, there will be few things to consider as we cannot drop single values. We can only drop entire rows or columns. Lets see how we do it:

In [27]:
df=pd.DataFrame({
    'ColumnA':[np.nan,1,np.nan,np.nan],
    'ColumnB': [np.nan,2,3,np.nan],
    'ColumnC':[4,5,6,np.nan],
    'ColumnD':[6,7,8,9]
}
)

In [28]:
df

Unnamed: 0,ColumnA,ColumnB,ColumnC,ColumnD
0,,,4.0,6
1,1.0,2.0,5.0,7
2,,3.0,6.0,8
3,,,,9


In [29]:
df.shape

(4, 4)

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   ColumnA  1 non-null      float64
 1   ColumnB  2 non-null      float64
 2   ColumnC  3 non-null      float64
 3   ColumnD  4 non-null      int64  
dtypes: float64(3), int64(1)
memory usage: 260.0 bytes


In [31]:
df.isnull()

Unnamed: 0,ColumnA,ColumnB,ColumnC,ColumnD
0,True,True,False,False
1,False,False,False,False
2,True,False,False,False
3,True,True,True,False


In [32]:
df.isnull().sum()

ColumnA    3
ColumnB    2
ColumnC    1
ColumnD    0
dtype: int64

In [33]:
df.dropna()

Unnamed: 0,ColumnA,ColumnB,ColumnC,ColumnD
1,1.0,2.0,5.0,7


**Dropna()** function drops the whole row by default in dataframe.

In this case, any row or columns that contains at least one null value will be dropped. we can use **axis** parameter to drop columns containing null values:

In [34]:
df.dropna(axis=1) #by default axis=0 and drops row

Unnamed: 0,ColumnD
0,6
1,7
2,8
3,9


We can also use **how** parameter which can be either any:"if any of the row contains NaN" or all:if all of the row contains "NaN"

In [35]:
df.dropna(how='any')

Unnamed: 0,ColumnA,ColumnB,ColumnC,ColumnD
1,1.0,2.0,5.0,7


In [36]:
df.dropna(how='all')

Unnamed: 0,ColumnA,ColumnB,ColumnC,ColumnD
0,,,4.0,6
1,1.0,2.0,5.0,7
2,,3.0,6.0,8
3,,,,9


In [37]:
df.dropna(how='any',axis=1)

Unnamed: 0,ColumnD
0,6
1,7
2,8
3,9


**Dropna** also contains **thresh** parameter to indicate threshold(minimum number) of non-null values for any row or column to be kept

In [38]:
df.dropna(thresh=2, axis=1)

Unnamed: 0,ColumnB,ColumnC,ColumnD
0,,4.0,6
1,2.0,5.0,7
2,3.0,6.0,8
3,,,9


In [39]:
df.dropna(thresh=3)

Unnamed: 0,ColumnA,ColumnB,ColumnC,ColumnD
1,1.0,2.0,5.0,7
2,,3.0,6.0,8


### **Filling Null Values** 

Instead of dropping the null values, we can also replace the null values with some other values. Sometimes **nan** can be replaced by `0` or it can be replaced by `mean`. There are various ways to do so :

In [40]:
s

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

In [41]:
s.fillna(0)

0    1.0
1    2.0
2    3.0
3    0.0
dtype: float64

In [42]:
s.fillna(s.mean())

0    1.0
1    2.0
2    3.0
3    2.0
dtype: float64

In [43]:
s

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

#### **Filling nulls with contagious (close) values**

We can pass method argument to fill null values with other values close to that of null one:

In [44]:
s.fillna(method='ffill') #fills from above

  s.fillna(method='ffill') #fills from above


0    1.0
1    2.0
2    3.0
3    3.0
dtype: float64

In [45]:
s.fillna(method='bfill') #fills from below

  s.fillna(method='bfill') #fills from below


0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

This still can leave null values at the extremes of the Series/DataFrame:

#### **Filling Null values on DataFrame:**

The **fillna** method also works on dataframe. The main difference are that you can specify the **axis** (as usual, rows or columns) to use to fill the values(specially for methods) and that you have meor control on the values passed:

In [46]:
df

Unnamed: 0,ColumnA,ColumnB,ColumnC,ColumnD
0,,,4.0,6
1,1.0,2.0,5.0,7
2,,3.0,6.0,8
3,,,,9


In [47]:
df.fillna({'ColumnA':0,'ColumnB':99, 'ColumnC':df['ColumnC'].mean()})

Unnamed: 0,ColumnA,ColumnB,ColumnC,ColumnD
0,0.0,99.0,4.0,6
1,1.0,2.0,5.0,7
2,0.0,3.0,6.0,8
3,0.0,99.0,5.0,9


In [49]:
df.fillna(method='bfill',axis=1)

  df.fillna(method='bfill',axis=1)


Unnamed: 0,ColumnA,ColumnB,ColumnC,ColumnD
0,4.0,4.0,4.0,6.0
1,1.0,2.0,5.0,7.0
2,3.0,3.0,6.0,8.0
3,9.0,9.0,9.0,9.0


In [50]:
df.fillna(method='bfill',axis=0)

  df.fillna(method='bfill',axis=0)


Unnamed: 0,ColumnA,ColumnB,ColumnC,ColumnD
0,1.0,2.0,4.0,6
1,1.0,2.0,5.0,7
2,,3.0,6.0,8
3,,,,9


In [51]:
df.fillna(method='ffill',axis=0)

  df.fillna(method='ffill',axis=0)


Unnamed: 0,ColumnA,ColumnB,ColumnC,ColumnD
0,,,4.0,6
1,1.0,2.0,5.0,7
2,1.0,3.0,6.0,8
3,1.0,3.0,6.0,9


In [52]:
df.fillna(method='ffill',axis=1)

  df.fillna(method='ffill',axis=1)


Unnamed: 0,ColumnA,ColumnB,ColumnC,ColumnD
0,,,4.0,6.0
1,1.0,2.0,5.0,7.0
2,,3.0,6.0,8.0
3,,,,9.0


### **Checking if there are NAs**

#### **Example1: Checking the length**

In [58]:
len(s)

4

In [53]:
s.dropna().count()

3

In [55]:
missing_values=len(s.dropna())!=len(s)
missing_values

True

There's also a `count` method, that excludes `nan`s from its result

In [56]:
len(s)

4

In [57]:
s.count()

3

So we could just do:

In [59]:
missing_values=len(s)!=s.count()
missing_values

True

#### **More Pythonic solution `any`**
The methods `any` and `all` check if either there's `any` True value in a Series or `all` the values are `True`. They work in the same way as in Python:

In [60]:
pd.Series([True,True,False,True]).any()

True

In [61]:
pd.Series([True,True,False,True]).all()

False

In [62]:
pd.Series([True,True,True,True]).any()

True

The `isnull()` method returned a Boolean `Series` with `True` values wherever there was a `nan`:

In [63]:
s.isnull()

0    False
1    False
2    False
3     True
dtype: bool

In [66]:
pd.Series([1,np.nan]).isnull().any() #isnull() returns: [False, Ture]

True

In [65]:
pd.Series([1,2]).isnull().any()

False

In [67]:
s.isnull().any()

True

In [68]:
s.isnull().values

array([False, False, False,  True])

In [69]:
s.isnull().values.any()

True