## Handling Missing Data Using Pandas

- In real world, it is very common to get the dataset having missing data in it
- It is important to handle missing data as ML models/algorithm cannot work on missing data
- In pandas, we will be representing missing data with `NaN` **(Not a Number)** and also there are newer specilaized null value `pd.NaT` **(Not A Timestamp)**

#### Common options to handle missig data
**Keep it**

      i. Easy to use and do not need lot of manipulation to handle missing data
      ii. Most of the ML models do not support missing data
  
**Drop/Remove it**

      i. Easy to do
      ii. can be based on rules
      iii. Potential loss of data or useful information if we have huge set of missing data
      iv. Limit trained models for future data (as our model is not aware about handling missing data)
      
     - We can drop a row of we have most of column data missing for that specific row is missing
     - We can drop a column if we have most of the value missing in a column

<span style="background-color:red; color:white; padding:2px">NOTE</span>: We can use this approach if we have >5 % of data missing
  
**Replace it (Impute)**

      i. Potential to save lot of training data
      ii. Train model can able to handle future missing data
      iii. Hardest to do
      iv. Potential(or possibility) to lead to false conslusions

    - We can fill missing all data with same value
    - We can fill missing data with interpolated or estimated value
    - We can fill missing data with random or arbitrary data

In [1]:
import pandas as pd
import numpy as np

In [2]:
np.nan

nan

In [4]:
pd.NA

<NA>

In [6]:
pd.NaT

NaT

In [7]:
np.nan == np.nan

False

<span style="background-color:red; color:white; padding:2px">IMPORTANT</span>: above comparison return `False` and the reason is both the values are missing and Python not sure what could be the values and hence returning `False`

In [8]:
# but if we check using 
np.nan is np.nan

True

In [9]:
myvar = np.nan

In [10]:
myvar is np.nan

True

In [12]:
df = pd.read_csv(filepath_or_buffer='./datasets/movie_scores.csv')

In [16]:
df.shape

(5, 6)

In [14]:
# check for null values
df.isna().sum()

first_name          1
last_name           1
age                 1
sex                 1
pre_movie_score     2
post_movie_score    2
dtype: int64

In [15]:
df.isnull().sum()

first_name          1
last_name           1
age                 1
sex                 1
pre_movie_score     2
post_movie_score    2
dtype: int64

In [18]:
df['pre_movie_score'].notnull()

0     True
1    False
2    False
3     True
4     True
Name: pre_movie_score, dtype: bool

In [19]:
df[df['pre_movie_score'].notnull()]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [23]:
df[df['pre_movie_score'].isnull()]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
1,,,,,,
2,Hugh,Jackman,51.0,m,,


In [25]:
df.dropna(how='all') # all the columns in a row is missing

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [26]:
df = df.dropna(how='all')

In [27]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [28]:
df_copy = df.copy()

In [30]:
df_copy.dropna()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [31]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [35]:
# do not drop row having null values if atleast N columns have data in it
df.dropna(thresh=2)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [37]:
# drop only row having missing values in N columns
df.dropna(thresh=4)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [38]:
# do not drop row having null values if atleast N columns have data in it
df.dropna(thresh=5)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [41]:
# to drop the column we can use parameter `axis=1`
df.dropna(axis=1)

Unnamed: 0,first_name,last_name,age,sex
0,Tom,Hanks,63.0,m
2,Hugh,Jackman,51.0,m
3,Oprah,Winfrey,66.0,f
4,Emma,Stone,31.0,f


In [43]:
df.dropna(axis=1, how='all')

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [44]:
df.dropna(axis=1, how='any')

Unnamed: 0,first_name,last_name,age,sex
0,Tom,Hanks,63.0,m
2,Hugh,Jackman,51.0,m
3,Oprah,Winfrey,66.0,f
4,Emma,Stone,31.0,f


In [46]:
df.dropna(axis=1, thresh=2)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [47]:
df.dropna(axis=1, thresh=3)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [48]:
df.dropna(axis=1, thresh=4)

Unnamed: 0,first_name,last_name,age,sex
0,Tom,Hanks,63.0,m
2,Hugh,Jackman,51.0,m
3,Oprah,Winfrey,66.0,f
4,Emma,Stone,31.0,f


In [49]:
## Filling missing data
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [51]:
df.fillna(value=0) # fill all na values with 0

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,0.0,0.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [52]:
help(df.fillna)

Help on method fillna in module pandas.core.generic:

fillna(value: 'Hashable | Mapping | Series | DataFrame | None' = None, *, method: 'FillnaOptions | None' = None, axis: 'Axis | None' = None, inplace: 'bool_t' = False, limit: 'int | None' = None, downcast: 'dict | None | lib.NoDefault' = <no_default>) -> 'Self | None' method of pandas.core.frame.DataFrame instance
    Fill NA/NaN values using the specified method.
    
    Parameters
    ----------
    value : scalar, dict, Series, or DataFrame
        Value to use to fill holes (e.g. 0), alternately a
        dict/Series/DataFrame of values specifying which value to use for
        each index (for a Series) or column (for a DataFrame).  Values not
        in the dict/Series/DataFrame will not be filled. This value cannot
        be a list.
    method : {'backfill', 'bfill', 'ffill', None}, default None
        Method to use for filling holes in reindexed Series:
    
        * ffill: propagate last valid observation forward to next

In [53]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [56]:
# to fill pre_movie_score with mean and post_movie_score with 0
values = {'pre_movie_score': df['pre_movie_score'].mean(), 'post_movie_score': 0}

In [57]:
df.fillna(value=values)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,7.0,0.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [58]:
list('ABCD')

['A', 'B', 'C', 'D']

In [63]:
print(list('hare krishna hare krishna'))

['h', 'a', 'r', 'e', ' ', 'k', 'r', 'i', 's', 'h', 'n', 'a', ' ', 'h', 'a', 'r', 'e', ' ', 'k', 'r', 'i', 's', 'h', 'n', 'a']


In [66]:
'hare krishna hare krishna'.split()

['hare', 'krishna', 'hare', 'krishna']

In [67]:
'abcd'.split()

['abcd']

In [69]:
# also we can use specific column and apply fillna
df['pre_movie_score'].fillna(0)

0    8.0
2    0.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

In [70]:
df['post_movie_score'].fillna(df['post_movie_score'].mean())

0    10.0
2     9.0
3     8.0
4     9.0
Name: post_movie_score, dtype: float64

In [73]:
# if we have all the columns numeric then we can simply fill using mean as below
df.fillna(df.mean())

TypeError: Could not convert ['TomHughOprahEmma' 'HanksJackmanWinfreyStone' 'mmff'] to numeric

In [74]:
df.fillna(df.mean(numeric_only=True))

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,7.0,9.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0
