# How to handle missing data
### You'll learn the basis of handling na/NaN/Null values

In [1]:
import pandas as pd 
import numpy  as np

#    From numpy We'll just use np.nan

### First Steps    
    1 - Creating a small dataset to work with.
    2 - Turning it into a Data Frame.
    3 - Trying to understand it.

In [2]:
people_data = {
    'first': ['Corey', 'Jane', 'John', 'Chris', np.nan, None, 'NA'], 
    'last': ['Schafer', 'Doe', 'Doe', 'Schafer', np.nan, np.nan, 'Missing'], 
    'email': ['CoreyMSchafer@gmail.com', 'JaneDoe@email.com', 'JohnDoe@email.com', None, np.nan, 'Anonymous@email.com', 'NA'],
    'age': ['33', '55', '63', '36', None, None, 'Missing']
}

data_frame = pd.DataFrame(people_data)

 
    As We can see, there are a lot of missing values. But this is a small dataset,
    if there were thousands of rows, this aproach wouldn't really help us to detect
    missing values:


In [3]:
data_frame

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
3,Chris,Schafer,,36
4,,,,
5,,,Anonymous@email.com,
6,,Missing,,Missing


### Looking for NaN values    
    Therefore We should use some functions to better understand the data.
    In this case, to find out more about the missing values.
 

In [4]:
data_frame.isna()

Unnamed: 0,first,last,email,age
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,True,False
4,True,True,True,True
5,True,True,False,True
6,False,False,False,False


In [5]:
# Sum of how many missing data Pandas can easily detect
data_frame.isna().sum() 

first    2
last     2
email    2
age      2
dtype: int64

    If You look closer, you'll notice that the ['age'] column has 3 missing values (at rows: 4, 5 and 6)
    But the previous function tell us there are only 2. Then, for you to see what's happening, let's drop
    the rows with 'na' values. You'll notice there will still be missing values, but these Pandas won't
    recognize as 'NaN'.

In [6]:
data_frame.dropna() 

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
6,,Missing,,Missing


In [7]:
#    .dropna(how='any') vs .dropna(how='all')
#    First one drops the row if it finds any 'na' value in that row.
#    The second one  only drops the row if all the row has missing values.
#    how='any'   is the default setting.
data_frame.dropna(how='all')  # Fourth row is full of missing values.

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
3,Chris,Schafer,,36
5,,,Anonymous@email.com,
6,,Missing,,Missing


    If just a column needs to have no missing values (let's say you'll work on ['age'] column, then 
    it doesn't really matter if ['email'] column has or not any email) you can specify where you 
    want to look for missing values.

In [8]:
data_frame.dropna(subset=['age'])  
# data_frame.dropna(subset=['email','age'])  # And You can specify multiple columns too.

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
3,Chris,Schafer,,36
6,,Missing,,Missing


    Well, continuing... in order to make Pandas know that 'Missing' and 'NA' values must be treated as 'na' values,
    We could do this:

In [9]:
#   I'll make use of a new data frame to actually modify it.
new_data_frame = pd.DataFrame(people_data)
#   And now We can just replace the "not deteceted missing" values for an actual 'NaN'
new_data_frame.replace(['NA','Missing'], np.nan, inplace=True)

new_data_frame

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33.0
1,Jane,Doe,JaneDoe@email.com,55.0
2,John,Doe,JohnDoe@email.com,63.0
3,Chris,Schafer,,36.0
4,,,,
5,,,Anonymous@email.com,
6,,,,


In [10]:
'''Now, It'll detect all of the 'na' existing values'''
new_data_frame.isna()

Unnamed: 0,first,last,email,age
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,True,False
4,True,True,True,True
5,True,True,False,True
6,True,True,True,True


In [11]:
new_data_frame.isna().sum()

first    3
last     3
email    3
age      3
dtype: int64

In [12]:
#  First DataFrame sum of 'na' values
data_frame.isna().sum()

first    2
last     2
email    2
age      2
dtype: int64

### How to handle them:
    1 - You can drop them.
    or
    2 - You can Replace them using .fillna()

In [13]:
new_data_frame.dropna()

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63


In [14]:
new_data_frame.fillna('MISSING')

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
3,Chris,Schafer,MISSING,36
4,MISSING,MISSING,MISSING,MISSING
5,MISSING,MISSING,Anonymous@email.com,MISSING
6,MISSING,MISSING,MISSING,MISSING


In [15]:
#  Notice That all modifications are made in a copy of the Data Frame.
#  Then, if We try to print new_data_frame, We'll see that the Data Frame is intact 
new_data_frame

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33.0
1,Jane,Doe,JaneDoe@email.com,55.0
2,John,Doe,JohnDoe@email.com,63.0
3,Chris,Schafer,,36.0
4,,,,
5,,,Anonymous@email.com,
6,,,,


In [16]:
#  To really alter it, You must use inplace=True.
new_data_frame.fillna('MISSING', inplace=True)
new_data_frame

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
3,Chris,Schafer,MISSING,36
4,MISSING,MISSING,MISSING,MISSING
5,MISSING,MISSING,Anonymous@email.com,MISSING
6,MISSING,MISSING,MISSING,MISSING


In [17]:
new_data_frame.replace('MISSING', np.nan, inplace=True)
new_data_frame

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33.0
1,Jane,Doe,JaneDoe@email.com,55.0
2,John,Doe,JohnDoe@email.com,63.0
3,Chris,Schafer,,36.0
4,,,,
5,,,Anonymous@email.com,
6,,,,


In [18]:
new_data_frame.dropna(inplace=True)
new_data_frame

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@gmail.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63


    Remember: Your choice of how to handle the missing data truly depends on what questions You are answering.
    There's much more to learn about this topic. I hope this was helpful to You.
