# Missing data
* Real world data will often be missing data for a variety of reasons
* Many machine learning models and statistical methods can not work with missing data points, in which case we need to decide what to do with the missing data
* When reading in missing values, pandas will display them as <b>NaN</b> (not a number) value
* There are also newer specialized null pandas values such as <b>pd.NaT</b> (not a timestamp) to imply the value missing should be a timestamp.

* Options for Missing Data
    * Keep it
    * Remove it
    * Replace it
* <i> Note, there is never 100% correct approach that applies to all circumstances, it all depends on the exact situation you encounter!


# Keeping the missing data
* Keeping the missing data
    * PROS:
        * Easiest to do
        * Does not manipulate or change the true data
    * CONS:
        * Many methods do not support Nan
        * Often there are reasonable guesses for filling in that missing data

# Dropping or Removing the missing data
* Dropping or Removing the missing data
    * PROS:
        * Easy to do
        * Can be based on rules (could drop missing column with 2 values)
    * CONS:
        * Potential to lose a lot of data or useful information
        * Limits trained models for future data
* Dropping a row
  * Makes sense when a lot of info is missing
  * Clearly this data point as a row should probably be dropped
  * Often a good idea to calculate a percentage of what data is dropped
* Dropping a Column
  * Good choice if every row is missing that particular feature
  <table class="center">
<tr>
<th>Labeled Index</th>
<th>Year</th>
<th>Pop</th>
<th>GDP</th>
<th>Area</th>
</tr>

<tr>
<th>USA</th>
<th>1776</th>
<th>NAN</th>
<th>NAN</th>
<th>NAN</th>
</tr>

<tr>
<th>CANADA</th>
<th>1867</th>
<th>38</th>
<th>1.7</th>
<th>3.86</th>
</tr>

<tr>
<th>MEXICO</th>
<th>1821</th>
<th>1.7</th>
<th>1.22</th>
<th>0.76</th>
</table>


# Filling in the missing data
* Filling in the missing data
    * PROS:
        * Potential to save a lot of data for use in training a model
    * CONS
        * Hardest to do and somewhat arbitrary
        * Potential to lead to false conclusions (be careful with the reasoning you're using on filling in that missing data)


* Filling in missing data
    * Fill with same value
        * Good choice if NaN was a placeholder
        * Here NAN can be filled in with Zero

<table class="center">
<tr>
<th>Labeled Index</th>
<th>Year</th>
<th>Pop</th>
<th>GDP</th>
<th>Carriers</th>
</tr>

<tr>
<th>USA</th>
<th>1776</th>
<th>328</th>
<th>20.5</th>
<th>11</th>
</tr>

<tr>
<th>CANADA</th>
<th>1867</th>
<th>38</th>
<th>1.7</th>
<th>Nan</th>
</tr>

<tr>
<th>MEXICO</th>
<th>1821</th>
<th>1.7</th>
<th>1.22</th>
<th>Nan</th>
</table>

* Filling in missing data
    * Fill with interpolated or estimated value
        * Much harder and requires reasonable assumptions
<table class="center">
<tr>
<th>Labeled Index</th>
<th>Year</th>
<th>Pop</th>
<th>GDP</th>
<th>Percent</th>
</tr>

<tr>
<th>USA</th>
<th>1776</th>
<th>328</th>
<th>20.5</th>
<th>75%</th>
</tr>

<tr>
<th>CANADA</th>
<th>1867</th>
<th>38</th>
<th>1.7</th>
<th>Nan</th>
</tr>

<tr>
<th>MEXICO</th>
<th>1821</th>
<th>1.7</th>
<th>1.22</th>
<th>25%</th>
</table>

* Let's explore the code syntax in pandas for dealing with missing values
* Later on in the course we will have a deeper disscussion on trying to decide between keep, remove, and replace options


In [2]:
import numpy as np
import pandas as pd

In [3]:
# old version of notebook
np.nan

nan

In [4]:
# new version of notebook
# future to show the missing data
pd.NA

<NA>

In [5]:
# timestamp
pd.NaT


NaT

In [6]:
# the two missing values could not be sure equal
np.nan == np.nan # False

False

In [7]:
np.nan is np.nan

True

In [8]:
myvar = np.nan

In [9]:
# use 'is'
# don't use the (==)
myvar is np.nan

True

In [10]:
df = pd.read_csv('C:\\Users\\admin\\Desktop\\Data Science\\Course-2021\\03-Pandas\\movie_scores.csv')

In [11]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [12]:
# return a boolean if you have a null value
# True is null value
df.isnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,False,False,False,False,False,False
1,True,True,True,True,True,True
2,False,False,False,False,True,True
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [13]:
# the opposite of 'isnull'
# only select columns where certain features are present
df.notnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,True,True,True,True,True,True
1,False,False,False,False,False,False
2,True,True,True,True,False,False
3,True,True,True,True,True,True
4,True,True,True,True,True,True


In [14]:
# research for not movie score value
df['pre_movie_score'].notnull()

0     True
1    False
2    False
3     True
4     True
Name: pre_movie_score, dtype: bool

In [15]:
# passing to dataframe
# only return the movie with pre_movie == True
df[df['pre_movie_score'].notnull()]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [16]:
# combine the filtering with 'isnull' and 'notnull'
df[(df['pre_movie_score'].isnull()) & (df['first_name'].notnull())]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
2,Hugh,Jackman,51.0,m,,


# Keep data
* Read the dataset
* Keep any missing values

In [17]:
help(df.dropna)


Help on method dropna in module pandas.core.frame:

dropna(axis: 'Axis' = 0, how: 'str' = 'any', thresh=None, subset=None, inplace: 'bool' = False) method of pandas.core.frame.DataFrame instance
    Remove missing values.
    
    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.
    
        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.
    
        .. versionchanged:: 1.0.0
    
           Pass tuple or list to drop on multiple axes.
           Only a single axis is allowed.
    
    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame, when we have
        at least one NA or all NA.
    
        * 'any' : If a

In [18]:
# drop any row with missing value
df.dropna()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [19]:
# drop the row with all missing value
# thresh=n - drop any rows that contain null value
# unless they have at least 'n' non-null-value
df.dropna(thresh=1)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [20]:
df.dropna(thresh=5)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [21]:
# drop any column with missing value
# the result has dropped all the value in table
df.dropna(axis=1)

0
1
2
3
4


* When using the rows as data point and columns as feature
=> leave default axis = 0

In [22]:
df.dropna(axis=0)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [23]:
# drop the NaN value in specific columns
df.dropna(subset=['last_name'])

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [24]:
# fill in 'Nan' value
df.fillna('NEW VALUE!')

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,NEW VALUE!,NEW VALUE!,NEW VALUE!,NEW VALUE!,NEW VALUE!,NEW VALUE!
2,Hugh,Jackman,51.0,m,NEW VALUE!,NEW VALUE!
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [25]:
# fill 'Nan' value with specific column
df['pre_movie_score'].fillna(0)

0    8.0
1    0.0
2    0.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

In [26]:
# replace Nan with the mean value
df['pre_movie_score'].fillna(df['pre_movie_score'].mean())

0    8.0
1    7.0
2    7.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

In [27]:
# fill every missing value by mean
# however, the
df.fillna(df.mean())

  df.fillna(df.mean())


Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,52.75,,7.0,9.0
2,Hugh,Jackman,51.0,m,7.0,9.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [28]:
# interpolation
airline_tix = {'first':100,'business':np.nan,'economy-plus':50,'economy':30}

In [29]:
ser = pd.Series(airline_tix)

In [30]:
ser

first           100.0
business          NaN
economy-plus     50.0
economy          30.0
dtype: float64

In [31]:
# linear interpolation
# the divide of 2 keys value next to the Nan value

ser.interpolate()



first           100.0
business         75.0
economy-plus     50.0
economy          30.0
dtype: float64