# Missing Values
Can be missing for a number of reasons. 
  - **Missing at random (MAR)** within a coherent data set where data may have relationship to other data
  - **Missing completely at random (MCAR)** is when the missing data has no relationship with any other data
  - Missing data often comes from joining data sets from different sources that do not have a complete overlap

Pandas has some built-in methods for handling missing values

In [2]:
import pandas as pd

df = pd.read_csv('../resources/week-2/datasets/class_grades.csv')
df.head()


Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,,63.15,48.89
3,7,,,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89


In [3]:
# a boolean mask can identify the missing values.
# the .isnull() function can be used to find the missing values
missing_values_mask = df.isnull()
missing_values_mask.head()

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,True,False,False
3,False,True,True,False,False,False
4,False,False,False,False,False,False


In [4]:
# the .dropna() function can be used if you want to drop rows that have _any_ missing values
df.dropna().head() 

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0


In [7]:
# the .fillna() function can be used to fill all the missing values with another value
# 0 for example
# .fillna() takes two parameters, the fill value and whether it should return a copy of the dataframe or make the changes in place

df.fillna(0, inplace = True)
df.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,0.0,63.15,48.89
3,7,0.0,0.0,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,0.0,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


In [32]:
# sometimes missing values actually comprise useful information
# example from the log file for the videos of this class
df=pd.read_csv('../resources/week-2/datasets/log.csv')
df.head(20)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


## filling missing values (ffill and bfill)

In [33]:
# .ffill() and .bfill() are used for filling in missing values. ffill (forward fill) fills a missing value
# with the value in the previous row. bfill fills a missing value with the value of the next row.
# data needs to be sorted in order to have the desired effect.
# to do so, timestamp can be promoted to the index value
df = df.set_index('time')
df.sort_index()
# note that if this cell is run twice it creates an error because the times is no longer in
# the columns list once it has been promoted to an index
df.columns
df.head()


Unnamed: 0_level_0,user,video,playback position,paused,volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974454,cheryl,intro.html,6,,
1469974544,cheryl,intro.html,9,,
1469974574,cheryl,intro.html,10,,
1469977514,bob,intro.html,1,,


In [36]:
# however this log file represents multiple users can be accessing the system at the same time,
# sp the index is not guaranteed to be unique. It would be unique if we had a composite index
# consisting of time and user, so...first reset the index
df = df.reset_index()
# then create the composite index by passing in a list to the set_index function instead of a single value
df = df.set_index(['time', 'user'])
df


Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974454,cheryl,intro.html,6,,
1469974544,cheryl,intro.html,9,,
1469974574,cheryl,intro.html,10,,
1469977514,bob,intro.html,1,,
1469977544,bob,intro.html,1,,
1469977574,bob,intro.html,1,,
1469977604,bob,intro.html,1,,
1469974604,cheryl,intro.html,11,,
1469974694,cheryl,intro.html,14,,


In [38]:
# filling can be done piecemeal, that is you can do column by column and you don't need to
# do it all in one command. Here's the command to forward fill the paused and volume columns
# as their value is not repeated until there is a change
df = df.fillna(method = 'ffill')
df


Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974454,cheryl,intro.html,6,False,10.0
1469974544,cheryl,intro.html,9,False,10.0
1469974574,cheryl,intro.html,10,False,10.0
1469977514,bob,intro.html,1,False,10.0
1469977544,bob,intro.html,1,False,10.0
1469977574,bob,intro.html,1,False,10.0
1469977604,bob,intro.html,1,False,10.0
1469974604,cheryl,intro.html,11,False,10.0
1469974694,cheryl,intro.html,14,False,10.0


## customized fill values

In [44]:
testdf = pd.DataFrame({'A':[1,1,2,3,4],
                      'B':[3,6,3,8,9],
                      'C':['a','b','c','d','e']})
testdf


Unnamed: 0,A,B,C
0,1,3,a
1,1,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [51]:
# value to value replacing
# replace all occurences of the first parameter with the second parameter
testdf.replace(1, 100)

Unnamed: 0,A,B,C
0,100,3,a
1,100,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [50]:
# note that this returns a copy of the dataframe and the original dataframe remains unchanged
testdf

Unnamed: 0,A,B,C
0,1,3,a
1,1,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [52]:
# you can change multiple values by providing parallel lists as the parameters to the replace function
testdf.replace([1,3],[100,300])

Unnamed: 0,A,B,C
0,100,300,a
1,100,6,b
2,2,300,c
3,300,8,d
4,4,9,e


In [54]:
# pandas also supports regex replacement
# the first parameter is the regex pattern to match
# the second parameter is the value to replace the matches
# the third parameter tells pandas it is a regex replacement. regex=true
# in the logfile df we were using earler, replace all cells that end in .html with the term 'webpage'
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974454,cheryl,intro.html,6,False,10.0
1469974544,cheryl,intro.html,9,False,10.0
1469974574,cheryl,intro.html,10,False,10.0
1469977514,bob,intro.html,1,False,10.0


In [59]:
import re
df.replace(to_replace = ".*.html$", value="webpage", regex=True)
# note that the regex treats each cell as an individual entity
# it does not try to parse the entire row
# regex explanation:
#   .* matches any number of characters
#   .html$ ending with .html


Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,webpage,5,False,10.0
1469974454,cheryl,webpage,6,False,10.0
1469974544,cheryl,webpage,9,False,10.0
1469974574,cheryl,webpage,10,False,10.0
1469977514,bob,webpage,1,False,10.0
1469977544,bob,webpage,1,False,10.0
1469977574,bob,webpage,1,False,10.0
1469977604,bob,webpage,1,False,10.0
1469974604,cheryl,webpage,11,False,10.0
1469974694,cheryl,webpage,14,False,10.0


## note on missing values
most of the statistical calculations will ignore missing values, but it is important
to acknowledge missing values and make a judgement on whether they are significant to the
problem you are trying to solve.

It may be unreasonable, for example, to infer missing values for data that should not exist