<img src="https://pandas.pydata.org/static/img/pandas.svg" width="250">

## <center> Working with missing data
    
+ cumsum()
+ groupby()
+ fillna()
+ fillna(method='pad')
+ interpolate()

In [3]:
import pandas as pd

In [4]:
temps = pd.DataFrame({
    "sequence":[1,2,3,4,5],
    "measurement_type":['actual','actual','actual',None,'estimated'],
    "temperature_f":[67.24,84.56,91.61,None,49.64]
})


temps

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,,
4,5,estimated,49.64


In [6]:
temps.isnull().sum()

sequence            0
measurement_type    1
temperature_f       1
dtype: int64

In [8]:
temps.isna().sum()

sequence            0
measurement_type    1
temperature_f       1
dtype: int64

-----

# How missing data is handled?

## Cumulative Sum
+ by default, cumulative sum exludes NAN values
+ we can include NaN values by setting `skipna=False`

In [10]:
temps['temperature_f'].cumsum()

0     67.24
1    151.80
2    243.41
3       NaN
4    293.05
Name: temperature_f, dtype: float64

In [13]:
# we can include NA values
# we can see row 5 cumsum became NaN after summing up row 4 NaN value.

temps['temperature_f'].cumsum(skipna=False)

0     67.24
1    151.80
2    243.41
3       NaN
4       NaN
Name: temperature_f, dtype: float64

--------

## Grouping
+ by default, it exclude NaN values
+ we can include NaN values by setting `dropna=False`

In [14]:
temps

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,,
4,5,estimated,49.64


In [15]:
temps.groupby('measurement_type').max()

Unnamed: 0_level_0,sequence,temperature_f
measurement_type,Unnamed: 1_level_1,Unnamed: 2_level_1
actual,3,91.61
estimated,5,49.64


In [17]:
temps.groupby('measurement_type', dropna=False).max()

# Now we can see NaN value group for measurement_type

Unnamed: 0_level_0,sequence,temperature_f
measurement_type,Unnamed: 1_level_1,Unnamed: 2_level_1
actual,3,91.61
estimated,5,49.64
,4,


------

## Dropping Na values

In [19]:
temps

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,,
4,5,estimated,49.64


In [20]:
temps.dropna() # this will drop any rows with na values

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
4,5,estimated,49.64


In [22]:
temps.dropna(axis=1) #this will drop any columns with na values

Unnamed: 0,sequence
0,1
1,2
2,3
3,4
4,5


-------

## Fill Na Values

In [24]:
temps

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,,
4,5,estimated,49.64


### Fill na with 0
+ If we fill 0 for missing value, value for`measurement_type` column doesn't make sense. 
+ Additionally if we are going to get mean value for `temperature_f`, mean value will be heavily biased by 0 value that we just introduced.

In [23]:
# fill with 0
temps.fillna(0)

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,0,0.0
4,5,estimated,49.64


### Fill na with `pad` method
+ fill na with value of previous row.
+ However this method posts its issue, influcencing the value too much.

In [25]:
temps.fillna(method='pad')

# we can see row 4 temperature_f is now filled with 91.61, making the mean temperature_f to get higher.

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,actual,91.61
4,5,estimated,49.64


-------

## Interpolate
+ fill na with half way value of previous row value and next row value.

In [27]:
(91.610 + 49.640) / 2

70.625

In [26]:
temps.interpolate()

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,,70.625
4,5,estimated,49.64
