# Dealing with Missing Values: Simple Imputation Example Using `groupby` + `fillna` 

A simple imputation is performed using statistics _mean_, _median_ or _mode_, or replacing missing values with a _constant_. There are some advantages, but also disadvantages using statistics to fill missing values. It is possible, however, to decrease disadvantages at least to some degree. We may use `groupby` to do that.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Create a DataFrame
ex = pd.DataFrame({'col1':['A','A','A','B','B','B'], 
                   'col2':[1, 2, 3, 4, 5, 6], 
                   'col3':[1, np.nan, 3, 4 ,np.nan, 4], 
                   'col4':['a', 'b', 'c', 'd', 'e', 'f'],
                   'col5':['a', 'a', np.nan, 'd', np.nan, 'f']})
ex

Unnamed: 0,col1,col2,col3,col4,col5
0,A,1,1.0,a,a
1,A,2,,b,a
2,A,3,3.0,c,
3,B,4,4.0,d,d
4,B,5,,e,
5,B,6,4.0,f,f


In [3]:
ex.groupby('col1').transform(np.mean)

Unnamed: 0,col2,col3
0,2,2.0
1,2,2.0
2,2,2.0
3,5,4.0
4,5,4.0
5,5,4.0


In [4]:
# Transform col3 into mean of the groups
ex.groupby('col1').col3.transform('mean')

0    2.0
1    2.0
2    2.0
3    4.0
4    4.0
5    4.0
Name: col3, dtype: float64

In [6]:
# Use col3 of the transformed DataFrame (above) to fill missing.
# Apply fillna to the column without grouping
ex.col3.fillna(ex.groupby('col1')['col3'].transform("mean"))

0    1.0
1    2.0
2    3.0
3    4.0
4    4.0
5    4.0
Name: col3, dtype: float64

In [7]:
# Apply fillna function within each group.
ex.groupby('col1').transform(lambda x: x.fillna(x.mean()))

Unnamed: 0,col2,col3
0,1,1.0
1,2,2.0
2,3,3.0
3,4,4.0
4,5,4.0
5,6,4.0


In [8]:
# Mode (most frequent) for both categorical and numerical variables
ex.groupby('col1').transform(lambda x: x.fillna(x.mode()[0]))
## I use index [0] for `mode()` since it returns NaN when there are multiple items as mode. 

Unnamed: 0,col2,col3,col4,col5
0,1,1.0,a,a
1,2,1.0,b,a
2,3,3.0,c,a
3,4,4.0,d,d
4,5,4.0,e,d
5,6,4.0,f,f
