# Imputining Missing Data
This code includes some examples of imputing missing data. We are using World bank population data. You can find info about how to read this data set in this repository 'Combining data sets 2' file. In this code we mainly focused on imputing tecniques. 

In [1]:
## first import pandas
import pandas as pd

In [2]:
#Read the data set
Population= pd.read_csv("Population.csv",skiprows=4)
Population= Population.drop(['2019', 'Unnamed: 64'], axis=1)
Population.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,Aruba,ABW,"Population, total",SP.POP.TOTL,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,...,101455.0,101669.0,102046.0,102560.0,103159.0,103774.0,104341.0,104872.0,105366.0,105845.0
1,Afghanistan,AFG,"Population, total",SP.POP.TOTL,8996973.0,9169410.0,9351441.0,9543205.0,9744781.0,9956320.0,...,28394813.0,29185507.0,30117413.0,31161376.0,32269589.0,33370794.0,34413603.0,35383128.0,36296400.0,37172386.0
2,Angola,AGO,"Population, total",SP.POP.TOTL,5454933.0,5531472.0,5608539.0,5679458.0,5735044.0,5770570.0,...,22514281.0,23356246.0,24220661.0,25107931.0,26015780.0,26941779.0,27884381.0,28842484.0,29816748.0,30809762.0
3,Albania,ALB,"Population, total",SP.POP.TOTL,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,...,2927519.0,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0,2866376.0
4,Andorra,AND,"Population, total",SP.POP.TOTL,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,...,84463.0,84449.0,83747.0,82427.0,80774.0,79213.0,78011.0,77297.0,77001.0,77006.0


In [3]:
#Transform the data set
df = pd.melt(Population, id_vars=['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code'], var_name='year', value_name='GDP')
df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,year,GDP
0,Aruba,ABW,"Population, total",SP.POP.TOTL,1960,54211.0
1,Afghanistan,AFG,"Population, total",SP.POP.TOTL,1960,8996973.0
2,Angola,AGO,"Population, total",SP.POP.TOTL,1960,5454933.0
3,Albania,ALB,"Population, total",SP.POP.TOTL,1960,1608800.0
4,Andorra,AND,"Population, total",SP.POP.TOTL,1960,13411.0


I transformed data by using melt method. You can find detailed and simplified explanation of metl method in the combining data frames file in this repository

In [4]:
# we can check for missing values
df.isnull().sum()

Country Name        0
Country Code        0
Indicator Name      0
Indicator Code      0
year                0
GDP               167
dtype: int64

## Imputing mean
In this part we will impute each countries mean GDP for its missing values.  
To do this first we need to group the data according to country. Then we will impute its mean.  
The groupby method simply creates groups according to given variable. In this case it is country name.  
transform() method implements a function to column wise or row wise to a df. In this example we definded the function by lambda. lambda is used when we need a small function which can be expresses as one line.  
fillna method imputes missing values by using specified value in this case it is mean

In [5]:
df['GDP_filled_mean'] = df.groupby('Country Name')['GDP'].transform(lambda x: x.fillna(x.mean()))

In [6]:
df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,year,GDP,GDP_filled_mean
0,Aruba,ABW,"Population, total",SP.POP.TOTL,1960,54211.0,54211.0
1,Afghanistan,AFG,"Population, total",SP.POP.TOTL,1960,8996973.0,8996973.0
2,Angola,AGO,"Population, total",SP.POP.TOTL,1960,5454933.0,5454933.0
3,Albania,ALB,"Population, total",SP.POP.TOTL,1960,1608800.0,1608800.0
4,Andorra,AND,"Population, total",SP.POP.TOTL,1960,13411.0,13411.0


## Backward / Forward imputation
When there is a timely data, imputing mean may not be the most appropriate way to handle missing data. Instead, we may impute the last valid or next valid observation.  
**Note:** You need a timely data to be able to use these methods.  

In this example we use backfill and forward fill within each country. So we will again use groupby method to create country groups. We also need to sort data set according to year as the order is important. 

In [9]:
#forward fill
df['GDP_filled_forward'] = df.sort_values('year').groupby('Country Name')['GDP'].fillna(method='ffill')

df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,year,GDP,GDP_filled_mean,GDP_filled_forward
0,Aruba,ABW,"Population, total",SP.POP.TOTL,1960,54211.0,54211.0,54211.0
1,Afghanistan,AFG,"Population, total",SP.POP.TOTL,1960,8996973.0,8996973.0,8996973.0
2,Angola,AGO,"Population, total",SP.POP.TOTL,1960,5454933.0,5454933.0,5454933.0
3,Albania,ALB,"Population, total",SP.POP.TOTL,1960,1608800.0,1608800.0,1608800.0
4,Andorra,AND,"Population, total",SP.POP.TOTL,1960,13411.0,13411.0,13411.0


In [10]:
#backward fill
df['GDP_filled_backward'] = df.sort_values('year').groupby('Country Name')['GDP'].fillna(method='bfill')

df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,year,GDP,GDP_filled_mean,GDP_filled_forward,GDP_filled_backward
0,Aruba,ABW,"Population, total",SP.POP.TOTL,1960,54211.0,54211.0,54211.0,54211.0
1,Afghanistan,AFG,"Population, total",SP.POP.TOTL,1960,8996973.0,8996973.0,8996973.0,8996973.0
2,Angola,AGO,"Population, total",SP.POP.TOTL,1960,5454933.0,5454933.0,5454933.0,5454933.0
3,Albania,ALB,"Population, total",SP.POP.TOTL,1960,1608800.0,1608800.0,1608800.0,1608800.0
4,Andorra,AND,"Population, total",SP.POP.TOTL,1960,13411.0,13411.0,13411.0,13411.0


In [11]:
#Lets check for null values
df.isnull().sum()

Country Name             0
Country Code             0
Indicator Name           0
Indicator Code           0
year                     0
GDP                    167
GDP_filled_mean         59
GDP_filled_forward     157
GDP_filled_backward     66
dtype: int64

As you can see above there are still null values. So what happened? When there is no valid case before/after then ot can not be filled. We can use both methods so we can have a complete data set.

In [12]:
df['GDP_ff_bf'] = df.sort_values('year').groupby('Country Name')['GDP'].fillna(method='ffill').fillna(method='bfill')
df.isnull().sum()

Country Name             0
Country Code             0
Indicator Name           0
Indicator Code           0
year                     0
GDP                    167
GDP_filled_mean         59
GDP_filled_forward     157
GDP_filled_backward     66
GDP_ff_bf                0
dtype: int64