# Data Transformations with Pandas in Python - Creating Data

Welcome to the notebook on filling missing data. In this notebook we will learn different techniques for filling missing data using Python. In general there are two approaches:
1. Filling by providing a value (or multiple values)
2. Filling by using a method: forward fill & backward fill

But first one remark. **The biggest challenge in filling missing data is to properly prepare your data**. For this data preparation you will need (depending on your data) tools like `.melt()` or `.pivot()` (see notebook `Transforming_Data.ipynb`), indexing and joining (see notebook `Combining_Data.ipynb`), aggregating with functions like `.groupby()` and `.resample()`, etcetera. In other words: Python can fill missing data, but only with well organized data. So you need to know how to (re)organize your data and how you want to fill your missing data.

This notebook does not pay attention to all those preparatory steps, and only focuses on the actual filling of missing data. You will see that this is actually quite easy.

Let us start with a simple dataframe, with some columns and some missing data. Run below code cell.

In [None]:
import numpy as np
import pandas as pd
np.random.seed(301083)
df1 = pd.DataFrame({'Date':pd.date_range(pd.to_datetime('2017'), pd.to_datetime('2017-01-10'), freq='D'),
                    'RF':[1, 2, np.nan, np.nan, 5] * 2,
                    'RH':np.random.randint(40, 80, 10),
                    'Temp1':[np.nan] * 3 + list(np.random.randint(15, 30, 7)),
                    'Temp2':[np.nan, 18, 13, 15, np.nan, 20, 22, 28, np.nan, 25],
                    'Temp3':np.random.randint(20, 30, 10)})
df1

## 1. Filling by providing a value (or multiple values)
Filling missing data in a Pandas DataFrame can be done with the method `.fillna()`. Like all Pandas functions, this method has several options and is very flexible. 

For example, we can fill all missing values with **one specific value**.

In [None]:
# Filling all missing values with the number 5
df1.fillna(5) 

If, instead, we want to provide **a specific value per column**, we can specify the value per column through a dictionary. And why not directly do it a bit smarter: filling each column with the average of that column.

In [None]:
# Calculating the average per column, for which there are missing values
RF_avg = df1.RF.mean()
Temp1_avg = df1.Temp1.mean()
Temp2_avg = df1.Temp2.mean()

# Use a dictionary to specify which column needs to be filled with which value
df1.fillna({'RF':RF_avg, 'Temp1':Temp1_avg, 'Temp2':Temp2_avg})

We can also **use a whole column** to fill another column. For example, filling the column `Temp1` with values from `Temp2`, and vice-versa.

In [None]:
df1.fillna({'Temp1':df1.Temp2, 'Temp2':df1.Temp1})

**Exercise**: Write code for the following filling missing data options:
1. Fill missing data of column `Temp1` with the value 20
2. Fill missing data of columns `Temp1` and `Temp2` with the average of column `Temp3`

In [None]:
# Write your code for filling missing data here

Likewise, we can fill column `Temp1` based on **the average of columns** `Temp2` and `Temp3` - we simply first need to calculate the average over those columns. We can take the average with the function `.mean()`, however, that gives us the average _per column_. Instead, we want the average _per row_. That is possible by using the `axis` argument: by default the `axis` is 0 (meaning average over the row dimension - getting one value per column), but you can set it to 1 (meaning average over the column dimension - getting one value per row). If we do that for columns `Temp2` and `Temp3`, we can use the resulting averages per row to fill column `Temp1`. See below example.

In [None]:
# Calculate average per row, for only columns Temp2 and Temp3
row_avg = df1.get(['Temp2', 'Temp3']).mean(axis=1)
row_avg

In [None]:
# Using the average per row to fill missing data in column Temp1
df1.fillna({'Temp1':row_avg})

**Exercise**: Fill the column `Temp2` with the average of columns `Temp1` and `Temp3`.

In [None]:
# Write your code for filling the missing data of column Temp2

## 2. Filling by using a method: forward fill & backward fill
Instead of supplying values, we can also select a method with the argument `method`. For example, the method 'forward filling', which means: fill missing values based on their previous values &rarr; use available data forward towards missing data. You can use this method by writing `method='ffill'` inside the function.

There is also a method for filling missing values in the other direction, 'backward filling'. To use this method you have to set `method='bfill'`. 

See below examples, and carefully compare the result with the original `df1` to see what happened.

In [None]:
# The original df1, for comparison
df1

In [None]:
# Forward filling of missing data
# Missing data at the start is not filled, because there is nothing before them that can be moved forward
df1.fillna(method='ffill')

In [None]:
# Backward filling of missing data
# Missing data at the end will not be filled, because there is nothing after them that can be moved backward
df1.fillna(method='bfill')

You can combine a filling method with the argument `limit`, to specify how much cells may be filled. For example, `method='bfill'` with `limit=1` will fill maximum one value backward. This is relevant if you want to combine different filling missing data techniques. For example, first fill a few values with forward or backward fill. Then, fill the rest of the missing data based on neighboring stations, satellite data, or something else.

In [None]:
# An example of forward filling with limit
df1.fillna(method='bfill', limit=1)

Smartly filling missing data is a matter of knowing what you want, and logically combining the different tools for that. If you know your data well, you might realize that you do not want only one technique, but different techniques for different situations and different parameters. In that case, you simply have to systematically combine different techniques, step by step.

For example, we cannot fill one column with a method, and another with a fixed value, all at once. But step by step it is possible. See below code for a mixture of techniques. In that code, also the argument `inplace=True` is used, to really let the change (in this case, the filling) happen in the dataframe or column itself.

Go over the code line by line and try to understand what is happening.

In [None]:
# Let us create a copy of df1; df1 remains unchanged, but to_fill will become the dataframe with missing data filled
to_fill = df1.copy() 

# Fill column RF with ffill, with a limit of 1.
# To specifically do a method on one column, we should select that one column
to_fill.RF.fillna(method='ffill', limit=1, inplace=True)
print('Filled column RF with forward fill (limit=1): \n\n', to_fill, '\n\n')

# Fill remaining missing values in column RF with average of column RF
to_fill.fillna({'RF':to_fill.RF.mean()}, inplace=True)
print('Filled remaining missing values of column RF with mean value: \n\n', to_fill, '\n\n')

# Fill column Temp1 with values of column Temp3 * 0.8 (e.g., biascorrection)
to_fill.fillna({'Temp1':to_fill.Temp3 * 0.8}, inplace=True)
print('Filled column Temp1 with values of column Temp3 * 0.8: \n\n', to_fill, '\n\n')

# Fill column Temp2 with the average of columns Temp1 and Temp3
to_fill.fillna({'Temp2':to_fill.get(['Temp1', 'Temp3']).mean(axis=1)}, inplace=True)
print('Filled column Temp2 with the average of columns Temp1 and Temp3: \n\n', to_fill)

## Practicing with filling missing data - Exercise
Let's practice some filling missing data skills using the rainfall data of six EMI stations. 

First, load the data and check it by running below cell.

In [None]:
dailyRFdata = pd.read_csv('dailyRFdata.csv')
dailyRFdata

**Exercises**: For you to practice, several ways in which you can fill missing data
1. Fill missing data of the column `Maksegnit` based on the overall average of `Maksegnit`
2. Fill missing data of the column `Chewahit` with data from the column `Gondar A.P.`
3. Fill missing data of the column `Ambagiorgis Sch` with the method forward fill
4. Fill missing data of the column `Aymba` with the average of all the other stations
5. Calculate the average of `Gondar A.P.` (`avg_GondarAP`) and the average of `Aymba` (`avg_Aymba`), and fill missing data of `Aymba` based on `Gondar A.P. * avg_Aymba / avg_GondarAP`

In [None]:
# Write your code for different ways of filling missing data here
# You can always add more code cells by clicking on the + sign at the top of the Jupyter Notebook