# Data Transformations with Pandas in Python - Combining Data

Welcome to the notebook about combining data! In this notebook we will learn how to combine two or more datasets. For this we will look at two parts:
1. Flexibility with indexes - How can we make sure our datasets have the same index (which is needed for joining)?
2. Joining two different dataframes - How can we combine two datasets?

Good luck!

## 1. Flexibility with indexes

By default, indexes are numbers ranging from 0 to length-1. However, for some operations it is useful to have other indexes. For example, for combining two different dataframes, for plotting, for resampling, etc. Or sometimes your index changes due to some Pandas operation and you want to turn back to 'regular' indexes. 

In this part about indexes, we will cover three situations: 
1. Using a single column as index
2. Using multiple columns as index
3. Going back to a normal index

Let's see some applications with the data from `tempData.csv` and some additional date-columns we create ourselves. Run cell below to check the data.

In [None]:
import pandas as pd

tempData = pd.read_csv('tempData.csv')
tempData.Datetime = pd.to_datetime(tempData.Datetime,dayfirst=True)

tempData['year'] = tempData.Datetime.dt.year
tempData['month'] = tempData.Datetime.dt.month
tempData['day'] = tempData.Datetime.dt.day

tempData

### 1.1 Using a single column as index
We can set any of the columns of a dataframe as index with the method `.set_index()`.

Let's set the values in the column `Datetime` as index.

In [None]:
# Setting the column Datetime as index
tempData_time_indexed = tempData.set_index('Datetime')
tempData_time_indexed

The above case is relevant for making the use of the function `.resample()` easier. Normally, you have to supply both a `rule` and the column `on` which you want to resample. However, if your index is actualy time data, only supplying the `rule` is enough. See below example.

In [None]:
# Resampling the original data to monthly averages: argument 'on' is needed
tempData.resample(rule='M', on='Datetime').mean()

In [None]:
# Resampling the data with Datetime as index: no need for the argument 'on'
tempData_time_indexed.resample(rule='M').mean()

### 1.2 Using multiple columns as index
You can even use multiple columns as index. This will create an index with multiple levels. See below example.

In [None]:
# Setting the columns year, month, and day as index
tempData_date_indexed = tempData.set_index(['year', 'month', 'day'])
tempData_date_indexed

### 1.3 Going back to a normal index
Methods like `.resample()` or `.groupby()` create themselves a special index. See for example the above resampled examples: the time frequency became the index. There can also be other cases in which you get a special index, like when you turn a NetCDF dataset into a pandas dataframe. 

If for some reason you want to obtain a 'normal' index, you can always use `.reset_index()`.

In [None]:
# Going back to a normal index using .reset_index()
back_to_normal = tempData_time_indexed.reset_index()
back_to_normal

## 2. Joining two different dataframes

The above described tools for indexing are especially relevant when you want to combine multiple dataframes.

You can join two dataframes together. It will create columns for every dataframe and it will put values with the _same index_ on the _same row_. In other words, for joining it is very important to prepare the indexes correctly.

We have two datasets (`pm25Data.csv` and `tempData.csv`), that we would like to combine into one. They might have different time steps and different indexes. So, our task will be to give the two datasets a similar index so that we can combine them.

Let's first load the data. Run below code.

In [None]:
# Loading pm25Data
pm25Data = pd.read_csv('pm25Data.csv')
pm25Data.Datetime = pd.to_datetime(pm25Data.Datetime, dayfirst=True)
pm25Data

In [None]:
# Loading tempData
tempData = pd.read_csv('tempData.csv')
tempData.Datetime = pd.to_datetime(tempData.Datetime, dayfirst=True)
tempData

### Combining `pm25Data` with `tempData`
First, make sure both datasets have a similar index. One smart way to do that is by resampling them both to the same time frequency. Then the index of both will be time and combining them will bring the same times on the same row.

In [None]:
# Resample the PM2.5 data to hourly frequency
pm25_hour = pm25Data.resample(rule='H', on='Datetime').mean()
pm25_hour

In [None]:
# Resample the temperature data to hourly frequency
temp_hour = tempData.resample(rule='H', on='Datetime').mean()
temp_hour

We can combine two dataframes with the method `.join()`. For example, joining `temp_hour` to `pm25_hour`:

In [None]:
pm25_temp_joined = pm25_hour.join(temp_hour)
pm25_temp_joined

**Exercise**: what happens if you use `temp_hour.join(pm25_hour)` instead of the above `pm25_hour.join(temp_hour)`? Try in the below cell and explain the difference.

In [None]:
# Write code that joins pm25_hour to temp_hour, instead of temp_hour to pm25_hour

There are also several options to the method `.join()`. For example, to decide what data you want to keep and whether you want to give the columns from one of the dataframes a suffix. You can check these options out for yourself by looking on the internet (search for 'Pandas dataframe join'), or by checking the built-in help text (`pm25_hour.join?`).

**Exercise**: From the help text, learn how you can give the column of `pm25_hour` the suffix `DF1` and the column of `temp_hour` the suffix `DF2` when you use `pm25_hour.join(temp_hour)`.

Steps:
- Read the help text by writing `pm25_hour.join?`
- Find the arguments that you need for the above mentioned tasks and write code that uses those arguments.

In [None]:
# Write your code for joining with suffixes here.

## Extensive joining example - Satellite and station data

You can use the above techniques to combine for example satellite and station data. Of course, you will need to take the necessary preparations: the two datasets must have a similar index, on which you can join.

**You can only work with this part if you have installed the packages xarray and netcdf4. If that is not the case, please install those packages or contact one of the instructors/assistants for help.**

### Step 1: Loading the data
We have a timeseries of temperature data for the EMI station Maksegnit (lon=37.56, lat=12.39) inside `makTemp.csv`. Also, we have a file with surface temperatures of the whole world, for multiple years (`temperatures.nc`). Let's start by loading the data. 

In [None]:
# Load maksegnit station temperature data
emiTemp = pd.read_csv('makTemp.csv')

In [None]:
# Load satellite data
import xarray as xr
satTempWorld = xr.open_dataset('temperatures.nc')

### Step 2: Making decisions

We must first make two decisions on how to join the final data:
1. From the satellite data we need to select the lat/lon area closest to Maksegnit. Let us decide to use data between longitudes 37-38 and latitudes 12-13.
2. The surface temperature data is monthly, so let us decide to also turn the EMI station data into monthly data.

For the first decision, we need to _slice_ the data variable `air` (`satTempWorld.air`) for latitudes and longitudes. It has dimensions `(time, lat, lon)`, so we need to slice on the second and third position.

In [None]:
# From the full dataset, only select the surface temperatures for latitudes and longitudes around Maksegnit station
satTempMak = satTempWorld.air[:, (satTempWorld.lat > 12) & (satTempWorld.lat < 13), 
                                 (satTempWorld.lon > 37) & (satTempWorld.lon < 38)]

For the second decision, we need to create monthly averages. The dataframe `makTemp` has columns `['YEAR', 'Month', 'day', 'temperature']`. We can use the method `.groupby()` to create groups for unique YEAR/Month combinations and take the average per group.

In [None]:
# Select only the columns we need (['YEAR', 'Month', 'temperature']), group the data, and take the average
emiMonth = emiTemp.get(['YEAR', 'Month', 'temperature']).groupby(by=['YEAR', 'Month']).mean()
emiMonth

### Step 3: Preparing the satellite data to the same format as `emiMonth`

- We now have temperature averages per month for the station data (inside the variable `emiMonth`), with as index `'YEAR' 'Month'`, thanks to `.groupby()`. 
- We also have temperatures per month for the satelite data, for all latitudes and longitudes between 12-13 and 37-38 (`satTempMak`). We need to take a few steps to end up with a file that contains the satellite data in the same format (indexed by `'YEAR' 'Month'`).
    1. We will need to 'average away' the remaining latitudes and longitudes (we only want to keep one value per timestep).
    2. We will need to turn the satellite _xarray dataset_ into a _pandas dataframe_.
    3. In our satellite data, time is tiven as a timestamp (e.g., `1948-01-01`). We need to get a column with only years and a column with only month numbers from that.
    4. Lastly, we need to set those year- and month-columns as index.

For step 1, we can use the numpy function `np.mean()` in combination with the `axis=` argument: take the mean over the second and third axis (`axis=(1, 2)`), to get the average per time over the selected latitudes and longitudes. See the cell below.

In [None]:
import numpy as np
satTempTime = np.mean(satTempMak, axis=(1, 2))

For step 2, we can use `.to_dataframe()`. We include `.reset_index()` to get normal indices (without that, we would get the timestamp as index).

In [None]:
satTempTimedf = satTempTime.to_dataframe().reset_index()
satTempTimedf

For step 3, we can obtain years or months only through the attribute `.dt`, which is available to any column with timestamp data (see the notebook `Creating_Data.ipynb` for more explanation on `.dt`). In other words, we can create a column with years through `satTempTimedf.time.dt.year`, and a column with month numbers through `satTempTimedf.time.dt.month`. While adding this as columns to the dataframe, we can already give them **exactly the same name** as the name in our station data dataframe (`YEAR` and `Month`).

In [None]:
# Add columns YEAR and Month
satTempTimedf['YEAR'] = satTempTimedf.time.dt.year
satTempTimedf['Month'] = satTempTimedf.time.dt.month
satTempTimedf

Finally, for step 4, we can turn the columns `YEAR` and `Month` into the index with the method `.set_index()`, as we saw earlier in this notebook. And we can get rid of the column `time`, because we do not need it anymore.

In [None]:
satMonth = satTempTimedf.set_index(['YEAR', 'Month']).drop(columns='time')
satMonth

### Step 4: Joining EMI maksegnit data and satellite data together
We now have EMI station data, which is indexed with `'YEAR' 'Month'` (`emiMonth`), and satellite data, indexed in the same way (`satMonth`). We can now join them, making values with similar index end up in the same row.

In [None]:
emi_sat_joined = emiMonth.join(satMonth)
emi_sat_joined

This looked like a lot of steps. However, you must realize:
- Once you have coded those steps, you can use similar code for other data files, and you only need to make small changes to the code.
- All individual steps used are not difficult. The challenge is in systematically combining the steps needed to reach your goal.
- If we put all code together in one cell and combine some steps, it already looks smaller (see below).

In [None]:
### All steps for combining Maksegnit station and satellite data combined

# Import necessary packages
import pandas as pd
import numpy as np
import xarray as xr

# Load data, select latitudes and longitudes for the satellite data
emiTemp = pd.read_csv('makTemp.csv')
satTempWorld = xr.open_dataset('temperatures.nc')
satTempMak = satTempWorld.air[:, (satTempWorld.lat > 12) & (satTempWorld.lat < 13), 
                                 (satTempWorld.lon > 37) & (satTempWorld.lon < 38)]

# Turn emi station data into monthly averages
emiMonth = emiTemp.get(['YEAR', 'Month', 'temperature']).groupby(by=['YEAR', 'Month']).mean()

# Take the average over latitudes and longitudes, turn the result into a dataframe, and set 'YEAR' and 'Month' as its index
satTempTimedf = np.mean(satTempMak, axis=(1, 2)).to_dataframe().reset_index()
satTempTimedf['YEAR'] = satTempTimedf.time.dt.year
satTempTimedf['Month'] = satTempTimedf.time.dt.month
satMonth = satTempTimedf.set_index(['YEAR', 'Month']).drop(columns='time')

# Join the two dataframes
emi_sat_joined = emiMonth.join(satMonth)
emi_sat_joined

Most importantly, whether you think it is a lot of work or not, it is worth it: we now have a dataframe with station and satellite data next to each other. It is now very easy to perform operations like **filling missing data** (filling station data based on the satellite data) or **bias correction** (check for a systematic difference between satellite and station, and for example multiply one column with a value calculated from the other column), to name a few.