<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1">Imports</a></span><ul class="toc-item"><li><span><a href="#Libraries" data-toc-modified-id="Libraries-1.1">Libraries</a></span></li></ul></li><li><span><a href="#Data" data-toc-modified-id="Data-2">Data</a></span></li><li><span><a href="#Index" data-toc-modified-id="Index-3">Index</a></span></li><li><span><a href="#Resample" data-toc-modified-id="Resample-4">Resample</a></span></li><li><span><a href="#Plotting-on-a-time-axis" data-toc-modified-id="Plotting-on-a-time-axis-5">Plotting on a time axis</a></span><ul class="toc-item"><li><span><a href="#Plotting-libraries" data-toc-modified-id="Plotting-libraries-5.1">Plotting libraries</a></span></li></ul></li></ul></div>

# E - Temporal data

This notebook shows how to import, format and reconfigure tables of data organized chronologically, using `pandas` data frames.

In the next cells we:
* read data from a file that has a column with the date and time and parse them as time
* select data from a certain period
* resample the data to a different time step
* plot with a time x-axis

## Imports

### Libraries

In [None]:
import pandas as pd
import numpy as np

## Data

We will use measurements of temperature and relative humidity measurements from an experiment with lettuce in Berlin.

In [None]:
df_climate = pd.read_excel( '../../data/climate.xlsx', sheet_name='greenhouse' )

In [None]:
df_climate.head()

In [None]:
df_climate.tail()

We have data measurements in (about) 5-minutes intervales, measuring temperature and relative humidity inside a greenhouse.

The data span from 18th of April until the 24th of May of 2018. We have a little more than one month of data.

## Index

Note that the index is numeric. We want to make it time-aware using the data in the ___Date and Time___ column to be able to select rows in time ranges.

For that, we use the function `pd.DatetimeIndex`, and send as argument the column from te table that has the date and time.

In [None]:
df_climate.head()

Old index:

In [None]:
df_climate.index

New index:

In [None]:
df_climate.index = pd.DatetimeIndex( df_climate[ 'Date and Time' ] )

Note: A common source of error in this step is a confusion between days and monts (because of the order). It can also happen with the year, if only the last 2 numbers are written: Which are the day, month and year if the date is __10-11-12__?

For the first case, you can specify if days or months go first:
`df_climate.index = pd.DatetimeIndex( df_climate[ 'Date and Time' ], dayfirst=True, yearfirst=True )`

Now we have the index as a time object and can select rows according with it:

In [None]:
df_climate.head()

To select a particular time period, we can use `pd.Timestamp`.

Let's say we want to check the data of the 23rd of May:

In [None]:
start = pd.Timestamp( '2018-05-23, 00:00:00' )
end = pd.Timestamp( '2018-05-24, 00:00:00' )

In [None]:
condition_start = df_climate.index > start
condition_end = df_climate.index < end

df_climate[ condition_start & condition_end ]

And we can use this new table to calculate the average (or max, min, etc) of the columns on that interval:

In [None]:
df_climate[ condition_start & condition_end ].mean()

Of course, we can also look for other periods with this technique, we just follow the steps:
* Define a start time with `pd.Timestamp`
* Define a finishing time with `pd.Timestamp`
* Check condition: > start
* Check condition: < end
* Select from the table (query)

In [None]:
start = pd.Timestamp( '2018-05-23, 08:00:00' )
end = pd.Timestamp( '2018-05-23, 09:00:00' )

condition_start = df_climate.index > start
condition_end = df_climate.index < end

df_climate[ condition_start & condition_end ]

This techinque does not look for exact matches, as the measuring times in the table include seconds that are irregular. It is very useful because it also works in cases when there are empty or not equally distributed timestamps.

It is now very easy to know the average temperature in the period between 8 and 9 that we selected previously:

In [None]:
start = pd.Timestamp( '2018-05-23, 08:00:00' )
end = pd.Timestamp( '2018-05-23, 09:00:00' )

condition_start = df_climate.index > start
condition_end = df_climate.index < end

df_climate[ condition_start & condition_end ].mean()


## Resample

Lastly, we will have a quick introduction to `.resample`, a function that allows us to change the interval in which some data are given, either ___upsample___ (get more points, at smaller intervals) or ___downsample___ (agreggating values in bigger intervals).

The data in the climate data frame is stored in intervals of about 5 minutes. 

We will first ___downsample___ it to hourly and daily values.

There are two things that we have to have clear to correctly resample data:

* What the new frequency will be. 1 hour? 15 minutes? 1 day?
* How we will create the new values. Sum? Average?

About the first question, we will use the following letters to specifiy the new frequency:
    
* M → monthly frequency
* W → weekly frequency
* D → daily frequency
* H → hourly frequency
* T → minutely frequency
* S → secondly frequency
* L → milliseonds
* U → microseconds
* N → nanoseconds

About the second question, think that water (liter [L]) from irrigation in the morning and in the afternoon should ___add___ up for a daily value.

On the other hand, the temperature in the morning, and the temperature in the afternoon should be ___averaged___ to give a daily value.

Also, in cases we need the last value, or the first, or the most common in the interval. At the end of this notebook is a link to a very nice post where these options can be consulted.

First we will resample the whole data frame to 1 hour, taking the average of the values in each hourly interval. 

It is like this now:

In [None]:
df_climate.head()

In [None]:
df_climate_1h = df_climate.resample( '1H' ).mean()

In [None]:
df_climate_1h.head()

In [None]:
df_climate_1h.tail()

And now the same for daily values:

In [None]:
df_climate_1d = df_climate.resample( '1D' ).mean()

In [None]:
df_climate_1d.head()

In [None]:
df_climate_1d.tail()

And that easily we get the daily average temperature and humidity from measurements in 5 minutes interval!

If we wanted to resample with different functions, sometimes the maximum value, for some other columns the sum or the mean, we need to use the following syntax.

In this case, it resamples to the same 1-day frequency, but it returns the minimum temperature and averages the relative humidity.

Remember to import numpy to use the functions `np.max` and `np.mean`!

In [None]:
import numpy as np

In [None]:
df_climate.resample( '1D' ).apply( {'Temp. (°C)':np.sum,'Rel. Humidity (%)':np.mean} ) 

Lastly, we will show what happens if the new frequency is bigger, i.e. the time intervals are smaller.

In these cases, we get empty spaces, that need to be filled with _something_. Common options are the next or last values, or empty cells.

For an example, we will change the frequency from 5 minutes to 1 minute.

Mean, average and other functions that ___aggregate___ values do not have meaning in this case, because we are "creating" new values, cells that were not there before:

In [None]:
df_climate['Temp. (°C)'].resample( '1min' ).ffill()

## Plotting on a time axis

If we have a data frame with time index, we can plot it using `matplotlib`, and treat the x-axis as time. 

This can be useful to set the limits of the axis, or to align plots from different sources. As an example, we often have measurements in different tables, with different time stamps. They will be plotted correctly if we previously set the indexes to time.

In this example, we will plot the temperature from before, as well as the daily averages as points.

### Plotting libraries

In [None]:
%matplotlib inline
#matplotlib notebook

import matplotlib.pyplot as plt
import matplotlib.dates as mdates

First a simple plot:

In [None]:
fig, ax = plt.subplots()

ax.plot( df_climate.index, df_climate['Temp. (°C)'] )

plt.show()

Now a little bit more elaborated formatting.

To ensure that we get the ticks in the x-axis where we want them, we use the functions in `matplotlib.dates`, which was imported as mdates before. 

Basically we use the `Locator` and the `Formatter`.

In this case, we put the ticks once a week, on Mondays (`mdates.MO`).

In [None]:
fig, ax = plt.subplots( figsize=(12,5) )

ax.plot( df_climate.index, df_climate['Temp. (°C)'], linewidth=2 )

ax.xaxis.set_major_locator( mdates.WeekdayLocator(byweekday=mdates.MO) ) # Put x ticks on Mondays
ax.xaxis.set_major_formatter( mdates.DateFormatter('%d/%b') )            # Print only day and month
ax.tick_params( axis='both', which='major', labelsize=14 )

plt.show()

Now, we can add a second data line, with the daily means. Notice that the x and y coordinates are different in both calls to `ax.plot`. This allows us to combine different sources of data and get the plots aligned correctly.

In [None]:
fig, ax = plt.subplots( figsize=(12,5) )

ax.plot( df_climate.index, df_climate['Temp. (°C)'], linewidth=2 )

ax.plot( df_climate_1d.index, df_climate_1d['Temp. (°C)'], linewidth=0, marker='s', markersize=10, color='red' )

ax.xaxis.set_major_locator( mdates.WeekdayLocator(byweekday=mdates.MO) )
ax.xaxis.set_major_formatter( mdates.DateFormatter('%d/%m') )
ax.tick_params( axis='both', which='major', labelsize=14 )

plt.show()

Now, adding a third source of data, the hourly averages, makes the plot look a little messy:

In [None]:
fig, ax = plt.subplots( figsize=(12,5) )

ax.plot( df_climate.index, df_climate['Temp. (°C)'], linewidth=2 )

ax.plot( df_climate_1h.index, df_climate_1h['Temp. (°C)'], linewidth=0, marker='^', markersize=10, color='green', alpha=0.3 )
ax.plot( df_climate_1d.index, df_climate_1d['Temp. (°C)'], linewidth=0, marker='s', markersize=10, color='red' )

ax.xaxis.set_major_locator( mdates.WeekdayLocator(byweekday=mdates.MO) )
ax.xaxis.set_major_formatter( mdates.DateFormatter('%d/%m') )
ax.tick_params( axis='both', which='major', labelsize=14 )

plt.show()

But it looks ok if we select a proper time range and give a little bit more formatting to the whole plot:

In [None]:
fig, ax = plt.subplots( figsize=(12,5) )

ax.plot( df_climate.index, df_climate['Temp. (°C)'], linewidth=2, label='Air temperature' )

ax.plot( df_climate_1h.index, df_climate_1h['Temp. (°C)'], linewidth=0, marker='^', markersize=10, 
         markeredgecolor='green', markerfacecolor='None', alpha=0.9, label='Air temperature: 1h mean' )
ax.plot( df_climate_1d.index, df_climate_1d['Temp. (°C)'], linewidth=0, marker='s', markersize=10, 
         color='red', label='Air temperature: 1 day mean' )

ax.xaxis.set_major_locator( mdates.DayLocator() )
ax.xaxis.set_major_formatter( mdates.DateFormatter('%d/%m') )
ax.tick_params( axis='both', which='major', labelsize=14 )

start = pd.Timestamp('2018-Apr-23 00:00:00')
end = pd.Timestamp('2018-Apr-26 00:00:00')
ax.set_xlim( [ start, end ] )

ax.legend()

ax.set_xlabel( 'Timestamp', fontsize=14 )
ax.set_ylabel( 'Temperature', fontsize=14 )

plt.show()

As said, these examples are thought to be instructive and serve as a look up source to consult later. 

The whole materials will be publicly available in a repository, and will updated eventually. Feel free to download and share.

Hopefully you found a couple of ideas to test in your projects!