<img src="../images/UBRA_Logo_DATA_TRAIN.png" style="width: 800px;">

<img src="../images/pandas.svg" style="width: 400px;">

# Time series analysis (Pandas)


Nikolay Koldunov

koldunovn@gmail.com

## Module import

First we have to import necessary modules:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
pd.set_option('display.max_rows', 15)  # this limits the maximum number of rows
np.set_printoptions(precision=3, suppress=True)  # this is just to make the output look better


In [None]:
pd.__version__

## Loading data

Now, when we are done with preparations, let's get some data.

Pandas has very good IO capabilities and we are going to use them to load our data and convert it to the time series.

You remember our Hamburg temperature file:

In [None]:
!head ../data/Ham_3column.txt

We can certainly load it with numpy:

In [None]:
temp = np.loadtxt('../data/Ham_3column.txt')

In [None]:
temp

Bit, say we would like to select specific year:

In [None]:
temp[temp[:,0]==2014]

In [None]:
year2000 = temp[temp[:,0]==2000]
year2000[year2000[:,1]==3]

## Exersise

Finish the code below, so that result (`monmean` variable) is monthly means for the year 2000:

In [None]:
year2000 = temp[temp[:,0]==2000]
monmean = []
for mon in range(1,13):
    mm = ......
    monmean.append(mm)

There should be a better way to do this things :)

Let's use similar data, but for a different location for a change:

In [None]:
!head -n 30 ../data/Bremen_tmax.txt

In [None]:
tmax = pd.read_csv('../data/Bremen_tmax.txt',skiprows=22,
                   delimiter=r"\s+", parse_dates=[[0,1,2]], header=None)

In [None]:
tmax

Here we read our data from file, telling pandas, that delimiter is a space, that it has to combine information in the 0th, 1st and 2nd columns and try to understand it as a date, and that there is no header presented in the original data.

In [None]:
tmax

Rename columns:

In [None]:
tmax.columns = ['Date', 'Temp']

Set "Date" column to be our index (instead of 1,2,3....), so pandas can understand, that our data is actually a time series.

In [None]:
tmax = tmax.set_index(['Date'])

In [None]:
tmax.head(3)

Now we can plot the complete time series:

In [None]:
tmax.plot()

or its part:

In [None]:
tmax.loc['1980':'1990'].plot()

or even smaller part:

In [None]:
tmax.loc['1980-05':'1980-07'].plot()

Reference to the time periods is done in a very natural way. You, of course, can also get individual values. By index (date in our case): 

In [None]:
tmax.loc['1980-01-02':'1980-01-02']

By exact location:

In [None]:
tmax.iloc[120]

And what if we choose only one month?

In [None]:
tmax.loc['1980-01'].plot()

Isn't that great? :)

## Exercise

What was temperature in Delhi at your burthsday (or at the closest day)?

## We can select data by condidtion

This is a plot of all temperatures larger than 35 degrees Celsius.

In [None]:
tmax[tmax > 30].plot(style='r*')

## Exercise

- plot all temperatures larger than 10 (red stars)
- plot temperatires lower than 10 (blue stars)
- limit both plots by 1990-2013 period

## Multiple columns

Now let's make live a bit more interesting and get more data. This will be TMIN time series.

In [None]:
tmin = pd.read_csv('../data/Bremen_tmin.txt',skiprows=22,
                   delimiter=r"\s+", parse_dates=[[0,1,2]], header=None)
tmin.columns = ['Date', 'Temp']
tmin = tmin.set_index(['Date'])

In [None]:
tmin.plot()

Note, that number of values in `tmin` and `tmax` are not the same:

In [None]:
tmin.describe()

In [None]:
tmax.describe()

We are going to create empty DataFrame with indeces for every day and then fill them in with TMIN and TMAX (where they are exist) 

Create index (use period_range):

In [None]:
tmin

In [None]:
dd = pd.date_range('1890-01','2021-05-31',freq='D')

In [None]:
dd

Create empty data frame:

In [None]:
tmp = pd.DataFrame(index=dd)

In [None]:
tmp

Convert indexes from date time values to periods:

In [None]:
# tmin.index = tmin.index.to_period('D')
# tmax.index = tmax.index.to_period('D')

Now we create Data Frame, that will contain both TMAX and TMIN data. It is sort of an Excel table where the first row contain headers for the columns and firs column is an index:

In [None]:
tmp['TMIN'] = tmin
tmp['TMAX'] = tmax

In [None]:
tmp.head()

One can plot the data straight away:

In [None]:
tmp.plot()

In [None]:
tmp.loc['1940':'1950'].plot()

We can reference each column by its name:

In [None]:
tmp['TMIN'].plot()

or as a method of the Data Frame variable (if name of the variable is a valid python name):

In [None]:
tmp.TMIN.plot()

We can simply add column to the Data Frame:

In [None]:
tmp['mean'] = (tmp['TMAX'] + tmp['TMIN'])/2.
tmp.head()

In [None]:
tmp['Diff'] = tmp['TMAX'] - tmp['TMIN']
tmp.head()

## Exercise
Find and plot all differences that are larger than 10

And delete it:

In [None]:
del tmp['Diff']
del tmp['mean']
tmp.tail()

Slicing will also work:

In [None]:
tmp.loc['1981-01':'1981-03'].plot()

## Statistics

Back to simple stuff. We can obtain statistical information over elements of the Data Frame. Default is column wise:

In [None]:
tmp.mean()

In [None]:
tmp.max()

In [None]:
tmp.min()

You can also do it row-wise:

In [None]:
tmp.mean(1)

Or get everything at once:

In [None]:
tmp.describe()

By the way getting correlation coefficients for members of the Data Frame is as simple as:

In [None]:
tmp.corr()

## Exercise

Find means of all TMIN and TMAX temperatures larger than 20

## Resampling

Pandas provide easy way to resample data to different time frequency. Two main parameters for resampling is time period you resemple to and the method that you use. By default the method is mean. Following example calculates monthly ('M'):

In [None]:
tmp.resample?

In [None]:
tmp_mm = tmp.resample("ME").mean()
tmp_mm['2000':].plot()

You can use your methods for resampling, for example np.max (in this case we change resampling frequency to 3 years):

In [None]:
tmp_mm = tmp.resample("3ME").apply(np.max)
tmp_mm['2000':].plot()

In [None]:
def my_max(x):
    out = np.max(x)
    return out

In [None]:
tmp_mm = tmp.resample("3ME").apply(mmax)
tmp_mm['2000':].plot()

You can specify several functions at once as a list:

In [None]:
tmp_mm = tmp.resample("3M").apply([np.max, np.min])
tmp_mm['2000':].plot()

## Exercise

Define function that will find difference between maximum and minimum values of the resampled slice, and resample our `tmp` variable with this function.

## Seasonal means with resample

Initially pandas was created for analysis of financial information and it thinks not in seasons, but in quarters. So we have to resample our data to quarters. We also need to make a shift from standard quarters, so they correspond with seasons. This is done by using 'Q-NOV' as a time frequency, indicating that year in our case ends in November:

In [None]:
q_mean = tmp.resample('Q-NOV').mean()
q_mean

In [None]:
q_mean.plot()

Plot winters

In [None]:
q_mean[q_mean.index.quarter==1].plot()

## Multi-year monthly means with groupby

First step will be to add another column to our DataFrame with month numbers:

In [None]:
tmp['mon'] = tmp.index.month
tmp

Now we can use [groupby](http://pandas.pydata.org/pandas-docs/stable/groupby.html) to group our values by months and calculate mean for each of the groups (month in our case):

In [None]:
monmean = tmp['1950':'2020'].groupby('mon').aggregate(np.mean)
monmean.plot(kind='bar')

In [None]:
tmp.boxplot(column=['TMAX'], by='mon', figsize=(10,5))
tmp.boxplot(column=['TMIN'], by='mon', figsize=(10,5))

## Exersise

The data that we are using are from [GHCN (Global Historical Climatology Network)-Daily](https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn). The easiest way to search and extract those data is to use [KNMI Climatological Service](http://climexp.knmi.nl/selectdailyseries.cgi?id).

- Use [KNMI Climatological Service](http://climexp.knmi.nl/selectdailyseries.cgi?id) to search for some other meteo station.
- Select TMAX data set for your home city or nearby place.
- Open it with pandas.
- Plot data for 2000-2010.
- Find maximum and minimum TMAX for all observational period.
- Find mean of the TMAX temperature.
- Plot monthly means.
- Plot maximum/minimum temperatures for each month.
- Plot seasonal mean for one of the seasons.
- Plot overall monthly means (use groupby(data.index.month)).
- Plot daily season cycle ( use index.dayofyear ).

## Links

[Time Series Data Analysis with pandas (Video)](http://www.youtube.com/watch?v=0unf-C-pBYE)

[Data analysis in Python with pandas (Video)](http://www.youtube.com/watch?v=w26x-z-BdWQ)

[Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do)