## Basic Time Series Analysis

This notebook demonstrates basic time series data analysis using scientific Python libraries such as [NumPY](http://www.numpy.org/).  This example uses air temperature data that is stored in HydroShare to derive daily aggregated values and store them in a new HydroShare resource.

## 1. Script Setup and Preparation

Before we begin our processing, we must import several libaries into this notebook. The `hs_utils` library provides functions for interacting with HydroShare, including resource querying, dowloading and creation.  The `%matplotlib inline` command tells the notebook server to place plots and figures directly into the notebook.

**Note:** You may see some matplotlib warnings if this is the first time you are running this notebook. These warnings can be ignored.

In [None]:
# import required libaries
import os
import hs_utils
import numpy as np
from datetime import datetime
import itertools as it
import matplotlib.pyplot as plt
%matplotlib inline

Next we need to establish a secure connection with HydroShare. This is done by simply instantiating the hydroshare class that is defined within `hs_utils`. In addition to connecting with HydroShare, this command also sets environment variables for several parameters that may useful to you:

1. Your username
2. The ID of the resource which launched the notebook
3. The type of resource that launched this notebook
4. The url for the notebook server.

In [None]:
# establish a secure connection to HydroShare
hs = hs_utils.hydroshare()

### Retrieve a resource using its ID

This example uses temperature data that is stored in HydroShare at the following url: https://www.hydroshare.org/resource/927094481da54af38ffb6f0c39ad8787/ . The data for our processing routines can be retrieved using the `getResourceFromHydroShare` function by passing it the global identifier from the url above.

In [None]:
# get some resource content. The resource content is returned as a dictionary
content = hs.getResourceFromHydroShare('927094481da54af38ffb6f0c39ad8787')

One file was downloaded to the Python notebook server: 

1. BeaverDivideTemp.csv


We will be using this file to derive daily minimum, maximum, and average air temperatures.  Lets preview this Beaver Divide temperature data by looping over the first 10 lines of the csv file that was downloaded.

In [None]:
# preview the content of the BeaverDivideTemp file
air_temp_csv = hs.content['BeaverDivideTemp.csv']
with open(hs.content['BeaverDivideTemp.csv']) as f:
    for i in range(0, 10):
        print(f.readline())

## 2. Time Series Analysis

NumPY is a numerical library that we will be using to read and analyze this temperature data.  To get started, the `genfromtxt` command is used to parse the textfile into NumPY arrays.  This is a powerful function that allows us to skip commented lines, strip whitespace, as well as transform date strings into python objects.

In [None]:
# read all of the data into a numpy array
data = np.genfromtxt(air_temp_csv, comments='#', delimiter=',',autostrip=True,
                    converters={0: lambda x: datetime.strptime(x.decode("utf-8"), 
                                                               '%m-%d-%Y %H:%M:%S')})

Since we are interested in deriving daily aggregated temperatures, we need to summarize the temperature data by date.  The `itertools` library provides an efficient way for us to do this.

In [None]:
# group the temperature data by day
start = data[0][0]
lenperiod = 1

# lists to hold the grouped temperatures and dates
grouped_data = []
grouped_dates = []

ind = 0
for k, g in it.groupby(data,lambda data: (data[0]-start).days // lenperiod):
    group = list(g)
    grouped_dates.append(group[0][0])
    d = [g[1] for g in group if g[1] != -9999]
    grouped_data.append(d)
    

Now that all of the data has been read into memory and grouped by date, we can simply loop over each day an calculate the min, max, and average temperature values.  Note that we are considering any value that is less than -80 $^\circ C$ to be erroneous.

In [None]:
# initialize the t_min, t_max, and t_ave arrays
t_min = np.full((len(grouped_dates), 1),  9999, dtype=np.float)
t_max = np.full((len(grouped_dates), 1), -9999, dtype=np.float)
t_ave = np.full((len(grouped_dates), 1), 0, dtype=np.float)

# loop over every day
for i in range(len(grouped_dates)):
    temp_count = 0
    # loop over each recorded temperature 
    for temp in grouped_data[i]:        
        # skip nodata values and any value less than -80 C
        if temp > -80:
            # save the min temp
            if temp < t_min[i]:
                t_min[i] = temp
            # save the max temp
            if temp > t_max[i]:
                t_max[i] = temp
            # sum the temps
            t_ave[i] += temp
            temp_count += 1
    # calculate the average temp
    t_ave[i] = (t_ave[i] / temp_count).round(2)
        

Visualize our derived data by using matplotlib.

In [None]:
# create a figure
fig, ax = plt.subplots(1,1,figsize=(15, 5))

# plot each temperature time series
tmax = ax.plot_date(grouped_dates, t_max, 'g-', label='Maximum Temperature')
tave = ax.plot_date(grouped_dates, t_ave, 'b-', label='Average Temperature')
tmin = ax.plot_date(grouped_dates, t_min, 'r-', label='Minimum Temperature')

# display a legend
h, l = ax.get_legend_handles_labels()
ax.legend(h, l)

# set the figure title
fig.suptitle('Beaver Divide Temperatures')
plt.ylabel('Temperature (degrees C)')

# format the ticks
ax.grid(True)
fig.autofmt_xdate()

Now that our data analysis is complete we need to save the results.  An easy way to accomplish this is to loop over the date range and write each of the arrays to a csv file.  We are using the built-in `os.environ['DATA']` variable to get the path of the default data directory, therefore, the resulting file will be located in the `data` directory on the server.

In [None]:
# set the save path for the aggregated values
temp_agg = os.path.join(os.environ['DATA'], 'beaver_divide_temp_daily_agg.csv')

# write the derived temperatures to a csv file
with open(temp_agg, 'w') as f:
    f.write('Date, Ave Temp (C), Min Temp (C), Max Temp (C)\n')
    for i in range(len(grouped_dates)):
        f.write('%s,%3.2f,%3.2f,%3.2f\n' % 
               (grouped_dates[i].strftime('%m-%d-%Y'), t_ave[i], t_min[i], t_max[i]))
        

---
## 3. Save the results back into HydroShare

Using the `hs_utils` library, the results of our timeseries analysis can be saved back into HydroShare.  First, define all of the required metadata for resource creation, i.e. *title*, *abstract*, *keywords*, and *content files*.  In addition, we must define the type of resource that will be created, in this case *genericresource*.  

***Optional*** : define the resource from which this "new" content has been derived.  This is one method for tracking resource provenance.

In [None]:
# define HydroShare required metadata
title = 'Daily Aggregate Temperature for Beaver Divide'
abstract = 'This daily average air temperature for the Beaver Divide gauging station that is maintained by iUtah researchers.'
keywords = ['Temperatire', 'Beaver Divide', 'Time Series']

# set the resource type that will be created.
rtype = 'genericresource'

# create a list of files that will be added to the HydroShare resource.
files = [temp_agg]  

# Set the Beaver Divide temperature resource as the "parent" 
# (i.e. the new resource will be "derived from" the "927094481da54af38ffb6f0c39ad8787 resource)
parent_resource = '927094481da54af38ffb6f0c39ad8787'


In [None]:
# create a hydroshare resource containing these data
resource_id = hs.createHydroShareResource(abstract, 
                                          title, 
                                          derivedFromId=parent_resource,
                                          keywords=keywords, 
                                          resource_type=rtype, 
                                          content_files=files, 
                                          public=False)