# Exercise 1 - Working with High-Frequency Data

So far we have worked with satellite data, which is often recorded at the timescale of days, weeks, or months. This is very different from the frequency of the data that we can get from on-the-ground sensors. Depending on your needs, you can have data recorded at the scale of hours, minutes, seconds, or even higher!

The goal of this exercise is to look at the capabilities of higher frequency data -- much beyond the typical daily or weekly data sets from satellites or larger weather station networks. 

In this first exercise, we are going to look at both higher-frequency satellite data and the outputs of some sensor data installed in the Trishuli to try to understand the usefulness and difficulties associated with high-frequency data. 

We can start by importing some Python modules:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import datetime, json
from shapely.geometry import mapping
import geopandas as gpd

In [None]:
import ee
ee.Initialize()
print(ee.__version__)

## High-Frequency Satellite Data

Let's return to the 3-hourly GPM data we worked with yesterday: [https://developers.google.com/earth-engine/datasets/catalog/NASA_GPM_L3_IMERG_V06](https://developers.google.com/earth-engine/datasets/catalog/NASA_GPM_L3_IMERG_V06)

This is one of the best high-frequency climate records from satellites -- there are not many sensors that give us data more often than once a day. In fact, most sensors only give data every 2-3 weeks! This means they are really good for looking at long-term trends, but very bad for finding short-term changes and any rapid events.

Google Earth Engine sets a limit to how many data points you can access directly -- something around 5000 is the maximum. If we want to look at a longer time series -- for example, the last 5 years -- we need to export the data to our Google Drive like we did with the gridded data.

Let's create a high-frequency time series for the Trishuli:

In [None]:
trishuli_outline = gpd.read_file('GeoData/Trishuli.geojson')
area_of_interest = ee.Geometry.MultiPolygon(ee.List(mapping(trishuli_outline.geometry[0])['coordinates']))

In [None]:
def export_timeseries_todrive(collection, filename, scale, region, agg_fx=ee.Reducer.median()):
    def createTS(image):
        date = image.get('system:time_start')
        value = image.reduceRegion(agg_fx, region, scale)
        ft = ee.Feature(None, {'system:time_start': date, 'date': ee.Date(date).format('Y/M/d-H:m:s'), 'val': value})
        return ft
    
    TS = collection.filterBounds(region).map(createTS)
    
    task_config = {'description': filename, 'fileFormat': 'CSV'}
    task = ee.batch.Export.table.toDrive(TS, **task_config)
    task.start()

def mask_(image):
    mask = image.gt(0.5)
    return image.updateMask(mask)

hr_rainfall = ee.ImageCollection("NASA/GPM_L3/IMERG_V06").select('precipitationCal').filterDate('2019-01-01', '2024-01-01')
hr_rainfall = hr_rainfall.map(mask_) #Remove low values
export_timeseries_todrive(hr_rainfall, 'Trishuli_GPM_Med', scale=11132, region=area_of_interest)
export_timeseries_todrive(hr_rainfall, 'Trishuli_GPM_Sum', scale=11132, region=area_of_interest, agg_fx=ee.Reducer.sum())

For me this took about 45 minutes to run. You can find the files directly on Github as well: [link](). 

We can now open this file with _pandas_ and have a look:

In [None]:
df = pd.read_csv('Time Series/Trishuli_GPM_Med.csv') #Load in our CSV File
dates = [pd.Timestamp(x) for x in df.date] #Fix the date to something Python recognizes
rain = [float(x.split('=')[1].replace('}','').replace('null','0')) for x in df.val] #Fix the value to something Python recognizes

s = pd.Series(rain, index=dates) #Turn this into an easy to work with time series
rs_d = s.resample('D').mean() #Get the daily mean
rs_m = s.resample('M').mean() #Get the monthly mean

f, ax = plt.subplots(1, figsize=(14,6))
ax.plot(s.index, s.values, 'rx', markersize=2, label='Hourly Data')
ax.plot(rs_d.index, rs_d.values, 'b.', label='Daily Mean')
ax.plot(rs_m.index, rs_m.values, 'co', label='Monthly Mean')
ax.set_xlabel('Time', fontsize=16, fontweight='bold')
ax.set_ylabel('Precipitation Rate (mm/hr)', fontsize=16, fontweight='bold')
ax.set_title('Trishuli GPM Precipitation, 2019-2023', fontsize=18, fontweight='bold')
ax.set_ylim(0,30)
ax.legend()

We can do the same thing with daily precipitation sums instead of averages:

In [None]:
df = pd.read_csv('Time Series/Trishuli_GPM_Sum.csv') #Load in our CSV File
dates = [pd.Timestamp(x) for x in df.date] #Fix the date to something Python recognizes
rain = [float(x.split('=')[1].replace('}','').replace('null','0')) for x in df.val] #Fix the value to something Python recognizes

s = pd.Series(rain, index=dates) #Turn this into an easy to work with time series
rs_d = s.resample('D').mean() #Get the daily mean
rs_m = s.resample('M').mean() #Get the monthly mean

f, ax = plt.subplots(1, figsize=(14,6))
ax.plot(s.index, s.values, 'rx', markersize=1, label='Hourly Data')
ax.plot(rs_d.index, rs_d.values, 'b.', label='Daily Mean')
ax.plot(rs_m.index, rs_m.values, 'co', label='Monthly Mean')
ax.set_xlabel('Time', fontsize=16, fontweight='bold')
ax.set_ylabel('Precipitation Sum (mm)', fontsize=16, fontweight='bold')
ax.set_title('Trishuli GPM Precipitation, 2019-2023', fontsize=18, fontweight='bold')
ax.set_ylim(0,250)
ax.legend()

This makes it really clear how much data is lost by only looking at daily or monthly averages (or sums). 

## Local Monitoring Data

With our own sensors, we can collect data at much higher frequencies -- even down to every second! However, we start to have problems with **actually using** that data -- if there is too much data, it becomes very hard to process.

For example, for one of our stations in Trishuli, we have more than **18 Million** measurements since May 2023. That is a lot of data! It thus becomes important to decide how much data you actually need, and consider the trade offs between capturing lots of data and having to both transmit that data and work with it. 

### Collecting High-Frequency Data

In our experience, it is usually best to first collect the _highest possible resolution_ and then choose the best way to make it smaller. Let's take an example of one week of temperature data from one of our sensors. We can start with the highest (1-second) resolution, and work our way down to an optimum for that data type. 

In [None]:
df = pd.read_csv('Time Series/TRDam_Temperature_Feb2024.csv')
df['date'] = [pd.Timestamp(x) for x in df.date]
s = pd.Series(df.temp.values, index=df.date)
s

In [None]:
f, ax = plt.subplots(1, figsize=(10,5))
ax.plot(s.index, s.values, '.', markersize=1)
ax.set_title('One-Second Data, n=478,006')

That is still a LOT of data! Let's try to make the same plot with one minute data, five minute data, and fifteen minute data:

In [None]:
f, (ax, ax2, ax3, ax4) = plt.subplots(4, figsize=(12,12))
ax.plot(s.index, s.values, 'b.', markersize=1, label='One Second Data')
print('1 Second', s.values.shape)

rs = s.resample('1min').mean()
ax2.plot(rs.index, rs.values, 'rx', label='One Minute Data')
print('1 Minute', rs.values.shape)

rs = s.resample('5min').mean()
ax3.plot(rs.index, rs.values, 'go', label='Five Minute Data')
print('5 Minute', rs.values.shape)

rs = s.resample('15min').mean()
ax4.plot(rs.index, rs.values, 'k^', label='Fifteen Minute Data')
print('15 Minute', rs.values.shape)

ax.legend()
ax2.legend()
ax3.legend()
ax4.legend()

Those graphs all look pretty similar! Unless you really need to know about those really small-scale changes, 5 or 15 minute data is certainly sufficient. It also **substantially** reduces the amount of data you need to work with. We go from 500,000 points to only 600 quite quickly, and without loosing much information. 

Since we don't expect air temperature to change really rapidly in most cases, 15-minute data would be a good choice to save on data storage space, SIM and WIFI data, and to make it easier to quickly make charts. 

### Transmitting High-Frequency Data

Before we continue to doing some processing with our high-frequency data, I wanted to make a quick detour about transmitting data. When you collect your own data, you can choose whether to save the data onto a hard drive or to transmit it over the internet. In some cases, saving the data to a hard drive is the only possibility -- if there is no cell network and you are in a very remote area, sending your data over the internet is not possible. 

Even when you do have internet, it is not always fast or reliable -- in both cases, it helps to make your data as small as possible to save space and internet bandwidth. One good way to do this is to resample your data to a slower speed, like we have done above to go from 1 second data to 15 minute data. Another important piece is to **compress** your data, for example into a .zip or .rar file. This is also possible to do directly within your data collection script -- you can use the [gzip](https://docs.python.org/3/library/gzip.html) library which is built into Python for this purpose. 

Once you have compressed your data, we have found that the [rsync](https://linux.die.net/man/1/rsync) tool is a very robust way to transmit data to a web server, an office computer, or directly to your laptop. If you want to find out more about this, please ask us! We would be happy to show how we accomplished this with our monitoring stations.

### Processing High-Frequency Data

Once you have received your data, the next question is how to use it best. There are many questions that can only be answered with high-frequency data, such as how quickly conditions change over short time periods, how different day- and night-time data are, and how far extreme temperatures are from the daily average. All of these quantities will change through time, and will give a better understanding of how much change can happen how quickly. 

Let's first load in our 15-minute data, covering a longer time period:

In [None]:
df = pd.read_csv('Time Series/TRDam_Temperature_15Min_2023-2024.csv')
df['date'] = [pd.Timestamp(x) for x in df.date]
s = pd.Series(df['0'].values, index=df.date)
s

In [None]:
f, ax = plt.subplots(1, figsize=(10,5))
ax.plot(s.index, s.values, '.', markersize=1)
ax.set_title('15 Minute Data, n=24,787')

This is still a LOT of data! However, we can start to see some patterns. Let's zoom in on one month to make this clearer:

In [None]:
f, ax = plt.subplots(1, figsize=(10,5))
jan_2024 = np.logical_and(s.index.year == 2024, s.index.month == 1)
jan_data = s[jan_2024]
ax.plot(jan_data.index, jan_data.values, '.', markersize=2)

Let's try to get three high-frequency metrics:

1. Day/night difference
2. Difference from daily average
3. Rate of temperature change through time

Python can understand the dates that we have with our data -- that is how we managed to extract only one month of data. We can also do the same for night, assuming we know what time the sun goes down!

We can do a rough estimate by looking only at times near noon and times near midnight to get a rough estimate of our day/night range:

In [None]:
night = np.logical_or(jan_data.index.hour > 21, jan_data.index.hour < 4)
day = np.logical_and(jan_data.index.hour > 10, jan_data.index.hour < 16)

night_data = jan_data[night]
day_data = jan_data[day]

#Now resample those to daily data
night_rs = night_data.resample('D').min()
day_rs = day_data.resample('D').max()

f, ax = plt.subplots(1, figsize=(10,5))
ax.plot(day_rs.index, day_rs.values, 'ro')
ax.plot(night_rs.index, night_rs.values, 'bx')

We can also now calculate the difference between day max/night min for each day:

In [None]:
f, ax = plt.subplots(1, figsize=(10,5))
ax.plot(day_rs.index, day_rs.values - night_rs.values, 'ko')

This tells us that our day/night differences are increasing! This is likely do to the start of spring, when things start to warm up more duing the day while still staying cold at night. We can look at the same parameter over the whole time series to see if we're right:

In [None]:
night = np.logical_or(s.index.hour > 21, s.index.hour < 4)
day = np.logical_and(s.index.hour > 10, s.index.hour < 16)

night_data = s[night]
day_data = s[day]

#Now resample those to daily data
night_rs = night_data.resample('D').min()[:-1]
day_rs = day_data.resample('D').max()

f, ax = plt.subplots(1, figsize=(10,5))
ax.plot(day_rs.index, day_rs.values - night_rs.values, 'ko')

As expected, the day/night difference has been increasing since around Novemer. However, it is clear that there is something else going on here -- there are really large changes in the day/night temperature difference throughout the year. This is due to both stronger sunlight during the summer, and the impact of the monsoonal rainfall, which can rapidly change temperatures, especially if rainfall occurs at night. Let's take a closer look at the other two parameters:

2. Difference from daily average
3. Rate of temperature change through time

We can get a daily average and a daily maximum quite easily using the same format we've looked at above:

In [None]:
avg_temp = s.resample('D').mean()
max_temp = s.resample('D').max()

f, ax = plt.subplots(1, figsize=(10,5))
ax.plot(avg_temp.index, max_temp.values - avg_temp.values, 'ro')

Generally there haven't been many real 'high' outliers -- throughout the past year there hasn't been too many times when the 15-minute temperature is particularly high. However, there are a few cold outliers, even in the summer, where the maximum and average daily temperatures are quite similar. 

We can determine exactly what date that minimum occured on:

In [None]:
smallest_difference = np.argmin(max_temp.values - avg_temp.values)
date = max_temp.index[smallest_difference]
print(date)

Let's take a look at the rate of temperature change per day to see if that makes things clearer:

In [None]:
def calc_slope(x):
    slope = np.polyfit(range(len(x)), x, 1)[0]
    return slope

result = s.rolling(12, min_periods=2).apply(calc_slope) #Use 12 values, which is 3-hours (15-minute data)

In [None]:
f, ax = plt.subplots(1, figsize=(10,5))
ax.plot(result.index, result.values, '.')

If we look only at the time around August 13:

In [None]:
august = np.logical_and(result.index > pd.Timestamp('2023-08-08'), result.index < pd.Timestamp('2023-08-18'))
august_data = result[august]

f, ax = plt.subplots(1, figsize=(10,5))
ax.plot(august_data.index, august_data.values, '.')

It looks like we are missing data for the hottest part of the day! That would certainly explain why we have such a low daily difference. 

Even though we are collecting data at a very high frequency, we still can miss some! It is always important to check for these things. 

In [None]:
s

### Comparing to Rainfall Data

As a final step, let's compare some rainfall data for the same location as our temperature sensor, to see how that matches up with our hot/cold days. I will again use the 3-hourly GPM data so that we have a good temporal resolution:

In [None]:
trdam = ee.Geometry.Point([85.14564, 27.92085])

hr_rainfall = ee.ImageCollection("NASA/GPM_L3/IMERG_V06").select('precipitationCal').filterDate('2023-05-24', '2024-02-07')
hr_rainfall = hr_rainfall.map(mask_) #Remove low values

export_timeseries_todrive(hr_rainfall, 'TRDAM_GPM', scale=11132, region=trdam)

In [None]:
df = pd.read_csv('Time Series/TRDAM_GPM.csv') #Load in our CSV File
dates = [pd.Timestamp(x) for x in df.date] #Fix the date to something Python recognizes
rain = [float(x.split('=')[1].replace('}','').replace('null','0')) for x in df.val] #Fix the value to something Python recognizes

rain = pd.Series(rain, index=dates) #Turn this into an easy to work with time series
rs_d = rain.resample('D').sum() #Get the daily mean

Let's make a plot showing the daily temperature range and the daily precipitation sum on the same x-axis:

In [None]:
df = pd.read_csv('Time Series/TRDam_Temperature_15Min_2023-2024.csv')
df['date'] = [pd.Timestamp(x) for x in df.date]
temp = pd.Series(df['0'].values, index=df.date)
daily_temp_range = temp.resample('D').max() - temp.resample('D').min()

f, ax = plt.subplots(1, figsize=(10,5))
ax.plot(daily_temp_range.index, daily_temp_range.values, 'b.')
ax2 = ax.twinx()
ax2.plot(rs_d.index, rs_d.values, 'rx')
ax2.set_ylim(ymin=0)
ax.set_xlabel('Date', fontsize=16, fontweight='bold')
ax.set_ylabel('Daily Temperature Range', fontsize=16, fontweight='bold')
ax2.set_ylabel('Daily Precipitation Sum', fontsize=16, fontweight='bold')

## Further Information

We hope that this has helped point out some of the advantages and difficulties of working with high-frequency data! If you want more resources about working with this kind of data, the _pandas_ library is very helpful for doing time-series analysis: [https://pandas.pydata.org/](https://pandas.pydata.org/)