# Time Series Data Analysis with cuDF 
cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data in time series datasets, especially large ones. It provides a pandas-like API, so users can readily speed up their workflow by ~800x, by changing pandas.DataFrame() to cudf.DataFrame(). You may explore the latest [cuDF API](https://docs.rapids.ai/api/cudf/nightly/api_docs/index.html).

The table shows the performance speed-up, comparing cuDF vs Pandas on an RTX A6000 GPU with an Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz.(Details below.)

**<p style="text-align: center;">Performance Results based on test results</p>**


|function| GPU Time | CPU Time| GPU Speedup |
| --- | --- | --- | --- |   
|read|4.391139|117.607004|26.78|    
|drop|0.184182|3.34047|18.14|    
|diff|0.131384|16.044269|122.12|    
|select|0.07151|62.890464|879.46|    
|resample|0.347972|9.892627|28.43|    

This notebook introduces how to use cuDF to apply basic data analysis on time series datasets. The [MeteoNet Dataset](https://www.kaggle.com/datasets/katerpillar/meteonet) is downloaded and analyzed in this notebook to provide a practical example to data scientists. 

In this notebook, we illustrate how to:
 - Loading and saving the dataset
 - Datetime object processing 
 - Date selection over a determined time period
 - Resampling and group time series data
 
We'll download the dataset for you in this notebook, but it can be manually downloaded from https://meteonet.umr-cnrm.fr/dataset/data/ .

## Section 0: Prerequisites   
To use this notebook, [RAPIDS](https://rapids.ai/start.html) must be installed. Please review the following steps and ensure it's properly installed.

### System Requirements
All provisioned systems need to be RAPIDS capable. Here’s what's required:

 **GPU**: NVIDIA Pascal™ or better with compute capability 6.0+

 **OS**: One of the following OS versions:
 - Ubuntu 18.04/20.04 or CentOS 7 / Rocky Linux 8 with gcc/++ 9.0+
 - Windows 11 using WSL2 See separate install guide 
 - RHEL 7/8 support is provided through CentOS 7 / Rocky Linux 8 builds/installs

 **CUDA & NVIDIA Drivers**: One of the following supported versions:
 - CUDA 11.2 with Driver 460.27.03 or newer
 - CUDA 11.4 with Driver 470.42.01 or newer
 - CUDA 11.5 with Driver 495.29.05 or newer
   
Note: RAPIDS is tested with and officially supports the versions listed above. Newer versions of CUDA, drivers, and OS may also work with RAPIDS.

### Environment for RAPIDS
You can install one of below environments for RAPIDS. Referring to [Step 2: Install Environment](https://rapids.ai/start.html), the possible environments are:
* Conda 
* Build from source 
* PIP installation
* Running a Docker container 

### Installing RAPIDS  
There's specific ways to install RAPIDS for different environments. 
#### Conda   
Below is the command for basic installation under Conda:
```
conda create -n rapids-23.02 -c rapidsai-nightly -c conda-forge -c nvidia rapids=23.02 python=3.9 cudatoolkit=11.5 jupyterlab
```

You can specify python version either 3.8 or 3.9, cudatoolkit version as one of 11.2, 11.4 and 11.5.    

NOTE: ```rapids=23.02``` means to install standard selection contains all following packages: _cuDF, cuML, cuGraph, cuSpatial, cuXFilter, cuSignal, cuCIM_. You can specify the package you want to use via instead ```cudf=23.02``` as an example. 

For addtional installation of Dask SQL, JupyterLab, Plotly Dash, Graphistry, etc., you can add the related package name to the conda install command. Find detailed information at [Step 3: Install Rapids](https://rapids.ai/start.html).

#### Docker
Here's example using two commands for basic installation using a Docker container from NGC, selecting CUDA 11.2 and Ubuntu 20.04. The second command runs the container.
```
docker pull nvcr.io/nvidia/rapidsai/rapidsai-core:22.12-cuda11.2-runtime-ubuntu20.04-py3.9

docker run --gpus all --rm -it \
    --shm-size=1g --ulimit memlock=-1 \
    -p 8888:8888 -p 8787:8787 -p 8786:8786 \
    nvcr.io/nvidia/rapidsai/rapidsai-core:22.12-cuda11.2-runtime-ubuntu20.04-py3.9
```
Check more on docker command for specific system and _Dask-SQL, CLX_ support see [Step 3: Install Rapids](https://rapids.ai/start.html)

## Section 1: Preparing the Dataset 
In this section will show how to do basic analysis on the Ground Stations Dataset from MeteoNet.

### Downloading the Dataset
During this first activity, the South East France ground station data are applied. Each weather parameter has been measured every 6 minutes.

The parameters in the data set are listed below. Detailed information can refer [dataset information](https://meteofrance.github.io/meteonet/english/data/ground-observations/)   
**<p style="text-align: center;">Metadata</p>**

|Name| Description | Unit|   
| --- | --- | --- |          
|number_sta|ground station ID| - |   
|lat| latitude| decimal degrees (10^-1°) |    
|lon| longitude| decimal degrees (10^-1°) |   
|height_sta| station height| meters(m) |    
|date| a datetime object| format 'YYYY-MM-DD HH:mm:ss' |    

**<p style="text-align: center;">Meteorological Parameters</p>**

|Name| Description | Unit| 
| --- | --- | --- |   
|dd| Wind direction | degrees (°)|    
|ff| Wind speed | m.s^-1|    
|precip| Precipitation during the reporting period | kg.m^2|    
|hu| Humidity | % |       
|td| Dew point | Kelvin (K) |     
|t| Temperature | Kelvin (K) |   
|psl| Pressure reduced to sea level | Pascal (Pa)|   

In [None]:
# These three wget commands will download three years' worth of data
!wget https://meteonet.umr-cnrm.fr/dataset/data/SE/ground_stations/SE_ground_stations_2018.tar.gz
!wget https://meteonet.umr-cnrm.fr/dataset/data/SE/ground_stations/SE_ground_stations_2017.tar.gz
!wget https://meteonet.umr-cnrm.fr/dataset/data/SE/ground_stations/SE_ground_stations_2016.tar.gz

In [None]:
# Let's untar and unzip them
!tar -xvf SE_ground_stations_2016.tar.gz && rm -f SE_ground_stations_2016.tar.gz
!tar -xvf SE_ground_stations_2017.tar.gz && rm -f SE_ground_stations_2017.tar.gz
!tar -xvf SE_ground_stations_2018.tar.gz && rm -f SE_ground_stations_2018.tar.gz

In [None]:
# Are they listed?
!ls -l -sh SE2*.csv

3.2G -rw-r--r-- 1 meiranp dip 3.2G Jan 23  2020 SE2016.csv
3.2G -rw-r--r-- 1 meiranp dip 3.2G Jan 23  2020 SE2017.csv
3.3G -rw-r--r-- 1 meiranp dip 3.3G Jan 23  2020 SE2018.csv


## Section 2: Loading and Saving the Dataset; Datetime Processing

Some basic features of DataFrames on time series dataset will make your work easier. Here are a few we'll use:
- The dataset is .csv format, meaning "comma separated values". Therefore, the ```.read_csv()``` dataframe method can help to load the dataset to data frame format. Make special note of the "line terminator: defined in the csv file.
- With the help of the functions ```.head()``` and ```.tail()``` from cudf library, we can see the first and last several observations of the dataset. Very handy for working with long datasets. 
- ```.shape``` will describe the shape of the DataFrame.
- ```cudf.to_datetime()``` function help to convert argument to datetime dtype
- ```.min()``` and ```.max()``` on datetime Series can help to investigate the sampling time window
- ```.concat()``` function to concatenate DataFrames, Series, or Indices row-wise.
- ```.to_csv()``` function help to write a dataframe to csv file format.

**Note**: The following processing uses a combined dataset from years 2016, 2017, and 2018, which is about 10GB in size. If limited by the GPU's memory (out of memory error), you can load just one of the datasets to investigate how cuDF works on time series data. 

In [None]:
import cudf
import cupy as cp
import pandas as pd

In [None]:
%%time
# Do a warm-up when benchmarking performance. Refer to the last section of code for the performance check. 
# If you get an out of memory error, you can comment out two of read_cvs lines below. Just make sure
# to update the gdf_frames line, too, to reflect which one dataset you're keeping.

# Empty DataFrame placeholders so you can select just one or two of the years of data. 
gdf_2016 = cudf.DataFrame()
gdf_2017 = cudf.DataFrame()
gdf_2018 = cudf.DataFrame()

# **********NOTE***********
# Comment out one or two of these if your GPU memory is full.
gdf_2016 = cudf.read_csv('./SE2016.csv')
gdf_2017 = cudf.read_csv('./SE2017.csv')
gdf_2018 = cudf.read_csv('./SE2018.csv')

gdf_frames =[gdf_2016,gdf_2017,gdf_2018]
gdf = cudf.concat(gdf_frames,ignore_index=True)
gdf.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 127515796 entries, 0 to 127515795
Data columns (total 12 columns):
 #   Column      Dtype
---  ------      -----
 0   number_sta  int64
 1   lat         float64
 2   lon         float64
 3   height_sta  float64
 4   date        object
 5   dd          float64
 6   ff          float64
 7   precip      float64
 8   hu          float64
 9   td          float64
 10  t           float64
        object
dtypes: float64(9), int64(1), object(2)
memory usage: 12.5+ GB
CPU times: user 4.64 s, sys: 6.53 s, total: 11.2 s
Wall time: 22.4 s


In [None]:
# Here's the bottom of the dataset
gdf.tail()

Unnamed: 0,number_sta,lat,lon,height_sta,date,dd,ff,precip,hu,td,t,psl\r
127515791,84086001,43.811,5.146,672.0,20181231 23:54,10.0,3.7,0.0,85.0,274.65,276.95,\r
127515792,84087001,44.145,4.861,55.0,20181231 23:54,350.0,11.4,0.0,80.0,277.85,281.05,102810.000\r
127515793,84094001,44.289,5.131,392.0,20181231 23:54,320.0,3.6,0.0,68.0,274.55,280.05,\r
127515794,84107002,44.041,5.493,836.0,20181231 23:54,280.0,0.6,0.0,91.0,269.55,270.85,\r
127515795,84150001,44.337,4.905,141.0,20181231 23:54,10.0,6.7,0.0,84.0,277.95,280.45,\r


In [None]:
%%time
## Save the (concatenated) dataframe to csv file
gdf.to_csv('./SE_data.csv',index=False,chunksize=500000)

CPU times: user 3.36 s, sys: 12.5 s, total: 15.9 s
Wall time: 25.4 s


Restart Kernel to release all GPU memory usage, then read the data for subsequent processing.

In [None]:
## Restart Kernels previous to doing the below performance comparasions.
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [None]:
# Let's make sure the GPU is visible!
!nvidia-smi

Fri Feb 17 01:26:51 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA RTX A6000    Off  | 00000000:65:00.0 Off |                  Off |
| 30%   55C    P5    57W / 300W |      6MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Import the necessary packages
import cudf
import cupy as cp
import pandas as pd

In [None]:
%%time
## Let's focus on wind speed, temprature, humidity parameters, drop others we're not using.
gdf = cudf.read_csv('./SE_data.csv')
gdf = gdf.drop(columns=['dd','precip','td','psl'])
gdf.head()

CPU times: user 4.43 s, sys: 1.35 s, total: 5.79 s
Wall time: 5.72 s


Unnamed: 0,number_sta,lat,lon,height_sta,date,ff,hu,t
0,1027003,45.83,5.11,196.0,20160101 00:00,,98.0,279.05
1,1033002,46.09,5.81,350.0,20160101 00:00,0.0,99.0,278.35
2,1034004,45.77,5.69,330.0,20160101 00:00,0.0,100.0,279.15
3,1072001,46.2,5.29,260.0,20160101 00:00,,,276.55
4,1089001,45.98,5.33,252.0,20160101 00:00,0.0,95.0,279.55


In [None]:
# Change the date column to datetime datatype. Look at the DataFrame's info
gdf['date'] = cudf.to_datetime(gdf['date'])
gdf.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 127515796 entries, 0 to 127515795
Data columns (total 8 columns):
 #   Column      Dtype
---  ------      -----
 0   number_sta  int64
 1   lat         float64
 2   lon         float64
 3   height_sta  float64
 4   date        datetime64[ns]
 5   ff          float64
 6   hu          float64
 7   t           float64
dtypes: datetime64[ns](1), float64(6), int64(1)
memory usage: 7.6 GB


In [None]:
# Print the data frame see the detail of first 5 rows
gdf.head()

Unnamed: 0,number_sta,lat,lon,height_sta,date,ff,hu,t
0,1027003,45.83,5.11,196.0,2016-01-01,,98.0,279.05
1,1033002,46.09,5.81,350.0,2016-01-01,0.0,99.0,278.35
2,1034004,45.77,5.69,330.0,2016-01-01,0.0,100.0,279.15
3,1072001,46.2,5.29,260.0,2016-01-01,,,276.55
4,1089001,45.98,5.33,252.0,2016-01-01,0.0,95.0,279.55


In [None]:
# Print the data frame see the detail of last 5 rows
gdf.tail()

Unnamed: 0,number_sta,lat,lon,height_sta,date,ff,hu,t
127515791,84086001,43.811,5.146,672.0,2018-12-31 23:54:00,3.7,85.0,276.95
127515792,84087001,44.145,4.861,55.0,2018-12-31 23:54:00,11.4,80.0,281.05
127515793,84094001,44.289,5.131,392.0,2018-12-31 23:54:00,3.6,68.0,280.05
127515794,84107002,44.041,5.493,836.0,2018-12-31 23:54:00,0.6,91.0,270.85
127515795,84150001,44.337,4.905,141.0,2018-12-31 23:54:00,6.7,84.0,280.45


In [None]:
# Here are the dimensions, i.e. the shape, of the DataFrame
gdf.shape

(127515796, 8)

In [None]:
## Investigate the sampling frequency with the diff() function to calculate the time diff
## dt.seconds, which is used to find the seconds value in the datatime frame. Then apply the 
## max() function to calculate the maximum date value of the series.
delta_mins = gdf['date'].diff().dt.seconds.max()/60

In [None]:
print(f"The dataset collection covers from {gdf['date'].min()} to {gdf['date'].max()} with {delta_mins} minute sampling interval")

The dataset collection covers from 2016-01-01T00:00:00.000000000 to 2018-12-31T23:54:00.000000000 with 6.0 minute sampling interval


### Summary:
- The dataset contains records from time 2016-01-01 00:00 to 2018-12-31 23:54:00
- A new record is sampled every 6 mins


## Section 3: Selecting the Data's Date over a Determined Time Period
Common user scenarios include adding new date columns as Year, Month, Day, etc., and selecting a date period with specific conditions. The 
cuDF library provides some effcient functions to do so:
- ```.year```, ```month```, ```day```, ```hour```, etc. can seperate the datetime to seprate columns
- Combine ```cupy.logical_and``` for elementwise boolean selection.
- ```pandas.Timestamp``` dtype can be used to define a timestamp
- ```shape``` to describe the Dataset shape

In [None]:
gdf['year'] = gdf['date'].dt.year
gdf['month'] = gdf['date'].dt.month
gdf['day'] = gdf['date'].dt.day
gdf['hour'] = gdf['date'].dt.hour
gdf['mins'] = gdf['date'].dt.minute

#Remember how to check the bottom of a DataFrame without displaying millions of lines?
gdf.tail()

Unnamed: 0,number_sta,lat,lon,height_sta,date,ff,hu,t,year,month,day,hour,mins
127515791,84086001,43.811,5.146,672.0,2018-12-31 23:54:00,3.7,85.0,276.95,2018,12,31,23,54
127515792,84087001,44.145,4.861,55.0,2018-12-31 23:54:00,11.4,80.0,281.05,2018,12,31,23,54
127515793,84094001,44.289,5.131,392.0,2018-12-31 23:54:00,3.6,68.0,280.05,2018,12,31,23,54
127515794,84107002,44.041,5.493,836.0,2018-12-31 23:54:00,0.6,91.0,270.85,2018,12,31,23,54
127515795,84150001,44.337,4.905,141.0,2018-12-31 23:54:00,6.7,84.0,280.45,2018,12,31,23,54


In [None]:
# Let's use cupy.logical_and(...) function to select the data from specific time range.
# We may combine more logical_and() functions to achieve more than 2 and conditions.
# You'll need to make sure the start and end times are part of the dataset, if you
# opted to use a partial dataset for the sake of GPU memory.

import pandas as pd
start_time = pd.Timestamp('2017-02-01T00')
end_time = pd.Timestamp('2018-11-01T00')
station_id = 84086001
gdf_period = gdf.loc[cp.logical_and(cp.logical_and(gdf['date']>start_time,gdf['date']<end_time),gdf['number_sta']==station_id)]
gdf_period.shape

(146039, 13)

In [None]:
# We can see all the columns here, and notice that indices on the left are no longer contigious.
# That's expected. Why?
gdf_period

Unnamed: 0,number_sta,lat,lon,height_sta,date,ff,hu,t,year,month,day,hour,mins
45968115,84086001,43.810,5.150,672.0,2017-02-01 00:06:00,7.9,98.0,281.15,2017,2,1,0,6
45968600,84086001,43.810,5.150,672.0,2017-02-01 00:12:00,8.0,98.0,281.15,2017,2,1,0,12
45969085,84086001,43.810,5.150,672.0,2017-02-01 00:18:00,7.3,98.0,281.15,2017,2,1,0,18
45969570,84086001,43.810,5.150,672.0,2017-02-01 00:24:00,7.5,98.0,281.15,2017,2,1,0,24
45970054,84086001,43.810,5.150,672.0,2017-02-01 00:30:00,7.4,98.0,281.05,2017,2,1,0,30
...,...,...,...,...,...,...,...,...,...,...,...,...,...
119937034,84086001,43.811,5.146,672.0,2018-10-31 23:30:00,7.6,95.0,281.65,2018,10,31,23,30
119937538,84086001,43.811,5.146,672.0,2018-10-31 23:36:00,7.0,95.0,281.55,2018,10,31,23,36
119938042,84086001,43.811,5.146,672.0,2018-10-31 23:42:00,6.9,95.0,281.65,2018,10,31,23,42
119938546,84086001,43.811,5.146,672.0,2018-10-31 23:48:00,7.3,95.0,281.65,2018,10,31,23,48


In [None]:
# Let's check for the presence of any NA (invalid) values.
gdf_period.isna().sum()

number_sta     0
lat            0
lon            0
height_sta     0
date           0
ff            83
hu            37
t             37
year           0
month          0
day            0
hour           0
mins           0
dtype: int64

### Summary:
There are 146039 records during the period between 2017-02-01 and 2018-10-31 of station with id 84086001.
There are 83 invalid records on wind speed parameters, 37 on humidity and 37 on temperature. 


## Section 4: Resampling and Group Time Series Data
Resampling the time series data is a quite common user scenario, often needed for further investigation.
The cuDF library provides a simple, powerful, and efficient function [```resample()```](https://docs.rapids.ai/api/cudf/nightly/api_docs/api/cudf.dataframe.resample#cudf-dataframe-resample) to realize this function.
- ```.bfill()``` to backward-fill the NA data in the dataset 
- ```resample()``` to resample the data with date as index
- ```set_index()``` to set the specified column(s) as index
- ```.groupby()``` to group dataframe by one or more columns, or by basic aggregations such as “sum”, “mean”, etc.

This section introduced how to invesitigate:
- Maximum temperature of the day during period between 2017-02-01 and 2018-10-31
- Mean temperature of month during period between 2017-02-01 and 2018-10-31, since cuDF does not yet support month, quarter, year-anchored frequency resampling, the ```groupby``` function can be used to do so.  

Note, [cuXFilter Library](https://github.com/rapidsai/cuxfilter) is applied in this section to plot temperature trends. 

In [None]:
## Set "date" as the index. See what that does?
gdf_period.set_index("date", inplace=True)
gdf_period.tail()

Unnamed: 0_level_0,number_sta,lat,lon,height_sta,ff,hu,t,year,month,day,hour,mins
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018-10-31 23:30:00,84086001,43.811,5.146,672.0,7.6,95.0,281.65,2018,10,31,23,30
2018-10-31 23:36:00,84086001,43.811,5.146,672.0,7.0,95.0,281.55,2018,10,31,23,36
2018-10-31 23:42:00,84086001,43.811,5.146,672.0,6.9,95.0,281.65,2018,10,31,23,42
2018-10-31 23:48:00,84086001,43.811,5.146,672.0,7.3,95.0,281.65,2018,10,31,23,48
2018-10-31 23:54:00,84086001,43.811,5.146,672.0,8.1,95.0,281.55,2018,10,31,23,54


In [None]:
## Now, resample by daylong intervals, and check the max data during the resampled period. 
## We use .reset_index() to reset the index instead of date.
gdf_day_max = gdf_period.resample('D').max().bfill().reset_index()

## Resample with monthlong intervals, and check the mean data during the resampled period.
## Focus on year 2018 as an example. 
gdf_month_mean = gdf_period[gdf_period["year"]==2018].groupby('month').mean().reset_index()

In [None]:
gdf_day_max.head()

Unnamed: 0,date,number_sta,lat,lon,height_sta,ff,hu,t,year,month,day,hour,mins
0,2017-02-01,84086001,43.81,5.15,672.0,8.1,98.0,283.05,2017,2,1,23,54
1,2017-02-02,84086001,43.81,5.15,672.0,14.1,98.0,283.85,2017,2,2,23,54
2,2017-02-03,84086001,43.81,5.15,672.0,10.1,99.0,281.45,2017,2,3,23,54
3,2017-02-04,84086001,43.81,5.15,672.0,12.5,99.0,284.35,2017,2,4,23,54
4,2017-02-05,84086001,43.81,5.15,672.0,7.3,99.0,280.75,2017,2,5,23,54


In [None]:
gdf_day_max.isna().sum()

date          0
number_sta    0
lat           0
lon           0
height_sta    0
ff            0
hu            0
t             0
year          0
month         0
day           0
hour          0
mins          0
dtype: int64

In [None]:
gdf_month_mean.head()

Unnamed: 0,month,number_sta,lat,lon,height_sta,ff,hu,t,year,day,hour,mins
0,7,84086001.0,43.811,5.146,672.0,4.041219,55.033965,296.290433,2018.0,16.009161,11.508959,26.991513
1,8,84086001.0,43.811,5.146,672.0,4.214624,61.451075,295.031223,2018.0,16.0,11.5,27.0
2,9,84086001.0,43.811,5.146,672.0,3.779583,64.454722,292.081111,2018.0,15.5,11.5,27.0
3,6,84086001.0,43.811,5.146,672.0,4.092944,69.817222,291.707028,2018.0,15.5,11.5,27.0
4,10,84086001.0,43.811,5.146,672.0,5.132343,75.868414,286.593925,2018.0,16.0,11.5,27.0


In [None]:
gdf_month_mean.isna().sum()

month         0
number_sta    0
lat           0
lon           0
height_sta    0
ff            0
hu            0
t             0
year          0
day           0
hour          0
mins          0
dtype: int64

## Section 5: Applying cuxfilter and Finding Daily Temperature Variances

In [None]:
# First, let's import the modules from cuXfilter we'll need.
import cuxfilter
from cuxfilter import themes, layouts
from cuxfilter.assets.custom_tiles import get_provider, Vendors

In [None]:
# It's time to perform the cross filtering operation.
cux_df = cuxfilter.DataFrame.from_dataframe(gdf_day_max)

# Let's make a plot.
chart1 = cuxfilter.charts.line(x='date',y='t',title='Max Temperature of Day')
d = cux_df.dashboard([chart1],layout_array=[[1]], theme=cuxfilter.themes.rapids, data_size_widget=True)
d.app()

In [None]:
# And the mean temperature
cux_df = cuxfilter.DataFrame.from_dataframe(gdf_month_mean)

# Let's make a plot.
chart2 = cuxfilter.charts.line(x='month',y='t',title='Mean Temperature of Month on Year 2018-01 ~ 2018-10')
d = cux_df.dashboard([chart2],layout_array=[[1]], theme=cuxfilter.themes.rapids, data_size_widget=True)
d.app()

### Section 5.1 Invesigate the Temperature Variance between Days
Let's see the maximum temperature change for two consecutive days via cuDF and cuxfilter.
- ```shift()``` to shift values by periods (default 1) positions.

In [None]:
gdf_day_max_shift = gdf_day_max.set_index("date").shift(1)
gdf_day_max_shift

Unnamed: 0_level_0,number_sta,lat,lon,height_sta,ff,hu,t,year,month,day,hour,mins
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2017-02-01,,,,,,,,,,,,
2017-02-02,84086001,43.81,5.15,672.0,8.1,98.0,283.05,2017,2,1,23,54
2017-02-03,84086001,43.81,5.15,672.0,14.1,98.0,283.85,2017,2,2,23,54
2017-02-04,84086001,43.81,5.15,672.0,10.1,99.0,281.45,2017,2,3,23,54
2017-02-05,84086001,43.81,5.15,672.0,12.5,99.0,284.35,2017,2,4,23,54
...,...,...,...,...,...,...,...,...,...,...,...,...
2018-10-27,84086001,43.811,5.146,672.0,6.1,100.0,290.85,2018,10,26,23,54
2018-10-28,84086001,43.811,5.146,672.0,7.2,100.0,285.45,2018,10,27,23,54
2018-10-29,84086001,43.811,5.146,672.0,7.3,100.0,283.25,2018,10,28,23,54
2018-10-30,84086001,43.811,5.146,672.0,9.1,100.0,280.15,2018,10,29,23,54


In [None]:
gdf_day_max.set_index("date",inplace=True)

In [None]:
temp_max_day_diff = gdf_day_max - gdf_day_max_shift
temp_max_day_diff

Unnamed: 0_level_0,number_sta,lat,lon,height_sta,ff,hu,t,year,month,day,hour,mins
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2017-02-01,,,,,,,,,,,,
2017-02-02,0,0.0,0.0,0.0,6.0,0.0,0.8,0,0,1,0,0
2017-02-03,0,0.0,0.0,0.0,-4.0,1.0,-2.4,0,0,1,0,0
2017-02-04,0,0.0,0.0,0.0,2.4,0.0,2.9,0,0,1,0,0
2017-02-05,0,0.0,0.0,0.0,-5.2,0.0,-3.6,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
2018-10-27,0,0.0,0.0,0.0,1.1,0.0,-5.4,0,0,1,0,0
2018-10-28,0,0.0,0.0,0.0,0.1,0.0,-2.2,0,0,1,0,0
2018-10-29,0,0.0,0.0,0.0,1.8,0.0,-3.1,0,0,1,0,0
2018-10-30,0,0.0,0.0,0.0,2.1,0.0,1.4,0,0,1,0,0


In [None]:
temp_max_day_diff.reset_index(inplace=True)

In [None]:
# We finally are ready to plot the daily temperature differences
cux_df = cuxfilter.DataFrame.from_dataframe(temp_max_day_diff)

# Let's make a plot.
chart4 = cuxfilter.charts.line(x='date',y='t',title='Temperature diff between Days')
d = cux_df.dashboard([chart4],layout_array=[[1]], theme=cuxfilter.themes.rapids, data_size_widget=True)
d.app()

### Summary:
With the help of ```shift()``` function, the value of the DataFrame can be shifted, we can use this function to calculate the difference between the 2 DataFrames. 
```shift(1)``` in this section is applied to check the temperature difference between consective 2 days. The Chart above clearly shows the temperature difference.

### Section 5.2 Mean Maximum Temperature with 3 Day Rolling Window
Seting the rolling window with 3 days to see the maximum temperature.
- ```rolling()``` function to set rolling window 

In [None]:
gdf_day_max.head()

Unnamed: 0_level_0,number_sta,lat,lon,height_sta,ff,hu,t,year,month,day,hour,mins
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2017-02-01,84086001,43.81,5.15,672.0,8.1,98.0,283.05,2017,2,1,23,54
2017-02-02,84086001,43.81,5.15,672.0,14.1,98.0,283.85,2017,2,2,23,54
2017-02-03,84086001,43.81,5.15,672.0,10.1,99.0,281.45,2017,2,3,23,54
2017-02-04,84086001,43.81,5.15,672.0,12.5,99.0,284.35,2017,2,4,23,54
2017-02-05,84086001,43.81,5.15,672.0,7.3,99.0,280.75,2017,2,5,23,54


In [None]:
# Here we specify the rolling window.
gdf_3d_max = gdf_day_max.rolling('3d',min_periods=1).max()
gdf_3d_max.reset_index(inplace=True)
gdf_3d_max.head()

Unnamed: 0,date,number_sta,lat,lon,height_sta,ff,hu,t,year,month,day,hour,mins
0,2017-02-01,84086001,43.81,5.15,672.0,8.1,98.0,283.05,2017,2,1,23,54
1,2017-02-02,84086001,43.81,5.15,672.0,14.1,98.0,283.85,2017,2,2,23,54
2,2017-02-03,84086001,43.81,5.15,672.0,14.1,99.0,283.85,2017,2,3,23,54
3,2017-02-04,84086001,43.81,5.15,672.0,14.1,99.0,284.35,2017,2,4,23,54
4,2017-02-05,84086001,43.81,5.15,672.0,12.5,99.0,284.35,2017,2,5,23,54


In [None]:
gdf_3d_max.isna().sum()

date          0
number_sta    0
lat           0
lon           0
height_sta    0
ff            0
hu            0
t             0
year          0
month         0
day           0
hour          0
mins          0
dtype: int64

In [None]:
# Applying cuxfilter.
cux_df = cuxfilter.DataFrame.from_dataframe(gdf_3d_max)

# Let's make a plot.
chart5 = cuxfilter.charts.line(x='date',y='t',title='Three Day Rolling Mean of Max Daily Temperatures')
d = cux_df.dashboard([chart5],layout_array=[[1]], theme=cuxfilter.themes.rapids, data_size_widget=True)
d.app()

## Section 6: Accelerated Computing Performance Check

This section covers the performance of a handful of typical functions used in this notebook, comparing between Pandas (CPU) and cuDF (GPU). You can adopt the code below to compare the performance improvement on your local machine.

Test machine information:
- **GPU**: NVIDIA RTX A6000   
- **CPU**: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz 
- **RAPIDS**: Rapids 23.02 with CUDA 11.5

**<p style="text-align: center;">Performance Results based on test results</p>**


|function| GPU Time | CPU Time| GPU Speedup |
| --- | --- | --- | --- |   
|read|4.391139|117.607004|26.78|    
|drop|0.184182|3.34047|18.14|    
|diff|0.131384|16.044269|122.12|    
|select|0.07151|62.890464|879.46|    
|resample|0.347972|9.892627|28.43|    

<div align=center><img src="attachment:8c70ade1-084a-4deb-9638-6037e3876db1.png" width=500 height=375></div>

In [None]:
## Restart Kernels previous to the performance comparison
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [None]:
!nvidia-smi

Thu Feb 16 19:40:33 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro RTX 6000     Off  | 00000000:15:00.0 Off |                  Off |
| 33%   43C    P8    27W / 260W |      6MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 6000     Off  | 00000000:2D:00.0  On |                  Off |
| 35%   61C    P3    67W / 260W |    620MiB / 24576MiB |      0%      Default |
|       

In [None]:
import numpy as np
import pandas as pd
import cudf
import cupy as cp
from timeit import default_timer as timer

# Run the DataFrame speed performance calculations on your machine.
# The compute-intensive functions will be run on both CPU and GPU, followed by
# displaying a performance table. CPU version is using Pandas, "pd".
# GPU version is RAPIDS, "cudf".

# First, warm up GPU for cuDF performance check.
for i in range(10):
    pf_data = cudf.DataFrame(cp.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
    
# Start by reading .csv file with 10+ G dataset in this example  
start_t = timer()
data_cudf = cudf.read_csv('./SE_data.csv')
read_gpu_time = timer() - start_t

start_t = timer()

## Chunksize read in case memory issue
pd_tf= pd.read_csv('./SE_data.csv',chunksize=30000000)
data_pd = pd.DataFrame()
data_pd = pd.concat([chunk for chunk in pd_tf])
read_cpu_time = timer() - start_t

#drop columns
start_t = timer()
data_cudf = data_cudf.drop(columns=['dd','precip','td','psl'])
drop_gpu_time = timer() - start_t

start_t = timer()
data_pd = data_pd.drop(columns=['dd','precip','td','psl'])
drop_cpu_time = timer() - start_t

# diff() 
start_t = timer()
data_cudf['date'] = cudf.to_datetime(data_cudf['date'])
delta_mins = data_cudf['date'].diff().dt.seconds.max()/60
print(f"The dataset runs from {data_cudf['date'].min()} to {data_cudf['date'].max()} with {delta_mins} mins sampling interval")
diff_gpu_time = timer() - start_t

start_t = timer()
data_pd['date'] = pd.to_datetime(data_pd['date'])
delta_mins = data_pd['date'].diff().dt.seconds.max()/60
print(f"The dataset runs from {data_pd['date'].min()} to {data_pd['date'].max()} with {delta_mins} mins sampling interval")
diff_cpu_time = timer() - start_t

# Select determined date and specific ground station
start_t = timer()
data_cudf['year'] = data_cudf['date'].dt.year
data_cudf['month'] = data_cudf['date'].dt.month
data_cudf['day'] = data_cudf['date'].dt.day
data_cudf['hour'] = data_cudf['date'].dt.hour
data_cudf['mins'] = data_cudf['date'].dt.minute

start_time = pd.Timestamp('2018-02-01T00')
end_time = pd.Timestamp('2018-11-01T00')
station_id = 84086001
gdf_period = data_cudf.loc[cp.logical_and(cp.logical_and(data_cudf['date']>start_time,data_cudf['date']<end_time),data_cudf['number_sta']==station_id)]
select_data_gpu_time = timer() - start_t

start_t = timer()
data_pd['year'] = data_pd['date'].dt.year
data_pd['month'] = data_pd['date'].dt.month
data_pd['day'] = data_pd['date'].dt.day
data_pd['hour'] = data_pd['date'].dt.hour
data_pd['mins'] = data_pd['date'].dt.minute

start_time = pd.Timestamp('2018-02-01T00')
end_time = pd.Timestamp('2018-11-01T00')
station_id = 84086001
df_period = data_pd.loc[np.logical_and(np.logical_and(data_pd['date']>start_time,data_pd['date']<end_time),data_pd['number_sta']==station_id)]
select_data_cpu_time = timer() - start_t

# resample dataset
start_t = timer()
data_cudf.set_index("date", inplace=True)
## resample with day, check the max data during the resampled period 
data_cudf = data_cudf.resample('D').max().reset_index()
resample_gpu_time = timer() - start_t

start_t = timer()
data_pd.set_index("date", inplace=True)
## resample with day, check the max data during the resampled period 
data_pd = data_pd.resample('D').max().reset_index()
resample_cpu_time = timer() - start_t

# Build the performance table (as another DataFrame, of course!).
performance_df = cudf.DataFrame()
performance_df['function'] = ['read','drop','diff','select','resample']
performance_df['time_gpu']=[read_gpu_time,drop_gpu_time,diff_gpu_time,select_data_gpu_time,resample_gpu_time]
performance_df['time_cpu']=[read_cpu_time,drop_cpu_time,diff_cpu_time,select_data_cpu_time,resample_cpu_time]
performance_df['speedup']=performance_df['time_cpu']/performance_df['time_gpu']
performance_df

## Conclusion
In this notebook, we applied GPU acclerated DataFrame computation through the use of RAPIDS, cuDF, and cuPY. We demonstrated data loading methods, data selection techniques on time series dataset, an application of cuXfilter for data analysis, and finally how to derive performance values of GPU computation versus CPU-only computation on the same dataset.

## Citation
- Gwennaëlle Larvor, Léa Berthomier, Vincent Chabot, Brice Le Pape, Bruno Pradel, Lior Perez. MeteoNet, an open reference weather dataset by METEO FRANCE, 2020 [dataset link](https://www.kaggle.com/datasets/katerpillar/meteonet)    