# Basic Exploratory Data Analysis with cuDF on the MeteoNet Dataset
cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. It provides a pandas-like API, users can easily speed workflow upto ~400x (tested on specific hardware, explore detail at the end) with changing pandas.DataFrame() to cudf.DataFrame().

Explore the latest [cuDF API](https://docs.rapids.ai/api/cudf/nightly/api_docs/index.html)

The performance speed up as below with comparasion cuDF vs pandas on RTX A6000 GPU and Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz.(Detail info at the end)
**<p style="text-align: center;">Performance Results based on test results</p>**    


|function| GPU Time | CPU Time| GPU Speedup |
| --- | --- | --- | --- |   
|read|1.977145|69.033193|34.92|
|slice|0.030406|13.349222|439.03|
|na|0.07609|8.246114|108.37|
|dropna|0.242239|9.784584|40.39|
|unique|0.013432|0.445705|33.18|
|dropduplicate|0.23392|0.518868|2.22|
|group_sum|0.6725|7.850392|11.67|


This notebook introduces how to use DataFrames and cuDF to apply basic data analysis to the [MeteoNet Dataset](https://www.kaggle.com/datasets/katerpillar/meteonet), an open weather dataset by METEO FRANCE, the nation's official meteorological service.
  
The dataset represents realistic data collection, including missing or invalid data. Here, we illustrate how to:
- Loading and saving the data
- Perform some quick checks
- Calculate the rate of missing data    
- Check for invalid data
- Run a corrleation of meteorological parameters
- Do a computation performance check

At the beginning of each of these topics, guidance is provided to show which functions in cuDF are applied. We follow up with a summary to describe the information we can glean through analysis. 


## Prerequisites   
To use this notebook, [RAPIDS](https://rapids.ai/start.html) must be installed. Please review the following steps and ensure it's properly installed.

### System Requirements
All provisioned systems need to be RAPIDS capable. Here’s what's required:

 **GPU**: NVIDIA Pascal™ or better with compute capability 6.0+

 **OS**: One of the following OS versions:
 - Ubuntu 18.04/20.04 or CentOS 7 / Rocky Linux 8 with gcc/++ 9.0+
 - Windows 11 using WSL2 See separate install guide 
 - RHEL 7/8 support is provided through CentOS 7 / Rocky Linux 8 builds/installs

 **CUDA & NVIDIA Drivers**: One of the following supported versions:
 - CUDA 11.2 with Driver 460.27.03 or newer
 - CUDA 11.4 with Driver 470.42.01 or newer
 - CUDA 11.5 with Driver 495.29.05 or newer
   
Note: RAPIDS is tested with and officially supports the versions listed above. Newer versions of CUDA, drivers, and OS may also work with RAPIDS.

### Environment for RAPIDS
You can install one of below environments for RAPIDS. Referring to [Step 2: Install Environment](https://rapids.ai/start.html), the possible environments are:
* Conda 
* Build from source 
* PIP installation
* Running a Docker container 

### Installing RAPIDS  
There's specific ways to install RAPIDS for different environments. 
#### Conda   
Below is the command for basic installation under Conda:
```
conda create -n rapids-23.02 -c rapidsai-nightly -c conda-forge -c nvidia rapids=23.02 python=3.9 cudatoolkit=11.5 jupyterlab
```

You can specify python version either 3.8 or 3.9, cudatoolkit version as one of 11.2, 11.4 and 11.5.    

NOTE: ```rapids=23.02``` means to install standard selection contains all following packages: _cuDF, cuML, cuGraph, cuSpatial, cuXFilter, cuSignal, cuCIM_. You can specify the package you want to use via instead ```cudf=23.02``` as an example. 

For addtional installation of Dask SQL, JupyterLab, Plotly Dash, Graphistry, etc., you can add the related package name to the conda install command. Find detailed information at [Step 3: Install Rapids](https://rapids.ai/start.html).

#### Docker
Here's example using two commands for basic installation using a Docker container from NGC, selecting CUDA 11.2 and Ubuntu 20.04. The second command runs the container.
```
docker pull nvcr.io/nvidia/rapidsai/rapidsai-core:22.12-cuda11.2-runtime-ubuntu20.04-py3.9

docker run --gpus all --rm -it \
    --shm-size=1g --ulimit memlock=-1 \
    -p 8888:8888 -p 8787:8787 -p 8786:8786 \
    nvcr.io/nvidia/rapidsai/rapidsai-core:22.12-cuda11.2-runtime-ubuntu20.04-py3.9
```
Check more on docker command for specific system and _Dask-SQL, CLX_ support see [Step 3: Install Rapids](https://rapids.ai/start.html)

## Prepare the Dataset - Ground Stations
In this section will show how to perform basic analysis on the Ground Stations Dataset from METEONET.

### Download the Dataset
During this first task, let's use the Northwest France ground station data. Each parameter was measured every six minutes (10 times an hour). The parameters in the data set are listed below. More detailed information is available in [this Github repo](https://meteofrance.github.io/meteonet/english/data/ground-observations/).

**<p style="text-align: center;">Metadata</p>**

|Name| Description | Unit|   
| --- | --- | --- |          
|number_sta|ground station ID| - |   
|lat| latitude| decimal degrees (10^-1°) |    
|lon| longitude| decimal degrees (10^-1°) |   
|height_sta| station height| meters(m) |    
|date| a datetime object| format 'YYYY-MM-DD HH:mm:ss' |    

**<p style="text-align: center;">Meteorological Parameters</p>**

|Name| Description | Unit| 
| --- | --- | --- |   
|dd| Wind direction | degrees (°)|    
|ff| Wind speed | m.s^-1|    
|precip| Precipitation during the reporting period | kg.m^2|    
|hu| Humidity | % |      
|td| Dew point | Kelvin (K) |     
|t| Temperature | Kelvin (K) |   
|psl| Pressure reduced to sea level | Pascal (Pa)|

With the data now described, let's download and unzip the 1.7G dataset.

In [None]:
# Download dataset
!wget https://meteonet.umr-cnrm.fr/dataset/data/NW/ground_stations/NW_ground_stations_2018.tar.gz
!wget https://meteonet.umr-cnrm.fr/dataset/data/NW/ground_stations/NW_ground_stations_2017.tar.gz
!wget https://meteonet.umr-cnrm.fr/dataset/data/NW/ground_stations/NW_ground_stations_2016.tar.gz

--2023-02-20 00:22:48--  https://meteonet.umr-cnrm.fr/dataset/data/NW/ground_stations/NW_ground_stations_2018.tar.gz
Resolving meteonet.umr-cnrm.fr (meteonet.umr-cnrm.fr)... 193.49.97.131
Connecting to meteonet.umr-cnrm.fr (meteonet.umr-cnrm.fr)|193.49.97.131|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 182609994 (174M) [application/x-gzip]
Saving to: ‘NW_ground_stations_2018.tar.gz’


2023-02-20 00:23:32 (4.15 MB/s) - ‘NW_ground_stations_2018.tar.gz’ saved [182609994/182609994]

--2023-02-20 00:23:32--  https://meteonet.umr-cnrm.fr/dataset/data/NW/ground_stations/NW_ground_stations_2017.tar.gz
Resolving meteonet.umr-cnrm.fr (meteonet.umr-cnrm.fr)... 193.49.97.131
Connecting to meteonet.umr-cnrm.fr (meteonet.umr-cnrm.fr)|193.49.97.131|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 182748960 (174M) [application/x-gzip]
Saving to: ‘NW_ground_stations_2017.tar.gz’


2023-02-20 00:24:19 (3.76 MB/s) - ‘NW_ground_stations_2017.tar.gz’ s

In [None]:
# Unzip it in the shell
!tar -xvf NW_ground_stations_2018.tar.gz && rm -f NW_ground_stations_2018.tar.gz
!tar -xvf NW_ground_stations_2017.tar.gz && rm -f NW_ground_stations_2017.tar.gz
!tar -xvf NW_ground_stations_2016.tar.gz && rm -f NW_ground_stations_2016.tar.gz

NW2018.csv
NW2017.csv
NW2016.csv


In [None]:
# Are they listed?
!ls -l -sh NW2*.csv

1.7G -rw-r--r-- 1 meiranp dip 1.7G Jan 23  2020 NW2016.csv
1.7G -rw-r--r-- 1 meiranp dip 1.7G Jan 23  2020 NW2017.csv
1.7G -rw-r--r-- 1 meiranp dip 1.7G Jan 23  2020 NW2018.csv


With the dataset now in hand, let's load the data and take an initial look at the contents.

## Loading and Saving the data, and perform some quick checks

There are some basic features of DataFrames that will make your work easier. Here are a few we'll use:
- The dataset from the METRONET is .csv format, meaning "comma separated values". Therefore, the ```.read_csv()``` function can help to load the dataset to data frame format. Make special note of the "line terminator: defined in the csv file.
- With the help of the functions ```.head()``` and ```.tail()``` from cudf library, we can see the first and last several observations of the dataset. Very handy for working with long datasets. 
- ```.shape``` will describe the shape of the DataFrame.
- ```.drop_duplicates()``` function drops the duplicated rows, optionally only considering a certain subset of the DataFrame's columns.
- ```.drop()``` function drop specifc columns.
- ```cudf.to_datetime()``` function help to convert argument to datetime dtype
- ```.concat()``` function to concatenate DataFrames, Series, or Indices row-wise.
- ```.to_csv()``` function help to write a dataframe to csv file format.

**Note**: The following processing uses a combined dataset from years 2016, 2017, and 2018, which is about 6GB in size. If limited by the GPU's memory (out of memory error), you can load just one of the datasets to investigate how cuDF works. 

In [None]:
# Here's where we import the cuDF and cuPY libraries
import cudf
import cupy as cp

In [None]:
%%time
# Do a warm-up when benchmarking performance. Refer to the last section of code for the performance check. 
# If you get an out of memory error, you can comment out two of read_cvs lines below. Just make sure
# to update the gdf_frames line, too, to reflect which one dataset you're keeping.

# Empty DataFrame placeholders so you can select just one or two of the years of data. 
gdf_2016 = cudf.DataFrame()
gdf_2017 = cudf.DataFrame()
gdf_2018 = cudf.DataFrame()

# **********NOTE***********
# Comment out one or two of these if your GPU memory is full.
gdf_2016 = cudf.read_csv('./NW2016.csv')
gdf_2017 = cudf.read_csv('./NW2017.csv')
gdf_2018 = cudf.read_csv('./NW2018.csv')

gdf_frames =[gdf_2016,gdf_2017,gdf_2018]
GS_cudf = cudf.concat(gdf_frames,ignore_index=True)
GS_cudf.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 65826837 entries, 0 to 65826836
Data columns (total 12 columns):
 #   Column      Dtype
---  ------      -----
 0   number_sta  int64
 1   lat         float64
 2   lon         float64
 3   height_sta  float64
 4   date        object
 5   dd          float64
 6   ff          float64
 7   precip      float64
 8   hu          float64
 9   td          float64
 10  t           float64
        object
dtypes: float64(9), int64(1), object(2)
memory usage: 6.5+ GB
CPU times: user 2.86 s, sys: 3.89 s, total: 6.75 s
Wall time: 13.4 s


In [None]:
# Here's the bottom of the dataset
GS_cudf.tail()

Unnamed: 0,number_sta,lat,lon,height_sta,date,dd,ff,precip,hu,td,t,psl\r
65826832,86137003,47.035,0.098,96.0,20181231 23:54,40.0,2.9,0.0,88.0,278.85,280.75,\r
65826833,86165005,46.412,0.841,153.0,20181231 23:54,60.0,3.3,0.0,95.0,278.85,279.55,\r
65826834,86272002,46.839,0.457,120.0,20181231 23:54,,,0.0,,,,\r
65826835,91200002,48.526,1.993,116.0,20181231 23:54,270.0,0.8,0.0,96.0,279.75,280.35,\r
65826836,95690001,49.108,1.831,126.0,20181231 23:54,280.0,2.4,0.0,97.0,279.65,280.05,\r


In [None]:
%%time
## Save the (concatenated) dataframe to csv file
GS_cudf.to_csv('./NW_data.csv',index=False,chunksize=500000)

CPU times: user 1.88 s, sys: 7.24 s, total: 9.11 s
Wall time: 16.8 s


Restart Kernel to release all GPU memory usage, then read the data for subsequent processing.

In [None]:
## Restart Kernels previous to doing the below performance comparasions.
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [None]:
# Let's make sure the GPU is visible!
!nvidia-smi

Mon Feb 20 01:57:58 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA RTX A6000    Off  | 00000000:65:00.0 Off |                  Off |
| 30%   56C    P8    22W / 300W |      6MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Import the necessary packages
import cudf
import cupy as cp
import pandas as pd

In [None]:
%%time
# Let's read in the dataset, which is in CSV format with newlines as the terminator.
# And let's also keep track of the time elapsed to do so (the first line of the cell).

GS_cudf = cudf.read_csv('./NW_data.csv',lineterminator='\n')

CPU times: user 2.17 s, sys: 743 ms, total: 2.91 s
Wall time: 2.88 s


In [None]:
# change the date column to datetime dtype, see the DataFrame info
GS_cudf['date'] = cudf.to_datetime(GS_cudf['date'])
GS_cudf.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 65826837 entries, 0 to 65826836
Data columns (total 12 columns):
 #   Column      Dtype
---  ------      -----
 0   number_sta  int64
 1   lat         float64
 2   lon         float64
 3   height_sta  float64
 4   date        datetime64[ns]
 5   dd          float64
 6   ff          float64
 7   precip      float64
 8   hu          float64
 9   td          float64
 10  t           float64
 11  psl         float64
dtypes: datetime64[ns](1), float64(10), int64(1)
memory usage: 5.9 GB


In [None]:
# Display the first five rows of the DataFrame to examine details
GS_cudf.head()

Unnamed: 0,number_sta,lat,lon,height_sta,date,dd,ff,precip,hu,td,t,psl
0,14066001,49.33,-0.43,2.0,2016-01-01,210.0,4.4,0.0,91.0,278.45,279.85,
1,14126001,49.15,0.04,125.0,2016-01-01,,,0.0,99.0,278.35,278.45,
2,14137001,49.18,-0.46,67.0,2016-01-01,220.0,0.6,0.0,92.0,276.45,277.65,102360.0
3,14216001,48.93,-0.15,155.0,2016-01-01,220.0,1.9,0.0,95.0,278.25,278.95,
4,14296001,48.8,-1.03,339.0,2016-01-01,,,0.0,,,278.35,


In [None]:
# Checking the DataFrame's dimensions. Millions of rows by 12 columns.
GS_cudf.shape

(65826837, 12)

In [None]:
# We can further examine the characteristics of a DataFrame using .info().
# This will show, for instance, the datatype of each column and the total GPU memory it occupies.
GS_cudf.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 65826837 entries, 0 to 65826836
Data columns (total 12 columns):
 #   Column      Dtype
---  ------      -----
 0   number_sta  int64
 1   lat         float64
 2   lon         float64
 3   height_sta  float64
 4   date        datetime64[ns]
 5   dd          float64
 6   ff          float64
 7   precip      float64
 8   hu          float64
 9   td          float64
 10  t           float64
 11  psl         float64
dtypes: datetime64[ns](1), float64(10), int64(1)
memory usage: 5.9 GB


In [None]:
# DataFrames simplify data cleaning, such as dropping duplicates entries for a column.
# In this case, we want to just keep unique ground station rows to check how many ground stations are monitored
unique_stat_info = GS_cudf.drop_duplicates(subset=['number_sta'])
unique_stat_info.shape[0]

287

### Summary
- The dataset series is aligned to the dataset metadata: number_sta, lat, lon, height_sta, date, dd, ff, etc.
- There are 65826837 records from the ground stations into the total dataset occupying 12 columns.
- There are 287 ground stations observed in this dataset.

## Calculate the rate of missing data     

Now we can further analyze the data, using a few handy DataFrame methods:
- With the help of ```.nunique()``` function, count the number of distinct elements in "number_sta" column 
- Using ```.str.contains()```, select the items with specific sub string
- The ```.to_numeric()``` function converts its argument into a numerical type
- The ```.diff()``` function returns a new DataFrame containing the results of difference between rows (default is difference with the previous row)
- The ```.min()``` function returns the minimum values in the DataFrame.
- Also, we use ```%%time``` as the first line of cells to display the time elapsed running the code.

In [None]:
%%time
# How many weather stations are covered in this dataset? 
# Call nunique() to count the distinct elements along a specified axis.

number_stations = GS_cudf['number_sta'].nunique()
print("The full dataset is composed of {} unique weather stations.".format(GS_cudf['number_sta'].nunique()))

The full dataset is composed of 287 unique weather stations.
CPU times: user 18.3 ms, sys: 19.4 ms, total: 37.7 ms
Wall time: 32.8 ms


In [None]:
%%time
## Investigate the the frequency of one specific station's data
## date column is datestime dtype, diff() function will calculate the delta time 
## TimedeltaProperties.seconds can help to get the delta seconds between each record, divide by 60 seconds to see the minutes difference.
delta_mins = GS_cudf['date'].diff().dt.seconds.max()/60
print(f"The data is recorded every {delta_mins} minutes")

The data is recorded every 6.0 minutes
CPU times: user 16.3 ms, sys: 28.6 ms, total: 44.9 ms
Wall time: 40.4 ms


The dataset including 287 unique stations, with 10 records per hour (record every 6 minutes), so the amount of data recorded shall be    
```
287 x 10 x 24 x 365 x 3 = 75,423,600 values in memory
```

Knowing this, we can calculate the missing record rate.

In [None]:
# Theoretical number of records is... 
theoretical_nb_records = number_stations * (60 / delta_mins) * 365 * 3 * 24 
actual_nb_of_rows = GS_cudf.shape[0]
missing_record_ratio = 1 - (actual_nb_of_rows/theoretical_nb_records)
print("Percentage of missing records of the NW dataset is: {:.1f}%".format(missing_record_ratio * 100))
print("Theoretical total number of values in dataset is: {:d}".format(int(theoretical_nb_records)))

Percentage of missing records of the NW dataset is: 12.7%
Theoretical total number of values in dataset is: 75423600


### Summary  
The dataset is composed of weather phenomena recordings of **287 unique ground statitions** in the Northwest of France during the year 2016, 2017 and 2018. The record is monitored every 6 minutes including wind direction, wind speed, humidity, temperature and pressure, etc. 
- The theoretical number of records is 75423600 
- Actual number of items in the dataset is 65826837
- There are missing records during the monitoring period at a percentage of during the year 2016 to 2018.

## Check for invalid data

Next, we check for invalid data, such as handling NA values, as well as by calculating columnar sums. We'll use:
- ```.isna()``` to create a new DataFrame with boolean values to mark the NA item with boolean True.
- ```.sum()``` to find the sum value of each series.
- ```.slice()``` to cut the date string to show only the month.
- ```.index``` function of Series, to find which series have NA values.
- ```.to_frame()```  to convert Series into a DataFrame.
- ```.reset_index()``` to reset the index of the DataFrame.
- [cuxfilter library](https://github.com/rapidsai/cuxfilter) to plot data for visual analysis.

Overall, we can use these functions to check if there is NA data, and then total up the NA data for each category by month to see which months have the most missing records. (Note that NA data includes types such as None, numpy.NaN, '', and numpy.inf.)

In [None]:
# Let's focus on Find which items have NA value(s) during year 2018
NA_sum = GS_cudf[GS_cudf['date'].dt.year==2018].isna().sum()
NA_data = NA_sum[NA_sum>0]
NA_data.index

StringIndex(['dd' 'ff' 'precip' 'hu' 'td' 't' 'psl'], dtype='object')

In [None]:
NA_data

dd         8605703
ff         8598613
precip     1279127
hu         8783452
td         8786154
t          2893694
psl       17621180
dtype: int64

In [None]:
%%time
# Let's slice the date format to select only the month
GS_cudf["month"] = GS_cudf["date"].dt.month
GS_cudf["year"] = GS_cudf["date"].dt.year
GS_cudf.head()

CPU times: user 11 ms, sys: 3.43 ms, total: 14.5 ms
Wall time: 10.3 ms


Unnamed: 0,number_sta,lat,lon,height_sta,date,dd,ff,precip,hu,td,t,psl,month,year
0,14066001,49.33,-0.43,2.0,2016-01-01,210.0,4.4,0.0,91.0,278.45,279.85,,1,2016
1,14126001,49.15,0.04,125.0,2016-01-01,,,0.0,99.0,278.35,278.45,,1,2016
2,14137001,49.18,-0.46,67.0,2016-01-01,220.0,0.6,0.0,92.0,276.45,277.65,102360.0,1,2016
3,14216001,48.93,-0.15,155.0,2016-01-01,220.0,1.9,0.0,95.0,278.25,278.95,,1,2016
4,14296001,48.8,-1.03,339.0,2016-01-01,,,0.0,,,278.35,,1,2016


In [None]:
# Let's only analyze the NA columns with dates in them during year 2018
NA_column = cudf.DataFrame(GS_cudf,columns=NA_data.index).isna()
NA_column["month"]=GS_cudf["month"]
NA_column["year"]=GS_cudf["year"]
NA_column = NA_column[NA_column['year']==2018]

In [None]:
NA_column.info()

<class 'cudf.core.dataframe.DataFrame'>
Int64Index: 22034571 entries, 43792266 to 65826836
Data columns (total 9 columns):
 #   Column  Dtype
---  ------  -----
 0   dd      bool
 1   ff      bool
 2   precip  bool
 3   hu      bool
 4   td      bool
 5   t       bool
 6   psl     bool
 7   month   int16
 8   year    int16
dtypes: bool(7), int16(2)
memory usage: 399.3 MB


In [None]:
# We can group the data by month and then calculate the the sum of the NA data for each month.
# Note, reset_index() is used to set the group month as a Series, or it will be deleted as the index.
NA_column
NA_data_month = NA_column.groupby("month",sort=True).sum().reset_index().drop(columns=['year'])
NA_data_month

Unnamed: 0,month,dd,ff,precip,hu,td,t,psl
0,1,704384,703030,101402,721584,722345,242430,1461872
1,2,636406,636114,92062,650117,651227,224703,1329865
2,3,708710,708008,102324,726449,726476,248646,1474683
3,4,687953,686884,107086,708769,709227,245753,1432213
4,5,721118,720603,107630,742137,742504,258661,1489728
5,6,689394,688701,109634,708939,708951,247578,1430580
6,7,719556,719325,111958,738818,739053,250513,1485952
7,8,722750,722723,107515,749206,749225,251985,1497920
8,9,705047,703453,108238,720628,720653,242374,1444152
9,10,734415,734411,109960,735582,735617,237221,1497323


In [None]:
NA_data_month.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   month   12 non-null     int16
 1   dd      12 non-null     int64
 2   ff      12 non-null     int64
 3   precip  12 non-null     int64
 4   hu      12 non-null     int64
 5   td      12 non-null     int64
 6   t       12 non-null     int64
 7   psl     12 non-null     int64
dtypes: int16(1), int64(7)
memory usage: 696.0 bytes


Finally, [cuXfilter](https://github.com/rapidsai/cuxfilter) is applied to show NA value distributions across the year, taking humidity, temperature, and pressure as examples. 

In [None]:
# First, let's import the modules from cuXfilter we'll need.
import cuxfilter
from cuxfilter import themes, layouts
from cuxfilter.assets.custom_tiles import get_provider, Vendors

In [None]:
# Showing the number of NA items for each column.
NA_data_df = NA_data.to_frame(name="num_NA").reset_index()
NA_data_df

Unnamed: 0,index,num_NA
0,dd,8605703
1,ff,8598613
2,precip,1279127
3,hu,8783452
4,td,8786154
5,t,2893694
6,psl,17621180


In [None]:
# Here's where we run the correlation.
cux_na_df = cuxfilter.DataFrame.from_dataframe(NA_data_df)

In [None]:
# Let's make a plot.
chart1 = cuxfilter.charts.bar('index','num_NA',title='Number of NA values vs. categories')
na_d = cux_na_df.dashboard([chart1],layout_array=[[1]], theme=cuxfilter.themes.rapids, data_size_widget=True)
na_d.app()

In [None]:
# More specifically, the NA value count for humidity, dew point, and temperature.
cux_df = cuxfilter.DataFrame.from_dataframe(NA_data_month)
chart2= cuxfilter.charts.bar('month','hu',title='NA value for Humidity')
chart3= cuxfilter.charts.bar('month','td',title='NA value for Dew point')
chart4= cuxfilter.charts.bar('month','t',title='NA value for Temperature')
d = cux_df.dashboard([chart2,chart3,chart4],layout_array=[[1],[2],[3]], theme=cuxfilter.themes.rapids, data_size_widget=True)

In [None]:
d.app()

### Summary:
- There are invalid data values in the monitored records year 2018.
- Seven meteorological parameters are monitored, all of which have some invalid data.
- Compared within all the categories, the precip (precipitation during the reporting period) and t (temperature) have lower numbers of NA occurances. 
- The psl (pressure at sea level) parameter has the largest number of invaid data with 17621180.
- For the other 4 meteorological parameters, there is no significant difference, as the NA data for them are all within 8000000 - 9000000.
- Examining the NA value distribution across the whole year, there is no signicant difference to suggest that there is one month with much more invalid data than the others.

## Run a corrleation of meteorological parameters 
We can apply a correlation analysis to figure out the correlation between meterological parameters. Removing correlated items may help us train a regression model, for instance. These DataFrame methods will help us:
- ```.corr()``` can be applied to compute the correlation matrix of a DataFrame, but only on a numeric matrix containing no NA data.
- ```.dropna()``` drop rows (or columns) containing NA data from a column (or rows), espeically hand for  dataset cleaning tasks.
- ```.drop()``` removes specific columns in the DataFrame.

In [None]:
%time
# Let's only analyze meteorological columns
Meteo_series = ['dd', 'ff', 'precip' ,'hu', 'td', 't', 'psl']
Meteo_df = cudf.DataFrame(GS_cudf,columns=Meteo_series)
Meteo_corr = Meteo_df.dropna().corr()

# And let's check the items with correlation value > 0.7 
Meteo_corr[Meteo_corr>0.7]

CPU times: user 7 µs, sys: 2 µs, total: 9 µs
Wall time: 17.4 µs


Unnamed: 0,dd,ff,precip,hu,td,t,psl
dd,1.0,,,,,,
ff,,1.0,,,,,
precip,,,1.0,,,,
hu,,,,1.0,,,
td,,,,,1.0,0.840558357,
t,,,,,0.840558357,1.0,
psl,,,,,,,1.0


In [None]:
%time
Meteo_df_less = Meteo_df.drop(columns=['td'])
Meteo_df_less.head()

CPU times: user 7 µs, sys: 2 µs, total: 9 µs
Wall time: 17.4 µs


Unnamed: 0,dd,ff,precip,hu,t,psl
0,210.0,4.4,0.0,91.0,279.85,
1,,,0.0,99.0,278.45,
2,220.0,0.6,0.0,92.0,277.65,102360.0
3,220.0,1.9,0.0,95.0,278.95,
4,,,0.0,,278.35,


### Summary:
- Apply ```.corr()``` method to analyze the relationship within the meteological parameters.
- There is strong correlation between td (dew point) and t (temperature)
- Based on that information, either td or t colomns shall be removed to improve a linear regression model. 
- Use the ```.drop()``` method to remove the td columns for downstream tasks. 

## Accelerated Computing Performance Check

This section covers the performance of a handful of typical functions used in this notebook, comparing between Pandas (CPU) and cuDF (GPU) with minimum code change with pd(pandas) -> cudf. You can adopt the code below to compare the performance improvement on your local machine. 

Test machine information:
- **GPU**: NVIDIA RTX A6000   
- **CPU**: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz 
- **RAPIDS**: Rapids 23.02 with CUDA 11.5

**<p style="text-align: center;">Performance Results based on test results</p>**


|function| GPU Time | CPU Time| GPU Speedup |
| --- | --- | --- | --- |   
|read|1.977145|69.033193|34.92|
|slice|0.030406|13.349222|439.03|
|na|0.07609|8.246114|108.37|
|dropna|0.242239|9.784584|40.39|
|unique|0.013432|0.445705|33.18|
|dropduplicate|0.23392|0.518868|2.22|
|group_sum|0.6725|7.850392|11.67|


<div align=center><img src="attachment:1560a303-5c7c-49d1-b69c-72ed3f127e89.png" width=500 height=375></div>

In [None]:
## Restart Kernels previous to do below performance comparasion
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [None]:
!nvidia-smi

Mon Feb 20 02:00:00 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA RTX A6000    Off  | 00000000:65:00.0 Off |                  Off |
| 36%   64C    P8    23W / 300W |      6MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
import numpy as np
import pandas as pd
import cudf
import cupy as cp
from timeit import default_timer as timer

In [None]:
# Run the DataFrame speed performance calculations on your machine.
# The compute-intensive functions will be run on both CPU and GPU, followed by
# displaying a performance table. CPU version is using Pandas, "pd".
# GPU version is RAPIDS, "cudf".

# First, warm up GPU for cuDF performance check.
for i in range(10):
    pf_data = cudf.DataFrame(cp.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
    
# Start by reading .csv file     
start_t = timer()
data_cudf = cudf.read_csv('./NW_data.csv')
read_gpu_time = timer() - start_t

start_t = timer()
data_pd = pd.read_csv('./NW_data.csv')
read_cpu_time = timer() - start_t

# Slicing function
start_t = timer()
data_cudf_month_s = data_cudf["date"].str.slice(4,6)
slice_gpu_time = timer() - start_t

start_t = timer()
data_pd_month_s = data_pd["date"].str.slice(4,6)
slice_cpu_time = timer() - start_t

data_cudf_month=data_cudf
data_cudf_month["date"]=data_cudf_month_s

data_pd_month=data_pd
data_pd_month["date"]=data_pd_month_s

# NA data check
start_t = timer()
NA_sum = data_cudf.isna().sum()
NA_data_cudf = NA_sum[NA_sum>0]
na_gpu_time = timer() - start_t

start_t = timer()
NA_sum = data_pd.isna().sum()
NA_data_pd = NA_sum[NA_sum>0]
na_cpu_time = timer() - start_t

# drop na 
start_t = timer()
data_cudf.dropna()
dropna_gpu_time = timer() - start_t

start_t = timer()
data_pd.dropna()
dropna_cpu_time = timer() - start_t

# unique data check
start_t = timer()
number_stations = data_cudf['number_sta'].nunique()
unique_gpu_time = timer() - start_t

start_t = timer()
number_stations = data_pd['number_sta'].nunique()
unique_cpu_time = timer() - start_t

# drop_duplicates
start_t = timer()
unique_stat_info = data_cudf.drop_duplicates(subset=['number_sta'])
dropdu_gpu_time = timer() - start_t

start_t = timer()
unique_stat_info = data_pd.drop_duplicates(subset=['number_sta'])
dropdu_cpu_time = timer() - start_t

# group and sum timer
start_t = timer()
NA_column_cudf = cudf.DataFrame(data_cudf_month,columns=NA_data_cudf.index).isna()
NA_column_cudf["month"]=data_cudf_month["date"]
# group the data by month, and then calculate the the sum of the NA data for each month
# reset_index() is used to set the group month as a Series, or it will be deleted as index
NA_data_cudf_month = NA_column_cudf.groupby("month",sort=True).sum().reset_index()
group_sum_gpu_time = timer() - start_t

start_t = timer()
NA_column_pd = pd.DataFrame(data_pd_month,columns=NA_data_pd.index).isna()
NA_column_pd["month"]=data_pd_month["date"]
# group the data by month, and then calculate the the sum of the NA data for each month
# reset_index() is used to set the group month as a Series, or it will be deleted as index
NA_data_pd_month = NA_column_pd.groupby("month",sort=True).sum().reset_index()
group_sum_cpu_time = timer() - start_t

In [None]:
# Build the performance table (as another DataFrame, of course!).
performance_df = cudf.DataFrame()
performance_df['function'] = ['read','slice','na','dropna','unique','dropduplicate','group_sum']
performance_df['time_gpu']=[read_gpu_time,slice_gpu_time,na_gpu_time,dropna_gpu_time,unique_gpu_time,dropdu_gpu_time,group_sum_gpu_time]
performance_df['time_cpu']=[read_cpu_time,slice_cpu_time,na_cpu_time,dropna_cpu_time,unique_cpu_time,dropdu_cpu_time,group_sum_cpu_time]
performance_df['speedup']=performance_df['time_cpu']/performance_df['time_gpu']
performance_df

## Conclusion

In this notebook, we applied GPU acclerated DataFrame computation through the use of RAPIDS, cuDF, and cuPY. We demonstrated data loading methods, data cleaning techniques, an application of cuXfilter for data analysis, and finally how to derive performance values versus CPU-only computation on the same dataset.

### Citation
- Gwennaëlle Larvor, Léa Berthomier, Vincent Chabot, Brice Le Pape, Bruno Pradel, Lior Perez. MeteoNet, an open reference weather dataset by METEO FRANCE, 2020 [dataset link](https://www.kaggle.com/datasets/katerpillar/meteonet)    
