# Advanced data analysis 23/24- CuDF GPU-based examples

Examples of code using CuDF - a Pandas-like framework that uses GPUs - cuML - a machine-learning library that uses GPUs - and cuGraph - a graph library that uses GPUs.

**IMPORTANT:** To use GPUs in Colab, you need to go to the menu ```Edit > Notebook settings``` and select GPU as the hardware accelerator.

In [3]:
import sys
import os

IN_COLAB = 'google.colab' in sys.modules

## Setup Data

The first time you run the notebook, before running this cell, you should access the following link [https://drive.google.com/drive/folders/1G6YAxMT9dciRZWemnRzsCaEPeVfSBHZo?usp=sharing](https://drive.google.com/drive/folders/1G6YAxMT9dciRZWemnRzsCaEPeVfSBHZo?usp=sharing) and select the "Add Shortcut to Drive". This will add a shortcut to the datasets to your Google Drive.

The following cell will mount the directory into Colab environment, so that it can be accessed as a local file.


In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [4]:
#Select the dataset you want to use by uncommnting the appropriate line and running this cell

#Default dataset 2M lines
FILENAME="/content/drive/MyDrive/sbd2223ada/sample.csv.gz"


## Install software

Start by checking that we have access to the GPU.

In [5]:
!nvidia-smi
!nvcc -V

Wed Nov 22 22:47:21 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P8    12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

And now let's install the needed software.

In [6]:
!pip uninstall --yes protobuf tensorflow tensorboard; pip install cupy-cuda11x

!pip install cudf-cu11 dask-cudf-cu11 cuml-cu11 cugraph-cu11 --extra-index-url https://pypi.nvidia.com


Found existing installation: protobuf 3.20.3
Uninstalling protobuf-3.20.3:
  Successfully uninstalled protobuf-3.20.3
Found existing installation: tensorflow 2.14.0
Uninstalling tensorflow-2.14.0:
  Successfully uninstalled tensorflow-2.14.0
Found existing installation: tensorboard 2.14.1
Uninstalling tensorboard-2.14.1:
  Successfully uninstalled tensorboard-2.14.1
Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cudf-cu11
  Downloading https://pypi.nvidia.com/cudf-cu11/cudf_cu11-23.10.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (502.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m502.6/502.6 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dask-cudf-cu11
  Downloading https://pypi.nvidia.com/dask-cudf-cu11/dask_cudf_cu11-23.10.2-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.0/82.0 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cuml-cu11
  

## Initialization

import libraries.

In [7]:
import os
import time
import pandas as pd
import numpy as np
import cudf
import cuml.cluster
import sklearn.cluster


# Example 1: simple statistics (cuDF)

This example computes, for each license, the number of trips performed.

We have the code using CuDF and Pandas, printing the time for doing the computation.


### Code: CuDF

In [8]:
start_time = time.time()
mySchema = ["medallion", "hack_license", "pickup_datetime",
            "dropoff_datetime", "trip_time_in_secs", "trip_distance",
            "pickup_longitude", "pickup_latitude", "dropoff_longitude",
            "dropoff_latitude", "payment_type", "fare_amount",
            "surcharge", "mta_tax", "tip_amount",
            "tolls_amount", "total_amount"]

dataset = cudf.read_csv(FILENAME,names=mySchema,compression="gzip")
result = dataset.groupby("hack_license").count()
print(result)

end_time = time.time()

print( "Runtime = " + str(end_time - start_time))


                                  medallion  pickup_datetime  \
hack_license                                                   
2D6E98895D9E4FF08519FB2433988F1A        122              122   
27C10878193D5607F220BF78B2C893A3          1                1   
571ECD7243F7A43B88F652DD7610EB8C        288              288   
0E6AED19C1605D0397146A1AD3798A17         34               34   
7895E8BDFDB5AE3C913D9AFEF756B75E          1                1   
...                                     ...              ...   
C73BE7726024DDC4F89031A5BB76832E          1                1   
549755BB2C577826AC9DBEA6D28182AB        148              148   
4F2FE8F7A29AFEF2FE141FF902170BF7        163              163   
0DECA0712DEFFBA24A0A347D0E6C1A65        184              184   
C85719E27AE39040C2B1538E2A15AE85        248              248   

                                  dropoff_datetime  trip_time_in_secs  \
hack_license                                                            
2D6E98895D9E4FF08519F

### Code: Pandas library

In [9]:
start_time = time.time()
mySchema = ["medallion", "hack_license", "pickup_datetime",
            "dropoff_datetime", "trip_time_in_secs", "trip_distance",
            "pickup_longitude", "pickup_latitude", "dropoff_longitude",
            "dropoff_latitude", "payment_type", "fare_amount",
            "surcharge", "mta_tax", "tip_amount",
            "tolls_amount", "total_amount"]

dataset = pd.read_csv(FILENAME,names=mySchema,compression="gzip")
result = dataset.groupby("hack_license").count()
print(result)

end_time = time.time()

print( "Runtime = " + str(end_time - start_time))


                                  medallion  pickup_datetime  \
hack_license                                                   
0008B3E338CE8C3377E071A4D80D3694        129              129   
000B8D660A329BBDBF888500E4BD8B98          2                2   
000CCA239BFDC0ABE2895AC9086C4290         11               11   
00184958F5D5FD0A9EC0B115C5B55796         62               62   
001C8AAB90AEE49F36FCAA7B4136C81A        178              178   
...                                     ...              ...   
FFF5AD65C673251C1F275CF5B43EC414          2                2   
FFF6401CC16911710E7590FE197E986A         33               33   
FFF657CFEC6A06384C97ACB500916913         68               68   
FFF909B1353148850AD3E40BB878618B        124              124   
FFFBCEA3D4E21E05902EE67AD556F67C        177              177   

                                  dropoff_datetime  trip_time_in_secs  \
hack_license                                                            
0008B3E338CE8C3377E07

# Example 2 (cuDF)

An example that does some statistics.
Let's find the most frequent routes whose distance is greater than 6.


#### CuDF

In [15]:
start_time = time.time()
mySchema = ["medallion", "hack_license", "pickup_datetime",
            "dropoff_datetime", "trip_time_in_secs", "trip_distance",
            "pickup_longitude", "pickup_latitude", "dropoff_longitude",
            "dropoff_latitude", "payment_type", "fare_amount",
            "surcharge", "mta_tax", "tip_amount",
            "tolls_amount", "total_amount"]

# Squares of 250 meters
latitudeStep = 0.004491556 / 2
longitudeStep = 0.005986 / 2
northLatitude = 41.474937 - 0.5 * latitudeStep
southLatitude = northLatitude - 600 * latitudeStep
westLongitude = -74.913585 - 0.5 * longitudeStep
eastLongitude = westLongitude + 600 * longitudeStep

# function to round longitude to a point in the middle of the square
def lonRound( val):
    return ((val - eastLongitude) // longitudeStep) * longitudeStep + eastLongitude + longitudeStep / 2

# function to round latitude to a point in the middle of the square
def latRound( l):
    return northLatitude - ((northLatitude - l) // latitudeStep) * latitudeStep - latitudeStep / 2

dataset = cudf.read_csv(FILENAME,names=mySchema,compression="gzip")
filtered = dataset[(dataset.pickup_longitude >= westLongitude) &
                   (dataset.pickup_longitude <= eastLongitude) &
                   (dataset.dropoff_longitude >=  westLongitude) &
                   (dataset.dropoff_longitude <= eastLongitude) &
                   (dataset.pickup_latitude <= northLatitude) &
                   (dataset.pickup_latitude >= southLatitude) &
                   (dataset.dropoff_latitude <= northLatitude) &
                   (dataset.dropoff_latitude >= southLatitude) &
                   (dataset.trip_distance > 6)]
filtered["pickup_lon"]=lonRound(filtered.pickup_longitude)
filtered["pickup_lat"]=latRound(filtered.pickup_latitude)
filtered["dropoff_lon"]=lonRound(filtered.dropoff_longitude)
filtered["dropoff_lat"]=latRound(filtered.dropoff_latitude)
filtered = filtered[["trip_distance","pickup_lon","pickup_lat","dropoff_lon","dropoff_lat"]]
result = filtered.groupby(["pickup_lon","pickup_lat","dropoff_lon","dropoff_lat"]).count().sort_values("trip_distance",ascending=False).head(20)
#result = filtered.groupby(["pickup_lon","pickup_lat","dropoff_lon","dropoff_lat"]).count().head(20)
print(result)
end_time = time.time()

print( "Runtime = " + str(end_time - start_time))


                                               trip_distance
pickup_lon pickup_lat dropoff_lon dropoff_lat               
-73.863042 40.769763  -73.985755  40.758534              198
-73.872021 40.774254  -73.985755  40.758534              187
                      -73.973783  40.756288              179
-73.985755 40.758534  -73.872021  40.774254              157
-73.973783 40.756288  -73.872021  40.774254              156
-73.979769 40.763025  -73.872021  40.774254              154
-73.863042 40.769763  -73.973783  40.756288              149
-73.872021 40.774254  -73.979769  40.763025              137
-73.863042 40.769763  -73.979769  40.763025              136
-73.985755 40.760780  -73.872021  40.774254              134
-73.872021 40.774254  -73.970790  40.756288              132
-73.863042 40.769763  -73.970790  40.756288              122
-73.872021 40.774254  -73.976776  40.751796              119
-73.982762 40.763025  -73.872021  40.774254              111
-73.872021 40.774254  -7

#### Pandas

In [16]:
start_time = time.time()
mySchema = ["medallion", "hack_license", "pickup_datetime",
            "dropoff_datetime", "trip_time_in_secs", "trip_distance",
            "pickup_longitude", "pickup_latitude", "dropoff_longitude",
            "dropoff_latitude", "payment_type", "fare_amount",
            "surcharge", "mta_tax", "tip_amount",
            "tolls_amount", "total_amount"]

# Squares of 250 meters
latitudeStep = 0.004491556 / 2
longitudeStep = 0.005986 / 2
northLatitude = 41.474937 - 0.5 * latitudeStep
southLatitude = northLatitude - 600 * latitudeStep
westLongitude = -74.913585 - 0.5 * longitudeStep
eastLongitude = westLongitude + 600 * longitudeStep

# function to round longitude to a point in the middle of the square
def lonRound( val):
    return ((val - eastLongitude) // longitudeStep) * longitudeStep + eastLongitude + longitudeStep / 2

# function to round latitude to a point in the middle of the square
def latRound( l):
    return northLatitude - ((northLatitude - l) // latitudeStep) * latitudeStep - latitudeStep / 2

dataset = pd.read_csv(FILENAME,names=mySchema,compression="gzip")
filtered = dataset[(dataset.pickup_longitude >= westLongitude) &
                   (dataset.pickup_longitude <= eastLongitude) &
                   (dataset.dropoff_longitude >=  westLongitude) &
                   (dataset.dropoff_longitude <= eastLongitude) &
                   (dataset.pickup_latitude <= northLatitude) &
                   (dataset.pickup_latitude >= southLatitude) &
                   (dataset.dropoff_latitude <= northLatitude) &
                   (dataset.dropoff_latitude >= southLatitude) &
                   (dataset.trip_distance > 6)]
filtered["pickup_lon"]=lonRound(filtered.pickup_longitude)
filtered["pickup_lat"]=latRound(filtered.pickup_latitude)
filtered["dropoff_lon"]=lonRound(filtered.dropoff_longitude)
filtered["dropoff_lat"]=latRound(filtered.dropoff_latitude)
filtered = filtered[["medallion","pickup_lon","pickup_lat","dropoff_lon","dropoff_lat"]]
result = filtered.groupby(["pickup_lon","pickup_lat","dropoff_lon","dropoff_lat"]).count().sort_values("medallion",ascending=False).head(20)
#result = filtered.groupby(["pickup_lon","pickup_lat","dropoff_lon","dropoff_lat"]).count().head(20)
print(result)
end_time = time.time()

print( "Runtime = " + str(end_time - start_time))

                                               medallion
pickup_lon pickup_lat dropoff_lon dropoff_lat           
-73.863042 40.769763  -73.985755  40.758534          198
-73.872021 40.774254  -73.985755  40.758534          187
                      -73.973783  40.756288          179
-73.985755 40.758534  -73.872021  40.774254          157
-73.973783 40.756288  -73.872021  40.774254          156
-73.979769 40.763025  -73.872021  40.774254          154
-73.863042 40.769763  -73.973783  40.756288          149
-73.872021 40.774254  -73.979769  40.763025          137
-73.863042 40.769763  -73.979769  40.763025          136
-73.985755 40.760780  -73.872021  40.774254          134
-73.872021 40.774254  -73.970790  40.756288          132
-73.863042 40.769763  -73.970790  40.756288          122
-73.872021 40.774254  -73.976776  40.751796          119
-73.982762 40.763025  -73.872021  40.774254          111
-73.872021 40.774254  -73.982762  40.756288          110
                               

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered["pickup_lon"]=lonRound(filtered.pickup_longitude)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered["pickup_lat"]=latRound(filtered.pickup_latitude)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered["dropoff_lon"]=lonRound(filtered.dropoff_longitude)
A value is trying to be set

## Example 3 (cuML)

This example uses KMeans, by clustering pickup locations.

#### CuDF

In [14]:
start_time = time.time()
mySchema = ["medallion", "hack_license", "pickup_datetime",
            "dropoff_datetime", "trip_time_in_secs", "trip_distance",
            "pickup_longitude", "pickup_latitude", "dropoff_longitude",
            "dropoff_latitude", "payment_type", "fare_amount",
            "surcharge", "mta_tax", "tip_amount",
            "tolls_amount", "total_amount"]

northLatitude = 40.86
southLatitude = 40.68
westLongitude = -74.03
eastLongitude = -73.92

dataset = cudf.read_csv(FILENAME,names=mySchema,compression="gzip")
filtered = dataset[(dataset.pickup_longitude <= eastLongitude) &
                   (dataset.pickup_longitude >=  westLongitude) &
                   (dataset.pickup_latitude <= northLatitude) &
                   (dataset.pickup_latitude >= southLatitude)]
filtered = filtered[["pickup_longitude","pickup_latitude"]].copy()

model = cuml.cluster.KMeans(n_clusters=100, max_iter=30, n_init=1,init='k-means||')
model.fit(filtered)
#using the score included in this class
score = model.score(filtered)
print("Score = " + str(score))

end_time = time.time()

print( "Runtime = " + str(end_time - start_time))


Score = -0.08144558534149837
Runtime = 5.946503162384033


#### Pandas

In [13]:
import sklearn.cluster

start_time = time.time()
mySchema = ["medallion", "hack_license", "pickup_datetime",
            "dropoff_datetime", "trip_time_in_secs", "trip_distance",
            "pickup_longitude", "pickup_latitude", "dropoff_longitude",
            "dropoff_latitude", "payment_type", "fare_amount",
            "surcharge", "mta_tax", "tip_amount",
            "tolls_amount", "total_amount"]

northLatitude = 40.86
southLatitude = 40.68
westLongitude = -74.03
eastLongitude = -73.92

dataset = pd.read_csv(FILENAME,names=mySchema,compression="gzip")
filtered = dataset[(dataset.pickup_longitude <= eastLongitude) &
                   (dataset.pickup_longitude >=  westLongitude) &
                   (dataset.pickup_latitude <= northLatitude) &
                   (dataset.pickup_latitude >= southLatitude)]
filtered = filtered[["pickup_longitude","pickup_latitude"]].copy()

model = sklearn.cluster.KMeans(n_clusters=100, max_iter=30, n_init=1,init='k-means++')
model.fit(filtered)
score = model.score(filtered)
print("Score = " + str(score))

end_time = time.time()

print( "Runtime = " + str(end_time - start_time))

Score = -12.847048618329463
Runtime = 41.048604249954224
