# Assignment: NY Taxi Fare Prediction with RAPIDS

* Note: see [NY Taxi Usecase Notebook](https://colab.research.google.com/github/keuperj/BigData_WS23/blob/main/Block_5/UseCase_NY_Taxi.ipynb) for an discription of a non-parallel solution

* NY Taxi Fare Prediction Task + Data: https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview

## Setup:
* activate GPU runtime and check for T4 GPU (Rapids needs CUDA capability >=7!)

In [1]:
!nvidia-smi

Fri Dec 15 10:45:49 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
# Rapids install must fit current CUDA version - use https://docs.rapids.ai/install to generate suitable install promt
!pip install \
    --extra-index-url=https://pypi.nvidia.com \
    cudf-cu12==23.12.* dask-cudf-cu12==23.12.* cuml-cu12==23.12.* \
    cugraph-cu12==23.12.* cuspatial-cu12==23.12.* cuproj-cu12==23.12.* \
    cuxfilter-cu12==23.12.* cucim-cu12==23.12.* pylibraft-cu12==23.12.* \
    raft-dask-cu12==23.12.*

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cudf-cu12==23.12.*
  Downloading https://pypi.nvidia.com/cudf-cu12/cudf_cu12-23.12.1-cp310-cp310-manylinux_2_28_x86_64.whl (511.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dask-cudf-cu12==23.12.*
  Downloading https://pypi.nvidia.com/dask-cudf-cu12/dask_cudf_cu12-23.12.0-py3-none-any.whl (82 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.9/82.9 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cuml-cu12==23.12.*
  Downloading https://pypi.nvidia.com/cuml-cu12/cuml_cu12-23.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (955.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m955.2/955.2 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cugraph-cu12==23.12.*
  Downloading https://pypi.nvidia.com/cugraph-cu12/cugraph_cu12-23.1

In [5]:
import cudf
import cuml


In [6]:
#in colab, we need to clone the data from the repo
!git clone https://github.com/keuperj/DATA.git


Cloning into 'DATA'...
remote: Enumerating objects: 126, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 126 (delta 11), reused 39 (delta 11), pack-reused 87[K
Receiving objects: 100% (126/126), 185.56 MiB | 22.44 MiB/s, done.
Resolving deltas: 100% (32/32), done.
Updating files: 100% (86/86), done.


In [8]:
#read data and lables to cuDF
data=cudf.read_csv("DATA/NY_taxi_train_small.csv")
y=data['fare_amount']
X=data.drop(['fare_amount'],axis=1)

In [9]:
X.head()

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2014-03-12 17:04:30.0000001,2014-03-12 17:04:30 UTC,-73.956721,40.767081,-73.98908,40.772745,1
1,2009-04-17 21:59:00.00000044,2009-04-17 21:59:00 UTC,-73.870913,40.773722,-73.996285,40.716018,2
2,2009-10-06 13:42:00.00000015,2009-10-06 13:42:00 UTC,-73.976258,40.75141,-73.984795,40.751305,3
3,2012-05-02 21:38:39.0000004,2012-05-02 21:38:39 UTC,-73.97794,40.752586,-73.976525,40.667005,1
4,2011-04-21 18:11:13.0000001,2011-04-21 18:11:13 UTC,-73.98839,40.723152,-73.984367,40.736301,1
