# Assignment: NY Taxi Fare Prediction with DASK

* Note: see [NY Taxi Usecase Notebook](https://colab.research.google.com/github/keuperj/DataEngineering22/blob/main/week_8/UseCase_NY_Taxi.ipynb) for an discription of a non-parallel solution

* NY Taxi Fare Prediction Task + Data: https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview

In [1]:
#install DASK
!pip install distributed "dask[complete]" dask-ml graphviz  --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting distributed
  Downloading distributed-2022.12.0-py3-none-any.whl (925 kB)
[K     |████████████████████████████████| 925 kB 9.8 MB/s 
Collecting dask-ml
  Downloading dask_ml-2022.5.27-py3-none-any.whl (148 kB)
[K     |████████████████████████████████| 148 kB 27.4 MB/s 
Collecting graphviz
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
[K     |████████████████████████████████| 47 kB 1.5 MB/s 
Collecting dask==2022.12.0
  Downloading dask-2022.12.0-py3-none-any.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 35.7 MB/s 
Collecting dask-glm>=0.2.0
  Downloading dask_glm-0.2.0-py2.py3-none-any.whl (12 kB)
Collecting bokeh<3,>=2.4.2
  Downloading bokeh-2.4.3-py3-none-any.whl (18.5 MB)
[K     |████████████████████████████████| 18.5 MB 8.7 MB/s 
Installing collected packages: dask, distributed, dask-glm, bokeh, graphviz, dask-ml
  Attempting uninstall: dask
 

In [2]:
#in colab, we need to clone the data from the repo
!git clone https://github.com/keuperj/DATA.git
path='DATA'

Cloning into 'DATA'...
remote: Enumerating objects: 101, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 101 (delta 2), reused 14 (delta 2), pack-reused 87[K
Receiving objects: 100% (101/101), 146.44 MiB | 19.16 MiB/s, done.
Resolving deltas: 100% (23/23), done.
Checking out files: 100% (69/69), done.


In [3]:
#read data to Pandas DF
import pandas as pd
data=pd.read_csv("DATA/NY_taxi_train_small.csv")
y=data['fare_amount']
X=data.drop(['fare_amount'],axis=1)


In [4]:
X.head()

Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2014-03-12 17:04:30.0000001,2014-03-12 17:04:30 UTC,-73.956721,40.767081,-73.98908,40.772745,1
1,2009-04-17 21:59:00.00000044,2009-04-17 21:59:00 UTC,-73.870913,40.773722,-73.996285,40.716018,2
2,2009-10-06 13:42:00.00000015,2009-10-06 13:42:00 UTC,-73.976258,40.75141,-73.984795,40.751305,3
3,2012-05-02 21:38:39.0000004,2012-05-02 21:38:39 UTC,-73.97794,40.752586,-73.976525,40.667005,1
4,2011-04-21 18:11:13.0000001,2011-04-21 18:11:13 UTC,-73.98839,40.723152,-73.984367,40.736301,1


In [5]:
y.head()

0    11.0
1    23.7
2     5.7
3    28.1
4     5.7
Name: fare_amount, dtype: float64

## Assignment:
use *DASK Dataframes* and *DASK-ML* to 
* split in train and test data
* build a pre-processing,
* feature extraction and 
* prediction pipeline 

to predict the taxi fares. 

### Use
* local DASK cluster
* [DASK DataFrames](https://examples.dask.org/dataframe.html)
* [DASK-ML](https://ml.dask.org/)

### Hints:
* start with a very simple, but working predicten and then enhance the solution with better pre-processing and features

In [6]:
# Cluster setup
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)

INFO:distributed.http.proxy:To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO:distributed.scheduler:State start
INFO:distributed.scheduler:  Scheduler at:     tcp://127.0.0.1:45061
INFO:distributed.scheduler:  dashboard at:            127.0.0.1:8787
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:43573'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:36395'
INFO:distributed.scheduler:Register worker <WorkerState 'tcp://127.0.0.1:41365', name: 1, status: init, memory: 0, processing: 0>
INFO:distributed.scheduler:Starting worker compute stream, tcp://127.0.0.1:41365
INFO:distributed.core:Starting established connection to tcp://127.0.0.1:55496
INFO:distributed.scheduler:Register worker <WorkerState 'tcp://127.0.0.1:41025', name: 0, status: init, memory: 0, processing: 0>
INFO:distributed.scheduler:Starting worker compute stream, tcp://127.0.0.1:41025
INFO:distributed.core:Startin

In [7]:
#get client info
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 2
Total threads: 2,Total memory: 12.68 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:45061,Workers: 2
Dashboard: http://127.0.0.1:8787/status,Total threads: 2
Started: 1 minute ago,Total memory: 12.68 GiB

0,1
Comm: tcp://127.0.0.1:41025,Total threads: 1
Dashboard: http://127.0.0.1:41561/status,Memory: 6.34 GiB
Nanny: tcp://127.0.0.1:36395,
Local directory: /tmp/dask-worker-space/worker-av3f55hk,Local directory: /tmp/dask-worker-space/worker-av3f55hk

0,1
Comm: tcp://127.0.0.1:41365,Total threads: 1
Dashboard: http://127.0.0.1:34201/status,Memory: 6.34 GiB
Nanny: tcp://127.0.0.1:43573,
Local directory: /tmp/dask-worker-space/worker-idg_3p9f,Local directory: /tmp/dask-worker-space/worker-idg_3p9f


In [8]:
import dask.dataframe as dd
import dask.array as da

### Export DASK Dashboard to public URL

In [9]:
!npm install -g localtunnel

[K[?25h/tools/node/bin/lt -> /tools/node/lib/node_modules/localtunnel/bin/lt.js
[K[?25h+ localtunnel@2.0.2
added 22 packages from 22 contributors in 2.361s


In [None]:
!lt --port 8787


your url is: https://stale-banks-juggle-34-141-183-85.loca.lt
