# High Performance Jupyter

This tutorial is based on: https://github.com/rikturr/high-performance-jupyter

## Scale out with Dask
<img src="https://docs.dask.org/en/latest/_images/dask_horizontal.svg" width="400" />


Paralleles Rechnen eines Machine Learning Beispiels mit Dask

Ressourcen: 2 Knoten auf bwUniCluster mit jeweils 40 Cores und 90GB RAM.


In [1]:
import os
from dask.distributed import Client
client = Client(scheduler_file=os.path.expanduser('~/dask-scheduler.json'))

In [3]:
client

0,1
Connection method: Scheduler file,Scheduler file: /home/es/es_es/es_pkoester/dask-scheduler.json
Dashboard: http://172.26.20.6:8787/status,

0,1
Comm: tcp://172.26.20.6:46419,Workers: 7
Dashboard: http://172.26.20.6:8787/status,Total threads: 70
Started: 2 minutes ago,Total memory: 76.90 GiB

0,1
Comm: tcp://172.26.20.6:40571,Total threads: 10
Dashboard: http://172.26.20.6:45059/status,Memory: 10.99 GiB
Nanny: None,
Local directory: /tmp/dask-worker-space/worker-hcomzonp,Local directory: /tmp/dask-worker-space/worker-hcomzonp
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 2.0%,Last seen: Just now
Memory usage: 95.81 MiB,Spilled bytes: 0 B
Read bytes: 12.38 kiB,Write bytes: 32.83 kiB

0,1
Comm: tcp://172.26.20.6:35881,Total threads: 10
Dashboard: http://172.26.20.6:33899/status,Memory: 10.99 GiB
Nanny: None,
Local directory: /tmp/dask-worker-space/worker-kyj886r6,Local directory: /tmp/dask-worker-space/worker-kyj886r6
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 4.0%,Last seen: Just now
Memory usage: 95.60 MiB,Spilled bytes: 0 B
Read bytes: 13.33 kiB,Write bytes: 43.51 kiB

0,1
Comm: tcp://172.26.20.6:36191,Total threads: 10
Dashboard: http://172.26.20.6:39703/status,Memory: 10.99 GiB
Nanny: None,
Local directory: /tmp/dask-worker-space/worker-xf094031,Local directory: /tmp/dask-worker-space/worker-xf094031
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 4.0%,Last seen: Just now
Memory usage: 96.01 MiB,Spilled bytes: 0 B
Read bytes: 14.39 kiB,Write bytes: 46.53 kiB

0,1
Comm: tcp://172.26.20.7:42777,Total threads: 10
Dashboard: http://172.26.20.7:42817/status,Memory: 10.99 GiB
Nanny: None,
Local directory: /tmp/dask-worker-space/worker-bfb5obwz,Local directory: /tmp/dask-worker-space/worker-bfb5obwz
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 2.0%,Last seen: Just now
Memory usage: 96.03 MiB,Spilled bytes: 0 B
Read bytes: 3.00 kiB,Write bytes: 3.65 kiB

0,1
Comm: tcp://172.26.20.7:44097,Total threads: 10
Dashboard: http://172.26.20.7:42761/status,Memory: 10.99 GiB
Nanny: None,
Local directory: /tmp/dask-worker-space/worker-8v4npxsm,Local directory: /tmp/dask-worker-space/worker-8v4npxsm
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 2.0%,Last seen: Just now
Memory usage: 95.94 MiB,Spilled bytes: 0 B
Read bytes: 3.00 kiB,Write bytes: 3.65 kiB

0,1
Comm: tcp://172.26.20.7:45773,Total threads: 10
Dashboard: http://172.26.20.7:35293/status,Memory: 10.99 GiB
Nanny: None,
Local directory: /tmp/dask-worker-space/worker-_84vyekf,Local directory: /tmp/dask-worker-space/worker-_84vyekf
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 4.0%,Last seen: Just now
Memory usage: 95.96 MiB,Spilled bytes: 0 B
Read bytes: 2.99 kiB,Write bytes: 3.65 kiB

0,1
Comm: tcp://172.26.20.7:36757,Total threads: 10
Dashboard: http://172.26.20.7:33209/status,Memory: 10.99 GiB
Nanny: None,
Local directory: /tmp/dask-worker-space/worker-1xi5rlav,Local directory: /tmp/dask-worker-space/worker-1xi5rlav
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 2.0%,Last seen: Just now
Memory usage: 96.09 MiB,Spilled bytes: 0 B
Read bytes: 2.99 kiB,Write bytes: 5.24 kiB


<br>The scheduler might be ready before all the workers are. We'll wait until all the workers are up.

In [4]:
client.wait_for_workers(7)

In [5]:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import datetime
import s3fs

seed = 42

# Load and explore data

Load the data for all of 2019. Note that when working with a Dask cluster each worker is a separate machine, so they do not share filesystems. This is not a problem for our case because we're already loading the data from S3.

In [6]:
fs = s3fs.S3FileSystem(anon=True)
files_2019 = fs.glob('s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv')
file_sizes_2019 = [fs.du(f) for f in files_2019] 

print(f'2019 avg size (MB): {np.round(np.mean(file_sizes_2019) / 1e6)}')
print(f'2019 total size (GB): {np.round(np.sum(file_sizes_2019) / 1e9)}')

2019 avg size (MB): 650.0
2019 total size (GB): 8.0


In [7]:
%%time

taxi = pd.read_csv(
        fs.open('s3://nyc-tlc/trip data/yellow_tripdata_2019-01.csv'),
        parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
)

print(f"Row count: {len(taxi)}") #Zeilen aller Datensätze zusammen
print(f"Size in GB: {taxi.memory_usage(deep=True).sum() / 1e9}")

Row count: 7667792
Size in GB: 1.487551776
CPU times: user 25.2 s, sys: 4.48 s, total: 29.7 s
Wall time: 1min 21s


In [8]:
%%time

taxi = dd.read_csv(
    's3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv',
    assume_missing=True, #Beim Einlesen werden alle Ints zu Floats. Dies erlaubt fehlende Werte
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'], #Interpretiert diese Spalten als Datum
    storage_options={'anon': True}, #Für S3: KeineAuthentifizierung für diesen Bucket nötig
)

CPU times: user 139 ms, sys: 56.7 ms, total: 195 ms
Wall time: 6.26 s


In [9]:
print(taxi)

Dask DataFrame Structure:
                VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount    extra  mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge
npartitions=127                                                                                                                                                                                                                                                                     
                 float64       datetime64[ns]        datetime64[ns]         float64       float64    float64             object      float64      float64      float64     float64  float64  float64    float64      float64               float64      float64              float64
                     ...                  ...                   ...             ...           ...        ...                ...          ...   

In [9]:
%%time
print(f"Row count: {len(taxi)}") #Zeilen aller Datensätze zusammen

Row count: 84399019
CPU times: user 104 ms, sys: 52.1 ms, total: 156 ms
Wall time: 1min 7s


In [13]:
%%time
print(f"Size in GB: {taxi.memory_usage(deep=True).sum().compute() / 1e9}")
# memory_usage: Return the memory usage of each column in bytes.
# If deep=True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.
# sum: Return the sum of the values over the requested axis.
# compute: This function will block until the computation is finished

Size in GB: 16.367014316
CPU times: user 79.6 ms, sys: 11.7 ms, total: 91.2 ms
Wall time: 1min


In [10]:
%%time
np.round(taxi.describe().compute(), 3).T

CPU times: user 587 ms, sys: 220 ms, total: 808 ms
Wall time: 1min 15s


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
VendorID,84152418.0,1.645,0.498,1.0,1.0,2.0,2.0,4.0
passenger_count,84152418.0,1.563,1.208,0.0,1.0,1.0,2.0,9.0
trip_distance,84399019.0,3.001,8.091,-37264.53,1.07,1.93,8.82,45977.22
RatecodeID,84152418.0,1.061,0.76,1.0,1.0,1.0,1.0,99.0
PULocationID,84399019.0,163.158,66.016,1.0,132.0,162.0,234.0,265.0
DOLocationID,84399019.0,161.353,70.251,1.0,116.0,163.0,236.0,265.0
payment_type,84152418.0,1.289,0.479,1.0,1.0,1.0,2.0,5.0
fare_amount,84399019.0,13.344,174.375,-1856.0,7.0,11.0,32.04,943274.8
extra,84399019.0,1.087,1.249,-60.0,0.0,1.0,3.0,535.38
mta_tax,84399019.0,0.495,0.067,-0.5,0.5,0.5,0.5,212.42


distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
concurrent.futures._base.CancelledError


# Feature engineering

In [7]:
numeric_feat = [
    'pickup_weekday', 
    'pickup_hour', 
    'pickup_week_hour', 
    'pickup_minute', 
    'passenger_count',
]
categorical_feat = [
    'PULocationID', 
    'DOLocationID',
]
features = numeric_feat + categorical_feat
y_col = 'high_tip'

In [8]:
def prep_df(df: dd.DataFrame) -> dd.DataFrame:tip_fraction
    '''
    Generate features from a raw taxi dataframe.
    '''
    df = df[df.fare_amount > 0]  # avoid divide-by-zero
    df['tip_fraction'] = df.tip_amount / df.fare_amount
    df[y_col] = (df['tip_fraction'] > 0.2) # class label
    
    df['pickup_weekday'] = df.tpep_pickup_datetime.dt.weekday
    df['pickup_weekofyear'] = df.tpep_pickup_datetime.dt.weekofyear
    df['pickup_hour'] = df.tpep_pickup_datetime.dt.hour
    df['pickup_week_hour'] = (df.pickup_weekday * 24) + df.pickup_hour
    df['pickup_minute'] = df.tpep_pickup_datetime.dt.minute
    df = df[features + [y_col]].astype(float).fillna(-1)
    
    return df
    
taxi = prep_df(taxi)

In [9]:
taxi.head()

Unnamed: 0,pickup_weekday,pickup_hour,pickup_week_hour,pickup_minute,passenger_count,PULocationID,DOLocationID,high_tip
0,1.0,0.0,24.0,46.0,1.0,151.0,239.0,1.0
1,1.0,0.0,24.0,59.0,1.0,239.0,246.0,0.0
2,4.0,13.0,109.0,48.0,3.0,236.0,236.0,0.0
3,2.0,15.0,63.0,52.0,5.0,193.0,193.0,0.0
4,2.0,15.0,63.0,56.0,5.0,193.0,193.0,0.0


<br>

Since we're using a cluster with lots of RAM, we can call `persist()` on the dataframe to avoid repeated CSV loading in downstream processing. This tells Dask to execute the task graph that exists up to this point and hold the results in memory. 


The function returns a [future](https://docs.dask.org/en/latest/futures.html) which continues to execute in the background until it's complete. To wait until execution is complete, we run `wait()`.

In [10]:
%%time
from dask.distributed import wait

taxi = taxi.persist()
_ = wait(taxi)

CPU times: user 172 ms, sys: 3.69 ms, total: 176 ms
Wall time: 40.5 s


Notice now that our commands run super fast!

In [11]:
%%time
len(taxi)

CPU times: user 34.9 ms, sys: 3.96 ms, total: 38.8 ms
Wall time: 108 ms


84194625

In [12]:
%%time
np.round(taxi.describe().compute(), 3).T

CPU times: user 1.87 s, sys: 23.8 ms, total: 1.9 s
Wall time: 7.98 s


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
pickup_weekday,84194625.0,2.977,1.933,0.0,2.0,4.0,6.0,6.0
pickup_hour,84194625.0,13.89,6.021,0.0,12.0,16.0,22.0,23.0
pickup_week_hour,84194625.0,85.35,46.356,0.0,62.0,111.0,166.0,167.0
pickup_minute,84194625.0,29.564,17.34,0.0,15.0,30.0,45.0,59.0
passenger_count,84194625.0,1.555,1.214,-1.0,1.0,1.0,2.0,9.0
PULocationID,84194625.0,163.161,66.011,1.0,132.0,162.0,234.0,265.0
DOLocationID,84194625.0,161.342,70.245,1.0,116.0,163.0,236.0,265.0
high_tip,84194625.0,0.541,0.498,0.0,0.0,1.0,1.0,1.0


# Hyperparameter tuning

Use a simiarly-sized sample as [laptop.ipynb](laptop.ipynb) for comparison purposes.

In [13]:
taxi_sample = taxi.sample(frac=0.0045, replace=False, random_state=seed)
taxi_sample = taxi_sample.persist()
_ = wait(taxi_sample)

len(taxi_sample)

378878

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from dask_ml.compose import ColumnTransformer
from dask_ml.preprocessing import StandardScaler, DummyEncoder, Categorizer
from dask_ml.model_selection import GridSearchCV

lr = LogisticRegression(
    solver='saga',
    penalty='elasticnet', 
    l1_ratio=0.5,
    max_iter=100, 
    random_state=seed,
)
pipeline = Pipeline(steps=[
    ('categorize', Categorizer(columns=categorical_feat)),
    ('onehot', DummyEncoder(columns=categorical_feat)),
    ('scale', ColumnTransformer(transformers=[('num', StandardScaler(), numeric_feat)])),
    ('clf', lr),
])

params = {
    'clf__l1_ratio': [0.2, 0.3, 0.5, 0.7, 0.9],
}

grid_search = GridSearchCV(
    pipeline, 
    params,
    cv=3, 
    scoring='accuracy',
)

In [15]:
%%time
_ = grid_search.fit(taxi_sample[features], taxi_sample[y_col])
grid_search.best_score_

CPU times: user 122 ms, sys: 4.35 ms, total: 126 ms
Wall time: 14.6 s


0.5367321406890873