# disclaimer
this is a code almost entirely copied from https://jcrist.github.io/dask-sklearn-part-3.html. The goal of this exercise is slighly different. We want to understand:

1. how much computational power we will need for processing all 22 csv files that is ~11.5 GB in total. Do we really need 4 `m3.2 large` instances (with 8 cores, 30 GB RAM) each?
2. what is the most amount of data that we can successfully process for a single node Xeon machine using Dask and Scikit-learn?

In [1]:
import dask

In [2]:
from distributed import Executor, Client, LocalCluster, progress

this means that we will have 4 processes, each process can use two threads.
Additionally, there will be one more python process running as the scheduler. You can find this out using `ps ax | grep python`

we are just using a single Xeon processor

In [3]:
n_workers = 11
ncores = 2 

In [4]:
!cat /proc/cpuinfo | grep 'Xeon' | head -1

model name	: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
grep: write error: Broken pipe


In [5]:
cluster = LocalCluster(n_workers=n_workers, threads_per_worker=ncores)

In [6]:
client = Client(cluster)

In [7]:
client

0,1
Client  Scheduler: tcp://127.0.0.1:42700  Dashboard: http://127.0.0.1:8787,Cluster  Workers: 11  Cores: 22  Memory: 10.09 GB


In [8]:
exc = Executor(cluster)

In [9]:
import dask.dataframe as ddf

In [10]:
!ls ../data/*.csv | wc -l

22


In [11]:
# Subset of the columns to use
cols = ['Year','Month', 'DayOfWeek', 'DepDelay',
        'CRSDepTime', 'UniqueCarrier', 'Origin', 'Dest', 'ArrDelay']

we are only using 1 node vs 4 `AWS m1.large` nodes in the example

This should still fit into memory

commenting out the following for showing the useful rows
```python
cols = ['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime',
       'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum',
       'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay',
       'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut',
       'Cancelled', 'CancellationCode', 'Diverted']
```

In [12]:
df = ddf.read_csv('../data/*.csv',
                  blocksize=int(128e6), 
                  usecols=cols)

In [13]:
exc.persist(df)

Unnamed: 0_level_0,Year,Month,DayOfWeek,CRSDepTime,UniqueCarrier,ArrDelay,DepDelay,Origin,Dest
npartitions=104,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
,int64,int64,int64,int64,object,float64,float64,object,object
,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...


In [14]:
progress(df, notebook=False)

[########################################] | 100% Completed |  0.0s

In [15]:
df = (df.drop(['DepDelay', 'CRSDepTime'], axis=1)
        .assign(hour=df.CRSDepTime.clip(upper=2399)//100,
                delayed=(df.DepDelay.fillna(16) > 15).astype('i8')))

In [16]:
aggregations = (df.groupby('Year').delayed.mean(),
                df.groupby('Month').delayed.mean())

In [17]:
# %time (delayed_by_year, delayed_by_month) = dask.compute(*aggregations)

In [18]:
%time df.ArrDelay.mean().compute()

CPU times: user 18.4 s, sys: 3.24 s, total: 21.6 s
Wall time: 2min 1s


7.0499626201265606

In [21]:
len(df)

123534969

In [None]:
dask_searchcv.

In [43]:
from dask_searchcv import RandomizedSearchCV, GridSearchCV

In [41]:
from dask_searchcv.model_selection import Pipeline, StratifiedKFold

In [42]:
StratifiedKFold()

sklearn.model_selection._split.StratifiedKFold

In [None]:
dask.dataframe.

In [None]:
Pipeline()