# disclaimer
this is a code almost entirely copied from https://jcrist.github.io/dask-sklearn-part-3.html. The goal of this exercise is slighly different. We want to understand:

1. how much computational power we will need for processing all 22 csv files that is ~11.5 GB in total. Do we really need 4 `m3.2 large` instances (with 8 cores, 30 GB RAM) each?
2. what is the most amount of data that we can successfully process for a single node Xeon machine using Dask and Scikit-learn?

In [3]:
from dask.diagnostics import Profiler, ResourceProfiler, CacheProfiler

In [4]:
import dask.dataframe as dd

In [5]:
# Subset of the columns to use
cols = ['Month', 'DayOfWeek', 'DepDelay',
        'CRSDepTime', 'UniqueCarrier', 'Origin', 'Dest']

we are only using 1 node vs 4 `AWS m1.large` nodes in the example

In [None]:
dask.

In [6]:
!ls ../data/*.csv

../data/1987.csv  ../data/1993.csv  ../data/1999.csv  ../data/2005.csv
../data/1988.csv  ../data/1994.csv  ../data/2000.csv  ../data/2006.csv
../data/1989.csv  ../data/1995.csv  ../data/2001.csv  ../data/2007.csv
../data/1990.csv  ../data/1996.csv  ../data/2002.csv  ../data/2008.csv
../data/1991.csv  ../data/1997.csv  ../data/2003.csv
../data/1992.csv  ../data/1998.csv  ../data/2004.csv


In [7]:
!ls ../data/*.csv | wc -l

22


In [8]:
datasize = !du -h ../data/*.csv

In [9]:
int(datasize[0].split('\t')[0].split('M')[0])

122

In [25]:
total_size_in_GB = sum([int(
    size.split('\t')[0].split('M')[0]) for size in datasize[:5]]) / 1000

In [26]:
total_size_in_GB

2.019

In [9]:
total_datasize = []

This should still fit into memory

In [31]:
df = dd.read_csv('../data/198*.csv',
                 blocksize=int(128e6), assume_missing=True)

In [29]:
df.persist()

ValueError: Mismatched dtypes found.
Expected integers, but found floats for columns:
- 'Distance'

To fix, specify dtypes manually by adding:

dtype={'Distance': float}

to the call to `read_csv`/`read_table`.

Alternatively, provide `assume_missing=True` to interpret all unspecified integer columns as floats.

In [6]:
df = (df.drop(['DepDelay', 'CRSDepTime'], axis=1)
        .assign(hour=df.CRSDepTime.clip(upper=2399)//100,
                delayed=(df.DepDelay.fillna(16) > 15).astype('i8')))

In [None]:
progress(df.persist())