# Dask

Parallel processing for Numpy Arrays and Pandas DataFrames

http://dask.pydata.org/en/latest/

Interesting example and use-case:

http://matthewrocklin.com/blog/work/2018/02/09/credit-models-with-dask

## Creating Dask Cluster

In [1]:
import dask
import dask.distributed
import dask.dataframe
import pandas as pd
from glob import glob

In [5]:
client = dask.distributed.Client() #starts Dask client
client

0,1
Client  Scheduler: tcp://127.0.0.1:39022  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 10.28 GB


In principle Dask runs also without it, but it seems to be less efficient (?) and more difficult to monitor

## Dask DataFrame

__ToDo: refactor, use actual example data__

The following example imports a large amount of data (from multiple zipped csv files) into a Dask dataframe, does some analysis and aggregation and returns a Pandas dataframe.

Example data:

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

In [None]:
df=dask.dataframe.read_csv(file_pattern,delimiter=',',decimal='.') #dask csv reader, 
# works on wildcards (e.g. *), but not on zip files
agg=df.groupby(['date']).agg({'passenger_count':'sum', 'trip_distance':'sum',
                              'tip_fraction':'mean'}) #groupby analogue to Pandas synthax
agg_df=agg.compute() #creates Pandas DataFrame from Dask DataFrame and does the actual computing
agg_df.head()

The execution progress and parallelism efficiency can be monitored in the web-gui of the Dask client:

http://localhost:8787/status

If something went wrong: kill all python tasks (Windows):

tasklist
taskkill /IM pythonw.exe /F

In contrast to Pandas, Dask DataFrames cannot be directly created from zipped csv files. The following code snippet uses Pandas to import zipped files on Dask grid via Dask Delayed.

In [None]:
file_pattern = u'C:/Users/d90394/Documents/test_data/*.csv'
files=glob(path)

dfs=[dask.delayed(pandas.read_csv)(filename,delimiter=',',decimal='.')
     for filename in files]
df=dask.dataframe.from_delayed(dfs)

Note that the import of a single zipped csv file is not distributed, thus the memory of each worker must be sufficient to contain the Pandas DataFrame for each input file.

## Dask Delayed

__ToDo: working, but not nice enough - needs refactoring__

Dask Delayed is used to submit functions with defined input and output to the calculators. An example is given here:

dask_delayed_example.py

Note that the compute() statement is only executed for the result(s), the dependencies are handled automatically by Dask.

If not a single function but a class is intended to be submitted, a helper function can be used. If the helper function is defined inside a class, it should be static (i.e. without referring to self).

In [6]:
import numpy as np
from time import clock

In [9]:
class A: #class which does the calculation, to be parallelized
    def __init__(self, p1,x1):
        self.p1=p1
        self.x1=x1
    def doCalc(self,p2,x2):
        return np.dot(self.p1+self.x1,p2+x2)

# ToDo: working, but not nice enough - needs refactoring
class B: #class which calls the calculation
    def mainProg(self):
        p1=np.array(range(10000000))
        p2=p1**2
        p1s=client.scatter(p1) #distribute large data, which is required for all calculations, ahead to the calculators
        p2s=client.scatter(p2)
        self.result=[]
        for x1 in range(100):
            x2=x1**2
            self.result.append(dask.delayed(self.calcHelper)(p1s,p2s,x1,x2))
        result_tot=dask.delayed(np.sum)(self.result)
        return result_tot.compute()

    @staticmethod
    def calcHelper(p1,p2,x1,x2): #helper function defined as staticmethod
        a=A(p1,x1)
        return a.doCalc(p2,x2)

In [10]:
# ToDo: refactor
start_time=clock()
b=B()
print(b.mainProg())
elapsed_time=clock()-start_time
print(elapsed_time)

  """Entry point for launching an IPython kernel.


2201258113697798912
6.076910000000002


  after removing the cwd from sys.path.


In the example above, the command xs=client.scatter(x) is used to distribute large data effectively to the calculators.

If there are issues with multiprocessing, the client can be restricted to multithreading (which is usually slower) using

client=dask.distributed.Client(processes=False, threads_per_worker=4)

## Cleanup

In [4]:
client.close()