# AAI614: Data Science & its Applications

*Notebook 7.1: Introducing Dask*

<a href="https://colab.research.google.com/github/harmanani/AAI614/blob/main/Week%207/Notebook7.1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Source: NVIDIA

# Dask 

Dask is not faster than pandas for a single file or for small size data.  It excels for multiple data as it uses lazy computaion. In this lab, we will learn how to use Dask to speed up computation under the correct conditions.
 
First, let's get these libraries loaded.

In [1]:
!pip install dask

import dask.dataframe as dd
import glob
import pandas as pd
import time
import urllib
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


## Using Dask versus Pandas

Neither pandas or cuDF can read in multiple CSV files directly with [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). In order to read multiple files into a DataFrame, we would need to loop through each file and append them together.

To see this, let's pull a couple more files from the [Water Level Website](https://tidesandcurrents.noaa.gov/stations.html?type=Water+Levels). This time, we will request a CSV and save it with the [urllib.request](https://docs.python.org/3/library/urllib.request.html).

We should now have a few `.csv` files in the `data` folder. When referencing these files, we could type out the paths of each of these files individually, but instead, we will use the [glob](https://docs.python.org/3/library/glob.html) library to programmatically do this for us. We can use `*` as a wild card to filter files that match our pattern specified like so:

In [2]:
file_paths = glob.glob("data/*.csv")
file_paths = [file for file in file_paths if file != "data/numbers.csv"]
file_paths

['data/RQC00667292.csv',
 'data/USC00020808.csv',
 'data/USC00013620.csv',
 'data/RQC00666992.csv',
 'data/USC00025635.csv',
 'data/USC00010063.csv',
 'data/USC00012675.csv',
 'data/USC00025344.csv',
 'data/USC00026117.csv',
 'data/USC00021870.csv',
 'data/USC00018517.csv',
 'data/RQC00666514.csv',
 'data/USC00012172.csv',
 'data/USC00010957.csv',
 'data/USC00027876.csv',
 'data/RQC00669829.csv',
 'data/RQC00668881.csv',
 'data/USC00030064.csv',
 'data/USC00018673.csv',
 'data/USC00012377.csv',
 'data/USC00013519.csv',
 'data/USC00023009.csv',
 'data/USC00013645.csv',
 'data/USC00010402.csv',
 'data/USC00022329.csv',
 'data/USC00010748.csv',
 'data/USC00013043.csv',
 'data/USC00018670.csv',
 'data/USC00015397.csv',
 'data/RQW00011641.csv',
 'data/USC00027708.csv',
 'data/USC00015553.csv',
 'data/USC00026037.csv',
 'data/USC00018325.csv',
 'data/USC00020080.csv',
 'data/USC00029271.csv',
 'data/USC00010369.csv',
 'data/USC00010425.csv',
 'data/USC00025924.csv',
 'data/USC00030130.csv',


Each path starts with `data`, ends with `.csv`, and the `*` indicates to pick up anything in between. Let's set up a for loop to see how long it takes to read all of these files. Run the block **twice** to see how much faster cuDF is after it has been initialized.

In [4]:
%%time
usecols = [0, 1, 2, 4, 5]  # Column names are different when pulling csv directly


def read_all(library, file_paths):
    df_list = []
    for file in file_paths:
        df = library.read_csv(
            file, index_col=None, header=None, usecols=usecols, skiprows=1
        )
        df_list.append(df)
    return library.concat(df_list, axis=0, ignore_index=True)



df_cpu = read_all(pd, file_paths)

CPU times: user 1.99 s, sys: 476 ms, total: 2.47 s
Wall time: 2.83 s


In [5]:
df_cpu

Unnamed: 0,0,1,2,4,5
0,RQC00667292,18.0258,-66.5252,1971-07-01,HPCP
1,RQC00667292,18.0258,-66.5252,1971-07-02,HPCP
2,RQC00667292,18.0258,-66.5252,1971-07-03,HPCP
3,RQC00667292,18.0258,-66.5252,1971-07-04,HPCP
4,RQC00667292,18.0258,-66.5252,1971-07-05,HPCP
...,...,...,...,...,...
1013567,USC00026468,34.7994,-109.8850,2021-01-28,HPCP
1013568,USC00026468,34.7994,-109.8850,2021-01-29,HPCP
1013569,USC00026468,34.7994,-109.8850,2021-01-30,HPCP
1013570,USC00026468,34.7994,-109.8850,2021-01-31,HPCP


Since Dask is made to be parallel, we do not need a for loop. It can read multiple files natively.

The below code shows how to read data in parallel. This only sets up the process to read the files. we need to force Dask to *compute* 

In [6]:
%%time
ddf_cpu = dd.read_csv(file_paths, usecols=usecols, header=0, skipinitialspace=True)

ddf_cpu.compute()

CPU times: user 2.97 s, sys: 1.8 s, total: 4.77 s
Wall time: 2.15 s


Unnamed: 0,STATION,LATITUDE,LONGITUDE,DATE,ELEMENT
0,RQC00667292,18.0258,-66.5252,1971-07-01,HPCP
1,RQC00667292,18.0258,-66.5252,1971-07-02,HPCP
2,RQC00667292,18.0258,-66.5252,1971-07-03,HPCP
3,RQC00667292,18.0258,-66.5252,1971-07-04,HPCP
4,RQC00667292,18.0258,-66.5252,1971-07-05,HPCP
...,...,...,...,...,...
23347,USC00026468,34.7994,-109.8850,2021-01-28,HPCP
23348,USC00026468,34.7994,-109.8850,2021-01-29,HPCP
23349,USC00026468,34.7994,-109.8850,2021-01-30,HPCP
23350,USC00026468,34.7994,-109.8850,2021-01-31,HPCP


Let's sample our data to confirm it had been read correctly. This time, we will only be working with the first three columns of data.

In [7]:
%%time
ddf_cpu.head()

CPU times: user 50.5 ms, sys: 9.7 ms, total: 60.2 ms
Wall time: 61.6 ms


Unnamed: 0,STATION,LATITUDE,LONGITUDE,DATE,ELEMENT
0,RQC00667292,18.0258,-66.5252,1971-07-01,HPCP
1,RQC00667292,18.0258,-66.5252,1971-07-02,HPCP
2,RQC00667292,18.0258,-66.5252,1971-07-03,HPCP
3,RQC00667292,18.0258,-66.5252,1971-07-04,HPCP
4,RQC00667292,18.0258,-66.5252,1971-07-05,HPCP


How can Dask do this faster than regular pandas or cuDF? Under the hood, Dask is building a system of operations called a DAG. We can view this DAG with the [visualize](https://docs.dask.org/en/latest/graphviz.html) method.

In [8]:
!pip install graphviz

Collecting graphviz
  Downloading graphviz-0.21-py3-none-any.whl.metadata (12 kB)
Downloading graphviz-0.21-py3-none-any.whl (47 kB)
Installing collected packages: graphviz
Successfully installed graphviz-0.21
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [9]:
ddf_cpu.visualize()

ExecutableNotFound: failed to execute PosixPath('dot'), make sure the Graphviz executables are on your systems' PATH