# DS107 Big Data : Lesson Eight Companion Notebook

### Table of Contents <a class="anchor" id="DS107L8_toc"></a>

* [Table of Contents](#DS107L8_toc)
    * [Page 1 - Introduction](#DS107L8_page_1)
    * [Page 2 - What is DASK? ](#DS107L8_page_2)
    * [Page 3 - DASK Tutorial Setup](#DS107L8_page_3)
    * [Page 4 - Packages and Parallel Processing](#DS107L8_page_4)
    * [Page 5 - DASK Data](#DS107L8_page_5)
    * [Page 6 - In-Memory Processing](#DS107L8_page_6)
    * [Page 7 - Key Terms](#DS107L8_page_7)
    * [Page 8 - Lesson 8 Hands-On](#DS107L8_page_8)

    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Overview of this Module<a class="anchor" id="DS107L8_page_1"></a>

[Back to Top](#DS107L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Introduction

Hadoop is not the only way to make use of distributed processing and computer clusters, nor is AWS.  In this lesson, you will explore an up-and-coming, easy-to-use big data program that integrates with Python: DASK.

In this lesson, you will: 

* Learn about the components of DASK
* Use DASK's built-in tutorial feature to explore
* Understand the concepts and usage of parallel processing
* Set up a DASK data frame
* Examine the differences in processing using in-memory storage

This lesson will culminate in a hands on in which you explore the wide world of DASK.

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - What is DASK? <a class="anchor" id="DS107L8_page_2"></a>

[Back to Top](#DS107L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# What is DASK? 

DASK is a library in Python that allows you to manage parallel processing (think, clusters or multiple cores on your personal computer) in Python.  It easily integrates with Python tools you already know and love, such as `pandas`, `numpy`, and `scikit-learn`. Unlike it's predecessor, PySpark, DASK plays nicely with Python and does not have an underlying reliance on Java that can make execution tricky.

---

## Components of DASK

There are two main components of DASK: 

* High-level Collections
* Low-level Schedulers

The *high-level collections* are functions that will mimic the operations you perform on arrays in `numpy` or on dataframes in `pandas`, but are meant to operate in parallel for data sets that are too big to work on your computer. The *low-level schedulers* are for executing the actual work of parallel processing.

DASK is currently being used for all sorts of work, including genome sequencing, modeling of hydrolics and communication, and even in the finance sector.  

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>If you would like to learn more about how DASK is being used, please visit <a href="https://stories.dask.org/en/latest/"> this website on DASK use cases. </a></p>
    </div>
</div>

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - DASK Tutorial Setup<a class="anchor" id="DS107L8_page_3"></a>

[Back to Top](#DS107L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# DASK Tutorial Setup

Go to the Dask website **[here](https://dask.org/)** to get started!

Then hit the `Try Now` button: 

![HistogramFollowUp](Media/DASK1.png)

It will open up a Jupyter Lab environment that looks like this: 

![HistogramFollowUp](Media/DASK2.png)

Click on `dataframe.ipynb` on the left hand side to see a starter project using DASK. Although you can run it locally on your computer, because you don't have the (often expensive) resources to sustain parallel processing, you'll use their free demo to get familiar with the DASK-specific commands and get a feel for exactly how parallel processing works! 

You won't be running the code used in this lesson in your own Jupyter Notebook, but will rather be running sections in JupyterLab per the link above when told.  Code is included in this lesson so that you can understand exactly what each piece does.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>The tutorial is very sensitive to loss of internet connectivity, so make sure you are working in a place with steady connection before you begin, or you may have to repeat multiple steps! It also times out quickly, so work on this when you have some dedicated time to devote to this lesson!</p>
    </div>
</div>

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Packages and Parallel Processing<a class="anchor" id="DS107L8_page_4"></a>

[Back to Top](#DS107L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Packages and Parallel Processing

On this page, you'll learn about the packages you need to make this tutorial work and about the parallel processing functions that DASK provides.

---

## Import Packages

The packages you'll need to get DASK running and to execute this tutorial are as follows:

```python
from dask.distributed import Client, progress
import dask
import dask.dataframe as dd
import pandas as pd
%matplotlib inline
from sklearn.linear_model import LinearRegression
```

`dask.distributed` contains functions that will help you manage parallel processing.  `dask` itself has functions that will help you create dataframes and interact with other packages.  And `sklearn` contains information for machine learning.

---

## Enable Parallel Processing

The `Client()` function has arguments for the number of workers, the number of threads per worker that will be used, and for the maximum amount of memory that you will allow it to use. `n_workers=` is the argument to specify the number of workers.  *Workers* are the "doers" of your code. *Threads* are how many things a worker can be doing at the same time.  You can think of it like multitasking. The argument `threads_per_worker=` tells Python how many things each worker will need to process at once.  Then the `memory_limit=` function ensures that you don't use so much memory in this work that your computer cannot perform its regular functions.

Go ahead and run the first line of code now in the tutorial to set this environment up. You'll notice that a view pops up on the right hand side for `Dask Task Stream`. You'll be able to watch everything you run in real-time here, to see a great visual of how the computers are processing your commands.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - DASK Data<a class="anchor" id="DS107L8_page_5"></a>

[Back to Top](#DS107L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# DASK Data

Now that you have the basics setup, you will create a DASK dataframe, learn how to view data types, and convert results to a `pandas` dataframe.

---

## Set up a DASK Dataframe

DASK has the ability to randomly generate data, like many other Python packages. The code `dask.datasets.timeseries()` randomly creates a dataframe that has data stored in ten-second bins, names, ids, and some numeric values.

Please run the second and third lines of code in the tutorial now.  Did you see something strange going on after line 2? Looks like there isn't any data in there! But don't be alarmed - this is just how DASK functions.  It takes processing power to print values, so by default, DASK doesn't to save you time and money. But don't worry, you can still use the standard `pandas` commands of `.head()` if you need to see the actual data.

---

## View Data Types

Instead of actually seeing the data, you may only need to know the types of data you have.  Well, DASK has thought of that too! Try the function `.dtypes` and see what happens! You can now run lines 4-6 in the tutorial.

---

## Get Pandas Dataframe Output

You can run all your favorite standard `pandas` commands like normal while using parallel processing through DASK. However, the output will default as a DASK dataframe instead of as a `pandas` one. If you want to force your computer to give you back a `pandas` dataframe, then you are in luck - the `.compute()` function will do just that.

Go ahead and run lines 7-9 now in the tutorial.  Do you see a flurry of activity here? `pandas` is not as efficient as DASK.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - In-Memory Processing<a class="anchor" id="DS107L8_page_6"></a>

[Back to Top](#DS107L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# In-Memory Processing

You can make your processing go a lot faster if you store data in memory.  While you can only do this if you have available RAM, it is definitely worth it for a dataframe you'll be working with a lot.  The function to do this is `.persist()`. Run line 10, and see how much processing power you take up committing this data to memory.

---

## Check out the Speed Difference! 

Run lines 11-14 and see how much faster and how little processing it took to populate a dataframe that was sampled from the old one! Even generating a graph of that data took almost no time at all.

But even with in-memory data, accessing certain parts of the data can take some time and processing power.  Try running lines 15 and 16 in the tutorial now to view this.

---

## Set Index

The way to fix this little hiccup in speed is to set the index, which means that data is automatically sorted by this.  While initially costly to do, it can be a time and power saver in the long run. To see how this works, start by running line 17 in the tutorial.  That's a lot of colors, huh?! You can see how much processing it took to index. It will take a lot to commit this change to memory, too - which is just what you'll do on line 18 of the tutorial.

But it all pays off with line 19, which when run is now nearly instantaneous. 

---

## Machine Learning in DASK

DASK really does play nicely with others, and you can witness for yourself just how nicely it works with `scikit-learn`! Just run the last two lines in the tutorial.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Key Terms<a class="anchor" id="DS107L8_page_7"></a>

[Back to Top](#DS107L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>DASK</td>
        <td>A Python library used for parallel processing.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>High-level Collections</td>
        <td>Functions for using datasets.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Low-level Schedulers</td>
        <td>Functions for parallel processing. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Thread</td>
        <td>One process that will work simultaneously with another task(s).</td>
    </tr>
</table>

---

## DASK Functions

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Client()</td>
        <td>Sets up distributed processing for Python.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.dtypes()</td>
        <td>Retrieves the data types in your DASK dataframe.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.compute()</td>
        <td>Converts a DASK dataframe to a pandas dataframe.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.persist()</td>
        <td>Places data in memory for faster processing.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Lesson 8 Hands-On<a class="anchor" id="DS107L8_page_8"></a>

[Back to Top](#DS107L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


This Hands-­On **will** be graded, so make sure you complete each part. When you are done, please submit one document with all of your findings for grading.

---

## Description

For this hands on, you'll be exploring the DASK documentation further.  Go to **[this website](https://docs.dask.org/en/latest/)** and read the overview section, then pick at least two sections under `USER INTERFACE` and `SCHEDULING` to examine.  Please answer the following questions after examining these features:

* What do you think the most useful DASK feature is?
* Why is the advent of DASK so important?
* About what would you like to learn more?

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>
