# Introduction to Distributed Computing (10 mins)

Before diving into code, let's first take a look at the the current tooling out there, and the use cases that demand distributed computing.

In this section, we explore:

* when do I need distributed computing?
* is big data still a thing?
* what does the big data ecosystem look like?
* what are the issues with current frameworks?

## When Do I Use Distributed Computing?

pandas is great for small datasets, but unfortunately does not scale well large datasets. The primary reason is that pandas is single core, and does not take advantage of all available compute resources. A lot of operations also generate [intermediate copies](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html#scaling-to-large-datasets) of data, utilizing more memory than necessary. To effectively handle data with pandas, users preferably need to have [5x to 10x times](https://wesmckinney.com/blog/apache-arrow-pandas-internals/) as much RAM as the size of the dataset.

Spark and Dask allow us to split compute jobs across multiple machines. They also can handle datasets that don’t fit into memory by [spilling data](http://distributed.dask.org/en/latest/worker.html#spill-data-to-disk) over to disk in some cases.

<img src="https://ml.dask.org/_images/dimensions_of_scale.svg" align="left" width="700"/>

## Is Big Data dying?

## Can't I just use Polars and DuckDB?

## Distributed Computing Architecture

There is an image in the Dask repo [issues](https://github.com/dask/dask/issues/4471) that clearly illustrates the distributed computing paradigm. In general, there is a client or master that takes care of the orchestration and final data collection. The client is responsible for scheduling tasks among workers.

Both Spark and Dask have local modes also where they use the cores available on the local machine. This means we can still take advantage of the additional processing without having a cluster available.

In the diagram below, note how:
- package versions and serialization
- reading in files can be optimized
- data actually lives on a physical machine

<img src="https://user-images.githubusercontent.com/11656932/62263986-bbba2f00-b3e3-11e9-9b5c-8446ba4efcf9.png" align="left" width="700"/>

## Introductions to Partitions

In order to understand partitions, we can look at this image showing the way Dask scales Pandas. Each partition is a Pandas DataFrame. A Dask DataFrame is the collection of all of the Pandas DataFrames. Operations are done on each partition, and then aggregated back.

<img src="https://docs.dask.org/en/latest/_images/dask-dataframe.svg" align="left" width="400"/>

## Available Tools

Spark

Dask

Ray

## Issues with Distributed Computing

1. Expertise Required

2. Different Syntax

3. Hard to Test