# Big Data Problems are Composed of Several Small Data Problems

[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](http://slack.fugue.ai)

Partitioning is one of the fundamental concepts in distributed computing. In the local setting, all of the data fits in one machine. In the distributed setting, the data is spread across multiple machines, and often has to be moved around to perform an operation (which is called a shuffle). There is a subtle but importance difference.

Fugue abstracts the partitioning to make it easy so that users have to worry about it less. Still, it's important to know this concept to leverage distributed computing effectively. 

## Real Scenarios

In the previous section, we ended with a scikit-learn pipeline that was applied for each group of data. When you data a big dataset, there is a high chance that there are logical grouping that can serve as the basis of parallelization. Can we tackle the problem per user? per region? per item in a store? It's very common to see that machine learning models for one region may not translate well for another region. It's also very common to have data stored in the same column that may be different units (currencies).

Once we break up big data into a small problems, we can focus on the logic for one group of data, iterate quickly, and then scale with Fugue.

## Simple Partitioning Example

In the DataFrame below, we want to take the difference of the value per day. Because there are three different ids, we want to make sure that we don’t get the difference across ids. 

In [5]:
import pandas as pd 

df = pd.DataFrame({"date":["2021-01-01", "2021-01-02", "2021-01-03"] * 3,
                   "id": (["A"]*3 + ["B"]*3 + ["C"]*3),
                   "value": [3, 4, 2, 1, 2, 5, 3, 2, 3]})
df.head()

Unnamed: 0,date,id,value
0,2021-01-01,A,3
1,2021-01-02,A,4
2,2021-01-03,A,2
3,2021-01-01,B,1
4,2021-01-02,B,2


Now we create a function that takes in a pd.DataFrame and outputs a pd.DataFrame. This will allow us to bring the logic to Spark and Dask as we’ve seen before.

In [6]:
def diff(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(diff=df['value'].diff())

But if we use the function directly seen below, we notice that the first row of B has a value instead of a NaN. This is wrong since the function used the value from A to calculate the difference.

In [7]:
from fugue import transform
transform(df.copy(), 
          diff, 
          schema="*, diff:int").head()

Unnamed: 0,date,id,value,diff
0,2021-01-01,A,3,
1,2021-01-02,A,4,1.0
2,2021-01-03,A,2,-2.0
3,2021-01-01,B,1,-1.0
4,2021-01-02,B,2,1.0


This is solved by passing the partitions to Fugue’s transform(). We now see the correct output of NaN for the first value of B seen below.

In [8]:
transform(data.copy(), 
          diff, 
          schema="*, diff:int",
          partition={"by": "id"}).head()

Unnamed: 0,date,id,value,diff
0,2021-01-01,A,3,
1,2021-01-02,A,4,1.0
2,2021-01-03,A,2,-2.0
3,2021-01-01,B,1,
4,2021-01-02,B,2,1.0


##

## Partitioning Strategies

Fugue has some partitioning strategies available off the shelf. There is also a new one not in this diagram called `coarse`.

![img](https://miro.medium.com/v2/resize:fit:1400/0*5_v4ziLbsZztCavk)

## Best Practice

Something to be aware of is 2 -stage parallelism

## Exercise