# Big Data Problems are Composed of Several Small Data Problems

[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](http://slack.fugue.ai)

Partitioning is one of the fundamental concepts in distributed computing. In the local setting, all of the data fits in one machine. In the distributed setting, the data is spread across multiple machines, and often has to be moved around to perform an operation (which is called a shuffle). There is a subtle but importance difference.

Fugue abstracts the partitioning to make it easy so that users have to worry about it less. Still, it's important to know this concept to leverage distributed computing effectively. 

## Real Scenarios

In the previous section, we ended with a scikit-learn pipeline that was applied for each group of data. When you data a big dataset, there is a high chance that there are logical grouping that can serve as the basis of parallelization. Can we tackle the problem per user? per region? per item in a store? It's very common to see that machine learning models for one region may not translate well for another region. It's also very common to have data stored in the same column that may be different units (currencies).

Once we break up big data into a small problems, we can focus on the logic for one group of data, iterate quickly, and then scale with Fugue.

## Simple Partitioning Example

In the DataFrame below, we want to take the difference of the value per day. Because there are three different ids, we want to make sure that we don’t get the difference across ids. 

In [1]:
import pandas as pd 

df = pd.DataFrame({"date":["2021-01-01", "2021-01-02", "2021-01-03"] * 3,
                   "id": (["A"]*3 + ["B"]*3 + ["C"]*3),
                   "value": [3, 4, 2, 1, 2, 5, 3, 2, 3]})
df.head()

Unnamed: 0,date,id,value
0,2021-01-01,A,3
1,2021-01-02,A,4
2,2021-01-03,A,2
3,2021-01-01,B,1
4,2021-01-02,B,2


Now we create a function that takes in a pd.DataFrame and outputs a pd.DataFrame. This will allow us to bring the logic to Spark and Dask as we’ve seen before.

In [2]:
def diff(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(diff=df['value'].diff())

But if we use the function directly seen below, we notice that the first row of B has a value instead of a NaN. This is wrong since the function used the value from A to calculate the difference.

In [3]:
from fugue import transform
transform(df.copy(), 
          diff, 
          schema="*, diff:int").head()

Unnamed: 0,date,id,value,diff
0,2021-01-01,A,3,
1,2021-01-02,A,4,1.0
2,2021-01-03,A,2,-2.0
3,2021-01-01,B,1,-1.0
4,2021-01-02,B,2,1.0


This is solved by passing the partitions to Fugue’s transform(). We now see the correct output of NaN for the first value of B seen below.

In [4]:
transform(df.copy(), 
          diff, 
          schema="*, diff:int",
          partition={"by": "id"}).head()

Unnamed: 0,date,id,value,diff
0,2021-01-01,A,3,
1,2021-01-02,A,4,1.0
2,2021-01-03,A,2,-2.0
3,2021-01-01,B,1,
4,2021-01-02,B,2,1.0


## Presort

We mentioned previous that Spark does not guarantee order. For operations like diff above, we need to make sure the records are ordered by date.

In [5]:
transform(df.copy(), 
          diff, 
          schema="*, diff:int",
          partition={"by": "id", "presort": "date asc"}).head()

Unnamed: 0,date,id,value,diff
0,2021-01-01,A,3,
1,2021-01-02,A,4,1.0
2,2021-01-03,A,2,-2.0
3,2021-01-01,B,1,
4,2021-01-02,B,2,1.0


The difference of groupby-apply and partition-transform is that the partition-transform semantic is much more scalable because it accounts for the physical location of the data. In addition to the column, we can also define a partitioning strategy.

## Partitioning Strategies

Fugue has some partitioning strategies available off the shelf. There is also a new one not in this diagram called `coarse`.

![img](https://miro.medium.com/v2/resize:fit:1400/0*5_v4ziLbsZztCavk)

Here is an example of how to use a partitioning strategy alongside the `by` argument.

In [6]:
transform(df.copy(), 
          diff, 
          schema="*, diff:int",
          partition={"by": "id", "presort": "date asc","algo": "hash"}).head()

Unnamed: 0,date,id,value,diff
0,2021-01-01,A,3,
1,2021-01-02,A,4,1.0
2,2021-01-03,A,2,-2.0
3,2021-01-01,B,1,
4,2021-01-02,B,2,1.0


## Scaling to Big Data

Again, if the `transform()` code works on the local engine, it will also work on Spark, Dask, and Ray by just passing the engine.

In [7]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

transform(df.copy(), 
          diff, 
          schema="*, diff:int",
          partition={"by": "id", "presort": "date asc","algo": "hash"},
          engine=spark).show()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/04/25 10:45:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
+----------+---+-----+----+
|      date| id|value|diff|
+----------+---+-----+----+
|2021-01-01|  B|    1|null|
|2021-01-02|  B|    2|   1|
|2021-01-03|  B|    5|   3|
|2021-01-01|  C|    3|null|
|2021-01-02|  C|    2|  -1|
|2021-01-03|  C|    3|   1|
|2021-01-01|  A|    3|null|
|2021-01-02|  A|    4|   1|
|2021-01-03|  A|    2|  -2|
+----------+---+-----+----+



## Even Partitions

Even partitioning is interesting because it's a perfect fit for small data, but large compute. For example, we have 4 machines and 4 models to run, we can assign 1 model per machine. This is not trivial and takes a lot of extra code on the Fugue side.

In the example below, we run AutoARIMA models for each timeseries 

In [8]:
from statsforecast.utils import generate_series
from statsforecast.models import AutoARIMA
from statsforecast.core import StatsForecast

series = generate_series(n_series=4, seed=1).reset_index()
series['unique_id'] = series['unique_id'].astype(int)

def run_forecast(df: pd.DataFrame) -> pd.DataFrame:
    model = StatsForecast(df=series,
                        models=[AutoARIMA()], 
                        freq='D', 
                        n_jobs=1)
    return model.forecast(7).reset_index()

series.head()

  from tqdm.autonotebook import tqdm


Unnamed: 0,unique_id,ds,y
0,0,2000-01-01,5.7e-05
1,0,2000-01-02,1.151166
2,0,2000-01-03,2.073378
3,0,2000-01-04,3.046169
4,0,2000-01-05,4.09313


In [9]:
forecasts = transform(series, 
                      run_forecast, 
                      schema="unique_id:int, ds:date, AutoARIMA:float", 
                      partition={"by": "unique_id", "presort": "ds asc", "algo": "even"},
                      engine=spark)
forecasts.show(14)

+---------+----------+----------+
|unique_id|        ds| AutoARIMA|
+---------+----------+----------+
|        0|2000-03-28| 1.8282341|
|        0|2000-03-29| 1.4334649|
|        0|2000-03-30|  1.123938|
|        0|2000-03-31|0.88124686|
|        0|2000-04-01|0.69095993|
|        0|2000-04-02|0.54176146|
|        0|2000-04-03|0.42477933|
|        1|2000-10-12|0.99109846|
|        1|2000-10-13| 1.0415428|
|        1|2000-10-14|-1.8498392|
|        1|2000-10-15|       0.0|
|        1|2000-10-16|       0.0|
|        1|2000-10-17|       0.0|
|        1|2000-10-18|       0.0|
+---------+----------+----------+
only showing top 14 rows



## Advanced - Running Compute Intensive Jobs

Taking this one step further, we can even create a DataFrame of jobs. `pickle` brings the data to a binary representation. We can create a DataFrame consisting of pickled data easily with Fugue.

In [27]:
import cloudpickle
from typing import List, Iterable, Dict, Any

def serialize(df: pd.DataFrame) -> Iterable[Dict[str,Any]]:
    yield {"unique_id": df.iloc[0]['unique_id'],
           "data": cloudpickle.dumps(df), 
           "fn": cloudpickle.dumps(run_forecast)}

serialized = transform(series, serialize, schema="unique_id:int,data:binary,fn:binary", partition={"by": "unique_id"})
serialized.head()

Unnamed: 0,unique_id,data,fn
0,0,b'\x80\x05\x95\x1b\x0c\x00\x00\x00\x00\x00\x00...,b'\x80\x05\x95\xea^\x00\x00\x00\x00\x00\x00\x8...
1,1,b'\x80\x05\x95\xaf\x1e\x00\x00\x00\x00\x00\x00...,b'\x80\x05\x95\xea^\x00\x00\x00\x00\x00\x00\x8...
2,2,b'\x80\x05\x95\xc7-\x00\x00\x00\x00\x00\x00\x8...,b'\x80\x05\x95\xea^\x00\x00\x00\x00\x00\x00\x8...
3,3,b'\x80\x05\x95c\x0f\x00\x00\x00\x00\x00\x00\x8...,b'\x80\x05\x95\xea^\x00\x00\x00\x00\x00\x00\x8...


In [28]:
def run_job(df: List[Dict[str,Any]]) -> pd.DataFrame:
    row = df[0]
    series = cloudpickle.loads(row['data'])
    forecaster = cloudpickle.loads(row['fn'])
    result = forecaster(series)
    return result

results = transform(serialized, 
                    run_job, 
                    schema="unique_id:int,ds:date,AutoARIMA:float", 
                    partition={"by": "unique_id", "how": "even"},
                    engine=spark)
results.compute().head()

<function run_forecast at 0x7fc4b90ff1f0>
<function run_forecast at 0x7f9d30be8940>
<function run_forecast at 0x7fd2713525e0>
<function run_forecast at 0x7f9d30be8700>


Unnamed: 0,unique_id,ds,AutoARIMA
0,0,2000-03-28,1.828234
1,0,2000-03-29,1.433465
2,0,2000-03-30,1.123938
3,0,2000-03-31,0.881247
4,0,2000-04-01,0.69096


## Best Practice

Something to be aware of is 2-stage parallelism. Note that we set the `n_jobs` to 1 in the previous timeseries forecasting function. This is because having multiple parallel jobs can cause resource contention and actually cause jobs to slow down.

Second, it's much better to miminize data transfer by passing a path to a file rather than passing the data itself. The goal is for the worker to load the file directly.

Third, it's more stable to use `cloudpickle` over `pickle`.

## Exercise - Validation by Partition

In this exercise, we use the open-source [Pandera](https://github.com/unionai-oss/pandera) library for data validation. We have two different regions and want to validate each separately.

In [29]:
data = pd.DataFrame(
    {
        'state': ['FL','FL','FL','CA','CA','CA'],
        'city': [
            'Orlando', 'Miami', 'Tampa', 'San Francisco', 'Los Angeles', 'San Diego'
        ],
        'price': [8, 12, 10, 16, 20, 18],
    }
)

In [30]:
from pandera import Column, DataFrameSchema, Check

price_check_FL = DataFrameSchema({
    "price": Column(int, Check.in_range(min_value=7,max_value=13)),
})

price_check_CA = DataFrameSchema({
    "price": Column(int, Check.in_range(min_value=15,max_value=21)),
})

price_checks = {'CA': price_check_CA, 'FL': price_check_FL}

These checks can be applied as follows:

In [32]:
price_check_CA.validate(data.loc[data['state'] == 'CA'])

Unnamed: 0,state,city,price
3,CA,San Francisco,16
4,CA,Los Angeles,20
5,CA,San Diego,18


But of course it will fail when applied to FL

In [34]:
try:
    price_check_CA.validate(data.loc[data['state'] == 'FL'])
except Exception as e:
    print(e)

<Schema Column(name=price, type=DataType(int64))> failed element-wise validator 0:
<Check in_range: in_range(15, 21)>
failure cases:
   index  failure_case
0      0             8
1      1            12
2      2            10


How to we run the appropriate validator for each region in parallel with Spark?

1. Create a `price_validation` function
2. Get the state of the partition
3. Pull the appropriate validator
4. Run the validation
5. Parallelize with the `transform()` function

## Conclusion

In this section we learned about partitions. This is an example of how sticking to a mindset can limit full utilization of the distributed system. Pandas doesn't have an innate concept of partitioning.

Fugue's partitioning also allows users to focus on the logic of one partition. When ready to scale, we just run the logic across all partitions.