(sec-dask-dataframe-shuffle)=
# Shuffle

在分布式场景下，`sort`，`merge`，`groupby` 有可能会在不同 Worker 之间交换数据，即 Shuffle。这些 pandas 算子在单机上实现起来比较简单，但是在大数据分布式计算场景，实现起来并不简单。
Dask 在 `2023.1` 版本之后提供了一种新的 Shuffle 方法，可以加速大部分计算任务。

## `groupby`

{numref}`fig-dataframe-groupby` 展示了 `groupby` 在单机上的操作流程，它主要有三个阶段：分组、聚合、输出。分布式场景下，不同的数据分布在不同的 Partition 下。

```{figure} ../img/ch-dask-dataframe/groupby.svg
---
width: 600px
name: fig-dataframe-groupby
---
DataFrame groupby 示意图
```

* `groupby(indexed_columns).agg()` 和 `groupby(indexed_columns).apply(user_def_fn)` 性能最好。`indexed_columns` 指的是索引列 Key，`agg` 指的是 Dask DataFrame 提供的官方的 `sum`，`mean`，`nunique` 等聚合方法。因为 `indexed_columns` 是排过序的了，可以很快地对 `indexed_columns` 进行分组，Shuffle 数据量不大。
* `groupby(non_indexed_columns).agg()` 的数据交换量要更大一些，`agg` 是 Dask 官方提供的方法，做过一些优化。
* `groupby(non_indexed_columns).apply(user_def_fn)` 的成本最高。它既要对所有数据进行交换，又要执行用户自定义的函数，


In [1]:
import os
import urllib
import shutil
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

folder_path = os.path.join(os.getcwd(), "../data/")
download_url_prefix = "https://gender-pay-gap.service.gov.uk/viewing/download-data/"
file_path_prefix = os.path.join(folder_path, "gender-pay")
if not os.path.exists(file_path_prefix):
    os.makedirs(file_path_prefix)
for year in [2017, 2018, 2019, 2020, 2021, 2022]:
    download_url = download_url_prefix + str(year)
    file_path = os.path.join(file_path_prefix, f"{str(year)}.csv")
    if not os.path.exists(file_path):
        with urllib.request.urlopen(download_url) as response, open(file_path, 'wb') as out_file:
            shutil.copyfileobj(response, out_file)

In [2]:
import dask.dataframe as dd
import pandas as pd
from dask.distributed import LocalCluster, Client

cluster = LocalCluster()
client = Client(cluster)

In a future release, Dask DataFrame will use new implementation that
contains several improvements including a logical query planning.
The user-facing DataFrame API will remain unchanged.

The new implementation is already available and can be enabled by
installing the dask-expr library:

    $ pip install dask-expr

and turning the query planning option on:

    >>> import dask
    >>> dask.config.set({'dataframe.query-planning': True})
    >>> import dask.dataframe as dd

API documentation for the new implementation is available at
https://docs.dask.org/en/stable/dask-expr-api.html

Any feedback can be reported on the Dask issue tracker
https://github.com/dask/dask/issues 

  import dask.dataframe as dd


In [3]:
ddf = dd.read_csv(os.path.join(file_path_prefix, "*.csv"),
                  dtype={'EmployerSize': 'str',
                         'DiffMeanHourlyPercent': 'float64'})

def fillna(df):
    return df.fillna(value={"PostCode": "UNKNOWN"})

ddf = ddf[["PostCode", "EmployerSize", "DiffMeanHourlyPercent"]]
ddf = ddf.dropna()
# ddf = ddf.map_partitions(fillna)
ddf.head(5)
        # .map_partitions(update_empsize_to_median)
# ddf = ddf.map_partitions(update_empsize_to_median)

Unnamed: 0,PostCode,EmployerSize,DiffMeanHourlyPercent
0,DT11 0PX,500 to 999,18.0
1,EH6 8NU,250 to 499,2.3
2,LS7 1AB,250 to 499,41.0
3,TA6 3JA,250 to 499,-22.0
4,SR5 1SU,250 to 499,13.4


In [44]:
ddf.dtypes

PostCode                 string[pyarrow]
EmployerSize             string[pyarrow]
DiffMeanHourlyPercent            float64
dtype: object

In [4]:
# ddf['EmployerSize']=ddf['EmployerSize'].astype(str)
def update_empsize_to_median(df):
    def to_median(value):
        if isinstance(value, str):
            if " to " in value:
                f , t = value.replace(",", "").split(" to ")
                return (int(f) + int(t)) / 2.0
            elif "Less than" in value:
                return 100
            else:
                return 10000
        else:
            return 0
    df["EmployerSize"] = df["EmployerSize"].apply(to_median)
    return df

try:
    ddf = ddf.map_partitions(update_empsize_to_median)
except Exception as e:
    print(f"{type(e).__name__}, {e}")

In [31]:
ddf['PostCodeLength'] = ddf['PostCode'].str.len()
ddf.head(5)

Unnamed: 0,PostCode,EmployerSize,DiffMeanHourlyPercent,PostCodeLength
0,DT11 0PX,749.5,18.0,8
1,EH6 8NU,374.5,2.3,7
2,LS7 1AB,374.5,41.0,7
3,TA6 3JA,374.5,-22.0,7
4,SR5 1SU,374.5,13.4,7


In [40]:
d = ddf.groupby('PostCode')['DiffMeanHourlyPercent'].mean()
d.compute()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


PostCode
 BS34 7QH     7.375000
 DE14 2EB     0.070000
 DN15 6NL     3.666667
 OX1 4BH     17.460000
 S65 1EG     14.716667
               ...    
WS11 0DJ     14.600000
WS13 8EL     -4.800000
WV10 8DS     24.330000
WV6 8DA      18.100000
YO25 8EJ     12.930000
Name: DiffMeanHourlyPercent, Length: 9118, dtype: float64

In [5]:
def process_chunk(chunk):
    def weighted_func(df):
        return (df["EmployerSize"] * df["DiffMeanHourlyPercent"]).sum()
    return (chunk.apply(weighted_func), chunk["EmployerSize"].sum())

def agg(total, weights):
    return (total.sum(), weights.sum())

def finalize(total, weights):
    return total / weights
    
weighted_mean = dd.Aggregation(
    name='weighted_mean',
    chunk=process_chunk,
    agg=agg,
    finalize=finalize)

aggregated = ddf.groupby("PostCode")["EmployerSize", "DiffMeanHourlyPercent"].agg(weighted_mean)
aggregated.head(10)

KeyError: 'EmployerSize'

In [39]:
ddf.groupby("PostCode")['DiffMeanHourlyPercent'].sum()
d = ddf.compute()
d

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,PostCode,EmployerSize,DiffMeanHourlyPercent,PostCodeLength
0,DT11 0PX,749.5,18.0,8
1,EH6 8NU,374.5,2.3,7
2,LS7 1AB,374.5,41.0,7
3,TA6 3JA,374.5,-22.0,7
4,SR5 1SU,374.5,13.4,7
...,...,...,...,...
10838,SN1 1AP,2999.5,25.9,7
10839,PO15 7JZ,2999.5,16.4,8
10840,SK11 0LP,374.5,18.9,8
10841,SY5 0BD,749.5,20.0,7


In [10]:
custom_mean = dd.Aggregation(
    name='custom_mean',
    chunk=lambda s: (s['EmployerSize'].count(), s['DiffMeanHourlyPercent'].sum()),
    agg=lambda count, sum: (count.sum(), sum.sum()),
    finalize=lambda count, sum: sum / count,
)  
a = ddf.groupby('PostCode').agg(custom_mean)
a.head(5) 

ValueError: Metadata inference failed in `_groupby_apply_funcs`.

You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
IndexError('Column(s) DiffMeanHourlyPercent already selected')

Traceback:
---------
  File "/Users/luweizheng/miniconda3/envs/dispy/lib/python3.11/site-packages/dask/dataframe/utils.py", line 194, in raise_on_meta_error
    yield
  File "/Users/luweizheng/miniconda3/envs/dispy/lib/python3.11/site-packages/dask/dataframe/core.py", line 7057, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/luweizheng/miniconda3/envs/dispy/lib/python3.11/site-packages/dask/dataframe/groupby.py", line 1194, in _groupby_apply_funcs
    r = func(grouped, **func_kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/luweizheng/miniconda3/envs/dispy/lib/python3.11/site-packages/dask/dataframe/groupby.py", line 1240, in _apply_func_to_column
    return func(df_like[column])
           ^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/4n/v40br47s46ggrjm9bdm64lwh0000gn/T/ipykernel_50604/416441874.py", line 3, in <lambda>
    chunk=lambda s: (s['EmployerSize'].count(), s['DiffMeanHourlyPercent'].sum()),
                     ~^^^^^^^^^^^^^^^^
  File "/Users/luweizheng/miniconda3/envs/dispy/lib/python3.11/site-packages/pandas/core/base.py", line 234, in __getitem__
    raise IndexError(f"Column(s) {self._selection} already selected")


In [11]:
na_rows.head(3)

Unnamed: 0,EmployerName,EmployerId,Address,PostCode,CompanyNumber,SicCodes,DiffMeanHourlyPercent,DiffMedianHourlyPercent,DiffMeanBonusPercent,DiffMedianBonusPercent,...,FemaleUpperMiddleQuartile,MaleTopQuartile,FemaleTopQuartile,CompanyLinkToGPGInfo,ResponsiblePerson,EmployerSize,CurrentName,SubmittedAfterTheDeadline,DueDate,DateSubmitted
1,"""RED BAND"" CHEMICAL COMPANY, LIMITED",16879,"19 Smith's Place, Leith Walk, Edinburgh, EH6 8NU",EH6 8NU,SC016876,47730,2.3,-2.7,15.0,37.5,...,89.7,18.1,81.9,,Philip Galt (Managing Director),250 to 499,"""RED BAND"" CHEMICAL COMPANY, LIMITED",False,2018/04/05 00:00:00,2018/03/28 16:44:25
2,123 EMPLOYEES LTD,17677,"34 Roundhay Road, Leeds, England, LS7 1AB",LS7 1AB,10530651,78300,41.0,36.0,-69.8,-157.2,...,89.0,23.0,77.0,,Chloe Lines (Financial Controller),250 to 499,123 EMPLOYEES LTD,True,2018/04/05 00:00:00,2018/05/04 11:24:06
7,1STOP HALAL LIMITED,689,"Colmore Court, 9 Colmore Row, Birmingham, West...",B3 2BJ,08929070,56290,11.9,0.0,0.0,0.0,...,41.9,69.8,30.2,,Stephen Elder (Finance Director),250 to 499,SHAZAN FOODS LIMITED,False,2018/04/05 00:00:00,2018/03/22 08:08:33


In [5]:
def process_chunk(chunk):
    def weighted_func(df):
        return (df["EmployerSize"] * df["DiffMeanHourlyPercent"]).sum()
    return (chunk.apply(weighted_func), chunk.sum()["EmployerSize"])
        
def agg(total, weights):
    return (total.sum(), weights.sum())

def finalize(total, weights):
    return total / weights
    
weighted_mean = dd.Aggregation(
    name='weighted_mean',
    chunk=process_chunk,
    agg=agg,
    finalize=finalize)

aggregated = ddf.groupby("PostCode")["EmployerSize", "DiffMeanHourlyPercent"].agg(weighted_mean)

Unnamed: 0_level_0,Date,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
npartitions=6,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
,datetime64[ns],int64,float64,int64,float64,int64,string,int64,float64,float64,int64,float64,float64,float64,string,string,float64,float64,float64,int64,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [58]:
client.shutdown()

In [47]:
data = {'PostCode': ['A', 'A', 'B', 'B'],
        'EmployerSize': [100, 200, 300, 400],
        'DiffMeanHourlyPercent': [0.1, 0.2, 0.3, 0.4]}
df = pd.DataFrame(data)
df

Unnamed: 0,PostCode,EmployerSize,DiffMeanHourlyPercent
0,A,100,0.1
1,A,200,0.2
2,B,300,0.3
3,B,400,0.4


In [49]:
df.sum()["EmployerSize"]

1000

In [50]:
df['EmployerSize'].sum()

1000

In [10]:
data = {'PostCode': ['A', 'A', 'B', 'B'], 'EmployerSize': [100, 200, 300, 400], 'DiffMeanHourlyPercent': [0.1, 0.2, 0.3, 0.4]}

df = pd.DataFrame(data)
ddf = dd.from_pandas(df, 2)
ddf.compute()

Unnamed: 0,PostCode,EmployerSize,DiffMeanHourlyPercent
0,A,100,0.1
1,A,200,0.2
2,B,300,0.3
3,B,400,0.4


In [13]:
def weighted_mean(data):
    print(data)
    return (data['EmployerSize'] * data['DiffMeanHourlyPercent']).sum() / data['EmployerSize'].sum()

result = df.groupby('PostCode').apply(weighted_mean).reset_index(name='WeightedMean')
result

  PostCode  EmployerSize  DiffMeanHourlyPercent
0        A           100                    0.1
1        A           200                    0.2
  PostCode  EmployerSize  DiffMeanHourlyPercent
2        B           300                    0.3
3        B           400                    0.4


  result = df.groupby('PostCode').apply(weighted_mean).reset_index(name='WeightedMean')


Unnamed: 0,PostCode,WeightedMean
0,A,0.166667
1,B,0.357143


In [10]:
df["Weighted"] = df["EmployerSize"] * df["DiffMeanHourlyPercent"]
df = df.groupby("PostCode").sum()
df = df["Weighted"] / df["EmployerSize"]
df

PostCode
A    0.166667
B    0.357143
dtype: float64

In [12]:
def chunk(chunk):
    def weighted_func(df):
        return (df["EmployerSize"] * df["DiffMeanHourlyPercent"]).sum()
    return (chunk.apply(weighted_func), chunk.sum()["EmployerSize"])

def agg(total, weights):
    return (total.sum(), weights.sum())
    # return chunk_maxes.max(), chunk_mins.min()

def finalize(total, weights):
    return total / weights

extent = dd.Aggregation('extent', chunk, agg, finalize=finalize)
ddf.groupby("PostCode")[['EmployerSize', 'DiffMeanHourlyPercent']].agg(extent).compute()

ValueError: Metadata inference failed in `_groupby_apply_funcs`.

You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
KeyError('EmployerSize')

Traceback:
---------
  File "/Users/luweizheng/miniconda3/envs/dask/lib/python3.11/site-packages/dask/dataframe/utils.py", line 194, in raise_on_meta_error
    yield
  File "/Users/luweizheng/miniconda3/envs/dask/lib/python3.11/site-packages/dask/dataframe/core.py", line 7174, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/luweizheng/miniconda3/envs/dask/lib/python3.11/site-packages/dask/dataframe/groupby.py", line 1190, in _groupby_apply_funcs
    r = func(grouped, **func_kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/luweizheng/miniconda3/envs/dask/lib/python3.11/site-packages/dask/dataframe/groupby.py", line 1266, in _apply_func_to_column
    return func(df_like[column])
           ^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/4n/v40br47s46ggrjm9bdm64lwh0000gn/T/ipykernel_6760/1210873169.py", line 4, in chunk
    return (chunk.apply(weighted_func), chunk.sum()["EmployerSize"])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/luweizheng/miniconda3/envs/dask/lib/python3.11/site-packages/pandas/core/groupby/generic.py", line 230, in apply
  File "/Users/luweizheng/miniconda3/envs/dask/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1824, in apply
    Series or DataFrame
                     ^^^
  File "/Users/luweizheng/miniconda3/envs/dask/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1885, in _python_apply_general
    """
        
  File "/Users/luweizheng/miniconda3/envs/dask/lib/python3.11/site-packages/pandas/core/groupby/ops.py", line 919, in apply_groupwise
  File "/var/folders/4n/v40br47s46ggrjm9bdm64lwh0000gn/T/ipykernel_6760/1210873169.py", line 3, in weighted_func
    return (df["EmployerSize"] * df["DiffMeanHourlyPercent"]).sum()
            ~~^^^^^^^^^^^^^^^^
  File "/Users/luweizheng/miniconda3/envs/dask/lib/python3.11/site-packages/pandas/core/series.py", line 1112, in __getitem__
    ) from err
       ^^^^^^^^
  File "/Users/luweizheng/miniconda3/envs/dask/lib/python3.11/site-packages/pandas/core/series.py", line 1228, in _get_value
    return getattr(self, "_cacher", None) is not None
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/luweizheng/miniconda3/envs/dask/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc


In [7]:
df = pd.DataFrame({
  'PostCode': ['a', 'b', 'a', 'a', 'b', 'c', 'd'],
  'Size': [0, 1, 0, 2, 5, 3, 4],
  'DiffMeanHourlyPercent': [0.1, 0.2, 0.3, 0.4, 0.5, 0.1, 0.2],
})
ddf = dd.from_pandas(df, 1)

In [9]:
def chunk(grouped):
    print(grouped.describe())
    return grouped.max(), grouped.min()

def agg(chunk_maxes, chunk_mins):
    return chunk_maxes.max(), chunk_mins.min()

def finalize(maxima, minima):
    return maxima - minima

extent = dd.Aggregation('extent', chunk, agg, finalize=finalize)
ddf.groupby('PostCode')[["Size", "DiffMeanHourlyPercent"]].agg(extent).compute()

          count  mean  std  min  25%  50%  75%  max
PostCode                                           
foo         2.0   1.0  0.0  1.0  1.0  1.0  1.0  1.0
          count  mean  std  min  25%  50%  75%  max
PostCode                                           
foo         2.0   1.0  0.0  1.0  1.0  1.0  1.0  1.0
          count      mean       std  min    25%   50%    75%  max
PostCode                                                         
a           3.0  0.266667  0.152753  0.1  0.200  0.30  0.350  0.4
b           2.0  0.350000  0.212132  0.2  0.275  0.35  0.425  0.5
c           1.0  0.100000       NaN  0.1  0.100  0.10  0.100  0.1
d           1.0  0.200000       NaN  0.2  0.200  0.20  0.200  0.2
          count      mean       std  min  25%  50%  75%  max
PostCode                                                    
a           3.0  0.666667  1.154701  0.0  0.0  0.0  1.0  2.0
b           2.0  3.000000  2.828427  1.0  2.0  3.0  4.0  5.0
c           1.0  3.000000       NaN  3.0  3.0  3

Unnamed: 0_level_0,Size,DiffMeanHourlyPercent
PostCode,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2,0.3
b,4,0.3
c,0,0.0
d,0,0.0


In [3]:
df = pd.DataFrame({
  'A': ['a', 'b', 'a', 'a', 'b'],
  'B': [0, 1, 0, 2, 5],
})
ddf = dd.from_pandas(df, 2)

In [4]:
def chunk(grouped):
    print(grouped)
    return grouped.max(), grouped.min()

def agg(chunk_maxes, chunk_mins):
    return chunk_maxes.max(), chunk_mins.min()

def finalize(maxima, minima):
    return maxima - minima

In [6]:
extent = dd.Aggregation('extent', chunk, agg, finalize=finalize)
ddf.groupby('A').agg(extent).compute()

KeyError: 'A'

In [14]:
client.shutdown()

In [67]:
df = pd.DataFrame({
   'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
   'C': [1, 2, 3, 4, 5, 6, 7, 8],
   'D': [10, 20, 30, 40, 50, 60, 70, 80]
})

# 根据 'A' 列进行分组
grouped = df.groupby('A')
print(grouped)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x2a713c590>


In [2]:
client.shutdown()

NameError: name 'client' is not defined