# `map_partitions`

除了 {numref}`sec-dask-dataframe-shuffle` 中提到的一些需要通信的计算外，有一种最简单的并行方式，英文术语为 Embarrassingly Parallel，中文可翻译为易并行。它指的是该类计算不需要太多跨 Worker 的协调和通信。比如，对某个字段加一，每个 Worker 内执行加法操作即可，Worker 之间没有通信的开销。Dask DataFrame 中可以使用 `map_partitions()` 来做这类 Embarrassingly Parallel 的操作。`map_partitions(func)` 的参数是一个 `func`，这个 `func` 将在每个 Partition 上执行。

下面的案例对缺失值进行填充，它没有跨 Worker 的通信开销，因此是一种 Embarrassingly Parallel 的典型应用场景。

In [6]:
import os
import urllib
import shutil
from zipfile import ZipFile
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

folder_path = os.path.join(os.getcwd(), "../data/")
download_url_prefix = "https://gender-pay-gap.service.gov.uk/viewing/download-data/"
file_path_prefix = os.path.join(folder_path, "gender-pay")
if not os.path.exists(file_path_prefix):
    os.makedirs(file_path_prefix)
for year in [2017, 2018, 2019, 2020, 2021, 2022]:
    download_url = download_url_prefix + str(year)
    file_path = os.path.join(file_path_prefix, f"{str(year)}.csv")
    if not os.path.exists(file_path):
        with urllib.request.urlopen(download_url) as response, open(file_path, 'wb') as out_file:
            shutil.copyfileobj(response, out_file)

In [7]:
import dask.dataframe as dd
import pandas as pd
from dask.distributed import LocalCluster, Client

cluster = LocalCluster()
client = Client(cluster)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 57481 instead


In [8]:
ddf = dd.read_csv(os.path.join(file_path_prefix, "*.csv"),
                  dtype={'CompanyNumber': 'str', 'DiffMeanHourlyPercent': 'float64'})

def fillna(df):
    return df.fillna(value={"PostCode": "UNKNOWN"})
    
ddf = ddf.map_partitions(fillna)

Dask DataFrame 模拟了 pandas DataFrame，如果这个 API 的计算模式是 Embarrassingly Parallel，它的底层很可能就是使用 `map_partitions()` 实现的。

{numref}`sec-dask-dataframe-indexing` 提到过，Dask DataFrame 会在某个列上进行切分。我们可以在 `map_partitions()` 的 `func` 中实现任何我们想做的事情，但如果对这些切分的列做了改动，需要 `clear_divisions()` 或者重新 `set_index()`。

In [9]:
ddf.clear_divisions()

Unnamed: 0_level_0,EmployerName,EmployerId,Address,PostCode,CompanyNumber,SicCodes,DiffMeanHourlyPercent,DiffMedianHourlyPercent,DiffMeanBonusPercent,DiffMedianBonusPercent,MaleBonusPercent,FemaleBonusPercent,MaleLowerQuartile,FemaleLowerQuartile,MaleLowerMiddleQuartile,FemaleLowerMiddleQuartile,MaleUpperMiddleQuartile,FemaleUpperMiddleQuartile,MaleTopQuartile,FemaleTopQuartile,CompanyLinkToGPGInfo,ResponsiblePerson,EmployerSize,CurrentName,SubmittedAfterTheDeadline,DueDate,DateSubmitted
npartitions=6,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
,string,int64,string,string,string,string,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,string,string,string,string,bool,string,string
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [10]:
client.shutdown()