(sec-dask-expr)=
# Dask Expressions

在数据工程领域，比如 Spark 或者 SQL 数据库，有一些通用的、经典的优化方法，比如谓词下推（Predicate pushdown）。这些优化技术又被称为查询优化（Query Optimization），已经被数据工程深入研究过，它们可有效加速数据处理的速度。

早期的 Dask DataFrame 并没有做这些优化。2023年开始，Dask DataFrame 推出了 Dask Expressions (dask-expr)，专门用于查询优化，加速数据处理速度。Dask Expressions 保留了 Dask DataFrame 的 API，用户仍然使用原来的 API，只不过 Dask 在背后帮忙进行了查询优化。

安装好后，可以直接 `import dask_expr as dd` 引入包，为了区别 `dask.dataframe` 也可以 `import dask_expr as dx`

In [1]:
!pip install dask-expr

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


In [2]:
import dask.dataframe as dd
import dask_expr as dx

In [3]:
import os
import shutil
import urllib.request

folder_path = os.path.join(os.getcwd(), "../data/nyc-taxi")
download_url = [
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-04.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-05.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-06.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-07.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-08.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-09.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-10.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-11.parquet",
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-12.parquet"
]
if not os.path.exists(folder_path):
    os.makedirs(folder_path)
for url in download_url:
    file_name = url.split("/")[-1]
    parquet_file_path = os.path.join(folder_path, file_name)
    if not os.path.exists(os.path.join(folder_path, file_name)):
        with urllib.request.urlopen(url) as response, open(parquet_file_path, 'wb') as out_file:
            shutil.copyfileobj(response, out_file)

先看一个没有经过优化的例子：

In [18]:
%%timeit
ddf = dd.read_parquet(os.path.join(folder_path, "*.parquet"))
payment_filtered = (
    ddf[ddf.payment_type == 1]['tip_amount']
)
payment_filtered_mean = payment_filtered.mean().compute()

284 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


首先将所有数据读取到内存中，然后过滤 `payment_type` 为 1 的行，并只需要 `tip_amount` 列。行和列的过滤本来可以在数据读取阶段提前进行。也就是查询优化中的谓词下推：在离数据读取越近的地方，提前进行数据的过滤，避免那些不会被用到的数据被读取进来。

如果 Dask DataFrame 用户对查询优化比较熟悉，可以将上面的代码修改为：

In [19]:
%%timeit
ddf = dd.read_parquet(
    os.path.join(folder_path, "*.parquet"),
    filters=[("payment_type", "==", 1)],
    columns=["tip_amount"],
)
payment_filtered_mean = ddf.tip_amount.mean().compute()

146 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


幸运地是，Dask DataFrame 现在提供了一种自动的优化方式：Dask Expressions，它不需要用户了解查询优化。Dask Expressions 

In [6]:
ddf = dx.read_parquet(os.path.join(folder_path, "*.parquet"))
payment_filtered = (
    ddf[ddf.payment_type == 1]['tip_amount']
)
payment_filtered_mean = payment_filtered.sum(numeric_only=True)
payment_filtered_mean.pprint()

Sum: numeric_only=True
  Projection: columns='tip_amount'
    Filter:
      ReadParquetFSSpec: path='/Users/luweizheng/Projects/godaai/distributed-python/ch-dask-dataframe/../data/nyc-taxi/*.parquet' kwargs={'dtype_backend': None}
      EQ: right=1
        Projection: columns='payment_type'
          ReadParquetFSSpec: path='/Users/luweizheng/Projects/godaai/distributed-python/ch-dask-dataframe/../data/nyc-taxi/*.parquet' kwargs={'dtype_backend': None}


In [20]:
%%timeit
ddf = dx.read_parquet(
    os.path.join(folder_path, "*.parquet"),
    filters=[("payment_type", "==", 1)],
    columns=["tip_amount"],
)
payment_filtered_mean = ddf.tip_amount.mean().optimize().compute()

148 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [8]:
%%time
payment_filtered_mean.compute()

AttributeError: 'numpy.float64' object has no attribute 'compute'

In [9]:
%%time
optimized_payment_filtered_mean = payment_filtered_mean.optimize()
optimized_payment_filtered_mean.pprint()
optimized_payment_filtered_mean.compute()

AttributeError: 'numpy.float64' object has no attribute 'optimize'

In [10]:
optimized_payment_filtered_mean.simplify().pprint()

NameError: name 'optimized_payment_filtered_mean' is not defined

In [None]:
%%timeit
optimized_payment_filtered_mean.compute()

281 ms ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
df = dx.datasets.timeseries(
    start="2024-01-01", 
    end="2024-12-30", 
    freq="100ms",
)
df

Unnamed: 0_level_0,name,id,x,y
npartitions=364,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2024-01-01,string,int64,float64,float64
2024-01-02,...,...,...,...
...,...,...,...,...
2024-12-29,...,...,...,...
2024-12-30,...,...,...,...


In [None]:
out = df[df.id == 1000].sum()["x"]

TypeError: 'ArrowStringArray' with dtype string does not support reduction 'sum' with pyarrow version 15.0.2. 'sum' may be supported by upgrading pyarrow.