(ray-data)=
# Ray Data

In [16]:
import os
import shutil
import urllib.request
from pathlib import Path
import pandas as pd
import ray

if ray.is_initialized:
    ray.shutdown()

ray.init()

2023-09-25 19:10:57,441	INFO worker.py:1633 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m


0,1
Python version:,3.10.9
Ray version:,2.7.0
Dashboard:,http://127.0.0.1:8265


In [17]:
folder_path = os.path.join(os.getcwd(), "../data/nyc-taxi")
download_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-06.parquet"
file_name = download_url.split("/")[-1]
parquet_file_path = os.path.join(folder_path, file_name)
if not os.path.exists(folder_path):
    # 创建文件夹
    os.makedirs(folder_path)
    print(f"文件夹 {folder_path} 不存在，已创建。")
    # 下载并保存 Parquet 文件
    with urllib.request.urlopen(download_url) as response, open(parquet_file_path, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)
    print("数据已下载并保存为 Parquet 文件。")
else:
    print(f"文件夹 {folder_path} 已存在，无需操作。")

文件夹 /Users/luweizheng/Projects/py-101/distributed-python/ch-ray-air/../data/nyc-taxi 已存在，无需操作。


使用 `ray.data` 读取文件：


In [18]:
dataset = ray.data.read_parquet(parquet_file_path)

(pid=22542) Parquet Files Sample 0:   0%|          | 0/1 [00:00<?, ?it/s]

2023-09-25 19:11:06,332	INFO read_api.py:406 -- To satisfy the requested parallelism of 200, each read task output is split into 200 smaller blocks.


查看这份数据集的表模式（Schema）：


In [19]:
dataset.schema()

Column                 Type
------                 ----
VendorID               int32
tpep_pickup_datetime   timestamp[us]
tpep_dropoff_datetime  timestamp[us]
passenger_count        int64
trip_distance          double
RatecodeID             int64
store_and_fwd_flag     large_string
PULocationID           int32
DOLocationID           int32
payment_type           int64
fare_amount            double
extra                  double
mta_tax                double
tip_amount             double
tolls_amount           double
improvement_surcharge  double
total_amount           double
congestion_surcharge   double
Airport_fee            double

查看数据集的样本数目：


In [20]:
dataset.count()

3307234

查看数据集中的前几个数据：


In [21]:
dataset.take(1)

2023-09-25 19:11:06,824	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet->SplitBlocks(200)] -> LimitOperator[limit=1]
2023-09-25 19:11:06,826	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-09-25 19:11:06,828	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[{'VendorID': 1,
  'tpep_pickup_datetime': datetime.datetime(2023, 6, 1, 0, 8, 48),
  'tpep_dropoff_datetime': datetime.datetime(2023, 6, 1, 0, 29, 41),
  'passenger_count': 1,
  'trip_distance': 3.4,
  'RatecodeID': 1,
  'store_and_fwd_flag': 'N',
  'PULocationID': 140,
  'DOLocationID': 238,
  'payment_type': 1,
  'fare_amount': 21.9,
  'extra': 3.5,
  'mta_tax': 0.5,
  'tip_amount': 6.7,
  'tolls_amount': 0.0,
  'improvement_surcharge': 1.0,
  'total_amount': 33.6,
  'congestion_surcharge': 2.5,
  'Airport_fee': 0.0}]

### Transformation

对于 `Dataset` 中的每条数据，可以使用一些用户自定义的转换或者 Ray 提供的转换对数据进行预处理。比如，使用 [map_batches()](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html#ray.data.Dataset.map_batches) 对数据进行预处理。

Ray 也提供了 [`map()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map.html) 和 [`flat_map()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.flat_map.html) 这两个 API，与其他大数据框架，比如 Spark 或者 Flink，类似。即对每一行（row）数据一一进行转换。这里不再赘述。`map_batches()` 对数据集中的每个数据进行转化操作，一般是一个批次的输入对应一个批次的输出。

```{figure} ../img/ch-ray-air/map-map-batches.svg
---
width: 800px
name: map-map-batches
---
map() v.s. map_batches()
```

`map_batches()` 模拟的是单机处理时，对整个数据集的操作。其设计思想主要为了方便将之前编写好的、单机的程序，无缝地迁移到 Ray 上。所以，我们可以简单理解，用户先编写一个单机的程序，然后使用 Ray Data 迁移到集群上。在 `map_batches()` 上，每个批次的数据格式为 `Dict[str, np.ndarray]`、`pd.DataFrame` 或 `pyarrow.Table` 表示，分别对应使用 NumPy 、pandas 和 Arrow 时，进行单机处理的业务逻辑。`map_batches()` 的最重要的参数是一个自定义的函数 `fn`。比如，我们对 `dataset` 中过滤某个字段的值，可以看到，经过过滤之后，数据的条数大大减少。

In [22]:
lambda_filterd_dataset = dataset.map_batches(lambda df: df[df["passenger_count"] == 0],  batch_format="pandas")
lambda_filterd_dataset.count()

2023-09-25 19:11:08,566	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet->SplitBlocks(200)] -> TaskPoolMapOperator[MapBatches(<lambda>)]
2023-09-25 19:11:08,568	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-09-25 19:11:08,570	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/40000 [00:00<?, ?it/s]

54231

在实现这个自定义函数时，我们使用了一个 Python 的 lambda 表达式，即一个匿名的 Python 函数。当然，我们也可以传入一个标准的 Python 函数。比如：


In [23]:
def filter_pa_cnt(df: pd.DataFrame) -> pd.DataFrame:
    df = df[df["passenger_count"] == 0]
    return df

filterd_dataset = dataset.map_batches(filter_pa_cnt, batch_format="pandas")
filterd_dataset.count()



2023-09-25 19:11:24,755	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet->SplitBlocks(200)] -> TaskPoolMapOperator[MapBatches(filter_pa_cnt)]
2023-09-25 19:11:24,756	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-09-25 19:11:24,758	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/40000 [00:00<?, ?it/s]

54231

默认情况下，`map_batches()` 使用的是 Ray 的 Remote Function。当然也可以使用 Actor 模型，感兴趣的读者可以参考文档，这里暂不赘述。

### groupby

数据处理中另外一个经常使用的原语是分组聚合，Ray Data 提供了： [groupby()](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.groupby.html#ray.data.Dataset.groupby)。Ray Data 先调用 `groupby()`，对数据按照某些字段进行分组，再调用 [`map_groups()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.grouped_data.GroupedData.map_groups.html) 对分组之后的数据进行聚合。

`groupby()` 的参数是需要进行分组的字段，`map_groups()` 的参数是一个 Python 函数，即对同一个组的数据进行操作。Ray Data 预置了一些聚合函数，比如常见的求和 [`sum()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.sum.html#ray.data.Dataset.sum)，最大值 [`max()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.max.html#ray.data.Dataset.max)，平均值 [`mean()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.mean.html#ray.data.Dataset.mean) 等。


In [24]:
ds = ray.data.from_items([
    {"group": 1, "value": 1},
    {"group": 1, "value": 2},
    {"group": 2, "value": 3},
    {"group": 2, "value": 4}])
mean_ds = ds.groupby("group").mean("value")
mean_ds.show()

2023-09-25 19:11:37,608	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=20]
2023-09-25 19:11:37,610	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-09-25 19:11:37,612	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Aggregate 1:   0%|          | 0/4 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/4 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/4 [00:00<?, ?it/s]

Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

Sort Sample 0:   0%|          | 0/4 [00:00<?, ?it/s]

{'group': 1, 'mean(value)': 1.5}
{'group': 2, 'mean(value)': 3.5}


## 数据预处理与模型训练

将数据集切分为训练集和测试集：


In [25]:
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)

Read progress 0:   0%|          | 0/1 [00:00<?, ?it/s]

In [26]:
from ray.data.preprocessors import MinMaxScaler

preprocessor = MinMaxScaler(columns=["trip_distance", "trip_duration"])

from ray.data.preprocessors import PowerTransformer

# create a copy
sample_data = train_dataset

# create new preprocessor
sample_preprocessor = PowerTransformer(columns=["trip_distance"], power=0.5)

# apply the transformation
transformed_data = sample_preprocessor.fit_transform(sample_data)