(ray-data-preprocessor)=
# Preprocessor

{numref}`ray-data-transform` introduces the general interfaces `map()` and `map_batches()`. For structured tabular data, Ray Data introduces a high-level API called the Preprocessor, building upon `map()` and `map_batches`. [Preprocessor](https://docs.ray.io/en/latest/data/api/preprocessor.html) consists of a series of feature processing operations, providing better integration with machine learning model training and inference. It is similar to scikit-learn's [sklearn.preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing), making it easy for scikit-learn users transfering quickly. For unstructured data such as images or videos, it is still recommended to use `map()` or `map_batches()`.

## Usage

Preprocessor primarily consists of four types of operations:

1. [`fit()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.preprocessor.Preprocessor.fit.html): Computes the state information for the Ray Data `Dataset`, such as calculating the variance or mean of a column.
2. [`transform()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.preprocessor.Preprocessor.transform.html): Executes the transformation operation. If the transformation operation involves state, `fit()` must be performed first.
3. [`transform_batch()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.preprocessor.Preprocessor.transform_batch.html): Performs the transformation operation on a batch of data.
4. [`fit_transform()`](https://docs.ray.io/en/latest/data/api/doc/ray.data.preprocessor.Preprocessor.fit_transform.html): An operation combining `fit()` and `transform()`. It first performs `fit()` on the `Dataset` and then applies `transform()`.

Below, we will demonstrate how to use the Preprocessor based on the taxi dataset. The taxi dataset is a typical structured dataset with many columns, such as the distance of the journey. These columns can be used as features for machine learning algorithms. However, before feeding them into a machine learning model, feature processing is required. For instance, using [`MinMaxScaler`](https://docs.ray.io/en/latest/data/api/doc/ray.data.preprocessors.MinMaxScaler.html) to normalize features:

$$
x' = \frac{x - \min(x)}{\max(x) - \min(x)}
$$

In [1]:
import os
import shutil
import urllib.request
from typing import Any, Dict

import ray

if ray.is_initialized:
    ray.shutdown()

ray.init()

folder_path = os.path.join(os.getcwd(), "../data/nyc-taxi")
download_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-06.parquet"
file_name = download_url.split("/")[-1]
parquet_file_path = os.path.join(folder_path, file_name)
if not os.path.exists(folder_path):
    os.makedirs(folder_path)
    with urllib.request.urlopen(download_url) as response, open(parquet_file_path, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)

2024-02-15 09:45:46,678	INFO worker.py:1724 -- Started a local Ray instance.


In [2]:
from ray.data.preprocessors import MinMaxScaler

ds = ray.data.read_parquet(parquet_file_path,
    columns=["trip_distance"])
ds.take(1)

Parquet Files Sample 0:   0%|          | 0/1 [00:00<?, ?it/s]

2024-02-15 09:45:48,122	INFO dataset.py:2488 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2024-02-15 09:45:48,125	INFO set_read_parallelism.py:115 -- Using autodetected parallelism=173 for stage ReadParquet to satisfy output blocks of size at least DataContext.get_current().target_min_block_size=1.0MiB.
2024-02-15 09:45:48,125	INFO set_read_parallelism.py:122 -- To satisfy the requested parallelism of 173, each read task output is split into 173 smaller blocks.
2024-02-15 09:45:48,126	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> LimitOperator[limit=1]
2024-02-15 09:45:48,126	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_

Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[36m(ReadParquet->SplitBlocks(173) pid=938)[0m   return transform_pyarrow.concat(tables)


[{'trip_distance': 3.4}]

After normalization with `MinMaxScaler`, the original values are transformed into normalized values.

In [3]:
preprocessor = MinMaxScaler(columns=["trip_distance"])
preprocessor.fit(ds)
minmax_ds = preprocessor.transform(ds)
minmax_ds.take(1)

2024-02-15 09:45:48,278	INFO set_read_parallelism.py:115 -- Using autodetected parallelism=173 for stage ReadParquet to satisfy output blocks of size at least DataContext.get_current().target_min_block_size=1.0MiB.
2024-02-15 09:45:48,279	INFO set_read_parallelism.py:122 -- To satisfy the requested parallelism of 173, each read task output is split into 173 smaller blocks.
2024-02-15 09:45:48,279	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]
2024-02-15 09:45:48,279	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-02-15 09:45:48,280	INFO streaming_executor.py:115 -- Tip: For detailed progres

- Aggregate 1:   0%|          | 0/1 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/1 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/1 [00:00<?, ?it/s]

Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

2024-02-15 09:45:49,101	INFO set_read_parallelism.py:115 -- Using autodetected parallelism=173 for stage ReadParquet to satisfy output blocks of size at least DataContext.get_current().target_min_block_size=1.0MiB.
2024-02-15 09:45:49,101	INFO set_read_parallelism.py:122 -- To satisfy the requested parallelism of 173, each read task output is split into 173 smaller blocks.
2024-02-15 09:45:49,102	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> TaskPoolMapOperator[MapBatches(MinMaxScaler._transform_pandas)] -> LimitOperator[limit=1]
2024-02-15 09:45:49,102	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-02-15 09:45:49,102	INFO streaming_executor

Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[36m(ReadParquet->SplitBlocks(173) pid=964)[0m   return transform_pyarrow.concat(tables)
  if isinstance(items[0], TensorArrayElement):
  return items[0]


[{'trip_distance': 1.8353531664835362e-05}]

In [4]:
minmax_ds_ft = preprocessor.fit_transform(ds)
minmax_ds_ft.take(1)

2024-02-15 09:45:49,257	INFO set_read_parallelism.py:115 -- Using autodetected parallelism=173 for stage ReadParquet to satisfy output blocks of size at least DataContext.get_current().target_min_block_size=1.0MiB.
2024-02-15 09:45:49,257	INFO set_read_parallelism.py:122 -- To satisfy the requested parallelism of 173, each read task output is split into 173 smaller blocks.
2024-02-15 09:45:49,258	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]
2024-02-15 09:45:49,258	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-02-15 09:45:49,258	INFO streaming_executor.py:115 -- Tip: For detailed progres

- Aggregate 1:   0%|          | 0/1 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/1 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/1 [00:00<?, ?it/s]

Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

2024-02-15 09:45:49,490	INFO set_read_parallelism.py:115 -- Using autodetected parallelism=173 for stage ReadParquet to satisfy output blocks of size at least DataContext.get_current().target_min_block_size=1.0MiB.
2024-02-15 09:45:49,490	INFO set_read_parallelism.py:122 -- To satisfy the requested parallelism of 173, each read task output is split into 173 smaller blocks.
2024-02-15 09:45:49,490	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> TaskPoolMapOperator[MapBatches(MinMaxScaler._transform_pandas)] -> LimitOperator[limit=1]
2024-02-15 09:45:49,491	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-02-15 09:45:49,491	INFO streaming_executor

Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[36m(ReadParquet->SplitBlocks(173) pid=968)[0m   return transform_pyarrow.concat(tables)
  if isinstance(items[0], TensorArrayElement):
  return items[0]


[{'trip_distance': 1.8353531664835362e-05}]

## Categorical and Numerical Variables

### Categorical Variables

Machine learning models cannot directly handle categorical variables. Therefore, some transformations are required. The table below lists several Preprocessors for handling categorical variables.

```{table} Preprocessors for handling categorical variables
:name: categorical-data-preprocessor

| Preprocessor       	| Variable Type 	| Example                         	|
|--------------------	|---------------	|---------------------------------	|
| [`LabelEncoder`](https://docs.ray.io/en/latest/data/api/doc/ray.data.preprocessors.LabelEncoder.html)     	| Unordered      	| Cat, Dog, Cow, Sheep            	|
| [`OrdinalEncoder`](https://docs.ray.io/en/latest/data/api/doc/ray.data.preprocessors.OrdinalEncoder.html)     	| Ordered        	| High School, Bachelor's, Master's, Ph.D.    	|
| [`MultiHotEncoder`](https://docs.ray.io/en/latest/data/api/doc/ray.data.preprocessors.MultiHotEncoder.html) 	| Multi-class    	| ["Comedy", "Animation"], ["Suspense", "Action"]   	|
```

### Numerical Variables

Various transformations can be applied to adapt the data for specific machine learning models. The table below lists several preprocessors for handling numerical variables.

```{table} Preprocessors for handling numerical variables
:name: numerical-data-preprocessor

| Preprocessor       	| Variable Type        	| Computation                                	| Remarks                                              	|
|--------------------	|----------------------	|--------------------------------------------	|------------------------------------------------------	|
| [`RobustScaler`](https://docs.ray.io/en/latest/data/api/doc/ray.data.preprocessors.RobustScaler.html)     	| With Outliers        	| $x' = \frac{x - \mu_{1/2}}{\mu_h - \mu_l}$ 	| $\mu_{1/2}$ is the median, $\mu_h$ is the max, $\mu_l$ is the min 	|
| [`MaxAbsScaler`](https://docs.ray.io/en/latest/data/api/doc/ray.data.preprocessors.MaxAbsScaler.html)     	| Sparse Data          	| $x' = \frac{x}{\max{\vert x \vert}}$       	|                                                      	|
| [`PowerTransformer`](https://docs.ray.io/en/latest/data/api/doc/ray.data.preprocessors.PowerTransformer.html) 	| Gaussian Transformation | Yeo-Johnson or Box-Cox                    	|                                                      	|
| [`Normalizer`](https://docs.ray.io/en/latest/data/api/doc/ray.data.preprocessors.Normalizer.html)       	| Requires Normalization 	| $x' = \frac{x}{\lVert x \rVert_p}$         	| $p$ is the norm, e.g., `l1` norm is the sum of absolute values         	|
```