Auto Feature Engineering helps to shorten the time required for data scientists to process and transform large-scale raw tabular datasets to ready-to-train features, reducing the necessity of human domain-specific knowledge to explore and identify new useful features, and thus enabling data scientists to be able to focus and iterate on the rest stages of the E2E AI pipelines more efficiently.
The workflow leverages Intel Auto-Feature-Engineering toolkit (codename: RecDP) to automatically transform raw tabular data to enriched useful new features on pluggable execution engine such as Pandas and Spark, with the capability of integrating 3rd party feature engineering primitives, and thus significantly improve developer productivity and efficiency.
Check out more workflow examples in the Developer Catalog.
Auto feature engineering targets to simplify Feature engineering process with enhanced performance via parallel data processing frameworks, automated data processing pipeline and built-in domain-specific feature engineering primitives. This repository provides an end-to-end workflow that automatically analyzes the data based on data type, profiles feature distribution, generates customizable feature engineering pipelines for the data preparation and executes the pipeline parallelly with different backend engines on Intel platform.
This workflow leverages the RecDP toolkits from Intel® End-to-End AI Optimization kit and the features can be used by both ML and DL models.
Steps explained:
- Feature profile: Analyze raw tabular dataset to infer original feature based on data type and generate FeatureList.
- Feature engineering: Use inferred FeatureList to generate Data Pipeline in Json/Yaml File format.
- Feature transformation: Convert Data Pipeline to executable operations and transform original features to candidate features with selected engine, currently Pandas and Spark were supported.
- Feature Importance Estimator: perform feature importance analysis on candidate features to remove un-important features, generate the transfomred dataset that includes all finalize features that will be used for training.
There are workflow-specific hardware and software setup requirements depending on how the workflow is run.
Name | Description |
---|---|
CPU | Intel® 1st, 2nd, 3rd, and 4th Gen Xeon® Scalable Performance processors |
Memory Size | >100 GB |
Disk Size | >300 GB |
Docker Version | 23.0.x |
Package | Version |
---|---|
Python | 3.8.10 |
Spark | 3.3.1 |
Pandas | 2.0.1 |
This workflow has been validated with Ubuntu 20.04 OS.
This Auto Feature Engineering Workflow loads tabular data, profiles data, applies feature transformation, generates new features to enrich the features, and then evaluates performance with transformed data. The whole process is configurable using yaml configuration file including tabular data path, target label of tabular data and execution engine.
Start by defining an environment variable that will store the workspace path, this can be an existing directory or one to be created in further steps. This ENVVAR will be used for all the commands executed using absolute paths.
export WORKSPACE=~/work
Steps explained:
- Download suggested dataset. (Skip if you want to use own dataset).
- Create a folder in applications and put the dataset there.
- Create a workflow yaml file in {your folder} to config dataset_path and label.
- run "docker compose run autofe-local-mode" (cmdline mode) or run "docker compose run autofe-notebook-UI" (jupyter notebook mode).
- Once the script completes successfully, you'll see transformed data and EDA analysis in {your folder}.
Following steps is a auto feature engineering example on NYC taxi dataset.
Create a working directory for workflow and clone workflow repo into working directory.
mkdir -p $WORKSPACE && cd $WORKSPACE
git clone https://github.com/intel/auto-feature-engineering.git
export AUTOFE_DIR=`pwd`/auto-feature-engineering
The examples uses kaggle API to download dataset, you can also download dataset manually.
pip install kaggle==1.5.3
apt install unzip
export KAGGLE_USERNAME=${your_kaggle_username}
export KAGGLE_KEY=${your_kaggle_key}
export KAGGLE_PROXY=${your_proxy}
# download NYC taxi dataset, usually took ~2 mins
cd ${AUTOFE_DIR}/applications/nyc_taxi_fare/; mkdir -p raw_data; cd raw_data;
kaggle competitions download -c new-york-city-taxi-fare-prediction; unzip new-york-city-taxi-fare-prediction.zip
You can execute the references pipelines using the following environments:
- Docker
- Argo
- Jupyter
Follow these instructions to set up and run our provided Docker image.
You'll need to install Docker Engine on your development system. Note that while Docker Engine is free to use, Docker Desktop may require you to purchase a license. See the Docker Engine Server installation instructions for details.
If the Docker image is run on a cloud service, you may also need credentials to perform training and inference related operations (such as these for Azure):
- Set up the Azure Machine Learning Account
- Configure the Azure credentials using the Command-Line Interface
- Compute targets in Azure Machine Learning
- Virtual Machine Products Available in Your Region
Ensure you have Docker Compose installed on your machine. If you don't have this tool installed, consult the official Docker Compose installation documentation.
DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}
mkdir -p $DOCKER_CONFIG/cli-plugins
curl -SL https://github.com/docker/compose/releases/download/v2.7.0/docker-compose-linux-x86_64 -o $DOCKER_CONFIG/cli-plugins/docker-compose
chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
docker compose version
Build or pull the provided docker image.
cd $AUTOFE_DIR/docker
docker compose build
Or
docker pull intel/ai-workflows:pa-autofe
This command will build RecDP docker image and starts docker container. $AUTOFE_WORKSPACE
directory will be mounted to docker container and corresponding workflow will be launched in container. When the command completes successfully, a EDA
folder will be created under $workspace folder, containing transformed dataset and pipeline.json
file that contains feature transform pipeline created by RecDP.
Note: We use system proxy from host while building docker image, so please set correct environment param for
http_proxy
andhttps_proxy
in host.
%%{init: {'theme': 'dark'}}%%
flowchart RL
VAUTOFEWORKSPACE{{"${AUTOFE_WORKSPACE}"}} x-. /home/vmagent/app/workspace/ .-x autofelocalmode[autofe-local-mode]
VH45679{{../}} x-. /home/vmagent/app/ .-x autofelocalmode
VAUTOFEWORKSPACE x-. /home/vmagent/app/workspace/ .-x autofeEDAUI[autofe-EDA-UI]
VH45679 x-. /home/vmagent/app/ .-x autofeEDAUI
VAUTOFEWORKSPACE x-. /home/vmagent/app/workspace/ .-x autofenotebookUI[autofe-notebook-UI]
VH45679 x-. /home/vmagent/app/ .-x autofenotebookUI
VAUTOFEWORKSPACE x-. /home/vmagent/app/workspace/ .-x autofenotebookdev[autofe-notebook-dev]
VH45679 x-. /home/vmagent/app/ .-x autofenotebookdev
classDef volumes fill:#0f544e,stroke:#23968b
class VAUTOFEWORKSPACE,VH45679,VAUTOFEWORKSPACE,VH45679,VAUTOFEWORKSPACE,VH45679,VAUTOFEWORKSPACE,VH45679 volumes
The table below shows some of the environment variables you can control according to your needs.
Environment Variable Name | Default Value | Description |
---|---|---|
AUTOFE_WORKSPACE | ${AUTOFE_DIR}/applications/nyc_taxi_fare |
Workflow workspace directory, will be mounted to /home/vmagent/app/workspace/ |
Configuration for ${AUTOFE_DIR}/applications/nyc_taxi_fare/workflow.yaml
dataset_path: ./raw_data/train.csv
target_label: fare_amount
engine_type: spark
cd ${AUTOFE_DIR}/docker
export AUTOFE_WORKSPACE=${AUTOFE_DIR}/applications/nyc_taxi_fare
docker compose run autofe-local-mode
- Docker container will be built and started
- Configuration file is parsed and input
autofe_pyrecdp | Configuration is {'dataset_path': './raw_data/train.csv', 'target_label': 'fare_amount'}
- Workflow profiles input data and creates data pipeline automatically.
autofe_pyrecdp | {0: DataFrame, 1: type_infer, 2: astype, 3: fillna, 4: DataFrame, 5: type_infer, 6: tuple, 7: tuple, 8: astype, 9: fillna, 10: astype, 11: datetime_featur
e, 12: haversine, 13: drop}
- Workflow starts to transform data, we provided engine in configuration: pandas and spark
autofe_pyrecdp | execute with spark started ...
...
autofe_pyrecdp | execute drop started ...
autofe_pyrecdp | execute drop took 0.0009502819739282131 sec
autofe_pyrecdp | DataframeTransform took 83.058 secs, processed 55423856 rows with num_partitions as 200
autofe_pyrecdp | DataframeTransform combine to one pandas dataframe took 4.765 secs
autofe_pyrecdp | execute with spark took 151.22523793019354 sec
- Workflow completed!
autofe_pyrecdp | ************ Completed! ************
autofe_pyrecdp | original data shape is (55423856, 8), after transformed data shape is (55423856, 17)
autofe_pyrecdp | Transformed Data and EDA analysis are saved in your workspace
autofe_pyrecdp | transformed_data.parquet
autofe_pyrecdp | workflow.yaml
autofe_pyrecdp | EDA
autofe_pyrecdp | raw_data
autofe_pyrecdp | EDA folder:
autofe_pyrecdp | UI_notebook.ipynb
autofe_pyrecdp | pipeline.json
autofe_pyrecdp | transformed_data_sample.csv
autofe_pyrecdp | original_data_sample.csv
autofe_pyrecdp | .ipynb_checkpoints
autofe_pyrecdp | Please visit EDA/UI_notebook.ipynb for details
autofe_pyrecdp | **************************************
autofe_pyrecdp |
autofe_pyrecdp exited with code 0
Stop containers created by docker compose and remove them.
docker compose down
- Install Helm
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 && \
chmod 700 get_helm.sh && \
./get_helm.sh
- Install Argo Workflows and Argo CLI
- Configure your Artifact Repository
- Ensure that your dataset and config files are present in your chosen artifact repository.
export NAMESPACE=argo
helm install --namespace ${NAMESPACE} --set proxy=${http_proxy} auto-feature-engineering ./chart
argo submit --from wftmpl/auto-feature-engineering --namespace=${NAMESPACE}
To view your workflow progress
argo logs @latest -f
We setup Jupyter environment in docker image, so please follow Set Up Docker Engine and Setup Docker Compose to install Docker Engine and Docker Compose first.
Build or pull the provided docker image.
cd $AUTOFE_DIR/docker
docker compose build
Or
docker pull intel/ai-workflows:pa-autofe
This command will build RecDP docker image and starts docker container. $AUTOFE_WORKSPACE
directory will be mounted to docker container and corresponding workflow will be launched in container. When the command completes successfully, a EDA
folder will be created under $workspace folder, containing transformed dataset and pipeline.json
file that contains feature transform pipeline created by RecDP.
Note: We use system proxy from host while building docker image, so please set correct environment param for
http_proxy
andhttps_proxy
in host.
Configuration for ${AUTOFE_DIR}/applications/nyc_taxi_fare/workflow.yaml
dataset_path: ./raw_data/train.csv
target_label: fare_amount
engine_type: spark
cd ${AUTOFE_DIR}/docker
export AUTOFE_WORKSPACE=${AUTOFE_DIR}/applications/nyc_taxi_fare
docker compose run autofe-notebook-UI
- Docker container will be built and started
- Then you can follow the instruction to open notebook on http://{host}:{port}/lab/tree/interactive_notebook.ipynb
[+] Building 2.9s (10/10) FINISHED
=> [internal] load build definition from Dockerfile-recdp 0.0s
...
⠿ Container autofe_pyrecdp Created 0.0s
Attaching to autofe_pyrecdp
...
autofe_notebook_UI | [I 2023-05-15 15:48:10.346 ServerApp] Jupyter Server 2.5.0 is running at:
autofe_notebook_UI | [I 2023-05-15 15:48:10.346 ServerApp] http://{host}:8890/lab
autofe_notebook_UI | [I 2023-05-15 15:48:10.346 ServerApp] http://127.0.0.1:8890/lab
autofe_notebook_UI | [I 2023-05-15 15:48:10.346 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Workflow Name | Description |
---|---|
NYC taxi fare | Fare prediction based on NYC taxi dataset |
Amazon Product Review | Product recommandation based on reviews from Amazon |
IBM Card Transaction Fraud Detect | Recognize fraudulent credit card transactions |
Twitter Recsys | Real-world task of tweet engagement prediction |
Outbrain | Click prediction for recommendation system |
Covid19 TabUtils | integration example with Tabular Utils |
PredictiveAssetsMaintenance | integration example with predictive assets maintenance use case |
Create a workflow directory like mkdir -p ${AUTOFE_DIR}/applications/customer_dataset/raw_data
and put your dataset under raw_data
directory.
- Create workflow configuration yaml file
${AUTOFE_DIR}/applications/customer_dataset/workflow.yaml
- Add workflow configurations in workflow.yaml, supported configurations are listed in the table
Name | Description |
---|---|
dataset_path | set dataset directory |
target_label | specify target lable of dataset |
engine_type | config auto feature engineering engine type, support pandas and spark |
- Example configuration of NYC taxi dataset:
dataset_path: ./raw_data/train.csv
target_label: fare_amount
engine_type: spark
Pick either jupyter mode or docker cmdline mode to run the workflow.
cd ${AUTOFE_DIR}/docker
export AUTOFE_WORKSPACE=${AUTOFE_DIR}/applications/customer_dataset
docker compose run autofe-notebook-UI
cd ${AUTOFE_DIR}/docker
export AUTOFE_WORKSPACE=${AUTOFE_DIR}/applications/customer_dataset
docker compose run autofe-local-mode
AutoFE workflows significantly simplifies the complexity of feature engineering with just 3 lines of code, and reduced data process time.
Note: The performance data is tested on single node with 3rd Gen Xeon® Scalable Performance processors with Spark as execution engine.
For more information about or to read about other relevant workflow examples, see these guides and software resources:
- RecDP toolkit
- Intel® End-to-End AI Optimization kit
- Intel® AI Analytics Toolkit (AI Kit)
- Enhance Productivity with Auto Feature Engineering Workflow from Intel
If you have questions or issues about this workflow, contact the Support Team. If there is no support forum, and we want developers to use GitHub issues to submit bugs and enhancement requests, put a link to that GitHub repo's issues.
The Auto Feature Engineering Workflow team tracks both bugs and enhancement requests using GitHub issues. Before submitting a suggestion or bug report, search the DLSA GitHub issues to see if your issue has already been reported.
*Other names and brands may be claimed as the property of others. Trademarks.