Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support more parts of end-to-end ML workflow #19

Closed
6 of 13 tasks
deepyaman opened this issue Feb 21, 2024 · 4 comments
Closed
6 of 13 tasks

feat: support more parts of end-to-end ML workflow #19

deepyaman opened this issue Feb 21, 2024 · 4 comments

Comments

@deepyaman
Copy link
Collaborator

deepyaman commented Feb 21, 2024

Objectives

TL;DR

Start at the "Alternatives considered" section.

Constraints

  • Ibis-ML will focus on enabling data processing workloads for ML on tabular data
  • Ibis-ML will be a standalone extension lib that depends on Ibis
  • Excludes domain-specific preprocessing like NLP, computer vision, and large language models
  • Does not address exploratory data analysis (EDA) or model training-related procedures

Mapping the landscape

Data processing for ML is a broad area. We need a strategy to differentiate our value and narrow it down to what we can provide immediate value.

Breaking down an end-to-end ML pipeline

Stephen Oladele’s neptune.ai blog article provides a high-level depiction of a standard ML pipeline.

image
Source: https://neptune.ai/blog/building-end-to-end-ml-pipeline

The article also describes each step of the pipeline. Based on the previously-established constraints, we will limit ourselves to the data preparation and model training components.

The data preparation (data preprocessing and feature engineering) and model training parts can be further subdivided into a number of processes:

  • Feature creation: Creating new features from existing ones or combining different features to create a new one.
  • Feature publishing: Pushing to a feature store to be used for training and inference by the entire organization.
  • Training dataset generation: Constructing training data by (if necessary, retrieving, and) joining features.
  • Data segregation: Splitting data into training, testing, and validation sets.
  • Cross validation: https://scikit-learn.org/stable/modules/cross_validation.html
  • Hyperparameter tuning: https://scikit-learn.org/stable/modules/grid_search.html
  • Feature preprocessing
    • Feature standardization/normalization: Converting the feature values into similar scale and distribution values. Usually falls under model preprocessing.
    • Feature cleaning: Treating missing feature values and removing outliers by capping/flooring them based on code implementation.
  • Feature selection: Select the most appropriate features to be cleaned and engineered. A number of automated algorithms exist.

Note

The above list of processes is adapted from the linked article. I've updated some of the definitions based on my experience and understanding.

Feature comparison (WIP)

Tecton Scikit-learn BigQuery ML NVTabular Dask-ML Ray
Feature creation Partial
Feature publishing Partial
Training dataset generation
Data segregation Partial
Cross validation
Hyperparameter tuning
Feature preprocessing
Feature selection
Model training
Feature serving Partial
Details

Tecton

Scikit-learn

  • Feature creation: No
  • Feature publishing: No
  • Training dataset generation: No
  • Data segregation: Yes
  • Cross validation: Yes
  • Hyperparameter tuning: Yes
  • Feature preprocessing: Yes
  • Feature selection: Yes
  • Model training: Yes
  • Feature serving: No

BigQuery ML

NVTabular

  • Feature creation: Partial
  • Feature publishing: No
  • Training dataset generation: No
  • Data segregation: No
  • Cross validation: No
  • Hyperparameter tuning: No
  • Feature preprocessing: Yes
  • Feature selection: No
  • Model training: No
  • Feature serving: No

Dask-ML

  • Feature creation: No
  • Feature publishing: No
  • Training dataset generation: No
  • Data segregation: Yes
  • Cross validation: Yes
  • Hyperparameter tuning: Yes
  • Feature preprocessing: Yes
  • Feature selection: No
  • Model training: Yes
  • Feature serving: No

Ray

  • Feature creation:
  • Feature publishing:
  • Training dataset generation:
  • Data segregation:
  • Cross validation:
  • Hyperparameter tuning:
  • Feature preprocessing:
  • Feature selection:
  • Model training:
  • Feature serving:

Ibis-ML product hypotheses

Scope

  • A library needs to solve a sufficiently-large problem to be adopted widely. To this end, we want to provide value in multiple stages of the ML pipeline.
  • (Domain-driven) feature engineering is already handled sufficiently well by Ibis. As with other tools that are part of an ecosystem that already supports data transformation (e.g. BigQuery, Dask), we leave feature engineering to the existing tooling (i.e. Ibis).
  • Feature publishing, retrieval, and serving are orthogonal and can be left to feature platforms.
  • Model training can’t be done by a database, unless they control the underlying (cloud) infrastructure and can treat it as another distributed compute problem (e.g. in the case of BigQuery ML, Snowpark ML). Ibis doesn’t control the underlying compute infrastructure.
  • In practice, hyperparameter tuning in industry/large companies is often delegated to purpose-fit tools like Optuna.
  • The remainder is (potentially) in scope.
    • Data segregation and cross validation are required in most every tabular ML problem.
      • @jcrist "Re: test/train splits and CV things - I do think we can provide some utilities in the initial release for handling test/train splitting or CV work, but for now I suggest we mostly focus on single model pipelines and already partitioned data. We ingest an ibis.Table as training data, we don't need to care for now where it's coming from or how the process upstream was handled IMO."
    • Feature preprocessing is a good fit for Ibis to provide value. Technically, a user with a model pipeline (e.g. scikit-learn) may already include that in their pipeline, so they may or may not leverage this.
    • (Automated) feature selection is more case-by-base, and therefore a lower priority.

Alternatives considered

End-to-end IMO also means that you should be able to able to go beyond just preprocessing the data. There are a few different approaches here:

  1. Ibis-ML supports fitting data preprocessing steps (during the training process) and applying pre-trained Ibis-ML preprocessing steps (during inference).
    • Pros: Ibis-ML is used during both the training and inference process
    • Cons: Ibis-ML only supports data preprocessing, and even then a subset of steps that can be fit in database (e.g. not some very widely-used steps like PCA, that fit in the middle of the data-preprocessing pipeline)
  2. Ibis-ML supports constructing transformers from a wider range of pre-trained preprocessors and models (from other libraries, like scikit-learn), and applying them across backends (during inference).
    • Pros: Ibis-ML gives users the ability to apply a much wider range of steps in the ML process during inference time, including preprocessing steps that can be fit linearly (e.g. PCA) and even linear models (e.g. SGDRegressor, GLMClassifier). You can even showcase the end-to-end capabilities just using Ibis (from raw data to model outputs, all on your database, across streaming and batch, powered by Ibis)
    • Cons: Ibis-ML doesn't support training the preprocessors on multiple backends; the expectation is that you use a dedicated library/existing local tools for training
  3. A combination of 1 & 2, where Ibis-ML supports a wider range of pre-processing steps and models, but only a subset support a fit method (those that don't need to be constructed .from_sklearn() or something).
    • Pros: Support the wider range of operations, and also fitting everything on the database in simple cases.
    • Cons: Confusing? If I can train some of my steps using Ibis-ML, but for the rest I have to go a different library, it doesn't feel very unified. @jcrist makes a good point that it's not so confusing, because of the separation of transformers and steps.

Proposal

I propose to go with option #3 of the alternatives considered. In practice, this will mean:

  • Keeping the existing structure of Ibis-ML
  • Adding the ability to construct transforms from_sklearn (and, in the future, potentially other libraries)
    • Some of the transforms may not be steps you can fit using Ibis-ML

This also means that the following will be out of scope (at least, for now):

  • Train-test split (may have value to add in the future)
  • CV (may have value to add in the future)
  • Hyperparameter tuning (less hypothesized value; probably better to integrate with existing frameworks like Optuna)

Deliverables

Guiding principles

  • At launch, we should showcase an end-to-end Ibis-ML workflow, from preprocessing to model inference.
    • The goal is to get people excited about Ibis-ML, and for them to try the example(s).
  • In future releases, we will increase the number of methods we support for each step. If we are successful in the first sub-goal (about generating excitement), people in the community will provide direction for and even contribute to this effort.
  • The library should be incrementally adoptable. The user should get benefit out of using just Ibis data segregation or feature preprocessing, and then they should be able to move on to adding another piece, and get further value.

Demo workflows

  1. Fit preprocessing on DuckDB (local experience, during experimentation)
  2. Experiment with different features
  3. Fit finalized preprocessing on larger dataset (e.g. from BigQuery)
  4. Perform inference on larger dataset

We are currently targeting the NVTabular demo on RecSys2020 Challenge as a demo workflow.

We need variants for all of:

  • scikit-learn (we already have)
  • XGBoost
  • PyTorch

With less priority:

  • LightGBM
  • Tensorflow
  • CatBoost

High-level deliverables

P0 deliverables must be included in the Q1 release. The remainder are prioritized opportunistically/for future development, but priorities may shift (e.g. due to user feedback).

  • [P0] Support handoff to XGBoost (for training and inference) Update: to_dmatrix/to_dask_dmatrix are already implemented
  • [P0] Support handoff to PyTorch (for training and inference)
  • [P0] Build demo workflows
  • [P0] Make documentation ready for "initial" release
  • [P0] Increase coverage of Ibis-ML preprocessing steps w.r.t. tidymodels
  • [P1] Increase coverage of data processing transformer(s) from_sklearn
  • [P2] Increase coverage of model prediction transformer(s) from_sklearn (i.e. those with predict functions that don't require UDFs)
  • [P2] Support handoff to LightGBM (for training and inference)
  • [P2] Support handoff to Tensorflow (for training and inference)
  • [P2] Support handoff to CatBoost (for training and inference)
  • [P2] Support (demo?) inference in streaming contexts
  • [P3] Support constructing some data preprocessing transformer(s) from_sklearn (e.g. PCA, or some more frequently used step)
  • [P3] Support constructing some (linear) model prediction transformer(s) from_sklearn (e.g. SGDRegressor)

Questions for validation

  • Does being able to perform inference for certain model classes directly on the database, without UDF, provide real value? Are models with linear predict too narrow a category for people to care?

Changelog

2024-03-19

Based on discussion around the Ibis-ML use cases and vision with stakeholders, some of the priorities have shifted:

  • Ibis-ML should leverage the underlying engine during both training and inference, and speeding up the training iteration loop on big data is a key value proposition. Therefore, support for constructing transformers from_sklearn is no longer a priority, from P0 to P3.
    • The associated demo of scaling inference only is also removed.
  • Increasing coverage of ML preprocessing steps w.r.t. tidymodels recipes and sklearn.preprocessing is a higher priority. We break down the relative priority of implementing steps in a separate issue.
@jitingxu1
Copy link
Collaborator

Thanks @deepyaman put them together.

Allowing users to complete all data-related tasks before model training would be highly beneficial without switching to other tools. Considering the necessity for users to grasp the data thoroughly before selecting suitable features and preprocessing strategies, integrating EDA(univariate, Correlation Analysis, and feature importances) into the feature engineering phase becomes imperative. This approach ensures that users are equipped with a comprehensive understanding of the data, empowering them to make informed decisions during the feature selection and preprocessing stages.

Does not address exploratory data analysis (EDA) or model training-related procedures

@deepyaman
Copy link
Collaborator Author

Thanks @deepyaman put them together.

Allowing users to complete all data-related tasks before model training would be highly beneficial without switching to other tools. Considering the necessity for users to grasp the data thoroughly before selecting suitable features and preprocessing strategies, integrating EDA(univariate, Correlation Analysis, and feature importances) into the feature engineering phase becomes imperative. This approach ensures that users are equipped with a comprehensive understanding of the data, empowering them to make informed decisions during the feature selection and preprocessing stages.

Does not address exploratory data analysis (EDA) or model training-related procedures

I agree that it could be valuable to handle more where Ibis is well-suited (e.g. some EDA). Your open issue on the ibis repo is very relevant. W.r.t. model training, that ultimately would need to be handled by other libraries, but we should make sure that the handoffs are smooth and efficient.

Feature engineering is a much bigger topic; I could see Ibis-ML expanding in that direction, to include some auto-FE (a la Featuretools), but it's not clear whether that's a priority. It's also a bit separate from the initial focus.

@deepyaman
Copy link
Collaborator Author

For consideration from @jcrist just now: Consider something like transform_sklearn(est, table) -> table over from_sklearn(est) -> some_new_type to avoid naming/designing the some_new_type object.

@deepyaman: The some_new_type could just be a transform (or step post-refactor?); check which option will be easier.

@deepyaman
Copy link
Collaborator Author

IbisML 0.1.0 is released and covers most of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: done
Development

No branches or pull requests

2 participants