Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add storage interfaces, basic file structure #529

Merged
merged 9 commits into from
Mar 18, 2020

Conversation

zhilingc
Copy link
Collaborator

@zhilingc zhilingc commented Mar 10, 2020

What this PR does / why we need it:
This PR adds the proposed storage interfaces for the storage refactor mentioned in issue #402 . Implementations of the interfaces defined should go into individual modules in storage/connectors. The introduction of these interfaces and their implementations will hopefully allow for greater extensibility and less code duplication, as well as unspaghettify the Serving codebase.

Making this PR for discussion before i move the redis and BQ code.

Notes:

  • The OnlineRetriever closely follows the implementation written by @smadarasmi for the cassandra store. It is less generic than the BatchRetriever interface so that implementations don't have to implement the code tracking missing/stale features. EDIT: I've made the methods a lot more high-level.
  • I'm not sure if the BatchRetriever methods should be more granular. If we want to support a wide range of stores, this might be as granular as it gets.
  • FeatureSink writes emit WriteResult objects that support the retrieval of successful and failed attempts in anticipation of Shift ingestion feature metrics to after store write  #489

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zhilingc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zhilingc zhilingc requested review from ches and removed request for pradithya March 10, 2020 07:23
@zhilingc
Copy link
Collaborator Author

/retest

1 similar comment
@zhilingc
Copy link
Collaborator Author

/retest

@zhilingc
Copy link
Collaborator Author

/retest

@woop
Copy link
Member

woop commented Mar 17, 2020

lgtm

@woop
Copy link
Member

woop commented Mar 18, 2020

/lgtm

@feast-ci-bot feast-ci-bot merged commit 8d8737a into feast-dev:storage-refactor Mar 18, 2020
zhilingc pushed a commit that referenced this pull request Mar 19, 2020
* Add storage interfaces, basic file structure

* Apply spotless, add comments

* Move parseResponse and isEmpty to response object

* Make changes to write interface to be more beam-like

* Pass feature specs to the retriever

* Pass feature specs to online retriever

* Add FeatureSetRequest

* Add mistakenly removed TestUtil

* Add mistakenly removed TestUtil
Copy link
Member

@ches ches left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post-merge review, hope it can still be useful.

I have a major concern about a single storage-api module having Beam dependencies, which serving should not have. We see it coming to life in #553 adding an exclusion on org.apache.beam:*

@zhilingc had some thoughts that there may be unavoidable realities—seems like a nuanced discussion, maybe should take it to public Feast Slack or calls.

Although writing via Beam IOs is very much part of "storage", I feel it's vital that storage API needs of serving and ingestion are decoupled at a module level.

<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can omit the JUnit version from all of these, 4.12 is the one applied by our <dependencyManagement> via the imported one from Spring.

Apache commons-lang3 is also there, at version 3.7.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha

storage/connectors/pom.xml Show resolved Hide resolved
storage/api/pom.xml Show resolved Hide resolved
import java.util.List;

/** Interface for implementing user defined retrieval functionality from Batch/historical stores. */
public interface BatchRetriever {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: "Interface for implementing" is redundant, it is (statically) an interface and those are always for implementing ☺️

Personally I find it helpful, as a reader and writer, to read/think in terms of the abstraction that's represented, as a noun—what an instance would be/do. Some of the first sentences of thejava.nio.channels.Channel interface Javadoc, for instance:

A channel represents an open connection to an entity such as a hardware device, a file, a network socket, or a program component that is capable of performing one or more distinct I/O operations, for example reading or writing.

This is harder when you get into things that are architectural glue more than domain concepts. I try to defer defining those for as long as possible… we're necessarily near the edge of it here.

I'll take a riff:

A BatchRetriever fetches historical feature values from storage.

Thinking about it, I wonder if HistoricalRetriever is a more expressive name. It's really what conceptually—and technically—distinguishes it from online retrieval in Feast. "Offline" and "batch" are more overloaded, fuzzier terms. Something like Druid or various streaming SQL engines could make online historical usage feasible. Whether that has a place in Feast is beyond the subject here, but hopefully my point stands regardless.

Honestly I liked your first iteration with getFeatures naming and I agree with your instinct that single implementations like class SQLLiteRetriever implements BatchRetriever, OnlineRetriever are going to be the exception much more than the norm. And they would still be possible with the overloads, even elegant. It's subjective taste at that point though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, I wonder if HistoricalRetriever is a more expressive name.

Makes sense. We used to call it historical serving but for some reason it just didn't have the right ring to it. It's definitely more accurate than simply batch.

* contains the error to be returned to the user.
*/
ServingAPIProto.Job getBatchFeatures(
ServingAPIProto.GetBatchFeaturesRequest request, List<FeatureSetRequest> featureSetRequests);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be controversial because it might imply many new classes that are very close to proto generated ones, and a translation layer somewhere with some tedious boilerplate, but I would really like to see "api" modules working with domain model types that are not proto wire format—like the newly-defined FeatureSetRequest in this PR as part of "api" is a step in the right direction to me, but it's essentially just a wrapper of ImmutableSet<FeatureReference> so I wonder if we could go further and use that directly here instead of a wrapper (although FeatureReference is still coming from proto, but maybe it's acceptable when they're more of a fundamental data model type and not request objects, DTOs, etc.).

Concretely I'm skeptical of ServingAPIProto.GetBatchFeaturesRequest and ServingAPIProto.Job at this layer: the former is a request object that should go no deeper than the RPC controller layer of serving IMO. Could the latter be served here by feast.core.model.Job? Granted that having storage-api depend on core is not ideal, but it may be that ubiquitous models that cross these new layers need to move (perhaps datatypes-java starts to carry source code beyond only generated, or "job" is really fundamentally an Ingestion domain concept and a Job model could be in an "ingestion-api" that helps bridge us from Core to Serving doing job management…).

It's a bit odd to me that ServingAPIProto.Jobis defined where it is currently anyhow, though I guess this might also be part of the inclination to move job management. And I acknowledge it might be justified to have different contracts for what a client making a historical query cares about status, versus more internal concerns of ingestion job management. But it's like a serving-specific View Model of the core domain model then. I digress, this is a tangent from this already rambling comment, but a good one to carry on elsewhere.

It feels redundant that this interface method has both the GetBatchFeaturesRequest and List<FeatureSetRequest> as arguments—there's substantial overlap isn't there? Seems like incidental complexity for the caller to provide both, all the information should be in the feature references. If we boil this down to essentials, forgetting any existing types that happen to be defined, I think the method requires these things:

  • Set/list of (qualified) feature references
  • Set of entities, paired with a datetime

(Do we also handle/support a query-supplied max age per feature set, for batch? Proto schemas allow it but Python API doesn't seem to).

As an API interface, what would it look like to simplify this to domain model and basic collection types?

Sorry for such a long and meandering comment…

/**
* Get the featureset associated with this response.
*
* @return String featureset reference in format featureSet:version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've alluded to the same in TestUtil, but the references dearly need to become first-class types instead of strings, IMO. But again, something we should address outside this PR (but quite possibly before storage-refactor merges to mainline, their absence is bearing on these fundamental interfaces we're designing).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! It will be added in @mrzzy's PR #548, in proto format, at least.

import feast.types.FeatureRowProto;

/** Response from an online store. */
public interface OnlineRetrieverResponse {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my earlier remark about FeatureSetRequest, I'm questioning the value at this layer of a wrapper over a simple collection of FeatureRows. They carry the information of what feature set they're associated with.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I think this is an artifact of a previous iteration that i forgot to remove.

* @return list of {@link OnlineRetrieverResponse} for each entity row
*/
List<List<FeatureRow>> getOnlineFeatures(
List<EntityRow> entityRows, List<FeatureSetRequest> featureSetRequests);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the types used in the signature here, in contrast to getBatchFeatures, lend support to my arguments there.

*
* @return {@link PTransform}
*/
PTransform<PCollection<FeatureRow>, WriteResult> write();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bikeshedding, but the method name feels a little odd for what it does: it sounds imperative. Maybe writer() solves that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is a common convention in Beam especially for the Write transform implementations of IOs, my bad.

Copy link
Collaborator Author

@zhilingc zhilingc Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine, most other implementations do it because it so that the API is fluent: it usually follows with the write destination, e.g. SomeIO.write().to(somewhere). In this case we don't supply such an option, so writer() makes more sense

import org.apache.beam.sdk.values.PCollection;

/** Interface for implementing user defined feature sink functionality. */
public interface FeatureSink extends Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/** A FeatureSink defines how feature values are written to storage, through a Beam transform. */

Or something like that… The current doc tells me practically nothing as an implementer of what these are for (the method docs are good!). It's also a muddy description of a contract, sounds like we're uncommitted to keeping this abstraction well-defined as time goes on.

zhilingc pushed a commit that referenced this pull request Mar 23, 2020
* Add storage interfaces, basic file structure

* Apply spotless, add comments

* Move parseResponse and isEmpty to response object

* Make changes to write interface to be more beam-like

* Pass feature specs to the retriever

* Pass feature specs to online retriever

* Add FeatureSetRequest

* Add mistakenly removed TestUtil

* Add mistakenly removed TestUtil
zhilingc pushed a commit that referenced this pull request Mar 29, 2020
* Add storage interfaces, basic file structure

* Apply spotless, add comments

* Move parseResponse and isEmpty to response object

* Make changes to write interface to be more beam-like

* Pass feature specs to the retriever

* Pass feature specs to online retriever

* Add FeatureSetRequest

* Add mistakenly removed TestUtil

* Add mistakenly removed TestUtil
zhilingc pushed a commit to zhilingc/feast that referenced this pull request Apr 3, 2020
* Add storage interfaces, basic file structure

* Apply spotless, add comments

* Move parseResponse and isEmpty to response object

* Make changes to write interface to be more beam-like

* Pass feature specs to the retriever

* Pass feature specs to online retriever

* Add FeatureSetRequest

* Add mistakenly removed TestUtil

* Add mistakenly removed TestUtil
zhilingc pushed a commit that referenced this pull request Apr 3, 2020
* Add storage interfaces, basic file structure

* Apply spotless, add comments

* Move parseResponse and isEmpty to response object

* Make changes to write interface to be more beam-like

* Pass feature specs to the retriever

* Pass feature specs to online retriever

* Add FeatureSetRequest

* Add mistakenly removed TestUtil

* Add mistakenly removed TestUtil
zhilingc pushed a commit that referenced this pull request Apr 7, 2020
* Add storage interfaces, basic file structure

* Apply spotless, add comments

* Move parseResponse and isEmpty to response object

* Make changes to write interface to be more beam-like

* Pass feature specs to the retriever

* Pass feature specs to online retriever

* Add FeatureSetRequest

* Add mistakenly removed TestUtil

* Add mistakenly removed TestUtil
feast-ci-bot pushed a commit that referenced this pull request Apr 7, 2020
…567)

* Add storage interfaces, basic file structure (#529)

* Add storage interfaces, basic file structure

* Apply spotless, add comments

* Move parseResponse and isEmpty to response object

* Make changes to write interface to be more beam-like

* Pass feature specs to the retriever

* Pass feature specs to online retriever

* Add FeatureSetRequest

* Add mistakenly removed TestUtil

* Add mistakenly removed TestUtil

* Add BigQuery storage (#546)

* Add Redis storage implementation (#547)

* Add Redis storage

* Remove staleness check; can be checked at the service level

* Remove staleness related tests

* Add dependencies to top level pom

* Clean up code

* Change serving and ingestion to use storage API (#553)

* Change serving and ingestion to use storage API

* Remove extra exclusion clause

* Storage refactor API and docstring tweaks (#569)

* API and docstring tweaks

* Fix javadoc linting errors

* Apply spotless

* Fix javadoc formatting

* Drop result from HistoricalRetrievalResult constructors

* Change pipeline to use DeadletterSink API (#586)

* Add better code docs to storage refactor (#601)

* Add better code documentation, make GetFeastServingInfo independent of retriever

* Make getStagingLocation method of historical retriever

* Apply spotless

* Clean up dependencies, remove exclusions at serving (#607)

* Clean up OnlineServingService code (#605)

* Clean up OnlineServingService code to be more readable

* Revert Metrics

* Rename storage API packages to nouns
@ches ches added this to the v0.5.0 milestone Apr 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants