Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FeatureStore] Spark read optimization #5514

Merged
merged 124 commits into from
May 21, 2024

Conversation

tomerm-iguazio
Copy link
Contributor

@tomerm-iguazio tomerm-iguazio commented May 5, 2024

  1. Added support for additional_filters in:
    a)spark_merger
    b) ParquetSource.to_spark_df.
  2. solved remote ingest bug.
  3. Added system and unit tests.

ML-6289

tomerm-iguazio and others added 30 commits April 11, 2024 12:33
…CSVSource, dataframeSource and BigQuerySource
only in parquetsource or parquettarget.
mlrun/datastore/sources.py Outdated Show resolved Hide resolved
mlrun/datastore/sources.py Outdated Show resolved Hide resolved
mlrun/datastore/sources.py Outdated Show resolved Hide resolved
none_exists = False
value = list(value)
for none_value in none_values:
if none_value in value:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue is still outstanding.

mlrun/datastore/sources.py Show resolved Hide resolved
tests/system/feature_store/utils.py Outdated Show resolved Hide resolved
kind = None if self.run_local else "remote-spark"
resp = fstore.get_offline_features(
feature_vector=vec,
additional_filters=[("bad", "=", 95)],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing a NaN value in the test data afaict. Not to be confused with None.

mlrun/datastore/sources.py Outdated Show resolved Hide resolved
@gtopper gtopper self-requested a review May 21, 2024 03:01
mlrun/datastore/sources.py Outdated Show resolved Hide resolved
mlrun/datastore/sources.py Outdated Show resolved Hide resolved
mlrun/datastore/sources.py Outdated Show resolved Hide resolved
mlrun/datastore/sources.py Show resolved Hide resolved
tomerm-iguazio and others added 2 commits May 21, 2024 13:36
Co-authored-by: Gal Topper <gal.topper@gmail.com>
Co-authored-by: Gal Topper <gal.topper@gmail.com>
mlrun/datastore/sources.py Outdated Show resolved Hide resolved
@assaf758 assaf758 merged commit 2bc950e into mlrun:development May 21, 2024
11 checks passed
rokatyy pushed a commit to rokatyy/mlrun that referenced this pull request May 28, 2024
@tomerm-iguazio tomerm-iguazio deleted the spark_read_optimisation branch June 3, 2024 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants