Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: support sourcing images from either file path or in-memory data frame #1174

Merged

Conversation

jimthompson5802
Copy link
Collaborator

Code Pull Requests

PR addresses Issue #484 and Issue #268.

Adds option to support sourcing images from file path or from in-memory as numpy arrays.

Why: renaming variables to support either a string for file path
to image or image stored as numpy array in input data set.
Why: Able to accept either file path string
or numpy array for image feature.
Why: unit test test_basic_image_feature test sourcing
images from both file and as ndarrays in dataframe.
Why: Accounting for possible settings by the user.
Why: Accounting for possible settings by the user.
Why:  To bypass test combination that causes returning index
values instead of image ndarrays for training
why: to comply with design standard on use of COLUMN
key in feature dictionary and added helper function
to declutter a long statement.
why: To explain rationale for special handling
of unit test where images are sourced from
the file system.
why: explict test to ensure the rest api call completed
successfully for /predict and /batch_predict endpoints.
why: put in scaffolding to eventually support use of
ndarray as image input for test.
why: support new capability int image feature.  To support sending
ndarray objects in REST api call, added helper functions to
ludwig.utils.data_utils.py for serializing/deserializing ndarray objects
to/from ludwig custom string format.
Why:  Support passing skip_save_processed_input parameter to
add_feature_data() methods.  This will allow the methods to
customize their setup based on the setting of skip_save_processed_input.
Immediate need is to support image feature setup to support
ndarray support.
@jimthompson5802
Copy link
Collaborator Author

Ready for review. summary of changes:

  • Major changes in image_feature.py to support both file paths and ndarray as image features in a pandas data frame.
  • Modified api signature for couple core apis to support passing skip_save_processed_input flag to the image_feature add_feature_data() to handle different setup situations.
  • added unit test to test for file path and ndarray sourcing of images
  • added 3 helper functions to utils.data_utils to support passing ndarrays in a REST request
  • modified unit test for Ludwig server to support passing ndarray images
  • updated all other feature add_feature_data() methods for adding skip_save_processed_input parameter.

@jimthompson5802 jimthompson5802 marked this pull request as ready for review May 10, 2021 01:02
ludwig/features/image_feature.py Outdated Show resolved Hide resolved
ludwig/features/image_feature.py Outdated Show resolved Hide resolved
ludwig/features/image_feature.py Outdated Show resolved Hide resolved
ludwig/features/image_feature.py Outdated Show resolved Hide resolved
Why: Remove need for custom ludwig string format to handle ndarrays.
Made more robust in file handling.  Add capability to restore
dtype for ndarray are set the same as in the original data source.
Create ludwig.utils.server_utils to house the new serialize/deseriallize
functions.
why: renamed variables to make more consistent. corrected
a few log messages.
@jimthompson5802
Copy link
Collaborator Author

Summary of changes:

  • Deprecated the the ludwig custom string format for ndarrays and removed the related helper functions from ludwig.utils.data_utils.
  • Complete re-write of the functions that serialize/deserialize the data used in the REST API. Removed the old functions. Created three new helper functions for the write of serializing and deserializing the REST API data. Housed these new helper functions in a new module ludwig.utils.server_utils.py.
  • Addressed review comments on variable naming, i.e., img_store -> img_entry.

Work remaining:

  • Have to add additional test for when in_memory and file/ndarray settings are different in training vs when predicting.
  • Code clean-up on now obsolete code that is currently commented out.

remove code made obsolete by PR ludwig-ai#1174 and address minor todos
provide additional details on how serialize helper function works.
removed code for the deprecated helper functions for
ludwig custom string format.
@jimthompson5802 jimthompson5802 marked this pull request as draft May 17, 2021 01:39
@jimthompson5802
Copy link
Collaborator Author

Converted back to draft mode. Clarification of requirements is forcing a rethink of current approach.

Why: replace function _write_file() with _read_image_buffer()
this will eliminate need to write temporary files and
do clean up of the temporary files.  Remove helper functions
for custom ludwig string format of ndarray.
Why: rename img_source to img_entry to be consistent
with rest of code base.
Why:  Use of Dask backend and use of hdf5 cache is incompatibile.  With
the recent change to support ndarray for images, there are two conditions
that now affect if hdf5 cache is used for images.  This change moves
the backend test for use of hdf5 cache earlier in processing to avoid
interactions with the two conditions that are currently in place.
Why: To support audio features that have not been
been updated to support an option for ndarray
representation.  Add unit test for audio feature
for model serving.
Why: allow testing of multiple or single record batch as
independent tests.
Why: To be addressed as part of long-term
update to the audio feature.
@jimthompson5802 jimthompson5802 marked this pull request as ready for review May 21, 2021 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants