feat: add support for raw data when encoding #598

LMMilliken · 2022-11-04T10:06:02Z

This pr adds support for providing a list of strings instead of a DocumentArray when using the finetuner.encode function. The modality of the contents of this list can be inferred using the model provided to the function, so no additional arguments need to be provided.
The docs have been updated to reflect this change, however encoding with a list has not been given priority over encoding with a DocumentArray. Example code snippets are left unchanged, and some additional code snippets are added.

This PR references an open issue
I have added a line about this change to CHANGELOG

…ina-ai/finetuner into feat-support-encoding-raw-data

LMMilliken · 2022-11-04T10:46:51Z

finetuner/__init__.py

@@ -485,3 +488,5 @@ def encode(
            inputs = model._move_to_device(inputs)
            output = model.run(inputs)
            batch.embeddings = output.detach().cpu().numpy()
+
+    return data


Since the encode method previously worked on the input DocumentArray in place, I was not sure how to return the embeddings in the case that the user provided a list. Currently I am just returning the final DocumentArray in both cases, though I would welcome alternative suggestions

CHANGELOG.md

finetuner/__init__.py

finetuner/data.py

README.md

docs/walkthrough/integrate-with-jina.md

finetuner/__init__.py

guenthermi · 2022-11-04T16:16:48Z

finetuner/data.py

+) -> DocumentArray:
+    """If data has been provided as a list, a :class:`DocumentArray` is created
+    from the elements of the list
+    """


The parameter descriptions are missing

@gmastrapas suggested they be removed as this is an internal function.

finetuner/data.py

README.md

bwanglzu

this is a good point, if user want to encode a DocumentArray, we should return a DocumentArray, otherwise if user want to encode a list of objects, i believe its better to return a np.ndarray of embeddings.

docs/walkthrough/integrate-with-jina.md

gmastrapas · 2022-11-07T10:09:50Z

docs/notebooks/image_to_image.ipynb

@@ -1,23 +1,10 @@
 {


what happend here? the notebook changed but not the md file?

ah, I didn't mean to commit that, i think since the notebook file records metadata such as number of runs, running the notebook causes changes that I committed mistakenly

I attempted to revert the changes to that file to main but it has since been updated there as well, after merging there will be no difference to either the notebook or the markdown

finetuner/data.py

gmastrapas · 2022-11-07T14:07:02Z

finetuner/data.py

@@ -58,6 +63,29 @@ def build_dataset(
    return data


+def build_encoding_dataset(
+    model: Union['InferenceEngine', str], data: Union[List[str], DocumentArray]


can this argument be a string?

No it cannot, changed now

docs/walkthrough/integrate-with-jina.md

bwanglzu

LGTM!, Before release, let's also update notebooks inference session, and make inference from list as default option. see https://finetuner.jina.ai/notebooks/text_to_text/#inference

guenthermi

LGTM, only added one minor comment

guenthermi · 2022-11-07T15:35:10Z

finetuner/__init__.py

@@ -469,6 +471,9 @@ def encode(

    from _finetuner.models.inference import ONNXRuntimeInferenceEngine

+    return_da = isinstance(data, DocumentArray)
+    data = build_encoding_dataset(model=model, data=data)


I would prefer to call this function only if the input type is a list.

scott-martens

LGTM

github-actions · 2022-11-07T16:57:53Z

📝 Docs are deployed on https://ft-feat-support-encoding-raw-data--jina-docs.netlify.app 🎉

feat: add function

373a47c

LMMilliken self-assigned this Nov 4, 2022

feat: add function

f3f526b

LMMilliken marked this pull request as draft November 4, 2022 10:18

lmmilliken added 2 commits November 4, 2022 11:38

feat: add build_encoding_dataset function

0be6810

feat: add build_encoding_dataset function

59802bb

github-actions bot added size/s area/core area/entrypoint area/testing This issue/PR affects testing labels Nov 4, 2022

lmmilliken added 2 commits November 4, 2022 11:43

feat: add build_encoding_dataset function

a0c2697

Merge branch 'feat-support-encoding-raw-data' of https://github.com/j…

3077e3d

…ina-ai/finetuner into feat-support-encoding-raw-data

LMMilliken commented Nov 4, 2022

View reviewed changes

github-actions bot added size/m area/docs and removed size/s labels Nov 4, 2022

LMMilliken linked an issue Nov 4, 2022 that may be closed by this pull request

Add support for raw data when encoding #595

Closed

test: added tests

0fa0f07

LMMilliken force-pushed the feat-support-encoding-raw-data branch from 83304ae to 0fa0f07 Compare November 4, 2022 14:01

LMMilliken marked this pull request as ready for review November 4, 2022 14:02

LMMilliken requested review from gmastrapas, bwanglzu, guenthermi and scott-martens November 4, 2022 14:02

gmastrapas suggested changes Nov 4, 2022

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

finetuner/__init__.py Outdated Show resolved Hide resolved

finetuner/data.py Outdated Show resolved Hide resolved

finetuner/data.py Outdated Show resolved Hide resolved

finetuner/data.py Outdated Show resolved Hide resolved

chore: updated typehints

b01d2bd

guenthermi reviewed Nov 4, 2022

View reviewed changes

fix: fixed modality checking

2ecf4f6

LMMilliken force-pushed the feat-support-encoding-raw-data branch from a70305d to 2ecf4f6 Compare November 4, 2022 16:42

bwanglzu reviewed Nov 7, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

bwanglzu suggested changes Nov 7, 2022

View reviewed changes

docs/walkthrough/integrate-with-jina.md Outdated Show resolved Hide resolved

LMMilliken force-pushed the feat-support-encoding-raw-data branch 2 times, most recently from a5599a5 to 27e493a Compare November 7, 2022 08:44

feat: changed return type of encode

ac8e010

LMMilliken force-pushed the feat-support-encoding-raw-data branch from 27e493a to ac8e010 Compare November 7, 2022 08:57

LMMilliken requested review from guenthermi, bwanglzu and gmastrapas November 7, 2022 09:07

gmastrapas reviewed Nov 7, 2022

View reviewed changes

chore: removed comments

66b97f0

LMMilliken requested a review from gmastrapas November 7, 2022 13:59

gmastrapas reviewed Nov 7, 2022

View reviewed changes

gmastrapas approved these changes Nov 7, 2022

View reviewed changes

bwanglzu reviewed Nov 7, 2022

View reviewed changes

docs/walkthrough/integrate-with-jina.md Outdated Show resolved Hide resolved

bwanglzu approved these changes Nov 7, 2022

View reviewed changes

guenthermi approved these changes Nov 7, 2022

View reviewed changes

LMMilliken force-pushed the feat-support-encoding-raw-data branch from 4e6e86a to 8c55a9b Compare November 7, 2022 16:26

scott-martens approved these changes Nov 7, 2022

View reviewed changes

LMMilliken force-pushed the feat-support-encoding-raw-data branch from 8c55a9b to 1772a01 Compare November 7, 2022 16:39

chore: corrected type hint

6035129

LMMilliken force-pushed the feat-support-encoding-raw-data branch from 1772a01 to 6035129 Compare November 7, 2022 16:52

LMMilliken merged commit 1ab1372 into main Nov 8, 2022

LMMilliken deleted the feat-support-encoding-raw-data branch November 8, 2022 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for raw data when encoding #598

feat: add support for raw data when encoding #598

LMMilliken commented Nov 4, 2022 •

edited

Loading

LMMilliken Nov 4, 2022

guenthermi Nov 4, 2022

LMMilliken Nov 4, 2022

bwanglzu left a comment

gmastrapas Nov 7, 2022

LMMilliken Nov 7, 2022

LMMilliken Nov 7, 2022

gmastrapas Nov 7, 2022

LMMilliken Nov 7, 2022

bwanglzu left a comment

guenthermi left a comment

guenthermi Nov 7, 2022

scott-martens left a comment

github-actions bot commented Nov 7, 2022

feat: add support for raw data when encoding #598

feat: add support for raw data when encoding #598

Conversation

LMMilliken commented Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwanglzu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwanglzu left a comment

Choose a reason for hiding this comment

guenthermi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scott-martens left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 7, 2022

LMMilliken commented Nov 4, 2022 •

edited

Loading