-
Notifications
You must be signed in to change notification settings - Fork 67
feat: add support for raw data when encoding #598
Conversation
…ina-ai/finetuner into feat-support-encoding-raw-data
finetuner/__init__.py
Outdated
@@ -485,3 +488,5 @@ def encode( | |||
inputs = model._move_to_device(inputs) | |||
output = model.run(inputs) | |||
batch.embeddings = output.detach().cpu().numpy() | |||
|
|||
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the encode method previously worked on the input DocumentArray
in place, I was not sure how to return the embeddings in the case that the user provided a list. Currently I am just returning the final DocumentArray in both cases, though I would welcome alternative suggestions
83304ae
to
0fa0f07
Compare
) -> DocumentArray: | ||
"""If data has been provided as a list, a :class:`DocumentArray` is created | ||
from the elements of the list | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter descriptions are missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmastrapas suggested they be removed as this is an internal function.
a70305d
to
2ecf4f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good point, if user want to encode a DocumentArray
, we should return a DocumentArray
, otherwise if user want to encode a list of objects, i believe its better to return a np.ndarray
of embeddings.
a5599a5
to
27e493a
Compare
27e493a
to
ac8e010
Compare
docs/notebooks/image_to_image.ipynb
Outdated
@@ -1,23 +1,10 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happend here? the notebook changed but not the md file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I didn't mean to commit that, i think since the notebook file records metadata such as number of runs, running the notebook causes changes that I committed mistakenly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to revert the changes to that file to main but it has since been updated there as well, after merging there will be no difference to either the notebook or the markdown
finetuner/data.py
Outdated
@@ -58,6 +63,29 @@ def build_dataset( | |||
return data | |||
|
|||
|
|||
def build_encoding_dataset( | |||
model: Union['InferenceEngine', str], data: Union[List[str], DocumentArray] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this argument be a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it cannot, changed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!, Before release, let's also update notebooks inference
session, and make inference from list
as default option. see https://finetuner.jina.ai/notebooks/text_to_text/#inference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only added one minor comment
finetuner/__init__.py
Outdated
@@ -469,6 +471,9 @@ def encode( | |||
|
|||
from _finetuner.models.inference import ONNXRuntimeInferenceEngine | |||
|
|||
return_da = isinstance(data, DocumentArray) | |||
data = build_encoding_dataset(model=model, data=data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to call this function only if the input type is a list.
4e6e86a
to
8c55a9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
8c55a9b
to
1772a01
Compare
1772a01
to
6035129
Compare
📝 Docs are deployed on https://ft-feat-support-encoding-raw-data--jina-docs.netlify.app 🎉 |
This pr adds support for providing a list of strings instead of a DocumentArray when using the
finetuner.encode
function. The modality of the contents of this list can be inferred using the model provided to the function, so no additional arguments need to be provided.The docs have been updated to reflect this change, however encoding with a list has not been given priority over encoding with a DocumentArray. Example code snippets are left unchanged, and some additional code snippets are added.