Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

feat: add support for raw data when encoding #598

Merged
merged 12 commits into from
Nov 8, 2022

Conversation

LMMilliken
Copy link
Contributor

@LMMilliken LMMilliken commented Nov 4, 2022

This pr adds support for providing a list of strings instead of a DocumentArray when using the finetuner.encode function. The modality of the contents of this list can be inferred using the model provided to the function, so no additional arguments need to be provided.
The docs have been updated to reflect this change, however encoding with a list has not been given priority over encoding with a DocumentArray. Example code snippets are left unchanged, and some additional code snippets are added.


  • This PR references an open issue
  • I have added a line about this change to CHANGELOG

@LMMilliken LMMilliken self-assigned this Nov 4, 2022
@LMMilliken LMMilliken marked this pull request as draft November 4, 2022 10:18
@@ -485,3 +488,5 @@ def encode(
inputs = model._move_to_device(inputs)
output = model.run(inputs)
batch.embeddings = output.detach().cpu().numpy()

return data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the encode method previously worked on the input DocumentArray in place, I was not sure how to return the embeddings in the case that the user provided a list. Currently I am just returning the final DocumentArray in both cases, though I would welcome alternative suggestions

@LMMilliken LMMilliken linked an issue Nov 4, 2022 that may be closed by this pull request
@LMMilliken LMMilliken force-pushed the feat-support-encoding-raw-data branch from 83304ae to 0fa0f07 Compare November 4, 2022 14:01
@LMMilliken LMMilliken marked this pull request as ready for review November 4, 2022 14:02
CHANGELOG.md Outdated Show resolved Hide resolved
finetuner/__init__.py Outdated Show resolved Hide resolved
finetuner/data.py Outdated Show resolved Hide resolved
finetuner/data.py Outdated Show resolved Hide resolved
finetuner/data.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
docs/walkthrough/integrate-with-jina.md Outdated Show resolved Hide resolved
finetuner/__init__.py Outdated Show resolved Hide resolved
) -> DocumentArray:
"""If data has been provided as a list, a :class:`DocumentArray` is created
from the elements of the list
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter descriptions are missing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmastrapas suggested they be removed as this is an internal function.

finetuner/data.py Outdated Show resolved Hide resolved
@LMMilliken LMMilliken force-pushed the feat-support-encoding-raw-data branch from a70305d to 2ecf4f6 Compare November 4, 2022 16:42
README.md Outdated Show resolved Hide resolved
Copy link
Member

@bwanglzu bwanglzu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good point, if user want to encode a DocumentArray, we should return a DocumentArray, otherwise if user want to encode a list of objects, i believe its better to return a np.ndarray of embeddings.

docs/walkthrough/integrate-with-jina.md Outdated Show resolved Hide resolved
@LMMilliken LMMilliken force-pushed the feat-support-encoding-raw-data branch 2 times, most recently from a5599a5 to 27e493a Compare November 7, 2022 08:44
@@ -1,23 +1,10 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happend here? the notebook changed but not the md file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I didn't mean to commit that, i think since the notebook file records metadata such as number of runs, running the notebook causes changes that I committed mistakenly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to revert the changes to that file to main but it has since been updated there as well, after merging there will be no difference to either the notebook or the markdown

finetuner/data.py Outdated Show resolved Hide resolved
finetuner/data.py Outdated Show resolved Hide resolved
@@ -58,6 +63,29 @@ def build_dataset(
return data


def build_encoding_dataset(
model: Union['InferenceEngine', str], data: Union[List[str], DocumentArray]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this argument be a string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it cannot, changed now

Copy link
Member

@bwanglzu bwanglzu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!, Before release, let's also update notebooks inference session, and make inference from list as default option. see https://finetuner.jina.ai/notebooks/text_to_text/#inference

Copy link
Member

@guenthermi guenthermi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only added one minor comment

@@ -469,6 +471,9 @@ def encode(

from _finetuner.models.inference import ONNXRuntimeInferenceEngine

return_da = isinstance(data, DocumentArray)
data = build_encoding_dataset(model=model, data=data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to call this function only if the input type is a list.

@LMMilliken LMMilliken force-pushed the feat-support-encoding-raw-data branch from 4e6e86a to 8c55a9b Compare November 7, 2022 16:26
Copy link
Member

@scott-martens scott-martens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LMMilliken LMMilliken force-pushed the feat-support-encoding-raw-data branch from 8c55a9b to 1772a01 Compare November 7, 2022 16:39
@LMMilliken LMMilliken force-pushed the feat-support-encoding-raw-data branch from 1772a01 to 6035129 Compare November 7, 2022 16:52
@github-actions
Copy link

github-actions bot commented Nov 7, 2022

📝 Docs are deployed on https://ft-feat-support-encoding-raw-data--jina-docs.netlify.app 🎉

@LMMilliken LMMilliken merged commit 1ab1372 into main Nov 8, 2022
@LMMilliken LMMilliken deleted the feat-support-encoding-raw-data branch November 8, 2022 07:47
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for raw data when encoding
5 participants