Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce catalog + ndcg #120

Merged
merged 21 commits into from
Oct 19, 2021
Merged

Introduce catalog + ndcg #120

merged 21 commits into from
Oct 19, 2021

Conversation

maximilianwerk
Copy link
Member

@maximilianwerk maximilianwerk commented Oct 12, 2021

This PR contains the following changes:

  • add catalog to Labeler and tuner API
    • for Labeler: catalog performs best if it is a DocumentArrayMemmap, since no copying of data is necessary.
    • for tuner: no special requirements of the datatype
  • changed how the frontend requests new Documents from the backend. The decision, which Documents are shown next lies with the backend now.
  • toydata (both QA and FMNIST) return now a data generator and a catalog. If pre_init_generator is set to False, it will return a callable, which will return the generator. The precalculation of the catalog takes a little longer than before. This makes test take longer.
  • Tuner can now compute metrices. For now hits and NDCG is implemented.
  • the result of Tuner.fit will now be a TunerStats object. This object allows easy printing and saving to file.
  • fixed a whole lot of tests in order to respect the new interfaces.

TODO:

  • docstrings

@maximilianwerk
Copy link
Member Author

If someone wants to play around and hasn't the dataset locally: https://github.com/jina-ai/workshops/blob/main/pokedex/get_data.sh

@maximilianwerk maximilianwerk changed the title DON'T MERGE! Feat ndcg Feat ndcg + hits Oct 15, 2021
@maximilianwerk maximilianwerk marked this pull request as ready for review October 15, 2021 12:33
@maximilianwerk
Copy link
Member Author

@bwanglzu please check, if the implementation regarding GPU is correct and is the way to do. It finally passes the test, but I am not sure about standards.

catalog = DocumentArrayMemmap(catalog_folder)
for doc in docs:
for match in doc.matches:
if match.id not in catalog:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think u should never rely on match. I think u should have some reserved key in tags

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this is our data structure in finetuner. That might change in the future.

@bwanglzu
Copy link
Member

6D919BEF-D7F9-4FA0-B649-D0F4BD3C0A1E
i see some logs using gpu, while others using cpu, looking into the code

Copy link
Member

@bwanglzu bwanglzu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore the screenshot above, it's using gpu.

Copy link
Contributor

@tadejsv tadejsv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have yet to review this in more detail, but I think that the design needs to be refactored. Namely:

  • with this addition, we have two evaluation loops: one that mirrors the training one, and one for computing metrics. We should only have one, otherwise we are computing embeddings twice for no good reason.
  • related, there should only be one evaluation dataset
  • the metrics we compute should replace the ones from "old" evaluation" loop (i am removing them in my PR anyway) and be added to return dict, not just logged

finetuner/tuner/pytorch/__init__.py Outdated Show resolved Hide resolved
finetuner/tuner/pytorch/__init__.py Outdated Show resolved Hide resolved
finetuner/tuner/base.py Show resolved Hide resolved
finetuner/tuner/base.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added area/testing This issue/PR affects testing component/labeler labels Oct 18, 2021
@codecov
Copy link

codecov bot commented Oct 18, 2021

The author of this PR, maximilianwerk, is not an activated member of this organization on Codecov.
Please activate this user on Codecov to display this PR comment.
Coverage data is still being uploaded to Codecov.io for purposes of overall coverage calculations.
Please don't hesitate to email us at success@codecov.io with any questions.

@maximilianwerk
Copy link
Member Author

Beware, that passing a callable into the tuner.fit function does only provide any benefit, if the catalog is also passed.

Copy link
Contributor

@tadejsv tadejsv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main problem of how we design evaluation/metrics remains. Right now, we first run the eval loop, which computes the embeddings, and then run metrics computation, which computes embeddings again (in a more efficient way). I think this can be refactored so that:

  • embeddings are computed after training for the whole catalog
  • both eval loop and metrics computation take these pre-computed embeddings.

Also, we need to add batching to embeddings computation for the catalog - with larger datasets the whole computation simply won't fit into memory.

Not sure if this is feasible before wednesday, but should be done soon

Comment on lines 175 to 176
embeddings = self.embed_model(tensor)
with torch.inference_mode():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
embeddings = self.embed_model(tensor)
with torch.inference_mode():
with torch.inference_mode():
embeddings = self.embed_model(tensor)

finetuner/tuner/stats.py Outdated Show resolved Hide resolved
@maximilianwerk maximilianwerk changed the title Feat ndcg + hits Introduce catalog + ndcg Oct 18, 2021
Copy link
Contributor

@tadejsv tadejsv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

@maximilianwerk maximilianwerk merged commit 0be69a4 into main Oct 19, 2021
@maximilianwerk maximilianwerk deleted the feat-ndcg branch October 19, 2021 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants