Introduce catalog + ndcg #120

maximilianwerk · 2021-10-12T14:39:05Z

This PR contains the following changes:

add catalog to Labeler and tuner API
- for Labeler: catalog performs best if it is a DocumentArrayMemmap, since no copying of data is necessary.
- for tuner: no special requirements of the datatype
changed how the frontend requests new Documents from the backend. The decision, which Documents are shown next lies with the backend now.
toydata (both QA and FMNIST) return now a data generator and a catalog. If pre_init_generator is set to False, it will return a callable, which will return the generator. The precalculation of the catalog takes a little longer than before. This makes test take longer.
Tuner can now compute metrices. For now hits and NDCG is implemented.
the result of Tuner.fit will now be a TunerStats object. This object allows easy printing and saving to file.
fixed a whole lot of tests in order to respect the new interfaces.

TODO:

docstrings

maximilianwerk · 2021-10-12T14:41:56Z

If someone wants to play around and hasn't the dataset locally: https://github.com/jina-ai/workshops/blob/main/pokedex/get_data.sh

maximilianwerk · 2021-10-15T12:34:02Z

@bwanglzu please check, if the implementation regarding GPU is correct and is the way to do. It finally passes the test, but I am not sure about standards.

JoanFM · 2021-10-13T05:56:38Z

finetuner/tuner/evaluation.py

+    catalog = DocumentArrayMemmap(catalog_folder)
+    for doc in docs:
+        for match in doc.matches:
+            if match.id not in catalog:


I think u should never rely on match. I think u should have some reserved key in tags

For now this is our data structure in finetuner. That might change in the future.

bwanglzu · 2021-10-15T12:39:57Z

i see some logs using gpu, while others using cpu, looking into the code

finetuner/tuner/base.py

bwanglzu

ignore the screenshot above, it's using gpu.

tadejsv

I have yet to review this in more detail, but I think that the design needs to be refactored. Namely:

with this addition, we have two evaluation loops: one that mirrors the training one, and one for computing metrics. We should only have one, otherwise we are computing embeddings twice for no good reason.
related, there should only be one evaluation dataset
the metrics we compute should replace the ones from "old" evaluation" loop (i am removing them in my PR anyway) and be added to return dict, not just logged

finetuner/tuner/pytorch/__init__.py

finetuner/tuner/base.py

codecov · 2021-10-18T13:05:02Z

The author of this PR, maximilianwerk, is not an activated member of this organization on Codecov.
Please activate this user on Codecov to display this PR comment.
Coverage data is still being uploaded to Codecov.io for purposes of overall coverage calculations.
Please don't hesitate to email us at success@codecov.io with any questions.

maximilianwerk · 2021-10-18T13:48:05Z

Beware, that passing a callable into the tuner.fit function does only provide any benefit, if the catalog is also passed.

tadejsv

I think the main problem of how we design evaluation/metrics remains. Right now, we first run the eval loop, which computes the embeddings, and then run metrics computation, which computes embeddings again (in a more efficient way). I think this can be refactored so that:

embeddings are computed after training for the whole catalog
both eval loop and metrics computation take these pre-computed embeddings.

Also, we need to add batching to embeddings computation for the catalog - with larger datasets the whole computation simply won't fit into memory.

Not sure if this is feasible before wednesday, but should be done soon

tadejsv · 2021-10-18T14:21:37Z

finetuner/tuner/pytorch/__init__.py

+        embeddings = self.embed_model(tensor)
+        with torch.inference_mode():


Suggested change

embeddings = self.embed_model(tensor)

with torch.inference_mode():

with torch.inference_mode():

embeddings = self.embed_model(tensor)

finetuner/tuner/stats.py

Will be done later.

tadejsv

Looks good 👍

github-actions bot added size/m area/core component/tuner labels Oct 14, 2021

maximilianwerk force-pushed the feat-ndcg branch 3 times, most recently from 2fecb6d to 25edf29 Compare October 15, 2021 10:31

maximilianwerk changed the title ~~DON'T MERGE! Feat ndcg~~ Feat ndcg + hits Oct 15, 2021

maximilianwerk force-pushed the feat-ndcg branch 2 times, most recently from 1dc475f to 7ec3d44 Compare October 15, 2021 11:21

maximilianwerk marked this pull request as ready for review October 15, 2021 12:33

JoanFM reviewed Oct 15, 2021

View reviewed changes

maximilianwerk added 6 commits October 15, 2021 14:51

feat: add ndcg calc

3570e7a

feat: added hits metric and refactoring

dceae94

feat: fix ndcg

b2ee822

feat: gpu support for eval

c768207

feat: add gpu for eval

6f7e7fa

fix: sample size

630b514

maximilianwerk force-pushed the feat-ndcg branch from 7ec3d44 to 630b514 Compare October 15, 2021 12:52

bwanglzu reviewed Oct 15, 2021

View reviewed changes

finetuner/tuner/base.py Show resolved Hide resolved

bwanglzu reviewed Oct 15, 2021

View reviewed changes

tadejsv previously requested changes Oct 15, 2021

View reviewed changes

finetuner/tuner/pytorch/__init__.py Outdated Show resolved Hide resolved

finetuner/tuner/pytorch/__init__.py Outdated Show resolved Hide resolved

finetuner/tuner/base.py Show resolved Hide resolved

finetuner/tuner/base.py Outdated Show resolved Hide resolved

maximilianwerk added 2 commits October 18, 2021 09:12

Merge branch 'main' into feat-ndcg

bee87e3

feat: labeler working

c8f6c08

github-actions bot added area/testing This issue/PR affects testing component/labeler labels Oct 18, 2021

feat: paddle and torch

5369b64

fix: train data callable

10e1cc0

test: fixed

2d6811d

maximilianwerk force-pushed the feat-ndcg branch from 00e4ce2 to 2d6811d Compare October 18, 2021 13:58

tadejsv reviewed Oct 18, 2021

View reviewed changes

feat: fmnist with catalog

e74429f

github-actions bot added size/l component/misc and removed size/m labels Oct 18, 2021

maximilianwerk added 3 commits October 18, 2021 20:16

feat: used dam for catalog

65dd5eb

refactor: qa toy data

cd6f0b7

fix: gpu tests

358b501

maximilianwerk force-pushed the feat-ndcg branch from fbd8087 to 358b501 Compare October 18, 2021 19:34

test: fix test size

6a9364e

maximilianwerk changed the title ~~Feat ndcg + hits~~ Introduce catalog + ndcg Oct 18, 2021

maximilianwerk added 5 commits October 18, 2021 22:30

test: fix wrong arg

3bc8356

test: speed up test data generation

a12e2b7

feat: removed train metrics

2830449

fix: next only called when needed

620f0ff

feat: restored old toy data generation behavior

88419d5

github-actions bot added the area/docs label Oct 19, 2021

tadejsv approved these changes Oct 19, 2021

View reviewed changes

maximilianwerk merged commit 0be69a4 into main Oct 19, 2021

maximilianwerk deleted the feat-ndcg branch October 19, 2021 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce catalog + ndcg #120

Introduce catalog + ndcg #120

maximilianwerk commented Oct 12, 2021 •

edited

Loading

maximilianwerk commented Oct 12, 2021

maximilianwerk commented Oct 15, 2021

JoanFM Oct 13, 2021

maximilianwerk Oct 15, 2021

bwanglzu commented Oct 15, 2021

bwanglzu left a comment

tadejsv left a comment

codecov bot commented Oct 18, 2021

maximilianwerk commented Oct 18, 2021

tadejsv left a comment •

edited

Loading

tadejsv Oct 18, 2021

tadejsv left a comment

		embeddings = self.embed_model(tensor)
		with torch.inference_mode():

Introduce catalog + ndcg #120

Introduce catalog + ndcg #120

Conversation

maximilianwerk commented Oct 12, 2021 • edited Loading

maximilianwerk commented Oct 12, 2021

maximilianwerk commented Oct 15, 2021

JoanFM Oct 13, 2021

Choose a reason for hiding this comment

maximilianwerk Oct 15, 2021

Choose a reason for hiding this comment

bwanglzu commented Oct 15, 2021

bwanglzu left a comment

Choose a reason for hiding this comment

tadejsv left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 18, 2021

maximilianwerk commented Oct 18, 2021

tadejsv left a comment • edited Loading

Choose a reason for hiding this comment

tadejsv Oct 18, 2021

Choose a reason for hiding this comment

tadejsv left a comment

Choose a reason for hiding this comment

maximilianwerk commented Oct 12, 2021 •

edited

Loading

tadejsv left a comment •

edited

Loading