add metric template #14

lvwerra · 2022-04-13T14:46:31Z

This PR adds a proposal template to create new metrics incl. an app.py such that they can be pushed to Spaces and displayed with Gradio (@osanseviero made a PoC here). The template includes:

new_metric_script.py: the main code for the metric is here
README.md: includes the Spaces tags in meta and takes inspiration from @sashavor's and @emibaylor's template for the body: Create SQuAD metric README.md datasets#3873
requirements.txt: file to include dependancies specific to a metric
tests.py: includes a few input/output pairs that we could use this to automatically test metrics and populate the spaces widget with examples. the idea of this vs. the doctests was to be more thorough and include edge cases
app.py: the code for the Gradio app

We could use cookiecutter to easily setup a new metric and populate some of the information, such that the main manual work would be addin content instead of renaming files/classes etc.

What do you think? @lhoestq @sashavor @osanseviero

sashavor · 2022-04-13T14:48:19Z

I've building on @osanseviero's POC to add more functionalities (including displaying the metric cards): https://huggingface.co/spaces/huggingface/metric-explorer
I'm currently adding the possibility to compare two metrics for the same input, as well

templates/README.md

lhoestq · 2022-04-13T15:19:54Z

We could use cookiecutter to easily setup a new metric and populate some of the information, such that the main manual work would be addin content instead of renaming files/classes etc.

Maybe we can have something similar to the CLI command in transformers that creates a new model and renames the classes automatically ?

osanseviero · 2022-04-19T09:32:38Z

As for cookiecutter, this is a template done by @nateraw that might be useful here https://github.com/nateraw/spaces-template

osanseviero

This is very nice!

osanseviero · 2022-04-19T09:35:16Z

templates/README.md

+# Metric
+
+## Metric description
+
+## How to use
+
+## Examples 
+
+## References 
+
+## Limitations and bias


This won't show up in the space, so I wonder if this should instead be in some place that will later be displayed in the Space. You already do this with _DESCRIPTION for example, so I was wondering if all of this should be over there instead.

Yes, indeed there is some duplication here. I was thinking we could read the README.md in the gradio app and display it. The reasons why I thought it is nicer as a separate file:

it's easier to edit a markdown file directly than a string in Python

if we ever decide to make a dedicated metrics/evaluate repository type we would already have a README for all repos

Yes, I think that's good. We can read the md file content and add it to the article section of the demo.

osanseviero · 2022-04-19T09:37:40Z

templates/tests.py

@@ -0,0 +1,12 @@
+test_cases = [


Very nice idea!

osanseviero · 2022-04-19T09:38:16Z

templates/app.py

+
+iface = gr.Interface(
+  fn=compute, 
+  inputs=gr.inputs.Dataframe(headers=metric_features, col_width=len(metric_features), datatype="number"),


Should the datatype also be specified programatically?

Yes I was thinking about that, too. With #15 we should be able to infer that. I think we also need to add a case for when we can't infer the type (e.g. somebody implements a metric for a new modality). Maybe we can then just display a text saying that the widget is not available for that metric but it could be implemented in app.py.

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

lvwerra · 2022-04-20T17:44:09Z

I have been working on the Gradio widget a bit more to make it more generic. Here's the ideas:

Parsing between `Gradio` and `evaluate`

We have to do some mapping and parsing between the inputs of metrics and gradio. I added a src/evaluate/utilts/gradio.py module that includes the helper functions. That way we can modify/extend that logic later without needing to change every metric repository (besides updating evaluate).

Now thinking about it we could actually move the whole Gradio code inside evaluate and the app.py would simply look like this:

from evaluate import widget
from my_metric import my_metric

widget(my_metric)

That way we could update all metrics by just updating evaluate version in the spaces. At the same time a user could still build a custom gradio app by replacing the widget(my_metric) bit with the full Gradio app code.

Input types

I am a Gradio novice so I just built on top of @osanseviero example. It is using a Dataframe as input which gives a lot of flexibility. However, there are only two input types we can use in a field of the Dataframe: numbers and strings (there's also bool and date which are not so useful for us here), but for some metrics we will need more flexibility e.g. a list of numbers or a list of strings. For that reason I added a new type internally called json. On the gradio side this is also just a string, but before passing it to the metric I parse it as a JSON file. That allows for easily adding list of strings or numbers or even more complex data structures should we need them as long as their string representation can be interpreted as json which means any construct of python lists/dicts.

Default values & tests

It could be nice to populate the widget with examples. I thought we could use the examples in the tests for that. Unfortunately, there seems to be a bug (gradio-app/gradio#745), but hopefully we can fix this.

In addition we could run the tests at the beginning of the Gradio app and display a message at the top if they are not passing.
E.g. ":rotating_light: This metric's tests failed. See ..."

Here's a picture of what the current widget looks like (everything is generated from the metric generically!). Note that the text below is the content of the README.md that is displayed:

cc @sashavor

sashavor · 2022-04-20T18:30:22Z

This is great!! I particularly like the idea of moving the Gradio code inside evaluate, it makes things so user friendly 🤗

General thoughts, by topic:

READMEs
We should have all the metric cards ready by the end of the month, so it would be easy to display them in the app as well. How easy is it to make them collapsed by section? Cause some of them are pretty long, it may be cumbersome to just show all the information, but if we make them interactive (with users toggling which sections they want to see), that could be cool.

Input/outputs
Does it make sense to define metric categories (based on the analysis that I did), e.g numerical metrics, prediction-reference metrics and referenceless metrics, and add that information to the metric metadata, and use that in the app?

Comparison feature
I think that metric comparisons are really important as well, so maybe this is a feature that we could add down the line?

lvwerra · 2022-04-26T11:55:35Z

I've updated the PR with a working cookiecutter template and CLI. You can now run:

evaluate-cli create "Aweeesoooome Metric"

which creates a new gradio space and clones it, populates the template and adds it to the space repo and pushes the changes. One then only needs to adapt the folder and push the changes again.

The following message is displayed at the end of the command:

A new repository for your metric "Aweeesoooome Metric" has been created at /Users/leandro/git/evaluate/aweeesoooome_metric and pushed to the Hugging Face Hub: https://huggingface.co/spaces/lvwerra/aweeesoooome_metric.

Here are the next steps:
- implement the metric logic in aweeesoooome_metric/aweeesoooome_metric.py
- document your metric in aweeesoooome_metric/README.md
- add test cases for your metric in aweeesoooome_metric/tests.py
- if your metric has any dependencies update them in aweeesoooome_metric/requirements.txt

You can test your metric's widget locally by running:

```
python /Users/leandro/git/evaluate/aweeesoooome_metric/app.py
```

When you are happy with your changes you can push your changes with the following commands to the Hugging Face Hub:

```
cd /Users/leandro/git/evaluate/aweeesoooome_metric
git add .
git commit -m "Updating metric"
git push
```

You should then see the update widget on the Hugging Face Hub: https://huggingface.co/spaces/lvwerra/aweeesoooome_metric
And you can load your metric in Python with the following code:

```
from evaluate import load_metric
metric = load_metric("lvwerra/aweeesoooome_metric")
```

The resulting space of that command can be found here:
https://huggingface.co/spaces/lvwerra/aweeesoooome_metric

@sashavor Regarding the README.md: At the moment the README is displayed after the widget so I don't think it is a big issue if it is too long. We can open a issue on the gradio repo should we need it, but let's have a look first how they'll look.

Next steps: Setup a separate PR to enable loading metrics from the Hub. The last step in the instructions does not work, yet, as it will look for the metric in the evaluate repository instead of the Hub.

lhoestq

LGTM, just a few nits:

lhoestq · 2022-04-27T13:15:37Z

src/evaluate/commands/evaluate_cli.py

+import subprocess
+from pathlib import Path
+
+from cookiecutter.main import cookiecutter


I think you need to add cookiecutter in setup.py ?

templates/{{ cookiecutter.metric_slug }}/requirements.txt

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

lvwerra · 2022-05-02T09:18:46Z

Thanks @lhoestq for your suggestions - I added them!

osanseviero

This is super nice! Thanks for this 🔥

osanseviero · 2022-05-02T09:27:27Z

setup.py

@@ -89,6 +89,8 @@
    # Utilities from PyPA to e.g., compare versions
    "packaging",
    "responses<0.19",
+    # to populate metric template
+    "cookiecutter"


do you want this to be a required dependency?

moved it to a "template" requirement group.

osanseviero · 2022-05-02T09:32:56Z

src/evaluate/utils/gradio.py

+REGEX_YAML_BLOCK = re.compile(r"---[\n\r]+([\S\s]*?)[\n\r]+---[\n\r]")
+
+
+def infer_gradio_input_types(feature_types):


I don't think you do this in the repo rn, but WDYT of making internal functions a bit more explicitly internal/private? That way other people don't import it and handling backwards compatibility is easier.

I would just add _ prefix and don't export it. Same for other functions

For this specifically the reason I think it would be a good idea is to expose them to the user is that some metrics might require a custom gradio widget and the user can easily reuse these helper functions to make it more useful.

Do you think that's not necessary or would you avoid custom widgets?

I think this might be a bit too early of an optimization, but I Have no strong opinion, so feel free to make it public if you think it will be useful to users

osanseviero · 2022-05-02T09:34:26Z

src/evaluate/utils/gradio.py

+    return examples
+
+
+def launch_gradio_widget(metric):


This is super nice

julien-c · 2022-05-02T10:15:06Z

BTW not sure if this was mentioned but you'll be able to list all those Gradio apps with https://huggingface.co/api/spaces?filter=metric

add template draft

2c8613b

lhoestq reviewed Apr 13, 2022

View reviewed changes

templates/README.md Outdated Show resolved Hide resolved

osanseviero reviewed Apr 19, 2022

View reviewed changes

lvwerra and others added 3 commits April 19, 2022 13:33

Apply suggestions from code review

6a9eb65

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

make gradio app work with generic metrics

9e3d75a

make code quality happy

7917c1a

fix style again

d38d1ae

lvwerra added 6 commits April 26, 2022 11:40

make the template a cookiecutter template

580046f

add cli

1b8c60b

move gradio logic inside evaluate

163ba36

add docstrings to gradio utils

95abd62

remove unused imports

538a7a0

make gradio install optional

aa4c251

lvwerra marked this pull request as ready for review April 26, 2022 11:55

lhoestq reviewed Apr 27, 2022

View reviewed changes

lvwerra and others added 2 commits May 2, 2022 11:17

Update templates/{{ cookiecutter.metric_slug }}/requirements.txt

07ecd79

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

add cookiecutter to requirements in setup.py

0f73226

osanseviero approved these changes May 2, 2022

View reviewed changes

move cookiecutter to dedicated requirements

9509dd1

This was referenced May 2, 2022

Setup script action to push "canonical" metrics to hub #24

Closed

Load metrics from Hub instead of GitHub repo #25

Closed

lvwerra merged commit 425ac26 into main May 3, 2022

This was referenced May 4, 2022

Download from hub #30

Merged

Feature: add community metrics #4

Closed

lvwerra mentioned this pull request May 17, 2022

Feature: enrich MetricsInfo with more meta information #5

Closed

5 tasks

lvwerra deleted the metrics-template branch July 24, 2022 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add metric template #14

add metric template #14

lvwerra commented Apr 13, 2022

sashavor commented Apr 13, 2022

lhoestq commented Apr 13, 2022

osanseviero commented Apr 19, 2022

osanseviero left a comment

osanseviero Apr 19, 2022

lvwerra Apr 19, 2022

osanseviero Apr 19, 2022

osanseviero Apr 19, 2022

osanseviero Apr 19, 2022

lvwerra Apr 19, 2022

lvwerra commented Apr 20, 2022

sashavor commented Apr 20, 2022

lvwerra commented Apr 26, 2022

lhoestq left a comment

lhoestq Apr 27, 2022

lvwerra May 2, 2022

lvwerra commented May 2, 2022

osanseviero left a comment

osanseviero May 2, 2022

lvwerra May 2, 2022

osanseviero May 2, 2022

lvwerra May 2, 2022

osanseviero May 2, 2022

osanseviero May 2, 2022

julien-c commented May 2, 2022

		REGEX_YAML_BLOCK = re.compile(r"---[\n\r]+([\S\s]*?)[\n\r]+---[\n\r]")


		def infer_gradio_input_types(feature_types):

add metric template #14

add metric template #14

Conversation

lvwerra commented Apr 13, 2022

sashavor commented Apr 13, 2022

lhoestq commented Apr 13, 2022

osanseviero commented Apr 19, 2022

osanseviero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvwerra commented Apr 20, 2022

Parsing between Gradio and evaluate

Input types

Default values & tests

sashavor commented Apr 20, 2022

lvwerra commented Apr 26, 2022

lhoestq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvwerra commented May 2, 2022

osanseviero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julien-c commented May 2, 2022

Parsing between `Gradio` and `evaluate`