Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metric template #14

Merged
merged 14 commits into from
May 3, 2022
Merged

add metric template #14

merged 14 commits into from
May 3, 2022

Conversation

lvwerra
Copy link
Member

@lvwerra lvwerra commented Apr 13, 2022

This PR adds a proposal template to create new metrics incl. an app.py such that they can be pushed to Spaces and displayed with Gradio (@osanseviero made a PoC here). The template includes:

  • new_metric_script.py: the main code for the metric is here
  • README.md: includes the Spaces tags in meta and takes inspiration from @sashavor's and @emibaylor's template for the body: Create SQuAD metric README.md datasets#3873
  • requirements.txt: file to include dependancies specific to a metric
  • tests.py: includes a few input/output pairs that we could use this to automatically test metrics and populate the spaces widget with examples. the idea of this vs. the doctests was to be more thorough and include edge cases
  • app.py: the code for the Gradio app

We could use cookiecutter to easily setup a new metric and populate some of the information, such that the main manual work would be addin content instead of renaming files/classes etc.

What do you think? @lhoestq @sashavor @osanseviero

@sashavor
Copy link

I've building on @osanseviero's POC to add more functionalities (including displaying the metric cards): https://huggingface.co/spaces/huggingface/metric-explorer
I'm currently adding the possibility to compare two metrics for the same input, as well

templates/README.md Outdated Show resolved Hide resolved
@lhoestq
Copy link
Member

lhoestq commented Apr 13, 2022

We could use cookiecutter to easily setup a new metric and populate some of the information, such that the main manual work would be addin content instead of renaming files/classes etc.

Maybe we can have something similar to the CLI command in transformers that creates a new model and renames the classes automatically ?

@osanseviero
Copy link
Member

As for cookiecutter, this is a template done by @nateraw that might be useful here https://github.com/nateraw/spaces-template

Copy link
Member

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice!

Comment on lines 11 to 21
# Metric

## Metric description

## How to use

## Examples

## References

## Limitations and bias
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't show up in the space, so I wonder if this should instead be in some place that will later be displayed in the Space. You already do this with _DESCRIPTION for example, so I was wondering if all of this should be over there instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed there is some duplication here. I was thinking we could read the README.md in the gradio app and display it. The reasons why I thought it is nicer as a separate file:

  • it's easier to edit a markdown file directly than a string in Python
  • if we ever decide to make a dedicated metrics/evaluate repository type we would already have a README for all repos

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's good. We can read the md file content and add it to the article section of the demo.

@@ -0,0 +1,12 @@
test_cases = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice idea!

templates/app.py Outdated

iface = gr.Interface(
fn=compute,
inputs=gr.inputs.Dataframe(headers=metric_features, col_width=len(metric_features), datatype="number"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the datatype also be specified programatically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was thinking about that, too. With #15 we should be able to infer that. I think we also need to add a case for when we can't infer the type (e.g. somebody implements a metric for a new modality). Maybe we can then just display a text saying that the widget is not available for that metric but it could be implemented in app.py.

lvwerra and others added 3 commits April 19, 2022 13:33
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@lvwerra
Copy link
Member Author

lvwerra commented Apr 20, 2022

I have been working on the Gradio widget a bit more to make it more generic. Here's the ideas:

Parsing between Gradio and evaluate

We have to do some mapping and parsing between the inputs of metrics and gradio. I added a src/evaluate/utilts/gradio.py module that includes the helper functions. That way we can modify/extend that logic later without needing to change every metric repository (besides updating evaluate).

Now thinking about it we could actually move the whole Gradio code inside evaluate and the app.py would simply look like this:

from evaluate import widget
from my_metric import my_metric

widget(my_metric)

That way we could update all metrics by just updating evaluate version in the spaces. At the same time a user could still build a custom gradio app by replacing the widget(my_metric) bit with the full Gradio app code.

Input types

I am a Gradio novice so I just built on top of @osanseviero example. It is using a Dataframe as input which gives a lot of flexibility. However, there are only two input types we can use in a field of the Dataframe: numbers and strings (there's also bool and date which are not so useful for us here), but for some metrics we will need more flexibility e.g. a list of numbers or a list of strings. For that reason I added a new type internally called json. On the gradio side this is also just a string, but before passing it to the metric I parse it as a JSON file. That allows for easily adding list of strings or numbers or even more complex data structures should we need them as long as their string representation can be interpreted as json which means any construct of python lists/dicts.

Default values & tests

It could be nice to populate the widget with examples. I thought we could use the examples in the tests for that. Unfortunately, there seems to be a bug (gradio-app/gradio#745), but hopefully we can fix this.

In addition we could run the tests at the beginning of the Gradio app and display a message at the top if they are not passing.
E.g. ":rotating_light: This metric's tests failed. See ..."

Here's a picture of what the current widget looks like (everything is generated from the metric generically!). Note that the text below is the content of the README.md that is displayed:
Screenshot 2022-04-20 at 15 44 50

cc @sashavor

@sashavor
Copy link

This is great!! I particularly like the idea of moving the Gradio code inside evaluate, it makes things so user friendly 🤗

General thoughts, by topic:

READMEs
We should have all the metric cards ready by the end of the month, so it would be easy to display them in the app as well. How easy is it to make them collapsed by section? Cause some of them are pretty long, it may be cumbersome to just show all the information, but if we make them interactive (with users toggling which sections they want to see), that could be cool.

Input/outputs
Does it make sense to define metric categories (based on the analysis that I did), e.g numerical metrics, prediction-reference metrics and referenceless metrics, and add that information to the metric metadata, and use that in the app?

Comparison feature
I think that metric comparisons are really important as well, so maybe this is a feature that we could add down the line?

@lvwerra
Copy link
Member Author

lvwerra commented Apr 26, 2022

I've updated the PR with a working cookiecutter template and CLI. You can now run:

evaluate-cli create "Aweeesoooome Metric"

which creates a new gradio space and clones it, populates the template and adds it to the space repo and pushes the changes. One then only needs to adapt the folder and push the changes again.

The following message is displayed at the end of the command:

A new repository for your metric "Aweeesoooome Metric" has been created at /Users/leandro/git/evaluate/aweeesoooome_metric and pushed to the Hugging Face Hub: https://huggingface.co/spaces/lvwerra/aweeesoooome_metric.

Here are the next steps:
- implement the metric logic in aweeesoooome_metric/aweeesoooome_metric.py
- document your metric in aweeesoooome_metric/README.md
- add test cases for your metric in aweeesoooome_metric/tests.py
- if your metric has any dependencies update them in aweeesoooome_metric/requirements.txt

You can test your metric's widget locally by running:

```
python /Users/leandro/git/evaluate/aweeesoooome_metric/app.py
```

When you are happy with your changes you can push your changes with the following commands to the Hugging Face Hub:

```
cd /Users/leandro/git/evaluate/aweeesoooome_metric
git add .
git commit -m "Updating metric"
git push
```

You should then see the update widget on the Hugging Face Hub: https://huggingface.co/spaces/lvwerra/aweeesoooome_metric
And you can load your metric in Python with the following code:

```
from evaluate import load_metric
metric = load_metric("lvwerra/aweeesoooome_metric")
```

The resulting space of that command can be found here:
https://huggingface.co/spaces/lvwerra/aweeesoooome_metric

@sashavor Regarding the README.md: At the moment the README is displayed after the widget so I don't think it is a big issue if it is too long. We can open a issue on the gradio repo should we need it, but let's have a look first how they'll look.

Next steps: Setup a separate PR to enable loading metrics from the Hub. The last step in the instructions does not work, yet, as it will look for the metric in the evaluate repository instead of the Hub.

@lvwerra lvwerra marked this pull request as ready for review April 26, 2022 11:55
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few nits:

import subprocess
from pathlib import Path

from cookiecutter.main import cookiecutter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to add cookiecutter in setup.py ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

templates/{{ cookiecutter.metric_slug }}/requirements.txt Outdated Show resolved Hide resolved
lvwerra and others added 2 commits May 2, 2022 11:17
@lvwerra
Copy link
Member Author

lvwerra commented May 2, 2022

Thanks @lhoestq for your suggestions - I added them!

Copy link
Member

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super nice! Thanks for this 🔥

@@ -89,6 +89,8 @@
# Utilities from PyPA to e.g., compare versions
"packaging",
"responses<0.19",
# to populate metric template
"cookiecutter"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want this to be a required dependency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved it to a "template" requirement group.

REGEX_YAML_BLOCK = re.compile(r"---[\n\r]+([\S\s]*?)[\n\r]+---[\n\r]")


def infer_gradio_input_types(feature_types):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you do this in the repo rn, but WDYT of making internal functions a bit more explicitly internal/private? That way other people don't import it and handling backwards compatibility is easier.

I would just add _ prefix and don't export it. Same for other functions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specifically the reason I think it would be a good idea is to expose them to the user is that some metrics might require a custom gradio widget and the user can easily reuse these helper functions to make it more useful.

Do you think that's not necessary or would you avoid custom widgets?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be a bit too early of an optimization, but I Have no strong opinion, so feel free to make it public if you think it will be useful to users

return examples


def launch_gradio_widget(metric):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super nice

@julien-c
Copy link
Member

julien-c commented May 2, 2022

BTW not sure if this was mentioned but you'll be able to list all those Gradio apps with https://huggingface.co/api/spaces?filter=metric

This was referenced May 4, 2022
@lvwerra lvwerra deleted the metrics-template branch July 24, 2022 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants