Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Validation For README #2121

Merged
merged 38 commits into from May 10, 2021
Merged

Conversation

gchhablani
Copy link
Contributor

@gchhablani gchhablani commented Mar 26, 2021

Hi @lhoestq, @yjernite

This is a simple Readme parser. All classes specific to different sections can inherit Section class, and we can define more attributes in each.

Let me know if this is going in the right direction :)

Currently the output looks like this, for to_dict() on FashionMNIST README.md:

{
    "name": "./datasets/fashion_mnist/README.md",
    "attributes": "",
    "subsections": [
        {
            "name": "Dataset Card for FashionMNIST",
            "attributes": "",
            "subsections": [
                {
                    "name": "Table of Contents",
                    "attributes": "- [Dataset Description](#dataset-description)\n  - [Dataset Summary](#dataset-summary)\n  - [Supported Tasks](#supported-tasks-and-leaderboards)\n  - [Languages](#languages)\n- [Dataset Structure](#dataset-structure)\n  - [Data Instances](#data-instances)\n  - [Data Fields](#data-instances)\n  - [Data Splits](#data-instances)\n- [Dataset Creation](#dataset-creation)\n  - [Curation Rationale](#curation-rationale)\n  - [Source Data](#source-data)\n  - [Annotations](#annotations)\n  - [Personal and Sensitive Information](#personal-and-sensitive-information)\n- [Considerations for Using the Data](#considerations-for-using-the-data)\n  - [Social Impact of Dataset](#social-impact-of-dataset)\n  - [Discussion of Biases](#discussion-of-biases)\n  - [Other Known Limitations](#other-known-limitations)\n- [Additional Information](#additional-information)\n  - [Dataset Curators](#dataset-curators)\n  - [Licensing Information](#licensing-information)\n  - [Citation Information](#citation-information)\n  - [Contributions](#contributions)",
                    "subsections": []
                },
                {
                    "name": "Dataset Description",
                    "attributes": "- **Homepage:** [GitHub](https://github.com/zalandoresearch/fashion-mnist)\n- **Repository:** [GitHub](https://github.com/zalandoresearch/fashion-mnist)\n- **Paper:** [arXiv](https://arxiv.org/pdf/1708.07747.pdf)\n- **Leaderboard:**\n- **Point of Contact:**",
                    "subsections": [
                        {
                            "name": "Dataset Summary",
                            "attributes": "Fashion-MNIST is a dataset of Zalando's article images\u2014consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.",
                            "subsections": []
                        },
                        {
                            "name": "Supported Tasks and Leaderboards",
                            "attributes": "[More Information Needed]",
                            "subsections": []
                        },
                        {
                            "name": "Languages",
                            "attributes": "[More Information Needed]",
                            "subsections": []
                        }
                    ]
                },
                {
                    "name": "Dataset Structure",
                    "attributes": "",
                    "subsections": [
                        {
                            "name": "Data Instances",
                            "attributes": "A data point comprises an image and its label.",
                            "subsections": []
                        },
                        {
                            "name": "Data Fields",
                            "attributes": "- `image`: a 2d array of integers representing the 28x28 image.\n- `label`: an integer between 0 and 9 representing the classes with the following mapping:\n  | Label | Description |\n  | --- | --- |\n  | 0 | T-shirt/top |\n  | 1 | Trouser |\n  | 2 | Pullover |\n  | 3 | Dress |\n  | 4 | Coat |\n  | 5 | Sandal |\n  | 6 | Shirt |\n  | 7 | Sneaker |\n  | 8 | Bag |\n  | 9 | Ankle boot |",
                            "subsections": []
                        },
                        {
                            "name": "Data Splits",
                            "attributes": "The data is split into training and test set. The training set contains 60,000 images and the test set 10,000 images.",
                            "subsections": []
                        }
                    ]
                },
                {
                    "name": "Dataset Creation",
                    "attributes": "",
                    "subsections": [
                        {
                            "name": "Curation Rationale",
                            "attributes": "**From the arXiv paper:**\nThe original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. \"If it doesn't work on MNIST, it won't work at all\", they said. \"Well, if it does work on MNIST, it may still fail on others.\"\nHere are some good reasons:\n- MNIST is too easy. Convolutional nets can achieve 99.7% on MNIST. Classic machine learning algorithms can also achieve 97% easily. Check out our side-by-side benchmark for Fashion-MNIST vs. MNIST, and read \"Most pairs of MNIST digits can be distinguished pretty well by just one pixel.\"\n- MNIST is overused. In this April 2017 Twitter thread, Google Brain research scientist and deep learning expert Ian Goodfellow calls for people to move away from MNIST.\n- MNIST can not represent modern CV tasks, as noted in this April 2017 Twitter thread, deep learning expert/Keras author Fran\u00e7ois Chollet.",
                            "subsections": []
                        },
                        {
                            "name": "Source Data",
                            "attributes": "",
                            "subsections": [
                                {
                                    "name": "Initial Data Collection and Normalization",
                                    "attributes": "**From the arXiv paper:**\nFashion-MNIST is based on the assortment on Zalando\u2019s website. Every fashion product on Zalando has a set of pictures shot by professional photographers, demonstrating different aspects of the product, i.e. front and back looks, details, looks with model and in an outfit. The original picture has a light-gray background (hexadecimal color: #fdfdfd) and stored in 762 \u00d7 1000 JPEG format. For efficiently serving different frontend components, the original picture is resampled with multiple resolutions, e.g. large, medium, small, thumbnail and tiny.\nWe use the front look thumbnail images of 70,000 unique products to build Fashion-MNIST. Those products come from different gender groups: men, women, kids and neutral. In particular, whitecolor products are not included in the dataset as they have low contrast to the background. The thumbnails (51 \u00d7 73) are then fed into the following conversion pipeline:\n1. Converting the input to a PNG image.\n2. Trimming any edges that are close to the color of the corner pixels. The \u201ccloseness\u201d is defined by the distance within 5% of the maximum possible intensity in RGB space.\n3. Resizing the longest edge of the image to 28 by subsampling the pixels, i.e. some rows and columns are skipped over.\n4. Sharpening pixels using a Gaussian operator of the radius and standard deviation of 1.0, with increasing effect near outlines.\n5. Extending the shortest edge to 28 and put the image to the center of the canvas.\n6. Negating the intensities of the image.\n7. Converting the image to 8-bit grayscale pixels.",
                                    "subsections": []
                                },
                                {
                                    "name": "Who are the source image producers?",
                                    "attributes": "**From the arXiv paper:**\nEvery fashion product on Zalando has a set of pictures shot by professional photographers, demonstrating different aspects of the product, i.e. front and back looks, details, looks with model and in an outfit.",
                                    "subsections": []
                                }
                            ]
                        },
                        {
                            "name": "Annotations",
                            "attributes": "",
                            "subsections": [
                                {
                                    "name": "Annotation process",
                                    "attributes": "**From the arXiv paper:**\nFor the class labels, they use the silhouette code of the product. The silhouette code is manually labeled by the in-house fashion experts and reviewed by a separate team at Zalando. Each product Zalando is the Europe\u2019s largest online fashion platform. Each product contains only one silhouette code.",
                                    "subsections": []
                                },
                                {
                                    "name": "Who are the annotators?",
                                    "attributes": "**From the arXiv paper:**\nThe silhouette code is manually labeled by the in-house fashion experts and reviewed by a separate team at Zalando.",
                                    "subsections": []
                                }
                            ]
                        },
                        {
                            "name": "Personal and Sensitive Information",
                            "attributes": "[More Information Needed]",
                            "subsections": []
                        }
                    ]
                },
                {
                    "name": "Considerations for Using the Data",
                    "attributes": "",
                    "subsections": [
                        {
                            "name": "Social Impact of Dataset",
                            "attributes": "[More Information Needed]",
                            "subsections": []
                        },
                        {
                            "name": "Discussion of Biases",
                            "attributes": "[More Information Needed]",
                            "subsections": []
                        },
                        {
                            "name": "Other Known Limitations",
                            "attributes": "[More Information Needed]",
                            "subsections": []
                        }
                    ]
                },
                {
                    "name": "Additional Information",
                    "attributes": "",
                    "subsections": [
                        {
                            "name": "Dataset Curators",
                            "attributes": "Han Xiao and Kashif Rasul and Roland Vollgraf",
                            "subsections": []
                        },
                        {
                            "name": "Licensing Information",
                            "attributes": "MIT Licence",
                            "subsections": []
                        },
                        {
                            "name": "Citation Information",
                            "attributes": "@article{DBLP:journals/corr/abs-1708-07747,\n  author    = {Han Xiao and\n               Kashif Rasul and\n               Roland Vollgraf},\n  title     = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning\n               Algorithms},\n  journal   = {CoRR},\n  volume    = {abs/1708.07747},\n  year      = {2017},\n  url       = {http://arxiv.org/abs/1708.07747},\n  archivePrefix = {arXiv},\n  eprint    = {1708.07747},\n  timestamp = {Mon, 13 Aug 2018 16:47:27 +0200},\n  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1708-07747},\n  bibsource = {dblp computer science bibliography, https://dblp.org}\n}",
                            "subsections": []
                        },
                        {
                            "name": "Contributions",
                            "attributes": "Thanks to [@gchhablani](https://github.com/gchablani) for adding this dataset.",
                            "subsections": []
                        }
                    ]
                }
            ]
        }
    ]
}

Thanks,
Gunjan

@yjernite
Copy link
Member

Good start! Here are some proposed next steps:

  • We want the Class structure to reflect the template - so the parser know what section titles to expect and when something has gone wrong
  • As a result, we don't need to parse the table of contents, since it will always be the same
  • For each section/subsection it would be cool to have a variable saying whether it's filled out or not (when it's either empty or has [More Information Needed])
  • attributes should probably be text

@gchhablani
Copy link
Contributor Author

gchhablani commented Mar 29, 2021

@yjernite @lhoestq

I have added basic validation checking in the class. It works based on a YAML string. The YAML string determines the expected structure and which text is to be checked. The text can be true or false showing whether the text has to be checked or not for emptiness. Similarly, each subsection is parsed recursively. I have used print statement currently so that all issues are shown.

Please let me know your thoughts.

I haven't added a variable that keeps a track of whether the text is empty or not but it can be done easliy if required.

@lhoestq
Copy link
Member

lhoestq commented Mar 30, 2021

This looks like a good start !
Maybe we can use a field named allow_empty instead of text ?
Also +1 for keeping track of empty texts

Do you think you can have a way to collect all the validation fails of a readme and then raise an error showing all the failures instead of using print ?

Then we can create a tests/test_dataset_cards.py test file to make sure all the readmes of the repo are valid !

@gchhablani
Copy link
Contributor Author

Hi @lhoestq

I have added changes accordingly. I prepared a list which stores all the errors and raises them at the end. I'm not sure if there is a better way.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! Now I'm curious to see the results if we run this on all the dataset cards ^^'

error_list = []
if structure["allow_empty"] == False:
if section.is_empty:
print(section.text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(section.text)

@gchhablani
Copy link
Contributor Author

Hi @lhoestq @yjernite

Please find the output for the existing READMEs here: http://p.ip.fi/2vYU

Thanks,
Gunjan

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool thanks !

Feel free to add a few docstrings and type hints. I also left a few comments:

]


class Section:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we may have subclasses of this to have more finegrained validation per section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class can be extended and we can keep a section-to-class mapping in the future. For now, this should be fine, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's fine for now

Comment on lines 15 to 16
with open(resource) as f:
content = yaml.safe_load(f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to use pkg_resources here to load the yaml data

See an example here:

def load_json_resource(resource: str) -> Tuple[Any, str]:
content = pkg_resources.read_text(resources, resource)
return json.loads(content), f"{BASE_REF_URL}/resources/{resource}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll use pkg_resources, but can you please explain why it is needed?

return error_list


def validate_readme(file_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you write a few tests for this function ? that would be appreciated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will add the tests.

@gchhablani
Copy link
Contributor Author

gchhablani commented Apr 27, 2021

Hi @lhoestq

I have added some basic tests, also have restructured ReadMe class slightly.

There is one print statement currently, I'm not sure how to remove it. Basically, I want to warn but not stop further validation. I can't append to a list because the error_list and warning_list are both only present in validate method, and this print is present in the parse method. This is done when someone has repeated a section multiple times. For e.g.:

---
---

# Dataset Card for FashionMNIST
## Dataset Description
## Dataset Description

In this case, I check for validation only in the latest entry.

I can also raise an error (ideal case scenario), but still, it is in the parse. Should I add error_lines and warning_lines as instance variables? That would probably solve the issue.

In tests, I'm using a dummy YAML string for structure, we can also make it into a file but I feel that is not a hard requirement. Let me know your thoughts.

I will add tests for from_readme as well.

However, I would love to be able to check the exact message in the test when an error is raised. I checked a couple of methods but couldn't get it working. Let me know if you're aware of a way to do that.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

]


class Section:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's fine for now

Comment on lines 80 to 82
print(
f"Multiple sections with the same heading '{current_sub_level}' have been found. Using the latest one found."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could also have self.parsing_error_list and self.parsing_warning_list ?

This way in validate you could get the errors and warnings with section.parsing_error_list and section.parsing_warning_list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I also add self.validate_error_list and self.validate_warning_list?

Currently I am raising both warnings and errors together. Should I handle them separately?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you want.
The advantage of having the parsing error and warnings in the attributes is that you can access them from the validate methods

Comment on lines 210 to 230
class TestReadMeUtils(unittest.TestCase):
def test_from_string(self):
ReadMe.from_string(README_CORRECT, EXPECTED_STRUCTURE)
with self.assertRaises(ValueError):
ReadMe.from_string(README_EMPTY_YAML, EXPECTED_STRUCTURE)
with self.assertRaises(ValueError):
ReadMe.from_string(README_INCORRECT_YAML, EXPECTED_STRUCTURE)
with self.assertRaises(ValueError):
ReadMe.from_string(README_NO_YAML, EXPECTED_STRUCTURE)
with self.assertRaises(ValueError):
ReadMe.from_string(README_MISSING_TEXT, EXPECTED_STRUCTURE)
with self.assertRaises(ValueError):
ReadMe.from_string(README_MISSING_SUBSECTION, EXPECTED_STRUCTURE)
with self.assertRaises(ValueError):
ReadMe.from_string(README_MISSING_FIRST_LEVEL, EXPECTED_STRUCTURE)
with self.assertRaises(ValueError):
ReadMe.from_string(README_MULTIPLE_WRONG_FIRST_LEVEL, EXPECTED_STRUCTURE)
with self.assertRaises(ValueError):
ReadMe.from_string(README_WRONG_FIRST_LEVEL, EXPECTED_STRUCTURE)
with self.assertRaises(ValueError):
ReadMe.from_string(README_EMPTY, EXPECTED_STRUCTURE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you could use pytest to check for the error messages.
You can find some documentation here:
https://docs.pytest.org/en/stable/assert.html#assertions-about-expected-exceptions

Note that pytest doesn't use the unittest.TestCase class. Instead you have to define a test function.
For example

def test_from_string(self):
    ReadMe.from_string(README_CORRECT, EXPECTED_STRUCTURE)
    with pytest.raises(ValueError) as excinfo:
        ReadMe.from_string(README_EMPTY_YAML, EXPECTED_STRUCTURE)
    assert "empty" in excinfo

Does that sound good for you ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you can use @pytest.mark.parametrize(...) to run your test functions on all the dummy yaml you defined if it sounds more convenient for you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought I was restricted to unittest. Cool, I'll write pytest test cases and also check the error messages. I assume that is better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be ideal, thanks !

Comment on lines 335 to 337
expected_error = expected_error.format(path=path).encode("unicode_escape").decode("ascii")
with pytest.raises(ValueError, match=expected_error):
ReadMe.from_readme(path, example_yaml_structure)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match is supposed to be a regex, however you are passing a path that may be a windows path.
Instead of espacing the backslashes from windows, you can just escape the full string so that it will consider it a a simple litteral.

Suggested change
expected_error = expected_error.format(path=path).encode("unicode_escape").decode("ascii")
with pytest.raises(ValueError, match=expected_error):
ReadMe.from_readme(path, example_yaml_structure)
expected_error = expected_error.format(path=path)
with pytest.raises(ValueError, match=re.escape(expected_error)):
ReadMe.from_readme(path, example_yaml_structure)

@gchhablani gchhablani marked this pull request as ready for review May 4, 2021 14:09
Comment on lines 106 to 108
if self.is_empty:
# If no header text is found, mention it in the error_list
error_list.append(f"Expected some header text for section `{self.name}`.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have a more explicit message like "Expected some text in section {self.name} but it is empty (text in subsections are ignored)."

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! You really did an amazing job on this one :)

As discussed offline, the next step is to integrate this to the pytest suite, and allow running the validation of all readmes with a RUN_SLOW=1 parameter (i.e. mark the full test with the slow decorator).

@lhoestq lhoestq merged commit f618f54 into huggingface:master May 10, 2021
@gchhablani
Copy link
Contributor Author

Hi @lhoestq

Thanks for merging. :)
Thanks a lot to you and @yjernite for guiding me and helping me out.

Yes, I'll also use the next PR for combining the readme and tags validation. ^_^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants