Add Validation For README #2121

gchhablani · 2021-03-26T17:02:17Z

This is a simple Readme parser. All classes specific to different sections can inherit Section class, and we can define more attributes in each.

Let me know if this is going in the right direction :)

Currently the output looks like this, for to_dict() on FashionMNIST README.md:

{
    "name": "./datasets/fashion_mnist/README.md",
    "attributes": "",
    "subsections": [
        {
            "name": "Dataset Card for FashionMNIST",
            "attributes": "",
            "subsections": [
                {
                    "name": "Table of Contents",
                    "attributes": "- [Dataset Description](#dataset-description)\n  - [Dataset Summary](#dataset-summary)\n  - [Supported Tasks](#supported-tasks-and-leaderboards)\n  - [Languages](#languages)\n- [Dataset Structure](#dataset-structure)\n  - [Data Instances](#data-instances)\n  - [Data Fields](#data-instances)\n  - [Data Splits](#data-instances)\n- [Dataset Creation](#dataset-creation)\n  - [Curation Rationale](#curation-rationale)\n  - [Source Data](#source-data)\n  - [Annotations](#annotations)\n  - [Personal and Sensitive Information](#personal-and-sensitive-information)\n- [Considerations for Using the Data](#considerations-for-using-the-data)\n  - [Social Impact of Dataset](#social-impact-of-dataset)\n  - [Discussion of Biases](#discussion-of-biases)\n  - [Other Known Limitations](#other-known-limitations)\n- [Additional Information](#additional-information)\n  - [Dataset Curators](#dataset-curators)\n  - [Licensing Information](#licensing-information)\n  - [Citation Information](#citation-information)\n  - [Contributions](#contributions)",
                    "subsections": []
                },
                {
                    "name": "Dataset Description",
                    "attributes": "- **Homepage:** [GitHub](https://github.com/zalandoresearch/fashion-mnist)\n- **Repository:** [GitHub](https://github.com/zalandoresearch/fashion-mnist)\n- **Paper:** [arXiv](https://arxiv.org/pdf/1708.07747.pdf)\n- **Leaderboard:**\n- **Point of Contact:**",
                    "subsections": [
                        {
                            "name": "Dataset Summary",
                            "attributes": "Fashion-MNIST is a dataset of Zalando's article images\u2014consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.",
                            "subsections": []
                        },
                        {
                            "name": "Supported Tasks and Leaderboards",
                            "attributes": "[More Information Needed]",
                            "subsections": []
                        },
                        {
                            "name": "Languages",
                            "attributes": "[More Information Needed]",
                            "subsections": []
                        }
                    ]
                },
                {
                    "name": "Dataset Structure",
                    "attributes": "",
                    "subsections": [
                        {
                            "name": "Data Instances",
                            "attributes": "A data point comprises an image and its label.",
                            "subsections": []
                        },
                        {
                            "name": "Data Fields",
                            "attributes": "- `image`: a 2d array of integers representing the 28x28 image.\n- `label`: an integer between 0 and 9 representing the classes with the following mapping:\n  | Label | Description |\n  | --- | --- |\n  | 0 | T-shirt/top |\n  | 1 | Trouser |\n  | 2 | Pullover |\n  | 3 | Dress |\n  | 4 | Coat |\n  | 5 | Sandal |\n  | 6 | Shirt |\n  | 7 | Sneaker |\n  | 8 | Bag |\n  | 9 | Ankle boot |",
                            "subsections": []
                        },
                        {
                            "name": "Data Splits",
                            "attributes": "The data is split into training and test set. The training set contains 60,000 images and the test set 10,000 images.",
                            "subsections": []
                        }
                    ]
                },
                {
                    "name": "Dataset Creation",
                    "attributes": "",
                    "subsections": [
                        {
                            "name": "Curation Rationale",
                            "attributes": "**From the arXiv paper:**\nThe original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. \"If it doesn't work on MNIST, it won't work at all\", they said. \"Well, if it does work on MNIST, it may still fail on others.\"\nHere are some good reasons:\n- MNIST is too easy. Convolutional nets can achieve 99.7% on MNIST. Classic machine learning algorithms can also achieve 97% easily. Check out our side-by-side benchmark for Fashion-MNIST vs. MNIST, and read \"Most pairs of MNIST digits can be distinguished pretty well by just one pixel.\"\n- MNIST is overused. In this April 2017 Twitter thread, Google Brain research scientist and deep learning expert Ian Goodfellow calls for people to move away from MNIST.\n- MNIST can not represent modern CV tasks, as noted in this April 2017 Twitter thread, deep learning expert/Keras author Fran\u00e7ois Chollet.",
                            "subsections": []
                        },
                        {
                            "name": "Source Data",
                            "attributes": "",
                            "subsections": [
                                {
                                    "name": "Initial Data Collection and Normalization",
                                    "attributes": "**From the arXiv paper:**\nFashion-MNIST is based on the assortment on Zalando\u2019s website. Every fashion product on Zalando has a set of pictures shot by professional photographers, demonstrating different aspects of the product, i.e. front and back looks, details, looks with model and in an outfit. The original picture has a light-gray background (hexadecimal color: #fdfdfd) and stored in 762 \u00d7 1000 JPEG format. For efficiently serving different frontend components, the original picture is resampled with multiple resolutions, e.g. large, medium, small, thumbnail and tiny.\nWe use the front look thumbnail images of 70,000 unique products to build Fashion-MNIST. Those products come from different gender groups: men, women, kids and neutral. In particular, whitecolor products are not included in the dataset as they have low contrast to the background. The thumbnails (51 \u00d7 73) are then fed into the following conversion pipeline:\n1. Converting the input to a PNG image.\n2. Trimming any edges that are close to the color of the corner pixels. The \u201ccloseness\u201d is defined by the distance within 5% of the maximum possible intensity in RGB space.\n3. Resizing the longest edge of the image to 28 by subsampling the pixels, i.e. some rows and columns are skipped over.\n4. Sharpening pixels using a Gaussian operator of the radius and standard deviation of 1.0, with increasing effect near outlines.\n5. Extending the shortest edge to 28 and put the image to the center of the canvas.\n6. Negating the intensities of the image.\n7. Converting the image to 8-bit grayscale pixels.",
                                    "subsections": []
                                },
                                {
                                    "name": "Who are the source image producers?",
                                    "attributes": "**From the arXiv paper:**\nEvery fashion product on Zalando has a set of pictures shot by professional photographers, demonstrating different aspects of the product, i.e. front and back looks, details, looks with model and in an outfit.",
                                    "subsections": []
                                }
                            ]
                        },
                        {
                            "name": "Annotations",
                            "attributes": "",
                            "subsections": [
                                {
                                    "name": "Annotation process",
                                    "attributes": "**From the arXiv paper:**\nFor the class labels, they use the silhouette code of the product. The silhouette code is manually labeled by the in-house fashion experts and reviewed by a separate team at Zalando. Each product Zalando is the Europe\u2019s largest online fashion platform. Each product contains only one silhouette code.",
                                    "subsections": []
                                },
                                {
                                    "name": "Who are the annotators?",
                                    "attributes": "**From the arXiv paper:**\nThe silhouette code is manually labeled by the in-house fashion experts and reviewed by a separate team at Zalando.",
                                    "subsections": []
                                }
                            ]
                        },
                        {
                            "name": "Personal and Sensitive Information",
                            "attributes": "[More Information Needed]",
                            "subsections": []
                        }
                    ]
                },
                {
                    "name": "Considerations for Using the Data",
                    "attributes": "",
                    "subsections": [
                        {
                            "name": "Social Impact of Dataset",
                            "attributes": "[More Information Needed]",
                            "subsections": []
                        },
                        {
                            "name": "Discussion of Biases",
                            "attributes": "[More Information Needed]",
                            "subsections": []
                        },
                        {
                            "name": "Other Known Limitations",
                            "attributes": "[More Information Needed]",
                            "subsections": []
                        }
                    ]
                },
                {
                    "name": "Additional Information",
                    "attributes": "",
                    "subsections": [
                        {
                            "name": "Dataset Curators",
                            "attributes": "Han Xiao and Kashif Rasul and Roland Vollgraf",
                            "subsections": []
                        },
                        {
                            "name": "Licensing Information",
                            "attributes": "MIT Licence",
                            "subsections": []
                        },
                        {
                            "name": "Citation Information",
                            "attributes": "@article{DBLP:journals/corr/abs-1708-07747,\n  author    = {Han Xiao and\n               Kashif Rasul and\n               Roland Vollgraf},\n  title     = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning\n               Algorithms},\n  journal   = {CoRR},\n  volume    = {abs/1708.07747},\n  year      = {2017},\n  url       = {http://arxiv.org/abs/1708.07747},\n  archivePrefix = {arXiv},\n  eprint    = {1708.07747},\n  timestamp = {Mon, 13 Aug 2018 16:47:27 +0200},\n  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1708-07747},\n  bibsource = {dblp computer science bibliography, https://dblp.org}\n}",
                            "subsections": []
                        },
                        {
                            "name": "Contributions",
                            "attributes": "Thanks to [@gchhablani](https://github.com/gchablani) for adding this dataset.",
                            "subsections": []
                        }
                    ]
                }
            ]
        }
    ]
}

Thanks,
Gunjan

yjernite · 2021-03-26T18:04:01Z

Good start! Here are some proposed next steps:

We want the Class structure to reflect the template - so the parser know what section titles to expect and when something has gone wrong
As a result, we don't need to parse the table of contents, since it will always be the same
For each section/subsection it would be cool to have a variable saying whether it's filled out or not (when it's either empty or has [More Information Needed])
attributes should probably be text

gchhablani · 2021-03-29T19:19:29Z

@yjernite @lhoestq

I have added basic validation checking in the class. It works based on a YAML string. The YAML string determines the expected structure and which text is to be checked. The text can be true or false showing whether the text has to be checked or not for emptiness. Similarly, each subsection is parsed recursively. I have used print statement currently so that all issues are shown.

Please let me know your thoughts.

I haven't added a variable that keeps a track of whether the text is empty or not but it can be done easliy if required.

lhoestq · 2021-03-30T10:26:58Z

This looks like a good start !
Maybe we can use a field named allow_empty instead of text ?
Also +1 for keeping track of empty texts

Do you think you can have a way to collect all the validation fails of a readme and then raise an error showing all the failures instead of using print ?

Then we can create a tests/test_dataset_cards.py test file to make sure all the readmes of the repo are valid !

gchhablani · 2021-04-01T07:51:33Z

Hi @lhoestq

I have added changes accordingly. I prepared a list which stores all the errors and raises them at the end. I'm not sure if there is a better way.

lhoestq

Nice ! Now I'm curious to see the results if we run this on all the dataset cards ^^'

lhoestq · 2021-04-06T07:17:21Z

src/datasets/utils/readme_parser.py

+        error_list = []
+        if structure["allow_empty"] == False:
+            if section.is_empty:
+                print(section.text)


Suggested change

print(section.text)

… add-readme-parser

gchhablani · 2021-04-08T12:57:42Z

Hi @lhoestq @yjernite

Please find the output for the existing READMEs here: http://p.ip.fi/2vYU

Thanks,
Gunjan

lhoestq

Very cool thanks !

Feel free to add a few docstrings and type hints. I also left a few comments:

lhoestq · 2021-04-26T17:25:02Z

src/datasets/utils/readme.py

+]
+
+
+class Section:


In the future we may have subclasses of this to have more finegrained validation per section

I think this class can be extended and we can keep a section-to-class mapping in the future. For now, this should be fine, right?

Yes it's fine for now

lhoestq · 2021-04-26T17:27:12Z

src/datasets/utils/readme.py

+    with open(resource) as f:
+        content = yaml.safe_load(f)


You may need to use pkg_resources here to load the yaml data

See an example here:

datasets/src/datasets/utils/metadata.py

Lines 25 to 27 in 8e903b5

def load_json_resource(resource: str) -> Tuple[Any, str]:

content = pkg_resources.read_text(resources, resource)

return json.loads(content), f"{BASE_REF_URL}/resources/{resource}"

Okay, I'll use pkg_resources, but can you please explain why it is needed?

lhoestq · 2021-04-26T17:28:34Z

src/datasets/utils/readme.py

+        return error_list
+
+
+def validate_readme(file_path):


Could you write a few tests for this function ? that would be appreciated

Yes, I will add the tests.

gchhablani · 2021-04-27T08:48:04Z

Hi @lhoestq

I have added some basic tests, also have restructured ReadMe class slightly.

There is one print statement currently, I'm not sure how to remove it. Basically, I want to warn but not stop further validation. I can't append to a list because the error_list and warning_list are both only present in validate method, and this print is present in the parse method. This is done when someone has repeated a section multiple times. For e.g.:

---
---

# Dataset Card for FashionMNIST
## Dataset Description
## Dataset Description

In this case, I check for validation only in the latest entry.

I can also raise an error (ideal case scenario), but still, it is in the parse. Should I add error_lines and warning_lines as instance variables? That would probably solve the issue.

In tests, I'm using a dummy YAML string for structure, we can also make it into a file but I feel that is not a hard requirement. Let me know your thoughts.

I will add tests for from_readme as well.

However, I would love to be able to check the exact message in the test when an error is raised. I checked a couple of methods but couldn't get it working. Let me know if you're aware of a way to do that.

lhoestq

Thanks !

lhoestq · 2021-04-27T08:56:32Z

src/datasets/utils/readme.py

+]
+
+
+class Section:


Yes it's fine for now

lhoestq · 2021-04-27T08:58:24Z

src/datasets/utils/readme.py

+                    print(
+                        f"Multiple sections with the same heading '{current_sub_level}' have been found. Using the latest one found."
+                    )


Maybe you could also have self.parsing_error_list and self.parsing_warning_list ?

This way in validate you could get the errors and warnings with section.parsing_error_list and section.parsing_warning_list

Should I also add self.validate_error_list and self.validate_warning_list?

Currently I am raising both warnings and errors together. Should I handle them separately?

As you want.
The advantage of having the parsing error and warnings in the attributes is that you can access them from the validate methods

lhoestq · 2021-04-27T09:06:50Z

tests/test_readme_util.py

+class TestReadMeUtils(unittest.TestCase):
+    def test_from_string(self):
+        ReadMe.from_string(README_CORRECT, EXPECTED_STRUCTURE)
+        with self.assertRaises(ValueError):
+            ReadMe.from_string(README_EMPTY_YAML, EXPECTED_STRUCTURE)
+        with self.assertRaises(ValueError):
+            ReadMe.from_string(README_INCORRECT_YAML, EXPECTED_STRUCTURE)
+        with self.assertRaises(ValueError):
+            ReadMe.from_string(README_NO_YAML, EXPECTED_STRUCTURE)
+        with self.assertRaises(ValueError):
+            ReadMe.from_string(README_MISSING_TEXT, EXPECTED_STRUCTURE)
+        with self.assertRaises(ValueError):
+            ReadMe.from_string(README_MISSING_SUBSECTION, EXPECTED_STRUCTURE)
+        with self.assertRaises(ValueError):
+            ReadMe.from_string(README_MISSING_FIRST_LEVEL, EXPECTED_STRUCTURE)
+        with self.assertRaises(ValueError):
+            ReadMe.from_string(README_MULTIPLE_WRONG_FIRST_LEVEL, EXPECTED_STRUCTURE)
+        with self.assertRaises(ValueError):
+            ReadMe.from_string(README_WRONG_FIRST_LEVEL, EXPECTED_STRUCTURE)
+        with self.assertRaises(ValueError):
+            ReadMe.from_string(README_EMPTY, EXPECTED_STRUCTURE)


Here you could use pytest to check for the error messages.
You can find some documentation here:
https://docs.pytest.org/en/stable/assert.html#assertions-about-expected-exceptions

Note that pytest doesn't use the unittest.TestCase class. Instead you have to define a test function.
For example

def test_from_string(self): ReadMe.from_string(README_CORRECT, EXPECTED_STRUCTURE) with pytest.raises(ValueError) as excinfo: ReadMe.from_string(README_EMPTY_YAML, EXPECTED_STRUCTURE) assert "empty" in excinfo

Does that sound good for you ?

Also you can use @pytest.mark.parametrize(...) to run your test functions on all the dummy yaml you defined if it sounds more convenient for you

Oh, I thought I was restricted to unittest. Cool, I'll write pytest test cases and also check the error messages. I assume that is better?

That would be ideal, thanks !

lhoestq · 2021-05-03T16:21:47Z

tests/test_readme_util.py

+        expected_error = expected_error.format(path=path).encode("unicode_escape").decode("ascii")
+        with pytest.raises(ValueError, match=expected_error):
+            ReadMe.from_readme(path, example_yaml_structure)


match is supposed to be a regex, however you are passing a path that may be a windows path.
Instead of espacing the backslashes from windows, you can just escape the full string so that it will consider it a a simple litteral.

Suggested change

expected_error = expected_error.format(path=path).encode("unicode_escape").decode("ascii")

with pytest.raises(ValueError, match=expected_error):

ReadMe.from_readme(path, example_yaml_structure)

expected_error = expected_error.format(path=path)

with pytest.raises(ValueError, match=re.escape(expected_error)):

ReadMe.from_readme(path, example_yaml_structure)

lhoestq · 2021-05-07T14:30:13Z

src/datasets/utils/readme.py

+            if self.is_empty:
+                # If no header text is found, mention it in the error_list
+                error_list.append(f"Expected some header text for section `{self.name}`.")


Maybe have a more explicit message like "Expected some text in section {self.name} but it is empty (text in subsections are ignored)."

lhoestq

Thanks ! You really did an amazing job on this one :)

As discussed offline, the next step is to integrate this to the pytest suite, and allow running the validation of all readmes with a RUN_SLOW=1 parameter (i.e. mark the full test with the slow decorator).

gchhablani · 2021-05-10T13:17:18Z

Hi @lhoestq

Thanks for merging. :)
Thanks a lot to you and @yjernite for guiding me and helping me out.

Yes, I'll also use the next PR for combining the readme and tags validation. ^_^

Add Initial README parser

993acd6

gchhablani added 3 commits March 30, 2021 00:43

Merge remote-tracking branch 'upstream/master' into add-readme-parser

c2f0699

Add basic validation checks

014c49d

Minor fix

2040602

gchhablani added 2 commits April 1, 2021 13:19

Merge remote-tracking branch 'upstream/master' into add-readme-parser

79c2ad0

Changes from review

7a1654b

lhoestq reviewed Apr 6, 2021

View reviewed changes

Merge branch 'master' of https://github.com/huggingface/datasets into…

c6d2345

… add-readme-parser

gchhablani added 5 commits April 8, 2021 18:53

Make main into a function in readme_parser

99d2222

Merge remote-tracking branch 'upstream/master' into add-readme-parser

51c08ae

Merge remote-tracking branch 'upstream/master' into add-readme-parser

4e54669

Move README validator to scripts

1d788a9

Arrange README validation files

2d13f70

lhoestq reviewed Apr 26, 2021

View reviewed changes

gchhablani added 3 commits April 27, 2021 05:34

Merge remote-tracking branch 'upstream/master' into add-readme-parser

a1c1f67

Update readme validator class

ee31e15

Add from_string tests

ae60ce5

lhoestq reviewed Apr 27, 2021

View reviewed changes

gchhablani added 6 commits April 28, 2021 18:20

Merge remote-tracking branch 'upstream/master' into add-readme-parser

362e464

Add PyTest tests

057d0d9

Merge remote-tracking branch 'upstream/master' into add-readme-parser

cd4b69e

Add tests for from_readme

35e08d8

Add ReadMe validator script

a3de91a

Fix style

8dd3feb

gchhablani added 11 commits May 2, 2021 14:43

Remove print statement

87b0668

Add validator to CircleCI

1d49a4d

Fix style

d9f0ac3

Add YAML files to setup resources

414fc2e

Make validator executable

0c3425a

Add no subsections test

933fdf7

Add incorrect YAML test

cd895a1

Fix style

a3bdb1f

Fix tests

6e85d4a

Fix tests

10386e7

Fix style

b4ca9ca

lhoestq reviewed May 3, 2021

View reviewed changes

gchhablani added 2 commits May 4, 2021 18:30

Fix escape character issue

a69c019

Merge remote-tracking branch 'upstream/master' into add-readme-parser

f12a105

gchhablani marked this pull request as ready for review May 4, 2021 14:09

lhoestq reviewed May 7, 2021

View reviewed changes

gchhablani added 4 commits May 7, 2021 21:47

Add three-level heading validation limit

d45ec9b

Merge remote-tracking branch 'upstream/master' into add-readme-parser

309a69e

Add either text or subsection option

cdcffe0

Fix style

ffdfcb6

lhoestq approved these changes May 10, 2021

View reviewed changes

lhoestq merged commit f618f54 into huggingface:master May 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Validation For README #2121

Add Validation For README #2121

gchhablani commented Mar 26, 2021 •

edited

yjernite commented Mar 26, 2021

gchhablani commented Mar 29, 2021 •

edited

lhoestq commented Mar 30, 2021

gchhablani commented Apr 1, 2021

lhoestq left a comment

lhoestq Apr 6, 2021

gchhablani commented Apr 8, 2021

lhoestq left a comment

lhoestq Apr 26, 2021

gchhablani Apr 26, 2021

lhoestq Apr 27, 2021

lhoestq Apr 26, 2021

gchhablani Apr 27, 2021

lhoestq Apr 26, 2021

gchhablani Apr 27, 2021

gchhablani commented Apr 27, 2021 •

edited

lhoestq left a comment

lhoestq Apr 27, 2021

lhoestq Apr 27, 2021

gchhablani Apr 27, 2021

lhoestq Apr 28, 2021

lhoestq Apr 27, 2021

lhoestq Apr 27, 2021

gchhablani Apr 27, 2021

lhoestq Apr 28, 2021

lhoestq May 3, 2021

lhoestq May 7, 2021

lhoestq left a comment •

edited

gchhablani commented May 10, 2021

	def load_json_resource(resource: str) -> Tuple[Any, str]:
	content = pkg_resources.read_text(resources, resource)
	return json.loads(content), f"{BASE_REF_URL}/resources/{resource}"

Add Validation For README #2121

Add Validation For README #2121

Conversation

gchhablani commented Mar 26, 2021 • edited

yjernite commented Mar 26, 2021

gchhablani commented Mar 29, 2021 • edited

lhoestq commented Mar 30, 2021

gchhablani commented Apr 1, 2021

lhoestq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gchhablani commented Apr 8, 2021

lhoestq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gchhablani commented Apr 27, 2021 • edited

lhoestq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq left a comment • edited

Choose a reason for hiding this comment

gchhablani commented May 10, 2021

gchhablani commented Mar 26, 2021 •

edited

gchhablani commented Mar 29, 2021 •

edited

gchhablani commented Apr 27, 2021 •

edited

lhoestq left a comment •

edited