New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Validation For README #2121
Conversation
Good start! Here are some proposed next steps:
|
I have added basic validation checking in the class. It works based on a YAML string. The YAML string determines the expected structure and which text is to be checked. The Please let me know your thoughts. I haven't added a variable that keeps a track of whether the text is empty or not but it can be done easliy if required. |
This looks like a good start ! Do you think you can have a way to collect all the validation fails of a readme and then raise an error showing all the failures instead of using print ? Then we can create a |
Hi @lhoestq I have added changes accordingly. I prepared a list which stores all the errors and raises them at the end. I'm not sure if there is a better way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! Now I'm curious to see the results if we run this on all the dataset cards ^^'
src/datasets/utils/readme_parser.py
Outdated
error_list = [] | ||
if structure["allow_empty"] == False: | ||
if section.is_empty: | ||
print(section.text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print(section.text) |
… add-readme-parser
Please find the output for the existing READMEs here: http://p.ip.fi/2vYU Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool thanks !
Feel free to add a few docstrings and type hints. I also left a few comments:
] | ||
|
||
|
||
class Section: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we may have subclasses of this to have more finegrained validation per section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this class can be extended and we can keep a section-to-class mapping in the future. For now, this should be fine, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's fine for now
src/datasets/utils/readme.py
Outdated
with open(resource) as f: | ||
content = yaml.safe_load(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to use pkg_resources
here to load the yaml data
See an example here:
datasets/src/datasets/utils/metadata.py
Lines 25 to 27 in 8e903b5
def load_json_resource(resource: str) -> Tuple[Any, str]: | |
content = pkg_resources.read_text(resources, resource) | |
return json.loads(content), f"{BASE_REF_URL}/resources/{resource}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll use pkg_resources
, but can you please explain why it is needed?
src/datasets/utils/readme.py
Outdated
return error_list | ||
|
||
|
||
def validate_readme(file_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you write a few tests for this function ? that would be appreciated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will add the tests.
Hi @lhoestq I have added some basic tests, also have restructured There is one print statement currently, I'm not sure how to remove it. Basically, I want to warn but not stop further validation. I can't append to a list because the ---
---
# Dataset Card for FashionMNIST
## Dataset Description
## Dataset Description In this case, I check for validation only in the latest entry. I can also raise an error (ideal case scenario), but still, it is in the In tests, I'm using a dummy YAML string for structure, we can also make it into a file but I feel that is not a hard requirement. Let me know your thoughts. I will add tests for However, I would love to be able to check the exact message in the test when an error is raised. I checked a couple of methods but couldn't get it working. Let me know if you're aware of a way to do that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
] | ||
|
||
|
||
class Section: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's fine for now
src/datasets/utils/readme.py
Outdated
print( | ||
f"Multiple sections with the same heading '{current_sub_level}' have been found. Using the latest one found." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could also have self.parsing_error_list
and self.parsing_warning_list
?
This way in validate
you could get the errors and warnings with section.parsing_error_list
and section.parsing_warning_list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I also add self.validate_error_list
and self.validate_warning_list
?
Currently I am raising both warnings and errors together. Should I handle them separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you want.
The advantage of having the parsing error and warnings in the attributes is that you can access them from the validate
methods
tests/test_readme_util.py
Outdated
class TestReadMeUtils(unittest.TestCase): | ||
def test_from_string(self): | ||
ReadMe.from_string(README_CORRECT, EXPECTED_STRUCTURE) | ||
with self.assertRaises(ValueError): | ||
ReadMe.from_string(README_EMPTY_YAML, EXPECTED_STRUCTURE) | ||
with self.assertRaises(ValueError): | ||
ReadMe.from_string(README_INCORRECT_YAML, EXPECTED_STRUCTURE) | ||
with self.assertRaises(ValueError): | ||
ReadMe.from_string(README_NO_YAML, EXPECTED_STRUCTURE) | ||
with self.assertRaises(ValueError): | ||
ReadMe.from_string(README_MISSING_TEXT, EXPECTED_STRUCTURE) | ||
with self.assertRaises(ValueError): | ||
ReadMe.from_string(README_MISSING_SUBSECTION, EXPECTED_STRUCTURE) | ||
with self.assertRaises(ValueError): | ||
ReadMe.from_string(README_MISSING_FIRST_LEVEL, EXPECTED_STRUCTURE) | ||
with self.assertRaises(ValueError): | ||
ReadMe.from_string(README_MULTIPLE_WRONG_FIRST_LEVEL, EXPECTED_STRUCTURE) | ||
with self.assertRaises(ValueError): | ||
ReadMe.from_string(README_WRONG_FIRST_LEVEL, EXPECTED_STRUCTURE) | ||
with self.assertRaises(ValueError): | ||
ReadMe.from_string(README_EMPTY, EXPECTED_STRUCTURE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you could use pytest to check for the error messages.
You can find some documentation here:
https://docs.pytest.org/en/stable/assert.html#assertions-about-expected-exceptions
Note that pytest doesn't use the unittest.TestCase
class. Instead you have to define a test function.
For example
def test_from_string(self):
ReadMe.from_string(README_CORRECT, EXPECTED_STRUCTURE)
with pytest.raises(ValueError) as excinfo:
ReadMe.from_string(README_EMPTY_YAML, EXPECTED_STRUCTURE)
assert "empty" in excinfo
Does that sound good for you ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you can use @pytest.mark.parametrize(...)
to run your test functions on all the dummy yaml you defined if it sounds more convenient for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought I was restricted to unittest
. Cool, I'll write pytest
test cases and also check the error messages. I assume that is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be ideal, thanks !
tests/test_readme_util.py
Outdated
expected_error = expected_error.format(path=path).encode("unicode_escape").decode("ascii") | ||
with pytest.raises(ValueError, match=expected_error): | ||
ReadMe.from_readme(path, example_yaml_structure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match
is supposed to be a regex, however you are passing a path that may be a windows path.
Instead of espacing the backslashes from windows, you can just escape the full string so that it will consider it a a simple litteral.
expected_error = expected_error.format(path=path).encode("unicode_escape").decode("ascii") | |
with pytest.raises(ValueError, match=expected_error): | |
ReadMe.from_readme(path, example_yaml_structure) | |
expected_error = expected_error.format(path=path) | |
with pytest.raises(ValueError, match=re.escape(expected_error)): | |
ReadMe.from_readme(path, example_yaml_structure) |
src/datasets/utils/readme.py
Outdated
if self.is_empty: | ||
# If no header text is found, mention it in the error_list | ||
error_list.append(f"Expected some header text for section `{self.name}`.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe have a more explicit message like "Expected some text in section {self.name}
but it is empty (text in subsections are ignored)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! You really did an amazing job on this one :)
As discussed offline, the next step is to integrate this to the pytest suite, and allow running the validation of all readmes with a RUN_SLOW=1
parameter (i.e. mark the full test with the slow
decorator).
Hi @lhoestq, @yjernite
This is a simple Readme parser. All classes specific to different sections can inherit
Section
class, and we can define more attributes in each.Let me know if this is going in the right direction :)
Currently the output looks like this, for
to_dict()
onFashionMNIST
README.md
:Thanks,
Gunjan