Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infer metadata from the data *.csv file #173

Merged
merged 17 commits into from
Jun 4, 2024
Merged

Conversation

Bachibouzouk
Copy link
Collaborator

@Bachibouzouk Bachibouzouk commented May 27, 2024

Description

The function infer_metadata_from_data can be run on a datapackage where only the data/elements and data/sequences have been defined by a user and will generate automatically the datapackage.json file containing the metadata.

Type of change

Please tick or delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

Please tick or delete options that are not relevant.

  • New and adjusted code is formatted using the pre-commit hooks
    --> I couldn't get the pre-commit local tagged object to run, it told me isort is unknown command even after I pip installed it ....
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • existing unit tests pass locally with my changes
  • I have added new features/fixes to the CHANGELOG
  • I have added my name to AUTHORS

Each key map to a list of descriptors which themselves contain a
"fields" attribute.
We map the values under this "fields" attribute to the keys of
FOREIGN_KEY_DESCRIPTORS
The function infer_metadata_from_data can be run on a datapackage where
only the data/elements and data/sequences have been defined by a user
and will generate automatically the datapackage.json file containing the
metadata.
p.add_resource(r.descriptor)


def infer_metadata_from_data(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea was to make this feature optional, therefore I did not want to modify infer_metadata function

if "/elements/" in r.descriptor["path"]:
infer_resource_basic_foreign_keys(r)
# this function saves the metadata of the package in json format
infer_metadata(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this function does already part of the job if provided a dict foreign_keys, I used it and therefore just wrote the function infer_resource_basic_foreign_keys to fill this foreign_keys dict.

@Bachibouzouk Bachibouzouk marked this pull request as ready for review May 28, 2024 20:21
Copy link
Collaborator

@henhuy henhuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature! Thanks for implementing.
Tested it with existing investment datapackage from examples - files were almost identical (only no empty line add end of file was missing 😛).
Would be nice to such test in tests folder too (aka build metadata for existing datapackage in examples and check if files are identical).
I also found three smaller issues.


def infer_metadata_from_data(
package_name="default-name",
path=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default parameter "None" throws an error and makes no sense IMO.
I would make it mandatory and first argument.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0c4279b0c4279b

metadata_filename="datapackage.json",
):
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small docstring would be good

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0c4279b0c4279b

Comment on lines 109 to 110
# write an error message here
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An error should be thrown here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b8b4e64b8b4e64

@Bachibouzouk
Copy link
Collaborator Author

Nice feature! Thanks for implementing. Tested it with existing investment datapackage from examples - files were almost identical (only no empty line add end of file was missing 😛). Would be nice to such test in tests folder too (aka build metadata for existing datapackage in examples and check if files are identical). I also found three smaller issues.

Thanks for your review and your comments :)

I will implement your suggestions and run those on the example datapackages :)

sequence_labels = []
duplicated_labels = []
for r in p.resources:
if "/sequences/" in r.descriptor["path"]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here one has to use os.sep instead of "/"

sequences_profiles_to_resource = map_sequence_profiles_to_resource_name(p)

for r in p.resources:
if "/elements/" in r.descriptor["path"]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here one has to use os.sep instead of "/"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 37c5ad537c5ad5

@Bachibouzouk
Copy link
Collaborator Author

Bachibouzouk commented Jun 3, 2024

Would be nice to such test in tests folder too (aka build metadata for existing datapackage in examples and check if files are identical).

I did run the new function on files and some of them are just changing the order of the resources within the datapackage. There was a difference which helped me find a bug and solve it in c7ab4bf

Those were generated by infer_metadata_from_data
and had no differences except the package name
Those were generated by infer_metadata_from_data
and had only a different order of resources
@Bachibouzouk Bachibouzouk requested a review from henhuy June 3, 2024 22:34
Copy link
Collaborator

@henhuy henhuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thx for the changes

@Bachibouzouk Bachibouzouk merged commit 782118e into dev Jun 4, 2024
2 checks passed
@Bachibouzouk Bachibouzouk deleted the feature/infer_metadata branch June 4, 2024 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants