Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify code and add developer-friendly documentation for nodes. #100

Merged
merged 1 commit into from
Jul 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,10 @@ jobs:
run: pip install .

- name: Validate JSON-LD files
# wiki-text is excluded at the moment. See: https://github.com/mlcommons/croissant/issues/101.
# movielens is excluded at the moment. See: https://github.com/mlcommons/croissant/issues/103.
run: |
JSON_FILES=$(python -c "import os; from etils import epath; [print(os.fspath(path)) for path in epath.Path('../../datasets').glob('*/*.json')]")
JSON_FILES=$(find ../../datasets/ -type f -name "*.json" ! -path '*wiki-text*' ! -path '*movielens*')
for file in ${JSON_FILES}
do
echo "Validating ${file}..."
Expand Down
4 changes: 1 addition & 3 deletions datasets/movielens/metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@
]
},
{
"name": "movies+ratings+tags",
"name": "movies_with_ratings_with_tags",
"@type": "ml:RecordSet",
"source": "#{movies}",
"key": "#{movie_id}",
Expand All @@ -209,7 +209,6 @@
"dataType": "ml:RecordSet",
"source": "#{ratings}",
"parentField": {
"@type": "ml:Field",
"source": "#{ratings/movie_id}",
"references": "#{movies}"
},
Expand Down Expand Up @@ -237,7 +236,6 @@
"dataType": "ml:RecordSet",
"source": "#{tags}",
"parentField": {
"@type": "ml:Field",
"source": "#{tags/movie_id}",
"references": "#{movies}"
},
Expand Down
2 changes: 1 addition & 1 deletion datasets/recipes/compressed_archive.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
"source": "ml:source"
},
"@type": "sc:Dataset",
"name": "Compressed archive example",
"name": "compressed_archive_example",
"description": "This is a fairly minimal example, showing a way to describe archive files.",
"url": "https://example.com/datasets/recipes/compressed_archive/about",
"distribution": [
Expand Down
2 changes: 1 addition & 1 deletion datasets/recipes/enum.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"references": "ml:references"
},
"@type": "sc:Dataset",
"name": "Enum example",
"name": "enum_example",
"description": "This is a fairly minimal example, showing a way to describe enumerations.",
"url": "https://example.com/datasets/enum/about",
"distribution": [
Expand Down
2 changes: 1 addition & 1 deletion datasets/recipes/minimal.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"sc": "https://schema.org/"
},
"@type": "sc:Dataset",
"name": "Minimal example",
"name": "minimal_example",
"description": "This is a very minimal example, with only the required fields.",
"url": "https://example.com/dataset/minimal/about"
}
2 changes: 1 addition & 1 deletion datasets/recipes/minimal_recommended.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
"references": "ml:references"
},
"@type": "sc:Dataset",
"name": "Minimal example with recommended fields",
"name": "minimal_example_with_recommended_fields",
"description": "This is a minimal example, including the required and the recommended fields.",
"url": "https://example.com/dataset/recipes/minimal-recommended",
"license": "https://creativecommons.org/licenses/by/4.0/",
Expand Down
1 change: 1 addition & 0 deletions datasets/wiki-text/metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
"applyTransform": "ml:applyTransform",
"format": "ml:format",
"regex": "ml:regex",
"replace": "ml:replace",
"separator": "ml:separator",
"references": "ml:references"
},
Expand Down
50 changes: 46 additions & 4 deletions python/ml_croissant/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,52 @@ python -m pip install ".[dev]"
pytest .
```

## Roadmap
## Design

Refer to the [design doc](https://docs.google.com/document/d/1zYQIUX9ae1sZOOBq9OCsJ8JW8-Ejy3NLSeqaI5LtOEM/edit?resourcekey=0-CK78DfFvF7fnufyZqF3h3Q) for an overview of the implementation.
The most important modules in the library are:

Refer to the [GitHub project](https://github.com/orgs/mlcommons/projects/26) for more detailed user stories.
- [`ml_croissant/_src/structure_graph`](./ml_croissant/_src/structure_graph/graph.py) is responsible for the **static analysis** of the Croissant files. We convert Croissant files to a Python representation called "**structure graph**" (using [NetworkX](https://networkx.org/)). In the process, we catch any static analysis issues (e.g., a missing mandatory field or a logic problem in the file).
- [`ml_croissant/_src/operation_graph`](./ml_croissant/_src/operation_graph/graph.py) is responsible for the **dynamic analysis** of the Croissant files (i.e., actually loading the dataset by yielding examples). We convert the structure graph into an "**operation graph**". Operations are the unit transformation that allow to build the dataset (like [`Download`](./ml_croissant/_src/operation_graph/operations/download.py), [`Extract`](./ml_croissant/_src/operation_graph/operations/extract.py), etc).

All contributions are welcome! We even have [good first issues](https://github.com/mlcommons/croissant/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) to start in the project.
Other important modules are:

- [`ml_croissant/_src/core`](./ml_croissant/_src/core) defines all needed core internals. For instance, [`Issues`](./ml_croissant/_src/core/issues.py) are a way to track errors and warning during the analysis of Croissant files.
- [`ml_croissant/__init__`](./ml_croissant/__init__.py) declares the public API with [`ml_croissant.Dataset`](./ml_croissant/_src/datasets.py).

For the full design, refer to the [design doc](https://docs.google.com/document/d/1zYQIUX9ae1sZOOBq9OCsJ8JW8-Ejy3NLSeqaI5LtOEM/edit?resourcekey=0-CK78DfFvF7fnufyZqF3h3Q) for an overview of the implementation.

## Contribute

All contributions are welcome! We even have [good first issues](https://github.com/mlcommons/croissant/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) to start in the project. Refer to the [GitHub project](https://github.com/orgs/mlcommons/projects/26) for more detailed user stories.

The development workflow goes as follow:

- [Fork](https://docs.github.com/en/get-started/quickstart/fork-a-repo) the repository: https://github.com/mlcommons/croissant.
- Clone the newly forked repository:
```bash
git clone git@github.com:<YOUR_GITHUB_LDAP>/croissant.git
```
- Create a new branch:
```bash
cd croissant/
git checkout -b feature/my-awesome-new-feature
```
- Code the feature. We support [VS Code](https://code.visualstudio.com) with pre-set settings.
- Push to GitHub:
```bash
git add .
git push --set-upstream origin feature/my-awesome-new-feature
```
- Open a pull request (PR) with the main branch of https://github.com/mlcommons/croissant, and ask for feedback!

## Debug

You can debug the validation of the file with the `--debug` flag:

```bash
python scripts/validate.py --file ../../datasets/titanic/metadata.json --debug
```

This will:
1. print extra information, like the generated nodes;
2. save the generated structure graph to a folder indicated in the logs.