mlcommons · marcenacp · Jul 7, 2023 · Jul 6, 2023
@@ -56,8 +56,10 @@ jobs:
       run: pip install .
 
     - name: Validate JSON-LD files
+      # wiki-text is excluded at the moment. See: https://github.com/mlcommons/croissant/issues/101.
+      # movielens is excluded at the moment. See: https://github.com/mlcommons/croissant/issues/103.
       run: |
-        JSON_FILES=$(python -c "import os; from etils import epath; [print(os.fspath(path)) for path in epath.Path('../../datasets').glob('*/*.json')]")
+        JSON_FILES=$(find ../../datasets/ -type f -name "*.json" ! -path '*wiki-text*' ! -path '*movielens*')
         for file in ${JSON_FILES}
         do
           echo "Validating ${file}..."

@@ -183,7 +183,7 @@
       ]
     },
     {
-      "name": "movies+ratings+tags",
+      "name": "movies_with_ratings_with_tags",
       "@type": "ml:RecordSet",
       "source": "#{movies}",
       "key": "#{movie_id}",
@@ -209,7 +209,6 @@
           "dataType": "ml:RecordSet",
           "source": "#{ratings}",
           "parentField": {
-            "@type": "ml:Field",
             "source": "#{ratings/movie_id}",
             "references": "#{movies}"
           },
@@ -237,7 +236,6 @@
           "dataType": "ml:RecordSet",
           "source": "#{tags}",
           "parentField": {
-            "@type": "ml:Field",
             "source": "#{tags/movie_id}",
             "references": "#{movies}"
           },

@@ -10,7 +10,7 @@
     "source": "ml:source"
   },
   "@type": "sc:Dataset",
-  "name": "Compressed archive example",
+  "name": "compressed_archive_example",
   "description": "This is a fairly minimal example, showing a way to describe archive files.",
   "url": "https://example.com/datasets/recipes/compressed_archive/about",
   "distribution": [

@@ -11,7 +11,7 @@
     "references": "ml:references"
   },
   "@type": "sc:Dataset",
-  "name": "Enum example",
+  "name": "enum_example",
   "description": "This is a fairly minimal example, showing a way to describe enumerations.",
   "url": "https://example.com/datasets/enum/about",
   "distribution": [

@@ -4,7 +4,7 @@
     "sc": "https://schema.org/"
   },
   "@type": "sc:Dataset",
-  "name": "Minimal example",
+  "name": "minimal_example",
   "description": "This is a very minimal example, with only the required fields.",
   "url": "https://example.com/dataset/minimal/about"
 }
@@ -10,7 +10,7 @@
     "references": "ml:references"
   },
   "@type": "sc:Dataset",
-  "name": "Minimal example with recommended fields",
+  "name": "minimal_example_with_recommended_fields",
   "description": "This is a minimal example, including the required and the recommended fields.",
   "url": "https://example.com/dataset/recipes/minimal-recommended",
   "license": "https://creativecommons.org/licenses/by/4.0/",

@@ -14,6 +14,7 @@
     "applyTransform": "ml:applyTransform",
     "format": "ml:format",
     "regex": "ml:regex",
+    "replace": "ml:replace",
     "separator": "ml:separator",
     "references": "ml:references"
   },

@@ -35,10 +35,52 @@ python -m pip install ".[dev]"
 pytest .
 ```
 
-## Roadmap
+## Design
 
-Refer to the [design doc](https://docs.google.com/document/d/1zYQIUX9ae1sZOOBq9OCsJ8JW8-Ejy3NLSeqaI5LtOEM/edit?resourcekey=0-CK78DfFvF7fnufyZqF3h3Q) for an overview of the implementation.
+The most important modules in the library are:
 
-Refer to the [GitHub project](https://github.com/orgs/mlcommons/projects/26) for more detailed user stories.
+- [`ml_croissant/_src/structure_graph`](./ml_croissant/_src/structure_graph/graph.py) is responsible for the **static analysis** of the Croissant files. We convert Croissant files to a Python representation called "**structure graph**" (using [NetworkX](https://networkx.org/)). In the process, we catch any static analysis issues (e.g., a missing mandatory field or a logic problem in the file).
+- [`ml_croissant/_src/operation_graph`](./ml_croissant/_src/operation_graph/graph.py) is responsible for the **dynamic analysis** of the Croissant files (i.e., actually loading the dataset by yielding examples). We convert the structure graph into an "**operation graph**". Operations are the unit transformation that allow to build the dataset (like [`Download`](./ml_croissant/_src/operation_graph/operations/download.py), [`Extract`](./ml_croissant/_src/operation_graph/operations/extract.py), etc).
 
-All contributions are welcome! We even have [good first issues](https://github.com/mlcommons/croissant/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) to start in the project.
+Other important modules are:
+
+- [`ml_croissant/_src/core`](./ml_croissant/_src/core) defines all needed core internals. For instance, [`Issues`](./ml_croissant/_src/core/issues.py) are a way to track errors and warning during the analysis of Croissant files.
+- [`ml_croissant/__init__`](./ml_croissant/__init__.py) declares the public API with [`ml_croissant.Dataset`](./ml_croissant/_src/datasets.py).
+
+For the full design, refer to the [design doc](https://docs.google.com/document/d/1zYQIUX9ae1sZOOBq9OCsJ8JW8-Ejy3NLSeqaI5LtOEM/edit?resourcekey=0-CK78DfFvF7fnufyZqF3h3Q) for an overview of the implementation.
+
+## Contribute
+
+All contributions are welcome! We even have [good first issues](https://github.com/mlcommons/croissant/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) to start in the project. Refer to the [GitHub project](https://github.com/orgs/mlcommons/projects/26) for more detailed user stories.
+
+The development workflow goes as follow:
+
+- [Fork](https://docs.github.com/en/get-started/quickstart/fork-a-repo) the repository: https://github.com/mlcommons/croissant.
+- Clone the newly forked repository:
+  ```bash
+  git clone git@github.com:<YOUR_GITHUB_LDAP>/croissant.git
+  ```
+- Create a new branch:
+  ```bash
+  cd croissant/
+  git checkout -b feature/my-awesome-new-feature
+  ```
+- Code the feature. We support [VS Code](https://code.visualstudio.com) with pre-set settings.
+- Push to GitHub:
+  ```bash
+  git add .
+  git push --set-upstream origin feature/my-awesome-new-feature
+  ```
+- Open a pull request (PR) with the main branch of https://github.com/mlcommons/croissant, and ask for feedback!
+
+## Debug
+
+You can debug the validation of the file with the `--debug` flag:
+
+```bash
+python scripts/validate.py --file ../../datasets/titanic/metadata.json --debug
+```
+
+This will:
+1. print extra information, like the generated nodes;
+2. save the generated structure graph to a folder indicated in the logs.