Merge back/1.4.0rc2 (#1096)

openvinotoolkit · Jul 14, 2023 · f100bb8 · f100bb8
2 parents d425d26 + 9827aef
commit f100bb8
Show file tree

Hide file tree

Showing 57 changed files with 6,775 additions and 1,099 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,9 +6,24 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ## \[Unreleased\]
-### New features
 - Add tabular data import/export
   (<https://github.com/openvinotoolkit/datumaro/pull/1089>)
+
+## 11/07/2023 - Release 1.4.0rc2
+### New features
+- Add documentation and notebook example for Prune API
+  (<https://github.com/openvinotoolkit/datumaro/pull/1070>)
+
+### Enhancements
+- Give notice that the deprecation works will be done in datumaro==1.5.0
+  (<https://github.com/openvinotoolkit/datumaro/pull/1085>)
+
+### Bug fixes
+- Create cache dir under only writable filesystem
+  (<https://github.com/openvinotoolkit/datumaro/pull/1088>)
+
+## 07/07/2023 - Release 1.4.0rc1
+### New features
 - Changed supported Python version range (>=3.8, <=3.11)
   (<https://github.com/openvinotoolkit/datumaro/pull/1083>)
 - Migrate OpenVINO v2023.0.0
@@ -26,7 +41,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Migrate DVC v3.0.0
   (<https://github.com/openvinotoolkit/datumaro/pull/1072>)
 - Stream dataset import/export
-  (<https://github.com/openvinotoolkit/datumaro/pull/1077>)
+  (<https://github.com/openvinotoolkit/datumaro/pull/1077>, <https://github.com/openvinotoolkit/datumaro/pull/1081>, <https://github.com/openvinotoolkit/datumaro/pull/1082>, <https://github.com/openvinotoolkit/datumaro/pull/1091>)
 - Support mask annotations for CVAT data format
   (<https://github.com/openvinotoolkit/datumaro/pull/1078>)
 
@@ -55,6 +70,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   (<https://github.com/openvinotoolkit/datumaro/pull/1053>)
 - Prevent installing protobuf>=4
   (<https://github.com/openvinotoolkit/datumaro/pull/1054>)
+- Fix UnionMerge
+  (<https://github.com/openvinotoolkit/datumaro/pull/1086>)
 
 ## 26/05/2023 - Release 1.3.2
 ### Enhancements

diff --git a/contributing.md b/contributing.md
@@ -1,41 +1,67 @@
 # Contribution Guide
 
+We appreciate any contribution to [Datumaro](https://github.com/openvinotoolkit/datumaro),
+whether it's in the form of a Pull Request, Feature Request or general comments/issue that you found.
+For feature requests and issues, please feel free to create a GitHub Issue in this repository.
+
 ## Related sections
 
 - [Design document](https://openvinotoolkit.github.io/datumaro/latest/docs/explanation/architecture)
 - [Developer manual](https://openvinotoolkit.github.io/datumaro/latest/docs/reference/datumaro_module)
 
-## Installation for developer
+## Development and pull requests
 
 ### Prerequisites
-
 - Python (3.8+)
 
-``` bash
-git clone https://github.com/openvinotoolkit/datumaro
-```
-
-Optionally, install a virtual environment (recommended):
-
-``` bash
-python -m pip install virtualenv
-python -m virtualenv venv
-. venv/bin/activate
-```
-
-Install Datumaro with optional dependencies:
-``` bash
-cd /path/to/the/cloned/repo/
-pip install -e .[default, tf]
-```
-
-Then install test dependencies:
-
-``` bash
-pip install -r tests/requirements.txt
-```
-
-**Optional dependencies**
+To set up your development environment, please follow the steps below.
+1. Fork the [repo](https://github.com/openvinotoolkit/datumaro).
+
+2. clone the forked repo.
+    ``` bash
+    git clone <forked_repo>
+    ```
+3. Optionally, install a virtual environment (recommended):
+    ``` bash
+    python -m pip install virtualenv
+    python -m virtualenv venv
+    . venv/bin/activate
+    ```
+
+4. Install Datumaro with [optional dependencies](#optional-dependencies):
+    ``` bash
+    cd /path/to/the/cloned/repo/
+    pip install -e .[tf,tfds,default]
+    ```
+
+5. Install dev & test dependencies:
+    ``` bash
+    pip install -r requirements-dev.txt
+    pip install -r tests/requirements.txt
+    ```
+
+6. Set up pre-commit hooks in the repo. See [Code style](#code-style).
+    ``` bash
+    pre-commit install
+    pre-commit run
+    ```
+
+7. Create your branch based off the `develop` branch and make changes.
+
+8. Verify your code by running unit tests and integration tests. See [Testing](#testing)
+    ``` bash
+    pytest -v
+    ```
+    or
+    ``` bash
+    python -m pytest -v
+    ```
+
+9. Push your changes.
+
+Now you are ready to create a PR(Pull Request) and get review.
+
+## Optional dependencies
 
 Developer should install the following optional components for running our tests:
 

diff --git a/docs/images/centroid.png b/docs/images/centroid.png
diff --git a/docs/images/cluster_random.png b/docs/images/cluster_random.png
diff --git a/docs/images/entropy.png b/docs/images/entropy.png
diff --git a/docs/images/query_clust.png b/docs/images/query_clust.png
diff --git a/docs/source/docs/command-reference/context_free/prune.md b/docs/source/docs/command-reference/context_free/prune.md
@@ -0,0 +1,93 @@
+# Prune
+
+## Prune Dataset
+
+This command prunes dataset to extract representative subset of the entire dataset. You can effectively handle large-scale dataset having redundancy through this command. The result consists of a representative and manageable subset.
+
+Prune supports various methodology.
+- Randomized
+- Hash-based
+- Clustering-based
+
+`Randomized` approach is based on the most fundamental form of randomness that we are familiar with, where data is selected randomly from the dataset. `Hash-based` approach operates on hash basis like [Explorer](./explorer.md). The default model for calculating hash is CLIP, which could support both image and text modality. Supported model format is OpenVINO IR and those are uploaded in [openvinotoolkit storage](https://storage.openvinotoolkit.org/repositories/datumaro/models/). `Clustering-based` approach is based on clustering to cover unsupervised dataset either. We compute hashes for the images in the dataset or utilize label data to perform clustering.
+
+By default, datasets are updated in-place. The `-o/--output-dir` option can be used to specify another output directory. When updating in-place, use the `--overwirte` parameter (in-place updates fail by default to prevent data loss), unless a project target is modified.
+
+The current project (`-p/--project`) is also used as a context for plugins, so it can be useful for datasest paths having custom formats. When not specified, the current project's working tree is used.
+
+The command can be applied to a dataset or a project build target, a stage or the combined `project` target, in which case all the project targets will be affected.
+
+Usage:
+```
+datum prune [TARGET] -m METHOD [-r RATIO] [-h/--hash-type HASH_TYPE]
+            [-m MODEL] [-p PROJECT_DIR] [-o DST_DIR] [--overwrite]
+```
+
+Parameters:
+- `<target>` (string) - Target [dataset revpath](../../user-manual/how_to_use_datumaro.md#dataset-path-concepts).
+    By default, prints info about the joined `project` dataset.
+- `-m, --method` (string) - Prune method name (default: random).
+- `-r, --ratio` (float) - Number how much you want to remain among dataset (default: 0.5).
+- `--hash-type` (string) - Hash type based for clustering of `query_clust` (default: img). We support image and text hash to extract feature from datasetitem. To use text hash, put `txt` for `hash-type`.
+- `-p, --project` (string) - Directory of the project to operate on (default: current directory).
+- `-o, --output-dir` (string) - Output directory. Can be omitted for main project targets (i.e. data sources and the `project`  target, but not intermediate stages) and dataset targets. If not specified, the results will be saved inplace.
+- `--overwrite` - Allows to overwrite existing files in the output directory, when it is specified and is not empty.
+
+Examples:
+- Prune dataset through clustering random with image hash into ratio 80
+```console
+datum prune source1 -m cluster_random -h image -r 80
+```
+
+### Built-in prune methods
+- [`random`](#random) - Select randomly among dataset
+- [`cluster_random`](#cluster_random) - Select randomly among clusters
+- [`centroid`](#centroid) - Select center of each cluster
+- [`query_clust`](#query_clust) - Set init with label of cluster
+- [`entropy`](#entropy) - Select dataset based label entropy among clusters
+- [`ndr`](#ndr) - Removes duplicated images from dataset
+
+#### `random`
+Randomly select items from the dataset. The items are chosen from the entire dataset using the most common random method.
+```console
+datum prune -m random -r 0.8 -p </path/to/project/>
+```
+
+#### `cluster_random`
+Select items randomly among each clusters. Cluster the entire dataset using K-means based on the number of labels, and select items from each cluster according to the desired ratio.
+```console
+datum prune -m cluster_random -r 0.8 -p </path/to/project/>
+```
+![cluster_random](../../../../images/cluster_random.png)
+
+#### `centroid`
+Clustering the entire dataset, set the number of desired data samples as the number of clusters. To perform clustering, a desired number of data points is selected as centroids from the entire dataset, based on the desired proportion. Then, the centers of each cluster are chosen.
+
+```console
+datum prune -m centroid -r 0.8 -p </path/to/project/>
+```
+![centroid](../../../../images/centroid.png)
+
+#### `query_clust`
+When clustering the entire dataset, set the representative query for each label as the center of the cluster. The representative query is calculated through image or text hash, and one representative query is set for each label. It supports the approach of randomly selecting one item per label and choosing it as the representative query. In the generated clusters, random selection of items is performed according to the desired ratio.
+```console
+datum prune -m query_clust -h img -r 0.8 -p </path/to/project/>
+```
+
+```console
+datum prune -m query_clust -h txt -r 0.8 -p </path/to/project/>
+```
+![query_clust](../../../../images/query_clust.png)
+
+#### `entropy`
+After clustering the entire dataset, items are selected within each cluster based on the desired ratio, considering the entropy of labels.
+```console
+datum prune -m entropy -r 0.8 -p </path/to/project/>
+```
+![entropy](../../../../images/entropy.png)
+
+#### `ndr`
+Remove near-duplicated images in each subset. You could check detail for this method in [ndr](./transform.md#ndr).
+```console
+datum prune -m ndr -p </path/to/project/>
+```
diff --git a/docs/source/docs/jupyter_notebook_examples/refine.rst b/docs/source/docs/jupyter_notebook_examples/refine.rst
@@ -1,7 +1,7 @@
 Refine
 ######
 
-We here provide the examples of dataset validation, correction and query-based filtration.
+We here provide the examples of dataset validation, correction, query-based filtration and pruning.
 
 Datumaro's validator detects 22 anomalies such as missing or undefined label, far-from-mean outliers
 and generates the validation report by categorizing anomalies into `info`, `warning`, and `error`.
@@ -17,6 +17,8 @@ For instance, with a given XML file below, we can filter a dataset by the subset
 through ``/item[image/width=image/height]``, and annotation information such as id (``id``), type
 (``type``), label (``label_id``), bounding box (``x, y, w, h``), etc.
 
+Through Prune API, you can create representative subsets of the entire dataset using various supported methods.
+
 .. code-block::
 
     <item>
@@ -61,6 +63,7 @@ datasets are updated in-place by default.
    notebooks/11_validate
    notebooks/12_correct_dataset
    notebooks/04_filter
+   notebooks/17_data_pruning
 
 .. grid:: 1 2 2 2
    :gutter: 2
@@ -85,3 +88,10 @@ datasets are updated in-place by default.
          :color: primary
          :outline:
          :expand:
+
+   .. grid-item-card::
+
+      .. button-ref:: notebooks/17_data_pruning
+         :color: primary
+         :outline:
+         :expand:
diff --git a/docs/source/docs/level-up/advanced_skills/14_data_pruning.rst b/docs/source/docs/level-up/advanced_skills/14_data_pruning.rst
@@ -0,0 +1,74 @@
+=====================================================
+Level 14: Dataset Pruning
+=====================================================
+
+
+Datumaro support prune feature to extract representative subset of dataset. The pruned dataset allows us to examine the trade-off between
+accuracy and convergence time when training on a reduced data sample. By selecting a subset of instances that captures the essential patterns
+and characteristics of the data, we aim to evaluate the impact of dataset size on model performance.
+
+More detailed descriptions about pruning are given by :doc:`Prune <../../command-reference/context_free/prune>`
+The Python example for the usage of pruning is described in :doc:`here <../../jupyter_notebook_examples/notebooks/17_data_pruning>`.
+
+
+.. tab-set::
+
+    .. tab-item:: Python
+
+        With Python API, we can prune dataset as below
+
+        .. code-block:: python
+
+            from datumaro.components.dataset import Dataset
+            from datumaro.components.environment import Environment
+            from datumaro.componenets.prune import prune
+
+            data_path = '/path/to/data'
+
+            env = Environment()
+            detected_formats = env.detect_dataset(data_path)
+
+            dataset = Dataset.import_from(data_path, detected_formats[0])
+
+            prune = Prune(dataset, cluster_method='<how/to/prune/dataset>')
+
+            result = prune.get_pruned(ratio='<how/much/to/prune/dataset>')
+
+        We can choose the desired method as ``<how/to/prune/dataset>`` among the provided ones. The default value is ``random``.
+        Additionally, we can specify how much of the dataset we want to retain by providing a float value between 0 and 1 for the ``<how/much/to/prune/dataset>`` parameter. The default value is 0.5.
+
+    .. tab-item:: CLI
+
+        Without the project declaration, we can simply ``prune`` dataset by
+
+        .. code-block:: bash
+
+            datum prune <target> -m METHOD -r RATIO -h HASH_TYPE
+
+        We could use ``--overwrite`` instead of setting ``-o/--output-dir``.
+        We can choose the desired method as ``METHOD`` among the provided ones. The default value is ``random``.
+        Additionally, we can specify how much of the dataset we want to retain by providing a float value between 0 and 1 for the ``RATIO`` parameter. The default value is 0.5.
+
+
+    .. tab-item:: ProjectCLI
+
+        With the project-based CLI, we first require to ``create`` a project by
+
+        .. code-block:: bash
+
+            datum project create --output-dir <path/to/project>
+
+        We now ``import`` data into project through
+
+        .. code-block:: bash
+
+            datum project import --project <path/to/project> <path/to/data>
+
+        We can ``prune`` dataset
+
+        .. code-block:: bash
+
+            datum prune -m METHOD -r RATIO -h HASH_TYPE -p <path/to/project>
+
+        We can choose the desired method as ``METHOD`` among the provided ones. The default value is ``random``.
+        Additionally, we can specify how much of the dataset we want to retain by providing a float value between 0 and 1 for the ``RATIO`` parameter. The default value is 0.5.
diff --git a/docs/source/docs/level-up/advanced_skills/index.rst b/docs/source/docs/level-up/advanced_skills/index.rst
@@ -7,6 +7,7 @@ Advanced Skills
 
    12_project_versioning
    13_pseudo_label_generation
+   14_data_pruning
 
 .. grid:: 1 2 2 2
    :gutter: 2
@@ -32,3 +33,16 @@ Advanced Skills
          Level 13: Psuedo Label Generation
 
       :bdg-success:`ProjectCLI`
+
+   .. grid-item-card::
+
+      .. button-ref:: 14_data_pruning
+         :color: primary
+         :outline:
+         :expand:
+
+         Level 14: Data Pruning
+
+      :bdg-warning:`Python`
+      :bdg-info:`CLI`
+      :bdg-success:`ProjectCLI`