Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datumaro] Add dataset statistics #1668

Merged
merged 6 commits into from Aug 7, 2020
Merged

[Datumaro] Add dataset statistics #1668

merged 6 commits into from Aug 7, 2020

Conversation

zhiltsov-max
Copy link
Contributor

@zhiltsov-max zhiltsov-max commented Jun 8, 2020

Motivation and context

  • stats outputs json files
  • fixed mean and std for empty datasets
  • Added new metrics for the stats command:
    • Image count, annotations count, unannotated images
    • Per-label annotations count, histogram
    • Per-type annotations count
    • Per-label attributes counts, values and their histogram (like asked in Implement extended annotations statistics #1783)
    • Per-label segment pixel count and histogram (i.e. class balance for segmentations)
    • Segment (bbox, polygons, masks) areas histogram

Example output:

{
    "annotations": {
        "labels": {
            "attributes": {
                "gender": {
                    "count": 358,
                    "distribution": {
                        "female": [
                            149,
                            0.41620111731843573
                        ],
                        "male": [
                            209,
                            0.5837988826815642
                        ]
                    },
                    "values count": 2,
                    "values present": [
                        "female",
                        "male"
                    ]
                },
                "view": {
                    "count": 340,
                    "distribution": {
                        "__undefined__": [
                            4,
                            0.011764705882352941
                        ],
                        "front": [
                            54,
                            0.1588235294117647
                        ],
                        "left": [
                            14,
                            0.041176470588235294
                        ],
                        "rear": [
                            235,
                            0.6911764705882353
                        ],
                        "right": [
                            33,
                            0.09705882352941177
                        ]
                    },
                    "values count": 5,
                    "values present": [
                        "__undefined__",
                        "front",
                        "left",
                        "rear",
                        "right"
                    ]
                }
            },
            "count": 2038,
            "distribution": {
                "car": [
                    340,
                    0.16683022571148184
                ],
                "cyclist": [
                    194,
                    0.09519136408243375
                ],
                "head": [
                    354,
                    0.17369970559371933
                ],
                "ignore": [
                    100,
                    0.04906771344455348
                ],
                "left_hand": [
                    238,
                    0.11678115799803729
                ],
                "person": [
                    358,
                    0.17566241413150147
                ],
                "right_hand": [
                    77,
                    0.037782139352306184
                ],
                "road_arrows": [
                    326,
                    0.15996074582924436
                ],
                "traffic_sign": [
                    51,
                    0.025024533856722278
                ]
            }
        },
        "segments": {
            "area distribution": [
                {
                    "count": 1318,
                    "max": 11425.1,
                    "min": 0.0,
                    "percent": 0.9627465303140978
                },
                {
                    "count": 1,
                    "max": 22850.2,
                    "min": 11425.1,
                    "percent": 0.0007304601899196494
                },
                {
                    "count": 0,
                    "max": 34275.3,
                    "min": 22850.2,
                    "percent": 0.0
                },
                {
                    "count": 0,
                    "max": 45700.4,
                    "min": 34275.3,
                    "percent": 0.0
                },
                {
                    "count": 0,
                    "max": 57125.5,
                    "min": 45700.4,
                    "percent": 0.0
                },
                {
                    "count": 0,
                    "max": 68550.6,
                    "min": 57125.5,
                    "percent": 0.0
                },
                {
                    "count": 0,
                    "max": 79975.7,
                    "min": 68550.6,
                    "percent": 0.0
                },
                {
                    "count": 0,
                    "max": 91400.8,
                    "min": 79975.7,
                    "percent": 0.0
                },
                {
                    "count": 0,
                    "max": 102825.90000000001,
                    "min": 91400.8,
                    "percent": 0.0
                },
                {
                    "count": 50,
                    "max": 114251.0,
                    "min": 102825.90000000001,
                    "percent": 0.036523009495982466
                }
            ],
            "avg. area": 5411.624543462382,
            "pixel distribution": {
                "car": [
                    13655,
                    0.0018431496518735067
                ],
                "cyclist": [
                    939005,
                    0.12674674030446592
                ],
                "head": [
                    0,
                    0.0
                ],
                "ignore": [
                    5501200,
                    0.7425510702956085
                ],
                "left_hand": [
                    0,
                    0.0
                ],
                "person": [
                    954654,
                    0.12885903974805205
                ],
                "right_hand": [
                    0,
                    0.0
                ],
                "road_arrows": [
                    0,
                    0.0
                ],
                "traffic_sign": [
                    0,
                    0.0
                ]
            }
        }
    },
    "annotations by type": {
        "bbox": {
            "count": 548
        },
        "caption": {
            "count": 0
        },
        "label": {
            "count": 0
        },
        "mask": {
            "count": 0
        },
        "points": {
            "count": 669
        },
        "polygon": {
            "count": 821
        },
        "polyline": {
            "count": 0
        }
    },
    "annotations count": 2038,
    "dataset": {
        "image mean": [
            107.06903686941979,
            79.12831698580979,
            52.95829558185416
        ],
        "image std": [
            49.40237673503467,
            43.29600731496902,
            35.47373007603151
        ],
        "images count": 100
    },
    "images count": 100,
    "subsets": {},
    "unannotated images": [
        "img00051",
        "img00052",
        "img00053",
        "img00054",
        "img00055",
        "img00056",
        "img00057",
        "img00058",
        "img00059",
        "img00060",
        "img00061",
        "img00062",
        "img00063",
        "img00064",
        "img00065",
        "img00066",
        "img00067",
        "img00068",
        "img00069",
        "img00070",
        "img00071",
        "img00072",
        "img00073",
        "img00074",
        "img00075",
        "img00076",
        "img00077",
        "img00078",
        "img00079",
        "img00080",
        "img00081",
        "img00082",
        "img00083",
        "img00084",
        "img00085",
        "img00086",
        "img00087",
        "img00088",
        "img00089",
        "img00090",
        "img00091",
        "img00092",
        "img00093",
        "img00094",
        "img00095",
        "img00096",
        "img00097",
        "img00098",
        "img00099",
        "img00100"
    ],
    "unannotated images count": 50
}

How has this been tested?

Unit tests

How to test:

datum project import -i <path> -f <format>
datum project stats -p <project/path>

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below)
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@coveralls
Copy link

coveralls commented Jun 8, 2020

Pull Request Test Coverage Report for Build 6838

  • 58 of 87 (66.67%) changed or added relevant lines in 2 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.1%) to 69.036%

Changes Missing Coverage Covered Lines Changed/Added Lines %
datumaro/datumaro/cli/contexts/project/init.py 1 9 11.11%
datumaro/datumaro/components/operations.py 57 78 73.08%
Files with Coverage Reduction New Missed Lines %
cvat/apps/engine/media_extractors.py 1 77.29%
Totals Coverage Status
Change from base Build 6836: 0.1%
Covered Lines: 11211
Relevant Lines: 15745

💛 - Coveralls

@zhiltsov-max zhiltsov-max changed the title [Datumaro] Add basic dataset quality estimation [Dependent] [Datumaro] Add basic dataset quality estimation Jun 9, 2020
@zhiltsov-max zhiltsov-max changed the title [Dependent] [Datumaro] Add basic dataset quality estimation [Datumaro] Add basic dataset quality estimation Jun 24, 2020
@nmanovic
Copy link
Contributor

@zhiltsov-max , could you please point me on a ready to use model (bin, xml, py)? How to integrate/use the feature in CVAT?

@zhiltsov-max
Copy link
Contributor Author

zhiltsov-max commented Jun 24, 2020

You can either:

  • use an accuracy checker model. There is a model at datumaro/tests/assets/pytorch_launcher, where you need to adjust a few paths to be absolute or just available from the point of invocation. You need to install accuracy-checker with pip from their repo (https://github.com/opencv/open_model_zoo/tree/master/tools/accuracy_checker).
  • use an OpenVINO model. There is no examples, you should provide an xml, bin, and interpeter.py. There is a number of models in OMZ available, any detector would fit.
  • imagine you launched a model and obtained a project. You can specify a path another project to compare with datum quality --inference-path.

At this moment there is no interface in CVAT, but this is a topic for further development.

@nmanovic nmanovic requested a review from azhavoro June 25, 2020 14:07
azhavoro
azhavoro previously approved these changes Jun 26, 2020
@zhiltsov-max zhiltsov-max added this to Review in progress in Dataset framework (Datumaro) via automation Jul 9, 2020
@zhiltsov-max zhiltsov-max changed the title [Datumaro] Add basic dataset quality estimation [Datumaro] Add dataset statistics Jul 10, 2020
@nmanovic
Copy link
Contributor

@zhiltsov-max , could you please resolve conflicts?
@azhavoro, could you please look at the PR and test it?

azhavoro
azhavoro previously approved these changes Jul 14, 2020
azhavoro
azhavoro previously approved these changes Jul 14, 2020
Copy link
Contributor

Codacy Here is an overview of what got changed by this pull request:

Complexity increasing per file
==============================
- datumaro/datumaro/cli/commands/quality.py  8
- datumaro/datumaro/components/operations.py  18
         

See the complete overview on Codacy

Copy link
Contributor

@nmanovic nmanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Dataset framework (Datumaro) automation moved this from Review in progress to Reviewer approved Aug 7, 2020
@nmanovic nmanovic merged commit eaeb67d into develop Aug 7, 2020
Dataset framework (Datumaro) automation moved this from Reviewer approved to Done Aug 7, 2020
@nmanovic nmanovic deleted the zm/ann-quality-main branch August 7, 2020 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants