Add wildreceipt dataset #1359

HamzaGbada · 2023-10-26T09:26:03Z

No description provided.

codecov · 2023-10-26T09:43:59Z

Codecov Report

Merging #1359 (e257a29) into main (e83c3ab) will increase coverage by 0.01%.
Report is 6 commits behind head on main.
The diff coverage is 97.77%.

❗ Current head e257a29 differs from pull request most recent head 478a420. Consider uploading reports for the commit 478a420 to get more accurate results

@@            Coverage Diff             @@
##             main    #1359      +/-   ##
==========================================
+ Coverage   95.78%   95.80%   +0.01%     
==========================================
  Files         154      155       +1     
  Lines        6910     6954      +44     
==========================================
+ Hits         6619     6662      +43     
- Misses        291      292       +1

Flag	Coverage Δ
unittests	`95.80% <97.77%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
doctr/datasets/__init__.py	`100.00% <100.00%> (ø)`
doctr/datasets/wildreceipt.py	`97.72% <97.72%> (ø)`

... and 4 files with indirect coverage changes

felixdittrich92

Hi @HamzaGbada 👋,

Thanks a lot this looks overall pretty good 👍
I have added a few comments

Furthermore could you also update the docs please ? :)
https://github.com/mindee/doctr/blob/main/docs/source/index.rst -> Supported Datasets
https://github.com/mindee/doctr/blob/main/docs/source/using_doctr/using_datasets.rst -> Tables

If you are done please run

make style
make quality

to fix formatting, etc.

NOTE: Don't take care of the failing CI TF detection test i have opened a fix for this already :)

felixdittrich92 · 2023-10-26T11:51:42Z

doctr/datasets/wildreceipt.py

+
+
+class WILDRECEIPT(AbstractDataset):
+    """WildReceipt is a collection of receipts. It contains, for each photo, of a list of OCRs - with bounding box, text, and class."


"""WildReceipt dataset from `"Spatial Dual-Modality Graph Reasoning for Key Information Extraction"

felixdittrich92 · 2023-10-26T11:52:19Z

doctr/datasets/wildreceipt.py

+    <https://arxiv.org/abs/2103.14470v1>`_ |
+    `repository <https://download.openmmlab.com/mmocr/data/wildreceipt.tar>`_.
+
+    >>> # NOTE: You need to download/generate the dataset from the repository.


download/generate -> download

felixdittrich92 · 2023-10-26T11:53:50Z

doctr/datasets/wildreceipt.py

+        self.data: List[Tuple[Union[str, Path, np.ndarray], Union[str, Dict[str, Any]]]] = []
+
+        # define folder to write IMGUR5K recognition dataset
+        reco_folder_name = "WILDRECEIPT_recognition_train" if self.train else "WILDRECEIPT_recognition_test"


How many samples are in the train and test splits ?
Do we really need to save it locally or can we keep it in RAM ?

Otherwise we can store it directly in RAM
example:

doctr/doctr/datasets/funsd.py

Line 94 in bc2d3c5

if recognition_task:

Certainly, given the limited number of samples available – specifically, 1268 samples for the training set and 472 samples for the test set – I've opted to store the data directly in RAM.

felixdittrich92 · 2023-10-26T11:54:11Z

doctr/datasets/wildreceipt.py

+        np_dtype = np.float32
+        self.data: List[Tuple[Union[str, Path, np.ndarray], Union[str, Dict[str, Any]]]] = []
+
+        # define folder to write IMGUR5K recognition dataset


WildReceipt

felixdittrich92 · 2023-10-26T12:21:46Z

doctr/datasets/wildreceipt.py

+                        dtype=np_dtype
+                    )
+                else:
+                    box = self._convert_xmin_ymin(coordinates)


No need to write an own function you can use the functions from doctr.utils

from .utils import polygon_to_bbox box_targets = polygon_to_bbox(tuple((coordniates[i], coordinates[i + 1]) for i in range(0, len(coordinates), 2))) box = [coord for coords in box_targets for coord in coords]

OR

write the logic directly here (function is only used onces)

x, y = box[::2], box[1::2] box = [min(x), min(y), max(x), max(y)]

I would prefer the sec way
:)

@HamzaGbada can we use the secound suggestion please after reading this again i really don't like it 😅

felixdittrich92 · 2023-10-26T12:28:33Z

doctr/datasets/wildreceipt.py

+                    img_path=os.path.join(tmp_root, img_path), geoms=np.asarray(box_targets, dtype=int).clip(min=0)
+                )
+                for crop, label in zip(crops, list(text_targets)):
+                    with open(os.path.join(reco_folder_path, f"{reco_images_counter}.txt"), "w") as f:


As mentioned i don't think that we need to save it locally wdyt ?

doctr/datasets/wildreceipt.py

HamzaGbada · 2023-10-26T16:13:52Z

About fixing formatting, these two commends return a Error:

make style
make quality

Sphinx error:
Builder name style not registered or available through entry point
make: *** [Makefile:20: style] Error 2

Do you have an idea about it ?

felixT2K · 2023-10-27T06:26:33Z

About fixing formatting, these two commends return a Error:
make style
make quality
Sphinx error:
Builder name style not registered or available through entry point
make: *** [Makefile:20: style] Error 2
Do you have an idea about it ?

You have installed doctr with it's dev dependencies correct ?

cd doctr
pip3 install -e .[dev]

Looks like you are in the docs directory

cd doctr
make style
make quality

https://github.com/mindee/doctr/blob/main/Makefile

felixT2K

@HamzaGbada Close to merge really good job 👍🏼 😃

Only some minor stuff left and make

felixT2K · 2023-10-27T06:28:49Z

docs/source/using_doctr/using_datasets.rst

@@ -84,6 +86,8 @@ This datasets contains the information to train or validate a text recognition m
 +-----------------------------+---------------------------------+---------------------------------+---------------------------------------------+
 | IIITHWS                     | 7141797                         | 793533                          | english / handwritten / external resources  |
 +-----------------------------+---------------------------------+---------------------------------+---------------------------------------------+
+| WILDRECEIPT                 | 1268                            | 472                             | english / external resources                |


This looks not correct here we should add the number of samples we get if we use the dataset for recognition :)
So this should be much more samples

felixT2K · 2023-10-27T06:28:58Z

doctr/datasets/wildreceipt.py

+        <https://arxiv.org/abs/2103.14470v1>`_ |
+    `repository <https://download.openmmlab.com/mmocr/data/wildreceipt.tar>`_.
+
+    >>> # NOTE: You need to download the dataset from the repository.


Change to: You need to download the dataset first.

felixT2K · 2023-10-27T06:31:31Z

doctr/datasets/wildreceipt.py

+                crops = crop_bboxes_from_image(
+                    img_path=os.path.join(tmp_root, img_path), geoms=np.asarray(box_targets, dtype=int).clip(min=0)
+                )
+                for crop, label in zip(crops, list(text_targets)):


Do you know if there are text inside we need to filter out ?
For example text which contains whitespaces ?

Ref.:

doctr/doctr/datasets/funsd.py

Line 100 in f22f6dd

if not any(char in label for char in ["☑", "☐", "\uf703", "\uf702"]):

It's worth noting that this dataset contains small text elements that might not be conducive to the recognition task. For instance, we could consider filtering out text elements that are empty or consist of characters such as "-", "*", "/", "=", "#", or "@" to enhance the quality of the recognition process.

@HamzaGbada
Mh in this case i think it would be enough to filter empty elements or if a whitespace is in the label.
We can handle all the above punctuations :)

felixT2K · 2023-10-27T06:37:27Z

doctr/datasets/wildreceipt.py

+    """WildReceipt dataset from `"Spatial Dual-Modality Graph Reasoning for Key Information Extraction"
+        <https://arxiv.org/abs/2103.14470v1>`_ |
+    `repository <https://download.openmmlab.com/mmocr/data/wildreceipt.tar>`_.
+


Optional
If we have an image to give a general overview of the dataset would be great

See:
https://mindee.github.io/doctr/modules/datasets.html

doctr/doctr/datasets/funsd.py

Line 24 in f22f6dd

.. image:: https://doctr-static.mindee.com/models?id=v0.5.0/funsd-grid.png&src=0

where should I put the image ?

@HamzaGbada you can post it here

@odulcy-mindee Could you upload it please ?

@felixT2K @HamzaGbada Here you go:

https://doctr-static.mindee.com/models?id=v0.7.0/wildreceipt-dataset.jpg&src=0

felixT2K · 2023-10-27T10:27:00Z

It would be enough if you post the mentioned image here we can update the docstring later :)
Does make style and quality now work ?

HamzaGbada · 2023-10-27T10:37:46Z

It would be enough if you post the mentioned image here we can update the docstring later :) Does make style and quality now work ?

No it returns:

isort .
make: isort: No such file or directory
make: *** [Makefile:12: style] Error 127

felixT2K · 2023-10-27T10:46:24Z

It would be enough if you post the mentioned image here we can update the docstring later :) Does make style and quality now work ?

No it returns:
isort .
make: isort: No such file or directory
make: *** [Makefile:12: style] Error 127

what happens if you run the following commands (single without make):

isort .
black .
ruff --fix .

felixT2K · 2023-10-27T10:52:28Z

doctr/datasets/wildreceipt.py

@@ -99,7 +99,7 @@ def __init__(
                    img_path=os.path.join(tmp_root, img_path), geoms=np.asarray(box_targets, dtype=int).clip(min=0)
                )
                for crop, label in zip(crops, list(text_targets)):
-                    if not any(char in label for char in ["", "-", "*", "/", "=", "#", "@"]):
+                    if not any(char in label for char in ["", " "]):


if label and " " not in label:

HamzaGbada · 2023-10-27T11:09:57Z

ruff --fix .

Got it, the issue was related to my Linux distribution.

felixdittrich92 · 2023-10-27T11:38:48Z

@HamzaGbada
black one change left see:
https://github.com/mindee/doctr/actions/runs/6666409340/job/18117882074?pr=1359

felixdittrich92

Looks good now thanks a lot 🤗

odulcy-mindee · 2023-10-27T14:37:49Z

Thank you @HamzaGbada for this contribution ! 👏
Thanks @felixdittrich92 for the review !

HamzaGbada added 24 commits October 12, 2023 17:14

[ADD] wildreceipt init

7ae18dc

[ADD] wildreceipt init

f4c4895

[ADD] wildreceipt _convert_xmin_ymin

a883ed0

[ADD] wildreceipt _convert_xmin_ymin

ddb4d67

[ADD] wildreceipt _convert_xmin_ymin

15abe0d

[ADD] wildreceipt test

dcb63cb

[ADD] wildreceipt test

87bf015

[UPDATE] wildreceipt use_polygon

17c1112

[UPDATE] wildreceipt img_folder

f197337

[ADD] mock_wildreceipt_dataset in conftest.py

3c7ce8d

[BUG] mock_wildreceipt_dataset in conftest.py

b7d8cb7

[BUG] mock_wildreceipt_dataset in conftest.py

a1f09b0

[BUG] mock_wildreceipt_dataset in conftest.py

e3b9bdc

[BUG] mock_wildreceipt_dataset in conftest.py

8c57b75

[BUG] mock_wildreceipt_dataset in conftest.py

630437d

[BUG] mock_wildreceipt_dataset in conftest.py

15804df

[FIX] mock_wildreceipt_dataset labels

275afa5

[FIX] mock_wildreceipt_dataset labels

82ed210

[FIX] mock_wildreceipt_dataset labels

1e06371

remove todos

a968db4

remove todos

e42c71e

[UPDATE] wildreceipt_image_folder

4ec3bf5

[ADD] test_wildreceipt_dataset tf

ff4b399

[UPDATE] WILDRECEIPT optimize imports

2a7d1e0

[FIX] WILDRECEIPT self.data

bffca24

felixdittrich92 added this to the 0.7.1 milestone Oct 26, 2023

felixdittrich92 added topic: documentation Improvements or additions to documentation ext: tests Related to tests folder module: datasets Related to doctr.datasets labels Oct 26, 2023

felixdittrich92 added the type: new feature New feature label Oct 26, 2023

felixdittrich92 self-assigned this Oct 26, 2023

felixdittrich92 requested changes Oct 26, 2023

View reviewed changes

HamzaGbada added 2 commits October 26, 2023 14:50

[UPDATE] save fata in RAM

954b8b0

[UPDATE] docs

edbcaf2

[UPDATE] box wildreceipt

e257a29

felixT2K suggested changes Oct 27, 2023

View reviewed changes

HamzaGbada added 2 commits October 27, 2023 11:08

[UPDATE] docs

2b3a578

[UPDATE] filter empty and whitespace

c18175b

felixT2K reviewed Oct 27, 2023

View reviewed changes

HamzaGbada added 2 commits October 27, 2023 12:03

[UPDATE] filter empty and whitespace

6c33799

[FIX] format

fcedaba

[FIX] format

478a420

felixdittrich92 approved these changes Oct 27, 2023

View reviewed changes

odulcy-mindee merged commit 7222fe8 into mindee:main Oct 27, 2023
66 of 68 checks passed

felixdittrich92 mentioned this pull request Apr 17, 2024

add coco-text as a test/train set #1131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add wildreceipt dataset #1359

Add wildreceipt dataset #1359

HamzaGbada commented Oct 26, 2023

codecov bot commented Oct 26, 2023 •

edited

felixdittrich92 left a comment

felixdittrich92 Oct 26, 2023

felixdittrich92 Oct 26, 2023

felixdittrich92 Oct 26, 2023

HamzaGbada Oct 26, 2023

felixdittrich92 Oct 26, 2023

felixdittrich92 Oct 26, 2023

felixdittrich92 Oct 26, 2023

HamzaGbada Oct 26, 2023

felixdittrich92 Oct 26, 2023

HamzaGbada commented Oct 26, 2023

felixT2K commented Oct 27, 2023

felixT2K left a comment •

edited

felixT2K Oct 27, 2023 •

edited

felixT2K Oct 27, 2023 •

edited

felixT2K Oct 27, 2023

HamzaGbada Oct 27, 2023 •

edited

felixT2K Oct 27, 2023

felixT2K Oct 27, 2023

HamzaGbada Oct 27, 2023

felixT2K Oct 27, 2023

HamzaGbada Oct 27, 2023

odulcy-mindee Oct 27, 2023

felixT2K commented Oct 27, 2023

HamzaGbada commented Oct 27, 2023

felixT2K commented Oct 27, 2023

felixT2K Oct 27, 2023

HamzaGbada commented Oct 27, 2023

felixdittrich92 commented Oct 27, 2023

felixdittrich92 left a comment

odulcy-mindee commented Oct 27, 2023



		class WILDRECEIPT(AbstractDataset):
		"""WildReceipt is a collection of receipts. It contains, for each photo, of a list of OCRs - with bounding box, text, and class."

Add wildreceipt dataset #1359

Add wildreceipt dataset #1359

Conversation

HamzaGbada commented Oct 26, 2023

codecov bot commented Oct 26, 2023 • edited

Codecov Report

felixdittrich92 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HamzaGbada commented Oct 26, 2023

felixT2K commented Oct 27, 2023

felixT2K left a comment • edited

Choose a reason for hiding this comment

felixT2K Oct 27, 2023 • edited

Choose a reason for hiding this comment

felixT2K Oct 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HamzaGbada Oct 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixT2K commented Oct 27, 2023

HamzaGbada commented Oct 27, 2023

felixT2K commented Oct 27, 2023

Choose a reason for hiding this comment

HamzaGbada commented Oct 27, 2023

felixdittrich92 commented Oct 27, 2023

felixdittrich92 left a comment

Choose a reason for hiding this comment

odulcy-mindee commented Oct 27, 2023

codecov bot commented Oct 26, 2023 •

edited

felixT2K left a comment •

edited

felixT2K Oct 27, 2023 •

edited

felixT2K Oct 27, 2023 •

edited

HamzaGbada Oct 27, 2023 •

edited