Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Make detection training and inference Multiclass #1097

Merged
merged 41 commits into from
Dec 19, 2022

Conversation

aminemindee
Copy link
Contributor

Hello, this PR is to make the detection part of DocTR multiclass.

⚠️ This PR has a breaking change on the use of DocTR. The attribute blocks of class Page is no longer a list of blocks but a dictionnary with keys being the class names.

  • Training detection model tf/torch working with new data format to include class and also old data format
  • Inference with ocr_predictor working tf/torch
  • visulization and export
  • middleware and hf api not changed should work without changes
  • minimal API: some changes in the docker file and doc to make it work

A first review of the code was done on my fork. you can find it here: aminemindee#4

Any feedback is welcome!

@odulcy-mindee
Copy link
Collaborator

Test in CI docker / pytest-api will work after DocTR release 👍

@felixdittrich92
Copy link
Contributor

Test in CI docker / pytest-api will work after DocTR release 👍

After release ? 😅

@odulcy-mindee
Copy link
Collaborator

odulcy-mindee commented Oct 13, 2022

@felixdittrich92 Yes, it seems weird written like that haha.
It's due to these:

@aminemindee following @felixdittrich92's remark, I just saw that we need to enforce DocTR to be the latest version in the pyproject.toml in the API, otherwise it may still try to install an older version from main not supporting Multiclass

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Oct 13, 2022

@odulcy-mindee @aminemindee you need also to increase the major version specifier in root pyproject.toml (breaking change) and sort the milestones/open tickets after merge/release :)
After scrolling trough the changes i saw also that the xml export changes will break the PDF/A notebook fyi :)
I will try to review this tomorrow careful but for my understanding what is the general use case for multiclass text detection ? printed/handwritten ?

@felixdittrich92 felixdittrich92 added topic: documentation Improvements or additions to documentation type: breaking change Introducing a breaking change module: datasets Related to doctr.datasets ext: demo Related to demo folder ext: references Related to references folder framework: pytorch Related to PyTorch backend framework: tensorflow Related to TensorFlow backend topic: text detection Related to the task of text detection type: new feature New feature labels Oct 14, 2022
Copy link
Contributor

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aminemindee @odulcy-mindee (seems you have both worked on it 😅) thanks for the PR i added a first general review without going to deep and left some questions and remarks :)

Due to the breaking change and the amount of changes, someone should have a second look at it. @frgfm ?

@@ -25,6 +25,11 @@ def _read_sample(self, index: int) -> Tuple[torch.Tensor, Any]:
if isinstance(target, dict):
assert "boxes" in target, "Target should contain 'boxes' key"
assert "labels" in target, "Target should contain 'labels' key"
elif isinstance(target, tuple):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to be careful with this currently we have two cases:

more then 1 annotation: dict (ocr dataset with labels and boxes or object detection dataset)
otherwise label (recognition) or boxes (text detection)

We need to be as clear as possible here so we don't run into problems with the transformations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In here, for now the only moment when target is a tuple is coming from detection dataset when there is multiple classes to keep that information and include it into the target in the _pre_transform.

yes we should be careful but for now this is the only case where it will be used.

@@ -25,6 +25,11 @@ def _read_sample(self, index: int) -> Tuple[tf.Tensor, Any]:
if isinstance(target, dict):
assert "boxes" in target, "Target should contain 'boxes' key"
assert "labels" in target, "Target should contain 'labels' key"
elif isinstance(target, tuple):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this still troublesome (transformations) better to keep dict for all which has multible annotations 🤔

@@ -10,13 +10,16 @@
import os
import sys

CLASS_NAME: str = "words"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a text detection specific attribute.
We have to add it to cfg of each text detection model as it is also done for object detection and then work with model.cfg['class_names'] afterwards:

"classes": ["background", "qr_code", "bar_code", "logo", "photo"],

this makes the names also flexible for training

NOTE: this change will need also an update for HF Api (factory) and update the config.json for uploaded text detection models

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to have this default single class and use the class of the datasets otherwise to build the model, but the user should be able to modify this CLASS_NAME to a specific value like "words", "numbers", "default_class", "handwritten"... so it shouldn't be hardcoded here but I rather see it as a parameter given by the user either in the config or when building the mode, WDYT ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @charlesmindee yep sounds good to me 👍

I see it much more problematic to modify the blocks output structure .. the results should be added to each predicted word like

'house', conf: '0.99', class: 'word'

Instead of:

Blocks:
'word': { Line .. Words ..},
'salary': { Line .. Words ..}

Wdyt ?
(Wrote from mobile phone sry 😅 )

@@ -311,65 +316,66 @@ def export_as_xml(self, file_title: str = "docTR - XML export (hOCR)") -> Tuple[
},
)
# iterate over the blocks / lines / words and create the XML elements in body line by line with the attributes
for block in self.blocks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a while ago i wrote this export as xml functionality 😅 but we should test that it still works as expected with the PDF/A notebook

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i looked at it a little bit. in google colab the installation using git doesn't seem to work for me. However, the code should work just the output will be slightly different i think.

doctr/models/_utils.py Outdated Show resolved Hide resolved
val_metric.update(gts=boxes_gt, preds=boxes_pred[:, :4])
for target, loc_pred in zip(targets, loc_preds):
if isinstance(target, np.ndarray):
target = {CLASS_NAME: target}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this name not come from the dataset specified names ? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the training part, you can still use a data format where you don't give any class name. in that case, targets are not a dict so in that case i change it here. But now that i'm looking at it i think it's better to just add the CLASS_NAME in the dataset and not handle it outside. right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to avoid this CLASS_NAME completly .. class names should come from the dataset and update the default class_names in model.cfg ... to explain a box can be annotated with different labels like:
box 1: name, box 2: salary i do not understand correctly why we need to modify in the end the blocks output structure
i think what we want is this:

(words): [
          Word(value='ABC Gmbh.', confidence=0.91, class_name='name'),
          Word(value='1200', confidence=0.99, class_name='salary),
        ]

And not something like:

(blocks): {'salaray' : 
      (lines): [Line(
        (words): [
          Word(value='1200.', confidence=0.91),
        ]
      )}, {'name': ....
]

right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multiclass detection model is prediction bounding boxes that represent the blocks. so in theory the whole block is of that class and the words inside are of that class. A user will want to get all the words of the same block of that class and he can get that directly with the blocks as dict approach. different words in the same block will always have the same class also.

val_metric.update(gts=boxes_gt, preds=boxes_pred[:, :4])
for target, loc_pred in zip(targets, loc_preds):
if isinstance(target, np.ndarray):
target = {CLASS_NAME: target}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same ?

@@ -215,7 +220,7 @@ class Page(Element):
"""Implements a page element as a collection of blocks

Args:
blocks: list of block elements
blocks: Dictionary with list of block elements for each detection class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still try to find a solution to avoid this breaking change but i think in front of this a short explanation would help me to understand the use case for multi class in text detection or better what you want to reach with it 😅

Is it not possible to add another key value pair like 'class_names': 'words' instead of changing the output structure ?

Copy link
Contributor Author

@aminemindee aminemindee Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm something like a class_names is the same len as blocks and give the class name of each block ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh are you sure that block is the correct part ? If i understood it correctly each box can have a different label so it is the words part and not block (blocks can contain multible lines which can contains multible words)

Document(
  (pages): [Page(
    dimensions=(340, 600)
    (blocks): [Block(
      (lines): [Line(
        (words): [
          Word(value='No.', confidence=0.91),
          Word(value='RECEIPT', confidence=0.99),
          Word(value='DATE', confidence=0.96),
        ]
      )]
      (artefacts): []
    )]
  )]
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually blocks are not resolved, so this sub object in the Page might be obsolete, we could either rename it to Layer to have a more meaningful page structure and keep it like that or add a Layer object between Page and Block in the hierarchy, so that a class would be on a layer and inside each layer we can resolve blocks and lines.

img, target = self.sample_transforms(img, target)
if isinstance(target, dict):
img_transformed = copy_tensor(img)
for class_name, bboxes in target.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An OCRDataset comes also with dict as target and has no class names (string labels and bbox annotations) so if we transform samples in this case this should normally fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's true 😨 i'm trying to think how to handle it . open to any suggestions x)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described in the other comments if we handle it like it is done with object_detection (classes comes from model.cfg which is flexible and can be updated with classes from dataset) we could remove this
Maybe thinking again about this dataset format:
polygons: [....]
classes: ['printed', 'handwritten', 'printed', 'printed', ...] ? @frgfm @odulcy-mindee @charlesmindee wdyt ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should split the topics of this PR, and discuss this in several issues/discussions to find a proper implementations. There are several breaking change modifications in the PR over quite a lot of files, it's quite hard to ensure the review quality will be both high and exhaustive 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added another check in addition to if it's a dict to ensure it's the target coming from detection.

@@ -57,7 +57,30 @@ labels.json
...
}
```
If you want to train a model with multiple classes, you can use the following format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a bit more clear:
multiclass box a ['printed', 'handwritten'] or box a ['printed'] box b ['handwritten']

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i think now i got it 😅

for example class_name_1 == 'printed' contains all boxes for printed and class_name_2 == 'handwritten' contains all boxes for handwritten.

Maybe a better format:

polygons: [....]
classes: ['printed', 'handwritten', 'printed', 'printed', ...]

So we can avoid to introduce a new dataset format then it should also be possible to avoid the breaking change only to add a new key value pair like 'class': 'printed' @aminemindee @odulcy-mindee wdyt ?

@aminemindee
Copy link
Contributor Author

@odulcy-mindee @aminemindee you need also to increase the major version specifier in root pyproject.toml (breaking change) and sort the milestones/open tickets after merge/release :) After scrolling trough the changes i saw also that the xml export changes will break the PDF/A notebook fyi :) I will try to review this tomorrow careful but for my understanding what is the general use case for multiclass text detection ? printed/handwritten ?

Hello @felixdittrich92 the general use case of multiclass text detection is when you want to find not just the words in a document but some specific infomations. An example, in a receipt from a restaurant you want to find the name of the restaurant and the total you paid. then you can have two classes and predict different bounding boxes for each class.

@felixdittrich92
Copy link
Contributor

@odulcy-mindee @aminemindee you need also to increase the major version specifier in root pyproject.toml (breaking change) and sort the milestones/open tickets after merge/release :) After scrolling trough the changes i saw also that the xml export changes will break the PDF/A notebook fyi :) I will try to review this tomorrow careful but for my understanding what is the general use case for multiclass text detection ? printed/handwritten ?

Hello @felixdittrich92 the general use case of multiclass text detection is when you want to find not just the words in a document but some specific infomations. An example, in a receipt from a restaurant you want to find the name of the restaurant and the total you paid. then you can have two classes and predict different bounding boxes for each class.

Ok got it so in the end like the LayoutLM model (without the text input)

@codecov
Copy link

codecov bot commented Oct 14, 2022

Codecov Report

Merging #1097 (73d10a0) into main (acb9f64) will decrease coverage by 0.25%.
The diff coverage is 91.82%.

@@            Coverage Diff             @@
##             main    #1097      +/-   ##
==========================================
- Coverage   95.23%   94.97%   -0.26%     
==========================================
  Files         142      146       +4     
  Lines        6017     6389     +372     
==========================================
+ Hits         5730     6068     +338     
- Misses        287      321      +34     
Flag Coverage Δ
unittests 94.97% <91.82%> (-0.26%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/file_utils.py 72.22% <37.50%> (-6.51%) ⬇️
doctr/datasets/datasets/base.py 91.22% <50.00%> (-6.78%) ⬇️
doctr/datasets/detection.py 81.08% <71.42%> (-14.58%) ⬇️
doctr/utils/visualization.py 88.16% <85.18%> (-1.49%) ⬇️
doctr/models/detection/linknet/base.py 88.88% <88.63%> (+0.19%) ⬆️
doctr/models/_utils.py 96.96% <88.88%> (-1.34%) ⬇️
...dels/detection/differentiable_binarization/base.py 93.92% <90.56%> (+0.39%) ⬆️
...s/detection/differentiable_binarization/pytorch.py 97.77% <90.90%> (-0.68%) ⬇️
...etection/differentiable_binarization/tensorflow.py 95.16% <92.30%> (-0.61%) ⬇️
doctr/models/detection/linknet/pytorch.py 98.18% <92.30%> (-0.86%) ⬇️
... and 39 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Collaborator

@frgfm frgfm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello everyone 👋

Thanks a lot of the PR! It's only my opinion but I think we have too many things at stake in the PR to be able to say whether it's OK to merge or not. I would suggest pausing this one shortly and opening GH issues for the following topics:

  • dataset target format modification for multiple class of text/objects
  • versioning of docTR (when is 1.0.0 within reach, etc.)
  • minimal API modifications
  • export format modifications
  • detection model modifications
  • training/evaluation

I might have forgotten a few

Once a topic has reached a good implementation idea / way to handle it, we can cherry pick from this PR or reimplement it :)

I already added comments for the section I checked, but then thought there is too much to handle in a single PR 😅 If you prefer to go with that one in a single PR, no worries!

Cheers ✌️

version = "0.5.2a0"
version = "1.0.1a0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ This is the version of the package, not of poetry
I'd suggest reverting back

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I realize you want to bump docTr to 1.0.0. Perhaps that might be easier to discuss this in an issue and lower the amount of modifications in this specific PR?

python = ">=3.8,<3.11"
python = ">=3.8.2,<3.11"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment to specify the reason of the minor version? (much easier to reassess later when we have a comment saying why)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good idea ! added it.

python-doctr = ">=0.2.0"
python-doctr = { version = ">=1.0.0", extras = ['tf'] }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but I think this will result as double version specifiers. If we consider extras, we need to remove the "tensorflow" & cie specifiers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think i understand your comment can you be more clear?
we added the extras because now doctr if you don't install the extras "tf" you don't have tf2onnx library and things like that, that are needed to run the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sure, I wasn't referring to the extra but the version index specifier
You switched it from >=0.2.0 to >=1.0.0 which is a big deal 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the extra, I'd argue it's up for discussion but the API doesn't need to convert models to ONNX

Ideally, we'd like it to run on ONNX, but some postprocessing ops are not compatible yet. So the alternative is to use doctr, and then carefully add as little deps as possible to run it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. but for now how doctr is done to run the code in the api tf2onnx is needed

img, target = self.sample_transforms(img, target)
if isinstance(target, dict):
img_transformed = copy_tensor(img)
for class_name, bboxes in target.items():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should split the topics of this PR, and discuss this in several issues/discussions to find a proper implementations. There are several breaking change modifications in the PR over quite a lot of files, it's quite hard to ensure the review quality will be both high and exhaustive 😅

Comment on lines 28 to 33
elif isinstance(target, tuple):
assert isinstance(target[0], str) or isinstance(
target[0], np.ndarray
), "first element of the tuple should be a string or a numpy array"
assert isinstance(target[1], list), "second element of the tuple should be a list"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this type of condition is that if the target is a tuple with a single element the code will crash. Either we have to add assert len(target) == 2 or change that a bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added it.

Copy link
Collaborator

@charlesmindee charlesmindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few remarks, but minor

@@ -215,7 +220,7 @@ class Page(Element):
"""Implements a page element as a collection of blocks

Args:
blocks: list of block elements
blocks: Dictionary with list of block elements for each detection class
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually blocks are not resolved, so this sub object in the Page might be obsolete, we could either rename it to Layer to have a more meaningful page structure and keep it like that or add a Layer object between Page and Block in the hierarchy, so that a class would be on a layer and inside each layer we can resolve blocks and lines.

@@ -161,3 +161,18 @@ def get_language(text: str) -> Tuple[str, float]:
if len(text) <= 1 or (len(text) <= 5 and lang.prob <= 0.2):
return "unknown", 0.0
return lang.lang, lang.prob


def invert_between_dict_of_lists_and_list_of_dicts(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invert_data_structure with a clear docstring which explains the typing maybe ? so that we keep a short and clear name ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -10,13 +10,16 @@
import os
import sys

CLASS_NAME: str = "words"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to have this default single class and use the class of the datasets otherwise to build the model, but the user should be able to modify this CLASS_NAME to a specific value like "words", "numbers", "default_class", "handwritten"... so it shouldn't be hardcoded here but I rather see it as a parameter given by the user either in the config or when building the mode, WDYT ?

@aminemindee
Copy link
Contributor Author

Hello everyone wave

Thanks a lot of the PR! It's only my opinion but I think we have too many things at stake in the PR to be able to say whether it's OK to merge or not. I would suggest pausing this one shortly and opening GH issues for the following topics:

* dataset target format modification for multiple class of text/objects

* versioning of docTR (when is 1.0.0 within reach, etc.)

* minimal API modifications

* export format modifications

* detection model modifications

* training/evaluation

I might have forgotten a few

Once a topic has reached a good implementation idea / way to handle it, we can cherry pick from this PR or reimplement it :)

I already added comments for the section I checked, but then thought there is too much to handle in a single PR sweat_smile If you prefer to go with that one in a single PR, no worries!

Cheers v

Hello @frgfm i understand that this PR is kind of big with a lot of changes. The goal is to have it available as soon as possible. i know that the review won't be exhaustive but we can see this 1.0.0 release as experimental and once it's there we can improve things. After this PR, i will work on improving the training part ( changing losses maybe and also adding some augmentations and things of the sort)

@frgfm
Copy link
Collaborator

frgfm commented Oct 18, 2022

Hi @aminemindee :)

i know that the review won't be exhaustive but we can see this 1.0.0 release as experimental and once it's there we can improve things.

I could agree with your suggestion if it were 0.7.0, but 1.0.0 in software dev, at least to the best of my experience & knowledge, is a mark of stability in the API (in the broad sense) and performance milestone. So jumping from 0.6.0 to 1.0.0 looks a bit hasty to me (even the 0.7.0 release tracker is full of things that would be worth considering before releasing a 1.0.0)

To put this into context, the day the community saw PyTorch jump from 0.4.0 to 1.0.0, it was because they added the C++ API, the JIT compiler (which was their target for the first 1.0.0 release). We knew that this could now be considered to run in production. Here, at least I think the release roadmap should follow a structure readable by the community (cf. release trackers) 😅

After this PR, i will work on improving the training part ( changing losses maybe and also adding some augmentations and things of the sort)

Sounds good 👍

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Oct 19, 2022

About the versioning, i agree with @frgfm 1.0.0 should be the first stable (production ready) version specifier.
On the other hand with the breaking change we need to make it clear for end users.
I would still suggest to rethink some parts so that we can implement it without modifying the Page structure and avoid the breaking change (for example as extra attribute class for each Word) cf.

class Word(Element):

@aminemindee
Copy link
Contributor Author

#1097 (comment)

About the versioning, i agree with @frgfm 1.0.0 should be the first stable (production ready) version specifier. On the other hand with the breaking change we need to make it clear for end users. I would still suggest to rethink some parts so that we can implement it without modifying the Page structure and avoid the breaking change (for example as extra attribute class for each Word) cf.

class Word(Element):

I don't think adding an attribute class to Word would be ideal, because for a user who wants all the blocks of words from a certain class will have to iterate over every block line and words to get them.

I think the main problem here is that, once the detection model is not trying to detect all words in a page, the fact that Page is a list of Block is not logical anymore because it doesn't constitute the page, but just give the information we found on the page.

So i agree with @charlemindee, we can add a new object Layer but instead of between Page and Blocks but at the same level of Blocks with the same information just as a dictionary. WDYT @fg-mindee @charlesmindee @felixdittrich92 ?

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Oct 20, 2022

#1097 (comment)

About the versioning, i agree with @frgfm 1.0.0 should be the first stable (production ready) version specifier. On the other hand with the breaking change we need to make it clear for end users. I would still suggest to rethink some parts so that we can implement it without modifying the Page structure and avoid the breaking change (for example as extra attribute class for each Word) cf.

class Word(Element):

I don't think adding an attribute class to Word would be ideal, because for a user who wants all the blocks of words from a certain class will have to iterate over every block line and words to get them.

I think the main problem here is that, once the detection model is not trying to detect all words in a page, the fact that Page is a list of Block is not logical anymore because it doesn't constitute the page, but just give the information we found on the page.

So i agree with @charlemindee, we can add a new object Layer but instead of between Page and Blocks but at the same level of Blocks with the same information just as a dictionary. WDYT @fg-mindee @charlesmindee @felixdittrich92 ?

Mh .. we should keep in mind that it is an feature task ... core is still OCR (where a user wants to get all the extracted text at once) so i think it would be totally ok if the user needs to iterate trough the results ... but yeah that's my opinion 😅 let's wait on @frgfm @charlesmindee 👍

@aminemindee
Copy link
Contributor Author

Hello again ! After discussing it internally, i have a new suggestion to add the detection multiclass without a big breaking change.

I added a new kie_predictor ( like ocr_predictor) that will handle the different classes and has a dict like output, while keeping ocr_predictor as it was.
Also, we're not going major anymore but just to 0.7.0 and maybe what was already tracked for the 0.7.0 will become for 0.8.0 i guess.

So a summary of the changes:

  • Addition of kie_predictor and some io classes to use it (KIEDocument, KIEPage, ..)
  • ocr_predictor has the same output as before, no breaking change here
  • Detection model training as it was before with new dataformat in input and outputs a dict
  • Test were fixed and added tests for kie_predictor
  • kie route was added on api

WDYT @felixdittrich92 @odulcy-mindee @frgfm @charlesmindee ?

Copy link
Contributor

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aminemindee thanks for the changes i added a few comments but in general there are some questions / remarks and the PR is still much to big to review it careful 😅

api/app/routes/kie.py Outdated Show resolved Hide resolved
doctr/models/detection/linknet/base.py Outdated Show resolved Hide resolved
doctr/models/kie_predictor/base.py Outdated Show resolved Hide resolved
@@ -25,6 +25,11 @@ def _read_sample(self, index: int) -> Tuple[tf.Tensor, Any]:
if isinstance(target, dict):
assert "boxes" in target, "Target should contain 'boxes' key"
assert "labels" in target, "Target should contain 'labels' key"
elif isinstance(target, tuple):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this still troublesome (transformations) better to keep dict for all which has multible annotations 🤔

doctr/models/kie_predictor/pytorch.py Outdated Show resolved Hide resolved
aminemindee and others added 24 commits December 5, 2022 17:10
…lements for kie predictor (#6)

* feat: ✨ add load backbone

* feat: change kie predictor out

* fix new elements for kie, dataset when class is empty and fix and add tests

* fix api kie route

* fix evaluate kie script

* fix black

* remove commented code

* update README
@aminemindee aminemindee merged commit e66ce01 into mindee:main Dec 19, 2022
@felixdittrich92 felixdittrich92 added this to the 0.7.0 milestone Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ext: demo Related to demo folder ext: references Related to references folder framework: pytorch Related to PyTorch backend framework: tensorflow Related to TensorFlow backend module: datasets Related to doctr.datasets topic: documentation Improvements or additions to documentation topic: text detection Related to the task of text detection type: breaking change Introducing a breaking change type: new feature New feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants