Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DjVu-binarization does not invert parts of the page that have white text on a colored background #21

Open
rmast opened this issue Nov 17, 2021 · 7 comments

Comments

@rmast
Copy link

rmast commented Nov 17, 2021

If you print and scan this document:
https://www.kvk.nl/download/Formulier-14-wijziging-ondernemings-en-vestigingsgegevens_tcm109-365607.pdf
the resulting DJBZ of didjvu with its default djvu-binarizer contains inverted tiles of text, while DjVuSolo3.1 inverts those, to only maintain the smaller elements within a big surface and use the background color for the colored frame.

I already read a probably expired patent mentioning this, so this should be getting attention. didjvu uses Gamera (include/plugins/threshold.hpp) djvu_threshold for this binarization, so this issue should probably be propagated to gamera.

@rmast
Copy link
Author

rmast commented Nov 20, 2021

@jsbien Patent https://patents.google.com/patent/US6901169 seems to deal with the choice between foreground and background. I've spent some time to understand what they do exactly, but I don't.

I just looked at the status of the patent, it is still active, so no use of implementing or studying it I guess.

@jsbien
Copy link

jsbien commented Nov 20, 2021

@rmast "Status Active, 2023-11-30 Adjusted expiration". Does it expire in two years or to the contrary, it's a date when the active status can be prolonged??? On the other hand, all USA software patents are vali only in USA, so if you want to use the software elsewhere than they do not matter - am I correct? BTW, I found my notes, but they require some checking and editing before making it public.

@rmast
Copy link
Author

rmast commented Nov 20, 2021

I can't imagine the American Software patents wouldn't be valid in Europe or even in Korea. There are patent struggles between Apple and the Korean Samsung for example. Otherwise someone could just use an offshore-company to break the patent.

I can imagine those rules are subject to international trade agreements.

However, a European or US patent lasts 20 years from the 'filing' date:
https://www.bardehle.com/europeansoftwarepatents/faq/how-long-does-a-software-patent-last/
https://www.stopfakes.gov/article?id=How-Long-Does-Patent-Trademark-or-Copyright-Protection-Last

So you're right, this patent will probably expire soon.
However I see a word 'filed' with date 2002 january 24, so 2023 november 30 is more than 20 years from that apparent filing date.

Your documentation states this patent has an excellent performance in front/back separation.

DjVuSolo3.1's inversion is far from perfect, so I doubt DjVuSolo 3.1 already contains this patented algorithm.

So, would you suggest to start coding to be able to have something productional in two years?

@jsbien
Copy link

jsbien commented Nov 21, 2021

In principle patents are valid only in the countries where they were explicitly patented, but you are right some trade agreements can affect it. I am quite sure they are not valid in the countries which do not recognize the software patents. A useful list is available at https://en.wikipedia.org/wiki/Software_patent. You are right European Union has a kind of software patents - I remembered the hot discussions but forgot the idea has been finally accepted. Now , besides USA, the patent in question is active in Europe: https://patents.google.com/patent/EP1229495B1/en (2022-01-31
Anticipated expiration), in Canada: https://patents.google.com/patent/CA2369841C/en (2022-01-31
Anticipated expiration), perhaps in South Korea: https://patents.google.com/patent/KR100873583B1/en (no explicit information on status and the expiration date), it is not relevant elsewhere unless some trade agreement says differently.

This is a very good example of patent created FUD (https://en.wikipedia.org/wiki/Fear,_uncertainty,_and_doubt).

To be on the safe side you can contact Current Assignee (T&T Corp, AT&T Intellectual Property II LP) and/or ask for help SFLC (https://en.wikipedia.org/wiki/Software_Freedom_Law_Center) or another similar organization.

@rmast
Copy link
Author

rmast commented Nov 21, 2021

If there exists a european patent as well you're right that the American patent probably doesn't cover Europe, otherwise they wouldn't have spent that double effort.
And as the European patent already expires in two months we even won't be able to realize a realistic violation in time within our spare time.

@rmast
Copy link
Author

rmast commented Nov 21, 2021

The text for the European patent seems to differ from the American patent, so it might clarify some things.
They talk about getting things done in very few passes, and talk about choosing foreground/background before even choosing what parts will join in estimating the background color. So the binarization, foreground/background estimation and color histogram determination are all done simultaneously. I guess we will have to focus on the Gamera (include/plugins/threshold.hpp) djvu_threshold to put it all in.

@rmast
Copy link
Author

rmast commented Nov 21, 2021

After reading some about the history of DjVuSolo 3.1 I now believe it does contain the patented algorithm. So behaving poorly on folded and inkjet-printed content, it probably needs another or additional strategy to get that content readable.
image
Simply thresholding the inverted text at 160 gives a better readable result:
Knipsel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants