Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdfcomp: problems with inverted text that is often better in hocr. #55

Open
rmast opened this issue Jun 25, 2022 · 10 comments
Open

pdfcomp: problems with inverted text that is often better in hocr. #55

rmast opened this issue Jun 25, 2022 · 10 comments

Comments

@rmast
Copy link

rmast commented Jun 25, 2022

This form https://www.kvk.nl/download/Formulier-14-wijziging-ondernemings-en-vestigingsgegevens_tcm109-365607.pdf

First page saved to jpeg via this site: https://smallpdf.com

0001

Result of the left column is quite readable at the right screen-resolution.

ocrmypdf --pdfa-image-compression lossless -O0  0001.jpg formulierhocrjpg.pdf
Input file is not a PDF, checking if it is an image...
Input file is an image
Input image has no ICC profile, assuming sRGB
Image seems valid. Try converting to PDF...
Successfully converted to PDF, processing...
Scanning contents: 100%|████████████████████████| 1/1 [00:00<00:00, 73.93page/s]
OCR: 100%|██████████████████████████████████| 1.0/1.0 [00:09<00:00,  9.92s/page]
Postprocessing...
PDF/A conversion: 100%|█████████████████████████| 1/1 [00:00<00:00,  2.46page/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)

pdfcomp formulierhocrjpg.pdf formulierhocrjpgkleiner.pdf
Compression factor: 9.617848822158944

formulierhocrjpgkleiner.pdf

Contains unreadable text on the left. The hocr contains "Toelichting 1.1", it is completely unreadable.

My patch for the inversion ratio makes it better readable:

formulierhocrjpgkleinerpatch.pdf

However if you lookup the mask-picture it doesn't contain this text in the left column at all.

So my patch isn't the only needed change for that routine.

@rmast
Copy link
Author

rmast commented Jun 25, 2022

If I invert the complete image via https://pinetools.com/invert-image-colors and repeat the steps all text seems correct in tesseract and sharp in the resulting PDF, despite both inverted and non-inverted text on the page:
imagenpgNLkleiner.pdf

@rmast
Copy link
Author

rmast commented Jun 26, 2022

I found a workaround to get the OCR correct:

Create a file tess.cfg containing

tessedit_do_invert      True

And call

ocrmypdf -l nld 175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg --tesseract-config tess.cfg ocrkwaliteit.pdf

The OCR text is now looking fine, however pdfcomp is crashing on this result:

pdf-metadata-json /home/rmast/ocrkwaliteit.pdf
Traceback (most recent call last):
  File "/home/rmast/archive-pdf-tools/venv/bin/pdf-metadata-json", line 4, in <module>
    __import__('pkg_resources').run_script('archive-pdf-tools==1.4.16', 'pdf-metadata-json')
  File "/home/rmast/archive-pdf-tools/venv/lib/python3.8/site-packages/pkg_resources/__init__.py", line 656, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/rmast/archive-pdf-tools/venv/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1453, in run_script
    exec(code, namespace, namespace)
  File "/home/rmast/archive-pdf-tools/venv/lib/python3.8/site-packages/archive_pdf_tools-1.4.16-py3.8-linux-x86_64.egg/EGG-INFO/scripts/pdf-metadata-json", line 275, in <module>
    r = analyse(sys.argv[1])
  File "/home/rmast/archive-pdf-tools/venv/lib/python3.8/site-packages/archive_pdf_tools-1.4.16-py3.8-linux-x86_64.egg/EGG-INFO/scripts/pdf-metadata-json", line 187, in analyse
    raise ValueError('Invalid type for Filter: %s' % typ)
ValueError: Invalid type for Filter: array

ocrkwaliteit.pdf

@rmast
Copy link
Author

rmast commented Jun 26, 2022

The new parameter Stefan Weil suggests gives the same error.

@MerlijnWajer
Copy link
Collaborator

For the record pdf-metadata-json is still in pretty heavily development, so while I encourage using it and patches (thanks, will review later today), it might still change a lot without much warning. It should not matter too much for the compression use case, though.

@rmast
Copy link
Author

rmast commented Jun 26, 2022

When I look at the extracted hocr from this "array"-containing PDF it twice contains the "wis-clear" part on the right top of the image, unfortunately both with confidence 100. I guess ocrmypdf should already decide for the best one:

  <div class="ocr_carea" title="bbox 2163 366 2399 395" id="block_000000_000053">
    <p class="ocr_par" id="par_000000_000053" title="bbox 2163 366 2399 395">
      <span class="ocr_line" title="bbox 2163 366 2341 395; baseline 0.000000 0" id="line_000000_000086">
        <span class="ocrx_word" id="word_000000_000307" title="bbox 2163 366 2169 395; x_wconf 100; x_fsize 4">
          <span class="ocrx_cinfo" title="x_bboxes 2163 366 2169 395; x_confs 100.0">&gt;</span>
        </span>
        <span class="ocrx_word" id="word_000000_000308" title="bbox 2193 366 2326 395; x_wconf 100; x_fsize 7">
          <span class="ocrx_cinfo" title="x_bboxes 2193 366 2208 395; x_confs 100.0">w</span>
          <span class="ocrx_cinfo" title="x_bboxes 2208 366 2223 395; x_confs 100.0">i</span>
          <span class="ocrx_cinfo" title="x_bboxes 2223 366 2238 395; x_confs 100.0">s</span>
          <span class="ocrx_cinfo" title="x_bboxes 2238 366 2252 395; x_confs 100.0">-</span>
          <span class="ocrx_cinfo" title="x_bboxes 2252 366 2267 395; x_confs 100.0">c</span>
          <span class="ocrx_cinfo" title="x_bboxes 2267 366 2282 395; x_confs 100.0">l</span>
          <span class="ocrx_cinfo" title="x_bboxes 2282 366 2296 395; x_confs 100.0">e</span>
          <span class="ocrx_cinfo" title="x_bboxes 2296 366 2311 395; x_confs 100.0">a</span>
          <span class="ocrx_cinfo" title="x_bboxes 2311 366 2326 395; x_confs 100.0">r</span>
        </span>
      </span>
      <span class="ocr_line" title="bbox 2373 366 2399 395; baseline -0.000000 0" id="line_000000_000087">
        <span class="ocrx_word" id="word_000000_000309" title="bbox 2373 366 2386 395; x_wconf 100; x_fsize 6">
          <span class="ocrx_cinfo" title="x_bboxes 2373 366 2386 395; x_confs 100.0">|</span>
        </span>
      </span>
      <span class="ocr_line" title="bbox 2193 366 2334 395; baseline 0.000000 0" id="line_000000_000088">
        <span class="ocrx_word" id="word_000000_000310" title="bbox 2193 366 2226 395; x_wconf 100; x_fsize 6">
          <span class="ocrx_cinfo" title="x_bboxes 2193 366 2204 395; x_confs 100.0">w</span>
          <span class="ocrx_cinfo" title="x_bboxes 2204 366 2215 395; x_confs 100.0">i</span>
          <span class="ocrx_cinfo" title="x_bboxes 2215 366 2226 395; x_confs 100.0">s</span>
        </span>
        <span class="ocrx_word" id="word_000000_000311" title="bbox 2249 366 2252 395; x_wconf 100; x_fsize 3">
          <span class="ocrx_cinfo" title="x_bboxes 2249 366 2252 395; x_confs 100.0">-</span>
        </span>
        <span class="ocrx_word" id="word_000000_000312" title="bbox 2268 366 2323 395; x_wconf 100; x_fsize 6">
          <span class="ocrx_cinfo" title="x_bboxes 2268 366 2279 395; x_confs 100.0">c</span>
          <span class="ocrx_cinfo" title="x_bboxes 2279 366 2290 395; x_confs 100.0">l</span>
          <span class="ocrx_cinfo" title="x_bboxes 2290 366 2301 395; x_confs 100.0">e</span>
          <span class="ocrx_cinfo" title="x_bboxes 2301 366 2312 395; x_confs 100.0">a</span>
          <span class="ocrx_cinfo" title="x_bboxes 2312 366 2323 395; x_confs 100.0">r</span>
        </span>
      </span>
    </p>
  </div>

@MerlijnWajer
Copy link
Collaborator

There's a few things at play:, and all could be at fault:

  • Tesseract/OCR engine
  • The PDF creator (I think OCRmyPDF now uses Tesseract's PDF rendering for the text layer)
  • The PDF rendering and extracting library - PyMuPDF in this case
  • My hOCR generating code, based on PyMuPDF text extraction

@rmast
Copy link
Author

rmast commented Jun 26, 2022

You can already see these are separately recognized words, for example the third coordinate of the first 'w' differs from the second. But Stefan says this is not by design, so I guess he'll adapt it in tesseract. The old functionality with tessedit_do_invert=True already gave an "array" instead of a "name" in pdf-metadata-json, that might be an alert for multiple values and correlated.

@rmast
Copy link
Author

rmast commented Jul 23, 2022

I didn't get the print/wis-clear correctly read in automatically in plain Tesseract. Looking around for a solution I stumbled into EasyOCR, which doesn't have HOCR-output, but comes with something similar when you just follow the main readme and print(result) for languages nl, en:

EasyOCR has the name of performing better than Tesseract on automatically segmenting and recognizing.

[([[107, 181], [500, 181], [500, 306], [107, 306]], 'KVK', 0.9998589158058167), ([[546, 212], [659, 212], [659, 303], [546, 303]], '14', 0.998583705333999), ([[697, 209], [1079, 209], [1079, 333], [697, 333]], 'Wijziging', 0.9999568060343197), ([[2187, 323], [2264, 323], [2264, 359], [2187, 359]], 'print', 0.9999612420764422), ([[546, 359], [1337, 359], [1337, 424], [546, 424]], 'Ondernemings- en vestigingsgegevens', 0.974492532298182), ([[2188, 368], [2244, 368], [2244, 399], [2188, 399]], 'wis', 0.9999067420092292), ([[2262, 368], [2340, 368], [2340, 399], [2262, 399]], 'clear', 0.9999880581002243), ([[545, 600], [866, 600], [866, 636], [545, 636]], 'Waarom dit formulier?', 0.9340071795845962), ([[992, 600], [1373, 600], [1373, 641], [992, 641]], 'voor het doorgeven van bijvoor-', 0.7671805513896455), ([[1433, 600], [1853, 600], [1853, 641], [1433, 641]], 'Waarom het handelsregister?', 0.9457987802731944), ([[548, 638], [902, 638], [902, 677], [548, 677]], 'Met dit formulier kunt u wijzi-', 0.8107827908344889), ([[992, 638], [1384, 638], [1384, 677], [992, 677]], 'beeld veranderde kapitaalsgege-', 0.7377568067143137), ([[1433, 638], [1752, 638], [1752, 677], [1433, 677]], 'Het inschrijven van onder-', 0.9876152636674521), ([[1894, 636], [2371, 636], [2371, 677], [1894, 677]], 'Dit gedeelte wordt door KVK ingevuld.', 0.778349646798972), ([[544, 670], [922, 670], [922, 718], [544, 718]], 'gingen in de ondernemings- en', 0.9089413235356162), ([[992, 676], [1398, 676], [1398, 715], [992, 715]], 'vens of wijzigingen in de statuten:', 0.9142662651164867), ([[1432, 673], [1790, 673], [1790, 718], [1432, 718]], 'nemingen en rechtspersonen', 0.8465853524708478), ([[545, 715], [899, 715], [899, 754], [545, 754]], 'vestigingsgegevens opgeven.', 0.7804667854203973), ([[992, 712], [1359, 712], [1359, 748], [992, 748]], 'Daarvoor heeft u het formulier', 0.9333226651228688), ([[1430, 712], [1771, 712], [1771, 755], [1430, 755]], 'is verplicht op grond van de', 0.9973401912463694), ([[994, 748], [1340, 748], [1340, 790], [994, 790]], "'Wijziging vennootschaps- of", 0.9835336956859595), ([[1433, 750], [1677, 750], [1677, 789], [1433, 789]], 'Handelsregisterwet.', 0.9722827193654787), ([[1896, 750], [2124, 750], [2124, 789], [1896, 789]], 'Datum ontvangst', 0.9874414185157038), ([[545, 789], [808, 789], [808, 829], [545, 829]], 'Het kan gaan om een', 0.9728439940888028), ([[989, 789], [1368, 789], [1368, 828], [989, 828]], "rechtspersoongegevens' nodig:", 0.9486988029219867), ([[1433, 789], [1850, 789], [1850, 828], [1433, 828]], 'De gegevens die u op dit formulier', 0.9264211175173818), ([[545, 827], [715, 827], [715, 866], [545, 866]], 'wijziging van:', 0.999760229948895), ([[1430, 827], [1831, 827], [1831, 866], [1430, 866]], 'invult, worden opgenomen in het', 0.7579797971333079), ([[569, 861], [798, 861], [798, 902], [569, 902]], 'een handelsnaam;', 0.9354983770073168), ([[992, 863], [1113, 863], [1113, 902], [992, 902]], 'Vragen?', 0.9998577521748023), ([[1433, 863], [1826, 863], [1826, 903], [1433, 903]], 'Handelsregister. Dit is openbaar:', 0.7884380582728256), ([[569, 901], [954, 901], [954, 937], [569, 937]], 'een internetadres (www-adres);', 0.7805590652471217), ([[992, 901], [1362, 901], [1362, 937], [992, 937]], 'Kijk op KVKnl of bel de Kamer', 0.8739483574765413), ([[1433, 901], [1798, 901], [1798, 940], [1433, 940]], 'anderen kunnen uw gegevens', 0.781161744977421), ([[569, 937], [839, 937], [839, 976], [569, 976]], 'de bedrijfsactiviteiten;', 0.8728194735672185), ([[992, 936], [1384, 936], [1384, 978], [992, 978]], 'van Koophandel (KVK) als u nog', 0.9075128691602377), ([[1433, 940], [1847, 940], [1847, 978], [1433, 978]], 'natrekken en ook u kunt gegevens', 0.7539825691271265), ([[569, 973], [930, 973], [930, 1017], [569, 1017]], 'het adres of correspondentie-', 0.8615655799722887), ([[992, 974], [1368, 974], [1368, 1015], [992, 1015]], 'vragen heeft. Bijvoorbeeld over', 0.7223914006581907), ([[1433, 978], [1795, 978], [1795, 1014], [1433, 1014]], 'opvragen van ondernemingen', 0.9870294822997413), ([[568, 1012], [654, 1012], [654, 1051], [568, 1051]], 'adres;', 0.999997031312701), ([[992, 1014], [1340, 1014], [1340, 1050], [992, 1050]], 'het invullen van dit formulier', 0.9901984504096066), ([[1433, 1014], [1804, 1014], [1804, 1052], [1433, 1052]], 'waarmee u bijvoorbeeld zaken', 0.782222782261967), ([[1896, 1009], [2139, 1009], [2139, 1053], [1896, 1053]], 'Datum inschrijving', 0.7208259282818439), ([[569, 1048], [913, 1048], [913, 1088], [569, 1088]], 'het telefoon-, faxnummer of', 0.9877933221861724), ([[1434, 1053], [1556, 1053], [1556, 1084], [1434, 1084]], 'wilt doen.', 0.7609010399097881), ([[569, 1088], [726, 1088], [726, 1124], [569, 1124]], 'e-mailadres;', 0.9898517615830904), ([[992, 1090], [1354, 1090], [1354, 1126], [992, 1126]], 'Als u een vergissing maakt bij', 0.7862420307167383), ([[1432, 1084], [1832, 1084], [1832, 1133], [1432, 1133]], 'Zo draagt het Handelsregister bij', 0.8588092579614046), ([[569, 1126], [954, 1126], [954, 1162], [569, 1162]], 'het aantal werkzame personen;', 0.9940783615503822), ([[992, 1126], [1387, 1126], [1387, 1162], [992, 1162]], 'het invullen; dan kunt u het foute', 0.8983763846424127), ([[1430, 1126], [1688, 1126], [1688, 1162], [1430, 1162]], 'tot zeker zakendoen:', 0.7060427266745706), ([[569, 1159], [907, 1159], [907, 1204], [569, 1204]], 'opheffing of overdracht van', 0.9978238227742723), ([[993, 1165], [1320, 1165], [1320, 1197], [993, 1197]], 'antwoord doorhalen en het', 0.9739850544014607), ([[1897, 1165], [2074, 1165], [2074, 1197], [1897, 1197]], 'KVK-nummer', 0.9765328524034846), ([[569, 1199], [921, 1199], [921, 1242], [569, 1242]], 'de onderneming of vestiging:', 0.8827611291745454), ([[990, 1201], [1335, 1201], [1335, 1243], [990, 1243]], 'goede antwoord erbij zetten:', 0.8367759105473107), ([[545, 1236], [957, 1236], [957, 1278], [545, 1278]], 'U kunt dit formulier niet gebruiken', 0.9947663851384315), ([[991, 1236], [1404, 1236], [1404, 1278], [991, 1278]], 'Plaats hierbij wel uw handtekening:', 0.7712697939109996), ([[552, 1336], [573, 1336], [573, 1368], [552, 1368]], '1', 0.9998362131432401), ([[621, 1328], [1075, 1328], [1075, 1385], [621, 1385]], 'Inschrijfgegevens bij KVK', 0.7913680104989015), ([[112, 1416], [343, 1416], [343, 1471], [112, 1471]], 'Toelichting 1.1', 0.9665197442435908), ([[546, 1429], [585, 1429], [585, 1460], [546, 1460]], '1.1', 0.9101260751881599), ([[698, 1425], [789, 1425], [789, 1461], [698, 1461]], 'welke', 0.999977395675486), ([[787, 1416], [1707, 1416], [1707, 1472], [787, 1472]], 'onderneming of rechtspersoon wordt de wijziging opgegeven?', 0.8649164762631084), ([[114, 1461], [475, 1461], [475, 1506], [114, 1506]], 'Om de gewijzigde gegevens', 0.9945734635776975), ([[112, 1498], [477, 1498], [477, 1540], [112, 1540]], 'te kunnen doorvoeren, heeft', 0.8308951210369384), ([[113, 1532], [425, 1532], [425, 1581], [113, 1581]], 'KVK de gegevens nodig', 0.8539477367401248), ([[623, 1541], [709, 1541], [709, 1572], [623, 1572]], 'naam', 0.9998448491096497), ([[111, 1570], [473, 1570], [473, 1619], [111, 1619]], 'waaronder de onderneming', 0.9175567179099927), ([[112, 1612], [480, 1612], [480, 1655], [112, 1655]], 'of rechtspersoon staat inge-', 0.9706024251224752), ([[113, 1653], [245, 1653], [245, 1685], [113, 1685]], 'schreven:', 0.9995520407921188), ([[112, 1688], [452, 1688], [452, 1727], [112, 1727]], 'de naam; plaats van vesti-', 0.7291726366433087), ([[112, 1726], [436, 1726], [436, 1765], [112, 1765]], 'ging en inschrijfnummer.', 0.6917421918692422), ([[618, 1718], [922, 1718], [922, 1772], [618, 1772]], 'plaats van vestiging', 0.9094052735610817), ([[619, 1836], [984, 1836], [984, 1880], [619, 1880]], 'inschrijfnummer bij KVK', 0.8361044693870214), ([[549, 1933], [577, 1933], [577, 1970], [549, 1970]], '2', 1.0), ([[620, 1924], [887, 1924], [887, 1984], [620, 1984]], 'Soort wijziging', 0.9432145475493581), ([[112, 2016], [343, 2016], [343, 2071], [112, 2071]], 'Toelichting 2.1', 0.8589677715582387), ([[545, 2025], [589, 2025], [589, 2061], [545, 2061]], '2.1', 0.3331562578678131), ([[621, 2021], [982, 2021], [982, 2071], [621, 2071]], 'De wijzigingen betreffen', 0.6568526212805448), ([[114, 2062], [403, 2062], [403, 2102], [114, 2102]], 'U kunt op dit formulier', 0.7075532554106072), ([[114, 2102], [447, 2102], [447, 2141], [114, 2141]], 'wijzigingen opgeven in de', 0.9030469818915413), ([[657, 2099], [1236, 2099], [1236, 2144], [657, 2144]], 'de hoofdvestiging of de enige vestiging', 0.9419140625236007), ([[112, 2138], [477, 2138], [477, 2179], [112, 2179]], 'gegevens van één vestiging:', 0.8397015077006383), ([[131, 2176], [436, 2176], [436, 2215], [131, 2215]], 'de hoofdvestiging of de', 0.919350472671281), ([[697, 2172], [1010, 2172], [1010, 2221], [697, 2221]], 'één andere vestiging', 0.998114945881038), ([[128, 2214], [334, 2214], [334, 2253], [128, 2253]], 'enige vestiging;', 0.7957931561611861), ([[128, 2253], [310, 2253], [310, 2292], [128, 2292]], 'één vestiging:', 0.831584808084107), ([[744, 2246], [1201, 2246], [1201, 2295], [744, 2295]], 'het adres van deze vestiging is', 0.8176983426178079), ([[546, 2478], [594, 2478], [594, 2510], [546, 2510]], '2.2', 0.8481187224388123), ([[618, 2471], [2145, 2471], [2145, 2520], [618, 2520]], 'Kruis hier aan wat er is gewijzigd en ga door naar de aangegeven vraag (meerdere antwoorden mogelijk)', 0.7809722471670952), ([[657, 2547], [1130, 2547], [1130, 2592], [657, 2592]], 'handelsnaam of handelsnamen', 0.9969339882359225), ([[1496, 2551], [1708, 2551], [1708, 2590], [1496, 2590]], 'Ga naar vraag 3', 0.9953507898783989), ([[656, 2620], [1317, 2620], [1317, 2668], [656, 2668]], 'bedrijfsactiviteiten; diensten en/of producten', 0.7978056315421745), ([[1496, 2628], [1708, 2628], [1708, 2664], [1496, 2664]], 'Ga naar vraag 4', 0.7169601715113014), ([[656, 2697], [1435, 2697], [1435, 2745], [656, 2745]], 'activiteiten van een rechtspersoon zonder onderneming', 0.9387298802978001), ([[1496, 2702], [1708, 2702], [1708, 2741], [1496, 2741]], 'Ga naar vraag 4', 0.9985427968075652), ([[656, 2769], [1164, 2769], [1164, 2818], [656, 2818]], 'adres en/of correspondentieadres', 0.889592751275233), ([[1496, 2779], [1710, 2779], [1710, 2815], [1496, 2815]], 'Ga naar vraag 5', 0.9997755335174974), ([[656, 2845], [1060, 2845], [1060, 2894], [656, 2894]], 'internetadres (www-adres)', 0.7284716366023437), ([[1496, 2853], [1708, 2853], [1708, 2892], [1496, 2892]], 'Ga naar vraag 6', 0.9986732443390018), ([[655, 2919], [1429, 2919], [1429, 2969], [655, 2969]], 'telefoon-, faxnummer; e-mailadres; berichtenboxnaam', 0.8868889771133504), ([[1496, 2927], [1686, 2927], [1686, 2966], [1496, 2966]], 'Ga naar vraag', 0.9976082940935516), ([[657, 2998], [1061, 2998], [1061, 3040], [657, 3040]], 'aantal werkzame personen', 0.8325593660182602), ([[1496, 3004], [1708, 3004], [1708, 3040], [1496, 3040]], 'Ga naar vraag 8', 0.7877077994100569), ([[656, 3069], [1010, 3069], [1010, 3121], [656, 3121]], 'opheffing of overdracht', 0.9805956245628613), ([[1496, 3078], [1708, 3078], [1708, 3117], [1496, 3117]], 'Ga naar vraag 9', 0.9980991381753203), ([[115, 3408], [901, 3408], [901, 3440], [115, 3440]], 'Kamer van Koophandel@ juni 2020 Wijziging ondernemings- en vestigingsgegevens', 0.7441849892761688), ([[2278, 3408], [2326, 3408], [2326, 3436], [2278, 3436]], 'blad', 0.9999856948852539), ([[2342, 3414], [2396, 3414], [2396, 3435], [2342, 3435]], 'van 4', 0.9974790653868143), ([[624.0298574998546, 1419.1194299994186], [700.8096965887974, 1429.7808970915848], [694.9701425001454, 1466.8805700005814], [618.1903034112026, 1456.2191029084152]], 'Voor', 0.9999222159385681)]

At first sight only the @ is wrong, as it should be an R of registered.

@rmast
Copy link
Author

rmast commented Jan 25, 2023

Playing around with the new You.com YouChat, which is free to use at the moment you can ask questions which are answered ChatGPT-like, but including references and actual results from a websearch, I found this article on document segmentation:
https://pyimagesearch.com/2018/09/17/opencv-ocr-and-text-recognition-with-tesseract/

@MerlijnWajer
Copy link
Collaborator

Right, the hOCR results basically contain the results of the Tesseract segmentation, so we wouldn't have to re-do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants