Fix asynchronous caching bug #666

Balearica · 2022-09-18T05:15:41Z

There are currently many issues that appear to stem from 2 problems in how caching works at present.

We assume that workers are created synchronously, and violating this assumption creates invalid cache files
We assume that all cache files are valid

The former appears to be the most common cause of invalid caching data (as this is non-obvious to users). However, cache may be invalid for other reasons. For example, until the last version cache was often invalid because langData responses were cached (see #585). Therefore, it is possible that not all bugs listed below were directly caused by creating workers asynchronously, but hopefully solving the async issue will solve most of it.

Related issues:

Balearica · 2022-09-18T06:11:38Z

Upon further investigation, this appears to already be fixed (at least for Node.js). The following code snippet throws an error consistently in Version 2 however does not throw an error in Version 3.

const { createWorker, createScheduler } = require('../../');

const scheduler = createScheduler();

// Creates worker and adds to scheduler
const workerGen = async () => {
  const worker = createWorker({cachePath: "."});
  await worker.load();
  await worker.loadLanguage('eng');
  await worker.initialize('eng');
  scheduler.addWorker(worker);
}

const workerN = 10;
(async () => {
  const resArr = Array(workerN);
  for (let i=0; i<workerN; i++) {
    resArr[i] = workerGen();
  }
  await Promise.all(resArr);
  /** Add 4 recognition jobs */
  const results = await Promise.all(Array(10).fill(0).map(() => (
    scheduler.addJob('recognize', 'https://tesseract.projectnaptha.com/img/eng_bw.png').then((x) => console.log(x.data.text))
  )))
  await scheduler.terminate(); // It also terminates all workers.
})();

… per #666

Balearica · 2022-09-20T02:22:44Z

While this issue seems to be largely resolved in version 3 (as stated above), one contributing factor appears to be that when cacheMethod=='write' (the default option) the cache file is overwritten on every call to loadLanguage even if the data was sourced from the cache file. In other words, the cache file is frequently overwritten with identical contents.

tesseract.js/src/worker-script/index.js

Lines 134 to 136 in dd6c40b

    
           if (['write', 'refresh', undefined].includes(cacheMethod)) { 
        
             await adapter.writeCache(`${cachePath || '.'}/${lang}.traineddata`, data); 
        
           }

I implemented an edit in the dev/v4 branch to no longer do this, which should reduce the number of times the cache is overwritten, and therefore the potential for the file being corrupted.

See #662 for explanation of Tesseract.js Version 4 changes. List below is auto-generated from commits. * Added image preprocessing functions (rotate + save images) * Updated createWorker to be async * Reworked createWorker to be async and throw errors per #654 * Reworked createWorker to be async and throw errors per #654 * Edited detect to return null when detection fails rather than throwing error per #526 * Updated types per #606 and #580 (#663) (#664) * Removed unused files * Added savePDF option to recognize per #488; cleaned up code for linter * Updated download-pdf example for node to use new savePDF option * Added OutputFormats option/interface for setting output * Allowed for Tesseract parameters to be set through recognition options per #665 * Updated docs * Edited loadLanguage to no longer overwrite cache with data from cache per #666 * Added interface for setting 'init only' options per #613 * Wrapped caching in try block per #609 * Fixed unit tests * Updated setImage to resolve memory leak per #678 * Added debug output option per #681 * Fixed bug with saving images per #588 * Updated examples * Updated readme and Tesseract.js-core version

Balearica · 2023-05-12T03:03:19Z

Closing this issue since the above changes appear to have resolved (nobody has reported an issue since v4).

Balearica pushed a commit that referenced this issue Sep 20, 2022

Edited loadLanguage to no longer overwrite cache with data from cache…

c0298ff

… per #666

This was referenced Oct 8, 2022

TesseractJS could not load language sometimes and application stops running #676

Closed

Version 4 Development and Changes #662

Closed

Tesseract example is not working? #453

Closed

Balearica mentioned this issue Nov 5, 2022

Tesseract.js can't load language files on deployed server #618

Closed

Balearica mentioned this issue Nov 25, 2022

Error: Error: UNKNOWN: unknown error, open '\eng.traineddata' CRASH #536

Closed

Balearica closed this as completed May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix asynchronous caching bug #666

Fix asynchronous caching bug #666

Balearica commented Sep 18, 2022

Balearica commented Sep 18, 2022 •

edited

Loading

Balearica commented Sep 20, 2022

Balearica commented May 12, 2023

Fix asynchronous caching bug #666

Fix asynchronous caching bug #666

Comments

Balearica commented Sep 18, 2022

Balearica commented Sep 18, 2022 • edited Loading

Balearica commented Sep 20, 2022

Balearica commented May 12, 2023

Balearica commented Sep 18, 2022 •

edited

Loading