Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to load language from a File (Blob/Stream etc) #794

Closed
datvm opened this issue Jul 3, 2023 · 4 comments
Closed

Add ability to load language from a File (Blob/Stream etc) #794

datvm opened this issue Jul 3, 2023 · 4 comments

Comments

@datvm
Copy link

datvm commented Jul 3, 2023

Is your feature request related to a problem? Please describe.
Right now in order to load a language, the URL has to be specified at the beginning and it must be available already. This does not apply to Chrome extension (offline). Say I can download a user-selected file (they download when they are online) and store it in IndexedDB. Now I want to feed it to TesseractJs but currently this is impossible because the file does not really have an URL.

Describe the solution you'd like
The loadLanguage can accept an optional second parameter, content where the content is provided, maybe through a Blob, Stream or ArrayBuffer. Alternatively, a string or URL object can be provided to override the default behavior (langPath + langCode + <magic string>) to tell Tesseract to just download from that URL. I think another parameter to specify if it's gzip or not would be nice as well.

Describe alternatives you've considered
Right now the only way to do it is to require users to be online and download the files from the server/a server.

Additional context
N/A

@Balearica
Copy link
Member

Balearica commented Jul 6, 2023

The worker.loadLanguage function simply fetches a language file and writes it to the virtual filesystem. If you already have the .traineddata file as an ArrayBuffer, you can just write that to the virtual filesystem directly using the filesystem functions we provide. The following code snippet shows how filesystem operations can be used directly rather than worker.loadLanguage.

        const langRes = await fetch("./lang-data/eng.traineddata");
        const langData = new Uint8Array(await langRes.arrayBuffer());
        await worker.FS("writeFile", ["eng.traineddata", langData]);

@datvm
Copy link
Author

datvm commented Jul 6, 2023

Thanks that would be helpful. It will take me a few days until I go back to this project again but just to confirm, simply have the file exists there is all it needs? Does it detect all *.traineddata files?

@Balearica
Copy link
Member

Balearica commented Jul 6, 2023

If you initialize Tesseract with language eng (for example) a file named eng.traineddata must exist on the virtual filesystem. As long as it exists (through whatever method) Tesseract will run.

@Balearica
Copy link
Member

Closing as answered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants