Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting error Invalid memory access when use Tess4j in JAVA project Netbeans #232

Closed
HerickRaposo opened this issue Jun 1, 2022 · 2 comments

Comments

@HerickRaposo
Copy link

HerickRaposo commented Jun 1, 2022

I'm developing a routine in my JAVA Web project (Using netbeans 13) where I extract the texts from a pdf. If it doesn't find a certain term, it converts the pdf to image and tries to extract the text with OCR tesseract.
After several attempts I always got the same error regardless of the configuration I did. Following error:

Error opening data file tessdata/por.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'por'
Tesseract couldn't load any languages!

Caused by: java.lang.Error: Invalid memory access

I'm having difficulties configuring the library in my development environment because, unlike the tutorials where SpringBoot and Eclipse are used, I use Netbeans 13 and I have the following project structure:

enter image description here

First I added the dependency in pom.xml:

<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
        <dependency>
            <groupId>net.sourceforge.tess4j</groupId>
            <artifactId>tess4j</artifactId>
            <version>5.2.0</version>
        </dependency>

After that as indicated in some tutorials, I went to my tessdata folder which is in dependencies/Tes4j and copied the tessdata folder to the web-inf folder (I also tried to the resources folder).

Later I tried to configure the environment variable
TESSDATA_PREFIX. I couldn't find information if this variable was to be defined in the windows system variables or if there is some other place in netbeans to define this. In my project structure I tried three definitions but none worked:

  1. C:\Programacao\myProjectName\Tess4J
  2. C:\Programacao\myProjectName\src\main\webapp\WEB-INF
  3. C:\Programacao\myProjectName\src\main\webapp\resources

In the code of my method I tried to set the datapath passing only the name tessdata, data and I also tried to point out the paths above. Follow created method:

 public String extractText(Anexos anexo) throws Exception {
        File file = new File(anexo.getCaminho());
        PDDocument doc = PDDocument.load(file);

        System.out.println("=================================> Extraindo com pdfBox <=========================");
        PDFTextStripper estripador = new PDFTextStripper();
        estripador.setSortByPosition(false); 
        String pdfTexto = estripador.getText(doc);
        String line = "";

        line = pdfTexto.toLowerCase().replaceAll(AplicacaoBean.CARACTERES_ESPECIAIS_REGEX, "")
                .replaceAll("\\s", " ");
        
        if (line.contains("sped")) {
            return line;
        } else {

            PDFRenderer pdfRenderer = new PDFRenderer(doc);
            StringBuilder out = new StringBuilder();

            Tesseract tesseract = new Tesseract();
            tesseract.setLanguage("por");
            tesseract.setOcrEngineMode(1);

            Path dataDirectory = Paths.get("tessdata");
            tesseract.setDatapath(dataDirectory.toString());

            for (int page = 0; page < doc.getNumberOfPages(); page++) {
                BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);

                // Create a temp image file
                File tempFile = new File(file.getPath().replace(File.separator + anexo.getAnexo(), "") + File.separator + "tempfile_" + anexo.getAnexo().replaceAll("\\..*", "") + "_" + page + ".png");
                ImageIO.write(bufferedImage, "png", tempFile);
                String result = tesseract.doOCR(tempFile);
                out.append(result);

                // Delete temp file
                tempFile.delete();
            }

            line = out.toString().toLowerCase().replaceAll(AplicacaoBean.CARACTERES_ESPECIAIS_REGEX, "")
                    .replaceAll("\\s", " ");
        }

        return line;
    }

My tessdata folder:

enter image description here

My development environment:

Netbeans:13

JDK 15

tess4j:5.2.0

My doubts:

So I would like to know if this environment variable () is configured in windows variables, in netbeans or in some internal place in my code? Also, did I skip any steps? Do I need to download anything else? Please help me, I don't know what else to do!

@nguyenq
Copy link
Owner

nguyenq commented Jun 9, 2022

You would set the environment variable TESSDATA_PREFIX in Windows.

https://superuser.com/questions/949560/how-do-i-set-system-environment-variables-in-windows-10

Does your example work with eng pack? I suggest you get it to work with a console application first before moving on to a server application.

@nguyenq
Copy link
Owner

nguyenq commented Apr 5, 2023

No response from OP.

@nguyenq nguyenq closed this as completed Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants