You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm developing a routine in my JAVA Web project (Using netbeans 13) where I extract the texts from a pdf. If it doesn't find a certain term, it converts the pdf to image and tries to extract the text with OCR tesseract.
After several attempts I always got the same error regardless of the configuration I did. Following error:
Error opening data file tessdata/por.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'por'
Tesseract couldn't load any languages!
Caused by: java.lang.Error: Invalid memory access
I'm having difficulties configuring the library in my development environment because, unlike the tutorials where SpringBoot and Eclipse are used, I use Netbeans 13 and I have the following project structure:
After that as indicated in some tutorials, I went to my tessdata folder which is in dependencies/Tes4j and copied the tessdata folder to the web-inf folder (I also tried to the resources folder).
Later I tried to configure the environment variable
TESSDATA_PREFIX. I couldn't find information if this variable was to be defined in the windows system variables or if there is some other place in netbeans to define this. In my project structure I tried three definitions but none worked:
In the code of my method I tried to set the datapath passing only the name tessdata, data and I also tried to point out the paths above. Follow created method:
public String extractText(Anexos anexo) throws Exception {
File file = new File(anexo.getCaminho());
PDDocument doc = PDDocument.load(file);
System.out.println("=================================> Extraindo com pdfBox <=========================");
PDFTextStripper estripador = new PDFTextStripper();
estripador.setSortByPosition(false);
String pdfTexto = estripador.getText(doc);
String line = "";
line = pdfTexto.toLowerCase().replaceAll(AplicacaoBean.CARACTERES_ESPECIAIS_REGEX, "")
.replaceAll("\\s", " ");
if (line.contains("sped")) {
return line;
} else {
PDFRenderer pdfRenderer = new PDFRenderer(doc);
StringBuilder out = new StringBuilder();
Tesseract tesseract = new Tesseract();
tesseract.setLanguage("por");
tesseract.setOcrEngineMode(1);
Path dataDirectory = Paths.get("tessdata");
tesseract.setDatapath(dataDirectory.toString());
for (int page = 0; page < doc.getNumberOfPages(); page++) {
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
// Create a temp image file
File tempFile = new File(file.getPath().replace(File.separator + anexo.getAnexo(), "") + File.separator + "tempfile_" + anexo.getAnexo().replaceAll("\\..*", "") + "_" + page + ".png");
ImageIO.write(bufferedImage, "png", tempFile);
String result = tesseract.doOCR(tempFile);
out.append(result);
// Delete temp file
tempFile.delete();
}
line = out.toString().toLowerCase().replaceAll(AplicacaoBean.CARACTERES_ESPECIAIS_REGEX, "")
.replaceAll("\\s", " ");
}
return line;
}
My tessdata folder:
My development environment:
Netbeans:13
JDK 15
tess4j:5.2.0
My doubts:
So I would like to know if this environment variable () is configured in windows variables, in netbeans or in some internal place in my code? Also, did I skip any steps? Do I need to download anything else? Please help me, I don't know what else to do!
The text was updated successfully, but these errors were encountered:
I'm developing a routine in my JAVA Web project (Using netbeans 13) where I extract the texts from a pdf. If it doesn't find a certain term, it converts the pdf to image and tries to extract the text with OCR tesseract.
After several attempts I always got the same error regardless of the configuration I did. Following error:
I'm having difficulties configuring the library in my development environment because, unlike the tutorials where SpringBoot and Eclipse are used, I use Netbeans 13 and I have the following project structure:
First I added the dependency in pom.xml:
After that as indicated in some tutorials, I went to my tessdata folder which is in dependencies/Tes4j and copied the tessdata folder to the web-inf folder (I also tried to the resources folder).
Later I tried to configure the environment variable
TESSDATA_PREFIX. I couldn't find information if this variable was to be defined in the windows system variables or if there is some other place in netbeans to define this. In my project structure I tried three definitions but none worked:
In the code of my method I tried to set the datapath passing only the name tessdata, data and I also tried to point out the paths above. Follow created method:
My tessdata folder:
My development environment:
Netbeans:13
JDK 15
tess4j:5.2.0
My doubts:
So I would like to know if this environment variable () is configured in windows variables, in netbeans or in some internal place in my code? Also, did I skip any steps? Do I need to download anything else? Please help me, I don't know what else to do!
The text was updated successfully, but these errors were encountered: