Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't pass citation mark character into tessedit_char_whitelist #501

Closed
natsukashiixo opened this issue Aug 24, 2023 · 2 comments · Fixed by #502
Closed

Can't pass citation mark character into tessedit_char_whitelist #501

natsukashiixo opened this issue Aug 24, 2023 · 2 comments · Fixed by #502

Comments

@natsukashiixo
Copy link

I'm running into issues trying to use " in the tessedit_char_whitelist config flag. This is most likely because " is also used by pytesseract to know when the config ends.
I have no idea if this should be considered a bug.
I'm mostly looking for alternative solutions, found no info in the documentation on whether you can just pass a config file instead.

charwhitelist = r'ABCDEFGHIJKLMNOPQRSTUVWZYXÅÄÖabcdefghijklmnopqrstuvwxyzåäö0123456789-()/=&%!?:;.,é ' + '\"'

  #Import path to tesseract executable
  with open('tesseract_install.txt', 'r') as file:
      install_path = file.read()

  pytesseract.pytesseract.tesseract_cmd = install_path

  files = list(filter(IsImage, input_dir))

  with about_time() as t1:
      total_iterations = len(files)
      remaining_iterations = len(files)
      completed_iterations = 0 
      print(f'Starting Tesseract using PSM {psm_nr}, there are {total_iterations} pages to read.')
      for file in files:
          print(f'Starting work on {file}')
          try: 
              img_cv = cv2.imread(str(file)) 
              img_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)
              hocr = pytesseract.image_to_pdf_or_hocr(img_rgb, extension='hocr', lang='swe+script/Latin', config=f"--oem 3 --psm {psm_nr} -c tessedit_char_whitelist='{charwhitelist}'")`

Example output while trying to use this whitelist:
Smhllshjlp1922
ArfreningenSmhllshjlpklsskmpsorgnistion?

Example output without whitelist (and also expected result):
Samhällshjälp 1922
Är föreningen Samhällshjälp klasskampsorganisation?

python version: 3.10.6 run via bundled interpreter in an executable
pytesseract version: 0.3.10
tesseract version: UB Mannheim windows binary, v5.3.0.20221214

@stefan6419846
Copy link
Contributor

This seems to be related to

cmd_args += shlex.split(config)
and roughly the same base issue as in #356

@natsukashiixo
Copy link
Author

Thanks for looking into it. In case someone else finds this issue my current workaround is:
cfg_filename = 'letters' hocr = pytesseract.image_to_pdf_or_hocr(img_rgb, extension='hocr', lang='swe+script/Latin', config=f"--oem 3 --psm {psm_nr} {cfg_filename}", )
where 'letters' points to a file located in tessdata/configs that's named letters with no file extension. in the file is the string tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWZYXÅÄÖabcdefghijklmnopqrstuvwxyzåäö0123456789-()/=&%!?:;.,é "
This lets me load the whitelist from a file without impacting my ability to set the PSM number using a function, and so far while testing it behaves as expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants