You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running into issues trying to use " in the tessedit_char_whitelist config flag. This is most likely because " is also used by pytesseract to know when the config ends.
I have no idea if this should be considered a bug.
I'm mostly looking for alternative solutions, found no info in the documentation on whether you can just pass a config file instead.
#Import path to tesseract executable
with open('tesseract_install.txt', 'r') as file:
install_path = file.read()
pytesseract.pytesseract.tesseract_cmd = install_path
files = list(filter(IsImage, input_dir))
with about_time() as t1:
total_iterations = len(files)
remaining_iterations = len(files)
completed_iterations = 0
print(f'Starting Tesseract using PSM {psm_nr}, there are {total_iterations} pages to read.')
for file in files:
print(f'Starting work on {file}')
try:
img_cv = cv2.imread(str(file))
img_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)
hocr = pytesseract.image_to_pdf_or_hocr(img_rgb, extension='hocr', lang='swe+script/Latin', config=f"--oem 3 --psm {psm_nr} -c tessedit_char_whitelist='{charwhitelist}'")`
Example output while trying to use this whitelist:
Smhllshjlp1922
ArfreningenSmhllshjlpklsskmpsorgnistion?
Example output without whitelist (and also expected result):
Samhällshjälp 1922
Är föreningen Samhällshjälp klasskampsorganisation?
python version: 3.10.6 run via bundled interpreter in an executable
pytesseract version: 0.3.10
tesseract version: UB Mannheim windows binary, v5.3.0.20221214
The text was updated successfully, but these errors were encountered:
Thanks for looking into it. In case someone else finds this issue my current workaround is: cfg_filename = 'letters' hocr = pytesseract.image_to_pdf_or_hocr(img_rgb, extension='hocr', lang='swe+script/Latin', config=f"--oem 3 --psm {psm_nr} {cfg_filename}", )
where 'letters' points to a file located in tessdata/configs that's named letters with no file extension. in the file is the string tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWZYXÅÄÖabcdefghijklmnopqrstuvwxyzåäö0123456789-()/=&%!?:;.,é "
This lets me load the whitelist from a file without impacting my ability to set the PSM number using a function, and so far while testing it behaves as expected
I'm running into issues trying to use " in the tessedit_char_whitelist config flag. This is most likely because " is also used by pytesseract to know when the config ends.
I have no idea if this should be considered a bug.
I'm mostly looking for alternative solutions, found no info in the documentation on whether you can just pass a config file instead.
charwhitelist = r'ABCDEFGHIJKLMNOPQRSTUVWZYXÅÄÖabcdefghijklmnopqrstuvwxyzåäö0123456789-()/=&%!?:;.,é ' + '\"'
Example output while trying to use this whitelist:
Smhllshjlp1922
ArfreningenSmhllshjlpklsskmpsorgnistion?
Example output without whitelist (and also expected result):
Samhällshjälp 1922
Är föreningen Samhällshjälp klasskampsorganisation?
python version: 3.10.6 run via bundled interpreter in an executable
pytesseract version: 0.3.10
tesseract version: UB Mannheim windows binary, v5.3.0.20221214
The text was updated successfully, but these errors were encountered: