You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Seems the Tesseract release 4 integrated in Apache Tika / Open Semantic ETL / Open Semantic Search recognizing such cases well, so for example "E l s e r" or "Otto S t r a s s e r" in https://commons.wikimedia.org/wiki/File:Gestapo-Akte_Georg_Elser_(Delikt).jpg is "Elser" or "Otto Strasser" in OCRd output plain text.
If there are OCR Software or yet OCRd documents where this is not the case please reopen or comment, so i can integrate some related name variant extension functions to entity extraction like:
def disjoin_chars(name):
namevariants = []
words = name.split(" ")
for i in range(len(words)):
namevariant = " ".join(words[0:i])
for word in words[i:]:
if namevariant:
namevariant += " "
namevariant += " ".join(list(word))
namevariants.append(namevariant)
return namevariants
generating name variants like print(entity_manager.disjoin_chars("Georg Elser")) ['G e o r g E l s e r', 'Georg E l s e r']
Names of Entities in old/legacy documents are often written with whitespaces like
John D O E
which should be recognized as John Doe, too.
The text was updated successfully, but these errors were encountered: