Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Names in scanned legacy documents #26

Closed
opensemanticsearch opened this issue May 18, 2018 · 1 comment
Closed

Names in scanned legacy documents #26

opensemanticsearch opened this issue May 18, 2018 · 1 comment
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@opensemanticsearch
Copy link
Owner

opensemanticsearch commented May 18, 2018

Names of Entities in old/legacy documents are often written with whitespaces like
John D O E
which should be recognized as John Doe, too.

@Mandalka
Copy link
Collaborator

Mandalka commented Nov 23, 2019

Seems the Tesseract release 4 integrated in Apache Tika / Open Semantic ETL / Open Semantic Search recognizing such cases well, so for example "E l s e r" or "Otto S t r a s s e r" in https://commons.wikimedia.org/wiki/File:Gestapo-Akte_Georg_Elser_(Delikt).jpg is "Elser" or "Otto Strasser" in OCRd output plain text.

If there are OCR Software or yet OCRd documents where this is not the case please reopen or comment, so i can integrate some related name variant extension functions to entity extraction like:

def disjoin_chars(name):

namevariants = []

words = name.split(" ")

for i in range(len(words)):
	namevariant = " ".join(words[0:i])
	for word in words[i:]:
		if namevariant:
			namevariant += " "
		namevariant += " ".join(list(word))

	namevariants.append(namevariant)

return namevariants

generating name variants like
print(entity_manager.disjoin_chars("Georg Elser"))
['G e o r g E l s e r', 'Georg E l s e r']

@opensemanticsearch opensemanticsearch added the help wanted Extra attention is needed label Nov 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants