Handle empty documents in FlairTagger #49

jantrienes · 2021-01-20T07:06:38Z

The FlairTagger (and possibly CRFTagger) ignores empty documents. The length of the output documents does not match the length of the input documents.

We should either allow empty documents, or raise a warning and that no empty strings should be passed.

Reproducible example

from pprint import pprint

from deidentify.base import Document
from deidentify.taggers import FlairTagger
from deidentify.tokenizer import TokenizerFactory

documents = [
    Document(name="doc_01", text=""),
    Document(name="doc_02", text="Stukje tekst met de naam Jan Jansen."),
    Document(name="doc_03", text=""),
]


tokenizer = TokenizerFactory().tokenizer(corpus="ons", disable=("tagger", "ner"))
tagger = FlairTagger(
    model="model_bilstmcrf_ons_fast-v0.2.0", tokenizer=tokenizer, verbose=False
)

annotated_docs = tagger.annotate(documents)
print(f"len(documents) = {len(documents)}")
print(f"len(annotated_docs) = {len(annotated_docs)}")

pprint(annotated_docs)

Actual:

len(documents) = 3
len(annotated_docs) = 1
[Document(name=doc_02). Chars: 36, Annotations: 1]

Expected:

len(documents) = 3
len(annotated_docs) = 3

The text was updated successfully, but these errors were encountered:

jantrienes changed the title ~~Allow empty documents in FlairTagger~~ Handle empty documents in FlairTagger Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle empty documents in FlairTagger #49

Handle empty documents in FlairTagger #49

jantrienes commented Jan 20, 2021

Handle empty documents in FlairTagger #49

Handle empty documents in FlairTagger #49

Comments

jantrienes commented Jan 20, 2021

Reproducible example