We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The FlairTagger (and possibly CRFTagger) ignores empty documents. The length of the output documents does not match the length of the input documents.
FlairTagger
CRFTagger
We should either allow empty documents, or raise a warning and that no empty strings should be passed.
from pprint import pprint from deidentify.base import Document from deidentify.taggers import FlairTagger from deidentify.tokenizer import TokenizerFactory documents = [ Document(name="doc_01", text=""), Document(name="doc_02", text="Stukje tekst met de naam Jan Jansen."), Document(name="doc_03", text=""), ] tokenizer = TokenizerFactory().tokenizer(corpus="ons", disable=("tagger", "ner")) tagger = FlairTagger( model="model_bilstmcrf_ons_fast-v0.2.0", tokenizer=tokenizer, verbose=False ) annotated_docs = tagger.annotate(documents) print(f"len(documents) = {len(documents)}") print(f"len(annotated_docs) = {len(annotated_docs)}") pprint(annotated_docs)
Actual:
len(documents) = 3 len(annotated_docs) = 1 [Document(name=doc_02). Chars: 36, Annotations: 1]
Expected:
len(documents) = 3 len(annotated_docs) = 3
The text was updated successfully, but these errors were encountered:
No branches or pull requests
The
FlairTagger
(and possiblyCRFTagger
) ignores empty documents. The length of the output documents does not match the length of the input documents.We should either allow empty documents, or raise a warning and that no empty strings should be passed.
Reproducible example
Actual:
Expected:
The text was updated successfully, but these errors were encountered: