LiteralNER shall tokenize each entry #20

jmansilla · 2014-05-16T17:18:53Z

Right now our LiteralNER is very literal, so in some cases is not working.

Example: an entry like this

takayasu's arteritis

Is never found because the documents will be tokenized, transforming this

John had takayasu's arteritis

into this

John had takayasu 's arteritis

making impossible a match (notice that 's is a separated token).

Also, what make things harder is that the tokenizer to use while parsing the LiteralNER entries must be the same tokenizer used when tokenizing text.

The text was updated successfully, but these errors were encountered:

rafacarrascosa · 2014-05-16T20:41:08Z

For documentation's sake,, something we talked in real life:

Perhaps tokenization, pos-tagging, ner-tagging and segmentation could be non-optional parts of the preprocessing pipeline since iepy's core would break without them anyway.

If those parts are non-optional, then they can be passed as kwargs to the preprocessing pipeline and therefore it can be easily ensured that tokenization is the same for both documents and literal tagging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LiteralNER shall tokenize each entry #20

LiteralNER shall tokenize each entry #20

jmansilla commented May 16, 2014

rafacarrascosa commented May 16, 2014

LiteralNER shall tokenize each entry #20

LiteralNER shall tokenize each entry #20

Comments

jmansilla commented May 16, 2014

rafacarrascosa commented May 16, 2014