Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate IDs in standoff output #9

Closed
spyysalo opened this issue Jun 5, 2012 · 9 comments
Closed

Duplicate IDs in standoff output #9

spyysalo opened this issue Jun 5, 2012 · 9 comments

Comments

@spyysalo
Copy link
Member

spyysalo commented Jun 5, 2012

When run with the -o standoff option, NERsuite output contains duplicate IDs (within a single input document). For example (for an AnEM model):

$ cut -f 2- featurized/multiclass-withmm/test/AnEM.test | nersuite tag -o standoff -m models/test.multiclass.withmm.model | head
32  48  entity_name id="entity-1" type="Pathological_formation"
68  84  entity_name id="entity-2" type="Pathological_formation"
226 231 entity_name id="entity-3" type="Pathological_formation"
378 389 entity_name id="entity-1" type="Cell"
408 413 entity_name id="entity-4" type="Pathological_formation"
429 445 entity_name id="entity-1" type="Multi-tissue_structure"
450 461 entity_name id="entity-2" type="Multi-tissue_structure"
[...]

Entity IDs should preferably be unique for each input document.

@priancho
Copy link
Member

priancho commented Jun 5, 2012

Currently, entity IDs are separately managed for each semantic type. (just a C++ map container :-)

The output above shows that you used the "standoff" option, not the "brat" option for output.
Do you think that using unique IDs regardless of their semantic types is necessary for only the "brat" option, or all output formats?

@spyysalo
Copy link
Member Author

spyysalo commented Jun 5, 2012

I think unique IDs would be a benefit for all output options. Miwa-san is currently planning to use NERsuite in an extraction pipeline using the "standoff" output format and would hope to be able to avoid duplicate IDs without running a separate script, if possible.

@priancho
Copy link
Member

While I am trying to add this functionality today, I found that Sampo added this already.

@spyysalo
Copy link
Member Author

@priancho : are you sure? If you're referring to 2775f22, that appears to apply to brat-flavored standoff only.

@priancho priancho reopened this Jun 25, 2012
@priancho
Copy link
Member

Oh, sorry about my mistake. I am now working on this.
brat output option will use unique entity IDs regardless of its semantic types soon :-)

@priancho
Copy link
Member

Now the brat output option (-o brat) generates unique IDs for all entities regardless of their semantic types. It also counts the IDs in document level, whereas other options (-o conll, -o standoff) still use IDs in sentence level.

@spyysalo
Copy link
Member Author

@priancho : thanks, but I think this issue actually applies to the -o standoff option, not to the -o brat one. From the original:

When run with the -o standoff option, NERsuite output contains duplicate IDs (within a single input document).

@priancho
Copy link
Member

Hi, sorry for my mistake.
I applied the same functionality for the standoff format output :-)

@spyysalo
Copy link
Member Author

Great, thanks!

S

On Mon, Jun 25, 2012 at 7:29 PM, Han-Cheol Cho <
reply@reply.github.com

wrote:

Hi, sorry for my mistake.
I applied the same functionality for the standoff format output :-)


Reply to this email directly or view it on GitHub:
#9 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants