Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating custom dataset from scratch #27

Closed
appledora opened this issue Sep 13, 2021 · 4 comments
Closed

Creating custom dataset from scratch #27

appledora opened this issue Sep 13, 2021 · 4 comments

Comments

@appledora
Copy link

I was wondering if you download the training data, in original sciERC format (as shown here) and then reformat again automatically internally before training the model? Asking this, because a little confused whether to format my custom dataset like sciERC or your input data format, as shown on the Readme.md . Also, sciERC annotates sentence wise , as per the aforementioned link. How does pure handle multisentence passages? The pretrained models that are downloaded by the repo-provided Readme.md link, are also labeled as single sentence model.

@a3616001
Copy link
Member

Hi, we downloaded the processed SciERC (link), which is in the same format as shown on the README.

In our experiments, we do not consider cross-sentence relations. All the datasets (ACE04, ACE05, SciERC) are annotated in the sentence level. So we consider each sentence individually if there are multiple in a passage. For cross-sentence relations, you can concatenate sentences and apply our approach on concatenated inputs to predict cross-sentence relations.

@appledora
Copy link
Author

Sort of a complementary question, the processed dataset has an additional clusters node for each document. What does it refer to or is it important?

@serenalotreck
Copy link

I think those are coreference clusters, as the model that SciERC was created for also incorporates coreference resolution. I believe this pipeline doesn't do that, so they probably aren't used?

@a3616001
Copy link
Member

The clusters field contains the coreference annotations. Our approach doesn't use those annotations, so you may ignore this field when using our code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants