Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenise raw sentence input #26

Closed
jonorthwash opened this issue May 2, 2017 · 15 comments
Closed

tokenise raw sentence input #26

jonorthwash opened this issue May 2, 2017 · 15 comments
Milestone

Comments

@jonorthwash
Copy link
Owner

Take a bunch of sentences and import as multiple sentences.

@arademaker
Copy link
Contributor

It would need a backend for run the parser, right? Isn't it out of the scope of this tool?

@jonorthwash
Copy link
Owner Author

jonorthwash commented May 25, 2017

No, it's not outside the scope, and doesn't need a backend. The sentence tokeniser would default to splitting on [.!?] (or similar) and would provide a simple interface for adjusting the boundaries. The interface could be something like the input text with highlighted characters to split on that allows for selecting/unselecting them, or even better, allows a "combine with next tokenised sentence" option and "split sentence here" option in the regular interface.

@arademaker
Copy link
Contributor

ok, you are proposing a bootstrapping process. From a string, produce a list of tokens and let the user create the tree.

@jonorthwash
Copy link
Owner Author

Exactly. Several of our other issues here are also for bootstrapping tree-creation, I suppose.

@ftyers ftyers added this to the Phase 1 milestone Jun 2, 2017
@maryszmary
Copy link
Collaborator

Hm, so, this issue means something like "add an option to split several sentences" to what is done in #1?

@jonorthwash
Copy link
Owner Author

Yeah, more or less. Though for now supporting sentences one-per-line should probably be enough. @ftyers , what do you think?

@maryszmary
Copy link
Collaborator

What is the input format? A file, or through the textbox, or both? Then if the input is in the textbox, how should it store the tokenized sentences – like loaded from a file: it should store several sentences and let the user to move between them using previous sentence / next sentence buttons? Should they be automatically converted to CONLL-U or should they remain in plain text format before the user clicks the "convert" button?

@jonorthwash
Copy link
Owner Author

jonorthwash commented Jun 24, 2017

What is the input format? A file, or through the textbox, or both?

Probably both.

Then if the input is in the textbox, how should it store the tokenized sentences – like loaded from a file: it should store several sentences and let the user to move between them using previous sentence / next sentence buttons?

Yeah, I think that makes the most sense.

Should they be automatically converted to CONLL-U or should they remain in plain text format before the user clicks the "convert" button?

I think at any point that text is imported, it should be converted to the underlying format, which as discussed previously should probably be conllu. So "import" for plain text implies "convert to conllu" for me. In this case "convert to conllu" probably isn't quite the right approach for single sentences either.
I'm not committed to this idea though—we can discuss other options too. What do you think?

@maryszmary
Copy link
Collaborator

I think at any point that text is imported, it should be converted to the underlying format, which as discussed previously should probably be conllu. So "import" for plain text implies "convert to conllu" for me.

Ok, so, from your point of view, plain text should be converted to conllu instantly (and then be viewed as conllu)?

@jonorthwash
Copy link
Owner Author

jonorthwash commented Jun 24, 2017

Not necessarily "instantly"—we should wait for the user to finish typing / pasting / whatever. And it doesn't have to be displayed as conllu right away.

I guess I kind of imagine tabs across the top of the textbox (like with github's comment box:). The default tab is something like "automatic" and as soon as it detects a format of input, it switches to the tab for that format (seemlessly—i.e., it just changes which is highlighted). So in this case, it would switch to the "plain text" tab, but then you could click the conllu or CG tab to view it or edit it in those formats. Further modifications to the contents of the plain text tab would only add/remove/edit tokens, and not change any of the dependencies or POS info (etc.) visible in the other formats.

@maryszmary
Copy link
Collaborator

So, now the interface allows to import plain text from file and input it in textbox, and then convert it to conllu all the sentences at once. I tested it on the text "This is a sample plain text! Why is it here? It exists for testing how annotatrix works. It Works!".

This is how it looks when the text is loaded:
image
image

Then I press the button Convert to CoNLL-U:
image
image

I can also insert it in the textbox:
image

And then press Convert to CoNLL-U:
image

@maryszmary
Copy link
Collaborator

I guess I kind of imagine tabs across the top of the textbox
...
So in this case, it would switch to the "plain text" tab, but then you could click the conllu or CG tab to view it or edit it in those formats.

This functionality is probably a part of #40.

@jonorthwash
Copy link
Owner Author

Yes, though the "auto" mode is definitely related to this issue. It can certainly be dealt with as part of #40 though.

@maryszmary
Copy link
Collaborator

Ok, so, can this issue be closed then, or is there some other functionality to add, which is not described in #40?

@jonorthwash
Copy link
Owner Author

Yeah, I think this issue is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants