New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenise raw sentence input #26
Comments
It would need a backend for run the parser, right? Isn't it out of the scope of this tool? |
No, it's not outside the scope, and doesn't need a backend. The sentence tokeniser would default to splitting on [.!?] (or similar) and would provide a simple interface for adjusting the boundaries. The interface could be something like the input text with highlighted characters to split on that allows for selecting/unselecting them, or even better, allows a "combine with next tokenised sentence" option and "split sentence here" option in the regular interface. |
ok, you are proposing a bootstrapping process. From a string, produce a list of tokens and let the user create the tree. |
Exactly. Several of our other issues here are also for bootstrapping tree-creation, I suppose. |
Hm, so, this issue means something like "add an option to split several sentences" to what is done in #1? |
Yeah, more or less. Though for now supporting sentences one-per-line should probably be enough. @ftyers , what do you think? |
What is the input format? A file, or through the textbox, or both? Then if the input is in the textbox, how should it store the tokenized sentences – like loaded from a file: it should store several sentences and let the user to move between them using previous sentence / next sentence buttons? Should they be automatically converted to CONLL-U or should they remain in plain text format before the user clicks the "convert" button? |
Probably both.
Yeah, I think that makes the most sense.
I think at any point that text is imported, it should be converted to the underlying format, which as discussed previously should probably be conllu. So "import" for plain text implies "convert to conllu" for me. In this case "convert to conllu" probably isn't quite the right approach for single sentences either. |
Ok, so, from your point of view, plain text should be converted to conllu instantly (and then be viewed as conllu)? |
Not necessarily "instantly"—we should wait for the user to finish typing / pasting / whatever. And it doesn't have to be displayed as conllu right away. I guess I kind of imagine tabs across the top of the textbox (like with github's comment box:). The default tab is something like "automatic" and as soon as it detects a format of input, it switches to the tab for that format (seemlessly—i.e., it just changes which is highlighted). So in this case, it would switch to the "plain text" tab, but then you could click the conllu or CG tab to view it or edit it in those formats. Further modifications to the contents of the plain text tab would only add/remove/edit tokens, and not change any of the dependencies or POS info (etc.) visible in the other formats. |
This functionality is probably a part of #40. |
Yes, though the "auto" mode is definitely related to this issue. It can certainly be dealt with as part of #40 though. |
Ok, so, can this issue be closed then, or is there some other functionality to add, which is not described in #40? |
Yeah, I think this issue is done. |
Take a bunch of sentences and import as multiple sentences.
The text was updated successfully, but these errors were encountered: