New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
echo "storeDiagramString:true, text: But as usual,we couldn't make it stick." | nc localhost 9000 #743
Comments
With the current setup, there is no tokenization of most of the punctuation that don't have white space before/after them. such in However, you can experiment with tokenizing them too. Current definition: |
I just added elipses and commas and semicolons to the MPUNC list; this will be in version 5.5.0 later today. Closing. |
What is the reason of including We looked for 3-Amino-3-azabicyclo[3.3.0]octane hydrochloride
(In this particular case, however, the first pasre, which uses the alternative of not splitting the word, may be better.) |
Middle-splitting on
we now get also:
which is incorrect. If Middle-splitting on The problem of exponentially increase in the number of parses (2**(number of splits)) will still remain even then, unless we find a way for compact representation of parses (you wrote on that need in another context). Also note that forbidding split for tokens that match regexes was abandoned long ago in the favor of creating alternatives. Returning to the old way will rule out parsing of many useful constructs. |
Ouch. I added only "[" and not "]" because I was looking at project gutenberg texts, which had foot-note and reference constructions, e.g. "Studies show that this happens[32]." where the [32] is some reference. I will add closing square brackets now. The addition of the comma is the law of unintended consequences. Perhaps I should remove the comma, for now. I don't see any easy, obvious solution. The complex solutions all seem to be problematic. We have distinct needs: bad punctuation, spelling errors, European morphology, Hebrew morphology, other (e.g. turkish) morphology. One unified system for all this is possibly not enough. For spelling errors, single-word substitution is a short-term patch; I did it because its an easy, cheap stunt. The correct long term approach is some kind of transformative grammar; where we can mutate sentences from unusual forms into "standard" forms. Some of that mutation is bad-spelling, bad-grammar related. |
Most probably.
Many, if not all, of these things can be done with the current LG, if definition and lookup limitations are removed (e.g. extending idiom definitions to allow Additional important improvement is a general way to detect "virtual morphemes" like (but not limited to) phonology and capitalization and use an appropriate disjuncts for them so everything will be able to be done by the dict, without a need to separate words to different files etc. (like a/an - can be very complex for complex phonology) or adding all capital words with their own rules. Such "virtual morphemes" are exactly like doing such separations -they are just practical shortcuts that are easy to define and modify (I'm working on that).
I think it is definitely possible. |
I fixed a bad formatting of my previous message, so please read it on GitHub... |
OK, that's done, now. Its not obviously terribly useful, unless you have some good example that you haven't told me (or I've forgotten)
Can you provide examples? For this, and each of teh points below, maybe could you open an issue, and place a description of the problem, an example, and a proposed fix?
Again, not sure what that means.
What would that do, and how would that help?
OK. Yes, that seemed useful at the time, but perhaps has outgrown it's utility.
Yes. Well, the REGPARTS/REGMID/etc was an attempt to do that; I'm not convinced it ever worked very well, but also, we've just barely ever used it. Or are you refering to something else, some replacement, some different way of doing things? I like the general sentiment; please do this. I'm flat out of bright ideas at this particular moment. |
I had an markdown problem in my post, and thought I have corrected it. But the post still contains this markdown problem. I tried again to fix it, and now the format is OK. Anyway, here is the corrected line (I forgot to use backquotes for
I indeed have never updated you about a possible usage of such constructs. I tried to solve the case of collocations with holes. I will open an issue to describe the test I did.
This doesn't parse:
This parses fine:
I can open an issue with my proposal for solving it.
This is one of the many things that are not "orthogonal": If you would like to add a regex, like what I did in my test, you interfere with the rest of the regexes.
Not exactly. Its only intended use is for the 'amy` and the like languages, to denote tokens that should not be produced by random splitting.
Yes. Say we add an ability to denote regex tokens in the xPUNC definitions.
I have a list of things to implement. I can try to open issues on each of them. |
Yes, please.
Oh, that's interesting!, Yeah, I guess I like that.
I can try to give it. I only have a finite amount of ability to read, focus and respond, and am already operating at the limit :-) |
p.s. when should I publish version 5.5.0? |
Just now, if you are willing to have the rand fix for the next release only. |
usual,we
The text was updated successfully, but these errors were encountered: