Implosion/to_s problem with Enclitics #68

Open
n8 opened this Issue Jan 13, 2014 · 2 comments

2 participants

@n8
n8 commented Jan 13, 2014
    text = "It's about time."
    text = sentence(text).apply(:tokenize, :parse)
    puts text.to_s

Results in:

It 's about time.

Should that to_s without the extra space between It and `s?

@chrisanderton

it's is a contraction - for tokenisation contractions are often considered two words (because they are really) - this is the case in Stanford Core - http://stackoverflow.com/questions/14058399/stanford-corenlp-split-words-ignoring-apostrophe

One option, as suggested in the above link, would be to handle imploding enclitics in the implode method - in treat this would be in module Treat::Entities::Entity::Stringable

@chrisanderton

so - looks like the issue is with the current implode method on string able - although it attempts to handle enclitics then from what i can see in the current implementation then 'value' would already be blank, so calling strip! would make no difference - when the imploded parts are merged the space is still there (as it is outside the scope of the strip!)

here's a fixed version - modified the recursive call to pass the value string and operations are all performed on the string instead of multiple copies - but a disclaimer is that i only started looking at treat about 3 hours ago!

chrisanderton@d9b912f

for the same code, this now gives:

It's about time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment