Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#word_count incorrectly counts contractions as two words #94

Closed
ojak opened this issue Feb 26, 2015 · 6 comments
Closed

#word_count incorrectly counts contractions as two words #94

ojak opened this issue Feb 26, 2015 · 6 comments

Comments

@ojak
Copy link
Contributor

ojak commented Feb 26, 2015

Contractions are counted as two words instead of one with #word_count.

For example, the current behavior when calling #word_count on the following sentence with 6 words returns an incorrect count of 7:

sentence("This sentence doesn't contain seven words.").tokenize.word_count
=> 7

This occurs because #tokenize splits the word doesn't into two tokenized words: does and n't.

Perhaps #word_count should allow for arguments on #word_count, where the method's defaults behavior is an accurate word count, but also allows for the total tokenized count to be explicitly requested? Or perhaps expose a #count method on a tokenized segment. Something like:

sentence = sentence("This sentence doesn't contain seven words.")

sentence.tokenize.word_count
=> 6
sentence.tokenize.word_count(double_count_contractions:true)
=> 7

# Or, just expose a `#count` or `#length` method on a tokenized segment
sentence.tokenize.count
=> 7

Any thoughts?

@louismullie
Copy link
Owner

I agree this would be confusing. What does sentence.print_tree show? The "n't" is supposed to be of class Enclitic (which is descendant of Token, not Word). So token_count should return 7 and word_count should return 6.

@ojak
Copy link
Contributor Author

ojak commented Feb 26, 2015

Here's print_tree without parsing:

> sentence("This sentence doesn't contain seven words.").tokenize.print_tree
+ Sentence (70319169704620)  --- "This sentence does [...] seven words."  ---  {}   --- []
|
+--> Word (70319118319320)  --- "This"  ---  {}   --- []
+--> Word (70319118317200)  --- "sentence"  ---  {}   --- []
+--> Word (70319118315180)  --- "does"  ---  {}   --- []
+--> Word (70319118313160)  --- "n't"  ---  {}   --- []
+--> Word (70319118286420)  --- "contain"  ---  {}   --- []
+--> Word (70319118284200)  --- "seven"  ---  {}   --- []
+--> Word (70319118282040)  --- "words"  ---  {}   --- []
+--> Punctuation (70319118279820)  --- "."  ---  {}   --- []

And with parsing:

> sentence("This sentence doesn't contain seven words.").tokenize.parse.print_tree
+ Sentence (70319171751360)  --- "This sentence does [...] seven words."  ---  {:tag_set=>:penn}   --- []
|
+--+ Phrase (70319163075360)  --- "This sentence"  ---  {:tag=>"NP"}   --- []
   |
   +--> Word (70319162384740)  --- "This"  ---  {:tag=>"DT"}   --- []
   +--> Word (70319161753420)  --- "sentence"  ---  {:tag=>"NN"}   --- []
+--+ Phrase (70319160833780)  --- "does n't contain seven words"  ---  {:tag=>"VP"}   --- []
   |
   +--> Word (70319160302720)  --- "does"  ---  {:tag=>"VBZ"}   --- []
   +--> Word (70319159626700)  --- "n't"  ---  {:tag=>"RB"}   --- []
   +--+ Phrase (70319125324640)  --- "contain seven words"  ---  {:tag=>"VP"}   --- []
      |
      +--> Word (70319124574860)  --- "contain"  ---  {:tag=>"VB"}   --- []
      +--+ Phrase (70319123708220)  --- "seven words"  ---  {:tag=>"NP"}   --- []
         |
         +--> Word (70319134976880)  --- "seven"  ---  {:tag=>"CD"}   --- []
         +--> Word (70319122487540)  --- "words"  ---  {:tag=>"NNS"}   --- []
+--> Punctuation (70319134697140)  --- "."  ---  {:tag=>"."}   --- []

@louismullie
Copy link
Owner

So the real issue is that "n't" is being tokenized as a Word, when it should be an Enclitic. Therefore there's a bug in the default :ptb tokenizer. The tokenizers mostly all use this method to create tokens, and the enclitic should be parsed appropriately (line 350). Can you check what is happening there?

@ojak
Copy link
Contributor Author

ojak commented Feb 26, 2015

Cool. I'll take a look, thanks for locating it.

@ojak
Copy link
Contributor Author

ojak commented Feb 26, 2015

Yup. Typo in lib/treat/entities/entity/buildable.rb:18:

  Enclitics = %w['ll 'm 're 's 't 've 'nt]

Changed to:

  Enclitics = %w['ll 'm 're 's 't 've n't]

Works!

sentence("This sentence doesn't contain seven words.").tokenize.word_count
=> 6

@louismullie
Copy link
Owner

Sweet. Can you be so kind as to submit a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants