#word_count incorrectly counts contractions as two words #94

ojak · 2015-02-26T18:21:09Z

Contractions are counted as two words instead of one with #word_count.

For example, the current behavior when calling #word_count on the following sentence with 6 words returns an incorrect count of 7:

sentence("This sentence doesn't contain seven words.").tokenize.word_count
=> 7

This occurs because #tokenize splits the word doesn't into two tokenized words: does and n't.

Perhaps #word_count should allow for arguments on #word_count, where the method's defaults behavior is an accurate word count, but also allows for the total tokenized count to be explicitly requested? Or perhaps expose a #count method on a tokenized segment. Something like:

sentence = sentence("This sentence doesn't contain seven words.")

sentence.tokenize.word_count
=> 6
sentence.tokenize.word_count(double_count_contractions:true)
=> 7

# Or, just expose a `#count` or `#length` method on a tokenized segment
sentence.tokenize.count
=> 7

Any thoughts?

The text was updated successfully, but these errors were encountered:

louismullie · 2015-02-26T18:29:07Z

I agree this would be confusing. What does sentence.print_tree show? The "n't" is supposed to be of class Enclitic (which is descendant of Token, not Word). So token_count should return 7 and word_count should return 6.

ojak · 2015-02-26T18:31:11Z

Here's print_tree without parsing:

> sentence("This sentence doesn't contain seven words.").tokenize.print_tree
+ Sentence (70319169704620)  --- "This sentence does [...] seven words."  ---  {}   --- []
|
+--> Word (70319118319320)  --- "This"  ---  {}   --- []
+--> Word (70319118317200)  --- "sentence"  ---  {}   --- []
+--> Word (70319118315180)  --- "does"  ---  {}   --- []
+--> Word (70319118313160)  --- "n't"  ---  {}   --- []
+--> Word (70319118286420)  --- "contain"  ---  {}   --- []
+--> Word (70319118284200)  --- "seven"  ---  {}   --- []
+--> Word (70319118282040)  --- "words"  ---  {}   --- []
+--> Punctuation (70319118279820)  --- "."  ---  {}   --- []

And with parsing:

> sentence("This sentence doesn't contain seven words.").tokenize.parse.print_tree
+ Sentence (70319171751360)  --- "This sentence does [...] seven words."  ---  {:tag_set=>:penn}   --- []
|
+--+ Phrase (70319163075360)  --- "This sentence"  ---  {:tag=>"NP"}   --- []
   |
   +--> Word (70319162384740)  --- "This"  ---  {:tag=>"DT"}   --- []
   +--> Word (70319161753420)  --- "sentence"  ---  {:tag=>"NN"}   --- []
+--+ Phrase (70319160833780)  --- "does n't contain seven words"  ---  {:tag=>"VP"}   --- []
   |
   +--> Word (70319160302720)  --- "does"  ---  {:tag=>"VBZ"}   --- []
   +--> Word (70319159626700)  --- "n't"  ---  {:tag=>"RB"}   --- []
   +--+ Phrase (70319125324640)  --- "contain seven words"  ---  {:tag=>"VP"}   --- []
      |
      +--> Word (70319124574860)  --- "contain"  ---  {:tag=>"VB"}   --- []
      +--+ Phrase (70319123708220)  --- "seven words"  ---  {:tag=>"NP"}   --- []
         |
         +--> Word (70319134976880)  --- "seven"  ---  {:tag=>"CD"}   --- []
         +--> Word (70319122487540)  --- "words"  ---  {:tag=>"NNS"}   --- []
+--> Punctuation (70319134697140)  --- "."  ---  {:tag=>"."}   --- []

louismullie · 2015-02-26T18:40:46Z

So the real issue is that "n't" is being tokenized as a Word, when it should be an Enclitic. Therefore there's a bug in the default :ptb tokenizer. The tokenizers mostly all use this method to create tokens, and the enclitic should be parsed appropriately (line 350). Can you check what is happening there?

ojak · 2015-02-26T18:43:49Z

Cool. I'll take a look, thanks for locating it.

ojak · 2015-02-26T18:49:15Z

Yup. Typo in lib/treat/entities/entity/buildable.rb:18:

  Enclitics = %w['ll 'm 're 's 't 've 'nt]

Changed to:

  Enclitics = %w['ll 'm 're 's 't 've n't]

Works!

sentence("This sentence doesn't contain seven words.").tokenize.word_count
=> 6

louismullie · 2015-02-26T18:56:05Z

Sweet. Can you be so kind as to submit a PR?

ojak mentioned this issue Feb 26, 2015

Fixed typo that caused n't to be tagged as a Word rather than an Enclitic #95

Merged

louismullie closed this as completed Feb 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#word_count incorrectly counts contractions as two words #94

#word_count incorrectly counts contractions as two words #94

ojak commented Feb 26, 2015

louismullie commented Feb 26, 2015

ojak commented Feb 26, 2015

louismullie commented Feb 26, 2015

ojak commented Feb 26, 2015

ojak commented Feb 26, 2015

louismullie commented Feb 26, 2015

#word_count incorrectly counts contractions as two words #94

#word_count incorrectly counts contractions as two words #94

Comments

ojak commented Feb 26, 2015

louismullie commented Feb 26, 2015

ojak commented Feb 26, 2015

louismullie commented Feb 26, 2015

ojak commented Feb 26, 2015

ojak commented Feb 26, 2015

louismullie commented Feb 26, 2015