Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Pod::Spell ignores stopwords in pod #9

Closed
karenetheridge opened this issue Sep 19, 2013 · 23 comments
Closed

New Pod::Spell ignores stopwords in pod #9

karenetheridge opened this issue Sep 19, 2013 · 23 comments

Comments

@karenetheridge
Copy link
Member

@karenetheridge karenetheridge commented Sep 19, 2013

When I upgraded from Pod-Spell-1.06 to 1.07, stopwords embedded in the pod stopped being recognized.

Downgrading made things work again.

More details in xenoterracide/Dist-Zilla-Plugin-Test-PodSpelling#18

@xenoterracide
Copy link
Contributor

@xenoterracide xenoterracide commented Sep 19, 2013

weird... will look this weekend

@xenoterracide
Copy link
Contributor

@xenoterracide xenoterracide commented Sep 24, 2013

didn't not have appropriate tuit levels this weekend, there's no chance you could write a failing test is there? and since I noticed you commented on "ordered stopwords" problem, do you think that has something to do with this ( possible it is not exasperated by lexical scoping of stopwords )

@pghmcfc
Copy link

@pghmcfc pghmcfc commented Sep 24, 2013

xt/author/pod-spell.t from YAML-Tiny 1.55 will trigger this issue; the word "embeddable" is listed as a stopword in lib/YAML/Tiny.pm but it's still complained about.

@dagolden
Copy link
Contributor

@dagolden dagolden commented Sep 24, 2013

I'll poke at it a bit as well and see what I find.

@dagolden
Copy link
Contributor

@dagolden dagolden commented Sep 24, 2013

Got it: "easily-embeddable". In 1.07, learned words are now isolated to an object. The global stoplist is not modified. During processing, Pod::Spell & Pod::Wordlist grabs chunks of \S+ to test against the wordlist (after a tiny bit of cleanup). So "embeddable" doesn't get stripped. The spellchecker does split on hyphen and reports the bad part. Test::Spelling then does a second pass against the global stoplist, which never learned embeddable.

Options:

  • Patch Test::Spelling to make a second pass using the parser's stopwords object and the API instead of hitting the global list
  • Patch Pod::Wordlist to do something "smart" with hyphenated words. (Test it complete against the stoplist, then test each part against the stoplist.)
  • both!

There's also the option of modifying the global stopword list when learning, but I tend to think we should stick with a gradual process of making the global stopword list private.

I'll see if I can bang-out a test and fix for the Pod::Wordlist part, but someone should go write a patch for Test::Spelling to use the new API and stop messing with the global, because it is likely to become a lexical at some point anyway.

@xenoterracide
Copy link
Contributor

@xenoterracide xenoterracide commented Sep 24, 2013

https://rt.cpan.org/Ticket/Display.html?id=57492 < guessing this may or may not be relevant. But something to consider. It's very likely that Pod::Spell needs to make 2 passes, which is what I thought was the cause of that issue a while back.

@dagolden
Copy link
Contributor

@dagolden dagolden commented Sep 24, 2013

That will do it. Now we just need it shipped.

@dagolden dagolden closed this Sep 24, 2013
@pghmcfc
Copy link

@pghmcfc pghmcfc commented Sep 25, 2013

I still have a bit of an issue here. On Fedora, the default speller is hunspell rather than aspell, and since Fedora 17 its behaviour is to complain about a few things in the YAML::Tiny Pod, such as:

Just like L<Storable>'s thaw() function or the eval() function in relation

and

Support for the "'" single quote indicator is required.

Given this, hunspell reports two mis-spellings:

'
's

Now it's arguable that this is an issue with hunspell and/or its dictionaries, but regardless of that I should be able to shut it up by specifying stopwords. If I add stopwords embedded in the Pod, the test still fails, but if I add the two failing character sequences to the list of stopwords at the end of xt/author/pod-spell.t, the test passes. So there's an inconsistency there.

@dagolden
Copy link
Contributor

@dagolden dagolden commented Sep 25, 2013

@pghmcfc assuming that's Test::Spelling (or a dzil plugin that uses it), you're possibly seeing the "two pass" approach that Test::Spelling uses. First it runs POD through Pod::Spell to strip stopwords, then puts it to the spell checker, and then checks the result against the stopwords again to catch things the dictionary might have split up.

I think what's going on for "'" is that it's stripping the outer quotes only when it should just punt the whole thing.

I'm not sure about the L<Storable>'s case, but I'll add both to the test file and see what I can find.

David

@dagolden dagolden reopened this Sep 25, 2013
@dagolden
Copy link
Contributor

@dagolden dagolden commented Sep 25, 2013

I've pushed up some commits that deal with these issues, mostly by stripping out "words" that are all punctuation or just a lone 's. It also fixes treatment of abbreviations like "Ph.D.", which was getting turned into "Ph.D".

@xenoterracide
Copy link
Contributor

@xenoterracide xenoterracide commented Sep 25, 2013

it appears some of the changes have somehow caused a regression

@xenoterracide
Copy link
Contributor

@xenoterracide xenoterracide commented Sep 25, 2013

nvm, it's not so much of a regression as an oversight, stoplist didn't used to be in the pod. (lol DOH!)

@pghmcfc
Copy link

@pghmcfc pghmcfc commented Sep 26, 2013

With Pod::Spell 1.09 I'm now seeing test failures on Fedora prior to Fedora 17 (and all Red Hat Enterprise Linux versions) in Pod-Spell's own spell test:

$ make test AUTHOR_TESTING=1 RELEASE_TESTING=1
Skip blib/lib/auto/share/dist/Pod-Spell/wordlist (unchanged)
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/00-compile.t ................ ok
# 
# 
# Generated by Dist::Zilla::Plugin::ReportVersions::Tiny v1.09
# perl: 5.014003 (wanted 5.006) on linux from /usr/bin/perl
# 
# Carp                                          => 1.20       (want any version)
# Class::Tiny                                   => 0.011      (want any version)
# ExtUtils::MakeMaker                           => 6.57_05    (want 6.30)   
# File::ShareDir::Install                       => 0.05       (want 0.03)   
# File::ShareDir::ProjectDistDir                => 0.4.4      (want any version)
# File::Slurp                                   => 9999.19    (want any version)
# File::Spec                                    => 3.33       (want any version)
# File::Temp                                    => 0.22       (want any version)
# IO::Handle                                    => 1.31       (want any version)
# IPC::Open3                                    => 1.0901     (want any version)
# Lingua::EN::Inflect                           => 1.892      (want any version)
# Pod::Coverage::TrustPod                       => 0.100002   (want any version)
# Pod::Escapes                                  => 1.04       (want any version)
# Pod::Parser                                   => 1.37       (want any version)
# Test::CPAN::Changes                           => 0.23       (want 0.19)   
# Test::CPAN::Meta                              => 0.23       (want any version)
# Test::Deep                                    => 0.11       (want any version)
# Test::More                                    => 0.98       (want 0.88)   
# Test::Pod                                     => 1.48       (want 1.41)   
# Test::Pod::Coverage                           => 1.08       (want 1.08)   
# Text::Wrap                                    => 2009.0305  (want any version)
# base                                          => 2.16       (want any version)
# constant                                      => 1.21       (want any version)
# locale                                        => 1.00       (want any version)
# strict                                        => 1.04       (want any version)
# version                                       => 0.9904     (want 0.9901) 
# warnings                                      => 1.12       (want any version)
# 
# Thanks for using my code.  I hope it works for you.
# If not, please try and include this output in the bug report.
# That will help me reproduce the issue and solve your problem.
# 
t/000-report-versions-tiny.t .. ok
t/author-critic.t ............. ok
#   Failed test 'POD spelling for lib/Pod/Spell.pm'
#   at t/author-pod-spell.t line 19.
# Errors:
#     virtù
# Looks like you failed 1 test of 3.
t/author-pod-spell.t .......... 
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/3 subtests 
t/author-test-eol.t ........... ok
t/basic.t ..................... ok
t/debug.t ..................... ok
t/get-stopwords.t ............. ok
t/release-cpan-changes.t ...... ok
# Unable to parse MANIFEST.SKIP file:
# No such file or directory
# Using default skip data from ExtUtils::Manifest 1.58
t/release-dist-manifest.t ..... ok
t/release-distmeta.t .......... ok
t/release-meta-json.t ......... ok
t/release-minimum-version.t ... ok
t/release-pod-coverage.t ...... ok
t/release-pod-linkcheck.t ..... skipped: Test::Pod::LinkCheck required for testing POD
t/release-pod-syntax.t ........ ok
t/release-portability.t ....... ok
t/release-test-version.t ...... ok
t/release-unused-vars.t ....... ok
t/text-block.t ................ ok
Test Summary Report
-------------------
t/author-pod-spell.t        (Wstat: 256 Tests: 3 Failed: 1)
  Failed test:  3
  Non-zero exit status: 1
Files=20, Tests=96,  7 wallclock secs ( 0.08 usr  0.01 sys +  7.03 cusr  0.22 csys =  7.34 CPU)
Result: FAIL
Failed 1/20 test programs. 1/96 subtests failed.

@dagolden
Copy link
Contributor

@dagolden dagolden commented Sep 26, 2013

@pghmcfc, that's a weird bug, but can I step back and ask why you're running release and author tests?

Per the Lancaster Consensus definitions of the semantics of those variables, there's really no reason to have it set (unless you're actively developing the module):

RELEASE_TESTING: if true, tests are being run as part of a release QA process; CPAN clients must not set this variable

AUTHOR_TESTING: if true, tests are being run as part of an author's personal development process; such tests may or may not be run prior to release. CPAN clients must not set this variable. Distribution packagers (ppm, deb, rpm, etc.) should not set this variable.

@pghmcfc
Copy link

@pghmcfc pghmcfc commented Sep 26, 2013

I'm the downstream maintainer of about 100 modules in Fedora, including a bunch of testing modules like Test::Spelling, so I tend to run author/release tests where they're provided as they make for good test cases for the test modules (and their dependencies) themselves. That's why I run them despite knowing that I'm not really supposed to do it.

@dagolden
Copy link
Contributor

@dagolden dagolden commented Sep 26, 2013

Fair enough.

Looking at the Pod::Spell pod, I see where that string is, but I can't see why it's not getting flagged for me. I wonder if it's a matter of the version of Pod::Parser that's doing it.

I'll see if I can replicate it somehow.

@xenoterracide
Copy link
Contributor

@xenoterracide xenoterracide commented Sep 26, 2013

I just checked Pod::Parser, no dice, and that version isn't even on CPAN anymore. In an effort to see if it goes away... I suspect this is a bug in a really old version of something.

also we have no unicode tests in the test suite, should probably fix that.

@dagolden
Copy link
Contributor

@dagolden dagolden commented Sep 26, 2013

Found it. It's an encoding issue. The test should be failing, but somehow, on my setup, it's not. Why it's failing for @pghmcfc is probably some sort of locale issue or aspell version or something.

I'll open a separate ticket.

@pghmcfc
Copy link

@pghmcfc pghmcfc commented Sep 26, 2013

Possibly; I don't see failures on recent (last 12 months) versions of Fedora but I do everywhere else. Locale should be the same in all cases but aspell/hunspell (which we use in Fedora where possible) may have different versions/dictionaries.

I've also come across another issue with 1.09, this time in our development version with 5.18.1 and mostly up to date versions of everything, running the spell check for Perl-OSType:

$ make test
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/00-compile.t ......... ok
# Prerequisite Report:
#   Version Module               
#   ------- ---------------------
#      5.68 Exporter             
#      6.78 ExtUtils::MakeMaker  
#      3.40 File::Spec           
#      3.40 File::Spec::Functions
#    0.2301 File::Temp           
#      1.34 IO::Handle           
#      1.13 IPC::Open3           
#      1.31 List::Util           
#   0.98_05 Test::More           
#      1.27 constant             
#      1.07 strict               
#      1.18 warnings             
t/00-report-prereqs.t .. ok
t/OSType.t ............. ok
All tests successful.
Files=3, Tests=21,  0 wallclock secs ( 0.11 usr  0.00 sys +  0.73 cusr  0.03 csys =  0.87 CPU)
Result: PASS
$ make test 'TEST_FILES=$(echo $(find xt/ -name '*.t'))
xt/author/critic.t ............ ok
#   Failed test 'POD spelling for lib/Perl/OSType.pm'
#   at xt/author/pod-spell.t line 11.
# Errors:
#     Win32'
#     freebsd'
# Looks like you failed 1 test of 1.
xt/author/pod-spell.t ......... 
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests 
xt/release/distmeta.t ......... ok
xt/release/minimum-version.t .. ok
xt/release/pod-coverage.t ..... ok
xt/release/pod-syntax.t ....... ok
xt/release/portability.t ...... ok
xt/release/test-version.t ..... ok
Test Summary Report
-------------------
xt/author/pod-spell.t       (Wstat: 256 Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
Files=8, Tests=14, 11 wallclock secs ( 0.14 usr  0.01 sys + 10.26 cusr  0.38 csys = 10.79 CPU)
Result: FAIL
Failed 1/8 test programs. 1/14 subtests failed.

Looks like the trailing single quote hasn't been stripped.

@dagolden
Copy link
Contributor

@dagolden dagolden commented Sep 26, 2013

@pghmcfc the trailing ' should be fixed with 055c2b2

@pghmcfc
Copy link

@pghmcfc pghmcfc commented Sep 27, 2013

Trailing ' still a problem:

$ grep -C 3 trailing /usr/share/perl5/vendor_perl/Pod/Wordlist.pm 
        # some spellcheckers can't cope with anything but Latin1
        $_ = '' if $self->no_wide_chars && /[^\x00-\xFF]/;

        # strip trailing punctuation; we don't strip periods unless
        # after punctuation; we don't want to chop abbreviations like "Ph.D."
        s/[\)\]\}\'\"\:\;\,\?\!]+\.*$//s;

--
sub _strip_a_word {
    my ($self, $word) = @_;
    my $remainder = '';
    # might have trailing period(s) or an internal dash, so first, just check
    # as is in case that's actually in the word list
    if ($self->is_stopword($word) ) {
        # stopword, so do nothing
--
        $remainder = join("-", @keep) if @keep;
    }
    elsif ( $word =~ m{(.*?)\.+$}) {
        # trailing period could be end of sentence or ellipses
        my $part = $1;
        $remainder = $word if ! $self->is_stopword( $part );
    }
$ LANG=en_US.UTF-8 make test TEST_FILES="$(echo $(find xt/ -name '*.t'))"
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" xt/release/pod-syntax.t xt/release/distmeta.t xt/release/test-version.t xt/release/pod-coverage.t xt/release/minimum-version.t xt/release/portability.t xt/author/critic.t xt/author/pod-spell.t
xt/author/critic.t ............ ok   
xt/author/pod-spell.t ......... 1/1 
#   Failed test 'POD spelling for lib/Perl/OSType.pm'
#   at xt/author/pod-spell.t line 11.
# Errors:
#     Win32'
# Looks like you failed 1 test of 1.
xt/author/pod-spell.t ......... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests 
xt/release/distmeta.t ......... ok   
xt/release/minimum-version.t .. ok   
xt/release/pod-coverage.t ..... ok   
xt/release/pod-syntax.t ....... ok   
xt/release/portability.t ...... ok   
xt/release/test-version.t ..... ok   

Test Summary Report
-------------------
xt/author/pod-spell.t       (Wstat: 256 Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
Files=8, Tests=14,  2 wallclock secs ( 0.04 usr  0.01 sys +  2.36 cusr  0.10 csys =  2.51 CPU)
Result: FAIL
Failed 1/8 test programs. 1/14 subtests failed.
make: *** [test_dynamic] Error 255

@dagolden
Copy link
Contributor

@dagolden dagolden commented Sep 27, 2013

Ugh. I've taken another shot at this with f340036

The lack of decent test corpus hurts, so the more of these we discover, the easier it is to fix.

@pghmcfc
Copy link

@pghmcfc pghmcfc commented Sep 27, 2013

That's cracked it. Current git master runs its own test suite plus those of YAML-Tiny and Perl-OSType successfully for me with current Test-Spelling on all Fedora releases from Fedora 3 (5.8.5) to the current development version (5.18.1), plus Red Hat Enterprise Linux 4, 5 and 6. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants