I figured I will collect all the pieces under one roof:
It started here: #897 / 7695cea
The problems that came out are things like #942 and
<ribasushi> the second bullet point for v2.20 is quite the achievement ;)
At this point I'd like to +1 this suggestion by @monken: #942 (comment), but really would recommend things go much further. Relevant irclog follows:
[03:37:11] <omega> oalders: yeah, I know. That one is a mix of not following the standard, but following it cloesly enough that something gets parsed, and our relaxed group parsing because of ribasushi "refusal" t
o follow the spec, while we try to parse it anyways
[05:06:10] <haarg> there's a few things wrong with how metacpan treats the changes file
[05:07:00] <haarg> and i don't know parsing well enough to know the proper fix
[05:07:22] <haarg> but generally, changelog entries starting with * shouldn't be treated as groups unconditionally
[05:07:37] <haarg> lots of dists use * instead of -
[05:12:25] <omega> yeah, I know, thats one of the problems
[05:12:30] <omega> I think we have to rip that part out again
[05:12:50] <omega> a line based parser can only work so well for complicated situations
[05:13:40] <haarg> indentation seems like the only thing you can really depend on
[05:14:04] <omega> yeah, which means it isn't really groups, but nesting
[05:14:14] <haarg> combined with recognising lines vs continuations
[05:14:38] <haarg> which you can do a rough job by checking for single non word characters probably
[05:16:02] <haarg> unfortunately, the cpan-changes spec isn't much of a spec
[05:16:08] <omega> nope
[05:16:54] <omega> and unfortunately, it sometimes parses things that aren't according to spec (like dbix-class, which I tried to fix, which lead to this mess)
[05:17:02] <omega> and it parses that in a bad way
[05:17:05] <haarg> well
[05:17:23] <haarg> dbic doesn't follow the spec if you read the english
[05:17:41] <haarg> but it follows the technical part of the spec just fine
[05:17:41] <omega> "doesn't follow the spec" was wrong to say
[05:18:00] <omega> I meant, gets parsed in a broken way, rendering the end result useless
[05:18:09] <haarg> yeah
[05:19:16] <haarg> that's mainly because the spec itself is pretty meaningless
[05:20:15] <haarg> in an attempt to follow all possible existing patterns, it made itself so general that it has almost no value as a spec.
[05:28:41] <omega> yeah, hence the idea that "our" parser could be more relaxed, because we don't really care about specs, we care about getting information out there
[05:28:49] <omega> but it is hard to do right
[05:31:36] <haarg> there are two options i can think of that seem reasonable
[05:32:53] <haarg> first is nested entries, where sub-entries need to be [*-whatever_else]\s*
[05:33:13] <haarg> and otherwise indented entries are continuations
[05:34:09] <haarg> the second would be to define a more strict format, and if it fails just show the raw text data, with white-space: pre
[05:50:23] <omega> the second option gets complicated by getting the latest release only
[05:54:11] <haarg> i mean more by following the spec strictly, so version headings would still be respected
[05:54:39] <haarg> but everything else that was indented after it would be treated as the changes for that version.
[07:32:00] <ribasushi> <omega> the second option gets complicated by getting the latest release only <--- how so?
[08:45:16] <omega> Well, not if we still expect "version lines" to be spec conformant I guess
[08:45:34] <omega> but I understood it as wanting to match more releases that cpan::changes doesn't
[08:45:56] <omega> there is a lot of weird stuff going on in Changes files on cpan :)
[08:47:01] <omega> (Other than that someone needs to write that code)
[08:47:42] <haarg> being strict about the version lines makes sense
[08:59:03] * ribasushi still thinks that attempting to parse the file is a step in the wrong direction
[08:59:36] <ribasushi> even trying to read the version itself is already a step into the woods of heuristics
[09:06:19] <haarg> it's a least worst situation
[09:06:34] <haarg> i think parsing is more reasonable than diffing
[09:07:01] <ribasushi> and I think exactly the opposite ;)
[09:07:38] <ribasushi> haarg: pistols or rapiers?
[09:08:11] * haarg points to https://metacpan.org/release/Moose as evidence that diffs won't always work well
[09:08:24] <haarg> broken changelog entry, likely to be fixed
[09:08:35] <haarg> which reminds me
[09:08:56] <ribasushi> haarg: but it *will* work
[09:09:08] <ribasushi> haarg: -U1 with taking only one hunk (the longest one)
[09:09:29] <ribasushi> yes, still heuristic, but expecting the relevant changes to be in one hunk is beyond reasonable
[09:10:05] <omega> except when someone reformats the entire changelog :)
[09:10:27] <ribasushi> omega: and then you display the entire changelog - how much worse is that than the half-assed stuff we have now?
[09:10:54] <ribasushi> another avenue is to tighten up the spec and not show *anything* that does not strictly conform
[09:11:16] * ribasushi will be happy to just have no changes display for DBIC at all, if it chokes on them ;)
[09:11:31] * ribasushi likes all-or-nothing behaviors
[09:14:34] <haarg> i do think the spec is next to useless as it is. making it more strict can only be an improvement.
Furthermore I am including a relevant implementation-detail-rant from a different channel:
1) get a diff -U1 between the current and last version of the Changes file (metacpan has this)
2) find the longest hunk (to avoid misc typo fixes)
3) remove the diff-prefixed +'es
4) display it in fixed width to preserve author's intent
An extra thought - if there seems to be no practical way to detect a changelog as unparseable - perhaps an unofficial meta-key needs to be provided for those of us who have no intentions of following the current spec (for various reasons that can be sumarrized as "artistic freedom"). Or perhaps the parser can be told somehow "this is not a changelog I want you to parse" by having something at the end of the file or whatnot...
Such an option also adequately resolve #941 and #920 for those of us who want the Changelog to be presented as-is to a potential user, without having a naïve heuristic walk all over it.
I'd like to keep this simple, so I'm on board with @monken and the comment of his that is referenced in the ticket. I don't really want to start diffing to find the largest hunk. I think, if authors want their Changes to be parsed, they need to follow either one spec or a spec that allows for X various formats. What that spec is or should be, I can't say. Maybe you have a strict spec + a relaxed spec and you indicate in your META file which spec you are following. I'm actually not really sure what the problem is with following a basic spec. The Changes files just seem to be a series of bullet points with dates, but then I don't work on modules that have a lot of contributors.
That's the thing - I don't want my Changes file to be parsed, but metacpan assumes I do. In either case - the stuff is at the bottom of the page, so it's not that big of a deal. I was just trying to help make the metacpan user-experience better, without sacrificing readability of the Change file for terminal users (that is not following a butt-ugly specification bestowed upon CPAN with zero discussion ;).
The constructive thing at this point is for me to disclaim any stake in how the Changes of DBIC are rendered. I have no problem with any further changes, including reverting all the changes made by @omega.
I am keeping the ticket open as a reminder that the current state of rendering of random changesets off CPAN as a whole is actually worse before @omega made his changes. Feel free to close.
@oalders, I really don't understand what you mean - you say you want to keep things simple, but then dismiss ribasushi's simple solution in favour of doing a full parse. Given the wide variety of changelog formats, using 'largest chunk of additions since the last release' is about the only thing that will ever cover the majority of cases - and is both simpler and substantially more reliable than the current "solution".
If you have a concrete argument as to why the diff approach is inferior, I'm happy to hear it - but 'simplicity' isn't one given the diff approach is pretty much the simplest thing that can possibly work.
@mst primarily I'm thinking of this kind of a diff: oalders/business-paypal-api@02b0463 Granted, this one would probably be OK because it's a wholesale change, but it would be much more than the last set of changes. We could try it and see how it plays out.
@oalders any word on which way (if at all ;) this is going forward?
@ribasushi I'm OK with trying your approach, but we need someone to do the actual work. I'm really tight for time right now. Not sure if @omega has been watching the ticket at all and has any thoughts?
FWIW http://frepan.64p.org/ uses the algorithm described in the OP i.e. take diff for the Changes file, then display as a plain text
The parser has been improved (hopefully) by #1349, which makes the autodie example given in the first post render reasonably.
Am closing. Please ping this issue if it needs to be re-opened.