Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8337111: Bad HTML checker for generated documentation #21879

Closed
wants to merge 14 commits into from

Conversation

nizarbenalla
Copy link
Member

@nizarbenalla nizarbenalla commented Nov 4, 2024

Doccheck's human-generated reports are great at previewing a "chessboard" of results. Giving reader a quick glimpse at the quality/health of the documentation. But these tests needed to be automated and they didn't easily translate to something that can be integrated into a CI.

This PR includes an HTML and internal link test on api/java.base and a BadChars and Doctype test on the entire generated documentation bundle.

Here is an example of the output after running all tests on api/java.base

Note: There is an active PR to fix the broken anchors left in java.base so this is not a blocker.

STDOUT:
STDERR:
test: test
Tidy found errors in the generated HTML
/Users/nizarbenalla/Work/jdk-repos/jdk1/build/macosx-aarch64/images/docs/api/java.base/java/lang/Class.html:323:87: Warning: <a> anchor "nest" already defined
Tidy output end.


api/java.base/java/util/concurrent/StructuredTaskScope.ShutdownOnFailure.html:245: id not found: api/java.base/java/util/concurrent/StructuredTaskScope.ShutdownOnFailure.html#TreeStructure
api/java.base/java/util/concurrent/StructuredTaskScope.ShutdownOnSuccess.html:242: id not found: api/java.base/java/util/concurrent/StructuredTaskScope.ShutdownOnSuccess.html#TreeStructure
api/java.base/java/lang/Class.html:323: name already declared: nest
api/java.base/java/lang/Module.html:291: id not found: api/java.base/java/lang/foreign/package-summary.html#restricted
api/java.base/java/lang/Module.html:434: id not found: api/java.base/java/lang/foreign/package-summary.html#restricted
api/java.base/java/lang/foreign/MemorySegment.html:725: id not found: api/java.base/java/lang/foreign/package-summary.html#restricted

Link Checker Report
Checked 3446 files.
Found 445059 references to 48205 anchors in 5770 files and 64 other URIs.
     1 duplicate ids
     3 missing ids

Hosts
    20 docs.oracle.com
     1 tools.ietf.org
     1 www.ietf.org
     1 jcp.org
     4 www.rfc-editor.org
     7 unicode.org
    10 www.unicode.org
    20 www.w3.org
Exception running test test: java.lang.Exception: One or more HTML checkers failed: [java.lang.RuntimeException: Tidy found errors in the generated HTML, java.lang.RuntimeException: LinkChecker encountered errors. Duplicate IDs: 1, Missing IDs: 3, Missing Files: 0, Bad Schemes: 0]
java.lang.Exception: One or more HTML checkers failed: [java.lang.RuntimeException: Tidy found errors in the generated HTML, java.lang.RuntimeException: LinkChecker encountered errors. Duplicate IDs: 1, Missing IDs: 3, Missing Files: 0, Bad Schemes: 0]
        at DocCheck.runCheckersSequentially(DocCheck.java:181)
        at DocCheck.test(DocCheck.java:166)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
        at java.base/java.lang.reflect.Method.invoke(Method.java:573)
        at toolbox.TestRunner.runTests(TestRunner.java:91)
        at toolbox.TestRunner.runTests(TestRunner.java:73)
        at DocCheck.main(DocCheck.java:128)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
        at java.base/java.lang.reflect.Method.invoke(Method.java:573)
        at com.sun.javatest.regtest.agent.MainWrapper$MainTask.run(MainWrapper.java:138)
        at java.base/java.lang.Thread.run(Thread.java:1576)

1 errors
java.lang.Exception: 1 errors found
        at toolbox.TestRunner.runTests(TestRunner.java:119)
        at toolbox.TestRunner.runTests(TestRunner.java:73)
        at DocCheck.main(DocCheck.java:128)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
        at java.base/java.lang.reflect.Method.invoke(Method.java:573)
        at com.sun.javatest.regtest.agent.MainWrapper$MainTask.run(MainWrapper.java:138)
        at java.base/java.lang.Thread.run(Thread.java:1576)

JavaTest Message: Test threw exception: java.lang.Exception: 1 errors found
JavaTest Message: shutting down test

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issues

  • JDK-8337111: Bad HTML checker for generated documentation (Sub-task - P4)
  • JDK-8337113: Bad character checker for generated documentation (Sub-task - P4)
  • JDK-8337116: Internal links checker for generated documentation (Sub-task - P4)
  • JDK-8337114: DocType checker for generated documentation (Sub-task - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21879/head:pull/21879
$ git checkout pull/21879

Update a local copy of the PR:
$ git checkout pull/21879
$ git pull https://git.openjdk.org/jdk.git pull/21879/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 21879

View PR using the GUI difftool:
$ git pr show -t 21879

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21879.diff

Using Webrev

Link to Webrev Comment

@nizarbenalla
Copy link
Member Author

/label add javadoc

This PR add 4 separate checkers related to checking the generated documentation

/issue add 8337114
/issue add 8337113
/issue add 8337116

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 4, 2024

👋 Welcome back nbenalla! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Nov 4, 2024

@nizarbenalla This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8337111: Bad HTML checker for generated documentation
8337113: Bad character checker for generated documentation
8337116: Internal links checker for generated documentation
8337114: DocType checker for generated documentation

Reviewed-by: hannesw

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 8 new commits pushed to the master branch:

  • c1b868d: 8346602: Remove unused macro parameters in jni.cpp
  • 43b7e9f: 8346713: [testsuite] NeverActAsServerClassMachine breaks TestPLABAdaptToMinTLABSize.java TestPinnedHumongousFragmentation.java TestPinnedObjectContents.java
  • 249f141: 8346737: GenShen: Generational memory pools should not report zero for maximum capacity
  • d562d3c: 8343882: BasicAnnoTests doesn't handle multiple annotations at the same position
  • 7ba969a: 8346739: jpackage tests failed after JDK-8345259
  • b8e40b9: 8346688: GenShen: Missing metadata trigger log message
  • d2a4863: 8346690: Shenandoah: Fix log message for end of GC usage report
  • bcb1bda: 8345259: Disallow ALL-MODULE-PATH without explicit --module-path

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the javadoc javadoc-dev@openjdk.org label Nov 4, 2024
@openjdk
Copy link

openjdk bot commented Nov 4, 2024

@nizarbenalla
The javadoc label was successfully added.

@openjdk
Copy link

openjdk bot commented Nov 4, 2024

@nizarbenalla The issue 923 was not found in the JDK project - make sure you have entered it correctly.
As there were validation problems, no additional issues will be added to the list of solved issues.

@openjdk
Copy link

openjdk bot commented Nov 4, 2024

@nizarbenalla
Adding additional issue to issue list: 8337113: Bad character checker for generated documentation.

@openjdk
Copy link

openjdk bot commented Nov 4, 2024

@nizarbenalla
Adding additional issue to issue list: 8337116: Internal links checker for generated documentation.

@nizarbenalla
Copy link
Member Author

/issue add 8337114

@openjdk
Copy link

openjdk bot commented Nov 4, 2024

@nizarbenalla
Adding additional issue to issue list: 8337114: DocType checker for generated documentation.

@nizarbenalla nizarbenalla marked this pull request as ready for review November 4, 2024 15:51
@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 4, 2024
@mlbridge
Copy link

mlbridge bot commented Nov 4, 2024

Copy link
Member

@hns hns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite done reviewing this, but here's a first batch of comments. By and large I like what I see, the comments are mostly about details.

links = true;
badchars = true;
doctype = true;
}else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space before else above, and in several places after if in the lines below.


if (!checks.isEmpty()) {
if (checks.contains(",")) {
EXCLUDE_LIST.addAll(Arrays.asList(checks.split(",")));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the content of EXCLUDE_LIST is never used.

* Base class for HTML checkers.
* <p>
* For details on HTML syntax and the terms used in this API, see
* W3C <a href="https://www.w3.org/TR/html5/syntax.html#syntax">The HTML syntax</a>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This URL redirects to https://html.spec.whatwg.org/multipage/syntax.html#syntax. Should we use that URL instead?


@Override
public void xml(int line, Map<String, String> attrs) {
xml = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The xml field is never used.

allURIs = new HashMap<>();
}

public void setBaseDirWereChecking(Path dir) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be "Where"? I think just setBaseDirectory or setBaseDir would be fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was supposed to be "set base dir we're checking", I wrote it as I was thinking.

if (m2.find()) {
return Charset.forName(m2.group(1));
}
return html5 ? StandardCharsets.UTF_8 : StandardCharsets.ISO_8859_1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the basis for assuming ISO-8859-1 for non-HTML5 files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed text would be written in latin characters, but I guess this can be removed and we can simply use UTF8?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unicode has some characters such as bidi characters which I don't want to allow but this test should only check for bistrot and character encoding, so UTF_8 could work as a default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that JDK source code is inconsistent. The makefile commands for javac specify just ASCII (since forever) but the javadoc commands allow ISO-8859-1, which was popular before the widespread acceptance of UTF-8.

Ideally, all should be self-consistent -- JDK policy, javac commands, javadoc commands, DocCheck, etc.
Note that resource files might use different encodings to *.java source files.

Pattern p = Pattern.compile("(?i)doctype"
+ "\\s+html"
+ "\\s+([a-z]+)"
+ "\\s+\"([^\"]+)\""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doctype legacy string can also use single quotes. Accepting that would make the test more robust, although making sure start and end quotes match could be difficult using a regex.

+ "\\s+html"
+ "\\s+([a-z]+)"
+ "\\s+\"([^\"]+)\""
+ "(?:\\s+\"([^\"]+)\")?"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this part of the doctype specified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can match the string :legacy-compat in <!DOCTYPE html SYSTEM "about:legacy-compat"> or <!DOCTYPE html SYSTEM 'about:legacy-compat'>

private final Log log;
private final Map<Path, IDTable> allFiles;
private final Map<URI, IDTable> allURIs;
private boolean checkInwardReferencesOnly;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is always false.

}

public void log(Path path, int line, String message, Object... args) {
errors.add(formatErrorMessage(path, line, message, args));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a strange that this class is called Log and it has several log methods, but they are also used to report and track errors. It seems like some checkers use this class to track errors, while others use it purely for logging. Maybe the two features should be separated, for example by adding a dedicated logError method?

Copy link
Member Author

@nizarbenalla nizarbenalla Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly use this to store errors but also anything I want to the test to output later. I can find a way to separate this.

@nizarbenalla
Copy link
Member Author

Small update (besides the large file with external links), not yet done updating the tests based on review comments.
FYI after JDK-8342836 the tests call be run with a simple make test-docs, it is the same as using make test-docs_all TEST_DEPS=docs-jdk

@nizarbenalla
Copy link
Member Author

I think I'd like to go for warnings instead of errors for external links, at least for a little while to avoid unnecessary failures in CI.
Maybe until we let people know that they should add the external resources to the whitelist, or we setup GHA for doc tests.
I split the test categories into separate jtreg tests, I think we may get away with one test per category even if it's testing all modules + specs.

@@ -0,0 +1,764 @@
http://cldr.unicode.org/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that would be nice to have (and easy to implement) is to treat lines starting with # as comments and add a few lines at the top of the file describing the purpose of this file and how to add new links.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/String.html
https://docs.oracle.com/en/java/javase/24/docs/specs/javadoc/javadoc-search-spec.html
https://docs.oracle.com/en/java/javase/24/docs/specs/man/javadoc.html
https://docs.oracle.com/en/java/javase/24/docs/specs/man/java.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right that some links are left pointing to version 23 and 24 resources, while most are updated to 25?
I'm afraid it will be unpractical to manually update these links twice a year, so we should use some placeholder/macro to insert the current feature release. But that will only work if the links are also generated uniformly with the current feature release (for example by the @extLink taglet).

Copy link
Member Author

@nizarbenalla nizarbenalla Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The java man page has links to previous releases (I think these are added manually).

SourceVersion.java and ClassFileFormatVersion.java have hardcoded links to jls24 and jvms24 links, as in:

     * @see <a
     * href="https://docs.oracle.com/javase/specs/jls/se24/html/index.html">
     * <cite>The Java Language Specification, Java SE 24 Edition</cite></a>
     */
    RELEASE_24,

Not 100% sure but I think anything that isn't pointing to se25 is hardcoded to point to a different link, I could use a special macro for the current se25 links (something like @VERSION@ and then substitute it when reading from the file).

@nizarbenalla
Copy link
Member Author

I forgot to link this issue.

/issue add 8337116

@openjdk
Copy link

openjdk bot commented Dec 20, 2024

@nizarbenalla
Updating description of additional solved issue: 8337116: Internal links checker for generated documentation.

@openjdk openjdk bot removed the rfr Pull request is ready for review label Dec 20, 2024
@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 20, 2024
@nizarbenalla
Copy link
Member Author

nizarbenalla commented Dec 20, 2024

I've added a small message at beginning on the whitelist file, and copied part of Linkchecker#foundReference to ExternalLinkChecker#foundReference to make it a bit more robust.

Lines starting with # are ignored and I use a special string @@JAVASE_VERSION@@ to denote the current release.

Additionally, I'd like to backport this to JDK 24, what do you think?

}
}

static class URIComparator implements Comparator<URI> {
Copy link
Member

@hns hns Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this class is unused, is there a need for it given that URI implements Comparable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove it, it could be used to print out links in the same order everytime but it doesn't seem important now.

}
extLinks.addAll(input.lines()
.map(line -> line.replaceAll("\\@\\@JAVASE_VERSION\\@\\@", String.valueOf(Runtime.version().feature())))
.filter(line -> !line.startsWith("#"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking for mostly aesthetic reasons, but it would be nice to do the filter before the map, and pull the String.valueof(...) out into a variable.

String fragment = null;

// The checker runs into a problem with links that have more than one hash character.
// You cannot create a URI unless you convert the second hash character into
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfinished sentence, you could say "... unless the second hash is escaped."

}
});
} catch (IOException e) {
e.printStackTrace();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to swallow exceptions here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could let the exception propagate, I just didn't think it was interesting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I fixed this in 95f61e6 (alongside a couple other things that popped up when I ran a linter on the code).

I know those files exist, so it should really never be thrown.

Copy link
Member

@hns hns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this new suite of doc dtests looks good and is ready to be integrated. Kudos for the substantial contribution!

One concern I still have is the potential of breakage in vetted links when rolling over to a new release. But I guess we can still come up with a more version macro in case that is a problem (for example accepting both latest release and latest - 1).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Dec 23, 2024
@nizarbenalla
Copy link
Member Author

Thanks for all the reviews Hannes! here goes.

/integrate

@openjdk
Copy link

openjdk bot commented Dec 23, 2024

Going to push as commit ed29231.
Since your change was applied there have been 10 commits pushed to the master branch:

  • cd15ebb: 8346477: Clarify the Java manpage in relation to the JVM's OnOutOfMemoryError flags
  • bffa77b: 8346714: [ASAN] compressedKlass.cpp reported applying non-zero offset to null pointer
  • c1b868d: 8346602: Remove unused macro parameters in jni.cpp
  • 43b7e9f: 8346713: [testsuite] NeverActAsServerClassMachine breaks TestPLABAdaptToMinTLABSize.java TestPinnedHumongousFragmentation.java TestPinnedObjectContents.java
  • 249f141: 8346737: GenShen: Generational memory pools should not report zero for maximum capacity
  • d562d3c: 8343882: BasicAnnoTests doesn't handle multiple annotations at the same position
  • 7ba969a: 8346739: jpackage tests failed after JDK-8345259
  • b8e40b9: 8346688: GenShen: Missing metadata trigger log message
  • d2a4863: 8346690: Shenandoah: Fix log message for end of GC usage report
  • bcb1bda: 8345259: Disallow ALL-MODULE-PATH without explicit --module-path

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Dec 23, 2024
@openjdk openjdk bot closed this Dec 23, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Dec 23, 2024
@openjdk
Copy link

openjdk bot commented Dec 23, 2024

@nizarbenalla Pushed as commit ed29231.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@nizarbenalla nizarbenalla deleted the new-docs-tests-suit branch December 23, 2024 13:52
@nizarbenalla
Copy link
Member Author

/backport jdk jdk24

@openjdk
Copy link

openjdk bot commented Dec 23, 2024

@nizarbenalla the backport was successfully created on the branch backport-nizarbenalla-ed292318-jdk24 in my personal fork of openjdk/jdk. To create a pull request with this backport targeting openjdk/jdk:jdk24, just click the following link:

➡️ Create pull request

The title of the pull request is automatically filled in correctly and below you find a suggestion for the pull request body:

Hi all,

This pull request contains a backport of commit ed292318 from the openjdk/jdk repository.

The commit being backported was authored by Nizar Benalla on 23 Dec 2024 and was reviewed by Hannes Wallnöfer.

Thanks!

If you need to update the source branch of the pull then run the following commands in a local clone of your personal fork of openjdk/jdk:

$ git fetch https://github.com/openjdk-bots/jdk.git backport-nizarbenalla-ed292318-jdk24:backport-nizarbenalla-ed292318-jdk24
$ git checkout backport-nizarbenalla-ed292318-jdk24
# make changes
$ git add paths/to/changed/files
$ git commit --message 'Describe additional changes made'
$ git push https://github.com/openjdk-bots/jdk.git backport-nizarbenalla-ed292318-jdk24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integrated Pull request has been integrated javadoc javadoc-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

3 participants