-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8337111: Bad HTML checker for generated documentation #21879
Conversation
/label add javadoc This PR add 4 separate checkers related to checking the generated documentation /issue add 8337114 |
👋 Welcome back nbenalla! A progress list of the required criteria for merging this PR into |
@nizarbenalla This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 8 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the ➡️ To integrate this PR with the above commit message to the |
@nizarbenalla |
@nizarbenalla The issue |
@nizarbenalla |
@nizarbenalla |
/issue add 8337114 |
@nizarbenalla |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite done reviewing this, but here's a first batch of comments. By and large I like what I see, the comments are mostly about details.
links = true; | ||
badchars = true; | ||
doctype = true; | ||
}else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space before else
above, and in several places after if
in the lines below.
|
||
if (!checks.isEmpty()) { | ||
if (checks.contains(",")) { | ||
EXCLUDE_LIST.addAll(Arrays.asList(checks.split(","))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the content of EXCLUDE_LIST
is never used.
* Base class for HTML checkers. | ||
* <p> | ||
* For details on HTML syntax and the terms used in this API, see | ||
* W3C <a href="https://www.w3.org/TR/html5/syntax.html#syntax">The HTML syntax</a>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This URL redirects to https://html.spec.whatwg.org/multipage/syntax.html#syntax. Should we use that URL instead?
|
||
@Override | ||
public void xml(int line, Map<String, String> attrs) { | ||
xml = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The xml
field is never used.
allURIs = new HashMap<>(); | ||
} | ||
|
||
public void setBaseDirWereChecking(Path dir) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be "Where"? I think just setBaseDirectory
or setBaseDir
would be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was supposed to be "set base dir we're checking", I wrote it as I was thinking.
if (m2.find()) { | ||
return Charset.forName(m2.group(1)); | ||
} | ||
return html5 ? StandardCharsets.UTF_8 : StandardCharsets.ISO_8859_1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the basis for assuming ISO-8859-1 for non-HTML5 files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed text would be written in latin characters, but I guess this can be removed and we can simply use UTF8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unicode has some characters such as bidi characters which I don't want to allow but this test should only check for bistrot and character encoding, so UTF_8 could work as a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that JDK source code is inconsistent. The makefile commands for javac
specify just ASCII (since forever) but the javadoc
commands allow ISO-8859-1, which was popular before the widespread acceptance of UTF-8.
Ideally, all should be self-consistent -- JDK policy, javac
commands, javadoc
commands, DocCheck, etc.
Note that resource files might use different encodings to *.java
source files.
Pattern p = Pattern.compile("(?i)doctype" | ||
+ "\\s+html" | ||
+ "\\s+([a-z]+)" | ||
+ "\\s+\"([^\"]+)\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doctype legacy string can also use single quotes. Accepting that would make the test more robust, although making sure start and end quotes match could be difficult using a regex.
+ "\\s+html" | ||
+ "\\s+([a-z]+)" | ||
+ "\\s+\"([^\"]+)\"" | ||
+ "(?:\\s+\"([^\"]+)\")?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this part of the doctype specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can match the string :legacy-compat
in <!DOCTYPE html SYSTEM "about:legacy-compat">
or <!DOCTYPE html SYSTEM 'about:legacy-compat'>
private final Log log; | ||
private final Map<Path, IDTable> allFiles; | ||
private final Map<URI, IDTable> allURIs; | ||
private boolean checkInwardReferencesOnly; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field is always false
.
} | ||
|
||
public void log(Path path, int line, String message, Object... args) { | ||
errors.add(formatErrorMessage(path, line, message, args)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a strange that this class is called Log
and it has several log
methods, but they are also used to report and track errors. It seems like some checkers use this class to track errors, while others use it purely for logging. Maybe the two features should be separated, for example by adding a dedicated logError
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly use this to store errors but also anything I want to the test to output later. I can find a way to separate this.
Small update (besides the large file with external links), not yet done updating the tests based on review comments. |
separate checks into different jtreg tests fix typos
I think I'd like to go for warnings instead of errors for external links, at least for a little while to avoid unnecessary failures in CI. |
@@ -0,0 +1,764 @@ | |||
http://cldr.unicode.org/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that would be nice to have (and easy to implement) is to treat lines starting with #
as comments and add a few lines at the top of the file describing the purpose of this file and how to add new links.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/String.html | ||
https://docs.oracle.com/en/java/javase/24/docs/specs/javadoc/javadoc-search-spec.html | ||
https://docs.oracle.com/en/java/javase/24/docs/specs/man/javadoc.html | ||
https://docs.oracle.com/en/java/javase/24/docs/specs/man/java.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right that some links are left pointing to version 23 and 24 resources, while most are updated to 25?
I'm afraid it will be unpractical to manually update these links twice a year, so we should use some placeholder/macro to insert the current feature release. But that will only work if the links are also generated uniformly with the current feature release (for example by the @extLink
taglet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The java man page has links to previous releases (I think these are added manually).
SourceVersion.java
and ClassFileFormatVersion.java
have hardcoded links to jls24 and jvms24 links, as in:
* @see <a
* href="https://docs.oracle.com/javase/specs/jls/se24/html/index.html">
* <cite>The Java Language Specification, Java SE 24 Edition</cite></a>
*/
RELEASE_24,
Not 100% sure but I think anything that isn't pointing to se25
is hardcoded to point to a different link, I could use a special macro for the current se25
links (something like @VERSION@
and then substitute it when reading from the file).
I forgot to link this issue. /issue add 8337116 |
@nizarbenalla |
…ase using a regex
I've added a small message at beginning on the whitelist file, and copied part of Lines starting with Additionally, I'd like to backport this to JDK 24, what do you think? |
} | ||
} | ||
|
||
static class URIComparator implements Comparator<URI> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this class is unused, is there a need for it given that URI
implements Comparable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove it, it could be used to print out links in the same order everytime but it doesn't seem important now.
} | ||
extLinks.addAll(input.lines() | ||
.map(line -> line.replaceAll("\\@\\@JAVASE_VERSION\\@\\@", String.valueOf(Runtime.version().feature()))) | ||
.filter(line -> !line.startsWith("#")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking for mostly aesthetic reasons, but it would be nice to do the filter before the map, and pull the String.valueof(...)
out into a variable.
String fragment = null; | ||
|
||
// The checker runs into a problem with links that have more than one hash character. | ||
// You cannot create a URI unless you convert the second hash character into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfinished sentence, you could say "... unless the second hash is escaped."
} | ||
}); | ||
} catch (IOException e) { | ||
e.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason to swallow exceptions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could let the exception propagate, I just didn't think it was interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I fixed this in 95f61e6 (alongside a couple other things that popped up when I ran a linter on the code).
I know those files exist, so it should really never be thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this new suite of doc dtests looks good and is ready to be integrated. Kudos for the substantial contribution!
One concern I still have is the potential of breakage in vetted links when rolling over to a new release. But I guess we can still come up with a more version macro in case that is a problem (for example accepting both latest release and latest - 1).
Thanks for all the reviews Hannes! here goes. /integrate |
Going to push as commit ed29231.
Your commit was automatically rebased without conflicts. |
@nizarbenalla Pushed as commit ed29231. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
/backport jdk jdk24 |
@nizarbenalla the backport was successfully created on the branch backport-nizarbenalla-ed292318-jdk24 in my personal fork of openjdk/jdk. To create a pull request with this backport targeting openjdk/jdk:jdk24, just click the following link: The title of the pull request is automatically filled in correctly and below you find a suggestion for the pull request body:
If you need to update the source branch of the pull then run the following commands in a local clone of your personal fork of openjdk/jdk:
|
Doccheck's human-generated reports are great at previewing a "chessboard" of results. Giving reader a quick glimpse at the quality/health of the documentation. But these tests needed to be automated and they didn't easily translate to something that can be integrated into a CI.
This PR includes an HTML and internal link test on
api/java.base
and a BadChars and Doctype test on the entire generated documentation bundle.Here is an example of the output after running all tests on
api/java.base
Note: There is an active PR to fix the broken anchors left in
java.base
so this is not a blocker.Progress
Issues
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21879/head:pull/21879
$ git checkout pull/21879
Update a local copy of the PR:
$ git checkout pull/21879
$ git pull https://git.openjdk.org/jdk.git pull/21879/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 21879
View PR using the GUI difftool:
$ git pr show -t 21879
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21879.diff
Using Webrev
Link to Webrev Comment