New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make some urls with whitespace acceptable to JavaScript extractor. #145

Merged
merged 5 commits into from Feb 11, 2016

Conversation

Projects
None yet
2 participants
@vonrosen
Contributor

vonrosen commented Jan 20, 2016

No description provided.

@vonrosen

This comment has been minimized.

Show comment
Hide comment
@vonrosen

vonrosen Jan 20, 2016

Contributor

Make some urls with whitespace acceptable to JavaScript extractor. Also, allow UTF-16 forward-slash to be used as path separator for use-case at https://agency.wisconsin.gov/sites/prb/SitePages/Home.aspx where many urls have the format: "\u002fsites\u002fprb\u002fPublic Comment Emails PDF\u002fOpen records of electronic records_1r71vt0k.pdf"

Contributor

vonrosen commented Jan 20, 2016

Make some urls with whitespace acceptable to JavaScript extractor. Also, allow UTF-16 forward-slash to be used as path separator for use-case at https://agency.wisconsin.gov/sites/prb/SitePages/Home.aspx where many urls have the format: "\u002fsites\u002fprb\u002fPublic Comment Emails PDF\u002fOpen records of electronic records_1r71vt0k.pdf"

@nlevitt

This comment has been minimized.

Show comment
Hide comment
@nlevitt

nlevitt Jan 27, 2016

Member

Thanks Hunter.

  • ExtractorJS.considerString() calls StringEscapeUtils.unescapeJavaScript() -- it would be best if handling of "\u002f" happened somewhere around here -- it looks like StringEscapeUtils.unescapeJavaScript() already unescapes "\u002f" et al, so you can probably just get rid of your code for that
  • looks like you have a stray close-parenthesis here: if (TextUtils.matches(".*[\\s)]+.*", candidate)) { (also, the square brackets are superfluous)
  • could you add some tests to ExtractorJSTest?
Member

nlevitt commented Jan 27, 2016

Thanks Hunter.

  • ExtractorJS.considerString() calls StringEscapeUtils.unescapeJavaScript() -- it would be best if handling of "\u002f" happened somewhere around here -- it looks like StringEscapeUtils.unescapeJavaScript() already unescapes "\u002f" et al, so you can probably just get rid of your code for that
  • looks like you have a stray close-parenthesis here: if (TextUtils.matches(".*[\\s)]+.*", candidate)) { (also, the square brackets are superfluous)
  • could you add some tests to ExtractorJSTest?

Hunter Stern and others added some commits Jan 28, 2016

Simplify logic for urls with spaces, and make it better follow the pa…
…ttern of UriUtils.isVeryLikelyUri(). Involves a subtle adjustment to the regex LIKELY_RELATIVE_URI_PATTERN to ensure the 2nd capturing group always gets the file extension. Also add a couple of tests of strings with spaces and file extensions that are not known good extensions.

vonrosen pushed a commit that referenced this pull request Feb 11, 2016

Hunter
Merge pull request #145 from vonrosen/ARI-4713
Make some urls with whitespace acceptable to JavaScript extractor.

@vonrosen vonrosen merged commit bfbd61e into internetarchive:master Feb 11, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment