Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make some urls with whitespace acceptable to JavaScript extractor. #145

Merged
merged 5 commits into from Feb 11, 2016

Conversation

@vonrosen
Copy link
Contributor

@vonrosen vonrosen commented Jan 20, 2016

No description provided.

@vonrosen
Copy link
Contributor Author

@vonrosen vonrosen commented Jan 20, 2016

Make some urls with whitespace acceptable to JavaScript extractor. Also, allow UTF-16 forward-slash to be used as path separator for use-case at https://agency.wisconsin.gov/sites/prb/SitePages/Home.aspx where many urls have the format: "\u002fsites\u002fprb\u002fPublic Comment Emails PDF\u002fOpen records of electronic records_1r71vt0k.pdf"

@nlevitt
Copy link
Member

@nlevitt nlevitt commented Jan 27, 2016

Thanks Hunter.

  • ExtractorJS.considerString() calls StringEscapeUtils.unescapeJavaScript() -- it would be best if handling of "\u002f" happened somewhere around here -- it looks like StringEscapeUtils.unescapeJavaScript() already unescapes "\u002f" et al, so you can probably just get rid of your code for that
  • looks like you have a stray close-parenthesis here: if (TextUtils.matches(".*[\\s)]+.*", candidate)) { (also, the square brackets are superfluous)
  • could you add some tests to ExtractorJSTest?
Hunter Stern and others added 2 commits Jan 28, 2016
…ttern of UriUtils.isVeryLikelyUri(). Involves a subtle adjustment to the regex LIKELY_RELATIVE_URI_PATTERN to ensure the 2nd capturing group always gets the file extension. Also add a couple of tests of strings with spaces and file extensions that are not known good extensions.
vonrosen pushed a commit that referenced this pull request Feb 11, 2016
Make some urls with whitespace acceptable to JavaScript extractor.
@vonrosen vonrosen merged commit bfbd61e into internetarchive:master Feb 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.