New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic extractor, see issue #683 #735
Conversation
Oops, I'm going to fix trivial failures (long lines, CategorySubcategory name etc), but I guess that |
This fixes a bug when we force the generic extractor on urls without a scheme (that are allowed by all other extractors).
Almost all extractors accept urls without an initial http(s) scheme. Many extractors also allow for generic subdomains in their "pattern" variable; some of them implement this with the regex character class "[^.]+" (everything but a dot). This leads to a problem when the extractor is given a url starting with g: or r: (to force using the generic or recursive extractor) and without the http(s) scheme: e.g. with "r:foobar.tumblr.com" the "r:" is wrongly considered part of the subdomain. This commit fixes the bug, replacing the too generic "[^.]+" with the more specific "[\w-]+" (letters, digits and "-", the only characters allowed in domain names), which is already used by some extractors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for everything here.
There are some smaller things and should be changed here and there, but most of it looks good.
gallery_dl/extractor/generic.py
Outdated
pattern = r"""(?ix) | ||
(?P<generic>g(?:eneric)?:)? # optional "g(eneric):" prefix | ||
(?P<scheme>https?://)? # optional http or https scheme | ||
(?P<domain>[^/?&#]+) # required domain | ||
(?P<path>/[^?&#]*)? # optional path | ||
(?:\?(?P<query>[^/?#]*))? # optional query | ||
(?:\#(?P<fragment>.*))?$ # optional fragment | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently matches too much, for example non-URLs like asd
.
It should require a TLD for something to be considered a URL.
I'm also not too comfortable with making the URL scheme optional here.
Maybe allow it to be optional when using the g:
or generic:
prefix, but require it otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any problems, the aim of this regexp is to detect the various URL components; it's up to the user to provide a valid URL (BTW, "localhost" or any host name in a local private network is a valid url); if the host can't be resolved, the user will receive an exit error from urllib, instead of 'No suitable extractor found'.
Re the URL scheme, currently it's optional for all extractors (except directlink.py), so why should be required here?
Another question could be what should it default to, currently I've set it to https
, maybe it should be http
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c2b8675 made the domain part of the regex stricter ([-\w.] as it's used in many other extractors), and this also fixes the failures in test_add
and test_add_module
(otherwise the "fake:foobar" url used by the FakeExtractor would be considered a valid domain for the generic extractor).
I also made sure the generic extractor is disabled by default, unless extractor.generic.enabled
is set, mimicking the ytdl extractor, so maybe leaving the https scheme optional is not too much a problem, i.e. for a scheme-less url to work, one must enable the generic extractor, either via config file or prepending g(eneric):
to the url.
absimageurls.append(self.baseurl + '/' + u) | ||
|
||
# Remove duplicates | ||
absimageurls = set(absimageurls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reorders all image URLs on Python 3.4 and 3.5.
I don't think that's too important, but there are ways to remove duplicates without putting everything in a set().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure image urls order is really important, usually you end up with a directory of image files, and what matters is the file name; of course the image url order would matter if you used num
to build a custom filename_fmt
.
Anyway, even if we removed duplicates by preserving image url order in the imageurls
list, at the moment this wouldn't reflect the order of image urls in the page, since imageurls
is the product of two passes on the page, one for each search strategy (imageurl_pattern_src
and imageurl_pattern_ext
); but of course we could merge them into one big alternative regex and only make one pass (I originally divided it in two mainly for ease of testing).
So we have two options:
- Status quo: don't care about image urls order
- Refactor: merge imageurl search strategies into one regex, and remove duplicates preserving imageurl order
I would go for 1. for now, and leave 2. as an enhancement for the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, wait: do I really need to add extractor code to remove duplicate image urls? Can't I just set extractor.generic.image-unique
in the extractor, and gallery-dl will take care of removing duplicates? Or am I missing something?
Fixed the domain section for "pattern", to pass "test_add" and "test_add_module" tests. Added the "enabled" configuration option (default False) to enable the generic extractor. Using "g(eneric):URL" forces using the extractor.
First implementation for a generic extractor, see comments in #683 and in the files.