Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic extractor, see issue #683 #735

Merged
merged 16 commits into from Dec 29, 2021
Merged

Conversation

Vrihub
Copy link
Contributor

@Vrihub Vrihub commented May 6, 2020

First implementation for a generic extractor, see comments in #683 and in the files.

@Vrihub Vrihub mentioned this pull request May 6, 2020
@Vrihub
Copy link
Contributor Author

Vrihub commented May 6, 2020

Oops, I'm going to fix trivial failures (long lines, CategorySubcategory name etc), but I guess that test_unique_pattern_matches has to be fixed to accomodate the generic extractor, which by design must match any url.

Vrihub added 6 commits May 7, 2020 00:58
This fixes a bug when we force the generic extractor on urls without a
scheme (that are allowed by all other extractors).
Almost all extractors accept urls without an initial http(s) scheme.

Many extractors also allow for generic subdomains in their "pattern"
variable; some of them implement this with the regex character class
"[^.]+" (everything but a dot).

This leads to a problem when the extractor is given a url starting
with g: or r: (to force using the generic or recursive extractor)
and without the http(s) scheme: e.g. with "r:foobar.tumblr.com"
the "r:" is wrongly considered part of the subdomain.

This commit fixes the bug, replacing the too generic "[^.]+" with the
more specific "[\w-]+" (letters, digits and "-", the only characters
allowed in domain names), which is already used by some extractors.
Copy link
Owner

@mikf mikf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for everything here.
There are some smaller things and should be changed here and there, but most of it looks good.

gallery_dl/extractor/generic.py Outdated Show resolved Hide resolved
gallery_dl/extractor/generic.py Outdated Show resolved Hide resolved
Comment on lines 22 to 29
pattern = r"""(?ix)
(?P<generic>g(?:eneric)?:)? # optional "g(eneric):" prefix
(?P<scheme>https?://)? # optional http or https scheme
(?P<domain>[^/?&#]+) # required domain
(?P<path>/[^?&#]*)? # optional path
(?:\?(?P<query>[^/?#]*))? # optional query
(?:\#(?P<fragment>.*))?$ # optional fragment
"""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently matches too much, for example non-URLs like asd.
It should require a TLD for something to be considered a URL.

I'm also not too comfortable with making the URL scheme optional here.
Maybe allow it to be optional when using the g: or generic: prefix, but require it otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any problems, the aim of this regexp is to detect the various URL components; it's up to the user to provide a valid URL (BTW, "localhost" or any host name in a local private network is a valid url); if the host can't be resolved, the user will receive an exit error from urllib, instead of 'No suitable extractor found'.

Re the URL scheme, currently it's optional for all extractors (except directlink.py), so why should be required here?
Another question could be what should it default to, currently I've set it to https, maybe it should be http?

Copy link
Contributor Author

@Vrihub Vrihub Dec 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c2b8675 made the domain part of the regex stricter ([-\w.] as it's used in many other extractors), and this also fixes the failures in test_add and test_add_module (otherwise the "fake:foobar" url used by the FakeExtractor would be considered a valid domain for the generic extractor).

I also made sure the generic extractor is disabled by default, unless extractor.generic.enabled is set, mimicking the ytdl extractor, so maybe leaving the https scheme optional is not too much a problem, i.e. for a scheme-less url to work, one must enable the generic extractor, either via config file or prepending g(eneric): to the url.

gallery_dl/extractor/generic.py Outdated Show resolved Hide resolved
gallery_dl/extractor/generic.py Outdated Show resolved Hide resolved
gallery_dl/extractor/generic.py Outdated Show resolved Hide resolved
gallery_dl/extractor/generic.py Outdated Show resolved Hide resolved
absimageurls.append(self.baseurl + '/' + u)

# Remove duplicates
absimageurls = set(absimageurls)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reorders all image URLs on Python 3.4 and 3.5.
I don't think that's too important, but there are ways to remove duplicates without putting everything in a set().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure image urls order is really important, usually you end up with a directory of image files, and what matters is the file name; of course the image url order would matter if you used num to build a custom filename_fmt.

Anyway, even if we removed duplicates by preserving image url order in the imageurls list, at the moment this wouldn't reflect the order of image urls in the page, since imageurls is the product of two passes on the page, one for each search strategy (imageurl_pattern_src and imageurl_pattern_ext); but of course we could merge them into one big alternative regex and only make one pass (I originally divided it in two mainly for ease of testing).

So we have two options:

  1. Status quo: don't care about image urls order
  2. Refactor: merge imageurl search strategies into one regex, and remove duplicates preserving imageurl order

I would go for 1. for now, and leave 2. as an enhancement for the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, wait: do I really need to add extractor code to remove duplicate image urls? Can't I just set extractor.generic.image-unique in the extractor, and gallery-dl will take care of removing duplicates? Or am I missing something?

@mikf mikf merged commit 96fcff1 into mikf:master Dec 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants