Generic extractor, see issue #683 #735

Vrihub · 2020-05-06T12:06:51Z

First implementation for a generic extractor, see comments in #683 and in the files.

Vrihub · 2020-05-06T15:32:36Z

Oops, I'm going to fix trivial failures (long lines, CategorySubcategory name etc), but I guess that test_unique_pattern_matches has to be fixed to accomodate the generic extractor, which by design must match any url.

This fixes a bug when we force the generic extractor on urls without a scheme (that are allowed by all other extractors).

Almost all extractors accept urls without an initial http(s) scheme. Many extractors also allow for generic subdomains in their "pattern" variable; some of them implement this with the regex character class "[^.]+" (everything but a dot). This leads to a problem when the extractor is given a url starting with g: or r: (to force using the generic or recursive extractor) and without the http(s) scheme: e.g. with "r:foobar.tumblr.com" the "r:" is wrongly considered part of the subdomain. This commit fixes the bug, replacing the too generic "[^.]+" with the more specific "[\w-]+" (letters, digits and "-", the only characters allowed in domain names), which is already used by some extractors.

mikf

Thanks a lot for everything here.
There are some smaller things and should be changed here and there, but most of it looks good.

gallery_dl/extractor/generic.py

mikf · 2020-05-10T19:54:47Z

gallery_dl/extractor/generic.py

+    pattern = r"""(?ix)
+            (?P<generic>g(?:eneric)?:)?     # optional "g(eneric):" prefix
+            (?P<scheme>https?://)?          # optional http or https scheme
+            (?P<domain>[^/?&#]+)            # required domain
+            (?P<path>/[^?&#]*)?             # optional path
+            (?:\?(?P<query>[^/?#]*))?       # optional query
+            (?:\#(?P<fragment>.*))?$        # optional fragment
+            """


This currently matches too much, for example non-URLs like asd.
It should require a TLD for something to be considered a URL.

I'm also not too comfortable with making the URL scheme optional here.
Maybe allow it to be optional when using the g: or generic: prefix, but require it otherwise.

I don't see any problems, the aim of this regexp is to detect the various URL components; it's up to the user to provide a valid URL (BTW, "localhost" or any host name in a local private network is a valid url); if the host can't be resolved, the user will receive an exit error from urllib, instead of 'No suitable extractor found'.

Re the URL scheme, currently it's optional for all extractors (except directlink.py), so why should be required here?
Another question could be what should it default to, currently I've set it to https, maybe it should be http?

c2b8675 made the domain part of the regex stricter ([-\w.] as it's used in many other extractors), and this also fixes the failures in test_add and test_add_module (otherwise the "fake:foobar" url used by the FakeExtractor would be considered a valid domain for the generic extractor).

I also made sure the generic extractor is disabled by default, unless extractor.generic.enabled is set, mimicking the ytdl extractor, so maybe leaving the https scheme optional is not too much a problem, i.e. for a scheme-less url to work, one must enable the generic extractor, either via config file or prepending g(eneric): to the url.

gallery_dl/extractor/generic.py

mikf · 2020-05-10T20:24:43Z

gallery_dl/extractor/generic.py

+                absimageurls.append(self.baseurl + '/' + u)
+
+        # Remove duplicates
+        absimageurls = set(absimageurls)


This reorders all image URLs on Python 3.4 and 3.5.
I don't think that's too important, but there are ways to remove duplicates without putting everything in a set().

I'm not sure image urls order is really important, usually you end up with a directory of image files, and what matters is the file name; of course the image url order would matter if you used num to build a custom filename_fmt.

Anyway, even if we removed duplicates by preserving image url order in the imageurls list, at the moment this wouldn't reflect the order of image urls in the page, since imageurls is the product of two passes on the page, one for each search strategy (imageurl_pattern_src and imageurl_pattern_ext); but of course we could merge them into one big alternative regex and only make one pass (I originally divided it in two mainly for ease of testing).

So we have two options:

Status quo: don't care about image urls order

Refactor: merge imageurl search strategies into one regex, and remove duplicates preserving imageurl order

I would go for 1. for now, and leave 2. as an enhancement for the future

Uhm, wait: do I really need to add extractor code to remove duplicate image urls? Can't I just set extractor.generic.image-unique in the extractor, and gallery-dl will take care of removing duplicates? Or am I missing something?

Fixed the domain section for "pattern", to pass "test_add" and "test_add_module" tests. Added the "enabled" configuration option (default False) to enable the generic extractor. Using "g(eneric):URL" forces using the extractor.

Generic extractor, see issue mikf#683

7bf5aa4

Vrihub mentioned this pull request May 6, 2020

generic extractor #683

Closed

Vrihub added 6 commits May 7, 2020 00:58

Fix failed test_names test, no subcategory needed

bbf18a1

Prefix directory_fmt with "generic"

a902e5a

Relax regex (would break some urls)

c533c5e

Flake8 compliance

a2b362d

pattern: don't require a scheme

953a8de

This fixes a bug when we force the generic extractor on urls without a scheme (that are allowed by all other extractors).

mikf reviewed May 10, 2020

View reviewed changes

Vrihub added 9 commits May 11, 2020 13:08

Relax imageurl_pattern_ext: allow relative urls

8419c4f

First round of small suggested changes

64960b3

Support image urls starting with "//"

b987c63

self.baseurl: remove trailing slash

e9cd866

Relax regexp (didn't catch some image urls)

cec2e39

Merge branch 'master' into generic-extractor

d93f12d

Some fixes and cleanup

0f17151

Merge branch 'master' into generic-extractor

4eed4f3

Fix domain pattern; option to enable extractor

c2b8675

Fixed the domain section for "pattern", to pass "test_add" and "test_add_module" tests. Added the "enabled" configuration option (default False) to enable the generic extractor. Using "g(eneric):URL" forces using the extractor.

mikf merged commit 96fcff1 into mikf:master Dec 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic extractor, see issue #683 #735

Generic extractor, see issue #683 #735

Vrihub commented May 6, 2020

Vrihub commented May 6, 2020

mikf left a comment

mikf May 10, 2020

Vrihub May 15, 2020

Vrihub Dec 18, 2021 •

edited

mikf May 10, 2020

Vrihub May 13, 2020

Vrihub Dec 18, 2021

Generic extractor, see issue #683 #735

Generic extractor, see issue #683 #735

Conversation

Vrihub commented May 6, 2020

Vrihub commented May 6, 2020

mikf left a comment

Choose a reason for hiding this comment

mikf May 10, 2020

Choose a reason for hiding this comment

Vrihub May 15, 2020

Choose a reason for hiding this comment

Vrihub Dec 18, 2021 • edited

Choose a reason for hiding this comment

mikf May 10, 2020

Choose a reason for hiding this comment

Vrihub May 13, 2020

Choose a reason for hiding this comment

Vrihub Dec 18, 2021

Choose a reason for hiding this comment

Vrihub Dec 18, 2021 •

edited