Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting IMX.to image hashes #3118

Closed
UtopianElectronics opened this issue Oct 29, 2022 · 4 comments · Fixed by #3175
Closed

Extracting IMX.to image hashes #3118

UtopianElectronics opened this issue Oct 29, 2022 · 4 comments · Fixed by #3175

Comments

@UtopianElectronics
Copy link

UtopianElectronics commented Oct 29, 2022

IMX.to stores and displays MD5 hashes of images on download pages (like this). A nice feature would be the ability to extract those hash values and to store them in a plain text file for comparison using md5deep. Is this easily achievable?

@enduser420
Copy link
Contributor

@mikf should something like this be enough?

--- a/gallery_dl/extractor/imagehosts.py
+++ b/gallery_dl/extractor/imagehosts.py
@@ -54,6 +54,7 @@ class ImagehostImageExtractor(Extractor):

         url, filename = self.get_info(page)
         data = text.nameext_from_url(filename, {"token": self.token})
+        data.update(self.metadata(page))
         if self.https and url.startswith("http:"):
             url = "https:" + url[5:]

@@ -63,6 +64,10 @@ class ImagehostImageExtractor(Extractor):
     def get_info(self, page):
         """Find image-url and string to get filename from"""

+    def metadata(self, page):
+        """Return additional metadata"""
+        return ()
+

 class ImxtoImageExtractor(ImagehostImageExtractor):
     """Extractor for single images from imx.to"""
@@ -108,6 +113,14 @@ class ImxtoImageExtractor(ImagehostImageExtractor):
             filename += splitext(url)[1]
         return url, filename or url

+    def metadata(self, page):
+        extr = text.extract_from(page, page.index("[ FILESIZE <"))
+        return {
+            "size"      : int(extr(">", "</span>")),
+            "dimensions": extr(">", " px</span>"),
+            "hash"      : extr(">", "</span>"),
+        }
+
$ py -m gallery_dl https://imx.to/i/1qdeva -K
Keywords for directory names:
-----------------------------
category
  imxto
dimensions
  64x32
extension
  png
filename
  test-テスト
hash
  94d56c599223c59f3feb71ea603484d1  
size
  182
subcategory
  image
token
  1qdeva

Keywords for filenames and --filter:
------------------------------------
category
  imxto
dimensions
  64x32
extension
  png
filename
  test-テスト
hash
  94d56c599223c59f3feb71ea603484d1  
size
  182
subcategory
  image
token
  1qdeva

@mikf
Copy link
Owner

mikf commented Nov 4, 2022

@enduser420 yeah, that would be enough.
I'd replace int with text.parse_int so it doesn't crash when the site changes something that is not int-parseable gets extracted.

@enduser420
Copy link
Contributor

this won't work, the website shows size greater the 1k in bytes-amounts, with a space between the size and unit
https://imx.to/img-57a2050547b97.html

@mikf
Copy link
Owner

mikf commented Nov 4, 2022

Then use text.parse_bytes() after removing the trailing B and spaces, I'd guess.
Or don't extract size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants