Extracting IMX.to image hashes #3118

UtopianElectronics · 2022-10-29T09:46:28Z

IMX.to stores and displays MD5 hashes of images on download pages (like this). A nice feature would be the ability to extract those hash values and to store them in a plain text file for comparison using md5deep. Is this easily achievable?

The text was updated successfully, but these errors were encountered:

enduser420 · 2022-11-03T13:17:55Z

@mikf should something like this be enough?

--- a/gallery_dl/extractor/imagehosts.py
+++ b/gallery_dl/extractor/imagehosts.py
@@ -54,6 +54,7 @@ class ImagehostImageExtractor(Extractor):

         url, filename = self.get_info(page)
         data = text.nameext_from_url(filename, {"token": self.token})
+        data.update(self.metadata(page))
         if self.https and url.startswith("http:"):
             url = "https:" + url[5:]

@@ -63,6 +64,10 @@ class ImagehostImageExtractor(Extractor):
     def get_info(self, page):
         """Find image-url and string to get filename from"""

+    def metadata(self, page):
+        """Return additional metadata"""
+        return ()
+

 class ImxtoImageExtractor(ImagehostImageExtractor):
     """Extractor for single images from imx.to"""
@@ -108,6 +113,14 @@ class ImxtoImageExtractor(ImagehostImageExtractor):
             filename += splitext(url)[1]
         return url, filename or url

+    def metadata(self, page):
+        extr = text.extract_from(page, page.index("[ FILESIZE <"))
+        return {
+            "size"      : int(extr(">", "</span>")),
+            "dimensions": extr(">", " px</span>"),
+            "hash"      : extr(">", "</span>"),
+        }
+

$ py -m gallery_dl https://imx.to/i/1qdeva -K
Keywords for directory names:
-----------------------------
category
  imxto
dimensions
  64x32
extension
  png
filename
  test-テスト
hash
  94d56c599223c59f3feb71ea603484d1  
size
  182
subcategory
  image
token
  1qdeva

Keywords for filenames and --filter:
------------------------------------
category
  imxto
dimensions
  64x32
extension
  png
filename
  test-テスト
hash
  94d56c599223c59f3feb71ea603484d1  
size
  182
subcategory
  image
token
  1qdeva

mikf · 2022-11-04T15:37:19Z

@enduser420 yeah, that would be enough.
I'd replace int with text.parse_int so it doesn't crash when the site changes something that is not int-parseable gets extracted.

enduser420 · 2022-11-04T15:56:05Z

this won't work, the website shows size greater the 1k in bytes-amounts, with a space between the size and unit
https://imx.to/img-57a2050547b97.html

mikf · 2022-11-04T15:59:10Z

Then use text.parse_bytes() after removing the trailing B and spaces, I'd guess.
Or don't extract size.

mikf added the site:feature label Oct 29, 2022

enduser420 mentioned this issue Nov 7, 2022

[imxto] extract additional metadata #3175

Merged

mikf closed this as completed in #3175 Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting IMX.to image hashes #3118

Extracting IMX.to image hashes #3118

UtopianElectronics commented Oct 29, 2022 •

edited

enduser420 commented Nov 3, 2022

mikf commented Nov 4, 2022

enduser420 commented Nov 4, 2022

mikf commented Nov 4, 2022 •

edited

Extracting IMX.to image hashes #3118

Extracting IMX.to image hashes #3118

Comments

UtopianElectronics commented Oct 29, 2022 • edited

enduser420 commented Nov 3, 2022

mikf commented Nov 4, 2022

enduser420 commented Nov 4, 2022

mikf commented Nov 4, 2022 • edited

UtopianElectronics commented Oct 29, 2022 •

edited

mikf commented Nov 4, 2022 •

edited