[Kemono] Re-download a post's text content when it is edited #3800

a84r7a3rga76fg · 2023-03-19T03:46:26Z

I've added archive and archive-format to this postprocessor and I'm wondering if this will re-download a post's text content when content and/or embed[url] is edited? I'm especially not sure if archive-format should be {content}|{embed[url]} or if I should be using two postprocessors for content and embed[url] respectively in archive-format.

            "postprocessors": [{
                "name": "metadata",
                "event": "post",
				"filename": "{_now!s:.16} {id}.txt",
				"filter": "embed.get('url') or re.search(r'(?i)(redgifs|atomicloli|gfycat|google|drive|onedrive|1drv|mega|xgf|k00|koofr|gigafile|mediafire|porn3dx|gofile|dropbox)', content)",
				"mode": "custom",
				"format": "{content}\n{embed[url]:?/\n/}",
				"archive-format": "{service}_{user}_{id}_{content}|{embed[url]}",
				"archive": "./gallery-dl/test/KEMONO-TEXT-archive.db"
			}],

The text was updated successfully, but these errors were encountered:

mikf · 2023-03-19T11:18:10Z

{content}|{embed[url]}

That won't work if you want it to be content OR embed[url].
It will instead write both fields, separated by a |.

I mean, it is going to do what you want it to - detecting changes in either content or embed[url] - but it might be quite wasteful. Maybe using the edited date would be better?

AlttiRi · 2023-03-19T18:02:40Z

#3679

    "archive": "~/gallery-dl/gallery-dl-kemono-postprocessor.sqlite",
    "archive-prefix": "\fF {category}",
    "archive-format": "_{service}_{user}_{id}_{hash_sha1(content)[:10]}",

UPD:
Ah, yes, with embed[url]:

    "archive-format": "_{service}_{user}_{id}_{hash_sha1(content + (embed and embed['url'] or ''))[:10]}",

AlttiRi · 2023-03-19T18:06:57Z

Also I suggest to save the content in HTML files like it is in the linked issue.

Then you can concatenate all HTML files into one file and open it in with a browser.

For example, .bashrc alias for Windows with Git Bash:

alias catahtml='cat *.html > "$TEMP/_temp-catahtml-result.html"; start "$TEMP/_temp-catahtml-result.html"; sleep 0; exit;'

Then use my userscript to parse the URLs from the HTML file.
https://github.com/AlttiRi/href-taker
Check the readme.

Although, if you will only parse them programmatically, just saving the only text value inside a .txt is also OK.

AlttiRi · 2023-03-19T19:06:30Z

Fill free to use.

"postprocessors": [
    {
        "name": "mtime",
        "event": "post"
    },
    {
        "name":  "metadata",
        "event": "post",
        "mode":  "custom",
        
        "directory": "metadata",
        "filename":  "\fF [{category}] {user}—{id}—{title}—{hash_sha1(content + (embed['url'] if embed else ''))[:10]}.html",
        "content-format": "\fT ~/gallery-dl/templates/kemono.html",
        
        "archive": "~/gallery-dl/gallery-dl-kemono-postprocessor.sqlite",
        "archive-prefix": "\fF {category}",
        "archive-format": "_{service}_{user}_{id}_{hash_sha1(content + (embed['url'] if embed else ''))[:10]}",

        "mtime": true
    }
]

~/gallery-dl/templates/kemono.html

<div class="post" id="{id}" data-added="{added}" data-published="{published}" data-edited="{edited}">
  <h4>
    <a href="https://kemono.party/{service}/user/{user}/post/{id}">{title}</a><span class="id" style="color: gray;"><i> #{id}</i></span>
  </h4>
  <div class="content">{content}</div>
  <div class="content embed" title="{embed[subject]:?//}">{embed[url]:?//}</div>
  <br>
  <div class="date"><i>{date:%Y.%m.%d %H:%M:%S}</i></div>
  <hr>
</div>

AlttiRi · 2023-03-19T19:09:48Z

BTW, is it possible to use f-string in templates? The custom formatting syntax is too difficult to write and read.

a84r7a3rga76fg · 2023-03-19T19:58:08Z

Is it better to hash the text data? Aren't the two methods practically the same? Is there a scenario where storing the hash value of the text data beats storing the text data itself?

That's a cool idea to combine all of the HTML files into one, I'll do that too.

Is there a difference in saving the archive file as sqlite, sqlite3 or db?

AlttiRi · 2023-03-19T20:29:38Z

Storing a hash will reduce the DB size, also you can use it in the filename as a postfix, see the example above. (To generate unique filenames for each post edit, in other case metadata file for the new post edit will overwrite the old one.)

To prevent collisions for a post (since you store both hash and post id) I think even 10 chars of SHA-1 hash is enough.
Of course, you can store the full text concated with embed[url], but it's just a data duplication.

sqlite, sqlite3 or db

In any case it will be a SQLite DB file.
For the file association (to open the file with SQL Browser I recommend to use .sqlite (.sqlite3).

a84r7a3rga76fg · 2023-03-19T20:41:18Z

Thank you. I'll change it to sqlite and use your postprocessor config, that's a neat way to reduce the size of the archive file.

Does gallery-dl support xxhash64? The latter is much faster than SHA-1 and the value is much smaller.

I'm assuming I can use SHA-1 for everything, for example, a Twitter name {hash_sha1(author[name] if embed else ''))}?

AlttiRi · 2023-03-19T20:49:32Z

hash_md5() and hash_sha1() #3679 (comment)

Any text key. #3679

with using f-string formatting (\fF prefix) to use any Python's code
https://github.com/mikf/gallery-dl/blob/master/docs/formatting.md#special-type-format-strings

"\fF {hash_sha1(author[name]}"

It's the wrong use.
The correct form: "\fF {hash_sha1(author['name']}"

mikf · 2023-03-19T20:58:28Z

As a side note: You can use the globals option to load a (custom) Python file/module and use its functions.
So, if you want to use xxhash64, you can define your own function that implements it.

In 1.25.0 it overwrites the default set of functions, but it will add them in the next version (#3773, a1ca240).

a84r7a3rga76fg · 2023-03-19T21:11:47Z

Thanks for the help everyone.

I'm using AlttiRi's postprocessor and I changed the filename to "filename": "\fF {_now!s:.16}-metadata.html", but it doesn't work, the error is [kemonoparty][error] An unexpected error occurred: NameError - name '_now' is not defined.Why can't I use {_now!s:.16}?

AlttiRi · 2023-03-19T21:19:06Z

~~Because of _now!s:.16 is not a valid Python code.~~ Also there is no _now variable in the props object, as I understand.
If you use \fF you can't use the custom formatting, only Python code and global available variables.

Or is there any workaround? Something like format('_now!s:.16')?

a84r7a3rga76fg · 2023-03-19T21:25:00Z

Will your postprocessor still work if I remove \fF from filename?

mikf · 2023-03-19T21:27:15Z

{_now} effectively just calls datetime.now(), so {datetime.now()!s:.16} should work in an f-string

BTW, is it possible to use f-string in templates? The custom formatting syntax is too difficult to write and read.

Not at the moment, but this is easy enough to add. Would \fTF <path> as format be OK, or do you have a better suggestion?

AlttiRi · 2023-03-19T21:31:42Z

Ah, I forgot, that inside f-string there is an extra syntax.

In my example \fF is required for using hash_sha1 function.

a84r7a3rga76fg · 2023-03-19T21:34:08Z

Thanks, that worked. This is what I've currently got, please let me know if I'm missing anything because I've made quite some changes. Does filter work? Also, @AlttiRi, do you have a content-format template and postprocessor for kemonoparty discord, that subcategory is very different from the other subcategories e.g. patreon, fantia, etc.

            "postprocessors": [
			{
				"name": "mtime",
				"event": "post"
			},
			{
				"filter": "subcategory not in ('discord')",
				"name":  "metadata",
				"event": "post",
				"mode":  "custom",
				"filename":  "\fF {datetime.now()!s:.16} metadata.html",
				"filter": "embed.get('url') or re.search(r'(?i)(redgifs|atomicloli|gfycat|google|drive|onedrive|1drv|mega|xgf|k00|koofr|gigafile|mediafire|porn3dx|gofile|dropbox)', content)",
				"content-format": "\fT ~/gallery-dl/templates/kemono.html",
				"archive": "~/gallery-dl/test1/kemono-metadata-archive.sqlite",
				"archive-prefix": "\fF {category}",
				"archive-format": "_{service}_{user}_{id}_{hash_sha1(content + (embed['url'] if embed else ''))}",
				"mtime": true
			}
			],

mikf · 2023-03-19T21:36:50Z

You cannot use multiple "filter" statements at once. It will only see and use the last one. Combine both with (...) and (...) or use a blacklist as suggested in #3803 (comment).

a84r7a3rga76fg · 2023-03-19T21:42:24Z

I changed it to "blacklist": ":discord",. If in the future I want to add one more subcategory, would this work "blacklist": {":discord", ":example"},?

mikf · 2023-03-19T21:45:38Z

"blacklist": [":discord", ":example"],
([...] instead of {...})

The blacklist syntax is documented here

AlttiRi · 2023-03-19T21:47:21Z

\fTF <path>

Yeah, it's fine.

"filter": "embed.get('url') or re.search(r'(?i)(redgifs|atomicloli|gfycat|google|drive|onedrive|1drv|mega|xgf|k00|koofr|gigafile|mediafire|porn3dx|gofile|dropbox)', content)",

I find this filter questionable. I would save every post. Then just parse the links that you need.
Also it's strange, that catbox, uploadir, webmshare, imgur, ... are missed in your list. Is it really the completed list of domains you need and you did not forget anything?

It's why I have in my userscript 2 search/filter inputs: the first one works similar to your regex — to list only known sites, the second one for the reverse mode to list every link except the input values.

a84r7a3rga76fg · 2023-03-19T21:56:09Z

I forgot about those, do you know of a list of file sharing domains? I add any such domain that I see to the filter and I know I'm missing a lot. I do actually save every post with another run, the reason for the filter is that imo it's faster and mainly because some artists will split the link into different chunks, e.g. "https :// mega [dot] nz".

AlttiRi · 2023-12-28T04:00:30Z

Also I suggest to save the content in HTML files like it is in the linked issue.
Then you can concatenate all HTML files into one file and open it in with a browser.

BTW, here is how it looks (Twitter's example):

For example, the result single HTML file (created with gallery-dl's postprocessor) for a tweet:

It has a description as well as other meta information: date, id, the links to a tweet and a profile.

Here is a screenshot of all HTML files are concated into one file with the opened popup of my HrefTaker userscript to parse links (catbox only on the screenshot) from all tweets:

To concat HTML files I use a bash function is defined in ~/.bashrc file:

alias catahtml=fun_cat_html

function fun_cat_html {
    current_date_time=$(date +"%Y.%m.%d-%H.%M.%S")
    cat *.html > "$TEMP/_temp-cat-html-result-$current_date_time.html";
    start "$TEMP/_temp-cat-html-result-$current_date_time.html";
    sleep 0;
    exit;
}

Just type catahtml in a bash terminal, then the result file will be opened in your browser automatically.

Just a cross post of this AlttiRi/twitter-click-and-save#36 (comment)
Maybe it would be useful for someone.

mikf mentioned this issue Mar 21, 2023

[Twitter] Can't hash the description #3807

Closed

mikf closed this as completed Mar 25, 2023

a84r7a3rga76fg mentioned this issue Jul 21, 2023

Can't figure out how to download only text from kemonoparty #4330

Open

a84r7a3rga76fg mentioned this issue Feb 28, 2024

[kemono] Remove Redundant/Nonexistent Files in Current Version #5247

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kemono] Re-download a post's text content when it is edited #3800

[Kemono] Re-download a post's text content when it is edited #3800

a84r7a3rga76fg commented Mar 19, 2023

mikf commented Mar 19, 2023

AlttiRi commented Mar 19, 2023 •

edited

AlttiRi commented Mar 19, 2023 •

edited

AlttiRi commented Mar 19, 2023 •

edited

AlttiRi commented Mar 19, 2023

a84r7a3rga76fg commented Mar 19, 2023

AlttiRi commented Mar 19, 2023 •

edited

a84r7a3rga76fg commented Mar 19, 2023

AlttiRi commented Mar 19, 2023 •

edited

mikf commented Mar 19, 2023

a84r7a3rga76fg commented Mar 19, 2023

AlttiRi commented Mar 19, 2023 •

edited

a84r7a3rga76fg commented Mar 19, 2023

mikf commented Mar 19, 2023 •

edited

AlttiRi commented Mar 19, 2023 •

edited

a84r7a3rga76fg commented Mar 19, 2023

mikf commented Mar 19, 2023 •

edited

a84r7a3rga76fg commented Mar 19, 2023

mikf commented Mar 19, 2023

AlttiRi commented Mar 19, 2023 •

edited

a84r7a3rga76fg commented Mar 19, 2023

AlttiRi commented Dec 28, 2023 •

edited

[Kemono] Re-download a post's text content when it is edited #3800

[Kemono] Re-download a post's text content when it is edited #3800

Comments

a84r7a3rga76fg commented Mar 19, 2023

mikf commented Mar 19, 2023

AlttiRi commented Mar 19, 2023 • edited

AlttiRi commented Mar 19, 2023 • edited

AlttiRi commented Mar 19, 2023 • edited

AlttiRi commented Mar 19, 2023

a84r7a3rga76fg commented Mar 19, 2023

AlttiRi commented Mar 19, 2023 • edited

a84r7a3rga76fg commented Mar 19, 2023

AlttiRi commented Mar 19, 2023 • edited

mikf commented Mar 19, 2023

a84r7a3rga76fg commented Mar 19, 2023

AlttiRi commented Mar 19, 2023 • edited

a84r7a3rga76fg commented Mar 19, 2023

mikf commented Mar 19, 2023 • edited

AlttiRi commented Mar 19, 2023 • edited

a84r7a3rga76fg commented Mar 19, 2023

mikf commented Mar 19, 2023 • edited

a84r7a3rga76fg commented Mar 19, 2023

mikf commented Mar 19, 2023

AlttiRi commented Mar 19, 2023 • edited

a84r7a3rga76fg commented Mar 19, 2023

AlttiRi commented Dec 28, 2023 • edited

AlttiRi commented Mar 19, 2023 •

edited

AlttiRi commented Mar 19, 2023 •

edited

AlttiRi commented Mar 19, 2023 •

edited

AlttiRi commented Mar 19, 2023 •

edited

AlttiRi commented Mar 19, 2023 •

edited

AlttiRi commented Mar 19, 2023 •

edited

mikf commented Mar 19, 2023 •

edited

AlttiRi commented Mar 19, 2023 •

edited

mikf commented Mar 19, 2023 •

edited

AlttiRi commented Mar 19, 2023 •

edited

AlttiRi commented Dec 28, 2023 •

edited