Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kemono] Re-download a post's text content when it is edited #3800

Closed
a84r7a3rga76fg opened this issue Mar 19, 2023 · 22 comments
Closed

[Kemono] Re-download a post's text content when it is edited #3800

a84r7a3rga76fg opened this issue Mar 19, 2023 · 22 comments

Comments

@a84r7a3rga76fg
Copy link

I've added archive and archive-format to this postprocessor and I'm wondering if this will re-download a post's text content when content and/or embed[url] is edited? I'm especially not sure if archive-format should be {content}|{embed[url]} or if I should be using two postprocessors for content and embed[url] respectively in archive-format.

            "postprocessors": [{
                "name": "metadata",
                "event": "post",
				"filename": "{_now!s:.16} {id}.txt",
				"filter": "embed.get('url') or re.search(r'(?i)(redgifs|atomicloli|gfycat|google|drive|onedrive|1drv|mega|xgf|k00|koofr|gigafile|mediafire|porn3dx|gofile|dropbox)', content)",
				"mode": "custom",
				"format": "{content}\n{embed[url]:?/\n/}",
				"archive-format": "{service}_{user}_{id}_{content}|{embed[url]}",
				"archive": "./gallery-dl/test/KEMONO-TEXT-archive.db"
			}],
@mikf
Copy link
Owner

mikf commented Mar 19, 2023

{content}|{embed[url]}

That won't work if you want it to be content OR embed[url].
It will instead write both fields, separated by a |.

I mean, it is going to do what you want it to - detecting changes in either content or embed[url] - but it might be quite wasteful. Maybe using the edited date would be better?

@AlttiRi
Copy link

AlttiRi commented Mar 19, 2023

#3679

    "archive": "~/gallery-dl/gallery-dl-kemono-postprocessor.sqlite",
    "archive-prefix": "\fF {category}",
    "archive-format": "_{service}_{user}_{id}_{hash_sha1(content)[:10]}",

UPD:
Ah, yes, with embed[url]:

    "archive-format": "_{service}_{user}_{id}_{hash_sha1(content + (embed and embed['url'] or ''))[:10]}",

@AlttiRi
Copy link

AlttiRi commented Mar 19, 2023

Also I suggest to save the content in HTML files like it is in the linked issue.

Then you can concatenate all HTML files into one file and open it in with a browser.

For example, .bashrc alias for Windows with Git Bash:

alias catahtml='cat *.html > "$TEMP/_temp-catahtml-result.html"; start "$TEMP/_temp-catahtml-result.html"; sleep 0; exit;'

Then use my userscript to parse the URLs from the HTML file.
https://github.com/AlttiRi/href-taker
Check the readme.


Although, if you will only parse them programmatically, just saving the only text value inside a .txt is also OK.

@AlttiRi
Copy link

AlttiRi commented Mar 19, 2023

Fill free to use.

"postprocessors": [
    {
        "name": "mtime",
        "event": "post"
    },
    {
        "name":  "metadata",
        "event": "post",
        "mode":  "custom",
        
        "directory": "metadata",
        "filename":  "\fF [{category}] {user}—{id}—{title}—{hash_sha1(content + (embed['url'] if embed else ''))[:10]}.html",
        "content-format": "\fT ~/gallery-dl/templates/kemono.html",
        
        "archive": "~/gallery-dl/gallery-dl-kemono-postprocessor.sqlite",
        "archive-prefix": "\fF {category}",
        "archive-format": "_{service}_{user}_{id}_{hash_sha1(content + (embed['url'] if embed else ''))[:10]}",

        "mtime": true
    }
]

~/gallery-dl/templates/kemono.html

<div class="post" id="{id}" data-added="{added}" data-published="{published}" data-edited="{edited}">
  <h4>
    <a href="https://kemono.party/{service}/user/{user}/post/{id}">{title}</a><span class="id" style="color: gray;"><i> #{id}</i></span>
  </h4>
  <div class="content">{content}</div>
  <div class="content embed" title="{embed[subject]:?//}">{embed[url]:?//}</div>
  <br>
  <div class="date"><i>{date:%Y.%m.%d %H:%M:%S}</i></div>
  <hr>
</div>

@AlttiRi
Copy link

AlttiRi commented Mar 19, 2023

BTW, is it possible to use f-string in templates? The custom formatting syntax is too difficult to write and read.

@a84r7a3rga76fg
Copy link
Author

Is it better to hash the text data? Aren't the two methods practically the same? Is there a scenario where storing the hash value of the text data beats storing the text data itself?

That's a cool idea to combine all of the HTML files into one, I'll do that too.

Is there a difference in saving the archive file as sqlite, sqlite3 or db?

@AlttiRi
Copy link

AlttiRi commented Mar 19, 2023

Storing a hash will reduce the DB size, also you can use it in the filename as a postfix, see the example above. (To generate unique filenames for each post edit, in other case metadata file for the new post edit will overwrite the old one.)

To prevent collisions for a post (since you store both hash and post id) I think even 10 chars of SHA-1 hash is enough.
Of course, you can store the full text concated with embed[url], but it's just a data duplication.

sqlite, sqlite3 or db

In any case it will be a SQLite DB file.
For the file association (to open the file with SQL Browser I recommend to use .sqlite (.sqlite3).

@a84r7a3rga76fg
Copy link
Author

Thank you. I'll change it to sqlite and use your postprocessor config, that's a neat way to reduce the size of the archive file.

Does gallery-dl support xxhash64? The latter is much faster than SHA-1 and the value is much smaller.

I'm assuming I can use SHA-1 for everything, for example, a Twitter name {hash_sha1(author[name] if embed else ''))}?

@AlttiRi
Copy link

AlttiRi commented Mar 19, 2023

hash_md5() and hash_sha1() #3679 (comment)

Any text key. #3679

with using f-string formatting (\fF prefix) to use any Python's code
https://github.com/mikf/gallery-dl/blob/master/docs/formatting.md#special-type-format-strings

"\fF {hash_sha1(author[name]}"

It's the wrong use.
The correct form: "\fF {hash_sha1(author['name']}"

@mikf
Copy link
Owner

mikf commented Mar 19, 2023

As a side note: You can use the globals option to load a (custom) Python file/module and use its functions.
So, if you want to use xxhash64, you can define your own function that implements it.

In 1.25.0 it overwrites the default set of functions, but it will add them in the next version (#3773, a1ca240).

@a84r7a3rga76fg
Copy link
Author

Thanks for the help everyone.

I'm using AlttiRi's postprocessor and I changed the filename to "filename": "\fF {_now!s:.16}-metadata.html", but it doesn't work, the error is [kemonoparty][error] An unexpected error occurred: NameError - name '_now' is not defined.Why can't I use {_now!s:.16}?

@AlttiRi
Copy link

AlttiRi commented Mar 19, 2023

Because of _now!s:.16 is not a valid Python code. Also there is no _now variable in the props object, as I understand.
If you use \fF you can't use the custom formatting, only Python code and global available variables.

Or is there any workaround? Something like format('_now!s:.16')?

@a84r7a3rga76fg
Copy link
Author

Will your postprocessor still work if I remove \fF from filename?

@mikf
Copy link
Owner

mikf commented Mar 19, 2023

{_now} effectively just calls datetime.now(), so {datetime.now()!s:.16} should work in an f-string


BTW, is it possible to use f-string in templates? The custom formatting syntax is too difficult to write and read.

Not at the moment, but this is easy enough to add. Would \fTF <path> as format be OK, or do you have a better suggestion?

@AlttiRi
Copy link

AlttiRi commented Mar 19, 2023

Ah, I forgot, that inside f-string there is an extra syntax.

In my example \fF is required for using hash_sha1 function.

@a84r7a3rga76fg
Copy link
Author

Thanks, that worked. This is what I've currently got, please let me know if I'm missing anything because I've made quite some changes. Does filter work? Also, @AlttiRi, do you have a content-format template and postprocessor for kemonoparty discord, that subcategory is very different from the other subcategories e.g. patreon, fantia, etc.

            "postprocessors": [
			{
				"name": "mtime",
				"event": "post"
			},
			{
				"filter": "subcategory not in ('discord')",
				"name":  "metadata",
				"event": "post",
				"mode":  "custom",
				"filename":  "\fF {datetime.now()!s:.16} metadata.html",
				"filter": "embed.get('url') or re.search(r'(?i)(redgifs|atomicloli|gfycat|google|drive|onedrive|1drv|mega|xgf|k00|koofr|gigafile|mediafire|porn3dx|gofile|dropbox)', content)",
				"content-format": "\fT ~/gallery-dl/templates/kemono.html",
				"archive": "~/gallery-dl/test1/kemono-metadata-archive.sqlite",
				"archive-prefix": "\fF {category}",
				"archive-format": "_{service}_{user}_{id}_{hash_sha1(content + (embed['url'] if embed else ''))}",
				"mtime": true
			}
			],

@mikf
Copy link
Owner

mikf commented Mar 19, 2023

You cannot use multiple "filter" statements at once. It will only see and use the last one. Combine both with (...) and (...) or use a blacklist as suggested in #3803 (comment).

@a84r7a3rga76fg
Copy link
Author

I changed it to "blacklist": ":discord",. If in the future I want to add one more subcategory, would this work "blacklist": {":discord", ":example"},?

@mikf
Copy link
Owner

mikf commented Mar 19, 2023

"blacklist": [":discord", ":example"],
([...] instead of {...})

The blacklist syntax is documented here

@AlttiRi
Copy link

AlttiRi commented Mar 19, 2023

\fTF <path>

Yeah, it's fine.

"filter": "embed.get('url') or re.search(r'(?i)(redgifs|atomicloli|gfycat|google|drive|onedrive|1drv|mega|xgf|k00|koofr|gigafile|mediafire|porn3dx|gofile|dropbox)', content)",

I find this filter questionable. I would save every post. Then just parse the links that you need.
Also it's strange, that catbox, uploadir, webmshare, imgur, ... are missed in your list. Is it really the completed list of domains you need and you did not forget anything?

It's why I have in my userscript 2 search/filter inputs: the first one works similar to your regex — to list only known sites, the second one for the reverse mode to list every link except the input values.

@a84r7a3rga76fg
Copy link
Author

I forgot about those, do you know of a list of file sharing domains? I add any such domain that I see to the filter and I know I'm missing a lot. I do actually save every post with another run, the reason for the filter is that imo it's faster and mainly because some artists will split the link into different chunks, e.g. "https :// mega [dot] nz".

@AlttiRi
Copy link

AlttiRi commented Dec 28, 2023

Also I suggest to save the content in HTML files like it is in the linked issue.
Then you can concatenate all HTML files into one file and open it in with a browser.

BTW, here is how it looks (Twitter's example):


For example, the result single HTML file (created with gallery-dl's postprocessor) for a tweet:

single-tweet-text

It has a description as well as other meta information: date, id, the links to a tweet and a profile.


Here is a screenshot of all HTML files are concated into one file with the opened popup of my HrefTaker userscript to parse links (catbox only on the screenshot) from all tweets:

cat-all-tweets-plus-extension


To concat HTML files I use a bash function is defined in ~/.bashrc file:

alias catahtml=fun_cat_html

function fun_cat_html {
    current_date_time=$(date +"%Y.%m.%d-%H.%M.%S")
    cat *.html > "$TEMP/_temp-cat-html-result-$current_date_time.html";
    start "$TEMP/_temp-cat-html-result-$current_date_time.html";
    sleep 0;
    exit;
}

Just type catahtml in a bash terminal, then the result file will be opened in your browser automatically.


Just a cross post of this AlttiRi/twitter-click-and-save#36 (comment)
Maybe it would be useful for someone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants