Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[twitter][inquiry][possible feature request] Skip download for already downloaded text-tweets? #3786

Closed
a-washing-machine opened this issue Mar 17, 2023 · 8 comments

Comments

@a-washing-machine
Copy link

My gallery-dl config is set up to download text-tweets too. However, it re-downloads them every time to overwrite the old file.

This significantly slows down the entire download process.

(Yes, I compared it and it really is significantly slower. Setting "text-tweets: false" in my config didn't change much, but removing the corresponding post-processor setting from my config sped things up significantly. So it's the act of needlessly re-saving the text-files that's the problem.)

It also creates a tremendous amount of overhead for when I run a backup of my gallery-dl directory, with literally hundreds of thousands of "updated" text-files being backed up needlessly which takes a frankly absurd amount of time and probably isn't good for the hard drive.

Is this something that can be changed in the config already?

If not, is this something that would be easy to implement? "If text-file with name X exists, skip and don't even attempt download"?

My config:

{
    "extractor":
    {
	"twitter":
        {
			"username": "[REDACTED]",
			"password": "[REDACTED]",
			"cookies": "twitter.com_cookies.txt",
			"cookies-update": true,
			"retweets": true,
			"quoted": true,
			"replies": true,
			"text-tweets": true,
			
			"directory": {
                		"retweet_id"              : ["{category}", "{user[name]}", "Retweets", "{author[name]}"],
                		"locals().get('quote_by')": ["{category}", "{user[name]}", "Quoted"  , "{author[name]}"],
                		""                        : ["{category}", "{user[name]}"]
           		 },

			"postprocessors": [
				 {
					"name": "metadata",
					"event": "post",
					"filename": "{tweet_id}.txt",
					"mode": "custom",
					"content-format": "{content}",
					"directory": "TEXT"
				 }
			]			
        },

	... ... ...
    }
}
@ClosedPort22
Copy link
Contributor

Metadata postprocessors support archive: https://github.com/mikf/gallery-dl/blob/master/docs/configuration.rst#metadataarchive

@a-washing-machine
Copy link
Author

Ah! Didn't think of that. Thanks! :D

@a-washing-machine
Copy link
Author

a-washing-machine commented Mar 18, 2023

Hmm. It does no longer overwrite the files, good, that was the important part.

On the other hand, I measured it for a small test-sample, and that still takes just as long as if it did overwrite them compared to it being faster when removing the "postprocessor" section from the *.config altogether.

Not sure what's happening here to slow things down, hard to say.

I know it used to take a couple hours to fully reparse twitter before I started also downloading text-tweets, even without an abort parameter. Now it takes about a day just to do it with abort set to 1000.

Could of course be other factors at play here, maybe twitter just slowed down too.

Ah well, slow and steady wins the race, and not overwriting the files was the more important of the two inquiries.

{
    "extractor":
    {
	"twitter":
        {
			"username": "[REDACTED]",
			"password": "[REDACTED]",
			"cookies": "twitter.com_cookies.txt",
			"cookies-update": true,
			"retweets": true,
			"quoted": true,
			"replies": true,
			"text-tweets": true,
			
			"directory": {
                		"retweet_id"              : ["{category}", "{user[name]}", "Retweets", "{author[name]}"],
                		"locals().get('quote_by')": ["{category}", "{user[name]}", "Quoted"  , "{author[name]}"],
                		""                        : ["{category}", "{user[name]}"]
           		 },

			"postprocessors": [
				 {
					"name": "metadata",
					"event": "post",
					"filename": "{tweet_id}.txt",
					"mode": "custom",
					"content-format": "{content}",
					"directory": "TEXT",
					"archive": "./gallery-dl/twitterMetadataDownloadsArchive.db"
				 }
			]			
        },

	... ... ...
    }
}

EDIT:

mikf added a commit that referenced this issue 4 minutes ago
@mikf
[postprocessor:metadata] add 'skip' option (https://github.com/mikf/gallery-dl/issues/3786[)](https://github.com/mikf/gallery-dl/commit/00f0233b2890ddf68bc9887aa5714c838cb12203)

Ooops, didn't see that! :)

@mikf
Copy link
Owner

mikf commented Mar 18, 2023

maybe twitter just slowed down too.

Last time I did a rough measurement, GraphQL endpoints were around 4x slower than the previous REST API from before cb43f77.

Also, I did try to improve metadata performance in v1.25.0 (3436c6b), but I don't know how much that actually did.

You can also disable extractor.twitter.transform if you don't want gallery-dl to waste time with reordering / processing any of the metadata entries.

  			"event": "post",
  			"archive": "./gallery-dl/twitterMetadataDownloadsArchive.db"

I think you need to also set a custom archive-format for it to work properly with event: post.
"archive-format": "{tweet_id}" should be good enough.

You might also want to look into archive-pragma options for better performance.

@a84r7a3rga76fg
Copy link

Is metadata.content-format the same as metadata.content?

@mikf
Copy link
Owner

mikf commented Mar 19, 2023

@a84r7a3rga76fg
metadata.content is not a recognized/available option.
metadata.content-format is the same as plain metadata.format though.

@a84r7a3rga76fg
Copy link

Ops, meant metadata.format. Should I change all instances of metadata.format to metadata.content-format? The latter is in the config manual, does that mean metadata.format will be removed in a future update?

@mikf
Copy link
Owner

mikf commented Mar 25, 2023

@a84r7a3rga76fg I'm not planning on removing support for metadata.format.

@mikf mikf closed this as completed Mar 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants