Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.0.3.1] Problem: Overwrite only if different in size not working correctly anymore #18

Open
reyaz006 opened this issue Apr 14, 2014 · 7 comments
Labels

Comments

@reyaz006
Copy link

As of yesterday, April 13, 2014, the app can no longer detect filesizes of many images, thus re-downloading and re-writing tons of files that are already downloaded. Up until yesterday I never had this problem.

Log file says:
Content-Length Filesize: -1 File Exists: [xxx].jpg, different size: 489199 vs -1, backing up to: [xxx].jpg.1397402216,98304
For some reason I can't reproduce this with a test batchjob.xml of 1 member job. It either has something to do with too many concurrent jobs (I haven't increased it though, it's still 6) or temporary server issues.

Would be great if you could add some kind of protection against incorrect server responses about filesizes.

EDIT:
I've now deleted all dupes after some fiddling with logs and cmd.exe. There were 12018 dupes after 2 runs (yesterday and today). My current download folder contains ~21000 files, so as you can guess the app created dupes for over 50% of downloaded files.

@Nandaka
Copy link
Owner

Nandaka commented Apr 14, 2014

Content-Length Filesize: -1

The application unable to get the filesize, so it cannot compare the filesize. The default action is download it anyway.

Is the downloaded file is the same? If yes, I can do post-checking.

@reyaz006
Copy link
Author

Yes, they are exactly same. Like I've said previously elsewhere, the website doesn't allow members to modify uploaded pictures (afaik), meaning that changed files are only possible case for avatars.

A post-check would do the thing I guess, it would probably need to re-do the online filesize check if returned value is 0 or less.

Until this problem is addressed somehow, it seems a better idea for me would be to just disable overwriting completely, provided I don't care about updated avatars.

FYI, from the way how avatar changing works, you can see that updated image has a different filename on the server. The added "13xxxxxxxx" part is a timestamp in hex format. Even though the scheme is different for uploaded images, you can still see it serves the same purpose - a timestamp. This means it's possible to find changed/updated files by comparing filenames, rather than filesizes. In case a user chose to use {serverFilename} you could monitor the changes like this while still not using the database. This scheme used on the server may or may not be changed in future, but it may still provide valid results with proper use.

Another possibility is to write actual modified-time as creation time for all downloaded files from corresponding server responses, but this may create an issue with timezones changed on the user pc after the file was saved. This may also be more complex to implement.

@Nandaka
Copy link
Owner

Nandaka commented Apr 15, 2014

@reyaz006
Copy link
Author

Thanks, will keep an eye on the log with this version and overwriting enabled.

While it's still running now, noticed the following things:

  • Content-Length Filesize: -1 is still happening, but it sometimes being followed by Content-Length Filesize: with different size. Although I can't say if it refers to the same image (due to concurrency), but I don't see different size messages anymore in the log. There are now things like this happening:
    2014-04-16 10:50:17,627 DEBUG [ 9] - Content-Length Filesize: -1 2014-04-16 10:50:17,681 DEBUG [ 8] - Content-Length Filesize: -1 2014-04-16 10:50:18,087 DEBUG [ 11] - Content-Length Filesize: 1805154
    If numbers in [ ] refer to different threads, I find it a bit strange.
  • Things like this started happening: Compression Enabled and Identical size: xxxxxx, deleting temp file...
    It may be related to different changes and features, not sure.
  • [1.0.3.x] Problem: batch download speed degradation #17 happened again and was fixed with Stop/Start sequence.

So far it looks like it doesn't try to do any redundant dupes.

By the way, I've noticed that my daily log size got decreased from ~20mb to ~7mb after disabling overwriting. I think it would really help to switch from filesize check to something else for regular images. Right now it looks like in case of changed/updated image under the same image id the app won't even be able to detect it as already downloaded, since the image file will have different name - then it will be downloaded as a new file and the old file under same id will never be touched again.

@Nandaka
Copy link
Owner

Nandaka commented Apr 16, 2014

Content-Length Filesize: -1
I have no control for this one, as the values come from the server. If no Content-Length header, it will be filled to -1.

Compression Enabled and Identical size: xxxxxx, deleting temp file...
yah, forgot to update the log message.

#17
Still not sure what is the cause, doing refactoring to see the flow better.

For file size check, I need to depend on the filename and the file size, as the filename format is customize able, so I cannot depend on the server filename, unless I do save the image id to the db (only single image, no manga), and record the downloaded filesize (not done), and then refer to those to do the check.

@reyaz006
Copy link
Author

I need to depend on the filename and the file size, as the filename format is customize able, so I cannot depend on the server filename, unless I do save the image id to the db

I forgot to tell that I was referring to the case if user-defined filename contains {serverFilename}. Since it's not included by default, this indeed would need to use the db.

and record the downloaded filesize (not done), and then refer to those to do the check.

I was trying to tell you that filesize check is useless if you take serverFilename into account, since with current scheme it'll be always same for same serverFilenames.

@Nandaka
Copy link
Owner

Nandaka commented Apr 24, 2014

DB updated, now record both filesize and extracted filename on server (based on the url), no detection logic yet.

Technically, we can compare the url/server filename on the DB with the actual url parsed from the image page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants