Some downloaded files are gzip stream #259

FoxAhead · 2023-09-17T10:32:23Z

I used this great tool to download the site http://web.archive.org/web/20230713110210/http://users.tpg.com.au/jpwbeest/. At first glance everything went well, but then I found out that some downloaded files, regardless of extension, were saved as GZIP stream. Some were fine. The result was consistently repeated on repeated downloads. It was about 30 "corrupted" files out of total 245.

Examples of gzipped files (The first two bytes 1F 8B are gzip magic number, and the third 08 is deflate compression)

I would like to know what causes this to happen. Is it a bug or peculiarities of this site or the whole Wayback Machine? Is it possible to fix it?

So far I've solved this problem with a simple python script that scans the files in the directory, and if the file has signs of a gzip stream, decompresses it, or otherwise just copies it to the output folder.
Thanks!

The text was updated successfully, but these errors were encountered:

lihaohong6 · 2023-09-30T05:13:52Z

Waybackmachine might have changed how their api works. I'm downloading webpages archived in the past two months, and all of them end up being gzip files. I suspect that the change happened sometime during the last two months.

Na-x4 · 2023-11-02T15:00:21Z

i think reverting #34 will solve this problem.

tudoujunha · 2024-04-12T10:27:11Z

i think reverting #34 will solve this problem.

You are right. This will solve the problem: #267 (comment)
I recommend this: #280 (comment)

Forage linked a pull request Oct 11, 2023 that will close this issue

Decompress gzip content #262

Open

nicholascc mentioned this issue Nov 12, 2023

I found garbled code files after downloading the entire website! #265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some downloaded files are gzip stream #259

Some downloaded files are gzip stream #259

FoxAhead commented Sep 17, 2023 •

edited

lihaohong6 commented Sep 30, 2023

Na-x4 commented Nov 2, 2023

tudoujunha commented Apr 12, 2024

Some downloaded files are gzip stream #259

Some downloaded files are gzip stream #259

Comments

FoxAhead commented Sep 17, 2023 • edited

lihaohong6 commented Sep 30, 2023

Na-x4 commented Nov 2, 2023

tudoujunha commented Apr 12, 2024

FoxAhead commented Sep 17, 2023 •

edited