Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

File still exists but can't be downloaded. #41

Open
jerejesse opened this issue Jun 20, 2022 · 3 comments
Open

File still exists but can't be downloaded. #41

jerejesse opened this issue Jun 20, 2022 · 3 comments

Comments

@jerejesse
Copy link

jerejesse commented Jun 20, 2022

到ceiba網站手動下載仍然可以正常下載,也可以正常打開來看。

2022-06-20 10:09:51 - INFO - [演算法 / 作業] 下載 PA#0...
2022-06-20 10:09:51 - ERROR - ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check')) Traceback (most recent call last): File "urllib3/response.py", line 401, in _decode File "urllib3/response.py", line 88, in decompress zlib.error: Error -3 while decompressing data: incorrect header check During handling of the above exception, another exception occurred: Traceback (most recent call last): File "requests/models.py", line 758, in generate File "urllib3/response.py", line 576, in stream File "urllib3/response.py", line 548, in read File "urllib3/response.py", line 407, in _decode urllib3.exceptions.DecodeError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check')) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "ceiba/util.py", line 176, in loop_connect File "requests/sessions.py", line 555, in get File "requests/sessions.py", line 542, in request File "requests/sessions.py", line 697, in send File "requests/models.py", line 836, in content File "requests/models.py", line 763, in generate requests.exceptions.ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))
2022-06-20 10:09:51 - ERROR - 網址:https://ceiba.ntu.edu.tw/course/f1c349/hw/PA0_MergeSort.tgz
2022-06-20 10:09:51 - WARNING - 五秒後重新連線...

重複十次後停止嘗試。

2022-06-20 10:10:41 - WARNING - 超過最大重試連線次數!停止嘗試連線!
2022-06-20 10:10:41 - ERROR - 下載 檔案 時發生問題!(網址:https://ceiba.ntu.edu.tw/course/f1c349/hw/PA0_MergeSort.tgz) Traceback (most recent call last): File "ceiba/crawler.py", line 159, in crawl_hrefs File "ceiba/crawler.py", line 55, in crawl File "ceiba/util.py", line 167, in get File "ceiba/util.py", line 188, in loop_connect ceiba.exceptions.CrawlerConnectionError: 連線時發生問題!發生錯誤的網址:https://ceiba.ntu.edu.tw/course/f1c349/hw/PA0_MergeSort.tgz

附上手動下載的檔案,請確認後刪除。
原始副檔名為tgz,因為github不允許上傳tgz格式故改為zip,下載後將副檔名改為tgz可正常開啟。
[The file has been deleted by @jameshwc]

@jameshwc
Copy link
Owner

我不太確定你這是什麼問題,我猜是 python 的 requests 以為 encoding 是 gzip 但其實不是?
你能夠用 curl 或其他工具確認 https://ceiba.ntu.edu.tw/course/f1c349/hw/PA0_MergeSort.tgz 這個網址會給什麼 content-encoding 嗎?

curl 存取 Ceiba 的方式:curl -I -H 'Accept-Encoding: gzip,compress,br,deflate,identity' --cookie "PHPSESSID=foobar" https://ceiba.ntu.edu.tw/course/f1c349/hw/PA0_MergeSort.tgz
PHPSESSID 請自行更改,可以在登入後的 Ceiba 網站的 Cookie 拿到 PHPSESSID

也可以參考這個 issue debug 看看

@jerejesse
Copy link
Author

看起來是這個問題沒錯。

HTTP/1.1 200 OK
Date: Wed, 22 Jun 2022 09:35:26 GMT
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1;mode=block
Last-Modified: Wed, 22 Feb 2017 11:09:49 GMT
ETag: "171bcee-624800-5491c88f52ba2"
Accept-Ranges: bytes
Content-Length: 6440960
Content-Type: application/x-tar
Content-Encoding: gzip
Set-Cookie: TS01fb8d58=01048815223ff22d433ff9d7c48c45253b11e6fe4dc3ec25d6dcd78f71b9d9b9d191fbbb8bd0d2df2694191974b1a013af9c378721; Path=/; Domain=.ceiba.ntu.edu.tw

@jerejesse
Copy link
Author

不熟悉python,有點暈頭轉向。
下載檔案的指令是這裡嗎?

resp = util.get(self.session, urljoin(self.url, res))

類似改成這樣?
resp = util.get(self.session, urljoin(self.url, res), headers={'Accept-Encoding': 'identity'})

我晚點再建環境跑看看。

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants