Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 characters unable to handle in headers. #4187

Closed
hrimfaxi opened this issue Jul 7, 2017 · 14 comments
Closed

UTF-8 characters unable to handle in headers. #4187

hrimfaxi opened this issue Jul 7, 2017 · 14 comments

Comments

@hrimfaxi
Copy link

hrimfaxi commented Jul 7, 2017

requests is unable to handle utf-8 characters in headers.

http://assrt.net/download/217234/%E6%A4%8D%E7%89%A9%E7%8E%8B%E5%9B%BD%E7%AC%AC%E4%BA%8C%E9%9B%86%E8%93%9D%E5%85%89%E7%89%88%E4%B8%AD%E8%8B%B1%E6%96%87%E5%8F%8A%E5%8F%8C%E8%AF%AD%E5%AD%97%E5%B9%95.rar

Expected Result

Content-Disposition should be subtitle; filename="植物王国第三集蓝光版中英文及双语字幕.rar"

Actual Result

{'Cache-Control': 'max-age=2678400', 'X-Cache': 'MISS', 'Expires': 'Mon, 07 Aug 2017 08:16:04 GMT', 'Server': 'openresty', 'Content-Disposition': 'subtitle; filename="æ¤\x8dç\x89©ç\x8e\x8bå\x9b½ç¬¬äº\x8cé\x9b\x86è\x93\x9då\x85', 'Connection': 'keep-alive', 'Date': 'Fri, 07 Jul 2017 08:16:04 GMT', 'Content-Length': '62859', 'Servant': 'Berserker', 'ETag': '"56f296fd-f58b"', 'Content-Type': 'application/octet-stream', 'Last-Modified': 'Wed, 23 Mar 2016 13:15:41 GMT'}

Content-Disposition is corrupted with last characters:

a.headers['content-disposition'].encode('iso8859-1').decode('utf-8', errors='ignore')
'subtitle; filename="植物王国第三集蓝'

Reproduction Steps

System Information

$ python -m requests.help
hrimfaxi@ubuntu-build-server:~/aasrt$ python3 -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": ""
  },
  "implementation": {
    "name": "CPython",
    "version": "3.5.2"
  },
  "platform": {
    "release": "4.7.4+",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.18.1"
  },
  "system_ssl": {
    "version": "1000207f"
  },
  "urllib3": {
    "version": "1.21.1"
  },
  "using_pyopenssl": false
}
@Lukasa
Copy link
Member

Lukasa commented Jul 7, 2017

On Python 3 the header decoding is actually done by the Python standard library. This decodes headers using Latin-1, so you can resolve this issue by doing headers['content-disposition'].encode('iso-8859-1').decode('utf-8').

Sadly, there is no guaranteed header encoding for headers, so this approach (while silly) works pretty well.

@Lukasa Lukasa closed this as completed Jul 7, 2017
@hrimfaxi
Copy link
Author

hrimfaxi commented Jul 7, 2017

I did tried that. But the filename is corrupted:

 a.headers['content-disposition'].encode('iso8859-1').decode('utf-8', errors='ignore')
'subtitle; filename="植物王国第三集蓝'

The complete filename should be

Content-Disposition should be subtitle; filename="植物王国第三集蓝光版中英文及双语字幕.rar"

curl -I -L -v gets the header without problem.

@Lukasa
Copy link
Member

Lukasa commented Jul 7, 2017

Hrm. Why are you setting errors='ignore'? Do you get an error if you don't do that?

@hrimfaxi
Copy link
Author

hrimfaxi commented Jul 7, 2017

Yes.

>>> a['Content-Disposition'].encode('iso-8859-1').decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 44-45: unexpected end of data

@Lukasa
Copy link
Member

Lukasa commented Jul 7, 2017

Ok, that strongly suggests that the data is not UTF-8 encoded. Once you do the latin-1 encode, what does the raw bytes repr look like?

@hrimfaxi
Copy link
Author

hrimfaxi commented Jul 7, 2017

I think it is UTF-8 encoded, however was corrupted after decoding as latin-1.

curl -L -I -v 'http://assrt.net/download/217235/%E6%A4%8D%E7%89%A9%E7%8E%8B%E5%9B%BD%E7%AC%AC%E4%B8%89%E9%9B%86%E8%93%9D%E5%85%89%E7%89%88%E4%B8%AD%E8%8B%B1%E6%96%87%E5%8F%8A%E5%8F%8C%E8%AF%AD%E5%AD%97%E5%B9%95.rar'
*   Trying 210.129.49.133...
*   Trying 2001:470:24:401::1...
* Immediate connect fail for 2001:470:24:401::1: Network is unreachable
* Connected to assrt.net (210.129.49.133) port 80 (#0)
> HEAD /download/217235/%E6%A4%8D%E7%89%A9%E7%8E%8B%E5%9B%BD%E7%AC%AC%E4%B8%89%E9%9B%86%E8%93%9D%E5%85%89%E7%89%88%E4%B8%AD%E8%8B%B1%E6%96%87%E5%8F%8A%E5%8F%8C%E8%AF%AD%E5%AD%97%E5%B9%95.rar HTTP/1.1
> Host: assrt.net
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 302 Moved Temporarily
HTTP/1.1 302 Moved Temporarily
< Server: openresty
Server: openresty
< Date: Fri, 07 Jul 2017 09:06:29 GMT
Date: Fri, 07 Jul 2017 09:06:29 GMT
< Content-Type: text/html
Content-Type: text/html
< Content-Length: 154
Content-Length: 154
< Connection: keep-alive
Connection: keep-alive
< Location: http://file0.assrt.net/download/217235/%E6%A4%8D%E7%89%A9%E7%8E%8B%E5%9B%BD%E7%AC%AC%E4%B8%89%E9%9B%86%E8%93%9D%E5%85%89%E7%89%88%E4%B8%AD%E8%8B%B1%E6%96%87%E5%8F%8A%E5%8F%8C%E8%AF%AD%E5%AD%97%E5%B9%95.rar?_=1499418389&-=2e12dadb7fd23e0f65e121f62890f7d2
Location: http://file0.assrt.net/download/217235/%E6%A4%8D%E7%89%A9%E7%8E%8B%E5%9B%BD%E7%AC%AC%E4%B8%89%E9%9B%86%E8%93%9D%E5%85%89%E7%89%88%E4%B8%AD%E8%8B%B1%E6%96%87%E5%8F%8A%E5%8F%8C%E8%AF%AD%E5%AD%97%E5%B9%95.rar?_=1499418389&-=2e12dadb7fd23e0f65e121f62890f7d2
< Master: Yoda
Master: Yoda
< Droid: R2-D2
Droid: R2-D2

< 
* Connection #0 to host assrt.net left intact
* Issue another request to this URL: 'http://file0.assrt.net/download/217235/%E6%A4%8D%E7%89%A9%E7%8E%8B%E5%9B%BD%E7%AC%AC%E4%B8%89%E9%9B%86%E8%93%9D%E5%85%89%E7%89%88%E4%B8%AD%E8%8B%B1%E6%96%87%E5%8F%8A%E5%8F%8C%E8%AF%AD%E5%AD%97%E5%B9%95.rar?_=1499418389&-=2e12dadb7fd23e0f65e121f62890f7d2'
*   Trying 210.129.49.133...
* Connected to file0.assrt.net (210.129.49.133) port 80 (#1)
> HEAD /download/217235/%E6%A4%8D%E7%89%A9%E7%8E%8B%E5%9B%BD%E7%AC%AC%E4%B8%89%E9%9B%86%E8%93%9D%E5%85%89%E7%89%88%E4%B8%AD%E8%8B%B1%E6%96%87%E5%8F%8A%E5%8F%8C%E8%AF%AD%E5%AD%97%E5%B9%95.rar?_=1499418389&-=2e12dadb7fd23e0f65e121f62890f7d2 HTTP/1.1
> Host: file0.assrt.net
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Server: openresty
Server: openresty
< Date: Fri, 07 Jul 2017 09:06:30 GMT
Date: Fri, 07 Jul 2017 09:06:30 GMT
< Content-Type: application/octet-stream
Content-Type: application/octet-stream
< Content-Length: 67270
Content-Length: 67270
< Connection: keep-alive
Connection: keep-alive
< Last-Modified: Wed, 23 Mar 2016 13:15:41 GMT
Last-Modified: Wed, 23 Mar 2016 13:15:41 GMT
< ETag: "56f296fd-106c6"
ETag: "56f296fd-106c6"
< Servant: Berserker
Servant: Berserker
< Expires: Mon, 07 Aug 2017 09:06:30 GMT
Expires: Mon, 07 Aug 2017 09:06:30 GMT
< Cache-Control: max-age=2678400
Cache-Control: max-age=2678400
< X-Cache: HIT
X-Cache: HIT
< Content-Disposition: subtitle; filename="植物王国第三集蓝光版中英文及双语字幕.rar"
Content-Disposition: subtitle; filename="植物王国第三集蓝光版中英文及双语字幕.rar"
< Accept-Ranges: bytes
Accept-Ranges: bytes

< 
* Connection #1 to host file0.assrt.net left intact

The link could be not available after a short of time.

>>> a = requests.get('http://file0.assrt.net/download/217235/%E6%A4%8D%E7%89%A9%E7%8E%8B%E5%9B%BD%E7%AC%AC%E4%B8%89%E9%9B%86%E8%93%9D%E5%85%89%E7%89%88%E4%B8%AD%E8%8B%B1%E6%96%87%E5%8F%8A%E5%8F%8C%E8%AF%AD%E5%AD%97%E5%B9%95.rar?_=1499418389&-=2e12dadb7fd23e0f65e121f62890f7d2')
>>> a.headers
{'Server': 'openresty', 'Date': 'Fri, 07 Jul 2017 09:08:16 GMT', 'X-Cache': 'HIT', 'Connection': 'keep-alive', 'Content-Disposition': 'subtitle; filename="æ¤\x8dç\x89©ç\x8e\x8bå\x9b½ç¬¬ä¸\x89é\x9b\x86è\x93\x9då\x85', 'Content-Type': 'application/octet-stream', 'Content-Length': '67270', 'Expires': 'Mon, 07 Aug 2017 09:08:16 GMT', 'Servant': 'Berserker', 'Cache-Control': 'max-age=2678400', 'Last-Modified': 'Wed, 23 Mar 2016 13:15:41 GMT', 'ETag': '"56f296fd-106c6"'}
>>> a.headers['Content-Disposition']
'subtitle; filename="æ¤\x8dç\x89©ç\x8e\x8bå\x9b½ç¬¬ä¸\x89é\x9b\x86è\x93\x9då\x85'
>>> a.headers['Content-Disposition'].encode('latin1')
b'subtitle; filename="\xe6\xa4\x8d\xe7\x89\xa9\xe7\x8e\x8b\xe5\x9b\xbd\xe7\xac\xac\xe4\xb8\x89\xe9\x9b\x86\xe8\x93\x9d\xe5\x85'
>>> a.headers['Content-Disposition'].encode('latin1').decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 44-45: unexpected end of data
>>> a.headers['Content-Disposition'].encode('latin1').decode('utf-8', errors='ignore')
'subtitle; filename="植物王国第三集蓝'

@Lukasa
Copy link
Member

Lukasa commented Jul 7, 2017

Decoding as latin-1 cannot corrupt binary data: latin-1 is a character map encoding, which means that it has a one-to-one mapping of bytes to unicode code points. The corruption appears to be happening lower down the stack, or more likely in the redirect. Can you use Wireshark to capture the HTTP traffic that Requests is sending?

@hrimfaxi
Copy link
Author

hrimfaxi commented Jul 7, 2017

But why curl is doing fine?

17:15:11.019128 IP 192.168.42.34.50210 > 210.129.49.133.80: Flags [P.], seq 1:377, ack 1, win 58, options [nop,nop,TS val 43682431 ecr 794638478], length 376: HTTP: GET /download/217235/%E6%A4%8D%E7%89%A9%E7%8E%8B%E5%9B%BD%E7%AC%AC%E4%B8%89%E9%9B%86%E8%93%9D%E5%85%89%E7%89%88%E4%B8%AD%E8%8B%B1%E6%96%87%E5%8F%8A%E5%8F%8C%E8%AF%AD%E5%AD%97%E5%B9%95.rar?_=1499418389&-=2e12dadb7fd23e0f65e121f62890f7d2 HTTP/1.1
	0x0000:  4500 01ac 4e50 4000 4006 fc2a c0a8 2a22  E...NP@.@..*..*"
	0x0010:  d281 3185 c422 0050 ad55 c430 fc1f 7a0a  ..1..".P.U.0..z.
	0x0020:  8018 003a 8636 0000 0101 080a 029a 8a7f  ...:.6..........
	0x0030:  2f5d 388e 4745 5420 2f64 6f77 6e6c 6f61  /]8.GET./downloa
	0x0040:  642f 3231 3732 3335 2f25 4536 2541 3425  d/217235/%E6%A4%
	0x0050:  3844 2545 3725 3839 2541 3925 4537 2538  8D%E7%89%A9%E7%8
	0x0060:  4525 3842 2545 3525 3942 2542 4425 4537  E%8B%E5%9B%BD%E7
	0x0070:  2541 4325 4143 2545 3425 4238 2538 3925  %AC%AC%E4%B8%89%
	0x0080:  4539 2539 4225 3836 2545 3825 3933 2539  E9%9B%86%E8%93%9
	0x0090:  4425 4535 2538 3525 3839 2545 3725 3839  D%E5%85%89%E7%89
	0x00a0:  2538 3825 4534 2542 3825 4144 2545 3825  %88%E4%B8%AD%E8%
	0x00b0:  3842 2542 3125 4536 2539 3625 3837 2545  8B%B1%E6%96%87%E
	0x00c0:  3525 3846 2538 4125 4535 2538 4625 3843  5%8F%8A%E5%8F%8C
	0x00d0:  2545 3825 4146 2541 4425 4535 2541 4425  %E8%AF%AD%E5%AD%
	0x00e0:  3937 2545 3525 4239 2539 352e 7261 723f  97%E5%B9%95.rar?
	0x00f0:  5f3d 3134 3939 3431 3833 3839 262d 3d32  _=1499418389&-=2
	0x0100:  6531 3264 6164 6237 6664 3233 6530 6636  e12dadb7fd23e0f6
	0x0110:  3565 3132 3166 3632 3839 3066 3764 3220  5e121f62890f7d2.
	0x0120:  4854 5450 2f31 2e31 0d0a 486f 7374 3a20  HTTP/1.1..Host:.
	0x0130:  6669 6c65 302e 6173 7372 742e 6e65 740d  file0.assrt.net.
	0x0140:  0a41 6363 6570 743a 202a 2f2a 0d0a 4163  .Accept:.*/*..Ac
	0x0150:  6365 7074 2d45 6e63 6f64 696e 673a 2067  cept-Encoding:.g
	0x0160:  7a69 702c 2064 6566 6c61 7465 0d0a 436f  zip,.deflate..Co
	0x0170:  6e6e 6563 7469 6f6e 3a20 6b65 6570 2d61  nnection:.keep-a
	0x0180:  6c69 7665 0d0a 5573 6572 2d41 6765 6e74  live..User-Agent
	0x0190:  3a20 7079 7468 6f6e 2d72 6571 7565 7374  :.python-request
	0x01a0:  732f 322e 3138 2e31 0d0a 0d0a            s/2.18.1....
17:15:11.162475 IP 210.129.49.133.80 > 192.168.42.34.50210: Flags [.], ack 377, win 235, options [nop,nop,TS val 794638514 ecr 43682431], length 0
	0x0000:  4548 0034 42b8 4000 2b06 1df3 d281 3185  EH.4B.@.+.....1.
	0x0010:  c0a8 2a22 0050 c422 fc1f 7a0a ad55 c5a8  ..*".P."..z..U..
	0x0020:  8010 00eb e43c 0000 0101 080a 2f5d 38b2  .....<....../]8.
	0x0030:  029a 8a7f                                ....
17:15:11.162623 IP 210.129.49.133.80 > 192.168.42.34.50210: Flags [.], ack 377, win 235, options [nop,nop,TS val 794638514 ecr 43682431], length 0
	0x0000:  4548 0034 42b9 4000 2b06 1df2 d281 3185  EH.4B.@.+.....1.
	0x0010:  c0a8 2a22 0050 c422 fc1f 7a0a ad55 c5a8  ..*".P."..z..U..
	0x0020:  8010 00eb e43c 0000 0101 080a 2f5d 38b2  .....<....../]8.
	0x0030:  029a 8a7f                                ....
17:15:11.277061 IP 210.129.49.133.80 > 192.168.42.34.50210: Flags [.], seq 1:1389, ack 377, win 235, options [nop,nop,TS val 794638541 ecr 43682431], length 1388: HTTP: HTTP/1.1 200 OK
	0x0000:  4548 05a0 42ba 4000 2b06 1885 d281 3185  EH..B.@.+.....1.
	0x0010:  c0a8 2a22 0050 c422 fc1f 7a0a ad55 c5a8  ..*".P."..z..U..
	0x0020:  8010 00eb f62f 0000 0101 080a 2f5d 38cd  ...../....../]8.
	0x0030:  029a 8a7f 4854 5450 2f31 2e31 2032 3030  ....HTTP/1.1.200
	0x0040:  204f 4b0d 0a53 6572 7665 723a 206f 7065  .OK..Server:.ope
	0x0050:  6e72 6573 7479 0d0a 4461 7465 3a20 4672  nresty..Date:.Fr
	0x0060:  692c 2030 3720 4a75 6c20 3230 3137 2030  i,.07.Jul.2017.0
	0x0070:  393a 3135 3a31 3120 474d 540d 0a43 6f6e  9:15:11.GMT..Con
	0x0080:  7465 6e74 2d54 7970 653a 2061 7070 6c69  tent-Type:.appli
	0x0090:  6361 7469 6f6e 2f6f 6374 6574 2d73 7472  cation/octet-str
	0x00a0:  6561 6d0d 0a43 6f6e 7465 6e74 2d4c 656e  eam..Content-Len
	0x00b0:  6774 683a 2036 3732 3730 0d0a 436f 6e6e  gth:.67270..Conn
	0x00c0:  6563 7469 6f6e 3a20 6b65 6570 2d61 6c69  ection:.keep-ali
	0x00d0:  7665 0d0a 4c61 7374 2d4d 6f64 6966 6965  ve..Last-Modifie
	0x00e0:  643a 2057 6564 2c20 3233 204d 6172 2032  d:.Wed,.23.Mar.2
	0x00f0:  3031 3620 3133 3a31 353a 3431 2047 4d54  016.13:15:41.GMT
	0x0100:  0d0a 4554 6167 3a20 2235 3666 3239 3666  ..ETag:."56f296f
	0x0110:  642d 3130 3663 3622 0d0a 5365 7276 616e  d-106c6"..Servan
	0x0120:  743a 2042 6572 7365 726b 6572 0d0a 4578  t:.Berserker..Ex
	0x0130:  7069 7265 733a 204d 6f6e 2c20 3037 2041  pires:.Mon,.07.A
	0x0140:  7567 2032 3031 3720 3039 3a31 353a 3131  ug.2017.09:15:11
	0x0150:  2047 4d54 0d0a 4361 6368 652d 436f 6e74  .GMT..Cache-Cont
	0x0160:  726f 6c3a 206d 6178 2d61 6765 3d32 3637  rol:.max-age=267
	0x0170:  3834 3030 0d0a 582d 4361 6368 653a 2048  8400..X-Cache:.H
	0x0180:  4954 0d0a 436f 6e74 656e 742d 4469 7370  IT..Content-Disp
	0x0190:  6f73 6974 696f 6e3a 2073 7562 7469 746c  osition:.subtitl
	0x01a0:  653b 2066 696c 656e 616d 653d 22e6 a48d  e;.filename="...
	0x01b0:  e789 a9e7 8e8b e59b bde7 acac e4b8 89e9  ................
	0x01c0:  9b86 e893 9de5 8589 e789 88e4 b8ad e88b  ................
	0x01d0:  b1e6 9687 e58f 8ae5 8f8c e8af ade5 ad97  ................
	0x01e0:  e5b9 952e 7261 7222 0d0a 4163 6365 7074  ....rar"..Accept
	0x01f0:  2d52 616e 6765 733a 2062 7974 6573 0d0a  -Ranges:.bytes..
	0x0200:  0d0a 5261 7221 1a07 00cf 9073 0000 0d00  ..Rar!.....s....
	0x0210:  0000 0000 0000 4988 7420 927d 0091 4300  ......I.t..}..C.
	0x0220:  0064 9900 0002 ff96 0f99 af72 1841 1d33  .d.........r.A.3
	0x0230:  5800 2000 0000 4253 6b79 422e 4b69 6e67  X.....BSkyB.King
	0x0240:  646f 6d2e 6f66 2e50 6c61 6e74 732e 336f  dom.of.Plants.3o
	0x0250:  6634 2e53 7572 7669 7661 6c2e 3732 3070  f4.Survival.720p
	0x0260:  2e42 4452 6970 2e48 442e 7832 3634 2e41  .BDRip.HD.x264.A
	0x0270:  4143 2e4d 5647 726f 7570 2e6f 7267 2e43  AC.MVGroup.org.C
	0x0280:  6869 6e65 7365 2e73 7274 0001 c052 00b0  hinese.srt...R..
	0x0290:  3ad3 7410 2114 d50c cd9d 1016 157b ee22  :.t.!........{."
	0x02a0:  fc1e fab3 a9dd def9 be88 cc8c e8bf 15c8  ................
	0x02b0:  8c8b 8c8c 8be8 fd11 9119 e368 94d2 4ca4  ...........h..L.
	0x02c0:  d335 fe10 d551 35c2 6b80 d551 aa28 b293  .5...Q5.k..Q.(..
	0x02d0:  5c2d 36d7 3369 15c2 e2e1 8145 36ea a8ea  \-6.3i.....E6...
	0x02e0:  40de f740 bfd9 e019 d1f7 cfa1 bd91 200d  @..@............
	0x02f0:  81a1 a1ae 3dd7 8909 1224 6848 d0d0 d44c  ....=....$hH...L
	0x0300:  cec5 4ff6 a49f 055f fe3f f059 6ff4 5fe2  ..O...._.?.Yo._.
	0x0310:  ebfd 4c7f 37fc bfe7 ff3f fdbf e507 81bf  ..L.7....?......
	0x0320:  a963 2fef afe2 c9c7 bf4a efbb 7bc5 5fcf  .c/......J..{._.
	0x0330:  2ac6 58f7 fd17 2ad2 fcbf b7fa 2c7f c7fe  *.X...*.....,...
	0x0340:  115b abb2 caa2 5efa 9bfe d0ea 4e3c de7d  .[....^.....N<.}
	0x0350:  179d 97f8 954b fa35 ead5 f6ff 0036 4d58  .....K.5.....6MX
	0x0360:  64c9 66cd d61c fa9a 6ccf 1e1d cbb8 6b54  d.f.....l.....kT
	0x0370:  dea3 467d e67f ed6f c18f 6ccb 6fd1 97e0  ..F}...o..l.o...
	0x0380:  0534 6343 1b75 83e3 ffd8 3077 ab3a f66c  .4cC.u....0w.:.l
	0x0390:  7262 a153 e177 e5fd 1b7e 78fc 59d5 7e2a  rb.S.w...~x.Y.~*
	0x03a0:  fc86 8d9a b3ab 5658 74fd 64c9 f7da 37a9  ......VXt.d...7.
	0x03b0:  75ee 46a3 e1dc f75d ed4d b142 86af 45d1  u.F....].M.B..E.
	0x03c0:  a438 de3c d44d 1c35 69af a9c7 15d3 4d3b  .8.<.M.5i.....M;
	0x03d0:  f53c d3e6 5a7d a925 2146 bb26 ee92 6d7e  .<..Z}.%!F.&..m~
	0x03e0:  53ef d3ab 1f96 9cfe 1f78 b83a 6ada da21  S........x.:j..!
	0x03f0:  a75c fd6f 4a4b 2e58 6009 014a afa4 7e72  .\.oJK.X`..J..~r
	0x0400:  70fd 92fe ebfe 8d7a 05e8 8596 d6f5 059f  p......z........
	0x0410:  3502 d3c5 b3dc 7971 f937 3cd7 bbe8 1331  5.....yq.7<....1
	0x0420:  38a7 cc3d a2ad 1f8b ffa5 8b26 ffb1 cea3  8..=.......&....
	0x0430:  ec59 2deb abd5 a723 2e1b 7f09 3e2c 15aa  .Y-....#....>,..
	0x0440:  c901 ae44 0e0b 363e a61d 5d51 d1bf 45e9  ...D..6>..]Q..E.
	0x0450:  7a2d a6a9 bed9 757f 96ff 725f efaa aa1f  z-....u...r_....
	0x0460:  09d3 b488 81e5 cc0b 4bc7 cd56 b8f3 d0ed  ........K..V....
	0x0470:  8c75 caed 6dea cf48 133a d725 6022 3899  .u..m..H.:.%`"8.
	0x0480:  4bfa 28bb 2c18 d117 5857 1b5c b3f5 9e67  K.(.,...XW.\...g
	0x0490:  3bef a927 87bf 4faf 1f46 6c6a 0ec1 7ff8  ;..'..O..Flj....
	0x04a0:  de6e 0eda 0799 35d2 369f de36 115f 587c  .n....5.6..6._X|
	0x04b0:  bfb5 8b34 ef5c b38f 6277 c285 ca38 8172  ...4.\..bw...8.r
	0x04c0:  c9b7 67d5 f4d0 3ed1 874c 9870 d8f9 6d79  ..g...>..L.p..my
	0x04d0:  61ef 3a79 3aa0 5c2c 69f3 83f4 b98a afa1  a.:y:.\,i.......
	0x04e0:  038d 982b 43a7 c22f e37f d843 ef13 3639  ...+C../...C..69
	0x04f0:  eb18 9cde 7a3d 7977 7488 35dd cb98 fa87  ....z=ywt.5.....
	0x0500:  f9c3 5634 fac9 b595 8e9e 2d59 c51c 177b  ..V4......-Y...{
	0x0510:  7a40 cf8d 1cc3 d2c3 7aff 877b dc6f 9d35  z@......z..{.o.5
	0x0520:  45a5 83e3 5cfc b8d9 a73e f1b6 6874 eefb  E...\....>..ht..
	0x0530:  f1e1 34bf 26ac 9d22 70d2 9732 afc4 df9e  ..4.&.."p..2....
	0x0540:  31a2 6cff 07c6 cb6f e5b3 dd5e e6c6 8b75  1.l....o...^...u
	0x0550:  f31c f8b8 fba7 e8e0 b79d 3b2c 0def da71  ..........;,...q
	0x0560:  162f fd54 ffdf d5a3 66b5 1dee c65e cd1c  ./.T....f....^..
	0x0570:  567c 167f 499e 7c19 bfb3 ffe7 fdbf dbff  V|..I.|.........
	0x0580:  1feb febf f5ff f33a ffa7 fdff faff c7ff  .......:........
	0x0590:  bff7 ffa7 ff7f fc7f b7fd 7fb6 8d4f 365b  .............O6[

it is utf-8 encoded:

>>> b"""\x65\x3b\x20\x66\x69\x6c\x65\x6e\x61\x6d\x65\x3d\x22\xe6\xa4\x8d\xe7\x89\xa9\xe7\x8e\x8b\xe5\x9b\xbd\xe7\xac\xac\xe4\xb8\x89\xe9\x9b\x86\xe8\x93\x9d\xe5\x85\x89\xe7\x89\x88\xe4\xb8\xad\xe8\x8b\xb1\xe6\x96\x87\xe5\x8f\x8a\xe5\x8f\x8c\xe8\xaf\xad\xe5\xad\x97\xe5\xb9\x95\x2e\x72\x61\x72\x22""".decode('utf-8')
'e; filename="植物王国第三集蓝光版中英文及双语字幕.rar"'

@hrimfaxi
Copy link
Author

hrimfaxi commented Jul 7, 2017

And that tcpdump result is from python-requests, not curl.

@hrimfaxi
Copy link
Author

hrimfaxi commented Jul 7, 2017

Somehow, requests striped last 32 bytes on the value of 'Content-Disposition'.

@Lukasa
Copy link
Member

Lukasa commented Jul 7, 2017

Yup, that appears to be the problem here. Requests isn't doing custom header parsing here: it's done by http.client. So I suspect http.client is the one making the error.

I recommend trying to use http.client to reproduce that request and see if the error persists. If it does, this is a bug in the Python standard library.

@hrimfaxi
Copy link
Author

hrimfaxi commented Jul 7, 2017

I tried http.client. It didn't reproduce the error.

#!/usr/bin/python3
# coding: utf-8

import http.client

conn = http.client.HTTPConnection("file0.assrt.net", 80)
conn.set_debuglevel(9)
#http://assrt.net/download/217233/%E6%A4%8D%E7%89%A9%E7%8E%8B%E5%9B%BD%E7%AC%AC%E4%B8%80%E9%9B%86%E8%93%9D%E5%85%89%E7%89%88%E4%B8%AD%E8%8B%B1%E6%96%87%E5%8F%8A%E5%8F%8C%E8%AF%AD%E5%AD%97%E5%B9%95.rar
conn.request("HEAD","/download/217234/%E6%A4%8D%E7%89%A9%E7%8E%8B%E5%9B%BD%E7%AC%AC%E4%BA%8C%E9%9B%86%E8%93%9D%E5%85%89%E7%89%88%E4%B8%AD%E8%8B%B1%E6%96%87%E5%8F%8A%E5%8F%8C%E8%AF%AD%E5%AD%97%E5%B9%95.rar?_=1499423618&-=cf2f5266fe8012534e2c0a8c8ac8aafe")
response = conn.getresponse()
print (response.getheaders())
print (response.getheader('Content-Disposition').encode('latin1').decode('utf-8'))

send: b'HEAD /download/217234/%E6%A4%8D%E7%89%A9%E7%8E%8B%E5%9B%BD%E7%AC%AC%E4%BA%8C%E9%9B%86%E8%93%9D%E5%85%89%E7%89%88%E4%B8%AD%E8%8B%B1%E6%96%87%E5%8F%8A%E5%8F%8C%E8%AF%AD%E5%AD%97%E5%B9%95.rar?_=1499423618&-=cf2f5266fe8012534e2c0a8c8ac8aafe HTTP/1.1\r\nHost: file0.assrt.net\r\nAccept-Encoding: identity\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server header: Date header: Content-Type header: Content-Length header: Connection header: Last-Modified header: ETag header: Servant header: Expires header: Cache-Control header: X-Cache header: Content-Disposition header: Accept-Ranges [('Server', 'openresty'), ('Date', 'Fri, 07 Jul 2017 10:36:32 GMT'), ('Content-Type', 'application/octet-stream'), ('Content-Length', '62859'), ('Connection', 'keep-alive'), ('Last-Modified', 'Wed, 23 Mar 2016 13:15:41 GMT'), ('ETag', '"56f296fd-f58b"'), ('Servant', 'Berserker'), ('Expires', 'Mon, 07 Aug 2017 10:36:32 GMT'), ('Cache-Control', 'max-age=2678400'), ('X-Cache', 'HIT'), ('Content-Disposition', 'subtitle; filename="æ¤\x8dç\x89©ç\x8e\x8bå\x9b½ç¬¬äº\x8cé\x9b\x86è\x93\x9då\x85\x89ç\x89\x88ä¸\xadè\x8b±æ\x96\x87å\x8f\x8aå\x8f\x8cè¯\xadå\xad\x97å¹\x95.rar"'), ('Accept-Ranges', 'bytes')]
subtitle; filename="植物王国第二集蓝光版中英文及双语字幕.rar"

@hrimfaxi
Copy link
Author

hrimfaxi commented Jul 7, 2017

And urllib3:

http = urllib3.PoolManager()
r = http.request("HEAD","http://assrt.net/download/217234/%E6%A4%8D%E7%89%A9%E7%8E%8B%E5%9B%BD%E7%AC%AC%E4%BA%8C%E9%9B%86%E8%93%9D%E5%85%89%E7%89%88%E4%B8%AD%E8%8B%B1%E6%96%87%E5%8F%8A%E5%8F%8C%E8%AF%AD%E5%AD%97%E5%B9%95.rar?_=1499423618&-=cf2f5266fe8012534e2c0a8c8ac8aafe")
print (r.headers)
print (r.headers['Content-Disposition'].encode('iso-8859-1').decode('utf-8'))

HTTPHeaderDict({'Server': 'openresty', 'Date': 'Fri, 07 Jul 2017 10:44:12 GMT', 'Content-Type': 'application/octet-stream', 'Content-Length': '62859', 'Connection': 'keep-alive', 'Last-Modified': 'Wed, 23 Mar 2016 13:15:41 GMT', 'ETag': '"56f296fd-f58b"', 'Servant': 'Berserker', 'Expires': 'Mon, 07 Aug 2017 10:44:12 GMT', 'Cache-Control': 'max-age=2678400', 'X-Cache': 'HIT', 'Content-Disposition': 'subtitle; filename="æ¤\x8dç\x89©ç\x8e\x8bå\x9b½ç¬¬äº\x8cé\x9b\x86è\x93\x9då\x85\x89ç\x89\x88ä¸\xadè\x8b±æ\x96\x87å\x8f\x8aå\x8f\x8cè¯\xadå\xad\x97å¹\x95.rar"', 'Accept-Ranges': 'bytes'})
subtitle; filename="植物王国第二集蓝光版中英文及双语字幕.rar"

@hrimfaxi
Copy link
Author

hrimfaxi commented Jul 7, 2017

Sorry, I screwed it up with a regex matching on latin1 regex with unicode. Nvm..

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants