# Fetching Internet Pages
Socket API를 직접 사용하지 않고 편리하게 이용할 수 있는 흔히 쓰는 방법은

1. Python standard library를 사용하거나
1. ```request``` module을 설치하여 이용

#### 참고: What is URL?
Universal Resouce Locator
- scheme: protocol
- netloc: TCP connection 만들기 위한 server의 hostname(또는 IP 주소)와 port 번호를 알 수 있다.
- path: server 내의 path (path가 반드시 server내에 존재하는 파일일 필요는 없다. Web application이 이를 받아 특정 function을 call하여 수행하고 그 결과로서 html 파일을 생성할 수도 있다. - dynamic page)
- query string: Web application에 path와 함께 전달된다. Web application은 이것을 input parameter로 생각하여 작업을 수행한다.


In [1]:
from urllib.parse import urlparse, parse_qs, urlencode

url = 'https://search.naver.com/search.naver?sm=top_hty&fbm=1&ie=utf8&query=%EC%BD%94%EB%A1%9C%EB%82%98+%EB%B0%94%EC%9D%B4%EB%9F%AC%EC%8A%A4'
r = urlparse(url)
print(r)
print(r.query)

ParseResult(scheme='https', netloc='search.naver.com', path='/search.naver', params='', query='sm=top_hty&fbm=1&ie=utf8&query=%EC%BD%94%EB%A1%9C%EB%82%98+%EB%B0%94%EC%9D%B4%EB%9F%AC%EC%8A%A4', fragment='')
sm=top_hty&fbm=1&ie=utf8&query=%EC%BD%94%EB%A1%9C%EB%82%98+%EB%B0%94%EC%9D%B4%EB%9F%AC%EC%8A%A4


Query string의 decoding과 Python dict로 변환

In [2]:
qs_dict = parse_qs(r.query)   # query sting to dict
print(qs_dict)

{'sm': ['top_hty'], 'fbm': ['1'], 'ie': ['utf8'], 'query': ['코로나 바이러스']}


URL 표현방식에 맞춘 encoding

In [3]:
urlencode(qs_dict)

'sm=%5B%27top_hty%27%5D&fbm=%5B%271%27%5D&ie=%5B%27utf8%27%5D&query=%5B%27%EC%BD%94%EB%A1%9C%EB%82%98+%EB%B0%94%EC%9D%B4%EB%9F%AC%EC%8A%A4%27%5D'

----
## `urllib` standard library

In [4]:
from urllib.request import urlopen, urlretrieve

f = urlopen(url)
print(f)
print(f.status)
print(f.headers['content-type'])

<http.client.HTTPResponse object at 0x000001EB1AA9E2B0>
200
text/html; charset=UTF-8


HTTP response는 다음 part로 나눠진다.
- status line
- header lines
- content

Status code 200은 request에 대해 성공적으로 회신한다는 의미다. 
Content가 html text 파일이고 encoding 방식 기술되었으니, 읽어서 UTF-8에서 unicode로 decode해야 함을 알 수 있다. (Web contents는 통상 **UTF-8**으로 encoding되어 있다.)

In [5]:
text = f.read().decode('UTF-8')
print(text[:500])

<!doctype html> <html lang="ko"><head> <meta charset="utf-8"> <meta name="referrer" content="always">  <meta name="format-detection" content="telephone=no,address=no,email=no"> <meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=2.0"> <meta property="og:title" content="코로나 바이러스 : 네이버 통합검색"/> <meta property="og:image" content="https://ssl.pstatic.net/sstatic/search/common/og_v3.png"> <meta property="og:description" content="'코로나 바이러스'의 네이버 통합검색 결과입니다."> <meta name="d


참고: Chrome browser에서 fetch한 web contents를 확인할 수 있다. 
>    `도구 더보기` >> `개발자 도구`

In [6]:
import re

image_srcs = re.findall(r'<img.*?src="(.*?)".*?>', text)
len(image_srcs)

27

### urlretrieve - retrieve URL resource into local file
가져온 HTML content를 보면, 이 page를 완성하기 위해 여러개의 image를 download해야 함을 알 수 있다. 

urlretrieve funtion은 주어진 url의 content를 (decoding없이) 그대로 가져와 temporary file에 저장하고, file name과 response의 headers를 return한다.

저장할 file을 지정하려면, `urlretrieve(url, file_name)`

In [7]:
file_name, headers = urlretrieve(image_srcs[0])
print(file_name)
print(headers)

C:\Users\jphong\AppData\Local\Temp\tmpytrk3_wp
accept-ranges: bytes
cache-control: max-age=86400
content-length: 1387
content-type: image/png
expires: Sat, 24 Jun 2023 10:03:25 GMT
last-modified: Fri, 23 Jun 2023 10:03:25 GMT
p3p: CP="ALL CURa ADMa DEVa TAIa OUR BUS IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE LOC OTC"
date: Fri, 23 Jun 2023 10:03:25 GMT
age: 19915
server: Testa/6.1.4
strict-transport-security: max-age=31536000
connection: close




In [8]:
with open(file_name, 'rb') as f:  # open in binary mode
    print(f.read())

b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x1e\x00\x00\x00\x1e\x08\x06\x00\x00\x00;0\xae\xa2\x00\x00\x052IDATH\xc7\xb5\x97}L\x13w\x18\xc7\x9f{\xe9\xb5\xb5\xf4\x8dk\xa1\\\xb1\xbc\xd8\x03J\x85\x16\x84\xc2oG\x84\xaa\x7f(F\xf0\rD\xa6+\x86\xdb\x16\x97\x98,\xd3?N;\xf0mn\xbe%\xc6\x855\x9b\x8b\x893,ls\xcddq\xba8\xb3\xcd\x98\xcd\xb8\x99,8\x93E\x83\x99\x8e8\x99rns\x11\x98 ,W\xae\xee\xa8\xbc\x14\xc5o\xf2\xfcqw\xbf\xe7\xf9\xe4y\xee\xf9=\xf7;\x10\x04ar\xdb\x12\xc4\x84`\xb3V8\xf0\x01+\x84\xc2\xb5B(\xbcY\x08\x85\xdf\x92m\x8b\x10\n\xaf\x12B\xe1la\xfbnmdm\x1c1\'\x16I\xe1\x90\x94\xcd\x82\xb7f\x07 \xfeg@\xfc  ~Xi\x18\xe2\x87\xf1\x91\xeb\x87\x80\xf8\xcb\xe0\xad\xd9\x19\xf1\x91|\x9fH\xd6\xacd\xf0\xac\xd8\x03\x88\xbf\xab\x04%\xfb\x1a\xfa\xcb\n\xea{\x9b\\U\xdd\xef\xb1\x95}G\x9c\x0b\x87\x8f8\x17\x0e\xbd\xcbV\xf6\xbe\xee\xaa\xee\x96\x9e%\xf9\x1a\xba\xc1\xb3b_$F\xdc"5\x18\xd8\xdc\xe5P\xb2\xaeC\tL+\x0e\xf4o\xcd\xa9\xfa\xeb\xa2\xc3\xffO\x0f\xc3=\x14\x19nx,\xeba\xb8\xc1\x1f\x1d\xfe\xfb\xcd\xae*qf\xd1\x9a\x1

----
# `requests` library
`urlopen`과 비슷한 `requests` module도 많이 사용되고 있다. Standard module이 아니기 때문에 이 모듈을 쓰기 위해서는 설치해야 한다. (Anaconda package에는 이미 설치되어 있다.)

명령창에서
- `$ pip install requests`

참고: Requests: HTTP for Humans https://requests.readthedocs.io/en/master/

In [9]:
import requests

# send GET method and receive response message
response = requests.get('https://search.naver.com/search.naver',
                       params={'query': '코로나 바이러스'})    
# response is an class instance
print(response)
print(response.status_code)
print(bool(response))         # successful including 200, 204, 304, ...

<Response [200]>
200
True


In [10]:
# response.content            # body of response message (bytes)
print(response.encoding)      # guess the encoding based on response's headers
print(response.text[:200])    # decoded text

UTF-8
<!doctype html> <html lang="ko"><head> <meta charset="utf-8"> <meta name="referrer" content="always">  <meta name="format-detection" content="telephone=no,address=no,email=no"> <meta name="viewport" c


In [11]:
print(response.headers)    # headers of response message

{'Date': 'Fri, 23 Jun 2023 15:35:20 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Set-Cookie': 'page_uid=i5TJXsprvN8ssZ37VzlssssssEo-295559; path=/; domain=.naver.com, _naver_usersession_=uy6p0BdQeWz2zWSm1T3Lwg==; path=/; expires=Fri, 23-Jun-23 15:40:20 GMT; domain=.naver.com, nx_ssl=2; Domain=.naver.com; Path=/; Expires=Sun, 23-Jul-2023 15:35:20 GMT;', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; report=/p/er/post/xss', 'Cache-Control': 'no-cache, no-store, must-revalidate, max-age=0', 'Pragma': 'no-cache', 'Referrer-Policy': 'unsafe-url', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Server': 'nxg', 'Accept-CH': 'Sec-CH-UA, Sec-CH-UA-Arch, Sec-CH-UA-Bitness, Sec-CH-UA-Full-Version-List, Sec-CH-UA-Mobile, Sec-CH-UA-Model, Sec-CH-UA-Platform, Sec-CH-UA-Platform-Version, Sec-CH-UA-WoW64'}


In [12]:
response.request.headers

{'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

### Handling JSON response

In [13]:
import requests

# Search GitHub's repositories for requests
response = requests.get(
    'https://api.github.com/search/repositories',
    params={'q': 'requests+language:python'},
)

print(response.headers['content-type'])
print(response.text[:200])

application/json; charset=utf-8
{"total_count":20989,"incomplete_results":false,"items":[{"id":4290214,"node_id":"MDEwOlJlcG9zaXRvcnk0MjkwMjE0","name":"grequests","full_name":"spyoungtech/grequests","private":false,"owner":{"login":


In [14]:
# Inspect some attributes of the `requests` repository
json_response = response.json()     # convert json content to Python 
repository = json_response['items'][0]
print(f'Repository name: {repository["name"]}')  # Python 3.6+
print(f'Repository description: {repository["description"]}')  # Python 3.6+

Repository name: grequests
Repository description: Requests + Gevent = <3


### HTTP POST method
참고: https://httbin.org site는 requests module 저자가 시험용으로 만든 site다. 여러가지 HTTP method들에 대해, 받은 request message를 json 등의 내용으로 회신해 준다.

In [15]:
r = requests.post('https://httpbin.org/post', data={'key': 'value'})
print(r.headers)
if r:
    print(r.text)
    if 'application/json' in r.headers['content-type']:
        json_r = r.json()
        print(json_r)

{'Date': 'Fri, 23 Jun 2023 15:35:56 GMT', 'Content-Type': 'application/json', 'Content-Length': '479', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key": "value"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "9", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.21.0", 
    "X-Amzn-Trace-Id": "Root=1-6495bbba-68d129bd5a1c562641221957"
  }, 
  "json": null, 
  "origin": "118.34.152.67", 
  "url": "https://httpbin.org/post"
}

{'args': {}, 'data': '', 'files': {}, 'form': {'key': 'value'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '9', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.21.0', 'X-Amzn-Trac

In [16]:
print(r.request.url)
print(r.request.headers)
print(r.request.body)

https://httpbin.org/post
{'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '9', 'Content-Type': 'application/x-www-form-urlencoded'}
key=value
