# Fetching Internet Pages

## What is URL?
Universal Resouce Locator

In [1]:
from urllib.parse import urlparse, parse_qs, urlencode

url = "http://mclab.hufs.ac.kr/mediawiki/index.php?title=Lectures/IA/2018&action=edit&section=25"
r = urlparse(url)
print(r)
print(r.netloc, r.path, r.query)
print(r.hostname, r.port)

ParseResult(scheme='http', netloc='mclab.hufs.ac.kr', path='/mediawiki/index.php', params='', query='title=Lectures/IA/2018&action=edit&section=25', fragment='')
mclab.hufs.ac.kr /mediawiki/index.php title=Lectures/IA/2018&action=edit&section=25
mclab.hufs.ac.kr None


- netloc: TCP connection 만들기 위한 server의 hostname(또는 IP 주소)와 port 번호를 알 수 있다.
- path: server 내의 path (path가 반드시 server내에 존재하는 파일일 필요는 없다. Web application이 이를 받아 특정 function을 call하여 수행하고 그 결과로서 html 파일을 생성할 수도 있다. - dynamic page)
- query string: Web application에 path와 함께 전달된다. Web application은 이것을 input parameter로 생각하여 작업을 수행한다.


Query string의 parsing and decoding

In [2]:
url1 = "https://www.google.co.kr/search?q=%ED%8C%90%EB%AC%B8%EC%A0%90+%EC%84%A0%EC%96%B8&rlz=1C1CHZL_koKR684KR684&oq=%ED%8C%90%EB%AC%B8%EC%A0%90+%EC%84%A0%EC%96%B8&aqs=chrome..69i57j0l2.7892j1j8&sourceid=chrome&ie=UTF-8"
qs = urlparse(url1).query   # fetch query string
print(qs)
parsed_qs = parse_qs(qs)
print(parsed_qs)       # decode query string

q=%ED%8C%90%EB%AC%B8%EC%A0%90+%EC%84%A0%EC%96%B8&rlz=1C1CHZL_koKR684KR684&oq=%ED%8C%90%EB%AC%B8%EC%A0%90+%EC%84%A0%EC%96%B8&aqs=chrome..69i57j0l2.7892j1j8&sourceid=chrome&ie=UTF-8
{'q': ['판문점 선언'], 'rlz': ['1C1CHZL_koKR684KR684'], 'oq': ['판문점 선언'], 'aqs': ['chrome..69i57j0l2.7892j1j8'], 'sourceid': ['chrome'], 'ie': ['UTF-8']}


URL 표현방식에 맞춘 encoding

In [3]:
urlencode(parsed_qs)

'q=%5B%27%ED%8C%90%EB%AC%B8%EC%A0%90+%EC%84%A0%EC%96%B8%27%5D&rlz=%5B%271C1CHZL_koKR684KR684%27%5D&oq=%5B%27%ED%8C%90%EB%AC%B8%EC%A0%90+%EC%84%A0%EC%96%B8%27%5D&aqs=%5B%27chrome..69i57j0l2.7892j1j8%27%5D&sourceid=%5B%27chrome%27%5D&ie=%5B%27UTF-8%27%5D'

## Fetching Web Contents via URL
### urlopen - open Internet contents with URL
Web contents는 통상 **UTF-8**으로 encoding되어 있다. Python에서는 `bytes` type으로 보인다. 이를 Python string, 즉,  **unicode**로 decoding해야 한다.

In [4]:
from urllib.request import urlopen, urlretrieve

url = "http://mclab.hufs.ac.kr/test/index.html"
with urlopen(url) as f:
    content = f.read().decode('utf-8')
print(content)

<html>
<head>
<title>Test Page</title>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>

<body>
<h1>Information and Communications Engineering</h1>
<p><img src="http://ice.hufs.ac.kr/hufs-image01.jpg" border="0"></p>
<p>Welcome to Dept. of Information and Communications Engineering</p>
<p></p>
<p>한국외국어대학교 정보통신공학과</p>

<h2>Blue Sky</h2>
<h3>SKY 2</h3>
<p><img src="s3test2.gif" border="0"></p>

<h3>SKY 3</h3>
<p><img src="s3test3.jpg" border="0"></p>
<h3>SKY 4</h3>
<p><img src="s3test4.jpg"  height="100" width="100" border="0"></p>
<h3>SKY 5</h3>
<p><img src="s3test5.jpg" height="100" width="100"></p>

<h3>TCP/IP Protocol Suits</h3>
<p><img src="tcp_ip.png" height="300" width="500"></p>
<h3>HTTP Protocol</h3>
<p>HTTP Request/Response Messages</p>
<p><img src="HTTP_RequestResponseMessages.png"></p>
<h3>Web Server Architecture</h3>
<p>Single Threaded Web Server</p>
<p><img src="single_threaded_web_server.gif" height="400" width="500"></p>
<p>Thread Pool Web Serve

참고: Chrome browser에서 fetch한 web contents를 확인할 수 있다. 
>    `도구 더보기` >> `개발자 도구`

### urlretrieve - retrieve URL resource into local file
가져온 HTML content를 보면, 이 page를 완성하기 위해 여러개의 image를 download해야 함을 알 수 있다. Relative URL "s3test4.jpg"을 가져와 같은 이름의 local 파일로 저장해 보자.

In [5]:
url = "http://mclab.hufs.ac.kr/test/s3test4.jpg"
file, headers = urlretrieve(url, "s3test4.jpg")
print(file)
print(headers)

s3test4.jpg
Date: Mon, 29 Apr 2019 14:04:25 GMT
Server: Apache/2.2.22 (Ubuntu)
Last-Modified: Thu, 03 Oct 2013 05:45:45 GMT
ETag: "1ea145-39f41-4e7cfb278356b"
Accept-Ranges: bytes
Content-Length: 237377
Connection: close
Content-Type: image/jpeg




In [6]:
with open(file, 'rb') as f:  # open in bibary mode
    print(f.read(237377))

b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x01\x00H\x00H\x00\x00\xff\xed\x08\x1ePhotoshop 3.0\x008BIM\x03\xed\nResolution\x00\x00\x00\x00\x10\x00H\x00\x00\x00\x01\x00\x02\x00H\x00\x00\x00\x01\x00\x028BIM\x04\r\x18FX Global Lighting Angle\x00\x00\x00\x00\x04\x00\x00\x00\x1e8BIM\x04\x19\x12FX Global Altitude\x00\x00\x00\x00\x04\x00\x00\x00\x1e8BIM\x03\xf3\x0bPrint Flags\x00\x00\x00\t\x00\x00\x00\x00\x00\x00\x00\x00\x01\x008BIM\x04\n\x0eCopyright Flag\x00\x00\x00\x00\x01\x00\x008BIM\'\x10\x14Japanese Print Flags\x00\x00\x00\x00\n\x00\x01\x00\x00\x00\x00\x00\x00\x00\x028BIM\x03\xf5\x17Color Halftone Settings\x00\x00\x00H\x00/ff\x00\x01\x00lff\x00\x06\x00\x00\x00\x00\x00\x01\x00/ff\x00\x01\x00\xa1\x99\x9a\x00\x06\x00\x00\x00\x00\x00\x01\x002\x00\x00\x00\x01\x00Z\x00\x00\x00\x06\x00\x00\x00\x00\x00\x01\x005\x00\x00\x00\x01\x00-\x00\x00\x00\x06\x00\x00\x00\x00\x00\x018BIM\x03\xf8\x17Color Transfer Settings\x00\x00\x00p\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xf

## 참고: `requests` module
`urlopen`과 비슷한 `requests` module도 많이 사용되고 있다. Standard module이 아니기 때문에 이 모듈을 쓰기 위해서는 설치해야 한다. (Anaconda package에는 이미 설치되어 있다.)

명령창에서
- `$ pip install requests`

In [7]:
import requests

response = requests.get("https://www.naver.com")    # send GET method and receive response message
# response is an class instance
print(response.content.decode('utf-8')[:400])   # body of response message
print(response.headers)    # headers of response message

<!doctype html>
















<html lang="ko">
<head>
<meta charset="utf-8">
<meta name="Referrer" content="origin">
<meta http-equiv="Content-Script-Type" content="text/javascript">
<meta http-equiv="Content-Style-Type" content="text/css">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=1100">
<meta name="apple-mobile-web-app-title" content="NAVER" />

{'Server': 'NWS', 'Date': 'Mon, 29 Apr 2019 14:03:54 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Set-Cookie': 'PM_CK_loc=53db4f7ef4cb126fbbfcba034d0edde5f670516cc637f653d659fa2b6c903cb7; Expires=Tue, 30 Apr 2019 14:03:54 GMT; Path=/; HttpOnly', 'Cache-Control': 'no-cache, no-store, must-revalidate', 'Pragma': 'no-cache', 'P3P': 'CP="CAO DSP CURa ADMa TAIa PSAa OUR LAW STP PHY ONL UNI PUR FIN COM NAV INT DEM STA PRE"', 'X-Frame-Options': 'DENY', 'X-XSS-Protection': '1; mode=block', 'Content-Encoding': 'gzip', 'Strict-Transport-

In [8]:
response.status_code

200

In [9]:
response.headers['content-type']

'text/html; charset=UTF-8'