# Basic concept of crawler

## Send request to server and Receive response data
<img src="images/http.png"/>
- Client enters the beginning url
- Web server response data via url

## Process response data
<img src="images/html-parser-json.png"/>
- Received data from web server
- Choose a hands on library or customize to parse the received data


---

---

---

# Request and Response

## HTTP Request Method

### HTTP 1.1: Method definitions

> OPTIONS

> GET

> HEAD

> POST

> PUT

> DELETE

> TRACE

> CONNECT

Reference: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html


### Two HTTP Request Methods: GET vs. POST
> Image as postcard and envelope

### GET: Requests data from a specified resource.

> GET requests can be cached

> GET requests remain in the browser history

> GET requests can be bookmarked

> GET requests should never be used when dealing with sensitive data
>> X) http://www.justfortest.com/login.html?username=HaHaHa&password=UCCU

> GET requests have length restrictions

> GET requests should be used only to retrieve data

### POST: Submits data to be processed to a specified resource.

> POST requests are never cached

> POST requests do not remain in the browser history

> POST requests cannot be bookmarked

> POST requests have no restrictions on data length

## Request - Response

> HTML

> XML

> JSON

> ...

---

---

---

---

---

---

# Python librarys for web crawler

## Requests: HTTP for Humans 
- Requests is an Apache2 Licensed HTTP library
- Powered by urllib3, which is embedded within Requests.
- Document: http://docs.python-requests.org/en/latest/

## Install library command below:

In [1]:
!pip install requests



## Import and List Members of The Requests

In [2]:
import requests

dir(requests)

['ConnectionError',
 'HTTPError',
 'NullHandler',
 'PreparedRequest',
 'Request',
 'RequestException',
 'Response',
 'Session',
 'Timeout',
 'TooManyRedirects',
 'URLRequired',
 '__author__',
 '__build__',
 '__builtins__',
 '__copyright__',
 '__doc__',
 '__file__',
 '__license__',
 '__name__',
 '__package__',
 '__path__',
 '__title__',
 '__version__',
 'adapters',
 'api',
 'auth',
 'certs',
 'codes',
 'compat',
 'cookies',
 'delete',
 'exceptions',
 'get',
 'head',
 'hooks',
 'logging',
 'models',
 'options',
 'packages',
 'patch',
 'post',
 'put',
 'request',
 'session',
 'sessions',
 'status_codes',
 'structures',
 'utils']

## Available HTTP Request Methods

In [3]:
response = requests.options("http://httpbin.org/get")
response = requests.get("http://httpbin.org/get")
response = requests.head("http://httpbin.org/get")
response = requests.post("http://httpbin.org/post")
response = requests.put("http://httpbin.org/put")
response = requests.delete("http://httpbin.org/delete")

---

## Make A Request

- make an HTTP GET request

In [4]:
response = requests.get('https://api.github.com/events')
print response

<Response [200]>


- make an HTTP POST request

In [5]:
response = requests.post("http://httpbin.org/post", data = {"key":"value"})
print response

<Response [200]>


---

## Passing Parameters In URLs

### A normal passing parameters case

In [6]:
payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.get("http://httpbin.org/get", params=payload)
print response.url

http://httpbin.org/get?key2=value2&key1=value1


### Any dictionary key whose value is None will not be added to the URL’s query string.

In [7]:
payload = {'key1': 'value1', 'key2': None}
response = requests.get("http://httpbin.org/get", params=payload)
print response.url

http://httpbin.org/get?key1=value1


### Pass a list of items as a value

In [8]:
payload = {'key1': 'value1', 'key2': ['value2', 'value3']}
response = requests.get("http://httpbin.org/get", params=payload)
print response.url

http://httpbin.org/get?key2=value2&key2=value3&key1=value1


### Custom Headers

In [9]:
url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}
response = requests.get(url, headers=headers)

### A passing parameters POST requests

In [10]:
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("http://httpbin.org/post", data=payload)
print(r.text)

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.8.1"
  }, 
  "json": null, 
  "origin": "112.104.13.37", 
  "url": "http://httpbin.org/post"
}



### POST a Multipart-Encoded File

In [11]:
url = 'http://httpbin.org/post'
files = {'filename': open('doc/test.txt', 'rb')}

r = requests.post(url, files=files)
print r.text

{
  "args": {}, 
  "data": "", 
  "files": {
    "filename": "Hello!\nHour of code.\n\n"
  }, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "170", 
    "Content-Type": "multipart/form-data; boundary=002255b261ac467194893a59f8b809d9", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.8.1"
  }, 
  "json": null, 
  "origin": "112.104.13.37", 
  "url": "http://httpbin.org/post"
}



- Also can set the filename, content_type and headers explicitly

In [12]:
url = 'http://httpbin.org/post'
files = {'filename': ('report.xls', open('doc/test.txt', 'rb'), 'text/plain', {'Expires': '0'})}

r = requests.post(url, files=files)
print r.text

{
  "args": {}, 
  "data": "", 
  "files": {
    "filename": "Hello!\nHour of code.\n\n"
  }, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "210", 
    "Content-Type": "multipart/form-data; boundary=90743a788911411baf056f0e8462f469", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.8.1"
  }, 
  "json": null, 
  "origin": "112.104.13.37", 
  "url": "http://httpbin.org/post"
}



- Send strings to be received as files

In [13]:
url = 'http://httpbin.org/post'
files = {'filename': ('test.doc', 'some,data,to,send\nanother,row,to,send\n')}

r = requests.post(url, files=files)
print r.text

{
  "args": {}, 
  "data": "", 
  "files": {
    "filename": "some,data,to,send\nanother,row,to,send\n"
  }, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "186", 
    "Content-Type": "multipart/form-data; boundary=5869e7745ad34474850bfb044c0cbc8a", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.8.1"
  }, 
  "json": null, 
  "origin": "112.104.13.37", 
  "url": "http://httpbin.org/post"
}



### To send your own cookies to the server

In [14]:
url = 'http://httpbin.org/cookies'
cookies = dict(cookies_are='working')

response = requests.get(url, cookies=cookies)
print response.text

{
  "cookies": {
    "cookies_are": "working"
  }
}



---

## Response Methods

In [15]:
response = requests.get('https://api.github.com/events')
dir(response)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__getstate__',
 '__hash__',
 '__init__',
 '__iter__',
 '__module__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

### Content of the server’s response

In [16]:
response.text
type(response.text)

unicode

### Binary response content

In [17]:
response.text
type(response.content)

str

### JSON Response Content

In [18]:
try:
    content  = response.json()
except ValueError as e:
    print e  # No JSON object could be decoded

### Response Status Codes

In [19]:
response = requests.get('http://httpbin.org/get')
print 'response.status_code:', response.status_code

# Requests also comes with a built-in status code lookup object for easy reference:
print response.status_code == requests.codes.ok

response.status_code: 200
True


In [20]:
bad_response = requests.get('http://httpbin.org/status/404')
print 'bad_response.status_code:', bad_response.status_code

# If we made a bad request (a 4XX client error or 5XX server error response), we can raise it with Response.raise_for_status():
print bad_response.raise_for_status()

bad_response.status_code: 404


HTTPError: 404 Client Error: NOT FOUND for url: http://httpbin.org/status/404

In [None]:
#since our status_code for r was 200, when we call raise_for_status() we get:
print 'response.status_code', response.status_code
print response.raise_for_status()

### Response Headers

In [None]:
print r.headers

---

---

---

---

---

---

## BeautifulSoup4:  sits atop an HTML or XML parser

- A Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

- it also supports a number of third-party Python parsers. One is the lxml parser.
- Document: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

## Install library command below:

In [None]:
!pip install beautifulsoup4 lxml

## Import and List Members of The BeautifulSoup4

In [None]:
from bs4 import BeautifulSoup

dir(BeautifulSoup)

## Typical Usage

- Python's html.parser
#### soup = BeautifulSoup(html_markup, 'html.parser')

- lxml's HTML parser
#### soup = BeautifulSoup(html_markup, "lxml")

- lxml's XML parser
#### soup = BeautifulSoup(xml_markup, "lxml-xml")
#### soup = BeautifulSoup(xml_markup, "xml")

## Quick Start

In [21]:
#Here’s an HTML document I’ll be using as an example throughout this document.
#It’s part of a story from Alice in Wonderland:

html_markup = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

### Import and Get BeautifulSoup object

In [22]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_markup, 'lxml')

In [23]:
type(soup)

bs4.BeautifulSoup

In [24]:
print soup.prettify()

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


### Select tag name - title

html_markup = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="storcy">...</p>
"""

In [25]:
soup.title

<title>The Dormouse's story</title>

In [26]:
soup.title.name

'title'

In [27]:
soup.title.string

u"The Dormouse's story"

---

### Select tag name - head

html_markup = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="storcy">...</p>
"""

In [28]:
soup.head

<head><title>The Dormouse's story</title></head>

In [29]:
soup.head.name

'head'

In [30]:
soup.head.string

u"The Dormouse's story"

---

### Find a parent tag from someone tag

html_markup = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="storcy">...</p>
"""

In [31]:
soup.title.parent.name

'head'

---

### Find 'p' tag's  attribute - class

html_markup = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="storcy">...</p>
"""

In [32]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [33]:
soup.p['class']
# return a list

['title']

---

### Find a tag name in all content

html_markup = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="storcy">...</p>
"""

In [39]:
soup.find_all('a')
# return a list

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

---

### Find a target via attribute name

html_markup = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="storcy">...</p>
"""

In [35]:
target = soup.find(id="link3")


In [38]:
target.name

'a'

In [37]:
target.string

u'Tillie'

---

---

---

---

---

---

## DEMO: Get all hyperlink in PyCon Taiwan front page
<img src='images/pycon.png'/>

In [41]:
from bs4 import BeautifulSoup
import lxml
import requests

response = requests.get("https://tw.pycon.org/2016")
soup = BeautifulSoup(response.text, 'lxml')

In [44]:
print soup.prettify()

<!DOCTYPE html>
<!--[if lt IE 7]>      <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="" name="description"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="/2016/static/images/favicon_pyconTW_32x32.png" rel="icon" type="image/png"/>
  <link href="/2016/static/CACHE/css/6a4f8e910730.css" rel="stylesheet" type="text/css"/>
  <link href="https://fonts.googleapis.com/css?family=Roboto:300,500" rel="stylesheet" type="text/css"/>
  <link href="https://fonts.googleapis.com/earlyaccess/cwtexhei.css" rel="stylesheet" type="text/css"/>
  <link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.4.0/css/font-awesome.min.css" rel="stylesheet"/>
  <!--

In [56]:
for link in soup.find_all('a'):
    print(link.get('href'))
    #

/2016/
https://www.facebook.com/pycontw
https://twitter.com/pycontw
https://docs.google.com/forms/d/1JoHhAj6NeXg98OFAAvAHQ6Dcrh9Dxp2PttHKwxNbiFs/viewform
http://www.meetup.com/Taipei-py/
http://www.meetup.com/pythonhug/
http://www.meetup.com/PyLadiesTW/
http://www.meetup.com/Tainan-py-Python-Tainan-User-Group/
http://www.meetup.com/Kaohsiung-Python-Meetup
http://www.meetup.com/Taichung-Python-Meetup/
http://djangogirls.org/taipei
http://www.meetup.com/Hualien-Py/
mailto:sponsorship@pycon.tw
https://tw.pycon.org/2015apac/en/sponsors/
http://aktsk.com.tw/
http://ocf.tw
http://www.wolftea.com
http://www.gliacloud.com
https://www.facebook.com/pycontw
https://twitter.com/pycontw


In [57]:
for link in soup.find_all('a', target="_blank"):
    print(link.get('href'))

https://www.facebook.com/pycontw
https://twitter.com/pycontw
https://docs.google.com/forms/d/1JoHhAj6NeXg98OFAAvAHQ6Dcrh9Dxp2PttHKwxNbiFs/viewform
http://www.meetup.com/Taipei-py/
http://www.meetup.com/pythonhug/
http://www.meetup.com/PyLadiesTW/
http://www.meetup.com/Tainan-py-Python-Tainan-User-Group/
http://www.meetup.com/Kaohsiung-Python-Meetup
http://www.meetup.com/Taichung-Python-Meetup/
http://djangogirls.org/taipei
http://www.meetup.com/Hualien-Py/
https://tw.pycon.org/2015apac/en/sponsors/
http://aktsk.com.tw/
http://ocf.tw
http://www.wolftea.com
http://www.gliacloud.com
https://www.facebook.com/pycontw
https://twitter.com/pycontw
