# Basic concept of crawler

## Send request to server and Receive response data
<img src="images/http.png"/>
- Client enters the beginning url
- Web server response data via url

## Process response data
<img src="images/html-parser-json.png"/>
- Received data from web server
- Choose a hands on library or customize to parse the received data


---

---

---

# Request and Response

## HTTP Request Method

### HTTP 1.1: Method definitions

> OPTIONS

> GET

> HEAD

> POST

> PUT

> DELETE

> TRACE

> CONNECT

Reference: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html


### Two HTTP Request Methods: GET vs. POST
> Image as postcard and envelope

### GET: Requests data from a specified resource.

> GET requests can be cached

> GET requests remain in the browser history

> GET requests can be bookmarked

> GET requests should never be used when dealing with sensitive data
>> X) http://www.justfortest.com/login.html?username=HaHaHa&password=UCCU

> GET requests have length restrictions

> GET requests should be used only to retrieve data

### POST: Submits data to be processed to a specified resource.

> POST requests are never cached

> POST requests do not remain in the browser history

> POST requests cannot be bookmarked

> POST requests have no restrictions on data length

---

---

---

# Python librarys for web crawler

## Requests: Process HTTP request and response 
- Requests is an Apache2 Licensed HTTP library
- Powered by urllib3, which is embedded within Requests.
- Document: http://docs.python-requests.org/en/latest/

## Install library command below:

In [None]:
!pip install requests

## Import The Requests

In [None]:
import requests

## Make A Request

### make an HTTP GET request

In [None]:
response = requests.get("https://api.github.com/events")
print(response)

- GET request's URL

In [None]:
print(response.url)

- GET request's message body

In [None]:
print(response.text)

### make an HTTP GET request with passing parameters

In [None]:
response = requests.get("http://httpbin.org/get", params={"key1": "value1", "key2": "value2"})
print(response)
print(response.url)

- GET request's URL

In [None]:
print(response.url)

- GET request's message body

In [None]:
print(response.text)

- DEMO

https://www.cousera.org

---

### make an HTTP POST request

In [None]:
response = requests.post("http://httpbin.org/post")
print(response)

- POST request's URL

In [None]:
print(response.url)

- POST request's message body

In [None]:
print(response.text)

### make an HTTP POST request with passing parameters

In [None]:
response = requests.post("http://httpbin.org/post", data = {"key1":"value1", "key2":"value2"})
print(response)

- POST request's URL

In [None]:
print(response.url)

- POST request's message body

In [None]:
print(response.text)

- DEMO - httpbin: HTTP Request & Response Service

https://httpbin.org/post #Returns POST data.

---

## Response Methods

In [None]:
response = requests.get('https://api.github.com/events')

### Content of the server’s response

In [None]:
response.text
type(response.text)

### Binary response content

In [None]:
response.text
type(response.content)

---

---

---

---

---

---

## BeautifulSoup4:  sits atop an HTML or XML parser

- A Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

- it also supports a number of third-party Python parsers. One is the lxml parser.
- Document: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

## Install library command below:

In [None]:
!pip install beautifulsoup4

In [None]:
!conda install lxml

## Import The BeautifulSoup4

In [None]:
from bs4 import BeautifulSoup

## Typical Usage

- lxml's HTML parser
#### soup = BeautifulSoup(html_markup, "lxml")

- Python's html.parser
#### soup = BeautifulSoup(html_markup, 'html.parser')

## Quick Start

- Markup sample

In [None]:
#Here’s an HTML document I’ll be using as an example throughout this document.
#It’s part of a story from Alice in Wonderland:

html_markup = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

[Markup sample](dormouse.html "The Dormouse's story")

### Import and Get BeautifulSoup object

In [None]:
from bs4 import BeautifulSoup

f = open('dormouse.html')

soup = BeautifulSoup(f.read(), 'lxml')

In [None]:
type(soup)

In [None]:
print(soup.prettify())

### Select tag name - title

html_markup = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="storcy">...</p>
"""

In [None]:
soup.title

In [None]:
soup.title.name

In [None]:
soup.title.string

---

### Select tag name - head

html_markup = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="storcy">...</p>
"""

In [None]:
soup.head

In [None]:
soup.head.name

In [None]:
soup.head.string

---

### Find a parent tag from someone tag

html_markup = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="storcy">...</p>
"""

In [None]:
soup.title.parent.name

---

### Find 'p' tag's  attribute - class

html_markup = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="storcy">...</p>
"""

In [None]:
soup.p

In [None]:
soup.p['class']
# return a list

---

### Find a tag name in all content

html_markup = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="storcy">...</p>
"""

In [None]:
soup.find_all('a')
# return a list

---

### Find a target via attribute name

html_markup = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="storcy">...</p>
"""

In [None]:
target = soup.find(id="link3")
target

In [None]:
target.name

In [None]:
target.string

---

---

---

---

---

---

# DEMO: Get all hyperlink in PyCon Taiwan front page
<img src='images/pycon.png'/>
### https://tw.pycon.org/2016

## Quick start

### Import necessary librarys

In [None]:
from bs4 import BeautifulSoup
import lxml
import requests

### Make a GET request

In [None]:
url = "https://tw.pycon.org/2016"
response = requests.get(url)

### Parsing response data using BeautifulSoup

In [None]:
soup = BeautifulSoup(response.text, 'lxml')

### Print PyCon Taiwan front page content

In [None]:
print(soup.prettify())

### Find out hyperlink string via BeautifulSoup

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

### Try again to find out hyperlink string via BeautifulSoup

In [None]:
for link in soup.find_all('a', target="_blank"):
    print(link.get('href'))