## 爬蟲的預備知識

爬蟲, Web scraping, Web crawler, Spider

Technology
- Request-response: [web access request-response cycle](https://media.geeksforgeeks.org/wp-content/uploads/20210905091508/ImageOfHTTPRequestResponse.png)  
- HTML 結構 (DOM, Document Object Model): [DOM](https://steam.oxxostudio.tw/webp/python/spider/beautiful-soup-03.webp)
- HTTP Response: [Response](https://media.geeksforgeeks.org/wp-content/uploads/20210905094321/StructureOfAHTTPResponse.png)
  

爬蟲禮儀
- 設計適當的等待延遲, 不造成網站主機負擔
- 盡可能使用網站提供的API取得資料
- 閱讀並遵從網站內robots.txt的規定 (https://www.tsg.com.tw/robots.txt) (https://www.iana.org/robots.txt)

範例網站
- Inter Assigned Numbers Authority (https://www.iana.org/domains/)
- Code-Gym (https://code-gym.github.io/spider_demo/)
- 文淵閣 (http://ehappy.tw/bsdemo1.htm')
  
需要的 python 模組
- pip3 install requests  
- pip3 install bs4  

工作流程 
- analysis web page manually
- build connection (requests module)
- parse HTML (BeautifulSoup module)
- find data (BeautifulSoup's find, find_all)

In [None]:
import requests
from bs4 import BeautifulSoup

## Build connection and parse HTML
- 常見的解析器有 lxml、html5lib、html.parser(內建) 這三個工具，參數主要是告訴 BeatifulSoup 要如何將原始碼HTML的**純文字**，轉換成可供分析取用的**標籤樹**，差異是「容錯率」跟「效能速度」有所不同
- 使用 print 觀察網頁內容, 若利用prettify函式, 可得到較佳排列

In [None]:
url = 'https://code-gym.github.io/spider_demo/'
response = requests.get(url)
print(response.status_code, '\n')

## Parse HTML (BeautifulSoup module)

In [None]:
sp = BeautifulSoup(response.text, 'html5lib')
print(sp.title,'\n')
print(sp.prettify())

## Find data (BeautifulSoup's find, find_all)

In [None]:
# find data
print(sp.h1); print(sp.h1.text)
print(sp.find('h1')); print(sp.find('h1').text)

In [None]:
# 找出網站的文章列表 find_all data
article_list = sp.find_all('h3')
print(article_list[0].prettify())
print(article_list[1].prettify())
print(article_list[1].text.strip(),'\n')
print(article_list[1].a['href'],'\n')

print('--- Artical List ---')
i = 0
for h3_element in sp.find_all('h3'):
    print(i, '-->', (h3_element.text.strip()))
    if h3_element.a is not None:
        print('(' + h3_element.a['href'].upper()+')')
    i += 1
    # print(h3.prettify())
    #if h3.a is not None:
    #    print(h3.a.prettify())

## 常見的反爬蟲方式
- 判斷瀏覽器headers資訊
- *使用動態頁面*
- *加入使用者行為判斷*
- *模擬真實用戶登入授權*
- ***加入驗證碼機制***
- ***封鎖代理伺服器與第三方 IP***

In [3]:
# 加入headers資訊來模擬瀏覽器的行為
import requests
url = "https://www.kingstone.com.tw"
headers_chrome = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}
headers_safari = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2.1 Safari/605.1.15'}

response = requests.get(url)
#response = requests.get(url, headers = headers_chrome)
#response = requests.get(url, headers = headers_safari)
response.status_code

400

## 使用 webbrowser 打開網頁


In [None]:
import webbrowser

address = "新竹市大學路1001號"
map_url = 'http://www.google.com/maps/place/' + address
print(map_url)
f_ctrler = webbrowser.get("safari") # 取得 webbrowser controller
f_ctrler.open(map_url)