# 靜態網頁的資料爬蟲策略


* 了解靜態網頁的資料爬蟲策略
* 認識適用於靜態網頁爬蟲的相關套件工具：Request
* 認識適用於靜態網頁爬蟲的相關套件工具：BeatifulSoup

## 作業目標

利用 Request + BeatifulSoup 爬取下列兩個網站內容並解析：

1. Dcared 網址： https://www.dcard.tw/f
2. 知乎： https://www.zhihu.com/explore

並且回答下面問題：

1. Request 取回之後該怎麼取出資料，資料型態是什麼？
2. 為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？
3. 觀察一下知乎回來的資料好像有點怪怪的，該怎麼解決？

### 1. Dcard 網址： https://www.dcard.tw/f

In [7]:
import requests
from bs4 import BeautifulSoup
url = 'https://www.dcard.tw/f'
r = requests.get(url)
r.encoding = 'utf-8'
print(r.text[0:3000])


  <!DOCTYPE html>
  <html lang="zh-tw">
    <head>
      <meta charset="UTF-8">
      <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
      <meta http-equiv="X-UA-Compatible" content="ie=edge">
      <meta name="theme-color" content="#006aa6">
      <meta data-react-helmet="true" property="og:type" content="website"/><meta data-react-helmet="true" property="og:site_name" content="Dcard"/><meta data-react-helmet="true" property="og:title" content="全部  | Dcard"/><meta data-react-helmet="true" property="og:url" content="https://www.dcard.tw/f"/><meta data-react-helmet="true" name="twitter:title" content="全部 | Dcard"/><meta data-react-helmet="true" property="al:ios:url" content="dcard://category/all/全部/hot"/><meta data-react-helmet="true" property="al:android:url" content="dcard://category/all/全部/hot"/>
      <title data-react-helmet="true">全部 | Dcard</title>
      <link data-react-helmet="true" rel="canonical" href="https://www.dcard.tw/f"/>
      <l

In [10]:
print('Request 取回之後該怎麼取出資料，資料型態是什麼？ =>')
response = r.text
print(type(response))

Request 取回之後該怎麼取出資料，資料型態是什麼？ =>
<class 'str'>


In [12]:
print('為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？ => ')
# 使用 BeatifulSoup 處理後，資料型態為BeatifulSoup解析後的巢狀物件結構，標籤、類別、內容等均可透過指令取得
soup=BeautifulSoup(response) # 以 BeautifulSoup 解析 HTML 程式碼
print(type(soup))
print(soup)

idth":180,"height":320}},"uploadedAt":"2019-12-28T15:21:40.125Z"},"likeCount":1169,"viewCount":3052,"createdAt":"2019-12-28T15:21:59.728Z","updatedAt":"2019-12-29T06:59:21.047Z","forumName":"寵物","forumAlias":"pet","score":1.99999999660841,"featured":true,"position":0.999999996608406,"gender":"F","school":"嶺東科技大學","read":false,"postAvatar":"","duration":5.865,"mediaHeight":1280,"mediaWidth":720,"previewHeight":320,"previewWidth":180,"previewUrl":"https:\u002F\u002Fmegapx-assets.dcard.tw\u002Fvideos\u002Fa1e70a20-82cb-4f04-b9df-27ec2e6f0216\u002Fpreview.mp4"},"a6420ee3-2a9f-4544-a344-d583e5c0aeee":{"id":"a6420ee3-2a9f-4544-a344-d583e5c0aeee","mediaUrl":"https:\u002F\u002Fgcs.dcard.tw\u002Fstory\u002Fimages\u002Fee8f14df-8cbe-4373-a50a-97855e6c7491.jpg","mediaType":"image","mediaData":null,"likeCount":567,"viewCount":1927,"createdAt":"2019-12-28T16:23:51.316Z","updatedAt":"2019-12-29T07:00:19.529Z","forumName":"寵物","forumAlias":"pet","score":1.99992201837195,"featured":true,"position":0.9

### 2. 知乎： https://www.zhihu.com/explore

In [13]:
url = 'https://www.zhihu.com/explore'
r = requests.get(url)
r.encoding = 'utf-8'

print(r.text[0:600])

<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>



### 3. 觀察一下知乎回來的資料好像有點怪怪的，該怎麼解決？

In [14]:
import requests
url = 'https://www.zhihu.com/explore'
# 定義標頭檔內容：經過測試之後，發現知乎伺服器只會檢查 user-agent
header={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
r = requests.get(url,headers=header)

r.encoding = 'utf-8'
print(r.text[0:600])

<!doctype html>
<html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react-helmet="true">发现 - 知乎</title><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1"/><meta name="renderer" content="webkit"/><meta name="force-rendering" content="webkit"/><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/><meta name="google-site-verification" content="FTeR0c8arOPKh8c5DYh_9uu98_zJbaWw53J-Sch9MTg"/><meta name="360_ssp_verify" content="693549ae953a04cb4990f79614e4392d"/><meta name="description" property="og:description" co
