# 靜態網頁的資料爬蟲策略


* 了解靜態網頁的資料爬蟲策略
* 認識適用於靜態網頁爬蟲的相關套件工具：Request
* 認識適用於靜態網頁爬蟲的相關套件工具：BeatifulSoup

## 網頁資料取得： Requests

In [25]:
import requests
# 引入函式庫
r = requests.get('https://github.com/timeline.json')
# 想要爬資料的目標網址
response = r.text
# 模擬發送請求的動作

In [26]:
print(response)
response[:150]

{"message":"Hello there, wayfaring stranger. If you’re reading this then you probably didn’t see our blog post a couple of years back announcing that this API would go away: http://git.io/17AROg Fear not, you should be able to get what you need from the shiny new Events API instead.","documentation_url":"https://docs.github.com/v3/activity/events/#list-public-events"}


'{"message":"Hello there, wayfaring stranger. If you’re reading this then you probably didn’t see our blog post a couple of years back announcing that '

In [18]:
print(type(response))

# print(response['message'])
print(response[:100])

<class 'str'>
{"message":"Hello there, wayfaring stranger. If you’re reading this then you probably didn’t see our


In [10]:
import json
response = json.loads(response)

print(type(response))
print(response['message'])

<class 'dict'>
Hello there, wayfaring stranger. If you’re reading this then you probably didn’t see our blog post a couple of years back announcing that this API would go away: http://git.io/17AROg Fear not, you should be able to get what you need from the shiny new Events API instead.


## 網頁解析器： BeatifulSoup


In [11]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""

print(html_doc)
print(type(html_doc))


<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>

<class 'str'>


In [9]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "html5lib")
print(soup)
print(type(soup))

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>


</body></html>
<class 'bs4.BeautifulSoup'>


## 作業目標

利用 Request + BeatifulSoup 爬取下列兩個網站內容並解析：

1. Dcard 網址： https://www.dcard.tw/f
2. 知乎： https://www.zhihu.com/explore

並且回答下面問題：

1. Request 取回之後該怎麼取出資料，資料型態是什麼？
2. 為什麼要使用 BeatifulSoup 處理？
3. 觀察一下知乎回來的資料好像有點怪怪的，該怎麼解決？

In [21]:
print('Request 取回之後該怎麼取出資料，資料型態是什麼？ =>')

url = 'https://www.dcard.tw/f'

r = requests.get(url)
response = r.text #利用r.text取出資料

r.encoding = 'utf-8'

print("Type of r is", type(r))
print("Type of r.text is", type(r.text))
print('\n')
print(r.text[0:3000])

Request 取回之後該怎麼取出資料，資料型態是什麼？ =>
Type of r is <class 'requests.models.Response'>
Type of r.text is <class 'str'>


<!DOCTYPE html>
<html>
  <head>
    <meta
      name="viewport"
      content="width=device-width, initial-scale=1, minimum-scale=1, viewport-fit=cover" />
    <meta charset="utf-8" />
    <title>Dcard 需要確認您的連線是安全的</title>
    <style>*{box-sizing:border-box}a,body,div,h1,h2,html,p,pre,span{color:#000;margin:0;padding:0}body{background-color:#00324e;font-family:Roboto,Helvetica Neue,Helvetica,Arial,PingFang TC,黑體-繁,Heiti TC,蘋果儷中黑,Apple LiGothic Medium,微軟正黑體,Microsoft JhengHei,sans-serif}@media (max-width:798px){body{background-color:#fff}}.dcard_nav{align-items:center;background-color:#006aa6;display:flex;height:48px}.dcard_logo{height:28px;margin-left:124px;width:74px}@media (max-width:798px){.dcard_logo{margin-left:16px}}h1{font-size:28px;line-height:40px;margin:10px 0}h1,h2{font-weight:500}h2{font-size:24px;line-height:33px;margin:8px 0}p,pre{color:rgba(0,0,0,.75);font-si

In [20]:
import requests
from bs4 import BeautifulSoup

In [22]:
print('為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？ => \n')

from bs4 import BeautifulSoup

soup = BeautifulSoup(response, "html.parser")
print(soup)
print(type(soup))

# 用BeatifulSoup中的html解析後，可以更簡單的從中抽出元素，型態是class

為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？ => 

<!DOCTYPE html>

<html>
<head>
<meta content="width=device-width, initial-scale=1, minimum-scale=1, viewport-fit=cover" name="viewport"/>
<meta charset="utf-8"/>
<title>Dcard 需要確認您的連線是安全的</title>
<style>*{box-sizing:border-box}a,body,div,h1,h2,html,p,pre,span{color:#000;margin:0;padding:0}body{background-color:#00324e;font-family:Roboto,Helvetica Neue,Helvetica,Arial,PingFang TC,黑體-繁,Heiti TC,蘋果儷中黑,Apple LiGothic Medium,微軟正黑體,Microsoft JhengHei,sans-serif}@media (max-width:798px){body{background-color:#fff}}.dcard_nav{align-items:center;background-color:#006aa6;display:flex;height:48px}.dcard_logo{height:28px;margin-left:124px;width:74px}@media (max-width:798px){.dcard_logo{margin-left:16px}}h1{font-size:28px;line-height:40px;margin:10px 0}h1,h2{font-weight:500}h2{font-size:24px;line-height:33px;margin:8px 0}p,pre{color:rgba(0,0,0,.75);font-size:14px;font-weight:400;line-height:20px}pre{margin:8px 0;white-space:pre-wrap;white-space:break-spaces;wo

In [23]:
url = 'https://www.zhihu.com/explore'
r = requests.get(url, 'utf-8')
# r.encoding = 'utf-8'

print(r.text[0:600])

<!doctype html>
<html lang="zh" data-hairline="true" class="itcauecng" data-theme="light"><head><meta charSet="utf-8"/><title data-rh="true">发现 - 知乎</title><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1"/><meta name="renderer" content="webkit"/><meta name="force-rendering" content="webkit"/><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/><meta name="google-site-verification" content="FTeR0c8arOPKh8c5DYh_9uu98_zJbaWw53J-Sch9MTg"/><meta name="description" property="og:description" content="知乎，中文互联网高质量的问答社区和创作者聚集的原创内容平台，于 2011 年 1 月正式上线，以「让人们更好的分享


# 3. 觀察一下知乎回來的資料好像有點怪怪的，該怎麼解決？

In [24]:
#加上header，騙過知乎即可

import requests
url = 'https://www.zhihu.com/explore'

headers = {'user-agent': 'my-app/0.0.1'}
r = requests.get(url, 'utf-8', headers=headers)

# r.encoding = 'utf-8'
print(r.text[0:600])

<!doctype html>
<html lang="zh" data-hairline="true" class="itcauecng" data-theme="light"><head><meta charSet="utf-8"/><title data-rh="true">发现 - 知乎</title><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1"/><meta name="renderer" content="webkit"/><meta name="force-rendering" content="webkit"/><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/><meta name="google-site-verification" content="FTeR0c8arOPKh8c5DYh_9uu98_zJbaWw53J-Sch9MTg"/><meta name="description" property="og:description" content="知乎，中文互联网高质量的问答社区和创作者聚集的原创内容平台，于 2011 年 1 月正式上线，以「让人们更好的分享
