# [莫凡爬蟲教學](https://morvanzhou.github.io/tutorials/data-manipulation/scraping/)

## 用Python登錄網頁

In [22]:
from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)

<!DOCTYPE html>
<html lang="cn">
<head>
	<meta charset="UTF-8">
	<title>Scraping tutorial 1 | 莫烦Python</title>
	<link rel="icon" href="https://morvanzhou.github.io/static/img/description/tab_icon.png">
</head>
<body>
	<h1>爬虫测试1</h1>
	<p>
		这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
		<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.
	</p>

</body>
</html>


In [29]:
import re
res = re.findall("<title>(.+?)</title>", html)
print("\nPage title is: ", res[0])

# Page title is:  Scraping tutorial 1 | 莫烦Python


Page title is:  Scraping tutorial 1 | 莫烦Python


In [33]:
res = re.findall(r"<p>(.*?)</p>", html, flags=re.DOTALL)    # re.DOTALL if multi line (沒放不能跑)
print("\nPage paragraph is: ", res[0])

# Page paragraph is:
#  这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
#  <a href="https://morvanzhou.github.io/tutorials/scraping">爬虫教程</a> 中的简单测试.


Page paragraph is:  
		这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
		<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.
	


In [37]:
res = re.findall(r'href="(.*?)"', html) ## 讀取所有link
print("\nAll links: ", res)
# All links:
['https://morvanzhou.github.io/static/img/description/tab_icon.png',
'https://morvanzhou.github.io/',
'https://morvanzhou.github.io/tutorials/scraping']


All links:  ['https://morvanzhou.github.io/static/img/description/tab_icon.png', 'https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/data-manipulation/scraping/']


['https://morvanzhou.github.io/static/img/description/tab_icon.png',
 'https://morvanzhou.github.io/',
 'https://morvanzhou.github.io/tutorials/scraping']

---
## 我的練習

* 爬[台大社會系網頁](http://sociology.ntu.edu.tw/zh_tw/teacher/FullTime)

In [39]:
from urllib.request import urlopen
html = urlopen(
    "http://sociology.ntu.edu.tw/zh_tw/teacher/FullTime"
).read().decode("utf-8")
print(html)

<!DOCTYPE html>
<html lang="zh_tw" class="orbit">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <link href="/assets/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon">
  <title>台灣大學社科院-社會系</title>
  <link href="//cdnjs.cloudflare.com/ajax/libs/font-awesome/4.3.0/css/font-awesome.min.css" media="screen" rel="stylesheet">
  <link href="/assets/bootstrap/bootstrap.min-cfd64c67a341584d2fc093d1c6737cff.css" media="screen" rel="stylesheet">
  <link href="/assets/template/template-5a204b05325a2786c31792bc9937774b.css" media="screen" rel="stylesheet">
  <link rel="stylesheet" media="print" type="text/css" href="/assets/template/print.css">
  <script src="/assets/plugin/modernizr-6a5ad612c54982e085aff743118bd2d0.js"></script>
  <script src="/assets/plugin/picturefill.min-292c56cabcb10d5120d55aa48aa6e442.js"></script>
  <scr

In [44]:
res = re.findall("<title>(.+?)</title>", html)
print(res)

['台灣大學社科院-社會系']


只能做到這樣，其他位置的標記比較複雜，之後再練習。

---

## Beautifulsoup

In [45]:
!pip install beautifulsoup4



In [62]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/basic-structure.html").read().decode('utf-8')
print(html)

<!DOCTYPE html>
<html lang="cn">
<head>
	<meta charset="UTF-8">
	<title>Scraping tutorial 1 | 莫烦Python</title>
	<link rel="icon" href="https://morvanzhou.github.io/static/img/description/tab_icon.png">
</head>
<body>
	<h1>爬虫测试1</h1>
	<p>
		这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
		<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.
	</p>

</body>
</html>


In [63]:
soup = BeautifulSoup(html, features='lxml')
print(soup.h1)

<h1>爬虫测试1</h1>


In [64]:
print(soup.p)

<p>
		这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.
	</p>


In [98]:
all_href = soup.find_all('a')
for x in all_href:
 print(x['href'])
all_href = [x['href'] for x in all_href] ## 簡化以上式子
print('\n', all_href)

# ['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/scraping']

https://morvanzhou.github.io/
https://morvanzhou.github.io/tutorials/data-manipulation/scraping/

 ['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/data-manipulation/scraping/']


### 練習

In [102]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen(
    "http://sociology.ntu.edu.tw/zh_tw/teacher/FullTime"
).read().decode("utf-8")
soup = BeautifulSoup(html, features='lxml')
print(soup)

<!DOCTYPE html>
<html class="orbit" lang="zh_tw">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="/assets/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<title>台灣大學社科院-社會系</title>
<link href="//cdnjs.cloudflare.com/ajax/libs/font-awesome/4.3.0/css/font-awesome.min.css" media="screen" rel="stylesheet"/>
<link href="/assets/bootstrap/bootstrap.min-cfd64c67a341584d2fc093d1c6737cff.css" media="screen" rel="stylesheet"/>
<link href="/assets/template/template-5a204b05325a2786c31792bc9937774b.css" media="screen" rel="stylesheet"/>
<link href="/assets/template/print.css" media="print" rel="stylesheet" type="text/css"/>
<script src="/assets/plugin/modernizr-6a5ad612c54982e085aff743118bd2d0.js"></script>
<script src="/assets/plugin/picturefill.min-292c56cabcb10d5120d55aa48aa6e442.js"></script>
<script src="//cdnjs

---

## 什麼是CSS 

* 觀察`class`

In [115]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/list.html").read().decode('utf-8')
print(html)

<!DOCTYPE html>
<html lang="cn">
<head>
	<meta charset="UTF-8">
	<title>爬虫练习 列表 class | 莫烦 Python</title>
	<style>
	.jan {
		background-color: yellow;
	}
	.feb {
		font-size: 25px;
	}
	.month {
		color: red;
	}
	</style>
</head>

<body>

<h1>列表 爬虫练习</h1>

<p>这是一个在 <a href="https://morvanzhou.github.io/" >莫烦 Python</a> 的 <a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/" >爬虫教程</a>
	里无敌简单的网页, 所有的 code 让你一目了然, 清晰无比.</p>

<ul>
	<li class="month">一月</li>
	<ul class="jan">
		<li>一月一号</li>
		<li>一月二号</li>
		<li>一月三号</li>
	</ul>
	<li class="feb month">二月</li>
	<li class="month">三月</li>
	<li class="month">四月</li>
	<li class="month">五月</li>
</ul>

</body>
</html>


In [129]:
soup = BeautifulSoup(html, features='lxml')

# use class to narrow search
month = soup.find_all('li', 
                     {'class' : 'month'}) ## 使用字典形式
for m in month:
#    print(m)
    print(m.get_text()) ## 只顯示文字，不顯示<li>

一月
二月
三月
四月
五月


In [142]:
jan = soup.find('ul', {"class" : "jan"})
d_jan = jan.find_all('li')
for d in d_jan:
    print(d.get_text())

一月一号
一月二号
一月三号


### 練習

In [166]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen(
    "http://sociology.ntu.edu.tw/zh_tw/teacher/FullTime"
).read().decode("utf-8")
soup = BeautifulSoup(html, features='lxml')
status = soup.find_all('span', {'class' : 'i-member-value member-data-value-job-title'})
name = soup.find_all('span', {'class' : 'i-member-title member-data-title-name'})

In [182]:
name = soup.find_all('span', {'class' : 'i-member-value member-data-value-name'})

怎麼把姓名和url 分開??