## Beautiful Soup入门
### 基本介绍
1. Beautiful Soup 是一个HTML/XML 的解析器，主要用于解析和提取 HTML/XML 数据。
2. 它基于HTML DOM 的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。
3. 虽然说BeautifulSoup4 简单容易比较上手，但是匹配效率还是远远不如正则以及xpath的，一般不推荐使用，推荐正则的使用。
### 基本元素
- Tag 标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾；
- Name 标签的名字，名字是'p'，格式：<tag>.name;
- Attributes 标签的属性，字典形式组织，格式：<tag>.attrs;
- NavigableString 标签内非属性字符串，<>…</>中字符串，格式：<tag>.string;
- Comment 标签内字符串的注释部分，一种特殊的Comment类型;

In [2]:
from bs4 import BeautifulSoup
import requests

r = requests.get('https://python123.io/ws/demo.html')
demo = r.text
demo

'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

In [3]:
# bs4解析HTML
soup = BeautifulSoup(demo, 'html.parser')   
print(soup.prettify())

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>


In [4]:
# 访问标签a
soup.a

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

In [5]:
# 访问标签title
soup.title

<title>This is a python demo page</title>

In [6]:
# 访问标签的name
soup.a.name


'a'

In [7]:
soup.a.parent.name  # the parent's name of tag a

'p'

In [8]:
soup.p.parent.name

'body'

In [9]:
# attrs of tag
tag = soup.a
print(tag.attrs)
print(tag.attrs['class'])
print(type(tag.attrs))

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
['py1']
<class 'dict'>


In [10]:
# tag内非attrs的string,即tag的value
print(soup.a.string)
print(type(soup.a.string))

Basic Python
<class 'bs4.element.NavigableString'>


### 基于bs4的HTML内容遍历方法
1. 标签树的下行遍历
    - contents 子节点的列表，将<tag>所有儿子节点存入列表
    - children 子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
    - descendants 子孙节点的迭代类型，包含所有子孙节点，用于循环遍历
2. 标签树的上行遍
    - parent 节点的父亲标签
    - parents 节点先辈标签的迭代类型，用于循环遍历先辈节点
3. 标签树的平行遍历
    - next_sibling 返回按照HTML文本顺序的下一个平行节点标签
    - previous_sibling 返回按照HTML文本顺序的上一个平行节点标签
    - next_siblings 迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
    - previous_siblings 迭代类型，返回按照HTML文本顺序的前续所有平行节点标签
4. 详见：https://www.cnblogs.com/mengxiaoleng/p/11585754.html#_label0

**标签树下行遍历**

In [15]:
import requests
from bs4 import BeautifulSoup

r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>


In [18]:
# 标签树的儿子节点
print(soup.contents)

[<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>]


In [19]:
# body下没有直接的节点
print(soup.body.content)

None


In [20]:
print(soup.head)

<head><title>This is a python demo page</title></head>


In [21]:
for child in soup.body.children:  # 遍历子孙节点
    print(child)



<p class="title"><b>The demo python introduces several python courses.</b></p>


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>




In [22]:
for child in soup.body.descendants:
    print(child)



<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
 and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python
.




**标签树上行遍历**

In [23]:
soup.title.parent

<head><title>This is a python demo page</title></head>

In [27]:
soup.a.parent

<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

In [34]:
for parent in soup.a.parents:
    print('='*20)
    print(parent)
    print('-'*20)
    if parent is None:
        print(0)
    else:
        print(parent.name)

<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
--------------------
p
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body>
--------------------
body
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several py

**标签树平行遍历**

In [37]:
print(soup.prettify())

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>


In [38]:
print(soup.a.next_sibling) # tag a 的下一个tag

and 


In [39]:
print(soup.a.next_sibling.next_sibling)#a标签的下一个标签的下一个标签

<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>


In [40]:
print(soup.a.previous_sibling)  # a标签的前一个标签

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:



In [41]:
print(soup.a.previous_sibling.previous_sibling)  # a标签的前一个标签的前一个标签

None


In [42]:
# 遍历其父节点范围内的后续节点
for sibling in soup.a.next_siblings:
    print(sibling)

and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
.


In [47]:
# 遍历之前的节点
print(soup.a.previous_sibling)
for sibling in soup.a.previous_sibling:  
    print(sibling)  # 前面就一个节点，变成了对这个节点字符串的遍历

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

P
y
t
h
o
n
 
i
s
 
a
 
w
o
n
d
e
r
f
u
l
 
g
e
n
e
r
a
l
-
p
u
r
p
o
s
e
 
p
r
o
g
r
a
m
m
i
n
g
 
l
a
n
g
u
a
g
e
.
 
Y
o
u
 
c
a
n
 
l
e
a
r
n
 
P
y
t
h
o
n
 
f
r
o
m
 
n
o
v
i
c
e
 
t
o
 
p
r
o
f
e
s
s
i
o
n
a
l
 
b
y
 
t
r
a
c
k
i
n
g
 
t
h
e
 
f
o
l
l
o
w
i
n
g
 
c
o
u
r
s
e
s
:





### 基于bs4的HTML内容查找方法
1. <>.find_all(name, attrs, recursive, string, **kwargs)

参数：
- name : 对标签名称的检索字符串
- attrs: 对标签属性值的检索字符串，可标注属性检索
- recursive: 是否对子孙全部检索，默认True
- string: <>…</>中字符串区域的检索字符串
    - 简写：
    - <tag>(..) 等价于 <tag>.find_all(..)
    - soup(..) 等价于 soup.find_all(..)
2. 扩展方法：
- <>.find() 搜索且只返回一个结果，同.find_all()参数
- <>.find_parents() 在先辈节点中搜索，返回列表类型，同.find_all()参数
- <>.find_parent() 在先辈节点中返回一个结果，同.find()参数
- <>.find_next_siblings() 在后续平行节点中搜索，返回列表类型，同.find_all()参数
- <>.find_next_sibling() 在后续平行节点中返回一个结果，同.find()参数
- <>.find_previous_siblings() 在前序平行节点中搜索，返回列表类型，同.find_all()参数
- <>.find_previous_sibling() 在前序平行节点中返回一个结果，同.find()参数

In [49]:
import requests
from bs4 import BeautifulSoup

r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo,'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>


In [50]:
# 对标签名称的检索字符串
soup.find_all('a')

[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>,
 <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

In [51]:
soup.find_all(['a', 'p'])

[<p class="title"><b>The demo python introduces several python courses.</b></p>,
 <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>,
 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>,
 <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

In [53]:
# 对标签属性值的检索字符串,指定属性名称
soup.find_all('p')

[<p class="title"><b>The demo python introduces several python courses.</b></p>,
 <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]

In [54]:
soup.find_all('p', 'course')

[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]

In [55]:
soup.find_all(id='link')

[]

In [56]:
#  recursive: 是否对子孙全部检索，默认True
soup.find_all('p',recursive=False)

[]

In [57]:
# string: <>…</>中字符串区域的检索字符串
soup.find_all(string = "Basic Python") # 完全匹配才能匹配到

['Basic Python']

### 实战：中国大学排名定向爬取


In [58]:
import requests
from bs4 import BeautifulSoup
import bs4

In [70]:
r = requests.get('http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html')
r.encoding='utf-8'  # 解决中文乱码问题
demo = r.text
soup = BeautifulSoup(demo,'html.parser')

In [74]:
print(soup.find_all('tr', 'alt')[0])
print(type(soup.find_all('tr', 'alt')[0]))

<tr class="alt"><td>1</td><td><div align="left">清华大学</div></td><td>北京</td><td>94.6</td><td class="hidden-xs need-hidden indicator5">100.0</td><td class="hidden-xs need-hidden indicator6" style="display: none;">98.30%</td><td class="hidden-xs need-hidden indicator7" style="display: none;">1589319</td><td class="hidden-xs need-hidden indicator8" style="display: none;">48698</td><td class="hidden-xs need-hidden indicator9" style="display: none;">1.512</td><td class="hidden-xs need-hidden indicator10" style="display: none;">1810</td><td class="hidden-xs need-hidden indicator11" style="display: none;">126</td><td class="hidden-xs need-hidden indicator12" style="display: none;">1697330</td><td class="hidden-xs need-hidden indicator13" style="display: none;">302898</td><td class="hidden-xs need-hidden indicator14" style="display: none;">6.81%</td></tr>
<class 'bs4.element.Tag'>


In [88]:
content = soup.find_all('tr', 'alt')[0]
a = [td.string for td in content]
a[:4]

['1', '清华大学', '北京', '94.6']

In [102]:
infos = []
contents = soup.find_all('tr', 'alt')
for content in contents:
    info = {}
    cont = [td.string for td in content]
    info['rank'] = cont[0]
    info['school'] = cont[1]
    info['score'] = cont[3]
    infos.append(info)

In [104]:
print('排名\t学校名称\t总分')
for info in infos[:30]:
    print(info['rank'], '\t', info['school'], '\t', info['score'])

排名	学校名称	总分
1 	 清华大学 	 94.6
2 	 北京大学 	 76.5
3 	 浙江大学 	 72.9
4 	 上海交通大学 	 72.1
5 	 复旦大学 	 65.6
6 	 中国科学技术大学 	 60.9
7 	 华中科技大学 	 58.9
7 	 南京大学 	 58.9
9 	 中山大学 	 58.2
10 	 哈尔滨工业大学 	 56.7
11 	 北京航空航天大学 	 56.3
12 	 武汉大学 	 56.2
13 	 同济大学 	 55.7
14 	 西安交通大学 	 55.0
15 	 四川大学 	 54.4
16 	 北京理工大学 	 54.0
17 	 东南大学 	 53.6
18 	 南开大学 	 52.8
19 	 天津大学 	 52.3
20 	 华南理工大学 	 52.0
21 	 中南大学 	 50.3
22 	 北京师范大学 	 49.7
23 	 山东大学 	 49.1
23 	 厦门大学 	 49.1
25 	 吉林大学 	 48.9
26 	 大连理工大学 	 48.6
27 	 电子科技大学 	 48.4
28 	 湖南大学 	 48.1
29 	 苏州大学 	 47.3
30 	 西北工业大学 	 46.7


[1, 2, 45, 5]