## Beautifulsoup

### 解析库

| 解析器	| 使用方法	| 优势	| 劣势 |
|--| -- |-- |-- |
| Python标准库 |	BeautifulSoup(markup, "html.parser")	| Python的内置标准库、执行速度适中 、文档容错能力强 | Python 2.7.3 or 3.2.2)前的版本中文容错能力差|
| lxml HTML 解析器	| BeautifulSoup(markup, "lxml")	| 速度快、文档容错能力强 | 需要安装C语言库 |
| lxml XML 解析器	| BeautifulSoup(markup, "xml") | 速度快、唯一支持XML的解析器 | 需要安装C语言库 |
| html5lib	| BeautifulSoup(markup, "html5lib")	 | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |

### 基本使用

In [19]:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

In [20]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') # 采用lxml解析库
print(soup.prettify()) # 将文本格式化补全
print(soup.title.string) # 后去title标签中的文本

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story


### 标签选择器

#### 选择元素

In [21]:
print(soup.p, soup.a)
print(soup.p.name, soup.p.attrs['name'], soup.p['name']) # 标签名 属性值
print(soup.p.string) # 文本内容
 
print(soup.p.contents) # p标签的子节点

<p class="title" name="dromouse"><b>The Dormouse's story</b></p> <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
p dromouse dromouse
The Dormouse's story
[<b>The Dormouse's story</b>]


### 标准选择器
#### findall(name, attrs, recursive, text, **kwargs)  findall(name, attrs, recursive, text, **kwargs)
可以根据标签名 属性 内容查找文档, find返回单个元素

In [22]:
print(soup.find_all('a'))
print(soup.find_all('a')[0].string)
print(soup.find('a'))

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
 Elsie 
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>


### CSS选择器
通过select()直接传入css选择器即可完成选择

In [23]:
ls = soup.select('a')
for a in ls:
    print(a['href'], a.get_text())

http://example.com/elsie 
http://example.com/lacie Lacie
http://example.com/tillie Tillie
