#  解析库使用

## XPath 

### XPath常用规则

| 表达式 | 描述 |
|:------|:-----|
| nodename | 选取此节点的所有子节点 |
| / | 从当前节点选取直接子节点 |
| // | 从当前节点选取子孙节点 |
| . | 选取当前节点 |
| .. | 选取当前节点的父节点 |
| @ | 选取属性

> 首先导入lxml的etree 模块,可以调用etree.html()对html文本进行初始化,构造XPath解析对象,同时etree模块可以自动修正html文本  
> etree.tostring()返回修正后的文本,为bytes类型,需要decode()转成str类型

In [None]:
from lxml  import etree
text = '''
<div>
<ul>
<li class = "item-0"><a href = "link1.html">first item</a></li>
<li class = "item-1"><a href = "link2.html">second item</a></li>
<li class = "item-inactive"><a href = "link3.html">third item</a></li>
<li class = "item-1"><a href = "link4.html">fourth item</a></li>
<li class = "item-1"><a href = "link5.html">fifth item</a>
</ul>
</div>

'''
#结果中,li节点被自动补全
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))

> 直接读取文本进行解析  
```html = etree.parse('./test.html',etree.HTMLParser())```

In [None]:
from lxml import etree 

html = etree.parse('./test.html',etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))


## 节点选取 

In [2]:
#example1
from lxml import etree

html = etree.parse('./test.html',etree.HTMLParser())
#print(etree.tostring(html))
#选取所有节点
result = html.xpath('//*')
print(result)

#选取所有li节点
result = html.xpath('//li')
print(result)
print(result[0])

#获取子节点
#选择所有li节点的直接子节点a
result = html.xpath('//li/a')
print(result)

#获取子孙节点
result = html.xpath('//ul//a')
print(result)

#获取父节点
result = html.xpath('//a[@href="link4.html"]/../@class')
print(result)

#获取父节点2
result = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result)

#属性匹配
#用@符号进行属性过滤
result = html.xpath('//li[@class="item-0"]')
print(result)

#文本获取 text()

#text()函数前为/,选取直接子节点
result = html.xpath('//li[@class="item-0"]/text()')
print(result)

#
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)

#获取子孙节点内部的所有文本
result = html.xpath('//li[@class="item-0"]//text()')
print(result)

#属性获取,获取节点的属性
result = html.xpath('//li/a/@href')
print(result)

#属性多值匹配 contains()函数 li节点中有两个值li和li-firsst 所以需要contains()
text ='''
<li class = "li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[@class=li]/a/text()')
print(result)
result = html.xpath('//li[contains(@class,"li")]/a/text()')
print(result)

#多属性匹配
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class,"li") and @name="item"]/a/text()')
print(result)

#按序选择
html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//li[1]/a/text()')
print(result)
result = html.xpath('//li[last()]/a/text()')
print(result)
result = html.xpath('//li[position()<3]/a/text()')
print(result)
result = html.xpath('//li[last()-2]/a/text()')
print(result)

#节点轴选择

[<Element html at 0x7fbc6008d408>, <Element body at 0x7fbc60302608>, <Element div at 0x7fbc603187c8>, <Element ul at 0x7fbc6008d448>, <Element li at 0x7fbc6008d488>, <Element a at 0x7fbc6008d508>, <Element li at 0x7fbc6008d548>, <Element a at 0x7fbc6008d588>, <Element li at 0x7fbc6008d5c8>, <Element a at 0x7fbc6008d4c8>, <Element li at 0x7fbc6008d608>, <Element a at 0x7fbc6008d648>, <Element li at 0x7fbc6008d688>, <Element a at 0x7fbc6008d6c8>]
[<Element li at 0x7fbc6008d488>, <Element li at 0x7fbc6008d548>, <Element li at 0x7fbc6008d5c8>, <Element li at 0x7fbc6008d608>, <Element li at 0x7fbc6008d688>]
<Element li at 0x7fbc6008d488>
[<Element a at 0x7fbc603187c8>, <Element a at 0x7fbc60302608>, <Element a at 0x7fbc60302548>, <Element a at 0x7fbc6008d448>, <Element a at 0x7fbc6008d508>]
[<Element a at 0x7fbc603187c8>, <Element a at 0x7fbc60302608>, <Element a at 0x7fbc60302548>, <Element a at 0x7fbc6008d448>, <Element a at 0x7fbc6008d508>]
['item-1']
['item-1']
[<Element li at 0x7fbc603

## Beautiul Soup
> 解析工具 自动进行编码转换

### beautiful soup 初始化
> 初始化时会自动更正文档的格式,补全格式  

<table border="1" class="docutils">
<colgroup>
<col width="22%">
<col width="26%">
<col width="26%">
<col width="26%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">解析器</th>
<th class="head">使用方法</th>
<th class="head">优势</th>
<th class="head">劣势</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>Python标准库</td>
<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span>
<span class="pre">"html.parser")</span></tt></td>
<td><ul class="first last simple">
<li>Python的内置标准库</li>
<li>执行速度适中</li>
<li>文档容错能力强</li>
</ul>
</td>
<td><ul class="first last simple">
<li>Python 2.7.3 or 3.2.2)前
的版本中文档容错能力差</li>
</ul>
</td>
</tr>
<tr class="row-odd"><td>lxml HTML 解析器</td>
<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span>
<span class="pre">"lxml")</span></tt></td>
<td><ul class="first last simple">
<li>速度快</li>
<li>文档容错能力强</li>
</ul>
</td>
<td><ul class="first last simple">
<li>需要安装C语言库</li>
</ul>
</td>
</tr>
<tr class="row-even"><td>lxml XML 解析器</td>
<td><p class="first"><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span>
<span class="pre">["lxml",</span> <span class="pre">"xml"])</span></tt></p>
<p class="last"><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span>
<span class="pre">"xml")</span></tt></p>
</td>
<td><ul class="first last simple">
<li>速度快</li>
<li>唯一支持XML的解析器</li>
</ul>
</td>
<td><ul class="first last simple">
<li>需要安装C语言库</li>
</ul>
</td>
</tr>
<tr class="row-odd"><td>html5lib</td>
<td><tt class="docutils literal"><span class="pre">BeautifulSoup(markup,</span>
<span class="pre">"html5lib")</span></tt></td>
<td><ul class="first last simple">
<li>最好的容错性</li>
<li>以浏览器的方式解析文档</li>
<li>生成HTML5格式的文档</li>
</ul>
</td>
<td><ul class="first last simple">
<li>速度慢</li>
<li>不依赖外部扩展</li>
</ul>
</td>
</tr>
</tbody>
</table>

In [12]:
from bs4 import BeautifulSoup

soup = BeautifulSoup('<p>Hello</p>','lxml')
print(soup.p.srting)

None


### 基本用法
> prettify()方法 可以将要解析的字符串以标准的缩进格式输出  
>调用soup.title.string,选中html中title节点,调用string属性可以得到文本

In [25]:
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story


### 选择元素 
> 元素类型 bs4.element.Tag

In [28]:
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html,'lxml')

print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
#只会匹配到第一个p节点的内容
print(soup.p)

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>


In [30]:
#获取名称
print(soup.title.name)

title


In [34]:
#获取属性
#attrs返回的是字典形式
print(soup.p.attrs)
print(soup.p.attrs['class'])
#更简单的获取形式
#有时候放回结果为字符串,当有多个class属性时,返回列表
print(soup.p['class'])

{'class': ['title']}
['title']
['title']


In [37]:
#获取内容
print(soup.p.string)

The Dormouse's story


In [40]:
#嵌套获取内容
from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story


In [1]:
#关联选择
# 子节点和子孙节点

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>


<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

#返回结果为列表形式 ,列表中每个元素为直接子节点,例如返回结果中a节点中的span节点不会被单独
#出来
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)

# 可以用children属性得到相应的结果
# children 返回的是生成器类型
print(soup.p.children)
for i,child in enumerate(soup.p.children):
    print(i,child)
    
#获取子孙节点
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
    print(i,child)

['Once upon a time there were three little sisters; and their names were\n', <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, ',\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' and\n', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';\nand they lived at the bottom of a well.']
<list_iterator object at 0x7f2ebabda978>
0 Once upon a time there were three little sisters; and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
2 ,

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4  and

5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 ;
and they lived at the bottom of a well.
<generator object descendants at 0x7f2ebabdc780>
0 Once upon a time there were three little sisters; and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
2 Elsie
3 ,

4 <a class="sister" href="http://example

In [3]:
# 获取父节点和祖先节点

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>


<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html,'lxml')
#获取直接父节点
#返回父节点及节点内的内容
print(soup.a.parent)

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


In [9]:
#获取祖先节点

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
"""

soup = BeautifulSoup(html,'lxml')
print(soup.a.parents)
#返回结果为生成器类型
print(type(soup.a.parents))
print(list(enumerate(soup.a.parents)))

<generator object parents at 0x7f2eba93b9e8>
<class 'generator'>
[(0, <p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>), (1, <body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body>), (2, <html><head><title>The Dormouse's story</title></head>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body></html>), (3, <html><head><title>The Dormouse's story</title></head>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body></html>)]


In [19]:
#获取兄弟节点

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>


<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>hello
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
#next_sibling&previous_sibling
print('next sibling ',soup.a.next_sibling)
print('prev sibling ',soup.a.previous_sibling)

#next_siblings&previous_siblings
print('next siblings',list(enumerate(soup.a.next_siblings)))
print('previous siblings',list(enumerate(soup.a.previous_siblings)))

next sibling  hello

prev sibling  Once upon a time there were three little sisters; and their names were

next siblings [(0, 'hello\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' and\n'), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, ';\nand they lived at the bottom of a well.')]
previous siblings [(0, 'Once upon a time there were three little sisters; and their names were\n')]


In [23]:
# 提取信息
#可以通过string ,attrs 获取上述节点的文本或者属性

print(soup.a.next_sibling.string)
print(list(soup.a.parents)[0].attrs['class'])

hello

['story']


### 方法选择器

####  find_all()

```find_all( name , attrs , recursive , text , **kwargs )```  
find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件.这里有几个例子  

##### 过滤器类型 
* 字符串  
传入一个字符串参数,则会查找与字符串完整相匹配的内容  
```
soup.find_all('b')
# [<b>The Dormouse's story</b>]
```  
* 正则表达式  
bs4通过正则表达式的match()来匹配内容  
```
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b
```  
* 列表  
bs4会将与列表中任一元素匹配的内容返回  
```
soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```
* True  
True 匹配任何值,会查找到所有tag 但是不会返回字符串节点 
```
for tag in soup.find_all(True):
    print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p
```
* 方法  
没有合适的过滤器,可以定义一个方法,方法只接受一个元素参数,如果方法返回True 则表示当前元素匹配且被找到,如果不是返回False  
```
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]
```


##### find_all()简单的几个例子

In [44]:
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all("title"))
# [<title>The Dormouse's story</title>]

print(soup.find_all("p", "title"))
# [<p class="title"><b>The Dormouse's story</b></p>]

print(soup.find_all("a"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.find_all(id="link2"))
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

import re
print(soup.find(text=re.compile("sisters")))
# u'Once upon a time there were three little sisters; and their names were\n'



[<title>The Dormouse's story</title>]
[<p class="title"><b>The Dormouse's story</b></p>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
Once upon a time there were three little sisters; and their names were



##### name 参数  
查找所有名字为name的tag,字符串对象会自动忽略掉  


In [33]:
print(soup.find_all("title"))
# [<title>The Dormouse's story</title>]

[<title>The Dormouse's story</title>]


##### keyword 参数  
如果一个指定的参数不是find_all()的内置参数名,搜索时把该参数当做指定名字tag的属性来搜索

In [36]:
print(soup.find_all(id='link2'))
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

#如果传入 href 参数,Beautiful Soup会搜索每个tag的”href”属性:
soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

#下面的例子在文档树中查找所有包含 id 属性的tag,无论 id 的值是什么:
soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

#使用多个指定名字的参数可以同时过滤tag的多个属性:
soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

#有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性:
#但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]


[<div data-foo="value">foo!</div>]

##### 按CSS 搜索  
按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 class_ 参数搜索有指定CSS类名的tag:

In [39]:
print(soup.find_all("a", class_="sister"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


class_ 参数接受不同类型的过滤器,字符串,正则表达式,方法或True

In [53]:
print(soup.find_all(class_=re.compile("itl")))

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6
print(soup.find_all(class_=has_six_characters))

[<p class="title"><b>The Dormouse's story</b></p>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


tag的class 属性是多只属性,按照css类型搜索tag时,可以分别搜索tag中的每个css类型

In [55]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.find_all("p", class_="strikeout")
# [<p class="body strikeout"></p>]

css_soup.find_all("p", class_="body")
# [<p class="body strikeout"></p>]

#通过css值完全匹配
print(css_soup.find_all("p",class_="body strikeout"))

[<p class="body strikeout"></p>]


##### text参数  
text参数可以搜索文档中的字符串内容,接受字符串,正则表达式,列表,True 


In [60]:
print(soup.find_all(text="Elsie"))

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

soup.find_all("a", text="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

['Elsie']


[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

##### limit 参数  
limit参数限制返回结果的数量

In [62]:
soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

##### recursive参数
find_all() 会搜索文档的所有子孙节点,而recursive=False 只搜索直接子节点

##### find()
```find( name , attrs , recursive , text , **kwargs )```  
只返回文档中符合条件的tag 只得到一个结果,而find_all() 返回所有结果  
find_all() 没找到返回空列表,find() 没找到返回None

In [64]:
soup.find('title')
# <title>The Dormouse's story</title>

<title>The Dormouse's story</title>

##### find_parents() 和 find_parent
```
find_parents( name , attrs , recursive , text , **kwargs )

find_parent( name , attrs , recursive , text , **kwargs )


```  
搜索父辈节点 前者返回所有祖先节点 后者返回直接父节点

In [67]:
a_string = soup.find(text="Lacie")
a_string
# u'Lacie'

a_string.find_parents("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

a_string.find_parent("p")
# <p class="story">Once upon a time there were three little sisters; and their names were
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#  and they lived at the bottom of a well.</p>

a_string.find_parents("p", class_="title")
# []

[]

##### find_next_siblings() & find_next_sibling()
```
find_next_siblings( name , attrs , recursive , text , **kwargs )

find_next_sibling( name , attrs , recursive , text , **kwargs )
```  
前者返回后面所有兄弟节点,后者返回后面第一个兄弟节点  
##### find_previous_siblings() 和 find_previous_sibling()  
```
find_previous_siblings( name , attrs , recursive , text , **kwargs )

find_previous_sibling( name , attrs , recursive , text , **kwargs )
```  
返回前面所有兄弟节点,返回第一个兄弟节点  
#####  find_all_next() 和 find_next()  
```
find_all_next( name , attrs , recursive , text , **kwargs )

find_next( name , attrs , recursive , text , **kwargs )
```
##### find_all_previous() 和 find_previous()   
```
find_all_previous( name , attrs , recursive , text , **kwargs )

find_previous( name , attrs , recursive , text , **kwargs )


```

##### CSS选择器 
.select() 

In [69]:
print(soup.select("title"))

[<title>The Dormouse's story</title>]


通过tag标签逐层查找

In [71]:
soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("html head title")
# [<title>The Dormouse's story</title>]

[<title>The Dormouse's story</title>]

找到某个tag标签下的直接子标签

In [72]:
soup.select("head > title")
# [<title>The Dormouse's story</title>]

soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

#nth-of-type 匹配父元素的特定类型的第N个子元素的每个元素
soup.select("p > a:nth-of-type(2)")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("body > a")
# []

[]

找到兄弟节点标签 


In [73]:
soup.select("#link1 ~ .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]

soup.select("#link1 + .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过CSS的类名查找

In [74]:
soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("[class~=sister]")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过tag的id查找

In [75]:
soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过是否存在某个属性来查找

In [79]:
soup.select('a[href]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
print(soup.select('p[class]'))

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]


通过属性的值查找


In [80]:
soup.select('a[href="http://example.com/elsie"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select('a[href^="http://example.com/"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href$="tillie"]')
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href*=".com/el"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

通过语言设置来查找


In [81]:
multilingual_markup = """
 <p lang="en">Hello</p>
 <p lang="en-us">Howdy, y'all</p>
 <p lang="en-gb">Pip-pip, old fruit</p>
 <p lang="fr">Bonjour mes amis</p>
"""
multilingual_soup = BeautifulSoup(multilingual_markup)
multilingual_soup.select('p[lang|=en]')
# [<p lang="en">Hello</p>,
#  <p lang="en-us">Howdy, y'all</p>,
#  <p lang="en-gb">Pip-pip, old fruit</p>]

[<p lang="en">Hello</p>,
 <p lang="en-us">Howdy, y'all</p>,
 <p lang="en-gb">Pip-pip, old fruit</p>]