# BeautifulSoup

将复杂的html文档转换为树形结构，每个节点都是python对象，有以下四种类型：

1. Tag

2. NavigableString

3. BeautifulSoup

4. Comment

In [1]:
from bs4 import BeautifulSoup

## 1.Tag

拿到他找到的第一组内容

In [2]:
file=open('./baidu.html',"rb")
html=file.read()
bs=BeautifulSoup(html,"html.parser")#还可以读取json等
print(bs.title)
print(bs.a)
print(bs.head)
print(type(bs.head))

<title>百度一下·你就知道</title>
<a class="mnav" href="https://news.baidu.com" name="tj_trnews"><!--新闻--></a>
<head>
<meta content="text/html; charset=UTF-8"/><meta http-equiv="content-type">
<meta content="IE=edge,http-equiv=" x-ua-compatible"=""/>
<meta content="always" name="referrer"/>
<link href="https://ss1.dbstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
<title>百度一下·你就知道</title>
</meta></head>
<class 'bs4.element.Tag'>


## 2.NavigableString

标签里的内容（字符串）

In [3]:
print(bs.title.string)
print(type(bs.title.string))

百度一下·你就知道
<class 'bs4.element.NavigableString'>


标签里的属性(字典、键值对)

In [4]:
print(bs.a.attrs)
print(type(bs.a.attrs))

{'class': ['mnav'], 'href': 'https://news.baidu.com', 'name': 'tj_trnews'}
<class 'dict'>


## 3.BeautifulSoup

表示整个文档

In [5]:
print(type(bs))

<class 'bs4.BeautifulSoup'>


## 4.comment

表示注释,是特殊的NavigableString,不含注释符号

In [6]:
print(bs.a.string)
print(type(bs.a.string))

新闻
<class 'bs4.element.Comment'>


---
## 文档的遍历

利用树结构遍历父子节点不如直接搜索（）

In [7]:
print(bs.head.contents)#列表型数据

['\n', <meta content="text/html; charset=UTF-8"/>, <meta http-equiv="content-type">
<meta content="IE=edge,http-equiv=" x-ua-compatible"=""/>
<meta content="always" name="referrer"/>
<link href="https://ss1.dbstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
<title>百度一下·你就知道</title>
</meta>]


## 文档的搜索

# find_all

## 1.标签完全匹配

字符串过滤，查找与字符串完全匹配的内容

In [8]:
t_list=bs.find_all("a")#标签name为a的所有内容
print(t_list)

[<a class="mnav" href="https://news.baidu.com" name="tj_trnews"><!--新闻--></a>, <a class="mnav" href="https://news.baidu.com" name="tj_trnews">新闻</a>, <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a>, <a class="mnav" href="https://map.baidu.com" name="tj_trmap">地图</a>, <a class="mnav" href="https://v.baidu.com" name="tj_trvideo">视频</a>, <a class="mnav" href="https://tieba.baidu.com" name="tj_trtieba">贴吧</a>, <a class="bri" href="https://www.baidu.com/more/" name="tj_briicon">style="..."更多产品</a>]


## 2.正则表达式搜索：使用search方法匹配内容

In [9]:
import re
t_list=bs.find_all(re.compile("a"))#标签中包含a的
for i in t_list:#按列表打印
    print(i)

<head>
<meta content="text/html; charset=UTF-8"/><meta http-equiv="content-type">
<meta content="IE=edge,http-equiv=" x-ua-compatible"=""/>
<meta content="always" name="referrer"/>
<link href="https://ss1.dbstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
<title>百度一下·你就知道</title>
</meta></head>
<meta content="text/html; charset=UTF-8"/>
<meta http-equiv="content-type">
<meta content="IE=edge,http-equiv=" x-ua-compatible"=""/>
<meta content="always" name="referrer"/>
<link href="https://ss1.dbstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
<title>百度一下·你就知道</title>
</meta>
<meta content="IE=edge,http-equiv=" x-ua-compatible"=""/>
<meta content="always" name="referrer"/>
<a class="mnav" href="https://news.baidu.com" name="tj_trnews"><!--新闻--></a>
<a class="mnav" href="https://news.baidu.com" name="tj_trnews">新闻</a>
<a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</

## 3.方法：传入一个函数，根据函数的要求来搜索，不常用
---

# kwargs参数

In [10]:
t_list=bs.find_all(id="head")#id参数
for i in t_list:
    print(i)

<div id="head">
<div class="head_wrapper">
<div id="u1">
<a class="mnav" href="https://news.baidu.com" name="tj_trnews"><!--新闻--></a>
<a class="mnav" href="https://news.baidu.com" name="tj_trnews">新闻</a>
<a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a>
<a class="mnav" href="https://map.baidu.com" name="tj_trmap">地图</a>
<a class="mnav" href="https://v.baidu.com" name="tj_trvideo">视频</a>
<a class="mnav" href="https://tieba.baidu.com" name="tj_trtieba">贴吧</a>
<a class="bri" href="https://www.baidu.com/more/" name="tj_briicon">style="..."更多产品</a>
</div>
</div>
</div>


In [11]:
t_list=bs.find_all(class_=True)#class参数
for i in t_list:
    print(i)

<div class="head_wrapper">
<div id="u1">
<a class="mnav" href="https://news.baidu.com" name="tj_trnews"><!--新闻--></a>
<a class="mnav" href="https://news.baidu.com" name="tj_trnews">新闻</a>
<a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a>
<a class="mnav" href="https://map.baidu.com" name="tj_trmap">地图</a>
<a class="mnav" href="https://v.baidu.com" name="tj_trvideo">视频</a>
<a class="mnav" href="https://tieba.baidu.com" name="tj_trtieba">贴吧</a>
<a class="bri" href="https://www.baidu.com/more/" name="tj_briicon">style="..."更多产品</a>
</div>
</div>
<a class="mnav" href="https://news.baidu.com" name="tj_trnews"><!--新闻--></a>
<a class="mnav" href="https://news.baidu.com" name="tj_trnews">新闻</a>
<a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a>
<a class="mnav" href="https://map.baidu.com" name="tj_trmap">地图</a>
<a class="mnav" href="https://v.baidu.com" name="tj_trvideo">视频</a>
<a class="mnav" href="https://tieba.baidu.com" name="tj_trtieba">贴吧

In [12]:
t_list=bs.find_all(href="https://news.baidu.com")
for i in t_list:
    print(i)

<a class="mnav" href="https://news.baidu.com" name="tj_trnews"><!--新闻--></a>
<a class="mnav" href="https://news.baidu.com" name="tj_trnews">新闻</a>


# text参数

In [13]:
t_list=bs.find_all(text=[re.compile("\d"),"贴吧","地图"])
for i in t_list:
    print(i)

hao123
地图
贴吧


# limit参数

限定得到多少条数据

In [14]:
t_list=bs.find_all("a",limit=3)
for i in t_list:
    print(i)

<a class="mnav" href="https://news.baidu.com" name="tj_trnews"><!--新闻--></a>
<a class="mnav" href="https://news.baidu.com" name="tj_trnews">新闻</a>
<a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a>


# css选择器

In [15]:
t_list=bs.select('title')#通过标签查找
for i in t_list:
    print(i)

<title>百度一下·你就知道</title>


In [16]:
t_list=bs.select('.mnav')#通过类名查找
for i in t_list:
    print(i)

<a class="mnav" href="https://news.baidu.com" name="tj_trnews"><!--新闻--></a>
<a class="mnav" href="https://news.baidu.com" name="tj_trnews">新闻</a>
<a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a>
<a class="mnav" href="https://map.baidu.com" name="tj_trmap">地图</a>
<a class="mnav" href="https://v.baidu.com" name="tj_trvideo">视频</a>
<a class="mnav" href="https://tieba.baidu.com" name="tj_trtieba">贴吧</a>


In [17]:
t_list=bs.select('#u1')#通过id查找
for i in t_list:
    print(i)

<div id="u1">
<a class="mnav" href="https://news.baidu.com" name="tj_trnews"><!--新闻--></a>
<a class="mnav" href="https://news.baidu.com" name="tj_trnews">新闻</a>
<a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a>
<a class="mnav" href="https://map.baidu.com" name="tj_trmap">地图</a>
<a class="mnav" href="https://v.baidu.com" name="tj_trvideo">视频</a>
<a class="mnav" href="https://tieba.baidu.com" name="tj_trtieba">贴吧</a>
<a class="bri" href="https://www.baidu.com/more/" name="tj_briicon">style="..."更多产品</a>
</div>


In [18]:
t_list=bs.select('a[class="bri"]')#通过属性查找
for i in t_list:
    print(i)

<a class="bri" href="https://www.baidu.com/more/" name="tj_briicon">style="..."更多产品</a>


In [19]:
#暂时不清楚
t_list=bs.select("head > title")#通过子标签查找
print(t_list)

[]
