# 文本内容的解析

我们主要介绍如何使用正则表达式来解析文本内容，即使用正则表达式对兴趣文本进行定位和提取。

## 正则表达式

有关教学内容可参考：https://github.com/ziishaned/learn-regex/blob/master/translations/README-cn.md

python 内置模块 re，支持正则表达式功能。

In [1]:
import re
print(re.__file__)
help(re)


C:\Users\leo\Anaconda3\lib\re.py
Help on module re:

NAME
    re - Support for regular expressions (RE).

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.
    
    Regular expressions can contain both special and ordinary characters.
    Most ordinary characters, like "A", "a", or "0", are the simplest
    regular expressions; they simply match themselves.  You can
    concatenate ordinary characters, so last matches the string 'last'.
    
    The special characters are:
        "."      Matches any character except a newline.
        "^"      Matches the start of the string.
        "$"      Matches the end of the string or just before the newline at
                 the end of the string.
        "*"      Matches 0 or more (greedy) repetitions of the prece

### 正则表达式的标准应用流程

**第一步 编译正则表达式**

In [2]:
#基本匹配
import re

# 目标字符串
text = 'Hello world, abcdefg, 1234567890@163.com,ABCDEFG.1+1=25678@126.com'

# 定义正则表达式,下列可以用于匹配目标字串中的电子邮件
p = re.compile('1234567890@163.com')
print(p)

re.compile('1234567890@163.com')


In [None]:
#元字符
import re

# 目标字符串
text = 'Hello world, abcdefg, 1234567890@163.com,ABCDEFG.1+1=25678@126.com'

# 定义正则表达式,下列可以用于匹配目标字串中的电子邮件
pattern = re.compile('5\w+@\w+.com')

help(pattern)

**第二步 对目标字符串进行匹配**

In [3]:
re.__all__

['match',
 'fullmatch',
 'search',
 'sub',
 'subn',
 'split',
 'findall',
 'finditer',
 'compile',
 'purge',
 'template',
 'escape',
 'error',
 'Pattern',
 'Match',
 'A',
 'I',
 'L',
 'M',
 'S',
 'X',
 'U',
 'ASCII',
 'IGNORECASE',
 'LOCALE',
 'MULTILINE',
 'DOTALL',
 'VERBOSE',
 'UNICODE']

In [4]:
import re

# 目标字符串
text = 'Hello world, abcdefg, 1234567890@163.com,ABCDEFG.1+1=25678@126.com'

# 定义正则表达式,下列可以用于匹配目标字串中的电子邮件
pattern = re.compile('5\w+@\w+.com')

print("pattern.findall(text)的结果:")
match = pattern.findall(text)
print(match)

print("pattern.finditer(text)的结果:")
#matchiter = pattern.finditer(text)
for i in pattern.finditer(text):
    print(i)
    print(i.group())

print("pattern.fullmatch(text)的结果:")
match =pattern.fullmatch(text)
print(match)

print("pattern.match(text)的结果:")
match = pattern.match(text)
print(match)

print("pattern.search(text)的结果:")
match = pattern.search(text)
print(match)
print(match.group())

print("pattern.split(text)的结果:")
match =pattern.split(text)
print(match)

print("pattern.sub(text)的结果:")
match =pattern.sub('new-value',text)
print(match)
print("原字符串为：",text)

print("pattern.subn(text)的结果:")
match =pattern.subn('new-value',text)
print(match)
print("原字符串为：",text)

pattern.findall(text)的结果:
['567890@163.com', '5678@126.com']
pattern.finditer(text)的结果:
<re.Match object; span=(26, 40), match='567890@163.com'>
567890@163.com
<re.Match object; span=(54, 66), match='5678@126.com'>
5678@126.com
pattern.fullmatch(text)的结果:
None
pattern.match(text)的结果:
None
pattern.search(text)的结果:
<re.Match object; span=(26, 40), match='567890@163.com'>
567890@163.com
pattern.split(text)的结果:
['Hello world, abcdefg, 1234', ',ABCDEFG.1+1=2', '']
pattern.sub(text)的结果:
Hello world, abcdefg, 1234new-value,ABCDEFG.1+1=2new-value
原字符串为： Hello world, abcdefg, 1234567890@163.com,ABCDEFG.1+1=25678@126.com
pattern.subn(text)的结果:
('Hello world, abcdefg, 1234new-value,ABCDEFG.1+1=2new-value', 2)
原字符串为： Hello world, abcdefg, 1234567890@163.com,ABCDEFG.1+1=25678@126.com


In [None]:
text = "gu12an@163.com; \
        hz1dau@163.com; \
        yaadlou@163.com;\
        mzessdng@163.com; \
        tsli123@163.com; \
        arsheng@163.com; \
        hwfsansg@163.com; \
        xwrrre@163.com"

import re

p = re.compile("[\d\w]+@[\d\w]+\.com")

m = p.findall(text)

print(m)

**第三步 提取结果信息**

- 以列表形式返回的，使用列表访问方法；
- 以迭代器（iterator）形式返回的，使用循环和group()方法；
- 以match对象形式返回的，使用group()方法访问内容。

In [None]:
text = "gu12an@163.com; \
        hz1dau@163.com; \
        yaadlou@163.com;\
        mzessdng@163.com; \
        tsli123@163.com; \
        arsheng@163.com; \
        hwfsansg@163.com; \
        xwrrre@163.com"

import re

p = re.compile("[\d\w]+@[\d\w]+\.com")

m1 = p.findall(text)
m2 = p.finditer(text)
m3 = p.match(text)

print(m1)
print(m2)
print(m3)

### 爬虫应用

#### 例1

In [None]:
import requests
import re

def fetchUrl(url,queryload = None):
    try:
        headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
                   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
                  }
        session = requests.Session()
        r = session.get(url,params = queryload,headers = headers)
        
        r.raise_for_status()
        r.encoding = r.apparent_encoding

        return r.text

    except requests.exceptions.HTTPError as e:
        print(e)       
        return "Some exceptions were raised."

def parseText(htmltext,pattern):
    rlist = pattern.findall(htmltext)
    return rlist

# 获取url指定页面
url = 'https://gupiao.baidu.com/'
htmltext = fetchUrl(url)

# 解析中文内容
pattern = re.compile(u'[\u4e00-\u9fa5]+')
r = parseText(htmltext,pattern)
print(r) 

#### 例2

In [None]:
# 解析所有股票代码
url = 'TEST_URL_003'

htmltext = fetchUrl(url)
pattern = re.compile('\d{6}')
r = parseText(htmltext,pattern)
print(r) 

In [None]:
url = 'TEST_URL_003'
htmltext = fetchUrl(url)
pattern = re.compile(r"www\.\S+\.com")
r = parseText(htmltext,pattern)
print(r)  

#### 例3

将下列信息转换为字典格式

In [7]:
text="""

Host: baike.baidu.com
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
Sec-Fetch-User: ?1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: navigate
Referer: https://baike.baidu.com/item/%E9%82%AE%E4%BB%B6%E5%88%97%E8%A1%A8/3242524?fr=aladdin
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9
Cookie: BAIDUID=427D8AC45B87E59C7DFFB9D4FC86A80A:FG=1; BIDUPSID=427D8AC45B87E59C7DFFB9D4FC86A80A; PSTM=1562547125; BK_SEARCHLOG=%7B%22key%22%3A%5B%22ID%22%5D%7D; BDUSS=1xVWdObmEzTThqcnFnTmlzbE5KaGlvOXo0dUI5V2hUNG1nWlo0akp2VXc4cGxkRVFBQUFBJCQAAAAAAAAAAAEAAABz6UcBZmx5YXRvAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADBlcl0wZXJdVm; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; yjs_js_security_passport=803e0f0e692af01f00d4b68206081a8691e7d675_1573729072_js; H_PS_PSSID=1435_21091_29568_29220_22160; Hm_lvt_55b574651fcae74b0a9f1cf9c8d7c93a=1573018919,1573483554,1573782269,1573782347; Hm_lpvt_55b574651fcae74b0a9f1cf9c8d7c93a=1573782347; delPer=0; PSINO=1; PMS_JT=%28%7B%22s%22%3A1573785239469%2C%22r%22%3A%22https%3A//baike.baidu.com/item/%25E9%2582%25AE%25E4%25BB%25B6%25E5%2588%2597%25E8%25A1%25A8/3242524%3Ffr%3Daladdin%22%7D%29

"""

import re

p = re.compile(r"^([\d\w-]+):\s*([\d\w\.\-/\(\t;\) ,? \+=\*:%]+)$")

m1 =p.sub( r'"\1":"\2"',text)

print(m1)




Host: baike.baidu.com
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
Sec-Fetch-User: ?1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: navigate
Referer: https://baike.baidu.com/item/%E9%82%AE%E4%BB%B6%E5%88%97%E8%A1%A8/3242524?fr=aladdin
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9
Cookie: BAIDUID=427D8AC45B87E59C7DFFB9D4FC86A80A:FG=1; BIDUPSID=427D8AC45B87E59C7DFFB9D4FC86A80A; PSTM=1562547125; BK_SEARCHLOG=%7B%22key%22%3A%5B%22ID%22%5D%7D; BDUSS=1xVWdObmEzTThqcnFnTmlzbE5KaGlvOXo0dUI5V2hUNG1nWlo0akp2VXc4cGxkRVFBQUFBJCQAAAAAAAAAAAEAAABz6UcBZmx5YXRvAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADBlcl0wZXJdVm; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; yjs_js_security_passport=803e0f0e69

## 类XML内容的解析

有关XML的详细介绍，可参考W3School官方文档: http://www.w3school.com.cn/xml/

使用前需要安装：

```pip install lxml```
假设我们的文档内容如下：
假设我们的文档内容如下：

```
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Test Document</title>
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<link rel="stylesheet" href="bootstrap/css/bootstrap.min.css">
</head>
<body>	
    <div><h1>测试文档</h1></div>
    <div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
	      <ul class="nav navbar-nav">
            
	        <li>
	        	<a href="http://www.website1.com">Link 1 <span class="sr-only">(current)</span></a>
	        </li>
	        <li><a href="http://www.website2.com">Link 2</a></li>
	        <li><a href="http://www.website3.com">Link 3</a></li>
	       	<li><a href="http://www.website4.com">Link 4</a></li>
	       
	      </ul>
    </div>
    <div class="row products">
			<div  class="col-xs-12 col-sm-5 col-sm-offset-1 col-md-3 col-md-offset-0">
				<img  src="image/c1.jpg" alt="">
				<h3>Office 365</h3>
				<p>购买 Office 365，体验最新版本的 Word、Excel、PowerPoint 等常用应用。
				</p>
				<a href="http://www.office365.com">立即购买 </a>
				<span class="buy-icon">&gt;</span>				
			</div>
			<div class="col-xs-12 col-sm-5  col-md-3">
				<img  src="image/c2.webp" alt="">
				<h3>wps office</h3>
				<p>购买wps Office，体验最新版本的 WPS office 应用。
				</p>
				<a href="https://www.wps.cn/">立即购买 </a>
				<span class="buy-icon">&gt;</span>
			</div>
    </div>
    <script src="bootstrap/js/bootstrap.min.js"></script>
</body>
</html>
```
上述文档保存在htmlsample.htm中。


尝试使用XPATH解析上述文档中的信息，具体完成以下任务：
- 获取文档中的所有的 li 元素；
- 获取 a 元素的所有属性；
- 获取含有属性href且其值为 http://www.website1.com 的元素；
- 获取倒数第一个元素的内容；
- 获取倒数第二个元素的内容

**使用lxml库的etree模块加载类xml文档或字符串**

要使用lxml中的xpath解析功能，首先需要引入lxml中的etree模块，lxml.etree模块实现了XML的元素树API，可以很方便地获取和修改XML中的元素和属性。

In [8]:
from lxml import etree

help(etree)

Help on module lxml.etree in lxml:

NAME
    lxml.etree - The ``lxml.etree`` module implements the extended ElementTree API for XML.

CLASSES
    builtins.Exception(builtins.BaseException)
        Error
            LxmlError
                C14NError
                DTDError
                    DTDParseError
                    DTDValidateError
                DocumentInvalid
                LxmlRegistryError
                    NamespaceRegistryError
                LxmlSyntaxError(LxmlError, builtins.SyntaxError)
                    ParseError
                        XMLSyntaxError
                    XPathSyntaxError(LxmlSyntaxError, XPathError)
                ParserError
                RelaxNGError
                    RelaxNGParseError
                    RelaxNGValidateError
                SchematronError
                    SchematronParseError
                    SchematronValidateError
                SerialisationError
                XIncludeError
                XMLSchema

在使用etree模块时，如果要读取XML文件或文件类对象时，我们可以使用parse()方法，这个方法将返回ElementTree对象。

注意：使用etree解析类xml文档，要求文档的标签有良好的左右匹配，但很多html中的标签配对并不完整，所以在解析一些格式不严谨的类xml文档时有可能会报错。

In [14]:
from lxml import etree 

et = etree.parse('..\examples\htmlsample.htm')
# 或 et = etree.parse(StringIO('samples\simplexml-ns.xml'))
print("Element tree 对象:")
print(et)
print("Element tree 内容:")
print(etree.tounicode(et))


Element tree 对象:
<lxml.etree._ElementTree object at 0x0000022C7D72A708>
Element tree 内容:
<html lang="en">
<head>
	<meta charset="UTF-8"/>
	<title>Test Document</title>
	<meta name="viewport" content="width=device-width, initial-scale=1"/>
	<link rel="stylesheet" href="bootstrap/css/bootstrap.min.css"/>
</head>
<body>	
    <div><h1>测试文档</h1></div>
    <div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
	      <ul class="nav navbar-nav">
            
	        <li>
	        	<a href="http://www.website1.com">Link 1 <span class="sr-only">(current)</span></a>
	        </li>
	        <li><a href="http://www.website2.com">Link 2</a></li>
	        <li><a href="http://www.website3.com">Link 3</a></li>
	       	<li><a href="http://www.website4.com">Link 4</a></li>
	       
	      </ul>
    </div>
    <div class="row products">
			<div class="col-xs-12 col-sm-5 col-sm-offset-1 col-md-3 col-md-offset-0">
				<img src="image/c1.jpg" alt=""/>
				<h3>Office 365</h3>
				<p>购买 Of

**获取文档中的所有的 li 元素**

下面我们尝试获取文档中的li元素


In [15]:
from lxml import etree 

htmldoc = etree.parse('..\examples\htmlsample.htm')

result = htmldoc.xpath("//li")

for r in result:
    print("li 对象：")
    print(r)
    print("标签名：")
    print(r.tag)
    print("孩子节点：")
    print(r.getchildren()[0].text)

li 对象：
<Element li at 0x22c7d804688>
标签名：
li
孩子节点：
Link 1 
li 对象：
<Element li at 0x22c7d8047c8>
标签名：
li
孩子节点：
Link 2
li 对象：
<Element li at 0x22c7d804808>
标签名：
li
孩子节点：
Link 3
li 对象：
<Element li at 0x22c7d804648>
标签名：
li
孩子节点：
Link 4


**获取 a 元素的属性值**

In [17]:
from lxml import etree 

htmldoc = etree.parse('..\examples\htmlsample.htm')

result = htmldoc.xpath("//@href")
print(result)

['bootstrap/css/bootstrap.min.css', 'http://www.website1.com', 'http://www.website2.com', 'http://www.website3.com', 'http://www.website4.com', 'http://www.office365.com', 'https://www.wps.cn/']


**获取含有属性href且其值为 http://www.website1.com 的元素**



In [19]:
from lxml import etree 

htmldoc = etree.parse('..\examples\htmlsample.htm')

result = htmldoc.xpath('//a[@href="http://www.website1.com"]')

for i in result:
    print(i.tag)
    print(i.attrib)
    print(i.text)

a
{'href': 'http://www.website1.com'}
Link 1 


**获取倒数第一个元素的内容**

In [23]:

from lxml import etree 

htmldoc = etree.parse('..\examples\htmlsample.htm')

result = htmldoc.xpath('//li[last()]/a/@href')

print(result)


['http://www.website4.com']


**获取倒数第二个元素的内容**

In [24]:

from lxml import etree 

htmldoc = etree.parse('..\examples\htmlsample.htm')

result = htmldoc.xpath('//li[last()-1]/a/text()')
print(result)

['Link 3']


**练习**

使用XPATH,对下列XML文档进行解析，提取所有book的name。


In [30]:
from lxml import etree
from io import StringIO, BytesIO

testtext = '''
<User name="test" age="18" marry="true">
    <Books>
        <Book name="java"/>
        <Book name="android"/>
        <Book name="XMLParser"/>
    </Books>
    <Phone number="110110110" type="home"/>
    <Phone number="221221221" type="company"/>
</User>
'''
parser = etree.XMLParser(ns_clean=True) 
et = etree.parse(StringIO(testtext),parser)

print(etree.tostring(et).decode('utf-8'))

print("提取所有book的name:")
result = et.xpath("//Book/@name")
print(result)

<User name="test" age="18" marry="true">
    <Books>
        <Book name="java"/>
        <Book name="android"/>
        <Book name="XMLParser"/>
    </Books>
    <Phone number="110110110" type="home"/>
    <Phone number="221221221" type="company"/>
</User>
提取所有book的name:
['java', 'android', 'XMLParser']


## 使用BeautifulSoup解析XML类文档

中文文档：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

安装过程很简单，在安装python与pip工具后，运行下面语句：

```pip install beautifulsoup4```

### 案例与应用

下面给出使用 BeautifulSoup 解析网页的实例。在实例中，重点说明以下问题：

- 引入BeautifulSoup，并加载待解析的html文档，构建BeautifulSoup对象；
- 利用Tag对象属性，获取Tag（html元素及属性）内容
- 对于html中重复出现的标签，使用使用findAll()方法获取所有同名元素；
- 通过标签对象.string属性，获取标签间的文本内容；
- 使用“标签名.attrs”方式获取元素内属性；


In [31]:
"""BeautifulSoup示例"""
from bs4 import BeautifulSoup

htmlDoc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
<div><!-- This is a comment --></div>
</body>
</html>
"""
#指定parser为html.parser
bs = BeautifulSoup(htmlDoc,'html.parser')# 也可以使用lxml解析器
print("BeautifulSoup 对象类型：")
print(type(bs))
print("BeautifulSoup的基类：")
print(BeautifulSoup.__bases__)
print("BeautifulSoup Tag对象内容：")
print(bs)

BeautifulSoup 对象类型：
<class 'bs4.BeautifulSoup'>
BeautifulSoup的基类：
(<class 'bs4.element.Tag'>,)
BeautifulSoup Tag对象内容：

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<div><!-- This is a comment --></div>
</body>
</html>



**BeautifulSoup中常见对象**

In [32]:
# 利用Tag对象获取标签及内部信息
print(bs.head)
print(bs.title)
print(bs.a)

<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


In [33]:
#为了获得某个元素的属性，可以使用“标签名.attrs”方式获取：

bsoup = BeautifulSoup(htmlDoc,'html.parser')

result = bsoup.a.attrs
print(result)

{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}


In [34]:
#为了获得标签间的文本内容，可以使用string属性：
result = bs.p.string
print(result)

The Dormouse's story


In [35]:
#对于文档中的注释信息，可以使用comment属性获取：

result = bs.div.string
print(result)

 This is a comment 


在爬虫程序中，最为常用的几个方法是：

- find(self, name=None, attrs={}, recursive=True, text=None, **kwargs)
- find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
- decode， Returns a Unicode representation of this tag and its contents.
- encode，Renders the contents of this tag as a bytestring.

In [36]:
result = bs.find(name = bs.a.name)
print(result)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


In [37]:
result = bs.findAll(name = bs.a.name)
for r in result:
    print(r)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


In [None]:
#  如果要根据属性值查询元素时，可以使用下列方法：

In [41]:
"""BeautifulSoup示例"""
from bs4 import BeautifulSoup

htmlDoc = """
<User name="test" age="18" marry="true">
    <Books>
        <Book name="java"/>
        <Book name="android"/>
        <Book name="XMLParser"/>
    </Books>
    <Phone number="110110110" type="home"/>
    <Phone number="221221221" type="company"/>
</User>
"""
bsObj = BeautifulSoup(htmlDoc,'html.parser')
result = bsObj.findAll(name = 'book',attrs={'name':'android'})
print(result)

[<book name="android"></book>]


## json数据读取

下面，通过实例讲解两类问题：

- Python 数据结构与JSON字符串之间的相互转化；
- 读取json文件，获取兴趣数据内容。

首先来看，python内置对象与json字符串之间的转换，主要方法是dumps和loads。

In [42]:
import json
 
# Python 字典类型转换为 JSON 对象
dict_data = {
    'no' : 1,
    'name' : 'zhangsan',
    'score' : 90.5
}
# 使用dumps 将python数据转换为 json字符串
json_str = json.dumps(dict_data)

print ("Python 字典数据：", dict_data)
for key,value in dict_data.items():
    print(key,':',value)
    
print ("JSON 字符串：", json_str)

Python 字典数据： {'no': 1, 'name': 'zhangsan', 'score': 90.5}
no : 1
name : zhangsan
score : 90.5
JSON 字符串： {"no": 1, "name": "zhangsan", "score": 90.5}


下面，将python列表转换为json字符串



In [43]:

import json

alist = ['张三','李四','王五']

jsonstr = json.dumps(alist, sort_keys=True, indent=4,ensure_ascii=False)
print("jsonstr：")
print(jsonstr)

blist = json.loads(jsonstr)
#print("blist：",blist)

jsonstr：
[
    "张三",
    "李四",
    "王五"
]
blist： ['张三', '李四', '王五']


下面，我们尝试读取json文件，获取兴趣数据内容。

例子中使用了json.load方法将文件fp (一个支持 .read() 并包含一个 JSON 文档的 text file 或者 binary file) 反序列化为一个 Python 对象。


In [45]:

import json

filepath = "../examples/jsonsample.json"

with open(filepath,'r',encoding='utf-8') as f:   
    adict = json.load(fp=f)

print(adict)

{'employees': [{'firstName': 'Bill', 'lastName': 'Gates'}, {'firstName': 'George', 'lastName': 'Bush'}, {'firstName': 'Thomas', 'lastName': 'Carter'}]}


In [None]:
下面，我们尝试将python对象进行json编码，并写入json文件。

例子中使用了json.dump方法，这个方法可以将python对象编码并写入fp指定的文件（序列化一个 Python 对象）。


In [48]:
import json
dict_data = {
    'no' : 1,
    'name' : 'zhangsan',
    'score' : 90.5
}
with open("../examples/jsonsampl2.json",'w') as f:
    json.dump(dict_data,fp=f)

下面我们给出一个更有实际意义的例子，


In [50]:

import requests
from bs4 import BeautifulSoup
import json


def getUrl(url,queryload = None):
    header = ""
    try:
        headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
                   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
                  }
        session = requests.Session()
        if not url:
            raise("必须提供要访问网页的url")
        if queryload:
            if not isinstance(queryload,dict):
                raise("Get方法的查询参数格式必须是字典形式：",queryload)
        
        r = session.get(url,params = queryload,headers = headers)        
        r.raise_for_status()
        r.encoding = r.apparent_encoding

        return r.text

    except requests.exceptions as e:
        print(e)       
        return "Some exceptions were raised."
    
url = 'http://vip.stock.finance.sina.com.cn/quotes_service/api/json_v2.php/Market_Center.getFundNetData?page=1&num=40&sort=symbol&asc=1&node=open_fund&_s_r_a=init'

# 转换字符串为标准json字符串

r = getUrl(url)
print(r)
# 将获取到的数据存为json文件
with open("../examples/jsonsampl4.json",'w') as f:
    json.dump(r,fp=f,ensure_ascii=False)


   

[{symbol:"000001",name:"华夏成长混合",dwjz:"1.0560",ljdwjz:"3.5170",zrjz:1.052,jzzz:"0.38022814",date:"2019-11-14",jjgm:"41.5606"},{symbol:"000003",name:"中海可转债债券A",dwjz:"0.7440",ljdwjz:"0.9540",zrjz:0.74,jzzz:"0.54054054",date:"2019-11-14",jjgm:"1.4078"},{symbol:"000004",name:"中海可转债债券C",dwjz:"0.7450",ljdwjz:"0.9550",zrjz:0.74,jzzz:"0.67567568",date:"2019-11-14",jjgm:"0.6133"},{symbol:"000005",name:"嘉实增强信用定期债券",dwjz:"1.0350",ljdwjz:"1.3400",zrjz:1.035,jzzz:"0.00000000",date:"2019-11-14",jjgm:"1.1824"},{symbol:"000006",name:"西部利得量化成长混合",dwjz:"1.1079",ljdwjz:"1.1079",zrjz:1.1026,jzzz:"0.48068202",date:"2019-11-14",jjgm:"1.1433"},{symbol:"000008",name:"嘉实中证500ETF联接A",dwjz:"1.3308",ljdwjz:"1.3968",zrjz:1.3216,jzzz:"0.69612591",date:"2019-11-14",jjgm:"14.4993"},{symbol:"000011",name:"华夏大盘精选混合",dwjz:"13.4010",ljdwjz:"18.3410",zrjz:13.376,jzzz:"0.18690191",date:"2019-11-14",jjgm:"3.8336"},{symbol:"000014",name:"华夏聚利债券",dwjz:"1.2290",ljdwjz:"1.2290",zrjz:1.228,jzzz:"0.08143322",date:"2019-11-14",jjgm

##  OFFICE文档的解析

### Excel数据
下面通过实例演示读取office excel文件内容

In [51]:

import pandas as pd

df = pd.read_excel('../examples/xlssample.xlsx')
df.head()

Unnamed: 0,序号,姓名,学号,年龄
0,1,王亮,90001,21
1,2,李莉,90002,21
2,3,徐彤,90003,23
3,4,侯月,90004,22
4,5,马凯,90005,21


In [52]:
df.loc[df['姓名']=='王亮']

Unnamed: 0,序号,姓名,学号,年龄
0,1,王亮,90001,21


In [53]:
df.iloc[1,1]

'李莉'

In [56]:
df.iloc[1,1] = '李梅丽'
print(df.head())
df.to_excel('../examples/xlssample2.xlsx')

   序号   姓名     学号  年龄
0   1   王亮  90001  21
1   2  李梅丽  90002  21
2   3   徐彤  90003  23
3   4   侯月  90004  22
4   5   马凯  90005  21


### WORD 数据

下面的例子演示了如何使用python-doc读取word文件：

- 首先要安装pip install python-docx
- 其次引入这个python 模块

In [57]:
from docx import Document

document = Document('../examples/wordsample.docx')

for p in document.paragraphs:
    if p.text:
        print(p.style)
        print(p.text)
        
table = []    
for t in document.tables:
    for i in t.rows:
        row = []
        for j in i.cells:
            row.append(j.text)
        table.append(row)
    print(table)
    table.clear()


_ParagraphStyle('Heading 1') id: 2390144022008
13省份经济半年报：四川总量首破2万亿，云南贵州高增长
_ParagraphStyle('Normal') id: 2390144022120
截至7月17日，全国已有四川、湖北、湖南、福建、安徽、北京、陕西、江西、内蒙古、云南、贵州、海南、宁夏13省份交出上半年经济成绩单。
_ParagraphStyle('Normal') id: 2390144023016
从经济总量上看，与往年半年报相比，四川今年首次突破2万亿元大关，以经济总量2.05万亿元领跑13省份。湖北以1.99万亿元的成绩暂列第二。宁夏上半年GDP为0.17万亿元，同比增长6.5%，总量暂时垫底。
[['排名', '省份', '经济增速', '年度'], ['1', '云南', '9.2%', '2019'], ['2', '贵州', '9%', '2019'], ['3', '江西', '8.6%', '2019']]


## 数据库文件的解析

Python DB-API使用流程：

1. 引入 API 模块。
2. 获取与数据库的连接。
3. 执行SQL语句和存储过程。
4. 关闭数据库连接。

为了支持python操作mysql，我们需要安装mysql驱动和mysql server：

有两个mysql驱动：

- mysql-connector-python，是mysql官方提供的纯python驱动；
- PyMySQL, a pure-Python MySQL client library,based on PEP 249.
```pip install PyMySQL```

```easy_install mysql-connector-python```

为了实现关系表与python对象的转换，我们需要使用sqlalchemy这类ORM工具。

```pip install sqlalchemy```

In [None]:
#下面演示，使用pymysql连接数据库


import pymysql.cursors

sql_create_table_usertest = '''CREATE TABLE `userstest` (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `email` varchar(255) COLLATE utf8_bin NOT NULL,
    `password` varchar(255) COLLATE utf8_bin NOT NULL,
    PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
AUTO_INCREMENT=1 ;

'''

dbconnection = pymysql.connect(host='localhost',
                               user='root',
                               password='密码',
                               db='test',
                               charset='utf8mb4',
                               cursorclass=pymysql.cursors.DictCursor)
try:
    #查询
    with dbconnection.cursor() as cursor:
        # view table user
        sql = "SELECT * FROM user WHERE email = %s"
        cursor.execute(sql,('aaa@shop.com'))
        rlist = cursor.fetchall()
        for r in rlist:
            print(r)
    #插入        
    with dbconnection.cursor() as cursor:
        # insert a record
        sql_update = "UPDATE user SET email = %s WHERE email = %s"
        cursor.execute(sql_update,('bbb@shop.com','aaa@shop.com'))
        
    dbconnection.commit()
    with dbconnection.cursor() as cursor:
        # view table user
        sql = "SELECT * FROM user"
        cursor.execute(sql)
        rlist = cursor.fetchall()
        for r in rlist:
            print(r)
        
finally:
    dbconnection.close()

## 本节课作业

设计网络爬虫，爬取大学排名。

上海交通大学设计了一个“最好大学网”，上面列出了当前的大学排名。我们要设计爬虫程序，爬取大学排名信息。

**爬虫功能要求：**

- 输入：大学排名URL链接
- 输出：大学排名信息的屏幕输出（排名，大学名称，总分）
- 工具：python3、requests、beautifulsoup

**程序设计思路：**

1. 研究大学排名网站网页URL
2. 设计fetchUrl函数，尝试获取页面；
3. 设计parseHtml函数，解析内容；
4. 设计output函数，组织列表形式输出；
5. 使用main函数调用程序。

In [1]:
"""实例：爬取大学排名

"""
import requests
from bs4 import BeautifulSoup
import bs4


def fetchURL(url):
    """
    功能：根据参数url，发起http request，尝试获取指定网页并返回其内容。
    
    参数：
        url：某webpage的url。
    
    返回：类文件对象型http Response对象。
    """
    
    headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
           'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
          }
    
    try:
        r = requests.get(url,headers = headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        print('    Success.')
        return r.text
    except requests.RequestError as e:
        print(e)
    except:
        return "    Error."
    
    
def parseHTML(html,urating):
    """
    功能：根据参数html给定的内存型HTML文件，尝试解析其结构，获取所需内容。
    
    参数：
        html：类似文件的内存HTML文本对象，例如requests.Response.text的内容；
        ulist：一个list对象，用于增量存放结果；
        
    返回：一个二维列表，存放着大学排名信息。
    """
    bsobj = BeautifulSoup(html,'html.parser')
 
    # 获取表头信息
    tr = bsobj.find('thead').find('tr')
    hlist = []
    if isinstance(tr,bs4.element.Tag):
        for th in tr('th'):
            hlist.append(th.string)
        urating.append(hlist)
        
    # 获得表体信息
    for tr in bsobj.find('tbody').children:
        blist = []
        if isinstance(tr,bs4.element.Tag):            
            for td in tr('td'):
                blist.append(td.string)
            urating.append(blist)   
    return urating

def output(urating,filename):
    """
    功能：格式化输出大学排名结果。
    
    参数：
    
    """
    #fmtStr = "{:^10}\t{:^6}\t{:^6}".format("排名","学校名称","评分")
    
    import pandas as pd
    dataframe = pd.DataFrame(urating)
    dataframe.to_csv(filename,index=False,sep=',',header=False)     
    print('    CSV file: %s has been writed.' % filename)
    
    
def chinese2pinyin(chineseStr):
    """
    将汉字转换为对应的全拼字符串（无声调）
    
    参数 chineseStr：中文字符串
    """
    import pypinyin
    
    return "".join(pypinyin.lazy_pinyin(chineseStr))

   
def main():
    """
    功能：基本UI、函数调度。
    """
   
    urldict = {
           '生源质量': 'http://www.zuihaodaxue.com/shengyuanzhiliangpaiming2018.html',
           '培养结果': 'http://www.zuihaodaxue.com/biyeshengjiuyelv2018.html',
           '社会声誉': 'http://www.zuihaodaxue.com/shehuishengyupaiming2018.html',
           '科研规模': 'http://www.zuihaodaxue.com/keyanguimopaiming2018.html',
           '科研质量': 'http://www.zuihaodaxue.com/keyanzhiliangpaiming2018.html',
           '顶尖成果': 'http://www.zuihaodaxue.com/dingjianchengguopaiming2018.html',
           '顶尖人才': 'http://www.zuihaodaxue.com/dingjianrencaipaiming2018.html',
           '科技服务': 'http://www.zuihaodaxue.com/kejifuwupaiming2018.html',
           '成果转化': 'http://www.zuihaodaxue.com/chengguozhuanhuapaiming2018.html',
           '学生国际化': 'http://www.zuihaodaxue.com/xueshengguojihuapaiming2018.html',
}
    indicationSystem = ['生源质量',
                        '培养结果',
                        '社会声誉',
                        '科研规模',
                        '科研质量',
                        '顶尖成果',
                        '顶尖人才',
                        '科技服务',
                        '成果转化',
                        '学生国际化',
                       ]
    """
    # 自动生成URL，注意培养结果对应URL网站定义为biyeshengjiuyelv2018，与其他有规则URL不同，故暂未使用。
    url = 'http://www.zuihaodaxue.com/'
    for i in indicationSystem:
        print(url + chinese2pinyin(i) + 'paiming2018.html')
    """       
    
    # 开始打开
    print('Begin to crawl the http://www.zuihaodaxue.com/ and get the rating of university in China......')
    url = 'http://www.zuihaodaxue.com/xueshengguojihuapaiming2018.html'
    print('    Try to fetch url : ' + url)
    
    urating = []
    html = fetchURL(url)
    print('    Try to parse html... ' )
    ur = parseHTML(html,urating)
    print('    Try to save the results in file...')
    output(ur,'daxuepaiming.csv')
    
        
    
    for key,url in urldict.items():
        print('    Try to fetch url : ' + url)
        urating = []
        html = fetchURL(url)
        print('    Try to parse html... ' )
        ur = parseHTML(html,urating)
        print('    Try to save the results in file...')
        output(ur,key+'排名2018.csv')
    print('The work of crawling is done.') 
    
if __name__ == '__main__':
    main()

Begin to crawl the http://www.zuihaodaxue.com/ and get the rating of university in China......
    Try to fetch url : http://www.zuihaodaxue.com/xueshengguojihuapaiming2018.html
    Success.
    Try to parse html... 


NameError: name 'bs4' is not defined