# 从web抓取信息

起源于《Python编程快速上手——让繁琐工作自动化》中的第11章“从web抓取信息”，也有自己生发的及其他书的内容。

几个重要的库：  

- 1. `webbrowser`
- 2. `requests`
- 3. `BeautifulSoup`  
- 4. `selenium`
- 5. `scrapy`

另外还有`urllib`，但AI Sweigart 说让我忘了这个库，意思是很不好用。

- js载入的动态网页内容

---
附录：关于HTML

**苹果系统中，command+option+I，可以打开或关闭开发者工具，和Windows上的F12是一样的。**  

作者建议，**不要用正则表达式来解析HTML**。例如昨天遇到的将class写在a标签中间的那种，对于html来说仍然有效，用正则来预估所有的情况则会非常繁琐。专门用来解析html的模块，例如beautifulsoup，将更不容易出错。[html - RegEx match open tags except XHTML self-contained tags - Stack Overflow](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)

---

一张图：  

![](https://user-gold-cdn.xitu.io/2019/8/22/16cb96d421b52ad9?imageslim)

In [1]:
# ipython输出各行结果
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## 1.  `webbrowser`模块

`webbrowser`这个模块，可以直接打开网址。

先生成网址，再用webbrowser打开，适合网址有规律的情况。哪些情况适用于生成网址再打开检查的情况？

- 动态网址  

    - 查询类：内容参数组成新网址，如Google Map
    - 解析类：从某段文字中解析出需要的新网址
    
- 页面内容确认

    - 编号类：某种无序数列组成新网址，如小说网站晋江
    - 已经获得，手动打开麻烦

### 1.1 打开小说页面

In [1]:
import webbrowser
for i in range(3000011, 3000020):
    url = 'http://www.jjwxc.net/onebook.php?novelid=' + str(i)
    webbrowser.open(url)

### 1.2 打开Google地图上单个城市

In [2]:
#_*_coding:utf-8_*_
import webbrowser
import re

def mapit(address):
    address = re.compile(r'\s+').sub('+',address) # 这里“+”之前不需要转义符\
    url = 'http://www.google.com/maps/place/' + address
    return webbrowser.open(url)

address = 'Wuxi,   Jiangsu,   China'    
mapit(address)

True

### 1.3 批量打开Google地图的城市群

In [3]:
import webbrowser

def mapcities(cities):
    address = []
    for city in cities:
        address.append(city + ', Jiangsu, China')
    for a in address:
        a = re.compile(r'\s+').sub('+',a)
        url = 'http://www.google.com/maps/place/' + a
        webbrowser.open(url)
        
cities = ['Wuxi', 'Suzhou', 'Xuzhou', 'Zhengjiang', 'Taizhou']
mapcities(cities)

### 1.4 批量打开简书笔记

复制下面这段网址（从workflowy来），用下面的程序批量打开：

    [每天一本书 -《思考线》](https://www.jianshu.com/p/ee5e1c32f97d)
	[创意变为现实的最佳方法——《思考线》读后感](https://www.jianshu.com/p/c2132f7e02ae)
	[读思考线 ](https://www.jianshu.com/p/52e2e4dbb08c)
	[极具说服力的书（思考线：让你的创意变为现实的最佳方法](https://book.douban.com/review/7747085/)
	[思考线·思维导图.png (3104×1802)](https://upload-images.jianshu.io/upload_images/14183687-4b46af1a5294edfd.png)

In [4]:
import webbrowser, pyperclip, re

mdUrls = pyperclip.paste().replace('\t','').split('\n')
urlRegex = re.compile(r'\((http.*)\)')
for mdUrl in mdUrls:
    url = urlRegex.search(mdUrl).group(1)
    webbrowser.open(url)

## 2. `requests`模块

requests文档地址：[Requests: HTTP for Humans™ — Requests 2.21.0 documentation](https://requests.readthedocs.io/en/master/)

`requests.get(url)`:  

- 类型是`requests.models.Response`
- 参数text
- 参数headers是类似于字典（字典有`dict.get(key)`返回value的语法）的结构：`requests.structures.CaseInsensitiveDict`，它的键不区分大小写。真正的字典键是区分大小写的。
- 参数status_code
- 参数encoding
- 方法raise_for_status()

### 2.1 尝试

In [2]:
#_*_coding:utf-8_*_
import requests
import chardet

url0 = 'http://www.gutenberg.org/cache/epub/1112/pg1112.txt'
url1 = 'http://example.webscraping.com'
url2 = 'http://www.engine3d.com'

res0 = requests.get(url0)
res1 = requests.get(url1)
headers = {'user-agent': 'my-app/0.0.1'}
res2 = requests.get(url2, headers = headers)

# res的基本情况
type(res0)
res0.status_code == requests.codes.ok
res2.encoding

# 看res的文本
len(res0.text)
res0.text[:250]

# 看res的headers
res0.headers.get('content-type')
res2.headers
res2.headers.get('user-agent')

# 不存在的网址，看res反应
url3 = 'http://inventwithpython.com/page_that_does_not_exist'
res3 = requests.get(url3)
res3.raise_for_status()

requests.models.Response

True

'ISO-8859-1'

179378

"\ufeffThe Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare\r\n\r\n\r\n*******************************************************************\r\nTHIS EBOOK WAS ONE OF PROJECT GUTENBERG'S EARLY FILES PRODUCED AT A\r\nTIME WHEN PROOFING METHODS AND TOO"

'text/plain; charset=utf-8'

{'Server': 'AliyunOSS', 'Date': 'Fri, 01 Nov 2019 01:39:49 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'x-oss-request-id': '5DBB8CE55C74183036984F90', 'Last-Modified': 'Thu, 31 Oct 2019 02:27:20 GMT', 'x-oss-object-type': 'Normal', 'x-oss-hash-crc64ecma': '4434422554274640270', 'x-oss-storage-class': 'Standard', 'Content-MD5': 'l2yxdfF3i/9yvmsmNVa0SA==', 'x-oss-server-time': '18', 'Content-Encoding': 'gzip'}

HTTPError: 404 Client Error: Not Found for url: http://inventwithpython.com/page_that_does_not_exist

### 2.2 能运行起来的第一段程序：

In [4]:
def download(url):
    res = requests.get(url)
    try:
        res.raise_for_status()
        html = res.text
        print('Webpage is download successfully.'+'\n'+'The beginning texts here:  '+'\n\n')
        print(html[:200])
    except Exception as exc :
        print('There was a problem: \n%s' %exc)
print('url0的下载结果： \n')
download(url0)
print('url3的下载结果： \n')
download(url3)

url0的下载结果： 

Webpage is download successfully.
The beginning texts here:  


﻿The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare


*******************************************************************
THIS EBOOK WAS ONE OF PROJECT GUTENBERG'S EARLY FILES
url3的下载结果： 

There was a problem: 
404 Client Error: Not Found for url: http://inventwithpython.com/page_that_does_not_exist


### 2.3 保存文件到本地

In [5]:
import requests
import os

res = requests.get(url0)
res.raise_for_status()
path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'fiction', 'RomeoAndJuliet.txt')
playFile = open(path, 'wb')
for chunk in res.iter_content(100000):
    playFile.write(chunk)
playFile.close()

100000

79380

### 2.4 源代码的编码问题

关于Unicode编码的知识：[The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)

```
encode_type = chardet.detect(html)
html = html.decode(encode_type['encoding'])
```

这里不是靠这两句解决问题的。已经测试过，html是str类型的。

用wb模式打开文件，写入内容是 `res.iter_content(100000)`  
作者说，使用iter_content是为了确保requests模块即使在**下载巨大**的文件时也**不会消耗太多**内存。

**The chunk size is the number of bytes it should read into memory. **

最后靠这篇[Python爬虫及存入txt中文编码错误的解决（一） - WANGZHUCHEN的博客 - CSDN博客](https://blog.csdn.net/WANGZHUCHEN/article/details/80033073)解决了中文网页乱码问题。

```
res.encoding = res.apparent_encoding
```

In [30]:
# _*_coding:utf-8_*_
import requests, bs4, os, lxml, chardet

url = 'https://www.114zw.la/book/1425/8236687.html'
res = requests.get(url)
res.raise_for_status()
res.encoding = res.apparent_encoding
soup = bs4.BeautifulSoup(res.text, 'lxml')
content = soup.select('#htmlContent')[0].text

path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'fiction', 'jinshi.txt')
with open(path, 'w') as file:
    file.write(content)

1536

换了一个新网页，没有这个问题，所以这句话`res.encoding = res.apparent_encoding`也可以不用。  
看了一下，两个网页的charse都是写的gbk。所以也不知道是哪里的设置导致这种编码保护。

In [29]:
# _*_coding:utf-8_*_
import requests, bs4, os, lxml, chardet

url = 'https://m.lewenxiaoshuo.com/books/tufeigonglve/12880240.html'
res = requests.get(url)
res.raise_for_status()
# res.encoding = res.apparent_encoding
soup = bs4.BeautifulSoup(res.text, 'lxml')
content = soup.select('#content')[0].text

path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'fiction', 'tufei.txt')
with open(path, 'w') as file:
    file.write(content)

3224

## 3. `BeautifulSoup`模块

[BeautifulSoup高级应用 之 CSS selectors /CSS 选择器 - Winterto1990的博客 - CSDN博客](https://blog.csdn.net/Winterto1990/article/details/47808949)

soup的两类来源：

- 1. 从网页获取
- 2. 从本地获取

### 3.1 尝试

In [2]:
import requests
import bs4
import os

# soup的两类来源：
# 1. 从网页获取
url = 'https://www.tripadvisor.cn/'
res = requests.get(url)
res.raise_for_status()
example1Soup = bs4.BeautifulSoup(res.text)
type(example1Soup)

# 2. 从本地获取
path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'NoStarch.html')
example2File = open(path)
example2Soup = bs4.BeautifulSoup(example2File.read()) #.read()加了没加没区别。
type(example2Soup)

bs4.BeautifulSoup

bs4.BeautifulSoup

In [2]:
# lxml解析 
import requests
import bs4
import lxml

url = 'https://www.tripadvisor.cn/'
res = requests.get(url)
res.raise_for_status()
html = res.text
tripSoup = bs4.BeautifulSoup(html, 'lxml')

tripSoup.select('title')
tripSoup.select_one('a>img')
tripSoup.select_one('a img')
# 网页变了，以下代码失效。
'''
tripSoup.select('input[type="radio"]')
tripSoup.select_one('input[type="radio"]')
tripSoup.select('#popularDestinations > div.section > ul.regionContent > li.active > ul > li:nth-child(1) > a > span.thumbCrop > img')
images = tripSoup.select('#popularDestinations > div.section > ul.regionContent > li.active > ul > li > a > span.thumbCrop > img')
titles = tripSoup.select('#popularDestinations > div.section > ul.regionContent > li.active > ul > li > div.title')
info = []
for title,image in zip(titles, images):
    data = {
            'title':((title.get_text()).replace('\n','')).replace('游记指南',''),
            'image':image.get('src')
        }
    info.append(data)
info
'''
# 不知道为什么读不到背景图像的url
tripSoup.select('a>div>ul>li>div')[0]['style'].split(';')

[<title>TripAdvisor(猫途鹰) - 全球旅游点评,酒店/景点/餐厅,真实旅客评论</title>]

<img alt="TripAdvisor(猫途鹰)" class="brand-header-Logo__resizeImg--15ZcW" src="https://cc.ddcdn.com/img2/langs/zh_CN/branding/rebrand/TA_logo_primary.svg"/>

<img alt="TripAdvisor(猫途鹰)" class="brand-header-Logo__resizeImg--15ZcW" src="https://cc.ddcdn.com/img2/langs/zh_CN/branding/rebrand/TA_logo_primary.svg"/>

'\ntripSoup.select(\'input[type="radio"]\')\ntripSoup.select_one(\'input[type="radio"]\')\ntripSoup.select(\'#popularDestinations > div.section > ul.regionContent > li.active > ul > li:nth-child(1) > a > span.thumbCrop > img\')\nimages = tripSoup.select(\'#popularDestinations > div.section > ul.regionContent > li.active > ul > li > a > span.thumbCrop > img\')\ntitles = tripSoup.select(\'#popularDestinations > div.section > ul.regionContent > li.active > ul > li > div.title\')\ninfo = []\nfor title,image in zip(titles, images):\n    data = {\n            \'title\':((title.get_text()).replace(\'\n\',\'\')).replace(\'游记指南\',\'\'),\n            \'image\':image.get(\'src\')\n        }\n    info.append(data)\ninfo\n'

['height:100%', 'width:100%', 'background-size:cover', 'background-image:none']

In [3]:
import requests, bs4

url = 'https://www.joelonsoftware.com/'
res = requests.get(url)
htmlSoup = bs4.BeautifulSoup(res.text)

import pprint
for i in range(2):
    # pprint.pprint(htmlSoup.select('div>p>span>a')[i].attrs)
    # print(htmlSoup.select('div>p>span>a')[i].get('href'))
    # pprint.pprint(htmlSoup.select('header>h2>a')[i].attrs)
    print(htmlSoup.select('header>h2>a')[i].text)
    print(htmlSoup.select('header>h2>a')[i].get('href'))
for i in range(2):
    print(htmlSoup.select('div>ul>li>a')[i].text)
    
htmlSoup.select_one('img')['src']

Welcome, Prashanth!
https://www.joelonsoftware.com/2019/09/24/announcing-stack-overflows-new-ceo/
The next CEO of Stack Overflow
https://www.joelonsoftware.com/2019/03/28/the-next-ceo-of-stack-overflow/
Things You Should Never Do, Part I
Strategy Letter I: Ben and Jerry’s vs. Amazon


'https://i1.wp.com/www.joelonsoftware.com/wp-content/uploads/2016/12/Pong.png?w=730&ssl=1'

In [2]:
import requests, bs4

url = 'https://blog.csdn.net' 
res = requests.get(url)
res.raise_for_status()

soup = bs4.BeautifulSoup(res.text)
soup.select('title')
# soup.select('div nth-of-type(0)') 在ipython中不好用
# soup.select('body a')[:3]
# soup.select('div>ul>li.active')
# soup.select('.carousel-caption, p.name')
soup.select('a[href]')[0]
soup.select("a[href$='102605809']")[1].text

[<title>CSDN博客-专业IT技术发表平台</title>]

<a href="/">推荐</a>

'\n\n\t\t\t\t\tCSDN产品公告第2期：博客支持视频、专栏文章拖拽排序、APP霸王课来袭……\t\t\t\t\n'

### 3.2 下载古腾堡中文书

突然想下载古腾堡的书下来。挑选中文的试试吧。  

【下载】
1. 通过主页面提取出txt的url
2. 使用上面的功能下载这些。

【繁体转简体】
找到对应的库 

> 不需要什么安装方法，只需要把这两个文件下载下来，保存到与代码同一目录下即可
```
https://raw.githubusercontent.com/skydark/nstools/master/zhtools/langconv.py  
https://raw.githubusercontent.com/skydark/nstools/master/zhtools/zh_wiki.py
```

#### 3.2.1 下载

In [9]:
# 下载txt
import requests
import bs4
import os

url = 'https://www.gutenberg.org/browse/languages/zh'
res = requests.get(url)
res.raise_for_status()
fictionSoup = bs4.BeautifulSoup(res.text)
fictionList = fictionSoup.select("li.pgdbetext a[href^='/ebooks']")
for i in range(len(fictionList)):
    fileName = fictionList[i].get_text()    
    if '/' in fileName:
        fileName = fileName.replace('/', ' ')
    if '\\' in fileName:
        fileName = fileName.replace('\\', ' ')
    file = os.path.join(os.path.dirname(os.getcwd()), 'files', 'fiction', fileName +'.txt')
    if os.path.exists(file):
        fileName = fileName + '_new'
        file = os.path.join(os.path.dirname(os.getcwd()), 'files', 'fiction', fileName +'.txt')
    fictionID = (fictionList[i].get('href')).split('/')[2]
    fictionUrl = 'https://www.gutenberg.org/files/' + fictionID + '/' + fictionID + '-0.txt'
    with open(file ,'wb') as fictionFile:
        resFiction = requests.get(fictionUrl)
        for chunk in resFiction.iter_content(100000):
            fictionFile.write(chunk)

一共下载了475本txt文本格式的电子书。  

现在看txt格式的都可以下载，试试其他格式的：  

In [11]:
# 下载epub

file = os.path.join(os.path.dirname(os.getcwd()), 'files', 'fiction', '玉樓春.epub')
url = 'https://www.gutenberg.org/ebooks/25422.epub.noimages?session_id=3c29b07a963878c5cd004f277b6d1d0adb08d623'
with open(file ,'wb') as fictionFile:
        resFiction = requests.get(url)
        for chunk in resFiction.iter_content(100000):
            fictionFile.write(chunk)

#### 3.2.2 转换简繁体  

In [13]:
from langconv import *

sentence = '玉樓春'
Converter('zh-hans').convert(sentence)

'玉楼春'

要想批量转换，需要：  

1. 转换文件名  
2. 建立新文件——考虑到有些文件名已经是简体，保险的方法就是再建立一个简体文件夹。  
3. 读取文件中的内容
4. 转换文件内容  
5. 写入新文件

In [26]:
import os
from langconv import Converter

pathFanti = os.path.join(os.path.dirname(os.getcwd()), 'files', 'fiction', 'fanti')
pathJianti = os.path.join(os.path.dirname(os.getcwd()), 'files', 'fiction', 'jianti')
filesFanti = os.listdir(pathFanti) #把这个目录下的所有文件都读出来
for fileName in filesFanti:
    if fileName.split('.')[-1] != 'txt':
        filesFanti.remove(fileName)
for fileNameFanti in filesFanti:
    fileNameJianti = Converter('zh-hans').convert(fileNameFanti)
    with open(os.path.join(pathJianti, fileNameJianti), 'w') as fileJianti:
        with open(os.path.join(pathFanti, fileNameFanti)) as fileFanti:
            contentFanti = fileFanti.readlines()
            for sentenceFanti in contentFanti:
                sentenceJianti = Converter('zh-hans').convert(sentenceFanti)
                fileJianti.write(sentenceJianti)

除了一个打不开的，其他都转换成功。

### 3.3 Google自动查询

In [5]:
# _*_coding:utf-8_*_
import requests, bs4, webbrowser,re

def googleit(query):
    
    #打开查询结果页面
    if '' in query:
        query = re.compile(r'\s+').sub('+', query)
    url = 'http://www.google.com/search?q=' + query
    print('Googling...')
    headers = {'User-Agent':'Mozilla/8.0 (compatible; MSIE 8.0; Windows 7)'}
    res = requests.get(url, headers=headers)
    res.raise_for_status()
    
    #选择结果页面
    soup = bs4.BeautifulSoup(res.text)
    linkElems = soup.select('.r a')
    
    #打开前5个页面
    numOpen = min(5, len(linkElems))
    for i in range(numOpen):
        webbrowser.open('http://google.com' + linkElems[i].get('href'))
        # print(linkElems[i].get('href'))
query = 'python webscraping'
googleit(query)

Googling...


### 3.4 下载xkcd漫画

这是第一次尝试下载图片。先开始是教材上的程序，后来自己又重新写了一下。

#### 3.4.1 标准程序

In [None]:
#_*_coding:utf-8_*_
import requests, bs4,os

url = 'http://xkcd.com'
# os.makedirs('xkcd', exist_ok=True)
while not url.endswith('#'):
    #下载网页
    # print('Downloading page %s ...'%url)
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text)
    
    #寻找漫画
    comicElem = soup.select('#comic img')
    if comicElem == []:
        print('Could not find comic image.')
    else:
        try:
            comicUrl = 'http:'+ comicElem[0].get('src')
            # print('Downloading image is %s ...'%comicUrl)
            res = requests.get(comicUrl)
            res.raise_for_status()  
        except:
            comicUrl = 'http://www.xkcd.com'+ comicElem[0].get('src')
            print('That is wrong. Downloading image is %s ...'%comicUrl)    
            res = requests.get(comicUrl)
            res.raise_for_status() 

    #保存漫画
    # imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
    imageFile = open(os.path.dirname(os.getcwd()) + '/files/xkcd/' + os.path.basename(comicUrl), 'wb')
    for chunk in res.iter_content(100000):
        imageFile.write(chunk)
    imageFile.close() 

    #找前一张漫画
    prevLink = soup.select('a[rel="prev"]')[0]
    url = 'http://xkcd.com' + prevLink.get('href')
    
print('Done.')

That is wrong. Downloading image is http://www.xkcd.com/2067/asset/challengers_header.png ...
Could not find comic image.
Could not find comic image.
That is wrong. Downloading image is http://www.xkcd.com/1525/bg.png ...
Could not find comic image.
Could not find comic image.


In [176]:
url = 'https://xkcd.com/2031/'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
comicElem = soup.select('#comic img')
try:
    comicUrl = 'http:'+ comicElem[0].get('src')
    print('Downloading image is %s ...'%comicUrl)
    res = requests.get(comicUrl)
    res.raise_for_status()  
except:
    comicUrl = 'http://www.xkcd.com'+ comicElem[0].get('src')
    print('That is wrong. Downloading image is %s ...'%comicUrl)    
    res = requests.get(comicUrl)
    res.raise_for_status() 

Downloading image is http://imgs.xkcd.com/comics/pie_charts.png ...


明白了，这里用的是canvas，不是img。

2031和1813都出现的错误：  

SysCallError                              Traceback (most recent call last)

SSLError: HTTPSConnectionPool(host='xkcd.com', port=443): Max retries exceeded with url: /1813/ (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))

####  3.4.2 重新下载xkcd漫画

在知道了漫画作者门罗就是《万物解释者》作者后，决定重新下载xkcd漫画。

In [85]:
import requests
import bs4
import os

url = 'https://xkcd.com/1'
while True:
    # 获取当前页面的漫画地址
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text)
    try:
        # 当前页面存在漫画的静态图片地址
        srcComic = soup.select_one('#comic img').get('src')
        urlComic = 'https:'+ srcComic
    except:
        # 没有漫画地址就直接找下一页
        srcPrev = soup.select_one(".comicNav a[rel='prev']").get('href')
        print(srcPrev)
        url = 'https://xkcd.com' + srcPrev
        continue
    
    # 存储当前漫画到本地
    fileName = srcComic.split('/')[-1]
    filePath = os.path.join(os.path.dirname(os.getcwd()), 'files', 'xkcd', fileName)
    with open(filePath, 'wb') as comicFile:
        resComic = requests.get(urlComic)
        for chunk in resComic.iter_content(100000):
            comicFile.write(chunk)
    
    # 获取前一页的页面地址
    srcPrev = soup.select_one(".comicNav a[rel='prev']").get('href')
    # 到了第一幅漫画的页面
    if srcPrev == '#':
        break
    print(srcPrev)
    url = 'https://xkcd.com' + srcPrev

有问题的：2198, 2067，1663，1608，1525，1416，1350  

2198、1663、1608、1525、1416都是游戏，页面有交互，非普通静态图片。1416会放大，嵌入了一个框架网页。1350无内容。

2067会放大，其中有链接，也不是普通图片。错误显示为：  

```
MissingSchema: Invalid URL 'https:/2067/asset/challengers_header.png': No schema supplied. Perhaps you meant http://https:/2067/asset/challengers_header.png?
```

1052-878之间一次性完成，无意外。874-787一次性完成。691-483一次性完成。446-250一次性完成

超时的错误为：  

```
SSLError: HTTPSConnectionPool(host='xkcd.com', port=443): Max retries exceeded with url: /240/ (Caused by SSLError(SSLError("bad handshake: SysCallError(60, 'ETIMEDOUT')")))

```


1. 需要写一个try-except，出了意外，直接找prev按钮继续走下去。 
   try-except无法解决2067的问题。
2. 对于timeout怎么应对？

目前程序的三个问题：  

1. try-except无法解决2067的问题  
2. timeout无法解决  
3. 下载速度太慢，平均每小时只能下载200-300张漫画，全部2200张漫画需要七八个小时才能下完。  

问题三可以用多线程来解决。  

In [14]:
# 单独下载一副漫画

url = 'https://xkcd.com/'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
srcComic = soup.select_one('#comic img').get('src')
urlComic = 'https:'+ srcComic

# 存储当前漫画到本地
fileName = srcComic.split('/')[-1]
filePath = os.path.join(os.path.dirname(os.getcwd()), 'files', 'xkcd', fileName)
with open(filePath, 'wb') as comicFile:
    resComic = requests.get(urlComic)
    for chunk in resComic.iter_content(100000):
        comicFile.write(chunk)

#### 3.4.3 多线程下载xkcd漫画

In [6]:
import requests
import bs4
import threading
import os

def downloadComic(startComic, endComic):
    for comicID in range(startComic, endComic):
        # 获取漫画地址
        res = requests.get('https://xkcd.com/%s'%str(comicID))
        soup = bs4.BeautifulSoup(res.text)
        comicElem = soup.select('#comic img')
        if comicElem == []:
            print('Could not find comic img: %s'%str(comicID))
        else:
            srcComic = comicElem[0].get('src')
            urlComic = 'https:'+ srcComic
            # 存储到本地
            fileName = str(comicID) + '-' + srcComic.split('/')[-1]
            filePath = os.path.join(os.path.dirname(os.getcwd()), 'files', 'xkcd', fileName)
            if os.path.exists(filePath):
                continue
            with open(filePath, 'wb') as comicFile:
                resComic = requests.get(urlComic)
                for chunk in resComic.iter_content(100000):
                    comicFile.write(chunk)

downloadTreads = []
for i in range(1, 81, 10):
    downloadTread = threading.Thread(target=downloadComic, args=[i, i+10])
    downloadTreads.append(downloadTread)
    downloadTread.start()
for downloadTread in downloadTreads:
    downloadTread.join()
print('Download End.')

Download End.


为了顺利地在所有线程结束后打印‘Download End.’，必须所有线程都没有崩溃才行。timeout一旦出现，这个程序就处于永远结束不了的状态了。

### 3.5 下载极客漫画

In [149]:
import requests, bs4, os

for i in range(1,6):
    url = 'https://linux.cn/talk/comic/index.php?page=' + str(i)
    headers = {'User-Agent':'Mozilla/8.0 (compatible; MSIE 8.0; Windows 7)'}
    res = requests.get(url, headers=headers)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text)
    for s in soup.select('h2 span[class="title"] a'):
        comicUrl = s.get('href')
        comicRes = requests.get(comicUrl, headers=headers)
        comicRes.raise_for_status()
        comicSoup = bs4.BeautifulSoup(comicRes.text)
        comicImageUrl = comicSoup.select('#article_content img')[0].get('src')
        comicImageRes = requests.get(comicImageUrl, headers=headers)
        imageFile = open(os.path.dirname(os.getcwd()) + '/files/jkmh/' + os.path.basename(comicImageUrl), 'wb')
        for chunk in comicImageRes.iter_content(100000):
            imageFile.write(chunk)
        imageFile.close()
print('Done!')

### 3.6 玩豆瓣

#### 3.6.1 豆瓣top250图书

豆瓣有个top250的图书榜单，读取其中书名、作者和评分。

In [1]:
import requests, bs4, lxml, re
urls = []
for i in range(0,250,25):
    urls.append('https://book.douban.com/top250?start=' + str(i))
info = []
for url in urls:
    res = requests.get(url)
    res.raise_for_status()
    html = res.text
    doubanBookSoup = bs4.BeautifulSoup(html, 'lxml')
    titles = doubanBookSoup.select('a[title]')
    scores = doubanBookSoup.select('span[class="rating_nums"]')
    authors = doubanBookSoup.select('p.pl')
    for title, author, score in zip(titles, authors, scores):
        data = {
            'title':((title.get_text()).replace('\n','')).replace(' ','') ,
            'author':re.compile(r'(\[\w+\])?(\w)+(·)?(\w+)(·\w+)?').search(author.get_text()).group(),
            'score':score.get_text()
        }
        info.append(data)

for i in info:
    if 9.5<=float(i['score'])<9.7:
        print(i)

{'title': '红楼梦', 'author': '曹雪芹', 'score': '9.6'}
{'title': '海贼王:ONEPIECE', 'author': '尾田荣一郎', 'score': '9.5'}


In [6]:
# 2019.11.01重新写一遍
import requests, bs4, lxml, os

topBooks = []
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
}
for pageNum in range(0,250,25):
    url = 'https://book.douban.com/top250?start=' + str(pageNum)
    res = requests.get(url, headers=headers)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, 'lxml')
    for i in range(25):
        info = soup.select('p.pl')[i].text
        data = {
            'Title':soup.select('.pl2 a')[i]['title'],
            'Rating':soup.select('.rating_nums')[i].text,    
            'Author': info.split('/')[0]
        }
        topBooks.append(data)
for book in topBooks:
    if float(book['Rating'])>=9.5:
        print(book)

{'Title': '红楼梦', 'Rating': '9.6', 'Author': '[清] 曹雪芹 著 '}
{'Title': '海贼王', 'Rating': '9.5', 'Author': '尾田荣一郎 '}


#### 3.6.2 豆瓣标签

In [42]:
# 单标签
import requests, bs4, lxml
import os, openpyxl, re, threading

def saveContent(tag):
    # 建立并保存初始的Excel表
    wb = openpyxl.Workbook()
    sheetTag = wb.create_sheet()
    sheetTag.title = tag
    sheetTag['A1'], sheetTag['B1'], sheetTag['C1'] = 'Label', 'Title', 'Author'
    sheetTag['D1'], sheetTag['E1'], sheetTag['F1'] = 'Rating', 'Comments', 'Link' 
    path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'douban', 'tagsEdu.xlsx')
    wb.save(path)
    wb.close()
    
    # 初始url
    urlTag = 'https://book.douban.com/tag/%E6%95%99%E8%82%B2'

    # 读取数据
    total = 0
    while True:
        resTag = requests.get(urlTag)
        resTag.raise_for_status()
        soupTag = bs4.BeautifulSoup(resTag.text, 'lxml')
        num = len(soupTag.select('h2 a'))
        if num == 0:
            break
        for i in range(num):
            sheetTag['A'+str(i+2+total)] = soupTag.select('h1')[0].text.split(': ')[1]
            # print(soupTag.select('h2 a')[i].get('title'))
            sheetTag['B'+str(i+2+total)] = soupTag.select('h2 a')[i].get('title')
            sheetTag['C'+str(i+2+total)] = re.compile(r'\s+').sub('', soupTag.select('div.pub')[i].text).split('/')[0]
            sheetTag['F'+str(i+2+total)] = soupTag.select('h2 a')[i].get('href') 
            if '少于10人评价' in soupTag.select('.clearfix .pl')[i].text or '无人评价' in soupTag.select('.clearfix .pl')[i].text:
                sheetTag['D'+str(i+2+total)] = ''
                sheetTag['E'+str(i+2+total)] = 0
            else:
                sheetTag['D'+str(i+2+total)] = float(soupTag.select('.info .clearfix')[i].select('.rating_nums')[0].text)
                sheetTag['E'+str(i+2+total)] = int(re.compile(r'\d+').search(soupTag.select('.clearfix .pl')[i].text).group(0))
        total = total + num
        urlTag = 'https://book.douban.com' + soupTag.select('.next link')[0].get('href')
    # 保存数据
    wb.save(path)
    wb.close()
saveContent('教育')

In [25]:
# 重写多标签，使用User-Agent，多线程，补充无下一页情况应对
import requests, bs4, lxml, os, openpyxl, re, threading

# "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
# "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
headers = {
    "Accept": "*/*",
    "Connection": "keep-alive",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
}

def tagSave(tagName, tagUrl):
    # Excel表格初始化
    filePath = os.path.join(os.path.dirname(os.getcwd()), 'files', 'scrapy', '%s.xlsx'%tagName)
    # wb = openpyxl.load_workbook(filePath)
    wb = openpyxl.Workbook()
    sheet = wb.get_active_sheet()
    sheet.title = tagName
    fields = ['Label', 'Title', 'Author', 'Rating', 'Comments', 'Link']
    for field, alpha in zip(fields, 'ABCDEF'):
        sheet[alpha+'1'] = field    
    # 读标签内的book信息
    total = 0
    while True:
        tagRes = requests.get(tagUrl, headers=headers)
        tagRes.raise_for_status()
        tagSoup = bs4.BeautifulSoup(tagRes.text, 'lxml')
        bookNums = len(tagSoup.select('h2 a'))
        if bookNums == 0:
            break
        for i in range(bookNums):
            sheet['A'+str(i+2+total)] = tagName
            sheet['B'+str(i+2+total)] = tagSoup.select('h2 a')[i].get('title')
            sheet['C'+str(i+2+total)] = re.compile(r'\s+').sub('', tagSoup.select('div.pub')[i].text).split('/')[0]
            sheet['F'+str(i+2+total)] = tagSoup.select('h2 a')[i].get('href')
            if '少于10人评价' in tagSoup.select('.clearfix .pl')[i].text or '无人评价' in tagSoup.select('.clearfix .pl')[i].text:
                sheet['D'+str(i+2+total)] = ''
                sheet['E'+str(i+2+total)] = 0
            else:
                sheet['D'+str(i+2+total)] = tagSoup.select('.info .clearfix')[i].select('.rating_nums')[0].text
                sheet['E'+str(i+2+total)] = int(re.compile(r'\d+').search(tagSoup.select('.clearfix .pl')[i].text).group(0))
        total = total + bookNums
        if tagSoup.select('.next link') == []:
            break
        else:
            tagUrl = 'https://book.douban.com' + tagSoup.select('.next link')[0].get('href')
    # 存Excel表格 
    wb.save(filePath)
    wb.close()

# 读取标签名和url
url = 'https://book.douban.com/tag'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'lxml')
tags = soup.select('.tagCol>tbody>tr>td>a')
tagNames = []
tagUrls = []
for i in range(len(tags)):
    tagNames.append(tags[i].text)    
    tagUrls.append('https://book.douban.com' + tags[i].get('href'))

# 运行函数
saveThreads = []
for i in range(5):
    saveThread = threading.Thread(target=tagSave, args=[tagNames[i], tagUrls[i]])
    saveThreads.append(saveThread)
    saveThread.start()



In [6]:
# 豆瓣book原文摘录
import requests, bs4, os, lxml, re

def blockQuotes(fileID, fileName):
    url = 'https://book.douban.com/subject/%s/blockquotes'%fileID
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
    }
    while True:
        res = requests.get(url, headers=headers)
        soup = bs4.BeautifulSoup(res.text, 'lxml')
        originalTexts = soup.select('.blockquote-list>ul>li>figure')
        filePath = os.path.join(os.path.dirname(os.getcwd()), 'files', 'scrapy', 'douban', '%s.txt'%fileName)
        if os.path.exists(filePath):
            with open(filePath, 'a') as file:
                for text in originalTexts:
                    file.write(re.compile(r'\s+').sub('', text.text) + '\n' + '\n')
        else:
            with open(filePath, 'w') as file:
                for text in originalTexts:
                    file.write(re.compile(r'\s+').sub('', text.text) + '\n' + '\n')
        if soup.select('.next link') != []:
            url = 'https://book.douban.com/subject/%s/blockquotes'%fileID + soup.select('.next link')[0].get('href')
        else:
            break

In [6]:
# 改成多线程，参数则不同
import requests, bs4, lxml, os, openpyxl, re
import threading

def blockQuotes(startID, EndID):
    filePath = os.path.join(os.path.dirname(os.getcwd()), 'files', 'scrapy', 'doubanBooks.xlsx')
    wb = openpyxl.load_workbook(filePath)
    sheet = wb.get_sheet_by_name('total')
    for i in range(startID, EndID):
        url = sheet['F'+str(i+1)].value + 'blockquotes'
        headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
        }
        proxies = {           
            "http":"59.57.149.237:9999",
            "http":"114.239.151.8:808",
            "http":"182.34.32.162:9999",
            "http":"114.239.254.46:9999",
            "http":"113.194.31.30:9999"
        }
        while True:
            res = requests.get(url, headers=headers, proxies=proxies)
            soup = bs4.BeautifulSoup(res.text, 'lxml')
            originalTexts = soup.select('.blockquote-list>ul>li>figure')
            fileName = str(i) +  '-' + sheet['A'+str(i+1)].value
            filePath = os.path.join(os.path.dirname(os.getcwd()), 'files', 'scrapy', 'douban', '%s.txt'%fileName)
            if os.path.exists(filePath):
                with open(filePath, 'a') as file:
                    for text in originalTexts:
                        file.write(re.compile(r'\s+').sub('', text.text) + '\n' + '\n')
            else:
                with open(filePath, 'w') as file:
                    for text in originalTexts:
                        file.write(re.compile(r'\s+').sub('', text.text) + '\n' + '\n')
            if soup.select('.next link') != []:
                url = sheet['F'+str(i+1)].value + 'blockquotes' + soup.select('.next link')[0].get('href')
            else:
                break
    wb.close()
# 放弃多线程，因为要对同一个文件进行读取。
blockQuotes(411, 421)

  


### 3.7 赵雅芝贴吧内容读取 

#### 3.7.1 存入字典

In [2]:
import requests, bs4, lxml

info = []
for i in range(0, 20000, 50):
    url = 'http://tieba.baidu.com/f?kw=%E8%B5%B5%E9%9B%85%E8%8A%9D&ie=utf-8&pn=' + str(i)
    res = requests.get(url)
    res.raise_for_status()
    yazhiSoup = bs4.BeautifulSoup(res.text, 'lxml')
    titles = yazhiSoup.select('a[class="j_th_tit"]')
    replies = yazhiSoup.select('span[class="threadlist_rep_num center_text"]')
    for title, reply in zip(titles, replies):
        data = {
            "title":title.get_text(),
            "replies":reply.get_text(),
            "link":'http://tieba.baidu.com' + title['href']
        }
        info.append(data)

for i in info:
    if '新白娘子传奇' in i['title'] and int(i['replies'])>2000:
        print(i)

{'title': '【典雅一生】《新白娘子传奇》截图帖（不定时更新）', 'replies': '2474', 'link': 'http://tieba.baidu.com/p/4983620050'}


#### 3.7.2 存入text

In [1]:
import requests, bs4, lxml, os

path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'yazhitieba', 'yazhi.txt')
with open(path, 'w+') as file_txt:
    file_txt.write('-------------------Title-----------------replies---------------link--------------\n')
    for i in range(0, 1000, 50):
        url = 'http://tieba.baidu.com/f?kw=%E8%B5%B5%E9%9B%85%E8%8A%9D&ie=utf-8&pn=' + str(i)
        res = requests.get(url)
        res.raise_for_status()
        yazhiSoup = bs4.BeautifulSoup(res.text, 'lxml')
        titles = yazhiSoup.select('a[class="j_th_tit"]')
        replies = yazhiSoup.select('span[class="threadlist_rep_num center_text"]')
        for title, reply in zip(titles, replies):
            data = {
                "title":title.get_text(),
                "replies":reply.get_text(),
                "link":'http://tieba.baidu.com' + title['href']
            }        
            file_txt.write(data['title'] + '  ' + data['replies'] + '  ' + data['link'] + '\n')

#### 3.7.3 写入Excel

In [7]:
import requests, bs4, lxml, openpyxl, os

wb = openpyxl.Workbook()
sheet = wb.get_active_sheet()
sheet.title = 'Tieba'
sheet['A1'] = 'Title'
sheet['B1'] = 'Replies'
sheet['C1'] = 'Link'
total = 0

for i in range(0, 20000, 50):
    url = 'http://tieba.baidu.com/f?kw=%E8%B5%B5%E9%9B%85%E8%8A%9D&ie=utf-8&pn=' + str(i)
    res = requests.get(url)
    res.raise_for_status()
    yazhiSoup = bs4.BeautifulSoup(res.text, 'lxml')
    titles = yazhiSoup.select('a[class="j_th_tit"]')
    replies = yazhiSoup.select('span[class="threadlist_rep_num center_text"]')
    numbers = range(2+total, 52+total)
    for title, reply, number in zip(titles, replies, numbers):
        sheet['A'+str(number)] = title.get_text()
        sheet['B'+str(number)] = reply.get_text()
        sheet['C'+str(number)] = 'http://tieba.baidu.com' + title['href']
    total = total + len(titles)
path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'yazhitieba', 'yazhi.xlsx')
wb.save(path)
wb.close()

  after removing the cwd from sys.path.


## 4. `selenium`模块

折腾过程见教程。

### 4.1 尝试

**如果browser赋值时打开的窗口关掉了，让browser.get(url)就会出错。**

In [2]:
from selenium import webdriver

path = '/Users/caimeijuan/anaconda3/lib/python3.7/site-packages/selenium/webdriver/chrome/chromedriver'
browser = webdriver.Chrome(path)
url = 'http://inventwithpython.com'
browser.get(url)

try:
    elem = browser.find_element_by_class_name('card-img-top')
    print('Found <%s> element with that class name!' %(elem.tag_name))
except:
    print('Was not able to find an element with that name.')

linkElem = browser.find_element_by_link_text('Read Online for Free')
type(linkElem)

linkElem.click()

links = browser.find_elements_by_partial_link_text('free review copy')
links

Found <img> element with that class name!


selenium.webdriver.remote.webelement.WebElement

真的把这个链接打开了。终于回忆起来**selenium的特性**是到哪里说哪里的话，点击进入哪个页面就只看哪个页面的元素。

In [15]:
from selenium import webdriver

path = '/Users/caimeijuan/anaconda3/lib/python3.7/site-packages/selenium/webdriver/chrome/chromedriver'
browser = webdriver.Chrome(path)
url = 'http://gmail.com'
browser.get(url)

# 找出input元素：
classElem = browser.find_element_by_class_name('whsOnd')
idElem = browser.find_element_by_id('identifierId')
nameElem = browser.find_element_by_name('identifier')
tagElem = browser.find_element_by_tag_name('input')
classElem == idElem
idElem == nameElem
nameElem == tagElem

tagsElem_div = browser.find_elements_by_tag_name('div')
tagsElem_a = browser.find_elements_by_tag_name('a')

True

True

True

### 4.2 登录gmail

#### 4.2.1 填写email帐号

In [6]:
from selenium import webdriver
import getpass

path = '/Users/caimeijuan/anaconda3/lib/python3.7/site-packages/selenium/webdriver/chrome/chromedriver'
browser = webdriver.Chrome(path)
url = 'http://gmail.com'
browser.get(url)

In [7]:
email_input = browser.find_element_by_tag_name('input')
email_input.clear()
email_input.send_keys(getpass.getpass())

········


试图毕其功于一役，将以上两个过程合在一起，失败。错误提示是：**元素不可见**。原因是无法打开gmail页面，也就找不到所谓元素了。

#### 4.2.2 点击下一步

In [8]:
classElem_btn = browser.find_element_by_class_name('RveJvd')
classElem_btn.click()

#### 4.2.3 填写密码

In [9]:
tagsElem_input = browser.find_elements_by_tag_name('input')
password = tagsElem_input[2]
password.send_keys(getpass.getpass())

········


#### 4.2.4 点击下一步

In [11]:
classElem_btn = browser.find_element_by_class_name('RveJvd')
classElem_btn.click()

以下是很久不登录需要选择风格的画面。一般不会遇到。

In [15]:
nameElem_btn = browser.find_element_by_name('welcome_dialog_next')
nameElem_btn.click()

In [18]:
typeElem_btn = browser.find_element_by_name('ok')
typeElem_btn.click()

逐步登录成功。

————————————合并成一个完整程序————————————————————

In [None]:
#_*_coding:utf-8_*_
from selenium import webdriver
import getpass

path = '/Users/caimeijuan/anaconda3/lib/python3.7/site-packages/selenium/webdriver/chrome/chromedriver'
browser = webdriver.Chrome(path)

def gmailSignin(email, password):
    #打开登录界面
    url = 'http://gmail.com'
    browser.get(url)

    # 填写email帐号
    tagsElem_input = browser.find_elements_by_tag_name('input')
    emailElem = tagsElem_input[0]
    emailElem.clear()
    emailElem.send_keys(email)

    # 点击下一步
    classElem_btn = browser.find_element_by_class_name('RveJvd')
    classElem_btn.click()

    # 填写密码
    Elem_input = browser.find_element_by_name('password')
    print(Elem_input)
    Elem_input.send_keys(password)

    # 点击下一步
    classElem_btn = browser.find_element_by_class_name('RveJvd')
    classElem_btn.click()
    
email = getpass.getpass('Input your email account: ')
password = getpass.getpass('Input your password: ') #不显示输入值
gmailSignin(email, password)

## 5. 获取动态网页数据

两个工具：  

- [在线JSON校验格式化工具（Be JSON）](https://www.bejson.com/)
- [JSON在线解析及格式化验证 - JSON.cn](https://www.json.cn/)

一堆教程：

> 永远记住，对于爬虫程序，模拟浏览器往往是下下策，只有实在没有办法了，才去考虑模拟浏览器环境，因为那样的内存开销实在是很大，而且效率非常低。

    关于拉勾网的两篇教程，要连起来看：

- [Web crawler with Python - 04.另一种抓取方式 - 知乎](https://zhuanlan.zhihu.com/p/20430122)
- [Python搭建代理池爬取拉勾网招聘信息 - 掘金](https://juejin.im/post/5d5e92916fb9a06ac93cd5f5)
    
    第三篇略复杂：  
    
- [Python爬取拉钩招聘网，让你清楚了解Python行业 - 掘金](https://juejin.im/post/5dc3ce0a6fb9a04aba52b643)

    第四篇：
    
- [拉勾网反爬虫解决方法 | lijun Blog](https://darkless.cn/2019/05/25/lagou-crawl-solution/)

    关于今日头条的教程：  

- [Python 网络爬虫：解析JSON, 获取JS动态内容—爬取今日头条, 抓取json内容 - Just Code](http://justcode.ikeepstudying.com/2018/12/python-%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB%EF%BC%9A%E8%A7%A3%E6%9E%90json-%E8%8E%B7%E5%8F%96js%E5%8A%A8%E6%80%81%E5%86%85%E5%AE%B9-%E7%88%AC%E5%8F%96%E4%BB%8A%E6%97%A5%E5%A4%B4%E6%9D%A1/)  

    关于淘宝的教程：  

- [用python抓取淘宝评论 - 云+社区 - 腾讯云](https://cloud.tencent.com/developer/article/1059747)  
- [Python 从零开始爬虫(五)——初遇json&爬取某宝商品信息 - Python 从零开始爬虫 - SegmentFault 思否](https://segmentfault.com/a/1190000014688216)  


### 5.1 拉勾网  

#### 5.1.1 `requests`和`bs`的常规做法：  

In [4]:
import requests, bs4, lxml, os

url = 'https://www.lagou.com/zhaopin/Python/?labelWords=label'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'lxml')
soup.select('.list_item_top')

[]

结果理所当然为空。因为数据是js动态发送的。  

#### 5.1.2 找json数据  

在F12（option+command+I）下，以前都是看elements，现在改看network，观察其中的XHR部分，在页面刷新或者点击页面上的一些切换按钮（总之就是让页面产生变化）时，看这一部分新增加的有哪些。取网址读取其中的json数据进行分析。

In [38]:
import requests, json, lxml, os

# 点击了职位选项后真的看到了positionAjax，通过右击copy link address或header里的url或双击打开取url，都能获得下面这个url。
url = 'https://www.lagou.com/jobs/positionAjax.json?px=default&needAddtionalResult=false'
res = requests.post(url)
res.raise_for_status()
res.json()

{'status': False,
 'msg': '您操作太频繁,请稍后再访问',
 'clientIp': '221.225.172.92',
 'state': 2408}

#### 5.1.3 分析URL参数

```
url = 'https://www.lagou.com/jobs/positionAjax.json?px=default&needAddtionalResult=false'
 
px = default
needAddtionalResult = false
```

#### 5.1.4  增加cookie

上面的结果，和[Python搭建代理池爬取拉勾网招聘信息 - 掘金](https://juejin.im/post/5d5e92916fb9a06ac93cd5f5)的描述一样，没有获得真正的data数据。  

从headers中可提取出如下数据：  

```
Request Method: POST 
Accept: application/json, text/javascript, */*; q=0.01
Cookie: 略
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
Referer: https://www.lagou.com/jobs/list_Python/p-city_0?px=default
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
From Data  

    first: true
    pn: 1
    kd: Python
```

In [2]:
import requests, json

headers = {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Connection": "keep-alive",
    "Host": "www.lagou.com",
    "Referer": 'https://www.lagou.com/jobs/list_Python/p-city_0?px=default',
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36",
    "Cookie": "JSESSIONID=ABAAABAAAIAACBIF4CE656D7045142F03438F07953A2EC5; user_trace_token=20191112134949-99c0fe75-1522-4700-9299-277ae02d8684; WEBTJ-ID=20191112134949-16e5e29370fcd-07a51ff750fd41-1c3c6a5a-1049088-16e5e293710479; _ga=GA1.2.1632364892.1573537790; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1573182236; LGUID=20191112134950-3ddaf708-0510-11ea-a62d-5254005c3644; _gid=GA1.2.1406510492.1573537790; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2216e5e2939bb3cd-016f07345976e3-1c3c6a5a-1049088-16e5e2939bcd7%22%2C%22%24device_id%22%3A%2216e5e2939bb3cd-016f07345976e3-1c3c6a5a-1049088-16e5e2939bcd7%22%7D; index_location_city=%E5%85%A8%E5%9B%BD; _gat=1; LGSID=20191113092033-ca0183c5-05b3-11ea-a4f3-525400f775ce; PRE_UTM=; PRE_HOST=; PRE_SITE=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_Python%2Fp-city_0%3Fpx%3Ddefault%26gx%3D%25E5%2585%25A8%25E8%2581%258C%26gj%3D%26isSchoolJob%3D1; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_Python%2Fp-city_0%3Fpx%3Ddefault%26gx%3D%26isSchoolJob%3D1; TG-TRACK-CODE=index_navigation; X_HTTP_TOKEN=004a2ca20ecbebc496380637513c48bf18724f2cd0; LGRID=20191113092609-929473ea-05b4-11ea-a62d-5254005c3644; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1573608370; SEARCH_ID=6303c489a71c4875a6cc7d1d88198d63"
}

data = {
    "first": True,
    "pn": 1,
    "kd": "python"
}

url = 'https://www.lagou.com/jobs/positionAjax.json?px=default&needAddtionalResult=false'
res = requests.post(url, headers=headers, data=data)
res.raise_for_status()
res.encoding = 'utf-8'
result = res.json()['content']['positionResult']['result']

# 存excel
import openpyxl, os

wb = openpyxl.Workbook()
sheet = wb.get_active_sheet()
sheet.title = 'PythonPosition'
sheet['A1'], sheet['B1'], sheet['C1'], sheet['D1'] = 'positionName','companyFullName','skillLables','createTime'
sheet['E1'], sheet['F1'], sheet['G1'] = 'salary','workYear','jobNature'
sheet['H1'], sheet['I1'], sheet['J1'] = 'address', 'latitude', 'longitude'

for i, p in zip(range(2, len(result)+2), result):
    sheet['A' + str(i)] = result[i-2]['positionName']
    sheet['B' + str(i)] = result[i-2]['companyFullName']
    sheet['C' + str(i)] = ''.join(result[i-2]['skillLables'])
    sheet['D' + str(i)] = result[i-2]['createTime']
    sheet['E' + str(i)] = result[i-2]['salary']
    sheet['F' + str(i)] = result[i-2]['workYear']
    sheet['G' + str(i)] = result[i-2]['jobNature']
    sheet['H' + str(i)] = ','.join([result[i-2]['city'], result[i-2]['district']])
    sheet['I' + str(i)] = result[i-2]['latitude']
    sheet['J' + str(i)] = result[i-2]['longitude']

path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'lagou', 'lagou.xlsx')
wb.save(path)
wb.close()

KeyError: 'content'

试图取得第二页数据，但是和前几天一样结果，且换IP也没有用。第一页数据好像梦中出现一样。

In [14]:
import requests, json

headers = {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    "Connection": "keep-alive",
    "Host": "www.lagou.com",
    "Origin": "https://www.lagou.com",
    "Referer": "https://www.lagou.com/jobs/list_Python/p-city_0?px=default",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
}

proxies = {
    "http":"59.57.149.237:9999",
    "http":"114.239.151.8:808",
    "http":"182.34.32.162:9999",
    "http":"114.239.254.46:9999",
    "http":"113.194.31.30:9999"
}

session = requests.session()
session.get(url, headers=headers, proxies=proxies)

data = {
    "first": "true",
    "pn": 1,
    "kd": "Python"
}

rep = session.post(url, headers=headers, proxies=proxies, data=data)
rep.json()

<Response [200]>

{'status': False,
 'msg': '您操作太频繁,请稍后再访问',
 'clientIp': '49.78.9.32',
 'state': 2402}

In [64]:
import requests, json

headers = {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Connection": "keep-alive",
    "Host": "www.lagou.com",
    "Referer": 'https://www.lagou.com/jobs/list_Python/p-city_0?px=default',
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36",
    "Cookie": "JSESSIONID=ABAAABAAAIAACBIF4CE656D7045142F03438F07953A2EC5; user_trace_token=20191112134949-99c0fe75-1522-4700-9299-277ae02d8684; WEBTJ-ID=20191112134949-16e5e29370fcd-07a51ff750fd41-1c3c6a5a-1049088-16e5e293710479; _ga=GA1.2.1632364892.1573537790; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1573182236; LGUID=20191112134950-3ddaf708-0510-11ea-a62d-5254005c3644; _gid=GA1.2.1406510492.1573537790; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2216e5e2939bb3cd-016f07345976e3-1c3c6a5a-1049088-16e5e2939bcd7%22%2C%22%24device_id%22%3A%2216e5e2939bb3cd-016f07345976e3-1c3c6a5a-1049088-16e5e2939bcd7%22%7D; index_location_city=%E5%85%A8%E5%9B%BD; _gat=1; LGSID=20191113092033-ca0183c5-05b3-11ea-a4f3-525400f775ce; PRE_UTM=; PRE_HOST=; PRE_SITE=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_Python%2Fp-city_0%3Fpx%3Ddefault%26gx%3D%25E5%2585%25A8%25E8%2581%258C%26gj%3D%26isSchoolJob%3D1; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_Python%2Fp-city_0%3Fpx%3Ddefault%26gx%3D%26isSchoolJob%3D1; TG-TRACK-CODE=index_navigation; X_HTTP_TOKEN=004a2ca20ecbebc496380637513c48bf18724f2cd0; LGRID=20191113092609-929473ea-05b4-11ea-a62d-5254005c3644; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1573608370; SEARCH_ID=6303c489a71c4875a6cc7d1d88198d63"
}

data = {
    "first": False,
    "pn": 2,
    "kd": "python"
}

url = 'https://www.lagou.com/jobs/positionAjax.json?px=default&needAddtionalResult=false'
res = requests.post(url, headers=headers, data=data)
res.raise_for_status()
res.encoding = 'utf-8'
res.json()

{'status': False,
 'msg': '您操作太频繁,请稍后再访问',
 'clientIp': '180.117.236.114',
 'state': 2402}

### 5.2 今日头条

#### 5.2.1 初始尝试  

这次仅仅找到了右侧“24小时热文”的链接

In [20]:
import requests, json, bs4, lxml

url = 'https://www.toutiao.com/api/pc/realtime_news/' # 右击copy link address
res = requests.get(url)
data = json.loads(res.text)

for i in range(len(data['data'])) :
    newsUrl = 'https://www.toutiao.com' + data['data'][i]['open_url']
    # print(newsUrl)
   

#### 5.2.2 找到了正content列表链接

没有加headers和data时，数据取不出来。加了后，连续运行几次都可以取出。这点似乎比拉勾网不同。  
但是批量取页还是不成。

In [27]:
import requests, json, openpyxl, os

headers = {
    "Accept": "text/javascript, text/html, application/xml, text/xml, */*",
    "Referer": 'https://www.toutiao.com/',
    "Content-Type": "application/x-www-form-urlencoded",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36",
    "Cookie": "tt_webid=6756795573083686408; s_v_web_id=c3d479b6049b458d283d183b5d273435; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6756795573083686408; csrftoken=99c459a02dd8aa3bd451f4eda8c97e65; UM_distinctid=16e49816ff08a-0be5c23d474e65-1c3c6a5a-100200-16e49816ff14cf; CNZZDATA1259612802=1054615425-1573187717-https%253A%252F%252Fwww.toutiao.com%252F%7C1573187717; _ga=GA1.2.1403944884.1573191251; __tasessionId=gxxa0jjg41573695423978"
}

data = {
    "min_behot_time": 0,
    "category": "__all__",
    "utm_source": "toutiao",
    "widen": 1,
    "tadrequire": True,
    "as": "A1D56D4C1C8B28D",
    "cp": "5DCC2B02C87DBE1",
    "_signature": "WFkevAAgEB3UTJNZiPzNflhZHqAAAWW"
}

url = 'https://www.toutiao.com/api/pc/feed/?min_behot_time=0&category=__all__&utm_source=toutiao&widen=1&tadrequire=true&as=A1D56D4C1C8B28D&cp=5DCC2B02C87DBE1&_signature=WFkevAAgEB3UTJNZiPzNflhZHqAAAWW'
res = requests.get(url, headers=headers, data=data)
data = json.loads(res.text)
result1 = data['data']

headers = {
    "Accept": "text/javascript, text/html, application/xml, text/xml, */*",
    "Referer": 'https://www.toutiao.com/',
    "Content-Type": "application/x-www-form-urlencoded",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36",
    "Cookie": "tt_webid=6756795573083686408; s_v_web_id=c3d479b6049b458d283d183b5d273435; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6756795573083686408; csrftoken=99c459a02dd8aa3bd451f4eda8c97e65; UM_distinctid=16e49816ff08a-0be5c23d474e65-1c3c6a5a-100200-16e49816ff14cf; CNZZDATA1259612802=1054615425-1573187717-https%253A%252F%252Fwww.toutiao.com%252F%7C1573187717; _ga=GA1.2.1403944884.1573191251; __tasessionId=gxxa0jjg41573695423978"
}

data = {
    "max_behot_time": 1573699086,
    "category": "__all__",
    "utm_source": "toutiao",
    "widen": 1,
    "tadrequire": True,
    "as": "A1253D4CFC8C053",
    "cp": "5DCC5C2055634E1",
    "_signature": "NJMTMAAgEB64hp7VRhDaLTSTEyAAGlm"
}
url = 'https://www.toutiao.com/api/pc/feed/?max_behot_time=1573699086&category=__all__&utm_source=toutiao&widen=1&tadrequire=true&as=A1253D4CFC8C053&cp=5DCC5C2055634E1&_signature=NJMTMAAgEB64hp7VRhDaLTSTEyAAGlm'
res = requests.get(url, headers=headers, data=data)
data = json.loads(res.text)
result2 = data['data']

# 存excel
import openpyxl, os, time

wb = openpyxl.Workbook()
sheet = wb.get_active_sheet()
sheet.title = 'toutiaoNews'
sheet['A1'], sheet['B1'], sheet['C1']  = 'title','abstract','comments_count'
sheet['D1'], sheet['E1'] = 'behot_time','sourceMedia'
total = 0
for i, p in zip(range(2, len(result1)+2), result1):
    sheet['A' + str(i)] = result1[i-2]['title']
    sheet['A' + str(i)].hyperlink = 'https://www.toutiao.com' + result1[i-2]['source_url']
    sheet['B' + str(i)] = result1[i-2]['abstract']
    sheet['C' + str(i)] = result1[i-2].get('comments_count', 0) # 偶尔会没有评论数，会有keyerror错误
    sheet['D' + str(i)] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(result1[i-2]['behot_time']))
    sheet['E' + str(i)] = result1[i-2]['source']
    sheet['E' + str(i)].hyperlink = 'https://www.toutiao.com' + result1[i-2].get('media_url', 'null')
total = total + len(result1)
for i, p in zip(range(2, len(result2)+2), result2):
    sheet['A' + str(i+total)] = result2[i-2]['title']
    sheet['A' + str(i+total)].hyperlink = 'https://www.toutiao.com' + result2[i-2]['source_url']
    sheet['B' + str(i+total)] = result2[i-2]['abstract']
    sheet['C' + str(i+total)] = result2[i-2].get('comments_count', 0) # 偶尔会没有评论数，会有keyerror错误
    sheet['D' + str(i+total)] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(result2[i-2]['behot_time']))
    sheet['E' + str(i+total)] = result2[i-2]['source']
    sheet['E' + str(i+total)].hyperlink = 'https://www.toutiao.com' + result2[i-2].get('media_url', 'null')

path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'scrapy', 'toutiao.xlsx')
wb.save(path)
wb.close()



### 5.3 淘宝

[Python format 格式化函数 | 菜鸟教程](https://www.runoob.com/python/att-string-format.html)

In [4]:
import requests, bs4, lxml, json

url = 'https://item.taobao.com/item.htm?spm=a219r.lm874.14.13.292512d5Thbdic&id=599063101905&ns=1&abbucket=9'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)

url = 'https://s.taobao.com/search?\
       q={name}&\
       imgfile=&\
       js=1&\
       stats_click=search_radio_all%3A1&\
       initiative_id=staobaoz_{date}&\
       ie=utf8&\
       sort={sort}&\
       s={num}'.format(name='zara', date='20191109', sort='price-asc', num=88)
# sort：默认排序（按综合排），default；
# sort：按销量排，sale-desc；
# sort：按信用拍，credit-desc；
# sort：按价格从低到高，price-asc；
# sort：按价格从高到低，price-desc；
# sort：总价从低到高，total-asc；
# sort：总价从高到低，total-desc；

res = requests.get(url)

### 6. 作业——测绘资质

第一次尝试时，以为需要进入iframe读取，那会儿不知道js传送实时数据，就只看到内容在一个iframe里。但是啥也没成功。
不过使用的暂停和后来的不一样，是：`browser.implicitly_wait(20) #让浏览器等待一下，最长等10秒。看看数据有没有加载出来。`   
事实证明：这句不如`time.sleep(2)`好用。

附上上次记录的教程：  
[Python selenium —— 一定要会用selenium的等待，三种等待方式解读 - 灰蓝 - CSDN博客](https://blog.csdn.net/huilan_same/article/details/52544521)

这里有个嵌套frame的应对办法：[python - Selenium Webdriver give NoSuchFrameException - Stack Overflow](https://stackoverflow.com/questions/28778142/selenium-webdriver-give-nosuchframeexception) 

保留第一次辛辛苦苦写的省份字典，其他删了：  
```
provinces = {
    '北 京':'BJ', '上 海':'SH', '天 津':'TJ', '重 庆':'CQ', '河 北':'HEB', '河 南':'HEN', '山 东':'SD', '山 西':'SX', '湖 北':'HB',\
    '湖 南':'HN', '辽 宁':'LN', '吉 林':'JL', '黑龙江':'HL', '江 苏':'JS', '浙 江':'ZJ', '福 建':'FJ', '安 徽':'AH', '广 东':'GD', \
    '广 西':'GX', '江 西':'JX', '四 川':'SC', '贵 州':'GZ', '海 南':'HI', '云 南':'YN', '陕 西':'SN', '甘 肃':'GS', '青 海':'QH', \
    '宁 夏':'NX', '新 疆':'XJ', '西 藏':'XZ', '内蒙古':'NM'
}
```

---
重新做。刚开始以为js发送的是json数据，但其实是table。于是摒除json数据想法，老老实实用selenium点击浏览器。  
这次多亏了学会从networks中找js

---

In [2]:
import requests, os, openpyxl, time
from selenium import webdriver

def downloadCehui(prov, url, pageNum):
    # 打开页面
    path = '/Users/caimeijuan/anaconda3/lib/python3.7/site-packages/selenium/webdriver/chrome/chromedriver'
    browser = webdriver.Chrome(path)
    browser.get(url)
    browser.find_element_by_id('BtnSearch').click()
    browser.maximize_window()  #浏览器最大化
    # Excel表格初始化
    filePath = os.path.join(os.path.dirname(os.getcwd()), 'files', 'scrapy', 'cehuizizhi.xlsx')
    wb = openpyxl.load_workbook(filePath) # 追加已有表格。如果新建表格用openpyxl.Workbook()
    sheet = wb.create_sheet()
    sheet.title = prov
    for alpha, i in zip('ABCDE', range(5)):
        field = browser.find_elements_by_css_selector('#UpdatePanel5>table>tbody>tr[class=formbg]>td')
        sheet[alpha + str(1)] = field[i].text
    # 存储页面数据
    total = 0
    for num in range(pageNum):
        try:
            number = len(browser.find_elements_by_css_selector('#UpdatePanel5>table>tbody>tr[style="line-height: 23px;"]'))
            for i in range(number):
                for alpha, j in zip('ABCDE', range(5)):
                    trTab = browser.find_elements_by_css_selector('#UpdatePanel5>table>tbody>tr[style="line-height: 23px;"]')
                    sheet[alpha + str(i+2+total)] = trTab[i].find_elements_by_css_selector('td')[j].text
            total = total + number
            nextPage = browser.find_elements_by_css_selector('a>img[src="/images/PageNavi/nextn.gif"]')[0]
            nextPage.click()
            time.sleep(2)
        except:
            break
    # 存储Excel表格
    wb.save(filePath)
    wb.close()

In [3]:
url = 'http://bjchzz.ch.mnr.gov.cn/Index/QueryList.aspx?AreaId=2'  #北京
pageNum = 23
prov = 'bj'
downloadCehui(prov, url, pageNum)

In [4]:
url = 'http://tjchzz.ch.mnr.gov.cn/Index/QueryList.aspx?AreaId=21' #天津
pageNum = 14
prov = 'tj'
downloadCehui(prov, url, pageNum)

In [5]:
url = 'http://hebchzz.ch.mnr.gov.cn/Index/QueryList.aspx?AreaId=28'#河北
pageNum = 64
prov = 'heb'
downloadCehui(prov, url, pageNum)

## 7. 知乎  

想着找专栏看看。按照上面的经验写了后，解析json时出现了一个错误：  

```
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
```

在这里[python - JSONDecodeError: Expecting value: line 1 column 1 (char 0) - Stack Overflow](https://stackoverflow.com/questions/16573332/jsondecodeerror-expecting-value-line-1-column-1-char-0)找到说：只要headers里去掉"Accept-Encoding": "gzip, deflate, br"即可。试了果然成功。很多其他答案还以为res的返回代码不正常导致的呢。

这个url好，只要改了limit的值（初始值是6），就可以不加任何代码，多下载。

In [None]:
import requests, json, openpyxl, os, time

# Excel表格初始化
filePath = os.path.join(os.path.dirname(os.getcwd()), 'files', 'scrapy', 'zhihu.xlsx')
wb = openpyxl.Workbook()
sheet = wb.get_active_sheet()
sheet.title = 'zhuanlan'
fields = ['Title', 'Description', 'Articles_count', 'ID', 'Intro', 'Followers', 'Created', 'Updated', 'URL']
for field, alpha in zip(fields, 'ABCDEFGHI'):
    sheet[alpha + '1'] = field

headers = {
    "Accept": "*/*",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    "Origin": "https://zhuanlan.zhihu.com",
    "Referer": "https://zhuanlan.zhihu.com/",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
    "Cookie":'_zap=4f087203-bde1-469b-95d7-bec4f36c7893; d_c0="AIArtCd7iQ-PTqFWBtSg-yAwhjme3UJPnX0=|1559704236"; z_c0="2|1:0|10:1560992880|4:z_c0|92:Mi4xRDAwQUFBQUFBQUFBZ0N1MEozdUpEeVlBQUFCZ0FsVk5jQ3I0WFFBeWl2R1phaUxIbjZqRlc0NW9uQTNQeFJBMzhR|b17c2924dee3cb1868739bca6785dfc4dd7268b5af85498660c6fdc0feba7dc5"; __utmv=51854390.100--|2=registration_date=20110513=1^3=entry_date=20110513=1; __gads=ID=cca6833400db0db9:T=1561517189:S=ALNI_MZTpiPGOFaaZQsdAb1Jvru-1Zm3mw; __utma=51854390.143703964.1560993917.1562660822.1564392440.6; __utmz=51854390.1564392440.6.6.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/cao-mei-juan/collections; tst=r; _xsrf=j44aK1MNZhzDPcsBpwuu1Dgra5E8LAmh; q_c1=0bfc58ff4bb9453896e281ade35242f0|1573974697000|1560222725000; tgw_l7_route=73af20938a97f63d9b695ad561c4c10c; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1574297705,1574298200,1574303217,1574309266; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1574309635'
}

pn = 0
page = 6
total = 0
while True:
    url = 'https://zhuanlan.zhihu.com/api/recommendations/columns?limit=%d&offset=%d&seed=7'%(page, page*pn)
    res = requests.get(url, headers=headers)
    isEnd = json.loads(res.text)['paging']['is_end']
    if isEnd == 'true':
        break
    else:
        result = json.loads(res.text)['data']
        for i in range(len(result)):
            sheet['A' + str(i+2+total)] = result[i]['title']
            sheet['B' + str(i+2+total)] = result[i]['description']
            sheet['C' + str(i+2+total)] = result[i]['articles_count']
            sheet['D' + str(i+2+total)] = result[i]['id']
            sheet['E' + str(i+2+total)] = result[i]['intro']
            sheet['F' + str(i+2+total)] = result[i]['followers']
            sheet['G' + str(i+2+total)] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(result[i]['created']))
            sheet['H' + str(i+2+total)] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(result[i]['updated']))
            sheet['I' + str(i+2+total)] = result[i]['url']
        total = total + len(result)
        pn = pn + 1
        wb.save(filePath)
wb.save(filePath)
wb.close()

  


In [74]:
import requests, json, openpyxl, os, time

# Excel表格初始化
filePath = os.path.join(os.path.dirname(os.getcwd()), 'files', 'scrapy', 'zhihu.xlsx')
wb = openpyxl.Workbook()
sheet = wb.get_active_sheet()
sheet.title = 'zhuanlan'
fields = ['Title', 'Description', 'Articles_count', 'ID', 'Intro', 'Followers', 'Created', 'Updated', 'URL']
for field, alpha in zip(fields, 'ABCDEFGHI'):
    sheet[alpha + '1'] = field

headers = {
    "Accept": "*/*",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    "Origin": "https://zhuanlan.zhihu.com",
    "Referer": "https://zhuanlan.zhihu.com/",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
    "Cookie":'_zap=4f087203-bde1-469b-95d7-bec4f36c7893; d_c0="AIArtCd7iQ-PTqFWBtSg-yAwhjme3UJPnX0=|1559704236"; z_c0="2|1:0|10:1560992880|4:z_c0|92:Mi4xRDAwQUFBQUFBQUFBZ0N1MEozdUpEeVlBQUFCZ0FsVk5jQ3I0WFFBeWl2R1phaUxIbjZqRlc0NW9uQTNQeFJBMzhR|b17c2924dee3cb1868739bca6785dfc4dd7268b5af85498660c6fdc0feba7dc5"; __utmv=51854390.100--|2=registration_date=20110513=1^3=entry_date=20110513=1; __gads=ID=cca6833400db0db9:T=1561517189:S=ALNI_MZTpiPGOFaaZQsdAb1Jvru-1Zm3mw; __utma=51854390.143703964.1560993917.1562660822.1564392440.6; __utmz=51854390.1564392440.6.6.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/cao-mei-juan/collections; tst=r; _xsrf=j44aK1MNZhzDPcsBpwuu1Dgra5E8LAmh; q_c1=0bfc58ff4bb9453896e281ade35242f0|1573974697000|1560222725000; tgw_l7_route=73af20938a97f63d9b695ad561c4c10c; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1574297705,1574298200,1574303217,1574309266; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1574309635'
}

page = 6
total = 0
for pn in range(5):
    url = 'https://zhuanlan.zhihu.com/api/recommendations/columns?limit=%d&offset=%d&seed=7'%(page, page*pn)
    res = requests.get(url, headers=headers)
    isEnd = json.loads(res.text)['paging']['is_end']
    result = json.loads(res.text)['data']
    print(len(result))
    for i in range(len(result)):
        sheet['A' + str(i+2+total)] = result[i]['title']
        sheet['B' + str(i+2+total)] = result[i]['description']
        sheet['C' + str(i+2+total)] = result[i]['articles_count']
        sheet['D' + str(i+2+total)] = result[i]['id']
        sheet['E' + str(i+2+total)] = result[i]['intro']
        sheet['F' + str(i+2+total)] = result[i]['followers']
        sheet['G' + str(i+2+total)] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(result[i]['created']))
        sheet['H' + str(i+2+total)] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(result[i]['updated']))
        sheet['I' + str(i+2+total)] = result[i]['url']
    total = total + len(result)
    print(total)
wb.save(filePath)
wb.close()

  


6
6
6
12
6
18
6
24
6
30


In [3]:
# _*_coding:utf-8_*_
import requests, bs4, re

def googleit(query):    
    #打开查询结果页面
    if '' in query:
        query = re.compile(r'\s+').sub('+', query)
    url = 'http://www.google.com/search?q=' + query
    print('Googling...')
    headers = {'User-Agent':'Mozilla/8.0 (compatible; MSIE 8.0; Windows 7)'}
    res = requests.get(url, headers=headers)
    res.raise_for_status()
    
    #选择结果页面
    urls = []
    soup = bs4.BeautifulSoup(res.text)
    linkElems = soup.select('.r a')
    for i in range(len(linkElems)):
        url = 'http://google.com' + linkElems[i].get('href')
        urls.append(url)
    return urls

query = 'site:zhuanlan.zhihu.com'
googleit(query)

Googling...


ConnectionError: HTTPConnectionPool(host='www.google.com', port=80): Max retries exceeded with url: /search?q=site:zhuanlan.zhihu.com (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x105acd0b8>: Failed to establish a new connection: [Errno 60] Operation timed out'))

In [12]:
# 获得关注名单
import re, bs4, requests, os, openpyxl, json

# Excel表格初始化
filePath = os.path.join(os.path.dirname(os.getcwd()), 'files', 'scrapy', 'users.xlsx')
wb = openpyxl.Workbook()
sheet = wb.get_active_sheet()
sheet.title = 'users'
fields = ['name', 'headline', 'articles_count', 'url']
for field, alpha in zip(fields, 'ABCD'):
    sheet[alpha + '1'] = field

url = 'https://www.zhihu.com/api/v4/members/cao-mei-juan/relations/mutuals?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=10'
headers = {
    "cookie":'_zap=4f087203-bde1-469b-95d7-bec4f36c7893; d_c0="AIArtCd7iQ-PTqFWBtSg-yAwhjme3UJPnX0=|1559704236"; z_c0="2|1:0|10:1560992880|4:z_c0|92:Mi4xRDAwQUFBQUFBQUFBZ0N1MEozdUpEeVlBQUFCZ0FsVk5jQ3I0WFFBeWl2R1phaUxIbjZqRlc0NW9uQTNQeFJBMzhR|b17c2924dee3cb1868739bca6785dfc4dd7268b5af85498660c6fdc0feba7dc5"; __utmv=51854390.100--|2=registration_date=20110513=1^3=entry_date=20110513=1; __gads=ID=cca6833400db0db9:T=1561517189:S=ALNI_MZTpiPGOFaaZQsdAb1Jvru-1Zm3mw; __utma=51854390.143703964.1560993917.1562660822.1564392440.6; __utmz=51854390.1564392440.6.6.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/cao-mei-juan/collections; _xsrf=j44aK1MNZhzDPcsBpwuu1Dgra5E8LAmh; q_c1=0bfc58ff4bb9453896e281ade35242f0|1573974697000|1560222725000; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1574303217,1574309266,1574317264,1574382663; tst=r; tgw_l7_route=7bacb9af7224ed68945ce419f4dea76d; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1574384964',
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
}
res = requests.get(url, headers=headers)
result = json.loads(res.text)['data']
for i in range(len(result)):
    sheet['A' + str(i+2)] = result[i]['name']
    sheet['B' + str(i+2)] = result[i]['headline']
    sheet['C' + str(i+2)] = result[i]['articles_count']
    sheet['D' + str(i+2)] = result[i]['url']
wb.save(filePath)
wb.close

  import sys


<bound method Workbook.close of <openpyxl.workbook.workbook.Workbook object at 0x115bdd7b8>>

In [19]:
# 我关注的人
import re, bs4, requests, os, openpyxl, json

# Excel表格初始化
filePath = os.path.join(os.path.dirname(os.getcwd()), 'files', 'scrapy', 'users.xlsx')
wb = openpyxl.Workbook()
sheet = wb.get_active_sheet()
sheet.title = 'users'
fields = ['name', 'headline', 'articles_count', 'url']
for field, alpha in zip(fields, 'ABCD'):
    sheet[alpha + '1'] = field

# 初始url
headers = {
    "cookie":'_zap=4f087203-bde1-469b-95d7-bec4f36c7893; d_c0="AIArtCd7iQ-PTqFWBtSg-yAwhjme3UJPnX0=|1559704236"; z_c0="2|1:0|10:1560992880|4:z_c0|92:Mi4xRDAwQUFBQUFBQUFBZ0N1MEozdUpEeVlBQUFCZ0FsVk5jQ3I0WFFBeWl2R1phaUxIbjZqRlc0NW9uQTNQeFJBMzhR|b17c2924dee3cb1868739bca6785dfc4dd7268b5af85498660c6fdc0feba7dc5"; __utmv=51854390.100--|2=registration_date=20110513=1^3=entry_date=20110513=1; __gads=ID=cca6833400db0db9:T=1561517189:S=ALNI_MZTpiPGOFaaZQsdAb1Jvru-1Zm3mw; __utma=51854390.143703964.1560993917.1562660822.1564392440.6; __utmz=51854390.1564392440.6.6.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/cao-mei-juan/collections; _xsrf=j44aK1MNZhzDPcsBpwuu1Dgra5E8LAmh; q_c1=0bfc58ff4bb9453896e281ade35242f0|1573974697000|1560222725000; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1574303217,1574309266,1574317264,1574382663; tst=r; tgw_l7_route=7bacb9af7224ed68945ce419f4dea76d; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1574384964',
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
}
url = 'https://www.zhihu.com/api/v4/members/cao-mei-juan/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20'
total = 0

#进入循环
while True:
    res = requests.get(url, headers=headers)
    result = json.loads(res.text)['data']
    for i in range(len(result)):
        sheet['A' + str(i+2+total)] = result[i]['name']
        sheet['B' + str(i+2+total)] = result[i]['headline']
        sheet['C' + str(i+2+total)] = result[i]['articles_count']
        sheet['D' + str(i+2+total)] = result[i]['url']
    total = total + len(result)
    wb.save(filePath)
    isEnd = json.loads(res.text)['paging']['is_end']
    if isEnd == 'true':
        break
    else:
        url = 'https://www.zhihu.com/api/v4' + (json.loads(res.text)['paging']['next']).split('https://www.zhihu.com')[1]
wb.close

  import sys


KeyError: 'data'

## 8. 链家成交数据

In [25]:
import requests, bs4, lxml, openpyxl, os

wb = openpyxl.Workbook()
sheet = wb.get_active_sheet()
sheet.title = 'LianJia'
sheet['A1'] = '房源'
sheet['B1'] = '成交日期'
sheet['C1'] = '总价（万）'
sheet['D1'] = '单价（元/平方）'
sheet['E1'] = '区域'
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
}
pn = 0
for i in range(100):
    url = 'https://su.lianjia.com/chengjiao/gaoxin1/pg'+str(i+1)
    res = requests.get(url, headers=headers)
    res.raise_for_status()
    lianSoup = bs4.BeautifulSoup(res.text, 'lxml')

    fangs = lianSoup.select('ul li div div a[target="_blank"]')
    dates = lianSoup.select('div[class="dealDate"]')
    totals = lianSoup.select('div[class="totalPrice"] span')
    units = lianSoup.select('div[class="unitPrice"] span')

    for fang, date, total, unit, number in zip(fangs, dates, totals, units, range(2, 32)):
        sheet['A'+str(number+pn)] = fang.get_text()
        sheet['B'+str(number+pn)] = date.get_text()
        sheet['C'+str(number+pn)] = total.get_text()
        sheet['D'+str(number+pn)] = unit.get_text()
        sheet['E'+str(number+pn)] = '高新区'
    pn = pn + 30
path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'scrapy', 'lianjia', 'lianjia.xlsx')
wb.save(path)
wb.close()

  after removing the cwd from sys.path.


## 9. 知服服查询软著

In [1]:
import requests, bs4, lxml, os

url = 'https://www.zhifufu.com/soft'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'lxml')
soup.select('.name')

[]

## 10. 天眼查

[Python爬取天眼查网站的方法大全 - 知乎](https://zhuanlan.zhihu.com/p/111913499)
哎呀，竟然有个专利叫反爬虫系统及方法~哎呀，专利居然是北京金堤科技有限公司的~哎呀，北京金堤科技有限公司的产品竟然叫天眼查~

[简单爬取天眼查数据 附代码 - 知乎](https://zhuanlan.zhihu.com/p/25273167)

In [12]:
# -*- coding:utf-8 -*-

from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

def driver_open():
    dcap = dict(DesiredCapabilities.PHANTOMJS)
    dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"
    )
    driver = webdriver.PhantomJS(executable_path='C:/ProgramData/Anaconda3/phantomjs-2.1.1-windows/bin/phantomjs.exe', desired_capabilities=dcap)
    return driver

def get_content(driver,url):
    driver.get(url)
    #等待5秒，更据动态网页加载耗时自定义
    time.sleep(5)
    # 获取网页内容
    content = driver.page_source.encode('utf-8')
    driver.close()
    soup = BeautifulSoup(content, 'lxml')
    return soup

def get_basic_info(soup):
    company = soup.select('div.company_info_text > p.ng-binding')[0].text.replace("\n","").replace(" ","")
    fddbr = soup.select('.td-legalPersonName-value > p > a')[0].text
    zczb = soup.select('.td-regCapital-value > p ')[0].text
    zt = soup.select('.td-regStatus-value > p ')[0].text.replace("\n","").replace(" ","")
    zcrq = soup.select('.td-regTime-value > p ')[0].text
    basics = soup.select('.basic-td > .c8 > .ng-binding ')
    hy = basics[0].text
    qyzch = basics[1].text
    qylx = basics[2].text
    zzjgdm = basics[3].text
    yyqx = basics[4].text
    djjg = basics[5].text
    hzrq = basics[6].text
    tyshxydm = basics[7].text
    zcdz = basics[8].text
    jyfw = basics[9].text
    print(u'公司名称：'+company)
    print(u'法定代表人：'+fddbr)
    print(u'注册资本：'+zczb)
    print(u'公司状态：'+zt)
    print(u'注册日期：'+zcrq)
    # print basics
    print(u'行业：'+hy)
    print(u'工商注册号：'+qyzch)
    print(u'企业类型：'+qylx)
    print(u'组织机构代码：'+zzjgdm)
    print(u'营业期限：'+yyqx)
    print(u'登记机构：'+djjg)
    print(u'核准日期：'+hzrq)
    print(u'统一社会信用代码：'+tyshxydm)
    print(u'注册地址：'+zcdz)
    print(u'经营范围：'+jyfw)

def get_gg_info(soup):
    ggpersons = soup.find_all(attrs={"event-name": "company-detail-staff"})
    ggnames = soup.select('table.staff-table > tbody > tr > td.ng-scope > span.ng-binding')
    # print(len(gg))
    for i in range(len(ggpersons)):
        ggperson = ggpersons[i].text
        ggname = ggnames[i].text
    print(ggperson+" "+ggname)

def get_gd_info(soup):
    tzfs = soup.find_all(attrs={"event-name": "company-detail-investment"})
    for i in range(len(tzfs)):
        tzf_split = tzfs[i].text.replace("\n","").split()
        tzf = ' '.join(tzf_split)
    print(tzf)

def get_tz_info(soup):
    btzs = soup.select('a.query_name')
    for i in range(len(btzs)):
        btz_name = btzs[i].select('span')[0].text
    print(btz_name)

if __name__=='__main__':
    url = "http://www.tianyancha.com/company/2310290454"
    driver = driver_open()
    soup = get_content(driver, url)
    print( '----获取基础信息----')
    get_basic_info(soup)
    print( '----获取高管信息----')
    get_gg_info(soup)
    print( '----获取股东信息----')
    get_gd_info(soup)
    print( '----获取对外投资信息----')
    get_tz_info(soup)

----获取基础信息----


IndexError: list index out of range