# 从web抓取信息

起源于《Python编程快速上手——让繁琐工作自动化》中的第11章“从web抓取信息”，也有自己生发的及其他书的内容。

几个重要的库：  

- 1. `webbrowser`
- 2. `requests`
- 3. `BeautifulSoup`  
- 4. `selenium`

另外还有`urllib`，但AI Sweigart 说让我忘了这个库，意思是很不好用。

---
附录：关于HTML

**苹果系统中，command+option+I，可以打开或关闭开发者工具，和Windows上的F12是一样的。**  

作者建议，**不要用正则表达式来解析HTML**。例如昨天遇到的将class写在a标签中间的那种，对于html来说仍然有效，用正则来预估所有的情况则会非常繁琐。专门用来解析html的模块，例如beautifulsoup，将更不容易出错。[html - RegEx match open tags except XHTML self-contained tags - Stack Overflow](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)

---

In [1]:
# ipython输出各行结果
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## 1.  `webbrowser`模块

`webbrowser`这个模块，可以直接打开网址。

先生成网址，再用webbrowser打开，适合网址有规律的情况。哪些情况适用于生成网址再打开检查的情况？

- 动态网址  

    - 查询类：内容参数组成新网址，如Google Map
    - 解析类：从某段文字中解析出需要的新网址
    
- 页面内容确认

    - 编号类：某种无序数列组成新网址，如小说网站晋江
    - 已经获得，手动打开麻烦

### 1.1 打开小说页面

In [1]:
import webbrowser
for i in range(3000011, 3000020):
    url = 'http://www.jjwxc.net/onebook.php?novelid=' + str(i)
    webbrowser.open(url)

### 1.2 打开Google地图上单个城市

In [2]:
#_*_coding:utf-8_*_
import webbrowser
import re

def mapit(address):
    address = re.compile(r'\s+').sub('+',address) # 这里“+”之前不需要转义符\
    url = 'http://www.google.com/maps/place/' + address
    return webbrowser.open(url)

address = 'Wuxi,   Jiangsu,   China'    
mapit(address)

True

### 1.3 批量打开Google地图的城市群

In [3]:
import webbrowser

def mapcities(cities):
    address = []
    for city in cities:
        address.append(city + ', Jiangsu, China')
    for a in address:
        a = re.compile(r'\s+').sub('+',a)
        url = 'http://www.google.com/maps/place/' + a
        webbrowser.open(url)
        
cities = ['Wuxi', 'Suzhou', 'Xuzhou', 'Zhengjiang', 'Taizhou']
mapcities(cities)

### 1.4 批量打开简书笔记

复制下面这段网址（从workflowy来），用下面的程序批量打开：

    [每天一本书 -《思考线》](https://www.jianshu.com/p/ee5e1c32f97d)
	[创意变为现实的最佳方法——《思考线》读后感](https://www.jianshu.com/p/c2132f7e02ae)
	[读思考线 ](https://www.jianshu.com/p/52e2e4dbb08c)
	[极具说服力的书（思考线：让你的创意变为现实的最佳方法](https://book.douban.com/review/7747085/)
	[思考线·思维导图.png (3104×1802)](https://upload-images.jianshu.io/upload_images/14183687-4b46af1a5294edfd.png)

In [4]:
import webbrowser, pyperclip, re

mdUrls = pyperclip.paste().replace('\t','').split('\n')
urlRegex = re.compile(r'\((http.*)\)')
for mdUrl in mdUrls:
    url = urlRegex.search(mdUrl).group(1)
    webbrowser.open(url)

## 2. `requests`模块

requests文档地址：[Requests: HTTP for Humans™ — Requests 2.21.0 documentation](https://requests.readthedocs.io/en/master/)

`requests.get(url)`:  

- 类型是`requests.models.Response`
- 参数text
- 参数headers是类似于字典（字典有`dict.get(key)`返回value的语法）的结构：`requests.structures.CaseInsensitiveDict`，它的键不区分大小写。真正的字典键是区分大小写的。
- 参数status_code
- 参数encoding
- 方法raise_for_status()

### 2.1 尝试

In [2]:
#_*_coding:utf-8_*_
import requests
import chardet

url0 = 'http://www.gutenberg.org/cache/epub/1112/pg1112.txt'
url1 = 'http://example.webscraping.com'
url2 = 'http://www.engine3d.com'

res0 = requests.get(url0)
res1 = requests.get(url1)
headers = {'user-agent': 'my-app/0.0.1'}
res2 = requests.get(url2, headers = headers)

# res的基本情况
type(res0)
res0.status_code == requests.codes.ok
res2.encoding

# 看res的文本
len(res0.text)
res0.text[:250]

# 看res的headers
res0.headers.get('content-type')
res2.headers
res2.headers.get('user-agent')

# 不存在的网址，看res反应
url3 = 'http://inventwithpython.com/page_that_does_not_exist'
res3 = requests.get(url3)
res3.raise_for_status()

requests.models.Response

True

'ISO-8859-1'

179378

"\ufeffThe Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare\r\n\r\n\r\n*******************************************************************\r\nTHIS EBOOK WAS ONE OF PROJECT GUTENBERG'S EARLY FILES PRODUCED AT A\r\nTIME WHEN PROOFING METHODS AND TOO"

'text/plain; charset=utf-8'

{'Server': 'AliyunOSS', 'Date': 'Fri, 01 Nov 2019 01:39:49 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'x-oss-request-id': '5DBB8CE55C74183036984F90', 'Last-Modified': 'Thu, 31 Oct 2019 02:27:20 GMT', 'x-oss-object-type': 'Normal', 'x-oss-hash-crc64ecma': '4434422554274640270', 'x-oss-storage-class': 'Standard', 'Content-MD5': 'l2yxdfF3i/9yvmsmNVa0SA==', 'x-oss-server-time': '18', 'Content-Encoding': 'gzip'}

HTTPError: 404 Client Error: Not Found for url: http://inventwithpython.com/page_that_does_not_exist

### 2.2 能运行起来的第一段程序：

In [4]:
def download(url):
    res = requests.get(url)
    try:
        res.raise_for_status()
        html = res.text
        print('Webpage is download successfully.'+'\n'+'The beginning texts here:  '+'\n\n')
        print(html[:200])
    except Exception as exc :
        print('There was a problem: \n%s' %exc)
print('url0的下载结果： \n')
download(url0)
print('url3的下载结果： \n')
download(url3)

url0的下载结果： 

Webpage is download successfully.
The beginning texts here:  


﻿The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare


*******************************************************************
THIS EBOOK WAS ONE OF PROJECT GUTENBERG'S EARLY FILES
url3的下载结果： 

There was a problem: 
404 Client Error: Not Found for url: http://inventwithpython.com/page_that_does_not_exist


### 2.3 保存文件到本地

In [5]:
import requests
import os

res = requests.get(url0)
res.raise_for_status()
path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'fiction', 'RomeoAndJuliet.txt')
playFile = open(path, 'wb')
for chunk in res.iter_content(100000):
    playFile.write(chunk)
playFile.close()

100000

79380

### 2.4 源代码的编码问题

关于Unicode编码的知识：[The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)

```
encode_type = chardet.detect(html)
html = html.decode(encode_type['encoding'])
```

这里不是靠这两句解决问题的。已经测试过，html是str类型的。

用wb模式打开文件，写入内容是 `res.iter_content(100000)`  
作者说，使用iter_content是为了确保requests模块即使在**下载巨大**的文件时也**不会消耗太多**内存。

**The chunk size is the number of bytes it should read into memory. **

In [6]:
help(res.iter_content)

Help on method iter_content in module requests.models:

iter_content(chunk_size=1, decode_unicode=False) method of requests.models.Response instance
    Iterates over the response data.  When stream=True is set on the
    request, this avoids reading the content at once into memory for
    large responses.  The chunk size is the number of bytes it should
    read into memory.  This is not necessarily the length of each item
    returned as decoding can take place.
    
    chunk_size must be of type int or None. A value of None will
    function differently depending on the value of `stream`.
    stream=True will read data as it arrives in whatever size the
    chunks are received. If stream=False, data is returned as
    a single chunk.
    
    If decode_unicode is True, content will be decoded using the best
    available encoding based on the response.



## 3. `BeautifulSoup`模块

[BeautifulSoup高级应用 之 CSS selectors /CSS 选择器 - Winterto1990的博客 - CSDN博客](https://blog.csdn.net/Winterto1990/article/details/47808949)

soup的两类来源：

- 1. 从网页获取
- 2. 从本地获取

### 3.1 尝试

In [2]:
import requests
import bs4
import os

# soup的两类来源：
# 1. 从网页获取
url = 'https://www.tripadvisor.cn/'
res = requests.get(url)
res.raise_for_status()
example1Soup = bs4.BeautifulSoup(res.text)
type(example1Soup)

# 2. 从本地获取
path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'NoStarch.html')
example2File = open(path)
example2Soup = bs4.BeautifulSoup(example2File.read()) #.read()加了没加没区别。
type(example2Soup)

bs4.BeautifulSoup

bs4.BeautifulSoup

In [2]:
# lxml解析 
import requests
import bs4
import lxml

url = 'https://www.tripadvisor.cn/'
res = requests.get(url)
res.raise_for_status()
html = res.text
tripSoup = bs4.BeautifulSoup(html, 'lxml')

tripSoup.select('title')
tripSoup.select_one('a>img')
tripSoup.select_one('a img')
# 网页变了，以下代码失效。
'''
tripSoup.select('input[type="radio"]')
tripSoup.select_one('input[type="radio"]')
tripSoup.select('#popularDestinations > div.section > ul.regionContent > li.active > ul > li:nth-child(1) > a > span.thumbCrop > img')
images = tripSoup.select('#popularDestinations > div.section > ul.regionContent > li.active > ul > li > a > span.thumbCrop > img')
titles = tripSoup.select('#popularDestinations > div.section > ul.regionContent > li.active > ul > li > div.title')
info = []
for title,image in zip(titles, images):
    data = {
            'title':((title.get_text()).replace('\n','')).replace('游记指南',''),
            'image':image.get('src')
        }
    info.append(data)
info
'''
# 不知道为什么读不到背景图像的url
tripSoup.select('a>div>ul>li>div')[0]['style'].split(';')

[<title>TripAdvisor(猫途鹰) - 全球旅游点评,酒店/景点/餐厅,真实旅客评论</title>]

<img alt="TripAdvisor(猫途鹰)" class="brand-header-Logo__resizeImg--15ZcW" src="https://cc.ddcdn.com/img2/langs/zh_CN/branding/rebrand/TA_logo_primary.svg"/>

<img alt="TripAdvisor(猫途鹰)" class="brand-header-Logo__resizeImg--15ZcW" src="https://cc.ddcdn.com/img2/langs/zh_CN/branding/rebrand/TA_logo_primary.svg"/>

'\ntripSoup.select(\'input[type="radio"]\')\ntripSoup.select_one(\'input[type="radio"]\')\ntripSoup.select(\'#popularDestinations > div.section > ul.regionContent > li.active > ul > li:nth-child(1) > a > span.thumbCrop > img\')\nimages = tripSoup.select(\'#popularDestinations > div.section > ul.regionContent > li.active > ul > li > a > span.thumbCrop > img\')\ntitles = tripSoup.select(\'#popularDestinations > div.section > ul.regionContent > li.active > ul > li > div.title\')\ninfo = []\nfor title,image in zip(titles, images):\n    data = {\n            \'title\':((title.get_text()).replace(\'\n\',\'\')).replace(\'游记指南\',\'\'),\n            \'image\':image.get(\'src\')\n        }\n    info.append(data)\ninfo\n'

['height:100%', 'width:100%', 'background-size:cover', 'background-image:none']

In [3]:
import requests, bs4

url = 'https://www.joelonsoftware.com/'
res = requests.get(url)
htmlSoup = bs4.BeautifulSoup(res.text)

import pprint
for i in range(2):
    # pprint.pprint(htmlSoup.select('div>p>span>a')[i].attrs)
    # print(htmlSoup.select('div>p>span>a')[i].get('href'))
    # pprint.pprint(htmlSoup.select('header>h2>a')[i].attrs)
    print(htmlSoup.select('header>h2>a')[i].text)
    print(htmlSoup.select('header>h2>a')[i].get('href'))
for i in range(2):
    print(htmlSoup.select('div>ul>li>a')[i].text)
    
htmlSoup.select_one('img')['src']

Welcome, Prashanth!
https://www.joelonsoftware.com/2019/09/24/announcing-stack-overflows-new-ceo/
The next CEO of Stack Overflow
https://www.joelonsoftware.com/2019/03/28/the-next-ceo-of-stack-overflow/
Things You Should Never Do, Part I
Strategy Letter I: Ben and Jerry’s vs. Amazon


'https://i1.wp.com/www.joelonsoftware.com/wp-content/uploads/2016/12/Pong.png?w=730&ssl=1'

In [2]:
import requests, bs4

url = 'https://blog.csdn.net' 
res = requests.get(url)
res.raise_for_status()

soup = bs4.BeautifulSoup(res.text)
soup.select('title')
# soup.select('div nth-of-type(0)') 在ipython中不好用
# soup.select('body a')[:3]
# soup.select('div>ul>li.active')
# soup.select('.carousel-caption, p.name')
soup.select('a[href]')[0]
soup.select("a[href$='102605809']")[1].text

[<title>CSDN博客-专业IT技术发表平台</title>]

<a href="/">推荐</a>

'\n\n\t\t\t\t\tCSDN产品公告第2期：博客支持视频、专栏文章拖拽排序、APP霸王课来袭……\t\t\t\t\n'

### 3.2 下载古腾堡中文书

突然想下载古腾堡的书下来。挑选中文的试试吧。  

【下载】
1. 通过主页面提取出txt的url
2. 使用上面的功能下载这些。

【繁体转简体】
找到对应的库 

> 不需要什么安装方法，只需要把这两个文件下载下来，保存到与代码同一目录下即可
```
https://raw.githubusercontent.com/skydark/nstools/master/zhtools/langconv.py  
https://raw.githubusercontent.com/skydark/nstools/master/zhtools/zh_wiki.py
```

#### 3.2.1 下载

In [9]:
# 下载txt
import requests
import bs4
import os

url = 'https://www.gutenberg.org/browse/languages/zh'
res = requests.get(url)
res.raise_for_status()
fictionSoup = bs4.BeautifulSoup(res.text)
fictionList = fictionSoup.select("li.pgdbetext a[href^='/ebooks']")
for i in range(len(fictionList)):
    fileName = fictionList[i].get_text()    
    if '/' in fileName:
        fileName = fileName.replace('/', ' ')
    if '\\' in fileName:
        fileName = fileName.replace('\\', ' ')
    file = os.path.join(os.path.dirname(os.getcwd()), 'files', 'fiction', fileName +'.txt')
    if os.path.exists(file):
        fileName = fileName + '_new'
        file = os.path.join(os.path.dirname(os.getcwd()), 'files', 'fiction', fileName +'.txt')
    fictionID = (fictionList[i].get('href')).split('/')[2]
    fictionUrl = 'https://www.gutenberg.org/files/' + fictionID + '/' + fictionID + '-0.txt'
    with open(file ,'wb') as fictionFile:
        resFiction = requests.get(fictionUrl)
        for chunk in resFiction.iter_content(100000):
            fictionFile.write(chunk)

一共下载了475本txt文本格式的电子书。  

现在看txt格式的都可以下载，试试其他格式的：  

In [11]:
# 下载epub

file = os.path.join(os.path.dirname(os.getcwd()), 'files', 'fiction', '玉樓春.epub')
url = 'https://www.gutenberg.org/ebooks/25422.epub.noimages?session_id=3c29b07a963878c5cd004f277b6d1d0adb08d623'
with open(file ,'wb') as fictionFile:
        resFiction = requests.get(url)
        for chunk in resFiction.iter_content(100000):
            fictionFile.write(chunk)

#### 3.2.2 转换简繁体  

In [13]:
from langconv import *

sentence = '玉樓春'
Converter('zh-hans').convert(sentence)

'玉楼春'

要想批量转换，需要：  

1. 转换文件名  
2. 建立新文件——考虑到有些文件名已经是简体，保险的方法就是再建立一个简体文件夹。  
3. 读取文件中的内容
4. 转换文件内容  
5. 写入新文件

In [26]:
import os
from langconv import Converter

pathFanti = os.path.join(os.path.dirname(os.getcwd()), 'files', 'fiction', 'fanti')
pathJianti = os.path.join(os.path.dirname(os.getcwd()), 'files', 'fiction', 'jianti')
filesFanti = os.listdir(pathFanti) #把这个目录下的所有文件都读出来
for fileName in filesFanti:
    if fileName.split('.')[-1] != 'txt':
        filesFanti.remove(fileName)
for fileNameFanti in filesFanti:
    fileNameJianti = Converter('zh-hans').convert(fileNameFanti)
    with open(os.path.join(pathJianti, fileNameJianti), 'w') as fileJianti:
        with open(os.path.join(pathFanti, fileNameFanti)) as fileFanti:
            contentFanti = fileFanti.readlines()
            for sentenceFanti in contentFanti:
                sentenceJianti = Converter('zh-hans').convert(sentenceFanti)
                fileJianti.write(sentenceJianti)

除了一个打不开的，其他都转换成功。

### 3.3 Google自动查询

In [5]:
# _*_coding:utf-8_*_
import requests, bs4, webbrowser,re

def googleit(query):
    
    #打开查询结果页面
    if '' in query:
        query = re.compile(r'\s+').sub('+', query)
    url = 'http://www.google.com/search?q=' + query
    print('Googling...')
    headers = {'User-Agent':'Mozilla/8.0 (compatible; MSIE 8.0; Windows 7)'}
    res = requests.get(url, headers=headers)
    res.raise_for_status()
    
    #选择结果页面
    soup = bs4.BeautifulSoup(res.text)
    linkElems = soup.select('.r a')
    
    #打开前5个页面
    numOpen = min(5, len(linkElems))
    for i in range(numOpen):
        webbrowser.open('http://google.com' + linkElems[i].get('href'))
        # print(linkElems[i].get('href'))
query = 'python webscraping'
googleit(query)

Googling...


### 3.4 下载xkcd漫画

这是第一次尝试下载图片。先开始是教材上的程序，后来自己又重新写了一下。

#### 3.4.1 标准程序

In [None]:
#_*_coding:utf-8_*_
import requests, bs4,os

url = 'http://xkcd.com'
# os.makedirs('xkcd', exist_ok=True)
while not url.endswith('#'):
    #下载网页
    # print('Downloading page %s ...'%url)
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text)
    
    #寻找漫画
    comicElem = soup.select('#comic img')
    if comicElem == []:
        print('Could not find comic image.')
    else:
        try:
            comicUrl = 'http:'+ comicElem[0].get('src')
            # print('Downloading image is %s ...'%comicUrl)
            res = requests.get(comicUrl)
            res.raise_for_status()  
        except:
            comicUrl = 'http://www.xkcd.com'+ comicElem[0].get('src')
            print('That is wrong. Downloading image is %s ...'%comicUrl)    
            res = requests.get(comicUrl)
            res.raise_for_status() 

    #保存漫画
    # imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
    imageFile = open(os.path.dirname(os.getcwd()) + '/files/xkcd/' + os.path.basename(comicUrl), 'wb')
    for chunk in res.iter_content(100000):
        imageFile.write(chunk)
    imageFile.close() 

    #找前一张漫画
    prevLink = soup.select('a[rel="prev"]')[0]
    url = 'http://xkcd.com' + prevLink.get('href')
    
print('Done.')

That is wrong. Downloading image is http://www.xkcd.com/2067/asset/challengers_header.png ...
Could not find comic image.
Could not find comic image.
That is wrong. Downloading image is http://www.xkcd.com/1525/bg.png ...
Could not find comic image.
Could not find comic image.


In [176]:
url = 'https://xkcd.com/2031/'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
comicElem = soup.select('#comic img')
try:
    comicUrl = 'http:'+ comicElem[0].get('src')
    print('Downloading image is %s ...'%comicUrl)
    res = requests.get(comicUrl)
    res.raise_for_status()  
except:
    comicUrl = 'http://www.xkcd.com'+ comicElem[0].get('src')
    print('That is wrong. Downloading image is %s ...'%comicUrl)    
    res = requests.get(comicUrl)
    res.raise_for_status() 

Downloading image is http://imgs.xkcd.com/comics/pie_charts.png ...


明白了，这里用的是canvas，不是img。

2031和1813都出现的错误：  

SysCallError                              Traceback (most recent call last)

SSLError: HTTPSConnectionPool(host='xkcd.com', port=443): Max retries exceeded with url: /1813/ (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))

####  3.4.2 重新下载xkcd漫画

在知道了漫画作者门罗就是《万物解释者》作者后，决定重新下载xkcd漫画。

In [85]:
import requests
import bs4
import os

url = 'https://xkcd.com/1'
while True:
    # 获取当前页面的漫画地址
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text)
    try:
        # 当前页面存在漫画的静态图片地址
        srcComic = soup.select_one('#comic img').get('src')
        urlComic = 'https:'+ srcComic
    except:
        # 没有漫画地址就直接找下一页
        srcPrev = soup.select_one(".comicNav a[rel='prev']").get('href')
        print(srcPrev)
        url = 'https://xkcd.com' + srcPrev
        continue
    
    # 存储当前漫画到本地
    fileName = srcComic.split('/')[-1]
    filePath = os.path.join(os.path.dirname(os.getcwd()), 'files', 'xkcd', fileName)
    with open(filePath, 'wb') as comicFile:
        resComic = requests.get(urlComic)
        for chunk in resComic.iter_content(100000):
            comicFile.write(chunk)
    
    # 获取前一页的页面地址
    srcPrev = soup.select_one(".comicNav a[rel='prev']").get('href')
    # 到了第一幅漫画的页面
    if srcPrev == '#':
        break
    print(srcPrev)
    url = 'https://xkcd.com' + srcPrev

有问题的：2198, 2067，1663，1608，1525，1416，1350  

2198、1663、1608、1525、1416都是游戏，页面有交互，非普通静态图片。1416会放大，嵌入了一个框架网页。1350无内容。

2067会放大，其中有链接，也不是普通图片。错误显示为：  

```
MissingSchema: Invalid URL 'https:/2067/asset/challengers_header.png': No schema supplied. Perhaps you meant http://https:/2067/asset/challengers_header.png?
```

1052-878之间一次性完成，无意外。874-787一次性完成。691-483一次性完成。446-250一次性完成

超时的错误为：  

```
SSLError: HTTPSConnectionPool(host='xkcd.com', port=443): Max retries exceeded with url: /240/ (Caused by SSLError(SSLError("bad handshake: SysCallError(60, 'ETIMEDOUT')")))

```


1. 需要写一个try-except，出了意外，直接找prev按钮继续走下去。 
   try-except无法解决2067的问题。
2. 对于timeout怎么应对？

目前程序的三个问题：  

1. try-except无法解决2067的问题  
2. timeout无法解决  
3. 下载速度太慢，平均每小时只能下载200-300张漫画，全部2200张漫画需要七八个小时才能下完。  

问题三可以用多线程来解决。  

In [14]:
# 单独下载一副漫画

url = 'https://xkcd.com/'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
srcComic = soup.select_one('#comic img').get('src')
urlComic = 'https:'+ srcComic

# 存储当前漫画到本地
fileName = srcComic.split('/')[-1]
filePath = os.path.join(os.path.dirname(os.getcwd()), 'files', 'xkcd', fileName)
with open(filePath, 'wb') as comicFile:
    resComic = requests.get(urlComic)
    for chunk in resComic.iter_content(100000):
        comicFile.write(chunk)

#### 3.4.3 多线程下载xkcd漫画

In [6]:
import requests
import bs4
import threading
import os

def downloadComic(startComic, endComic):
    for comicID in range(startComic, endComic):
        # 获取漫画地址
        res = requests.get('https://xkcd.com/%s'%str(comicID))
        soup = bs4.BeautifulSoup(res.text)
        comicElem = soup.select('#comic img')
        if comicElem == []:
            print('Could not find comic img: %s'%str(comicID))
        else:
            srcComic = comicElem[0].get('src')
            urlComic = 'https:'+ srcComic
            # 存储到本地
            fileName = str(comicID) + '-' + srcComic.split('/')[-1]
            filePath = os.path.join(os.path.dirname(os.getcwd()), 'files', 'xkcd', fileName)
            if os.path.exists(filePath):
                continue
            with open(filePath, 'wb') as comicFile:
                resComic = requests.get(urlComic)
                for chunk in resComic.iter_content(100000):
                    comicFile.write(chunk)

downloadTreads = []
for i in range(1, 81, 10):
    downloadTread = threading.Thread(target=downloadComic, args=[i, i+10])
    downloadTreads.append(downloadTread)
    downloadTread.start()
for downloadTread in downloadTreads:
    downloadTread.join()
print('Download End.')

Download End.


为了顺利地在所有线程结束后打印‘Download End.’，必须所有线程都没有崩溃才行。timeout一旦出现，这个程序就处于永远结束不了的状态了。

### 3.5 下载极客漫画

In [149]:
import requests, bs4, os

for i in range(1,6):
    url = 'https://linux.cn/talk/comic/index.php?page=' + str(i)
    headers = {'User-Agent':'Mozilla/8.0 (compatible; MSIE 8.0; Windows 7)'}
    res = requests.get(url, headers=headers)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text)
    for s in soup.select('h2 span[class="title"] a'):
        comicUrl = s.get('href')
        comicRes = requests.get(comicUrl, headers=headers)
        comicRes.raise_for_status()
        comicSoup = bs4.BeautifulSoup(comicRes.text)
        comicImageUrl = comicSoup.select('#article_content img')[0].get('src')
        comicImageRes = requests.get(comicImageUrl, headers=headers)
        imageFile = open(os.path.dirname(os.getcwd()) + '/files/jkmh/' + os.path.basename(comicImageUrl), 'wb')
        for chunk in comicImageRes.iter_content(100000):
            imageFile.write(chunk)
        imageFile.close()
print('Done!')

### 3.6 玩豆瓣

#### 3.6.1 豆瓣top250图书

豆瓣有个top250的图书榜单，读取其中书名、作者和评分。

In [1]:
import requests, bs4, lxml, re
urls = []
for i in range(0,250,25):
    urls.append('https://book.douban.com/top250?start=' + str(i))
info = []
for url in urls:
    res = requests.get(url)
    res.raise_for_status()
    html = res.text
    doubanBookSoup = bs4.BeautifulSoup(html, 'lxml')
    titles = doubanBookSoup.select('a[title]')
    scores = doubanBookSoup.select('span[class="rating_nums"]')
    authors = doubanBookSoup.select('p.pl')
    for title, author, score in zip(titles, authors, scores):
        data = {
            'title':((title.get_text()).replace('\n','')).replace(' ','') ,
            'author':re.compile(r'(\[\w+\])?(\w)+(·)?(\w+)(·\w+)?').search(author.get_text()).group(),
            'score':score.get_text()
        }
        info.append(data)

for i in info:
    if 9.5<=float(i['score'])<9.7:
        print(i)

{'title': '红楼梦', 'author': '曹雪芹', 'score': '9.6'}
{'title': '海贼王:ONEPIECE', 'author': '尾田荣一郎', 'score': '9.5'}


In [2]:
# 2019.11.01重新写一遍
import requests
import bs4
import lxml
import os

topBooks = []
for pageNum in range(0,250,25):
    url = 'https://book.douban.com/top250?start=' + str(pageNum)
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, 'lxml')
    for i in range(25):
        info = soup.select('p.pl')[i].text
        data = {
            'Title':soup.select('.pl2 a')[i]['title'],
            'Rating':soup.select('.rating_nums')[i].text,    
            'Author': info.split('/')[0]
        }
        topBooks.append(data)
for book in topBooks:
    if float(book['Rating'])>=9.5:
        print(book)

{'Title': '红楼梦', 'Rating': '9.6', 'Author': '[清] 曹雪芹 著 '}
{'Title': '海贼王', 'Rating': '9.5', 'Author': '尾田荣一郎 '}


#### 3.6.2 豆瓣标签

In [20]:
# 单标签
import requests, bs4, lxml
import os, openpyxl, re, threading

urlTag = 'https://book.douban.com/tag/%E6%95%99%E8%82%B2'  
resTag = requests.get(urlTag)
resTag.raise_for_status()
soupTag = bs4.BeautifulSoup(resTag.text, 'lxml')

wb = openpyxl.Workbook()
sheetTag = wb.create_sheet()
tag = soupTag.select('h1')[0].split(': ')[1]
sheetTag.title = tag
sheetTag['A1'], sheetTag['B1'], sheetTag['C1'] = 'Label', 'Title', 'Author'
sheetTag['D1'], sheetTag['E1'] = 'Rating', 'Link' 
path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'douban', 'tags.xlsx')
wb.save(path)
wb.close()

def saveContent(tag):
    total = 0
    while True:
        num = len(soupTag.select('h2 a'))
        if num == 0:
            break
        for i in range(num):
            sheetTag['A'+str(i+2+total)] = soupTag.select('h1')[0].text.split(': ')[1]
            sheetTag['B'+str(i+2+total)] = soupTag.select('h2 a')[i].get('title')
            sheetTag['C'+str(i+2+total)] = re.compile(r'\s+').sub('', soupTag.select('div.pub')[i].text).split('/')[0]
            sheetTag['E'+str(i+2+total)] = soupTag.select('h2 a')[i].get('href') 
            if '少于10人评价' in soupTag.select('.clearfix .pl')[i].text:
                sheetTag['D'+str(i+2+total)] = ''
            else:
                sheetTag['D'+str(i+2+total)] = soupTag.select('.info .clearfix')[i].select('.rating_nums')[0].text
        total = total + num
        urlTag = 'https://book.douban.com' + soupTag.select('.next link')[0].get('href')
    path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'douban', 'tags.xlsx')
    wb.save(path)
    wb.close()

saveThread = threading.Thread(target=saveContent, args=[tag])
saveThread.start()

HTTPError: 403 Client Error: Forbidden for url: https://sec.douban.com/b?r=https%3A%2F%2Fbook.douban.com%2Ftag%2F%25E6%2595%2599%25E8%2582%25B2

In [19]:
# 多标签
import requests, bs4, lxml
import os, openpyxl, re, threading

url = 'https://book.douban.com/tag/'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'lxml')
wb = openpyxl.Workbook()
tags = []
for tagID in range(len(soup.select('.tagCol a'))):
    urlTag = 'https://book.douban.com' + soup.select('.tagCol a')[tagID].get('href')   
    sheetTag = wb.create_sheet()
    tag = soup.select('.tagCol a')[tagID].text
    sheetTag.title = tag
    tags.append(tag)
    sheetTag['A1'], sheetTag['B1'], sheetTag['C1'] = 'Label', 'Title', 'Author'
    sheetTag['D1'], sheetTag['E1'] = 'Rating', 'Link' 
path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'douban', 'tags.xlsx')
wb.save(path)
wb.close()

def saveContent(tag):
    urlTag = 'https://book.douban.com/tag/' + tag
    sheetTag = wb[tag]
    total = 0
    while True:
        resTag = requests.get(urlTag)
        resTag.raise_for_status()
        soupTag = bs4.BeautifulSoup(resTag.text, 'lxml')
        num = len(soupTag.select('h2 a'))
        if num == 0:
            break
        for i in range(num):
            sheetTag['A'+str(i+2+total)] = soupTag.select('h1')[0].text.split(': ')[1]
            sheetTag['B'+str(i+2+total)] = soupTag.select('h2 a')[i].get('title')
            sheetTag['C'+str(i+2+total)] = re.compile(r'\s+').sub('', soupTag.select('div.pub')[i].text).split('/')[0]
            sheetTag['E'+str(i+2+total)] = soupTag.select('h2 a')[i].get('href') 
            if '少于10人评价' in soupTag.select('.clearfix .pl')[i].text:
                sheetTag['D'+str(i+2+total)] = ''
            else:
                sheetTag['D'+str(i+2+total)] = soupTag.select('.info .clearfix')[i].select('.rating_nums')[0].text
        total = total + num
        urlTag = 'https://book.douban.com' + soupTag.select('.next link')[0].get('href')
    path = os.path.join(os.path.dirname(os.getcwd()), 'files', 'douban', 'tags.xlsx')
    wb.save(path)
    wb.close()

saveThreads = []
for i in range(2):
    saveThread = threading.Thread(target=saveContent, args=[tags[i]])
    saveThreads.append(saveThread)
    saveThread.start()

IndexError: list index out of range

In [9]:
url = 'https://book.douban.com/tag/'
res = requests.get(url, 'lxml')
soup = bs4.BeautifulSoup(res.text)
len(soup.select('.tagCol a'))

145

### 3.7 赵雅芝贴吧内容读取 

#### 存入字典

In [55]:
import requests, bs4, lxml

info = []
for i in range(0, 100, 50):
    url = 'http://tieba.baidu.com/f?kw=%E8%B5%B5%E9%9B%85%E8%8A%9D&ie=utf-8&pn=' + str(i)
    res = requests.get(url)
    res.raise_for_status()
    html = res.text
    yazhiSoup = bs4.BeautifulSoup(html, 'lxml')
    titles = yazhiSoup.select('a[class="j_th_tit"]')
    replies = yazhiSoup.select('span[class="threadlist_rep_num center_text"]')
    for title, reply in zip(titles, replies):
        data = {
            "title":title.get_text(),
            "replies":reply.get_text(),
            "link":'http://tieba.baidu.com' + title['href']
        }
        info.append(data)
file_txt.close()

In [56]:
for i in info:
    if int(i['replies'])>2000:
        print(i)

{'title': '【典雅一生】收集10000句：赵雅芝我爱你', 'replies': '31627', 'link': 'http://tieba.baidu.com/p/3412726961'}
{'title': '【典雅一生】赵雅芝', 'replies': '2654', 'link': 'http://tieba.baidu.com/p/5144633788'}
{'title': '【典雅一生】新小说《蓦然回首，浮生若梦》', 'replies': '6894', 'link': 'http://tieba.baidu.com/p/4872399665'}


In [57]:
for i in info:
    if '新白娘子传奇' in i['title']:
        print(i)

{'title': '【典雅一生】新白娘子传奇的删减', 'replies': '38', 'link': 'http://tieba.baidu.com/p/5873317948'}
{'title': '【典雅一生】《新白娘子传奇》曾删减过部分内容吗？', 'replies': '346', 'link': 'http://tieba.baidu.com/p/1958610478'}
{'title': '【典雅一生】凯杰即将出演新新白娘子传奇里的仕林了，还是小小期', 'replies': '9', 'link': 'http://tieba.baidu.com/p/6083013894'}
{'title': '《新白娘子传奇》1-50集下载（内有下载地址）', 'replies': '45', 'link': 'http://tieba.baidu.com/p/8705594'}
{'title': '【典雅一生】《新白娘子传奇》芝姐白素贞好美', 'replies': '0', 'link': 'http://tieba.baidu.com/p/6093271906'}
{'title': '【典雅一生】求。新白娘子传奇，谁有？', 'replies': '18', 'link': 'http://tieba.baidu.com/p/5807575411'}
{'title': '【典雅一生】《新白娘子传奇》高清大图剧照', 'replies': '1811', 'link': 'http://tieba.baidu.com/p/2476521385'}
{'title': '【典雅一生】新白娘子传奇', 'replies': '1', 'link': 'http://tieba.baidu.com/p/6080937969'}


#### 存入text

In [60]:
import requests, bs4, lxml

path = './yazhitieba/'
file_txt = open(path + 'yazhi.txt', 'w+')
file_txt.write('-------------------Title-----------------------replies----------------------link-------------------')
file_txt.write('\n')
for i in range(0, 100, 50):
    url = 'http://tieba.baidu.com/f?kw=%E8%B5%B5%E9%9B%85%E8%8A%9D&ie=utf-8&pn=' + str(i)
    res = requests.get(url)
    res.raise_for_status()
    html = res.text
    yazhiSoup = bs4.BeautifulSoup(html, 'lxml')
    titles = yazhiSoup.select('a[class="j_th_tit"]')
    replies = yazhiSoup.select('span[class="threadlist_rep_num center_text"]')
    for title, reply in zip(titles, replies):
        data = {
            "title":title.get_text(),
            "replies":reply.get_text(),
            "link":'http://tieba.baidu.com' + title['href']
        }        
        file_txt.write(data['title'])
        file_txt.write('  ')
        file_txt.write(data['replies'])
        file_txt.write('  ')
        file_txt.write(data['link'])
        file_txt.write('\n')
file_txt.close()

#### 写入Excel

In [54]:
import requests, bs4, lxml, openpyxl

wb = openpyxl.Workbook()
sheet = wb.get_active_sheet()
sheet.title = 'Tieba'
sheet['A1'] = 'Title'
sheet['B1'] = 'Replies'
sheet['C1'] = 'Link'

for i in range(0, 100, 50):
    url = 'http://tieba.baidu.com/f?kw=%E8%B5%B5%E9%9B%85%E8%8A%9D&ie=utf-8&pn=' + str(i)
    res = requests.get(url)
    res.raise_for_status()
    html = res.text
    yazhiSoup = bs4.BeautifulSoup(html, 'lxml')
    titles = yazhiSoup.select('a[class="j_th_tit"]')
    replies = yazhiSoup.select('span[class="threadlist_rep_num center_text"]')
    numbers = range(2, 5000)
    for title, reply, number in zip(titles, replies, numbers):
        sheet['A'+str(number)] = title.get_text()
        sheet['B'+str(number)] = reply.get_text()
        sheet['C'+str(number)] = 'http://tieba.baidu.com' + title['href']

path = './yazhitieba/'
wb.save(path + 'yazhi.xlsx')
wb.close()

  after removing the cwd from sys.path.


## 4. `selenium`模块

折腾过程见教程。

**如果browser赋值时打开的窗口关掉了，让browser.get(url)就会出错。**

In [151]:
#_*_coding:utf-8_*_
from selenium import webdriver

path = '/Users/caimeijuan/anaconda/envs/python35/lib/python3.7/site-packages/selenium/webdriver/chrome/chromedriver'
browser = webdriver.Chrome(path)

SessionNotCreatedException: Message: session not created: Chrome version must be between 70 and 73
  (Driver info: chromedriver=73.0.3683.68 (47787ec04b6e38e22703e856e101e840b65afe72),platform=Mac OS X 10.14.4 x86_64)


In [1]:
url = 'http://inventwithpython.com'
browser.get(url)

try:
    elem = browser.find_element_by_class_name('card-img-top')
    print('Found <%s> element with that class name!' %(elem.tag_name))
except:
    print('Was not able to find an element with that name.')

Found <img> element with that class name!


In [2]:
browser = webdriver.Chrome(path)
url = 'http://inventwithpython.com'
browser.get(url)

linkElem = browser.find_element_by_link_text('Read Online for Free')
type(linkElem)

selenium.webdriver.remote.webelement.WebElement

In [3]:
linkElem.click()

真的把这个链接打开了。

#### 登录gmail

In [26]:
url = 'http://gmail.com'
browser.get(url)

找出input元素：  

In [21]:
classElem = browser.find_element_by_class_name('whsOnd')

In [22]:
idElem = browser.find_element_by_id('identifierId')

In [23]:
nameElem = browser.find_element_by_name('identifier')

In [24]:
classElem == idElem

True

In [25]:
idElem == nameElem

True

In [26]:
tagElem = browser.find_element_by_tag_name('input')

In [27]:
nameElem == tagElem

True

In [36]:
tagsElem_div = browser.find_elements_by_tag_name('div')

In [48]:
tagsElem_a = browser.find_elements_by_tag_name('a')

##### 填写email帐号

In [27]:
import getpass
tagsElem_input = browser.find_elements_by_tag_name('input')
email = tagsElem_input[0]
email.clear()
email.send_keys(getpass.getpass())

········


试图毕其功于一役，失败。错误提示是：**元素不可见**。

In [5]:
tagsElem_input = browser.find_elements_by_tag_name('input')
email = None
password = None
for t in tagsElem_input:
    if t.get_property('type')=='email':
        email = t
    elif t.get_property('type')=='password':
        password = t

In [8]:
email.clear()
email.send_keys('xxxxxxx@gmail.com')
password.clear()
password.send_keys('********')

ElementNotVisibleException: Message: element not interactable
  (Session info: chrome=73.0.3683.103)
  (Driver info: chromedriver=73.0.3683.68 (47787ec04b6e38e22703e856e101e840b65afe72),platform=Mac OS X 10.14.4 x86_64)


##### 点击下一步

In [28]:
classElem_btn = browser.find_element_by_class_name('RveJvd')
classElem_btn.click()

##### 填写密码

In [29]:
tagsElem_input = browser.find_elements_by_tag_name('input')
password = tagsElem_input[2]
password.send_keys(getpass.getpass())

········


##### 点击下一步

In [30]:
classElem_btn = browser.find_element_by_class_name('RveJvd')
classElem_btn.click()

以下是很久不登录需要选择风格的画面。一般不会遇到。

In [15]:
nameElem_btn = browser.find_element_by_name('welcome_dialog_next')
nameElem_btn.click()

In [18]:
typeElem_btn = browser.find_element_by_name('ok')
typeElem_btn.click()

逐步登录成功。

————————————合并成一个完整程序————————————————————

In [34]:
#_*_coding:utf-8_*_
from selenium import webdriver
import getpass

path = '/Users/caimeijuan/anaconda/envs/python35/lib/python3.7/site-packages/selenium/webdriver/chrome/chromedriver'
browser = webdriver.Chrome(path)

def gmailSignin(email, password):
    #打开登录界面
    url = 'http://gmail.com'
    browser.get(url)

    # 填写email帐号
    tagsElem_input = browser.find_elements_by_tag_name('input')
    emailElem = tagsElem_input[0]
    emailElem.clear()
    emailElem.send_keys(email)

    # 点击下一步
    classElem_btn = browser.find_element_by_class_name('RveJvd')
    classElem_btn.click()

    # 填写密码
    Elem_input = browser.find_elements_by_tag_name('input')
    for t in Elem_input:
        print(t.get_property('name'))
    passwordElem = Elem_input[2]      
    passwordElem.send_keys(password)

    # 点击下一步
    classElem_btn = browser.find_element_by_class_name('RveJvd')
    classElem_btn.click()
    
email = getpass.getpass('Input your email account: ')
password = getpass.getpass('Input your password: ') #不显示输入值
gmailSignin(email, password)

Input your email account: ········
Input your password: ········
identifier
hiddenPassword
ca
ct
pstMsg
checkConnection
checkedDomains


ElementNotInteractableException: Message: element not interactable
  (Session info: chrome=73.0.3683.103)
  (Driver info: chromedriver=73.0.3683.68 (47787ec04b6e38e22703e856e101e840b65afe72),platform=Mac OS X 10.14.4 x86_64)
