# 爬取 RIA 网站内容

## 利用Requests等获取RIA网站信息

### 案例1 某条信息获取

我们以爬取某条页面内容为例。

首先使用之前所学进行页面分析、试探性内容爬取。

In [1]:
import requests
from urllib.parse import urlencode

def fetchContent(url,queryStr = None):
    try:
        q = None
        if queryStr:
            q = urlencode({"keyword":queryStr})


        r = requests.get(url,params=q)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
        #return r.json
        
    except Exception as e:
        print(e)
        

In [2]:
url = "https://www.toutiao.com/search/?"
queryStr = input("输入要查询的关键词：")

fetchContent(url,queryStr)

输入要查询的关键词：中国


'<!DOCTYPE html><html><head><meta http-equiv="x-dns-prefetch-control" content="on"><link rel="dns-prefetch" href="//s3.pstatp.com/"><link rel="dns-prefetch" href="//s3a.pstatp.com/"><link rel="dns-prefetch" href="//s3b.pstatp.com"><link rel="dns-prefetch" href="//p1.pstatp.com/"><link rel="dns-prefetch" href="//p3.pstatp.com/"><meta charset="utf-8"><meta http-equiv="Content-Security-Policy" content="upgrade-insecure-requests"><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta name="renderer" content="webkit"><meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0, user-scalable=no, minimal-ui"><meta name="baidu-site-verification" content="T239HZAbh7"><meta name="360-site-verification" content="b96e1758dfc9156a410a4fb9520c5956"><meta name=\'360_ssp_verify\' content=\'2ae4ad39552c45425bddb738efda3dbb\'><meta name="google-site-verification" content="3PYTTW

尝试我们看到的某个链接，并查看下载结果：

In [5]:
url = "https://www.某条.com/api/article/user_log/?c=/search/&sid=dndvxamwo1574493680709&type=close&st=3962&t=1574495387989"
queryStr = input("输入要查询的关键词：")

fetchContent(url,queryStr)

输入要查询的关键词：中国


'{"message": "success"}'

上面的结果看来是失败了。

改变url，再次尝试

In [6]:
url = "https://www.某条.com/search/?keyword=%E7%8E%8B%E6%80%9D%E8%81%AA"
fetchContent(url)

'<!DOCTYPE html><html><head><meta http-equiv="x-dns-prefetch-control" content="on"><link rel="dns-prefetch" href="//s3.pstatp.com/"><link rel="dns-prefetch" href="//s3a.pstatp.com/"><link rel="dns-prefetch" href="//s3b.pstatp.com"><link rel="dns-prefetch" href="//p1.pstatp.com/"><link rel="dns-prefetch" href="//p3.pstatp.com/"><meta charset="utf-8"><meta http-equiv="Content-Security-Policy" content="upgrade-insecure-requests"><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta name="renderer" content="webkit"><meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0, user-scalable=no, minimal-ui"><meta name="baidu-site-verification" content="T239HZAbh7"><meta name="360-site-verification" content="b96e1758dfc9156a410a4fb9520c5956"><meta name=\'360_ssp_verify\' content=\'2ae4ad39552c45425bddb738efda3dbb\'><meta name="google-site-verification" content="3PYTTW

上述结果，看似也不大有效。

原来的方法效果不佳。怎么办？

需要通过开发者工具，去发现动态加载的新请求。

In [11]:
import requests
from urllib.parse import urlencode
import time
def fetchContent(url,queryStr = None):
    try:
        
        headers = {
            "authority":"www.toutiao.com",
            "method":"GET",
            "path":"/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E4%B8%AD%E5%9B%BD&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=1574496978085",
            "scheme":"https",
            "accept":"application/json,text/javascript",
            "accept-encoding":"gzip, deflate, br",
            "accept-language":"zh-CN,zh;q=0.9",
            "content-type":"application/x-www-form-urlencoded",
            "cookie":"tt_webid=6762398801557128712; s_v_web_id=b15d614a445679097241d4c8fda6c2a3; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6762398801557128712; __tasessionId=dndvxamwo1574493680709; csrftoken=442fe99b35620b1fb0f5a84c789f8eb2; RT=","z=1&dm=toutiao.com&si=xxicvkr2hj&ss=k3b8vziq&sl=2&tt=sm&obo=1&ld=3q9y&r=f707240a16867893c666bc65f8643932&ul=3qaa&hd=3qft"
            "referer":"https://www.toutiao.com/search/?keyword=%E4%B8%AD%E5%9B%BD",
            "sec-fetch-mode":"cors",
            "sec-fetch-site":"same-origin",
            "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
            "x-requested-with":"XMLHttpRequest",
        }
        q = None
        if queryStr:
            q = urlencode(queryStr)
        

        r = requests.get(url,params=q,headers=headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding

        return r.json()
        
    except Exception as e:
        print(e)

url = "https://www.toutiao.com/api/search/content/?"
keyword = input('输入关键字')
offset = 0
timestamp = str(int(time.time() * 1000))
query = {
        "aid":"24",
        "app_name":"web_search",
        "offset":offset,
        "format":"json",
        "keyword":keyword,
        "autoload":"true",
        "count":"20",
        "en_qc":"1",
        "cur_tab":"1",
        "from":"search_tab",
        "pd":"synthesis",
        "timestamp":timestamp,
    }
jsondata = fetchContent(url,query)
print(jsondata)

输入关键字中国
{'count': 20, 'return_count': 20, 'query_id': '6605800898642859267', 'has_more': 1, 'request_id': '201911241102430100080630440EA00E15', 'search_id': '201911241102430100080630440EA00E15', 'cur_ts': 1574564563, 'offset': 20, 'message': 'success', 'pd': 'synthesis', 'show_tabs': 1, 'keyword': 'ä¸\xadå›½', 'city': 'æ‰¬å·�', 'tokens': ['ä¸\xadå›½'], 'log_pb': {'impr_id': '201911241102430100080630440EA00E15'}, 'data': [{'abstract': 'å¹¿ä¸œè\xad¦æ–¹å±•ç¤ºç¼´è�·çš„æ–°å�‹æ¯’å“�å’Œåˆ€å…·ç\xad‰ã€‚ é»„ä¸½å�› æ‘„ä¸\xadæ–°ç½‘å¹¿å·�11æœˆ21æ—¥ç”µ (é»„ä¸½å�› åˆ˜ä¸½å¨Ÿ æ��é•¿è¾¾)21æ—¥ï¼Œå¹¿ä¸œçœ�å…¬å®‰å�…ä¸¾è¡Œæ–°é—»å�‘å¸ƒä¼šï¼Œä»‹ç»�å¹¿ä¸œâ€œç¦�æ¯’2019ä¸¤æ‰“ä¸¤æ�§â€�ä¸“é¡¹è¡ŒåŠ¨ç›¸å…³æƒ…å†µã€‚', 'app_info': {'db_name': 'R_SITE', 'page_type': '1', 'query_type': 'SearchAggregationInternalQueryType'}, 'article_url': 'http://ex.chinadaily.com.cn/exchange/partners/77/rss/channel/cn/columns/ruv4j9/stories/WS5dd64366a31099ab995ed53e.html', 'behot_time': '1574323457', 'comment_count': 0, 'comments_coun

上面的结果中内容不少，但似乎不是我们希望得到的信息。

原因可能是headers设置不对，特别是cookies的内容。

In [32]:

import requests
from urllib.parse import urlencode
import os
from hashlib import md5
from multiprocessing.pool import Pool
import re
import threading


def fetchContent(url,offset,keyword):
    headers = {
        "cookie": "tt_webid=6762398801557128712; s_v_web_id=b15d614a445679097241d4c8fda6c2a3; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6762398801557128712; __tasessionId=dndvxamwo1574493680709; csrftoken=442fe99b35620b1fb0f5a84c789f8eb2; RT=","z=1&dm=某条.com&si=xxicvkr2hj&ss=k3b8vziq&sl=2&tt=sm&obo=1&ld=3q9y&r=f707240a16867893c666bc65f8643932&ul=3qaa&hd=3qft"
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',
        'x-requested-with': 'XMLHttpRequest',
        'referer': 'https://www.某条.com/search/?keyword=%E4%B8%AD%E5%9B%BD',
    }
    params = {
        'aid': '24',
        'app_name': 'web_search',
        'offset': offset,
        'format': 'json',
        'keyword': keyword,
        'autoload': 'true',
        'count': '20',
        'en_qc': '1',
        'cur_tab': '1',
        'from': 'search_tab',
        'pd': 'synthesis',
    }
     
    try:
        r = requests.get(url,params=params,headers=headers)
        r.raise_for_status()
        return r.json()
    except Exception as e:
        print(e)
        
def getInfo(json):
    if json.get('data'):
        
        data = json.get('data')        
        for item in data:
            
            if item.get('title') is None:
                continue
            title = item.get('title')
            yield{
                'title':title
            }
def main():    
    url = "https://www.某条.com/api/search/content/?"
    keyword = input('输入关键字')
    offset = 0    
    json = fetchContent(url =url, offset=offset,keyword=keyword)

    for i in getInfo(json):
        print(i)
main()

输入关键字中国
{'title': '广东警方前十个月共破毒品案8500余起 刑拘1.1万余人'}
{'title': '沉迷电子烟 加拿大少年患上“爆米花肺”'}
{'title': '“还是用中文吧，在座的都是中国人”'}
{'title': '中国🇨🇳著名实力派山水画家李劲松彩墨山水画欣赏'}
{'title': '把中国警告当耳旁风？这回美国真闯大祸了，11月20日央视十二连击'}
{'title': '中国🇨🇳 著名实力派山水画家李劲松参加了'}
{'title': '吴作栋：中国7亿多人口脱贫，是个史无前例的奇迹'}
{'title': '中国🇨🇳实力派山水画家李劲松彩墨山水画作品欣赏有喜欢本作'}
{'title': '"56岁不丹王母穿粉色礼服出席活动，皮肤白皙透亮像少女"'}
{'title': '习近平：创新成果不应成为埋在山洞里的宝藏'}
{'title': '中国🇨🇳实力派山水画家李劲松彩墨山水画作品欣赏'}
{'title': '中国🇨🇳实力派山水画家李劲松彩墨山水画作品欣赏有喜欢本作'}


上述结果显示出了我们所希望的内容。

下面，我们尝试将相关图片内容下载下来。

In [42]:
import requests
from urllib.parse import urlencode
import os
from hashlib import md5
import re


def fetchContent(url,offset,keyword):
    headers = {
        "cookie": "tt_webid=6762398801557128712; s_v_web_id=b15d614a445679097241d4c8fda6c2a3; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6762398801557128712; __tasessionId=dndvxamwo1574493680709; csrftoken=442fe99b35620b1fb0f5a84c789f8eb2; RT=","z=1&dm=某条.com&si=xxicvkr2hj&ss=k3b8vziq&sl=2&tt=sm&obo=1&ld=3q9y&r=f707240a16867893c666bc65f8643932&ul=3qaa&hd=3qft"
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',
        'x-requested-with': 'XMLHttpRequest',
        'referer': 'https://www.某条.com/search/?keyword=%E4%B8%AD%E5%9B%BD',
    }
    params = {
        'aid': '24',
        'app_name': 'web_search',
        'offset': offset,
        'format': 'json',
        'keyword': keyword,
        'autoload': 'true',
        'count': '20',
        'en_qc': '1',
        'cur_tab': '1',
        'from': 'search_tab',
        'pd': 'synthesis',
    }
     
    try:
        r = requests.get(url,params=params,headers=headers)
        r.raise_for_status()
        return r.json()
    except Exception as e:
        print("Cannot fetch the content. ")
        
def getInfo(json):
    if json.get('data'):
        
        data = json.get('data')        
        for item in data:
            
            if item.get('title') is None:
                continue
            title = item.get('title')
            yield{
                'title':title
            }

def getImages(json):
    if json.get('data'):
        data = json.get('data')
        for item in data:
            if item.get('title') is None:
                continue
            title = re.sub('[\t\\\|]', '', item.get('title'))
            images = item.get('image_list')
            for image in images:
                origin_image = re.sub("list.*?pgc-image", "large/pgc-image", image.get('url'))
                yield {
                    'image': origin_image,
                    'title': title
                }

def saveImages(image):
    try:
        imgDir = 'img' + os.path.sep + image.get('title')
        # imgDir = os.path.join('img',image.get('title'))
        if not os.path.exists(imgDir):
            os.makedirs(imgDir)
        r = requests.get(image.get('image'))
        file_path = imgDir + os.path.sep + '{file_name}.{file_suffix}'.format(
                    file_name=md5(r.content).hexdigest(),
                    file_suffix='jpg')
        if not os.path.exists(file_path):

            with open(file_path, 'wb') as f:
                f.write(r.content)
            print("Downloaded image path is {}".format(file_path))

        else:
            print("Already Downloaded")
    except Exception as e:
        print("Images download error.")  

def main():    
    url = "https://www.某条.com/api/search/content/?"
    keyword = input("输入查询关键字：")
    offset = input("输入条目数量(20的整倍数)：")   
    for i in range(0,offset,20):
        json = fetchContent(url =url, offset=offset,keyword=keyword)
        for item in getImages(json):
            saveImages(item)            
       
            
main()

输入关键字中国
Downloaded image path is img\广东警方前十个月共破毒品案8500余起 刑拘1.1万余人\f32a3100473930a4e03ba5cc0e413148.jpg
Downloaded image path is img\陕西省委副秘书长王飞出任咸阳市委常委、副市长\a93e615705ef8f8b16c633a781e33ea5.jpg
Downloaded image path is img\沉迷电子烟 加拿大少年患上“爆米花肺”\78039f1aba211b7a3262b7cf77c01ca6.jpg
Downloaded image path is img\“还是用中文吧，在座的都是中国人”\27d464d53f0c035a44b2fddcd9100ff7.jpg
Downloaded image path is img\中国🇨🇳著名实力派山水画家李劲松彩墨山水画欣赏\049ef16fb568d45fb206e66ed5f135b5.jpg
Downloaded image path is img\中国🇨🇳著名实力派山水画家李劲松彩墨山水画欣赏\0abdc78f2da1edad05fe63e2038d244e.jpg
Downloaded image path is img\中国🇨🇳著名实力派山水画家李劲松彩墨山水画欣赏\f71a06692abe679acb78dc8d67d4099e.jpg
Downloaded image path is img\中国🇨🇳著名实力派山水画家李劲松彩墨山水画欣赏\8c7123a7d2893f2f44d9919770d27fa3.jpg
Downloaded image path is img\中国🇨🇳 著名实力派山水画家李劲松参加了\75c9641e93e85c058086781a66b50d13.jpg
Downloaded image path is img\中国🇨🇳 著名实力派山水画家李劲松参加了\4d30b0e582c0b63b5b3ea0a50e5b89bb.jpg
Downloaded image path is img\中国🇨🇳 著名实力派山水画家李劲松参加了\35a6365e541636ed72dbab9ae76df3c2.jpg
Downloaded image 

上面的程序是单线程的，下面我们将它改造为多线程程序。

In [47]:
import threading
from multiprocessing.pool import Pool

###################
# 之前的内容省略 #
#################

def main(keyword,offset):
    url = "https://www.某条.com/api/search/content/?"
    json = fetchContent(url =url, offset=offset,keyword=keyword)
    for item in getImages(json):
        saveImages(item)      

            
if __name__ == '__main__':      
    keyword = input("输入查询关键字：")
    count = input("输入希望获得信息条目数量：")
    groups = ([x * 20 for x in range(count)])
    tasks = []   # 线程池
    
    for group in groups:
        task = threading.Thread(target=main, args=(keyword,group,))
        tasks.append(task)
        task.start()
        
    # 等待所有线程完成
    for _ in tasks:
        _.join()         
    print("Images download completed.")

输入查询关键字：中国
Already Downloaded
Downloaded image path is img\小伙邀请大家一起探讨，中国第一大古都是哪里呢？洛阳还是西安？\30e20bd5c06bec995208ab8da0839c44.jpg
Already Downloaded
Downloaded image path is img\中日文化趣谈（11）“苦”与“涩”\877564fe4fa33989f5517a356395d76d.jpgDownloaded image path is img\中国🇨🇳实力派山水画家李劲松彩墨山水画作品欣赏有喜欢本作\eca7c5e733beb62d0a42d5384c78f4c3.jpg

Downloaded image path is img\中国🇨🇳著名的武术家郝凤鸣形意拳系列教学片《健身第一》\a71cd704d66fe92f3f089278c9d69161.jpg
Downloaded image path is img\美反华议员遭群嘲：中国指责美国干涉中国内政是干涉美国内政\41cbf02de84aabae9edf9b14e4a8630c.jpg
Already Downloaded
Already Downloaded
Downloaded image path is img\中国不会屈服于来自印度的压力，不会放弃与“巴铁”牢不可破关系\30cd48faa88827ef115b6b49e22e9133.jpg
Downloaded image path is img\美反华议员遭群嘲：中国指责美国干涉中国内政是干涉美国内政\1eb3d09a9222c191b12108aeb11cae36.jpg
Downloaded image path is img\两院院士结果揭晓！12位中国科大校友百花齐鸣，工程院一枝独秀\a98841ba7262d589007bf6bc7e3b06d0.jpg
Downloaded image path is img\职场中，一定要多管闲事\17e673802ecade1be9facc9eac226855.jpg
Already Downloaded
Already Downloaded
Downloaded image path is img\《长三角生态绿色一体化发展示

Already Downloaded
Downloaded image path is img\中国🇨🇳实力派山水画家李劲松彩墨山水画作品欣赏有喜欢本作\602600c8ed6f48057364093fb81b425a.jpg
Already Downloaded
Downloaded image path is img\全球洪门联盟在泰国曼谷举行全球百万掌门人千人年度盛会\cde4fac9969e0296ffe0149cc45baec1.jpg
Downloaded image path is img\普京谈苏联解体：中国是更好的经济转型案例\f34367dd1dedf528694655350ad45475.jpg
Already Downloaded
Already Downloaded
Downloaded image path is img\普京谈苏联解体：中国是更好的经济转型案例\ce86674b088bd25b4f20fd9af0b89f3d.jpg
Downloaded image path is img\全球洪门联盟在泰国曼谷举行全球百万掌门人千人年度盛会\b7caf79c6b4e81548a255235175fd2e9.jpg
Already Downloaded
Downloaded image path is img\前国足助教批判中国足球现状！德国玩命日本认真，中国足球就是玩\2c4ec8f84fa12381266cdae9f96010c3.jpg
Downloaded image path is img\全球洪门联盟在泰国曼谷举行全球百万掌门人千人年度盛会\f4139910cf79832c3fe4f8e4242b6ae0.jpg
Already Downloaded
Downloaded image path is img\前国足助教批判中国足球现状！德国玩命日本认真，中国足球就是玩\08dc860c23a065cbd9d0824dffe93d3a.jpg
Downloaded image path is img\全球洪门联盟在泰国曼谷举行全球百万掌门人千人年度盛会\8174304dd6445f333d86b064f01e6352.jpg
Downloaded image path is img\全球洪门联盟在泰国曼谷举行全球百万掌门人千人

KeyboardInterrupt: 

### 案例2 获取某宝商品信息获取实例

这个程序没有考虑登录过程，需要先人工登录获取用户cookie。

In [3]:
import requests
import re


def getHTMLText(url,headers):
    try:
        r=requests.get(url,headers=headers,timeout=30)
        r.raise_for_status # 如果是200，表示返回的内容正确；如果不是200，会产生HttpError异常
        r.encoding=r.apparent_encoding# 将对文本中分析的编码来替换整体的编码
        #print(r.text[:1000])
        return r.text
    except:
        print ("urlError")
        


def parsePage(ilt,html):
    try:
        plt=re.findall(r'"view_price":"[\d.]*"',html)# 匹配键值对——"view_price":"[\d.]*"
        tlt=re.findall(r'"raw_title":".*?"',html) # *?表示前一个字符0或无限次扩展，最小匹配（只取到最后一个”为止的内容，这样就约束了匹配的内容就是商品本身的名字）
        '''
        去掉view_price字段，只取价格部分
        eval函数用来执行字符串表达式，并返回表达式的值（相当于将外面的双引号去掉）
        这里string.split(pattern)不是正则表达式re.split(pattern, string)，是将字符串分割，[1]是获得键值对的后面部分
        '''
        for i in range(len(plt)):
            price=eval(plt[i].split(':')[1])
            title=eval(tlt[i].split(':')[1])
            ilt.append([price,title])
        #print(ilt)
    except:
        print ("parserError")
    
def printGoodsList(ilt):
    tplt="{:4}\t{:8}\t{:16}"# 定义输出格式的模板
    print(tplt.format("num","price","name"))
    count=0
    for g in ilt:
        count=count+1
        print(tplt.format(count,g[0],g[1]))
        
def main():
    goods=input("请输入要查询的关键字：")
    depth=3 # 爬取的深度，即有多个网页时决定爬几个网页
    headers = {"User_Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Cookie" : "thw=cn; cna=vci1Ff9tXTcCAWdavQq45RyK; t=5922c3d73a9a97219e796951f673161d; lgc=nchycom; tracknick=nchycom; tg=0; enc=NRzcjpZZU2DsXuSVQWjOD%2FdgoJs091d1tIr1OwkBKMCr1ur7i4mfI8DJrwGIha7RZY7ag0yPV5xXuOfkpPFdYw%3D%3D; hng=CN%7Czh-CN%7CCNY%7C156; v=0; cookie2=14cef6d3a92d856e6f412cca3d540a38; _tb_token_=e1787e17133e6; unb=75965512; uc3=id2=VASjULlFuZE%3D&vt3=F8dByuQDYTLj2te6%2Fm4%3D&lg2=VT5L2FSpMGV7TQ%3D%3D&nk2=DeSe0Q8ONA%3D%3D; csg=0f6fc257; cookie17=VASjULlFuZE%3D; dnk=nchycom; skt=f7f18af33f92b7a9; existShop=MTU3NDUwMjA0MA%3D%3D; uc4=id4=0%40Vh3Dtwy4orMbQ7cG8JgNj%2BqJLg%3D%3D&nk4=0%40DzERyuLSckbO7cdHC4sb2PB9; _cc_=W5iHLLyFfA%3D%3D; _l_g_=Ug%3D%3D; sg=m23; _nk_=nchycom; cookie1=BqJk9D7342ZkMXFfMykDvEYqmkiwE8r43rljzNeky5E%3D; mt=ci=57_1; alitrackid=i.taobao.com; lastalitrackid=i.taobao.com; uc1=cookie16=UIHiLt3xCS3yM2h4eKHS9lpEOw%3D%3D&cookie21=VFC%2FuZ9aiKCaj7AzMHh1&cookie15=UtASsssmOIJ0bQ%3D%3D&existShop=false&pas=0&cookie14=UoTbmVU5Yv1UfQ%3D%3D&tag=8&lng=zh_CN; JSESSIONID=A372A25931AF9C9CE8E6E12BD20FF4BB; l=dB_Ox0mVqylJ4GUABOfBnurza77TqQAbzrVzaNbMiICPOUf1iTNAWZpt4yTBCnGV3sN6R3Jt3efYBXYLwyUIh2nk8b8CgsDpKdTeR; isg=BBISzf5Wm3WGZ-e_cBrkhqBYY9g0ixfV7e3WjtxpdUSm77DpxLOWzbpJXwv2mY5V"
}               
    start_url="https://s.moubao.com/search?q=" + goods
    infoList=[]
    for i in range(depth):
        try:
            url=start_url+'&s='+str(44*i)
            html=getHTMLText(url,headers)
            parsePage(infoList,html)
        except:
            continue
    printGoodsList(infoList)


main()

num 	price   	name            
   1	199.00  	双肩包女小包2019新款韩版背包ins百搭书包
   2	149.00  	小米双肩包简约休闲多功能书包男女笔记本电脑包时尚潮流旅行背包
   3	99.00   	小米经典商务双肩包男女潮流时尚笔记本电脑包旅行大容量背包
   4	29.80   	双肩包女士2019新款韩版百搭潮背包牛津布休闲时尚旅行大容量书包
   5	49.90   	迪卡侬官网新款户外双肩包登山旅行包男书包学生休闲女背包QUBP
   6	3500.00 	【直营】Michael Kors MK MERCER女牛皮单肩斜挎中包包30F6GM9M2L
   7	69.00   	小米胸包男士单肩包斜跨包男斜挎多功能实用迷你运动腰包手提包
   8	4599.00 	【直营】MCM男女STARK侧面铆钉小号双肩背包功能包包包MMK6SVE37
   9	698.00  	Herschel Supply Retreat 时尚潮流旅游男女双肩包书包背包百搭
  10	629.00  	伊米妮包包女包新款2019时尚流行高级感斜挎菱格链条流浪包水桶包
  11	469.00  	CHARLES＆KEITH2019秋季CK2-80150844圆环手提单肩腰包斜挎包女
  12	439.00  	CHARLES＆KEITH 小方包 CK2-80680599 金属装饰肩带单肩斜挎女包
  13	619.00  	红谷明星同款包包女包2019新款时尚丝巾斜挎包牛皮链条手提单肩包
  14	299.00  	ZARA新款 女包 黑色摇滚软质单肩斜挎包 16312004040
  15	990.00  	迪桑娜包包戚薇同款女包单肩斜挎包2019新款时尚小香风菱格链条包
  16	329.00  	稻草人女包斜挎包女士包包女2019新款时尚手提包百搭女式单肩包潮
  17	258.00  	电视剧款JanSport旗舰店官网杰斯伯双肩包时尚女书包背包男大容量
  18	199.00  	Dickies潮牌斜挎包男士腰包大学生潮流休闲女士胸包单肩包包C012
  19	229.00  	GOLF包包女包新款2019真皮时尚单肩斜挎包潮韩版百搭风琴包小方包
  20	218.00  	真皮豆腐包box包2019新款潮复古小方包空

## 使用 Selenium 爬取RIA网站信息

通过实际的代码操作介绍Selenium以下功能：

### 生成浏览器驱动实例

首先，请将下载到的驱动程序放在当前文件目录下。

In [12]:
from selenium import webdriver

# Create a new instance of the chrome driver
driver = webdriver.Chrome()


### 通过Selenium访问页面


In [13]:
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.baidu.com")


### 定位UI元素（WebElements）

WebDriver中的定位元素可以在WebDriver实例本身或 WebElement 上完成。

使用函数：find_element(self, by='id', value=None)

In [14]:
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.baidu.com")
inputElement = driver.find_element( by='id', value="kw")


In [17]:
dir(driver)
help(driver.find_element_by_tag_nam)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_file_detector',
 '_is_remote',
 '_mobile',
 '_switch_to',
 '_unwrap_value',
 '_web_element_cls',
 '_wrap_value',
 'add_cookie',
 'application_cache',
 'back',
 'capabilities',
 'close',
 'command_executor',
 'create_options',
 'create_web_element',
 'current_url',
 'current_window_handle',
 'delete_all_cookies',
 'delete_cookie',
 'desired_capabilities',
 'error_handler',
 'execute',
 'execute_async_script',
 'execute_cdp_cmd',
 'execute_script',
 'file_detector',
 'file_detector_context',
 'find_element',
 'find_element_by_class_name',
 'find_element_by_css_selector',
 'find_element_by_id',
 

 尝试使用以下函数定位信息：
    
 'find_element',
 'find_element_by_class_name',
 'find_element_by_css_selector',
 'find_element_by_id',
 'find_element_by_link_text',
 'find_element_by_name',
 'find_element_by_partial_link_text',
 'find_element_by_tag_name',
 'find_element_by_xpath',
 'find_elements',
 'find_elements_by_class_name',
 'find_elements_by_css_selector',
 'find_elements_by_id',
 'find_elements_by_link_text',
 'find_elements_by_name',
 'find_elements_by_partial_link_text',
 'find_elements_by_tag_name',
 'find_elements_by_xpath',

In [None]:
查看 WebElement 实例，会发现它也存在与 driver类似的方法。

这说明我们可以在之前查找的基础上递归定位原始。

In [20]:
type(inputElement)
dir(inputElement)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_execute',
 '_id',
 '_parent',
 '_upload',
 '_w3c',
 'clear',
 'click',
 'find_element',
 'find_element_by_class_name',
 'find_element_by_css_selector',
 'find_element_by_id',
 'find_element_by_link_text',
 'find_element_by_name',
 'find_element_by_partial_link_text',
 'find_element_by_tag_name',
 'find_element_by_xpath',
 'find_elements',
 'find_elements_by_class_name',
 'find_elements_by_css_selector',
 'find_elements_by_id',
 'find_elements_by_link_text',
 'find_elements_by_name',
 'find_elements_by_partial_link_text',
 'find_elements_by_tag_name',
 'find_elements_by_xpath',
 'get_attribute',
 'get_property',
 'id',
 

In [26]:
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.baidu.com")
aElement = driver.find_element_by_id("wrapper")
print(aElement)

subE = aElement.find_element_by_id("wrapper_wrapper")
print(subE)


<selenium.webdriver.remote.webelement.WebElement (session="c022e01080a9393324ff6ba4a43997aa", element="b53d11de-cc75-40c2-9f5c-a3b9ef4c5d2a")>
<selenium.webdriver.remote.webelement.WebElement (session="c022e01080a9393324ff6ba4a43997aa", element="1369ac0d-b974-4135-a667-5fa8b5a7be7f")>


**获取文本值**

In [27]:
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.baidu.com")
aElement = driver.find_element_by_id("wrapper")
print(aElement)
print(aElement.text)
subE = aElement.find_element_by_id("qrcode")
print(subE)
print(subE.text)

<selenium.webdriver.remote.webelement.WebElement (session="32e6ddbfc3544c626eb9d03174aaa337", element="76f8e720-616f-430c-abd4-bb52f7f19ac8")>
新闻
hao123
地图
视频
贴吧
学术
登录
设置
更多产品
下载百度APP
有事搜一搜  没事看一看
把百度设为主页关于百度About  Baidu百度推广
©2019 Baidu 使用百度前必读 意见反馈 京ICP证030173号  京公网安备11000002000001号 
<selenium.webdriver.remote.webelement.WebElement (session="32e6ddbfc3544c626eb9d03174aaa337", element="9cf3323d-156b-4df0-a9f3-ca536e798297")>
下载百度APP
有事搜一搜  没事看一看


**键盘动作控制**

可以使用下列方法：
- send_keys
- submit
- clear

In [15]:
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.baidu.com")
inputElement = driver.find_element_by_id("kw")
inputElement.clear()
inputElement.send_keys("selenium")
#
inputElement.submit()

**选择列表**

之前的例子已经展示了如何使用selenium在文本框中键入文本，那么如果要勾选checkbox 或 radio，又或者是其它html可输入元素时应该如何做呢？

In [None]:
html="""<select id="status" class="form-control valid" onchange="" name="status">
    <option value=""></option>
    <option value="0">未审核</option>
    <option value="1">初审通过</option>
    <option value="2">复审通过</option>
    <option value="3">审核不通过</option>
</select>
"""

# 导入 Select 类
from selenium.webdriver.support.ui import Select

# 找到 name 的选项卡
select = Select(driver.find_element_by_name('status'))

# 
select.select_by_index(1)
select.select_by_value("0")
select.select_by_visible_text(u"未审核")


- index 索引从 0 开始
- value是option标签的一个属性值，并不是显示在下拉框中的值
- visible_text是在option标签文本的值，是显示在下拉框的值
- 全部取消选择怎么办呢？很简单:

select.deselect_all()

**鼠标动作链**

In [None]:
#导入 ActionChains 类
from selenium.webdriver import ActionChains

# 鼠标移动到 ac 位置
ac = driver.find_element_by_xpath('element')
ActionChains(driver).move_to_element(ac).perform()


# 在 ac 位置单击
ac = driver.find_element_by_xpath("elementA")
ActionChains(driver).move_to_element(ac).click(ac).perform()

# 在 ac 位置双击
ac = driver.find_element_by_xpath("elementB")
ActionChains(driver).move_to_element(ac).double_click(ac).perform()

# 在 ac 位置右击
ac = driver.find_element_by_xpath("elementC")
ActionChains(driver).move_to_element(ac).context_click(ac).perform()

# 在 ac 位置左键单击hold住
ac = driver.find_element_by_xpath('elementF')
ActionChains(driver).move_to_element(ac).click_and_hold(ac).perform()

# 将 ac1 拖拽到 ac2 位置
ac1 = driver.find_element_by_xpath('elementD')
ac2 = driver.find_element_by_xpath('elementE')
ActionChains(driver).drag_and_drop(ac1, ac2).perform()



**执行 JavaScript**

In [None]:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
import time
# Create a new instance of the Firefox driver
driver = webdriver.Chrome()

# go to the google home page
driver.get("https://ssl.zc.qq.com/v3/index-chs.html")

print(driver.title)

try:
    # 显示等待10s，直到出现“QQ注册”，否则返回超时错误。
    WebDriverWait(driver, 10).until(EC.title_contains("QQ注册"))


    inputs = driver.execute_script("var inputs = []; inputs.push(document.getElementsByTagName(\"input\")); return inputs;")

    for i in inputs[0]:
        print(i.get_attribute("id"))

except Exception as e:
    print(e)
finally:
    driver.quit()

**在窗口和 frame框架间切换**

In [None]:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
import time
from selenium.webdriver.common.by import By

try:
    driver = webdriver.Chrome()

    driver.get("http://www.xuetangx.com/")

    print(driver.title)
    # 显示等待10s，直到出现“QQ注册”，否则返回超时错误。
    WebDriverWait(driver, 10).until(EC.title_contains("学堂在线-国家精品课程在线学习平台"))
    
    loginlink = driver.find_element_by_id("header_login")
    loginlink.click()
    
    #等待登录modal出现
    WebDriverWait(driver, 10).until(EC.visibility_of(driver.find_element_by_id('g_modal')))
    
    driver.find_element_by_xpath("//a[@data-description=\"TOLOGIN#WEIBO\"]").click()
    
    # 等待微博登录界面出现
    WebDriverWait(driver, 10).until(EC.new_window_is_opened)

    
    for handle in driver.window_handles:
        driver.switch_to.window(handle)
        if driver.title.find("网站连接") > -1:
            print(driver.title)
            WebDriverWait(driver, 10).until(EC.title_contains("网站连接"))
            weibo_username = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//input[@id="userId"]')))
            weibo_username.send_keys('13611112222')      
            
            weibo_passwd = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//input[@id="passwd"]')))
            weibo_passwd.send_keys('asdfghjkl12345')
            
            weibo_submit = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//a[@class=\"WB_btn_login formbtn_01\"]")))
            weibo_submit.click()
            
            time.sleep(10)
            break
except Exception as e:
    print(e)
finally:
    driver.quit()

**处理弹出的对话框**

In [None]:
alert = driver.switch_to.alert

**实现网站导航：历史和定位**

In [None]:
driver.get("http://www.example.com") 
driver.forward()
driver.back()

**显式页面等待**

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
import time
ff = webdriver.Chrome()
ff.get("https://www.kaggle.com/")
try:
    element = WebDriverWait(ff, 10).until(EC.presence_of_element_located((By.ID, "site-header-kernels__a")))
    print(element.tag_name)
    print(element.text)
    time.sleep(3)
finally:
    ff.quit()

In [None]:
class taobao_infos:

    #对象初始化
    def __init__(self):
        url = 'https://login.taobao.com/member/login.jhtml'
        self.url = url

        options = webdriver.ChromeOptions()
        options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2}) # 不加载图片,加快访问速度
        options.add_experimental_option('excludeSwitches', ['enable-automation']) # 此步骤很重要，设置为开发者模式，防止被各大网站识别出来使用了Selenium

        self.browser = webdriver.Chrome(executable_path=chromedriver_path, options=options)
        self.wait = WebDriverWait(self.browser, 10) #超时时长为10s
        
    def loginAsWeiboUser(self):

        # 打开网页
        self.browser.get(self.url)
        time.sleep(3.4)
        # 等待 密码登录选项 出现
        password_login = self.wait.until(EC.presence_of_element_located((By.XPATH,'//a[@class="forget-pwd J_Quick2Static"]')))
        password_login.click()
        time.sleep(3)

        weibo_login = self.wait.until(EC.presence_of_element_located((By.XPATH,'//a[@class="weibo-login"]')))
        weibo_login.click()
        time.sleep(3)

        weibo_user = self.wait.until(EC.presence_of_element_located((By.XPATH, '//input[@name="username"]')))
        weibo_user.send_keys(weibo_username)
        time.sleep(2.1)

        weibo_pwd = self.wait.until(EC.presence_of_element_located((By.XPATH, '//input[@name="password"]')))
        weibo_pwd.send_keys(weibo_password)
        time.sleep(2.6)

        # 等待 登录按钮 出现
        submit = self.wait.until(EC.presence_of_element_located((By.XPATH,'//a[@class="W_btn_g"]')))
        submit.click()

        # 直到获取到淘宝会员昵称才能确定是登录成功
        taobao_name = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.site-nav-bd > ul.site-nav-bd-l > li#J_SiteNavLogin > div.site-nav-menu-hd > div.site-nav-user > a.site-nav-login-info-nick ')))
        # 输出淘宝昵称
        print(taobao_name.text)

**隐式页面等待**

In [None]:
from selenium import webdriver

ff = webdriver.Chrome()
ff.implicitly_wait(10) # seconds
ff.get("http://somedomain/url_that_delays_loading")
myDynamicElement = ff.find_element_by_id("myDynamicElement")

In [None]:
**异常处理**

In [None]:
import selenium.webdriver.support.expected_conditions as EC

help(selenium.webdriver.support.expected_conditions)


案例

In [None]:
"""selenium 入门示例"""
import time
from selenium import webdriver

# 要想调用键盘按键操作需要引入keys包
from selenium.webdriver.common.keys import Keys

# 调用环境变量指定的PhantomJS浏览器创建浏览器对象
#chromeOptions = webdriver.ChromeOptions()
#chromeOptions.add_argument("headless")
#driver = webdriver.Chrome(chrome_options=chromeOptions)
driver = webdriver.Chrome()
# 下面方法被废止了，单还可以用
# driver = webdriver.PhantomJS(executable_path='./phantomjs/bin/phantomjs.exe')

# get方法会一直等到页面被完全加载，然后才会继续程序，通常测试会在这里选择 time.sleep(2)
driver.get("https://www.baidu.com/")
time.sleep(3)

# 生成当前页面快照并保存
driver.save_screenshot("baidu.png")

# id="kw"是百度搜索输入框，输入字符串"长城"
driver.find_element_by_id("kw").send_keys(u"长城")

# id="su"是百度搜索按钮，click() 是模拟点击
driver.find_element_by_id("su").click()
time.sleep(2)
# 获取新的页面快照
driver.save_screenshot("长城.png")

# 打印网页渲染后的源代码
print(driver.page_source[:1000])

# 获取当前页面Cookie
print(driver.get_cookies())

# ctrl+a 全选输入框内容
driver.find_element_by_id("kw").send_keys(Keys.CONTROL,'a')

# ctrl+x 剪切输入框内容
driver.find_element_by_id("kw").send_keys(Keys.CONTROL,'x')

# 输入框重新输入内容
driver.find_element_by_id("kw").send_keys("itcast")

# 模拟Enter回车键
driver.find_element_by_id("su").send_keys(Keys.RETURN)

# 清除输入框内容
driver.find_element_by_id("kw").clear()

# 生成新的页面快照
driver.save_screenshot("itcast.png")

# 获取当前url
print(driver.current_url)

# 关闭当前页面，如果只有一个页面，会关闭浏览器
# driver.close()

# 关闭浏览器
time.sleep(20)
driver.quit()

## Selenium 反爬的应对

使用 Selenium 模拟浏览器进行数据抓取无疑是当下最通用的数据采集方案，因为：

- 它能够应对各种数据加载方式
- 能够绕过客户JS加密
- 绕过多种爬虫检测
- 绕过签名机制。

它的应用，使得许多网站的反采集策略形同虚设。由于selenium不会在HTTP请求数据中留下指纹，因此无法被网站直接识别和拦截。

这是不是就意味着selenium真的就无法被网站屏蔽了呢？

答案是否定的。

Selenium在运行的时候会暴露出一些预定义的Javascript变量（特征字符串），例如: "window.navigator.webdriver"。

在非selenium环境下其值为undefined，而在selenium环境下，其值为true（如下图所示为selenium驱动下Chrome控制台打印出的值）。

除此之外，还有一些其它的标志性字符串（不同的浏览器可能会有所不同），常见的特征串如下所示：



In [None]:
webdriver  
__driver_evaluate  
__webdriver_evaluate  
__selenium_evaluate  
__fxdriver_evaluate  
__driver_unwrapped  
__webdriver_unwrapped  
__selenium_unwrapped  
__fxdriver_unwrapped  
_Selenium_IDE_Recorder  
_selenium  
calledSelenium  
_WEBDRIVER_ELEM_CACHE  
ChromeDriverw  
driver-evaluate  
webdriver-evaluate  
selenium-evaluate  
webdriverCommand  
webdriver-evaluate-response  
__webdriverFunc  
__webdriver_script_fn  
__$webdriverAsyncExecutor  
__lastWatirAlert  
__lastWatirConfirm  
__lastWatirPrompt  
$chrome_asyncScriptInfo  
$cdc_asdjflasutopfhvcZLmcfl_  

了解了这个特点之后，就可以在浏览器客户端JS中通过检测这些特征串来判断当前是否使用了selenium，并将检测结果附加到后续请求之中，这样服务端就能识别并拦截后续的请求。

下面讲一个有效检测并屏蔽selenium的网站应用：qunar网。

In [1]:
from selenium import webdriver
driver = webdriver.Chrome()

driver.get("https://flight.qunar.com/site/oneway_list.htm")

只要我们能够隐藏这些特征串就可以了。但是还不能直接删除这些属性，因为这样可能会导致selenium不能正常工作了。

此时可以使用中间人代理，比如fidder, proxy2.py或者mitmproxy，将JS文件中的特征字符串给过滤掉（或者替换掉，比如替换成根本不存在的特征串），让它无法正常工作，从而达到让客户端脚本检测不到selenium的效果。

In [None]:
# coding: utf-8  
# modify_response.py  
  
import re  
from mitmproxy import ctx  
    
def response(flow):  
  """修改应答数据 
  """  
  if '/js/yoda.' in flow.request.url:  
      # 屏蔽selenium检测  
      for webdriver_key in ['webdriver', '__driver_evaluate', '__webdriver_evaluate', 
                            '__selenium_evaluate', '__fxdriver_evaluate', '__driver_unwrapped',
                            '__webdriver_unwrapped', '__selenium_unwrapped', '__fxdriver_unwrapped', 
                            '_Selenium_IDE_Recorder', '_selenium', 'calledSelenium', '_WEBDRIVER_ELEM_CACHE',
                            'ChromeDriverw', 'driver-evaluate', 'webdriver-evaluate', 'selenium-evaluate', 
                            'webdriverCommand', 'webdriver-evaluate-response', '__webdriverFunc',
                            '__webdriver_script_fn', '__$webdriverAsyncExecutor', '__lastWatirAlert',
                            '__lastWatirConfirm', '__lastWatirPrompt', '$chrome_asyncScriptInfo', '$cdc_asdjflasutopfhvcZLmcfl_']:  
          ctx.log.info('Remove "{}" from {}.'.format(webdriver_key, flow.request.url))  
          flow.response.text = flow.response.text.replace('"{}"'.format(webdriver_key), '"NO-SUCH-ATTR"')  
      flow.response.text = flow.response.text.replace('t.webdriver', 'false')  
      flow.response.text = flow.response.text.replace('ChromeDriver', '')  

In [1]:
import selenium
print(selenium.__file__)

C:\Users\leo\Anaconda3\lib\site-packages\selenium\__init__.py
