In [None]:
import sc
sc.init('subscriber_searching.txt')

In [None]:
sc.sc()

# Web Crawler (Scraping the Web)

[Wiki - Web Crawler](https://www.wikiwand.com/en/Web_crawler)
======
> A Web crawler is an Internet bot which **systematically browses** the World Wide Web, typically for the purpose of Web indexing

![crawler arch](images/WebCrawlerArchitecture.svg.png)

### Video:[Web Crawler - CS101 - Udacity](https://www.youtube.com/watch?v=CDXOcvUNBaA&hd=1)

### Where to start? 
* [Top 30 Free Web Scraping Software](http://www.octoparse.com/blog/top-30-free-web-scraping-software/)
- [八爪鱼采集器](http://www.bazhuayu.com/about)
- [Scrapy](http://scrapy.org/)  Demo:⬇︎

dmoz_spider.py

In [None]:
import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

scrapy crawl dmoz

2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Spider opened
2014-01-23 18:13:08-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2014-01-23 18:13:09-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2014-01-23 18:13:09-0400 [scrapy] INFO: Closing spider (finished)

 [Wiki - Web Scraping](https://www.wikiwand.com/en/Web_scraping)
 ======
> Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.

> Techniques:
> - Human copy-and-paste
> - Text grepping and regular expression matching  
[regular expression for url](http://t.cn/RcrMHbr)
> - HTTP programming
> - HTML parsers
![HTML Tags](images/html_tags.png)
(https://www.nobledesktop.com/html-quick-guide/)
> - DOM(_Document Object Model_) parsing
![DOM Tree](images/domTree.jpg)
> - Web-scraping software
> - Vertical aggregation platforms
> - Semantic annotation recognizing
> - Computer vision web-page analyzers

## [《Web Scraping with Python: Collecting Data from the Modern Web》](http://shop.oreilly.com/product/0636920034391.do)
#### by Ryan Mitchell (2015) _4.5 out of 5 stars_

![book logo](images/51ZuvPdvCjL._SX379_BO1,204,203,200_.jpg)

> Learn web scraping and crawling techniques to access unlimited data from any web source in any format.
> - Learn how to __parse__ complicated HTML pages
> - __Traverse__ multiple pages and sites
> - Get a general overview of __APIs__ and how they work
> - Learn several methods for __storing__ the data you scrape
> - Download, read, and __extract__ data from documents
> - Use tools and techniques to __clean__ badly formatted data
> - Read and write __natural languages__
> - Crawl through __forms and logins__
> - Understand how to scrape __JavaScript__
> - Learn __image processing and text recognition__

### [Scrapy](http://scrapy.org/)'s Features
> - Write script in Python
> - Powerful WebUI with script editor, task monitor, project manager and result viewer
> - MySQL, MongoDB, Redis, SQLite, PostgreSQL with SQLAlchemy as database backend
> - RabbitMQ, Beanstalk, Redis and Kombu as message queue
> - Task priority, retry, periodical, recrawl by age, etc...
> - Distributed architecture, Crawl Javascript pages, Python 2&3, etc...

### There's libs for that!

* [pattern](http://www.clips.ua.ac.be/pages/pattern)
* [lxml](http://lxml.de/)
* [requests](http://docs.python-requests.org/en/latest/)
* [Scrapy](https://scrapy.org/)
* [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/)
* [mechanize](https://pypi.python.org/pypi/mechanize) 

### and Chrome DevCenter & Plugins...

* [XPath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=zh-CN)
* [Postman](https://chrome.google.com/webstore/detail/postman/fhbjgbiflinjbdggehcddcbncdddomop?hl=zh-CN)

### Page Parsing technology
- XPath / CSS Path
- Regular Expression

HTML and the DOM
========
The general idea behind web scraping is to retrieve data that exists on a website, and convert it into a format that is usable for analysis. Webpages are rendered by the brower from HTML and CSS code, but much of the information included in the HTML underlying any website is not interesting to us.
We begin by reading in the source code for a given web page and creating a Beautiful Soup object with the BeautifulSoup function.

In [None]:
from bs4 import BeautifulSoup
import urllib
url = 'http://www.bazhuayu.com/about'
r = urllib.urlopen(url).read()
soup = BeautifulSoup(r)
print type(soup)

The soup object contains all of the HTML in the original document.

In [None]:
print soup.prettify()[0:1000]

The HTML tags contained in the angled brackets provide structural information (and sometimes formatting), which we probably don't care about in and of itself but is useful for selecting only the content relevant to our needs. 
 
Beautiful Soup is essentially a set of wrapper functions that make it simple to select common HTML elements.

![dom tree](images/dom_tree.png)

Most modern browsers have a parser that reads in the HTML document, parses it into a DOM (Document Object Model) structure, and then renders the DOM structure.
Much like HTTP, the DOM is an agreed-upon standard. 
 
The DOM is much more than what described here, but for our purposes, what is most important to understand is that the text is only one part of an HTML element, and we need to select it explicitly.

![dom tree](images/dom_tree2.png)

More about DOM Tree:
- [DOM visualization](http://dok.github.io/dom-visualization/)
- [Live DOM Viewer](https://software.hixie.ch/utilities/js/live-dom-viewer/)
- [HTML DOM Tree](https://gojs.net/latest/samples/DOMTree.html)

XPath & lxml:
========
`<div id=’idValue’ class=’classValue’ username='username'>
    <a href='http://www.bupt.edu.cn' target='_blank'>BUPT Home</a>
    <div>
        <a href='#'>click here 1st</a>
        <a href='#'>click here 2nd</a>
        <a href='#'>click here 3rd</a>
    </div>
</div>`
![XPath CSS Path Cheat Sheet](images/XpathCssPathCheatSheet.jpg)
(http://axatrikx.com/xpath-css-path-cheat-sheet/)

- [[XPath] XPath 与 lxml （一）XPath 术语](http://www.cnblogs.com/ifantastic/p/3863271.html)  
- [[XPath] XPath 与 lxml （二）XPath 语法](http://www.cnblogs.com/ifantastic/p/3863415.html)  
- [[XPath] XPath 与 lxml （三）XPath 坐标轴](http://www.cnblogs.com/ifantastic/p/3863808.html)

### To the internet (with dev tools)
#### [MTime](http://www.mtime.com/hotest/)
After poking around the MTime homepage with the help of command + option + i (mac) and a Chrome plugin, [XPath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=zh-CN), we can see that all the movie titles can be grabbed with the xpath, "//dl/dt/a/text()". lxml lets us retrieve all the text that matches our xpath.

Full XPath for movie name:
/html/body[@id='bodyRegion']/div[@class='centent']/div[@class='mtimetip']/div[@class='mtipbox']/div[@class='mtipmid']/div[@class='mtiplist'][2]/div[@class='clearfix']/div[@class='picbox']/dl/dt/a

In [None]:
#encoding=utf-8
from lxml import html 

x = html.parse('http://www.mtime.com/hotest/')
titles = x.xpath("//dt/a/text()")
print "We got %s titles. Here are the first 5:" % len(titles)
for title in titles[:5]:
    print title

full XPath for score:
/html/body[@id='bodyRegion']/div[@class='centent']/div[@class='mtimetip']/div[@class='mtipbox']/div[@class='mtipmid']/div[@class='mtiplist'][7]/div[@class='clearfix']/div[@class='picbox']/div[@class='score']/strong

In [None]:
scores = x.xpath("//div[@class='score']/strong/text()")
for score in scores:
    print score

In [None]:
groups = x.path('//div[@class='mtiplist']')
for group in groups:
    titles = group.xpath("//dt/a/text()")
    scores = group.xpath("//div[@class='score']/strong/text()")

In [None]:
for i in range(1, 11):
    actors = x.xpath("//div[%s]/div/div/dl/dd/ul/li[2]/a/text()" % i)
    print actors
    for actor in actors:
        print actor
    print '-----'

### Scraping for Multi-pages

This page only has ~20 titles on it. There's a "More Stories" button at the bottom of the page, which brings us to another, similarly structured page with new titles and another "More Stories" button. To get more examples, we'll repeat the process above in a loop with each successive iteration hitting the page pointed to by the "More Stories" button. In order to figure out what the link is, go do some more investigating! You can also just look at the code below.

In [None]:
# We'll use sleep to add some time in between requests
# so that we're not bombarding Gawker's server too hard. 
from time import sleep

# Now we'll fill this list of gawker titles by starting
# at the landing page and following "More Stories" links
titles = []
base_url = 'http://www.mtime.com/hotest/{}'
next_page = "http://www.mtime.com/hotest/"

# These are the xpaths we determined from snooping 
next_button_xpath = "//a[@id='key_nextpage']/@href"
headline_xpath = "//div[@class='picbox']/dl/dt/a/text()"

while len(titles) < 50 and next_page:
    dom = html.parse(next_page)
    headlines = dom.xpath(headline_xpath)
    print "Retrieved {} titles from url: {}".format(len(headlines), next_page)
    titles += headlines
    next_pages = dom.xpath(next_button_xpath)
    if next_pages: 
        next_page = base_url.format(next_pages[0]) 
    else:
        print "No next button found"
        next_page = None
    sleep(3)

In [None]:
for title in titles[:15]:
    print title

In [None]:
with open('mtime_titles.txt', 'wb') as out:
    out.write('\n'.join(titles).encode('utf-8'))
with open('mtime_titles.txt') as f:
    titles_ = f.readlines()
    
print "Well, we got {} Hot Movies!".format(len(titles_))
for title in titles[:15]:
    print title

### a special example

In [None]:
y = html.parse('http://music.baidu.com/top/dayhot')
next_page = y.xpath("//a[@class='page-navigator-next']/@href")
print next_page

In [None]:
print [str(next_page[0]).strip()]

In [None]:
print ['http://music.baidu.com' + 
       str(next_page[0])
           .replace('\\t', '')
           .replace('\\n', '')
           .replace('[\'', '').replace('\']', '')
           .strip()
      ]

### Another xpath example

In [None]:
import requests
url = 'https://book.douban.com/series/1163?page=11'
page = requests.get(url)
y = html.fromstring(page.content)
stars = y.xpath("//div[@class='star clearfix']/*")
for star in stars:
    print star.attrib['class'], star.text.strip() if star.attrib['class'] == 'rating_nums' or star.attrib['class'] == 'pl'  else ''

## [A Simple Crawler Example](https://github.com/gwulfs/bostonml/blob/master/scraping/scraping.ipynb)


#### How about regular expression?

In [None]:
import requests
import re

url = 'http://book.douban.com/series/1163?page=11'
re_extract = re.compile('<a href="(.*?)" title="(.*?)"[\S\s]*?class="pub">([\S\s]*?)<\/div>')
page = requests.get(url)
item_match = re.findall(re_extract, page.content)
if item_match:
    for item_info in item_match:
        print item_info
        print item_info[0]
        print item_info[1]
        print item_info[2].strip(), '\n'

### About Regular Expression:

![Python Regular Expression](images/pyre_ebb9ce1c-e5e8-4219-a8ae-7ee620d5f9f1.png)

References:
- [Regulex](https://jex.im/regulex/)
- [regular expressions 101](https://regex101.com/)
- [Regex Builder](https://regexbuilder.codeplex.com/)
- [Python正则表达式指南](http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html)

- [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/)

### Python编码问题
简而言之，Python 2.x里字符串有两种：str和Unicode  
前者到后者要decode，后者到前者要encode,'utf-8'为例：  
str.decode('utf-8') -> Unicode  
str <- Unicode.encode('utf-8')

In [None]:
print "Type of    '中文'   is %s" % type('中文')
print "Type of   '中文'.decode('utf-8')   is %s" % type('中文'.decode('utf-8')) 
print "Type of   u '中文'   is %s" % type(u'中文')
print "Type of   u'中文'.encode('utf-8')   is %s" % type(u'中文'.encode('utf-8')) 

建议一、使用字符编码声明，并且同一工程中的所有源代码文件使用相同的字符编码声明

In [None]:
#encoding=utf-8

建议二、抛弃str，全部使用unicode

In [None]:
test1 = 'présenter'
test2 = u'汉字'
print type(test1)
test_unicode = test1.decode('utf-8')
print type(test_unicode)
print ("%s+%s" % (test_unicode, test2)).encode('utf-8')

JS里转义字符串的处理

In [None]:
src_str = u"\\u4e09\\u73af\\u4ee5\\u5185"
print src_str.encode('utf-8')

In [None]:
tar_str_decode = src_str.decode("unicode-escape")
print tar_str_decode.encode('utf-8')

In [None]:
tar_str_eval = eval("u\"" + src_str + "\"")
print tar_str_eval.encode('utf-8')

References:
- [Python字符编码详解](http://www.cnblogs.com/huxi/archive/2010/12/05/1897271.html)

## About Ajax

Let’s start with the imports:

In [None]:
from lxml import html
import requests

Next we will use requests.get to retrieve the web page with our data, parse it using the html module:

In [None]:
page = requests.get('http://bbs.byr.cn/board/Recommend?p=1&_uid=guest')
tree = html.fromstring(page.text)

```html
<td class="title_9"><a href="……" class="">【公告】北邮人论坛热点活动管理条例</a></td>
…… 
<td class="title_12">| <a href="/user/query/wangxiaobupt" class="c63f">wangxiaobupt</a></td>
```

Knowing this we can create the correct XPath query:

In [None]:
title_xpath = '//td[@class="title_9"]/a/text()'
author_xpath = '//a[@class="c63f"]/text()'

Use the lxml xpath function like this:

In [None]:
titles = tree.xpath(title_xpath)
authors = tree.xpath(author_xpath)
print "We got %s titles and %s authors" % (len(titles), len(authors))

Why? Let's introduce Ajax(Asynchronous JavaScript and XML)
![Ajax](images/dojo_0401.png)

So let's review the page code... 

In [None]:
headers = {'X-Requested-With': ' XMLHttpRequest'}
page = requests.get('http://bbs.byr.cn/board/Recommend?p=1&_uid=guest', headers=headers)
tree = html.fromstring(page.text)

### Another Ajax example

In [None]:
page = requests.get('http://list.jd.com/list.html?cat=9987,653,655', headers=headers)
tree = html.fromstring(page.text)

In [None]:
jd_names = tree.xpath("//div[@class='p-name']/a/em/text()")
for jd_name in jd_names[:5]:
    print jd_name

In [None]:
jd_ids = tree.xpath("//div[@class='gl-i-wrap j-sku-item']/@data-sku")
# for jd_id in jd_ids:
#     print jd_id.attrib['data-sku']
for ids_groups in [jd_ids[i:i+5] for i in range(0,len(jd_ids),5)]:
    skuIds = '%2C'.join(map(lambda x: 'J_%s' % x, [ids_group for ids_group in ids_groups]))
    page = requests.get('http://p.3.cn/prices/mgets?callback=jQuery8870889&type=1&area=1&skuIds=%s' % skuIds)
    print page.content

### JSON (JavaScript Object Notation)
![JSON](./images/JSON.gif)



Used for REST(Representational State Transfer)

![rest](images/How to parse JSON In Java.png)

JSON in Python

In [None]:
import json

json_str = """
{
  "maps": [
    {
      "id": "blabla",
      "iscategorical": "0"
    },
    {
      "id": "blabla",
      "iscategorical": "0"
    }
  ],
  "masks": {
    "id": "valore"
  },
  "om_points": "value",
  "parameters": {
    "id": "valore"
  }
}
"""

json_obj = json.loads(json_str)
print json_obj, json_obj['masks']

Refereneces:
- [Introducing JSON](http://www.json.org/)
- [Python: json — JSON encoder and decoder](https://docs.python.org/2/library/json.html)

### Web Debugging Proxy
- [Fiddler](http://www.telerik.com/fiddler)
- [Charles](http://www.charlesproxy.com/)

### Headers
- User-Agent
- Referer
- Cookie
- Accept

### Robot exclusion
![Robot exclusion](images/robot_explained.png)

In [None]:
#http://www.intel.com/robots.txt

# robots.txt exclusion for www.intel.com/ - US
User-agent: *
Disallow: /cgi
Disallow: /iaweb/
Disallow: /cpc/vision/
Disallow: /intel/june297/
Disallow: /cpc/eps/
Disallow: /design/june297/
Disallow: /cpc/archive/
......

### Evolution of freshness and age in Web crawling
![](images/freshness.png)

### When the scale of Crawler grows...

### Typical anatomy of a large-scale crawler
![Typical anatomy of a large-scale crawler](images/5396ee05gw1ewdwnihf6jj20kl0hlwhx.jpg)

References:  
- [Python web scraping resource](http://jakeaustwick.me/python-web-scraping-resource/)
- [Web Scraping 101 with Python](http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/)
- The Science of Crawl [Part 1](http://blog.urx.com/urx-blog/2014/9/4/the-science-of-crawl-part-1-deduplication-of-web-content) [Part2](http://blog.urx.com/urx-blog/2014/10/23/the-science-of-crawl-part-2-content-freshness)
- [阅读 coursera-dl 源码](http://blog.onlyice.net/2015/07/11/read-coursera-dl-code/) 

### Crawler Tech Graph
![crawler](images/crawler.jpg)

In [None]:
sc.sc()

### Customized DNS component
![DNS](images/dns-rev-1.gif)
![DNS Lookup Time](images/Waterfall-Chart-New.png)
- Custom client for address resolution
- **Caching** server
- **Prefetching** client

DNS cache poisoning:
![DNS cache poisoning](images/dns-security-fig3.png)

References:
- [Dnsmasq](http://www.thekelleys.org.uk/dnsmasq/doc.html)

## Multi-Thread
![Multi-Thread](images/Picture1.png)

### Thread Pool
![thread pool](images/400px-Thread_pool.svg.png)

#### Advantages of a Multithreaded
- Improved performance and concurrency
- Simplified coding of remote procedure calls and conversations
- Simultaneous access to multiple applications
- Reduced number of required servers

#### Disadvantages of a Multithreaded
- Difficulty of writing code
- Difficulty of debugging
- Difficulty of managing concurrency
- Difficulty of testing
- Difficulty of porting existing code

### [Tomorrow](https://github.com/madisonmay/Tomorrow)

In [None]:
urls = [
    'http://google.com',
    'http://facebook.com',
    'http://youtube.com',
    'http://baidu.com',
    'http://yahoo.com',
]

In [None]:
import time
import requests

def download(url):
    return requests.get(url)

if __name__ == "__main__":

    start = time.time()
    responses = [download(url) for url in urls]
    html = [response.text for response in responses]
    end = time.time()
    print "Time: %f seconds" % (end - start)

In [None]:
import time
import requests

from tomorrow import threads

@threads(5)
def download(url):
    return requests.get(url)

if __name__ == "__main__":
    start = time.time()
    responses = [download(url) for url in urls]
    html = [response.text for response in responses]
    end = time.time()
    print "Time: %f seconds" % (end - start)

Multi-Sockets
![sockets](images/TCP_IP_socket_diagram.png)

About APIs
---
Application Programming Interface(API)

### Long URL

In [None]:
!curl "http://api.longurl.org/v2/expand?url=http://kck.st/1Q51X6T&title=1&content-type=1&rel-canonical=1&format=json"

In [None]:
import requests
import json

short_url = "http://kck.st/1Q51X6T"
url_template = "http://api.longurl.org/v2/expand?url=%s&title=1&content-type=1&rel-canonical=1&format=json"
tar_url = url_template % short_url
print tar_url

result = requests.get(tar_url).content
json_result = json.loads(result)
print "short url:%s \n->long url:%s " % (short_url, json_result['long-url'])

### Page Summary

In [None]:
!curl 'http://clipped.me/algorithm/clippedapi.php?url=http://kck.st/1Q51X6T'

In [None]:
import requests
import json

url = "http://kck.st/1Q51X6T"
url_template = "http://clipped.me/algorithm/clippedapi.php?url=%s"
tar_url = url_template % url

result = requests.get(tar_url).content
json_result = json.loads(result)
print "for:%s\ntitle:%s\nsummary:" % (url, json_result['title'])
for summary in json_result['summary']:
    print summary
    

### Share Count

In [None]:
!curl 'http://free.sharedcount.com/?url=https://google.com/&apikey=d730c518430eabcabc46ab79528c744067afa17e'

In [None]:
import requests
import json

url = "https://google.com"
url_template = "http://free.sharedcount.com/?url=%s&apikey=d730c518430eabcabc46ab79528c744067afa17e"
tar_url = url_template % url

result = requests.get(tar_url).content
json_result = json.loads(result)
print "for %(url)s\nFacebook:%(facebook)s\nTwitter:%(twitter)s" % {
    'url' : url,
    'facebook' : json_result['Facebook']['like_count'],
    'twitter' : json_result['Twitter'],
} 

### About Deduplication

![### About Deduplication](images/image5.png)

Set in Python

In [None]:
urls = [
    'http://www.google.com',
    'http://www.aol.com',    
    'http://www.google.com',
]
dedup = set()
for url in urls:
    dedup.add(url)
print dedup

Bloom Filter
![Bloom Filter](images/800px-Bloom_filter.svg.png)
![](images/Bloom-Filter-Simple-howto.png)

[pybloom](https://github.com/jaybaird/python-bloomfilter)

In [None]:
from pybloom import BloomFilter
f = BloomFilter(capacity=1000, error_rate=0.001)
[f.add(x) for x in range(10)]

In [None]:
all([(x in f) for x in range(10)])

### Message Queue
![Message Queue](images/IC709523.png)
- A sender can post a message to the queue.
- A receiver can retrieve a message from the queue (the message is removed from the queue).
- A receiver can examine (or peek) the next available message in the queue (the message is not removed from the queue).
#### Scenarios for Asynchronous Messaging
- Load balancing
- Decoupling workloads
- Temporal decoupling
- Load leveling
- Cross-platform integration
- Asynchronous workflow
- Deferred processing
- Reliable messaging
- Resilient message handling
- Non-blocking receivers

### Google File System
![GFS Architecture](images/gfs_architecture.png)  
- Client translates file name and byte offset to chunk index.
- Sends request to master.
- Master replies with chunk handle and location of replicas.
- Client caches this info.
- Sends request to a close replica, specifying chunk handle and byte range.
- Requests to master are typically buffered.
#### GFS's winning attributes
- Availability
- Performance
- Management
- Cost   

### About Cookie

Requests —— Persistent Sessions

In [None]:
import requests

url1 = 'http://elib.cnki.net/grid2008/brief/result.aspx?DbPrefix=hotspotcomp&showTitle=学科学术热点'
url2 = 'http://elib.cnki.net/request/search.aspx?action=&PageName=ASP.brief_result_aspx&DBViewType=FullText&DbCatalog=中国学术文献网络出版总库&DbPrefix=hotspotcomp&ConfigFile=hotspotcomp.xml&db_value=SUBJECT_BASE_INFO&NaviField=主题学科代码&orderby=(主题热度值,\'integer\')&txt_extension=false&his=1&SourceTypeCode=undefined&pSourceTypeCode='
url3 = 'http://elib.cnki.net/DataCenter/DoGridTable.aspx?action=grid&pagename=ASP.brief_result_aspx&dbPrefix=hotspotcomp&dbCatalog=中国学术文献网络出版总库&ConfigFile=hotspotcomp.xml&sKuakuID=1&loadgroup=1&prio=true&db_value=SUBJECT_BASE_INFO'

s = requests.Session()
page = s.get(url1)
page = s.get(url2)
page = s.get(url3)
print page.content

pyCookieCheat

In [None]:
import pyCookieCheat
local_cookies = pyCookieCheat.chrome_cookies(profile='Profile 4', domain='baidu.com')
local_cookies

### About Proxy

In [None]:
page_content = requests.get(url, cookies=cookies).content, proxies={"http": "http://117.177.243.42:85/",}

### Crawler vs. Anti-Crawler

| Crawler  | Anti-Crawler|
|:------------- |:---------------|
| Naive Crawler(No specific header)      | Verify User-Agent/Referer/X-Requested-With/... | 
| Multi-threaded      |  Connections per IP limit |
| Multi-Proxy | Connections per IP limit + Proxy Dection | 
| *Multi-IP | Cookie Limit |
| Cloud(PaaS(Platform-as-a-Service) / SaaS(Software-as-a-Service)) | OAuth/Cookie limit |
| CrowdSourcing | Credit / Noise / HoneyPot / Pattern Recognition  |

In [None]:
sc.sc()

- 核心问题 —— 真人/非人 判断
- Multi-IP是业界主流做法——调动大量IP资源进行访问
- 大量帐号的获取——自动/众包
- 封堵 = 极限通告
- 利用漏洞非长久之计
- 最简单的HoneyPot——不可见但能解析/推断的链接

- 数据无价，挖掘有价
- 收藏无价，阅读有价

### A Real Story......

> Please stop automated bots against our platform and stealing data. This is not a good way for a smart guy to spend his time and energy. If you'd like to engage, we love to have intelligent guys join, work with us and make things better.

> I'm so sorry to have troubled you. I'm not a data stealer, but a NLP researcher. I really love your service and I've recommended it to all my friends. In the last days, I did some data collection to analysis social relevances of them, sorry for any inconvenience this may have caused, and I'll stop the collection right now. Further more, do you have any plan to provide open(or commerial) API service? I would appreciate it very much.

> We’re working on a set of OAuth2 APIs for trials with enterprise customers and they should be ready by the end of July. When they’re ready we’d be happy to share based on either a commercial agreement (if using for commercial purposes) or non-commercial agreement meant purely for research/education with all citations/references in place.

> We’re happy to support research work that makes things better.

> In the meantime, if you can please outline what exactly are you looking to do and how may we help, we can figure how we may get you the required data.

> Once we understand and come to an agreement, we’ll unblock your primary account.
Also, to reiterate, please refrain from all data collection without our knowledge and consent. You seem like a talented individual so we consider it mutually beneficial to maintain good relations with you.
We appreciate you recommending us to your friends.

In [None]:
from selenium import webdriver

driver = webdriver.Firefox()

driver.get('https://en.wikipedia.org/wiki/International_Space_Station')

In [None]:
iss_table = driver.find_element_by_xpath('//table')
iss_table_html = iss_table.get_attribute('outerHTML')

print iss_table_html[:200]
print '\n. . .\n'
print iss_table_html[-200:]

In [None]:
# for ipython notebook display
from IPython.core.display import display, HTML

display(HTML(iss_table_html))