# 使用API

**应用编程接口（Application Programming Interface，API）**的用处：它们为不同的应用提供了方便友好的接口。不同的开发者用不同的架构，甚至不同的语言编写软件都没问题——因为API 设计的目的就是要成为一种通用语言，让不同的软件进行信息共享。

尽管目前不同的软件应用都有各自不同的API，但“API”经常被看成“网络应用API”。一般情况下，程序员可以用HTTP 协议向API 发起请求以获取某种信息，API 会用XML（eXtensible Markup Language， 可扩展标记语言） 或JSON（JavaScript Object Notation，JavaScript 对象表示）格式返回服务器响应的信息。尽管大多数API 仍然在用XML，**但是JSON 正在快速成为数据编码格式的主流选择**。

用这种即开即用的接口获取预先打包好的信息，看起来好像和本书主题没什么关系，但是这种看法只对了一半。虽然大多数人通常不会把使用API 看成网络数据采集，但是实际上两者使用的许多技术（都是发送HTTP 请求）和产生的结果（都是获取信息）差不太多；两者经常是相辅相成的关系。

API 之所以叫API 而不是叫网站的原因，其实是首先API 请求使用非常严谨的语法，其次API 用JSON 或XML 格式表示数据，而不是HTML 格式。

### 1.API通用规则

API 用一套非常标准的规则生成数据，而且生成的数据也是按照非常标准的方式组织的。因为规则很标准，所以一些简单、基本的规则很容易学，可以帮你快速地掌握任意API 的用法。

不过并非所有API 都很简单，有些API 的规则比较复杂，因此第一次使用一个API 时，建议阅读文档，无论你对以前用过的API 是多么熟悉。

#### 1.1 方法

利用HTTP 从网络服务获取信息有四种方式：
- GET
- POST
- PUT
- DELETE

  - 1.0 GET 就是你在浏览器中输入网址浏览网站所做的事情。当你访问http://freegeoip.net/json/50.78.253.58 时，就会使用GET 方法.
  - 2.0 POST 基本就是当你填写表单或提交信息到网络服务器的后端程序时所做的事情.
  - 3.0 PUT 在网站交互过程中不常用，但是在API 里面有时会用到。PUT 请求用来更新一个对象或信息。
  - 4.0 DELETE 用于删除一个对象。。DELETE 方法在公共API 里面不常用，它们主要用于创建信息，不能随便让一个用户去删掉数据库的信息。但是，和PUT 方法一样，DELETE 方法也值得了解一下
  
虽然在HTTP 规范里还有一些信息处理方式，但是这四种基本是你使用API 过程中可能遇到的全部。

#### 2.2 服务器响应

API 有一个重要的特征是它们会反馈格式友好的数据。大多数反馈的数据格式都是XML 和JSON。

这几年，JSON 比XML 更受欢迎，主要有两个原因。


**首先，JSON 文件比完整的XML 格式小**。

比如下面的XML 数据用了98 个字符：
```XML
<user><firstname>Ryan</firstname><lastname>Mitchell</lastname><username>Kludgist</username></user>
```

同样的JSON 格式数据：
```JSON
{"user":{"firstname":"Ryan","lastname":"Mitchell","username":"Kludgist"}}
```

**JSON 格式比XML 更受欢迎的另一个原因是网络技术的改变**

过去，服务器端用PHP和.NET 这些程序作为API 的接收端。现在，服务器端也会用一些JavaScript 框架作为API的发送和接收端，像Angular 或Backbone 等。虽然服务器端的技术无法预测它们即将收到的数据格式，但是像Backbone 之类的JavaScript 库处理JSON 比处理XML 要更简单。

### API调用

不同API 的调用语法大不相同，但是有几条共同准则。当使用GET 请求获取数据时，用URL 路径描述你要获取的数据范围，查询参数可以作为过滤器或附加请求使用。

有许多API 会通过文件路径（path）的形式指定API 版本、数据格式和其他属性。例如，下面的链接会返回同样的结果，但是使用虚拟API 的第四版，反馈数据为JSON 格式：
```JSON
http://socialmediasite.com/api/v4/json/users/1234/posts?from=08012014&to=08312014
```

还有一些API 会通过请求参数（request parameter）的形式指定数据格式和API 版本：
```JSON
http://socialmediasite.com/users/1234/posts?format=json&from=08012014&to=08312014
```

### 2.  解析JSON数据

在本章中，我们介绍了许多不同类型的API 以及它们的使用方法，也介绍了这些API 反馈的一些简单的JSON 格式数据。现在让我们看看如何解析和使用这些信息。

本章开始的时候，我用过freegeoip.net 网站IP 查询的例子，可以把IP 地址解析转换成地
理位置：

```JSON
http://freegeoip.net/json/50.78.253.58
```
我可以获取这个请求的反馈数据，然后用Python 的JSON 解析函数来解码：

In [10]:
import json
from urllib.request import urlopen

def responseJson(ipAddress):
    response = urlopen("http://freegeoip.net/json/" + ipAddress).read().decode('utf-8')
    responseJson = json.loads(response)
    return responseJson

def getCountry(ipAddress):
    responseJ = responseJson(ipAddress)
    return responseJ.get("country_code")

def getCity(ipAddress):
    responseJ = responseJson(ipAddress)
    return responseJ.get("city")

ip = "50.78.253.58"

print(getCountry(ip), getCity(ip))


US Boston


这里用的JSON 解析库是Python 标准库的一部分。只需要在代码开头写上import json，
你就可以使用它了！不同于那些需要先把JSON 解析成一种JSON 对象或JSON 节点的语
言，Python 使用了一种更加灵活的方式，把JSON 转换成字典，JSON 数组转换成列表，
JSON 字符串转换成Python 字符串。通过这种方式，就可以让JSON 的获取和操作变得非
常简单。

下面的例子演示了如何使用Python 的JSON 解析库，处理JSON 字符串中可能出现的不同
数据类型：

In [15]:
import json

jsonString = '{"arrayOfNums":[{"number":0},{"number":1},{"number":2}],"arrayOfFruits":[{"fruit":"apple"},{"fruit":"banana"},{"fruit":"pear"}]}'

In [22]:
jsonObj = json.loads(jsonString)

print(jsonObj.get("arrayOfNums"))
print(jsonObj.get("arrayOfFruits"))
print(jsonObj.get("arrayOfFruits")[2].get("fruit"))

[{'number': 0}, {'number': 1}, {'number': 2}]
[{'fruit': 'apple'}, {'fruit': 'banana'}, {'fruit': 'pear'}]
pear


### 3.如何把API 和网络数据采集结合起来：看看维基百科的贡献者们大都在哪里。

首先做一个采集维基百科的基本程序，寻找编辑历史页面，然后把编辑历史里面的IP 地址
找出来，这并不难。只要对第3 章的代码做些修改就可以，代码如下所示：

In [27]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

random.seed(datetime.datetime.now())

def getLinks(articleUrl):
    html = urlopen("http://en.wikipedia.org"+articleUrl)
    bsObj = BeautifulSoup(html,'html.parser')
    return bsObj.find("div", {"id":"bodyContent"}).findAll("a",
                                                           href=re.compile("^(/wiki/)((?!:).)*$"))

def getHistoryIPs(pageUrl):
    # 编辑历史页面URL链接格式是：
    # http://en.wikipedia.org/w/index.php?title=Title_in_URL&action=history
    pageUrl = pageUrl.replace("/wiki/", "")
    historyUrl = "http://en.wikipedia.org/w/index.php?title="+pageUrl+"&action=history"
    print("history url is: "+historyUrl)
    html = urlopen(historyUrl)
    bsObj = BeautifulSoup(html, 'html.parser')
    
    # 找出class属性是"mw-anonuserlink"的链接
    # 它们用IP地址代替用户名
    ipAddresses = bsObj.findAll("a", {"class":"mw-anonuserlink"})
    addressList = set()
    for ipAddress in ipAddresses:
        addressList.add(ipAddress.get_text())
        return addressList
    
links = getLinks("/wiki/Python_(programming_language)")

while(len(links) > 0):
    for link in links:
        print("-------------------")
        historyIPs = getHistoryIPs(link.attrs["href"])
        for historyIP in historyIPs:
            print(historyIP)
    
    newLink = links[random.randint(0, len(links)-1)].attrs["href"]
    links = getLinks(newLink)

-------------------
history url is: http://en.wikipedia.org/w/index.php?title=Programming_paradigm&action=history
68.151.180.83
-------------------
history url is: http://en.wikipedia.org/w/index.php?title=Object-oriented_programming&action=history
103.74.23.139
-------------------
history url is: http://en.wikipedia.org/w/index.php?title=Imperative_programming&action=history
85.133.27.110
-------------------
history url is: http://en.wikipedia.org/w/index.php?title=Functional_programming&action=history
93.104.182.39
-------------------
history url is: http://en.wikipedia.org/w/index.php?title=Procedural_programming&action=history
51.6.173.174
-------------------
history url is: http://en.wikipedia.org/w/index.php?title=Reflective_programming&action=history
212.96.25.37
-------------------
history url is: http://en.wikipedia.org/w/index.php?title=Software_design&action=history
91.180.76.245
-------------------
history url is: http://en.wikipedia.org/w/index.php?title=Guido_van_Rossum&a

KeyboardInterrupt: 

这个程序包括两个函数：getLinks（第3 章里用过）和新的函数getHistoryIPs，搜索所有mw-anonuserlin 类里面的链接信息（匿名用户的IP 地址，不是用户名），返回一个链接列表。

#### Python 的集合类型简介

```
对于未来可能需要扩展的代码，在决定使用集合还是列表时，有两件事情需要考虑：虽然列表迭代速度
比集合稍微快一点儿，但集合查找速度更快（确定一个对象是否在集合中），因为Python 集合就是值
为None 的词典，用的是哈希表结构，查询速度为O(1)。
```

上面的代码还用了一些随机的（不过对这个示例是有效的）搜索模式来查找词条的编
辑历史。首先获取起始词条连接的所有词条的编辑历史（示例中是Python programming
language 词条）。然后，随机选择一个词条作为起始点，再获取这个页面连接的所有词条的
编辑历史。重复这个过程直到页面没有连接维基词条为止。

现在，我们获得了编辑历史的IP 地址数据，把它们与上一节的getCountry 函数结合起来，
就可以查询IP 地址所属的国家和地区了。我对getCountry 函数做了一点儿修改，处理了
无效或错误的IP 地址引起的“404 Not Found”异常（比如，写到这里时，freegeoip.net 不
能查询IPv6 地址，可能会引起404 错误）：

In [30]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

random.seed(datetime.datetime.now())

def getLinks(articleUrl):
    html = urlopen("http://en.wikipedia.org"+articleUrl)
    bsObj = BeautifulSoup(html,'html.parser')
    return bsObj.find("div", {"id":"bodyContent"}).findAll("a",
                                                           href=re.compile("^(/wiki/)((?!:).)*$"))

def getHistoryIPs(pageUrl):
    # 编辑历史页面URL链接格式是：
    # http://en.wikipedia.org/w/index.php?title=Title_in_URL&action=history
    pageUrl = pageUrl.replace("/wiki/", "")
    historyUrl = "http://en.wikipedia.org/w/index.php?title="+pageUrl+"&action=history"
    print("history url is: "+historyUrl)
    html = urlopen(historyUrl)
    bsObj = BeautifulSoup(html, 'html.parser')
    
    # 找出class属性是"mw-anonuserlink"的链接
    # 它们用IP地址代替用户名
    ipAddresses = bsObj.findAll("a", {"class":"mw-anonuserlink"})
    addressList = set()
    for ipAddress in ipAddresses:
        addressList.add(ipAddress.get_text())
        return addressList
    

def getCountry(ipAddress):
    try:
        response = urlopen("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8')
    except HTTPError:
        return None
    responseJson = json.loads(response)
    return responseJson.get("country_code")
        
    
links = getLinks("/wiki/Python_(programming_language)")

while(len(links) > 0):
    for link in links:
        print("-------------------")
        historyIPs = getHistoryIPs(link.attrs["href"])
        for historyIP in historyIPs:
            country = getCountry(historyIP)
            if country is not None:
                print(historyIP + "is from" + country)
                
newLink = links[random.randint(0, len(links)-1)].attrs["href"]
links = getLinks(newLink)                

-------------------
history url is: http://en.wikipedia.org/w/index.php?title=Programming_paradigm&action=history
68.151.180.83is fromCA
-------------------
history url is: http://en.wikipedia.org/w/index.php?title=Object-oriented_programming&action=history
103.74.23.139is fromPK
-------------------
history url is: http://en.wikipedia.org/w/index.php?title=Imperative_programming&action=history
85.133.27.110is fromGB
-------------------
history url is: http://en.wikipedia.org/w/index.php?title=Functional_programming&action=history
93.104.182.39is fromDE
-------------------
history url is: http://en.wikipedia.org/w/index.php?title=Procedural_programming&action=history
51.6.173.174is fromGB
-------------------
history url is: http://en.wikipedia.org/w/index.php?title=Reflective_programming&action=history


NameError: name 'HTTPError' is not defined

完整代码在http://www.pythonscraping.com/code/6-3.txt。 下面是部分输出结果：

### 4. CrawlingModels

In [31]:
import requests
from bs4 import BeautifulSoup

In [32]:
def getPage(url):
    """
    Utilty function used to get a Beautiful Soup object from a given URL
    """

    session = requests.Session()
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
               "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"}
    try:
        req = session.get(url, headers=headers)
    except requests.exceptions.RequestException:
        return None
    bs = BeautifulSoup(req.text, "html.parser")
    return bs

### Dealing with different website layouts

In [33]:
import requests


class Content:
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body


def getPage(url):
    req = requests.get(url)
    return BeautifulSoup(req.text, 'html.parser')


def scrapeNYTimes(url):
    bs = getPage(url)
    title = bs.find("h1").text
    lines = bs.find_all("p", {"class": "story-content"})
    body = '\n'.join([line.text for line in lines])
    return Content(url, title, body)


def scrapeBrookings(url):
    bs = getPage(url)
    title = bs.find("h1").text
    body = bs.find("div", {"class", "post-body"}).text
    return Content(url, title, body)


url = 'https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/'
content = scrapeBrookings(url)
print('Title: {}'.format(content.title))
print('URL: {}\n'.format(content.url))
print(content.body)

url = 'https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html'
content = scrapeNYTimes(url)
print('Title: {}'.format(content.title))
print('URL: {}\n'.format(content.url))
print(content.body)

Title: Delivering inclusive urban access: 3 uncomfortable truths
URL: https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/


The past few decades have been filled with a deep optimism about the role of cities and suburbs across the world. These engines of economic growth host a majority of world population, are major drivers of economic innovation, and have created pathways to opportunities for untold amounts of people.	
Authors






Jeffrey Gutman
Senior Fellow - Global Economy and Development







Adie Tomer
Fellow - Metropolitan Policy Program

 Twitter
AdieTomer






But all is not well within our so-called Urban Century. Rapid urbanization, rising gentrification, concentrated poverty, and shortages of basic infrastructure have combined to create spatial inequity in cities and suburbs across the globe. The challenges of housing, moving, and employing so many people have led to longer travel times, rising housing 

ConnectionError: HTTPSConnectionPool(host='www.nytimes.com', port=443): Max retries exceeded with url: /2018/01/25/opinion/sunday/silicon-valley-immortality.html (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000005A41C88>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。',))

In [34]:
class Content:
    """
    Common base class for all articles/pages
    """

    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

    def print(self):
        """
        Flexible printing function controls output
        """
        print("URL: {}".format(self.url))
        print("TITLE: {}".format(self.title))
        print("BODY:\n{}".format(self.body))


class Website:
    """ 
    Contains information about website structure
    """

    def __init__(self, name, url, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag

In [35]:
import requests
from bs4 import BeautifulSoup


class Crawler:

    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')

    def safeGet(self, pageObj, selector):
        """
        Utilty function used to get a content string from a Beautiful Soup
        object and a selector. Returns an empty string if no object
        is found for the given selector
        """
        selectedElems = pageObj.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''

    def parse(self, site, url):
        """
        Extract content from a given page URL
        """
        bs = self.getPage(url)
        if bs is not None:
            title = self.safeGet(bs, site.titleTag)
            body = self.safeGet(bs, site.bodyTag)
            if title != '' and body != '':
                content = Content(url, title, body)
                content.print()

In [36]:
import requests
from bs4 import BeautifulSoup


class Crawler:

    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')

    def safeGet(self, pageObj, selector):
        """
        Utilty function used to get a content string from a Beautiful Soup
        object and a selector. Returns an empty string if no object
        is found for the given selector
        """
        selectedElems = pageObj.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''

    def parse(self, site, url):
        """
        Extract content from a given page URL
        """
        bs = self.getPage(url)
        if bs is not None:
            title = self.safeGet(bs, site.titleTag)
            body = self.safeGet(bs, site.bodyTag)
            if title != '' and body != '':
                content = Content(url, title, body)
                content.print()

In [37]:
crawler = Crawler()

siteData = [
    ['O\'Reilly Media', 'http://oreilly.com', 'h1', 'section#product-description'],
    ['Reuters', 'http://reuters.com', 'h1', 'div.StandardArticleBody_body_1gnLA'],
    ['Brookings', 'http://www.brookings.edu', 'h1', 'div.post-body'],
    ['New York Times', 'http://nytimes.com', 'h1', 'p.story-content']
]
websites = []
for row in siteData:
    websites.append(Website(row[0], row[1], row[2], row[3]))

crawler.parse(websites[0], 'http://shop.oreilly.com/product/0636920028154.do')
crawler.parse(
    websites[1], 'http://www.reuters.com/article/us-usa-epa-pruitt-idUSKBN19W2D0')
crawler.parse(
    websites[2],
    'https://www.brookings.edu/blog/techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/')
crawler.parse(
    websites[3], 
    'https://www.nytimes.com/2018/01/28/business/energy-environment/oil-boom.html')

URL: http://shop.oreilly.com/product/0636920028154.do
TITLE: Learning Python, 5th Edition 
BODY:

Get a comprehensive, in-depth introduction to the core Python language with this hands-on book. Based on author Mark Lutz’s popular training course, this updated fifth edition will help you quickly write efficient, high-quality code with Python. It’s an ideal way to begin, whether you’re new to programming or a professional developer versed in other languages. 

Complete with quizzes, exercises, and helpful illustrations,  this easy-to-follow, self-paced tutorial gets you started with both Python 2.7 and 3.3— the latest releases in the 3.X  and 2.X lines—plus all other releases in common use today. You’ll also learn some advanced language features that recently have become more common in Python code.

Explore Python’s major built-in object types such as numbers, lists, and dictionaries 
Create and process objects with Python statements, and learn Python’s general syntax model
Use functions

### Crawling through sites with search

In [38]:
class Content:
    """Common base class for all articles/pages"""

    def __init__(self, topic, url, title, body):
        self.topic = topic
        self.title = title
        self.body = body
        self.url = url

    def print(self):
        """
        Flexible printing function controls output
        """
        print("New article found for topic: {}".format(self.topic))
        print("URL: {}".format(self.url))
        print("TITLE: {}".format(self.title))
        print("BODY:\n{}".format(self.body))

In [39]:
class Website:
    """Contains information about website structure"""

    def __init__(self, name, url, searchUrl, resultListing, resultUrl, absoluteUrl, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.searchUrl = searchUrl
        self.resultListing = resultListing
        self.resultUrl = resultUrl
        self.absoluteUrl = absoluteUrl
        self.titleTag = titleTag
        self.bodyTag = bodyTag

In [40]:
import requests
from bs4 import BeautifulSoup


class Crawler:

    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')

    def safeGet(self, pageObj, selector):
        childObj = pageObj.select(selector)
        if childObj is not None and len(childObj) > 0:
            return childObj[0].get_text()
        return ""

    def search(self, topic, site):
        """
        Searches a given website for a given topic and records all pages found
        """
        bs = self.getPage(site.searchUrl + topic)
        searchResults = bs.select(site.resultListing)
        for result in searchResults:
            url = result.select(site.resultUrl)[0].attrs["href"]
            # Check to see whether it's a relative or an absolute URL
            if(site.absoluteUrl):
                bs = self.getPage(url)
            else:
                bs = self.getPage(site.url + url)
            if bs is None:
                print("Something was wrong with that page or URL. Skipping!")
                return
            title = self.safeGet(bs, site.titleTag)
            body = self.safeGet(bs, site.bodyTag)
            if title != '' and body != '':
                content = Content(topic, title, body, url)
                content.print()


crawler = Crawler()

siteData = [
    ['O\'Reilly Media', 'http://oreilly.com', 'https://ssearch.oreilly.com/?q=',
        'article.product-result', 'p.title a', True, 'h1', 'section#product-description'],
    ['Reuters', 'http://reuters.com', 'http://www.reuters.com/search/news?blob=', 'div.search-result-content',
        'h3.search-result-title a', False, 'h1', 'div.StandardArticleBody_body_1gnLA'],
    ['Brookings', 'http://www.brookings.edu', 'https://www.brookings.edu/search/?s=',
        'div.list-content article', 'h4.title a', True, 'h1', 'div.post-body']
]
sites = []
for row in siteData:
    sites.append(Website(row[0], row[1], row[2],
                         row[3], row[4], row[5], row[6], row[7]))

topics = ['python', 'data science']
for topic in topics:
    print("GETTING INFO ABOUT: " + topic)
    for targetSite in sites:
        crawler.search(topic, targetSite)

GETTING INFO ABOUT: python
New article found for topic: python
URL: Learning Python, 5th Edition 
TITLE: 
Get a comprehensive, in-depth introduction to the core Python language with this hands-on book. Based on author Mark Lutz’s popular training course, this updated fifth edition will help you quickly write efficient, high-quality code with Python. It’s an ideal way to begin, whether you’re new to programming or a professional developer versed in other languages. 

Complete with quizzes, exercises, and helpful illustrations,  this easy-to-follow, self-paced tutorial gets you started with both Python 2.7 and 3.3— the latest releases in the 3.X  and 2.X lines—plus all other releases in common use today. You’ll also learn some advanced language features that recently have become more common in Python code.

Explore Python’s major built-in object types such as numbers, lists, and dictionaries 
Create and process objects with Python statements, and learn Python’s general syntax model
Use f

New article found for topic: python
URL: Machine Learning with Python Cookbook
TITLE: 
This practical guide provides nearly 200 self-contained recipes to help you solve machine learning challenges you may encounter in your daily work. If you’re comfortable with Python and its libraries, including pandas and scikit-learn, you’ll be able to address specific problems such as loading data, handling text or numerical data, model selection, and dimensionality reduction and many other topics.

Each recipe includes code that you can copy and paste into a toy dataset to ensure that it actually works. From there, you can insert, combine, or adapt the code to help construct your application. Recipes also include a discussion that explains the solution and provides meaningful context. This cookbook takes you beyond theory and concepts by providing the nuts and bolts you need to construct working machine learning applications. 

You’ll find recipes for:

Vectors, matrices, and arrays
Handling numer

AttributeError: 'NoneType' object has no attribute 'select'

### Crawling Sites through Links

In [41]:
class Website:

    def __init__(self, name, url, targetPattern, absoluteUrl, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.targetPattern = targetPattern
        self.absoluteUrl = absoluteUrl
        self.titleTag = titleTag
        self.bodyTag = bodyTag


class Content:

    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

    def print(self):
        print("URL: {}".format(self.url))
        print("TITLE: {}".format(self.title))
        print("BODY:\n{}".format(self.body))

In [44]:
import re


class Crawler:
    def __init__(self, site):
        self.site = site
        self.visited = []

    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, 'html.parser')

    def safeGet(self, pageObj, selector):
        selectedElems = pageObj.select(selector)
        if selectedElems is not None and len(selectedElems) > 0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''

    def parse(self, url):
        bs = self.getPage(url)
        if bs is not None:
            title = self.safeGet(bs, self.site.titleTag)
            body = self.safeGet(bs, self.site.bodyTag)
            if title != '' and body != '':
                content = Content(url, title, body)
                content.print()

    def crawl(self):
        """
        Get pages from website home page
        """
        bs = self.getPage(self.site.url)
        targetPages = bs.findAll('a', href=re.compile(self.site.targetPattern))
        for targetPage in targetPages:
            targetPage = targetPage.attrs['href']
            if targetPage not in self.visited:
                self.visited.append(targetPage)
                if not self.site.absoluteUrl:
                    targetPage = '{}{}'.format(self.site.url, targetPage)
                self.parse(targetPage)


reuters = Website('Reuters', 'https://www.reuters.com', '^(/article/)',
                  False, 'h1', 'div.StandardArticleBody_body_1gnLA')
crawler = Crawler(reuters)
crawler.crawl()

TypeError: __init__() takes 5 positional arguments but 7 were given

### Crawling multiple page types¶

In [43]:
class Website:
    """Common base class for all articles/pages"""

    def __init__(self, name, url, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag
        

In [45]:
class Product(Website):
    """Contains information for scraping a product page"""

    def __init__(self, name, url, titleTag, productNumber, price):
        Website.__init__(self, name, url, TitleTag)
        self.productNumberTag = productNumberTag
        self.priceTag = priceTag


class Article(Website):
    """Contains information for scraping an article page"""

    def __init__(self, name, url, titleTag, bodyTag, dateTag):
        Website.__init__(self, name, url, titleTag)
        self.bodyTag = bodyTag
        self.dateTag = dateTag

### 再说一点API

本章我们介绍了几个新式API 常用的获取网络数据的方式，重点介绍了有助于网络数据采
集工作的API 用法。但是，对API 的这点儿介绍还是远远不够的，API 的内容非常丰富，
这里并没有体现出API 具有“许多不同的软件都可以通过相同的API 分享数据”的特点。

由于本书的主题是网络数据采集，因此无意成为数据收集的百科全书，如果你需要，我只
能为你推荐一些优质的资源，帮助你对这个主题进行深入的研究。

虽然初看网络数据采集和网络API 好像完全是两个不同的主题，但是希望这一章的内容
可以为你呈现出两者在网络数据收集这个领域中相互补充的能力。从某种意义上看，网络
API 的使用可以作为网络数据采集的一个子集。毕竟，最终都是要从网络服务器收集数据，
然后把它们解析成可用的数据格式，这和你用任何网络爬虫做的事情一模一样。