# 使用urllib获取www资源

目标：学习使用urllib编写简单爬虫程序，获取webpage资源


1.使用urlopen()实现最简单的url访问，获取所指页面内容。

2.使用urlretireve()将页面内容存为临时文件，并获取response头

3.定制request对象，使爬虫更像浏览器
    默认情况下，urllib发出的请求头如下所示（用wireshark工具截获可知）：
        GET / HTTP/1.1
        Accept-Encoding: identity
        Host: 10.10.10.135
        User-Agent: Python-urllib/3.6
        Connection: close
4.利用GET方法，向百度服务器发送查询请求，并获得查询结果 

5.利用POST方法，向http://10.10.10.135/WebGoat/ 提交用户名和密码

6.利用urllib.error处理异常,两个常用异常类：urllib.error.URLError和HTTPError

7.认识response类的方法：info(),geturl()

8.一个较为完整case，从百度贴吧下载多页话题

9.通过HTTP basic authentication
    登陆网页前遇到的要求输入用户名和密码的程序，通常称为身份认证程序。
    HTTP认证可以保护一个作用域（称为一个realm）内的资源不受非法访问。
    HTTP规范中定义了两种认证模式：basic auth和digest auth
    认证的基本过程是：1.客户请求访问网页；2.服务器端返回401错误，要求认证；
    3.客户端重新提交请求并附以认证信息，这部分信息将被编码；
    4.服务器检查信息，通过则给以正常服务页面；否则返回401错误。
    
   第一次服务器返回401错误时，会返回headers字典信息，其中会包含如下信息：
    WWW-Authenticate: Basic realm="cPanel"
    我们假定已知用户名和密码，之后利用一定的编码格式将realm名、用户名、密码等信息
    编码后就可以传递给服务器，认证就可通过。
    编码格式是base-64

10.使用代理
11.设置time-out
12.使用HEAD方法，请求服务器

13.urllib加载ajax信息
   AJAX = Asynchronous JavaScript and XML（异步的 JavaScript 和 XML）。
   AJAX 最大的优点是在不重新加载整个页面的情况下，可以与服务器交换数据并更新部分网页内容。
   AJAX 不需要任何浏览器插件，但需要用户允许JavaScript在浏览器上执行。
    
   这里以 https://movie.douban.com/tag/#/ 为例
   先使用抓包工具查看一下这个页面，通过测试可以发现每次点击“更多”会增加一个响应    
   https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=40
   将其直接在浏览器中打开，可以看到它以json格式记录了新加载的电影信息。
   找到这个文件后，就可开始尝试了。
14.urllib通过已登录的cookie值，以登录用户身份访问网页。
    首先用浏览器登录，获取登陆后的cookie，通常这个cookie会非常长。
    我们以访问http://www.renren.com/968196747/profile 这个登录后链接为例 如果成功会得到相关内容.
    将cookkie值作为字符串加载在headers里。
    
    
15.访问https站点
    需要CA证书才能访问,证书是用于加密连接和身份认证的数字凭据，通常由公信机构发放。
    尝试访问http://www.baidu.com 与https://www.baidu.com， 观察它们的不同
    百度访问https时会有跳转。
    再尝试访问12306网站 'http://www.12306.cn/mormhweb/' 与 https://www.12306.cn/mormhweb/ 的不同
    可以看到访问 https://www.12306.cn/mormhweb/ 时会报出错误：CertificateError: hostname 'www.12306.cn' doesn't match either of 'webssl.chinanetcenter.com'
    ssl库ssl.

In [None]:
"""使用urlopen()实现最简单的url访问
"""
import urllib.request

url = 'http://10.10.10.135/'
with urllib.request.urlopen(url) as response:
    print(response.status)

In [None]:
"""使用urlretireve()将页面内容存为临时文件，并获取response头
"""
import urllib.request

url = 'http://10.10.10.135/'
localfile, headers = urllib.request.urlretrieve(url)
print(localfile)
print()
print(headers)

    

In [None]:
"""定制request对象，使爬虫更像浏览器
"""
import urllib.request

url = 'http://10.10.10.135/'
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
           'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
          }
request = urllib.request.Request(url,headers=headers)
with urllib.request.urlopen(request) as response:
    print(response.status)


In [None]:
"""使用GET方法，向百度服务器发送查询请求
"""
import urllib.request
import urllib.parse

querystr = {'wd':'北航'}
querystr_encode = urllib.parse.urlencode(querystr)
print(querystr_encode)
#https://www.baidu.com/s?wd=%E5%8C%97%E8%88%AA
url = 'http://www.baidu.com/s?' + querystr_encode
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
           'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
          }
request = urllib.request.Request(url,headers=headers)
with urllib.request.urlopen(request) as response:
    print(response.status)
    print(response.headers)
    html = response.read()
    print(html.decode('utf-8'))


In [None]:
"""利用POST方法，向http://10.10.10.135/WebGoat/ 提交用户名和密码
"""
import urllib.request
import urllib.parse

url = 'http://10.10.10.135/dvwa/login.php'
cookie = 'PHPSESSID=898c1rsum58475qh3nros002n7; path=/'
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
           'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
         'Cookie':cookie,
          }
authstr = {'username':'admin',
           'password':'admin',
           'Login':'Login',}
data = urllib.parse.urlencode(authstr).encode('utf-8')

request = urllib.request.Request(url,data=data,headers=headers)
with urllib.request.urlopen(request) as response:
    print(response.status)
    print(response.headers)
    cookie1 = response.headers['Set-Cookie']

url = 'http://10.10.10.135/dvwa/index.php'
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
           'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
          'Cookie':cookie+';'+cookie1,
          }
print(headers)
request = urllib.request.Request(url,headers=headers)
with urllib.request.urlopen(request) as response:
    print(response.read())

In [None]:
"""使用urlllib.error处理异常
URLError继承自OSError，是urllib的异常的基础类
HTTPError是验证HTTP response实例的一个异常类。

HTTP protocol errors是有效的response，有状态码、headers、body。

一个成熟的程序需要管理所有输出，不仅有希望见到的输出，还要有意料之外的异常。
logging的使用可以参考https://docs.python.org/3.5/howto/logging.html
"""

import urllib.request
import urllib.error
import urllib.parse
import logging

logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    filename='C:\\Users\\leo\Documents\\crawlerslesson1_crawler.log',
                    level=logging.DEBUG)
try: 
    #url = 'http://www.baidu11.com'
    url = 'http://10.10.10.135/WebGoat/attack'
    headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
           'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
          }
    request = urllib.request.Request(url,headers=headers)
    with urllib.request.urlopen(request) as response:
        print(response.status)
        print(response.read().decode('utf-8'))

except urllib.error.HTTPError as e:
    import http.server
    #print(http.server.BaseHTTPRequestHandler.responses[e.code])
    logging.error('HTTPError code: %s and Messages: %s'% (str(e.code),http.server.BaseHTTPRequestHandler.responses[e.code]))
    logging.info('HTTPError headers: ' + str(e.headers))
    logging.info(e.read().decode('utf-8'))
    print('不好意思，服务器卡壳儿了，请稍后重试。')
except urllib.error.URLError as e:
    logging.error(e.reason)
    print('不好意思，服务器卡壳儿了，请稍后重试。')

In [None]:
"""使用response的geturl和info方法来验证访问是否相符
geturl - this returns the real URL of the page fetched. 
This is useful because urlopen (or the opener object used) may have followed a redirect. 
The URL of the page fetched may not be the same as the URL requested.

info - this returns a dictionary-like object that describes the page fetched,
particularly the headers sent by the server.
It is currently an http.client.HTTPMessage instance.
"""
import urllib.request
import urllib.parse

url = 'http://www.baidu.com'
with urllib.request.urlopen(url) as response:
    #print(type(response))
    print(response.info())
    print(response.geturl())

In [None]:
"""A case of crawler is used to fetch the content of baidu's tieba url, in according to user's input keywords.

"""
import urllib.request
import urllib.parse

def loadPage(url):
    """
        Function: Fetching url and accessing the webpage content.
        url: the wanted webpage url.
    """
    headers = {'Accept': 'text/html','User-Agent':'Mozilla/5.0',}
    print('To send http request to %s' % url)
    request = urllib.request.Request(url,headers=headers)

    return  urllib.request.urlopen(request).read()

def writePage(html,filename):
    """
        Fuction: To write the content of html into a local file.
        html: The response content.
        filename: the local filename to be used stored the response.
    """
    print('To write html into a local file %s ...' % filename)
    with open(filename,'w') as f:
        f.write(str(html))
    print('Work done.')
    print('-'*10)

def tiebaCrawler(url,beginpPage,endPage,keyword):
    """
        Function: The scheduler of tieba crawler, is used to access every wanted url in turns.
        url: the url of baidu's tieba webpage
        beginPage: initial page
        endPage: end page
        keyword: the wanted keyword 
    """
    filename = keyword + '_tieba.html'
    for page in range(beginpPage,endPage+1):
        pn = (page - 1) * 50
        queryurl = url + '&pn=' + str(pn)
        #print(queryurl)
        
        writePage(loadPage(queryurl),filename)
        
if __name__ == '__main__':
    kw = input('Pl input the wanted tieba\'s name:' )
    beginPage = int(input('The beginning page number:'))
    endPage = int(input('The ending page number:'))
    url = 'http://tieba.baidu.com/f?'
    key = urllib.parse.urlencode({'kw':kw})
    queryurl = url+ key
    tiebaCrawler(url,beginPage,endPage,kw)

In [None]:
"""An case of passing basic authentication with urllib


"""

import urllib.request
import urllib.error
import urllib.parse
import logging

url = 'http://10.10.10.135/WebGoat/attack'
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
           'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
          }
request = urllib.request.Request(url,headers=headers)

def passBasicAuth(realm):

    import base64
    username = 'webgoat'
    password = 'webgoat'
    bstr = username+':'+password
    schemastr , realmname = realm.split('=')
    if schemastr.lower().find('basic') >= 0:
        schema ='Basic'
    else:
        print('The authentication schema isn\'t basic, programe exit.')
        exit(-1)
        
    base64str = base64.b64encode(bstr.encode('utf-8'))
    authheader = 'Basic %s' % base64str.decode('utf-8')
    
    request.add_header('Authorization',authheader)
    print(request.headers)
    with urllib.request.urlopen(request) as response:
        print(response.status)
        print(response.read().decode('utf-8'))
    
     
    
try:
    #url = 'http://www.baidu11.com'
    
    with urllib.request.urlopen(request) as response:
        print(response.status)
        print(response.info())
        #print(response.read().decode('utf-8'))
except urllib.error.URLError as e:
    if  hasattr(e,'code'):
        
        print(e.code)
        print(e.info())
        if e.code == 401:
            passBasicAuth(e.headers['WWW-Authenticate'])            
    elif hasattr(e,'reason'):
        print(e.reason)
    else:
        print('unkown error.')
        


In [None]:
"""An case of passing basic authentication with urllib

Worked!

"""
import urllib.request
import urllib.parse
import urllib.error

url = 'http://10.10.10.135/WebGoat/attack'
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
           'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
          }
request = urllib.request.Request(url,headers=headers)
username = 'webgoat'
password = 'webgoat'

passman = urllib.request.HTTPPasswordMgrWithDefaultRealm()
# this creates a password manager
passman.add_password(None, url, username, password)
# because we have put None at the start it will always
# use this username/password combination for  urls
# for which `theurl` is a super-url

authhandler = urllib.request.HTTPBasicAuthHandler(passman)
# create the AuthHandler

opener = urllib.request.build_opener(authhandler)

urllib.request.install_opener(opener)
# All calls to urllib2.urlopen will now use our handler
# Make sure not to include the protocol in with the URL, or
# HTTPPasswordMgrWithDefaultRealm will be very confused.
# You must (of course) use it when fetching the page though.

with urllib.request.urlopen(request) as response:
    print(response.status)
    print(response.read().decode('utf-8'))
# authentication is now handled automatically for us

In [None]:
"""使用代理"""
import urllib.request
proxy_support = urllib.request.ProxyHandler({'sock5': 'localhost:1080'})
 
opener = urllib.request.build_opener(proxy_support)
 
urllib.request.install_opener(opener)
 
a = urllib.request.urlopen("http://www.python.org/").read().decode("utf8")
print(a)

In [None]:
"""设置time-out"""
import socket
import urllib.request
# timeout in seconds
timeout = 2
socket.setdefaulttimeout(timeout)
# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request('http://www.python.org/')
a = urllib.request.urlopen(req).read()
print(a)

In [None]:
"""抓取ajax页面"""
import urllib.request
import urllib.parse


url = 'https://movie.douban.com/j/new_search_subjects?'
movietype = '动作'
params = {'sort':'U','range':'0,10','tags':'','start':'1','tags':''}
params['tags'] = movietype
params_encode = urllib.parse.urlencode(params).encode('utf-8')
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
           'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
          }

request = urllib.request.Request(url,data=params_encode,headers=headers)
with urllib.request.urlopen(request) as response:
    print(response.read().decode('utf-8'))
   

In [None]:
"""urllib通过已登录的cookie值，以登录用户身份访问网页"""
import urllib.request
import urllib.parse


url = 'http://www.renren.com/968196747/profile'

headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
           'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
           
          }
'''
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
           'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
           'Cookie':'anonymid=jmn5bx3v-5le0e; depovince=GW; _r01_=1; ick_login=c9a5ab7b-7432-4e79-8d8d-d9e3f89ab72c; ick=a0121b31-6fe8-4eef-bf25-4278b6f19743; XNESSESSIONID=d51be5b82a20; WebOnLineNotice_968196747=1; JSESSIONID=abcISInMXWcrBd2AlxKyw; jebe_key=ed233682-75b8-496c-a199-557f631f200e%7Cc377fecba1c1e1def24233f88b207b06%7C1538208356017%7C1%7C1538208354004; wp_fold=0; jebecookies=dea9e704-78ec-4756-90d7-1a93dae3be67|||||; _de=D6104CF9DBA07C121FF9E00605E6865D; p=a9d3694434120963a494d947ab2906477; first_login_flag=1; ln_uact=13141055789; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; t=3395ed44534775ea30e64655670389627; societyguester=3395ed44534775ea30e64655670389627; id=968196747; xnsid=dba2a4e4; loginfrom=syshome',
         }
'''
request = urllib.request.Request(url)#,headers=headers)
with urllib.request.urlopen(request) as response:
    print(response.read().decode('utf-8'))

In [None]:
"""访问https"""

import urllib.request
import ssl
#尝试访问http://www.baidu.com 与https://www.baidu.com，观察它们的不同
#再尝试访问12306网站 'http://www.12306.cn/mormhweb/' 与 https://www.12306.cn/mormhweb/的不同

url = 'https://www.12306.cn/mormhweb/'

#导入证书需要使用ssl库,可以查看urllib源码如何使用这个函数
context = ssl._create_unverified_context()


#request = urllib.request.Request(url,unverifiable=True)
request = urllib.request.Request(url)
#分别尝试下列语句块，查看不同的结果。                             
with urllib.request.urlopen(request,context=context) as response:
    print(response.read().decode('utf-8'))
'''        
with urllib.request.urlopen(request) as response:
    print(response.read().decode('utf-8'))
'''
'''
with urllib.request.urlopen(url,cafile=cafile) as response:
    print(response.read().decode('utf-8'))
'''

In [None]:
?urllib.request.urlopen()