# 相关技术简介
本文都以Python3为基础，Python2的相关使用语法将不做赘述。

## urllib
urllib是Python中的一个标准库，用于操作URL。（官方文档详见：https://docs.python.org/3.5/library/urllib.html ）
</p>其中包含主要几个模块
1. urllib.request，用于打开和读取URL
2. urllib.error，包含从urllib.request中抛出的异常
3. urllib.parse，用于解析URL
4. urllib.robotparser，用于解析robots.txt文件

### Import
import urllib.request
### 打开一个链接
response = urllib.request.urlopen(url)



# 测试数据准备
本文用于测试的包含如下网站</p>
1. 新浪财经
2. 东方财富

In [1]:
import requests
import re

from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.parse import urlparse

In [2]:
url = "http://roll.finance.sina.com.cn/finance/zq1/scyj/index_1.shtml"

# 关于requests返回乱码的问题
requests访问中文网站有时会返回乱码，此时需要使用encoding对返回内容进行编码。
<code>
res = requests.get(url)
res.encoding = res.apparent_encoding
</code>

In [3]:
url = "http://roll.finance.sina.com.cn/finance/zq1/scyj/index_1.shtml"
res = requests.get(url)
res.encoding = res.apparent_encoding

res.raise_for_status
so_obj=BeautifulSoup(res.text,'lxml')
result = so_obj.find_all("a",href=re.compile("doc"))
for i in result:
    if len(i.string)<5:
        result.remove(i)

result

[<a href="http://finance.sina.com.cn/stock/marketresearch/2017-02-19/doc-ifyarrcc7994021.shtml" target="_blank">兴业证券："红包行情"后半段规避两大板块 把握三大机会</a>,
 <a href="http://finance.sina.com.cn/stock/t/2017-02-19/doc-ifyarzzv3236485.shtml" target="_blank">兴业证券点评再融资新规：抑制二级市场对壳资源等投机炒作</a>,
 <a href="http://finance.sina.com.cn/stock/marketresearch/2017-02-19/doc-ifyarrcc7979981.shtml" target="_blank">国信证券：A股仍处在反弹窗口期 莫负春日好时光</a>,
 <a href="http://finance.sina.com.cn/stock/marketresearch/2017-02-19/doc-ifyarrcc7979279.shtml" target="_blank">中信评再融资新规影响：新规挤出资金利好二级市场</a>,
 <a href="http://finance.sina.com.cn/stock/marketresearch/2017-02-19/doc-ifyarrcc7977128.shtml" target="_blank">广发评再融资新政：A股多了一个短期利好 少了一个牛市因子</a>,
 <a href="http://finance.sina.com.cn/stock/marketresearch/2017-02-19/doc-ifyarrcc7975190.shtml" target="_blank">下周股市投资日历：或仅有7只新股申购 关注8大主题机会</a>,
 <a href="http://finance.sina.com.cn/stock/marketresearch/2017-02-18/doc-ifyarrcc7898401.shtml" target="_blank">华泰证券：再融资新规强化“买入实,卖脱虚”</a>,
 <a href="h

# 如何对url字符串进行操作
以下代码示范了如何替换url中特定内容

In [4]:
url_target = 'http://roll.finance.sina.com.cn/finance/zq1/scyj/index_1.shtml'
url_next = './index_2.shtml'

p1 = url_next.index('./')

url_next[p1+2:len(url_next)]
path = urlparse(url_target).path
url_target[url_target.rfind('/')+1:len(url_target)]
url_target.replace(url_target[url_target.rfind('/')+1:len(url_target)],url_next[2:len(url_next)])



'http://roll.finance.sina.com.cn/finance/zq1/scyj/index_2.shtml'

# 如何识别不同类型的内容来获取模糊的可能是文章的url

在抓取新浪财经新闻标题的时候发现从某一时期开始，其新闻标题的命名方式发生了变化<p>
最新的新闻标题链接为如下形式：<p>
http://finance.sina.com.cn/roll/2017-02-13/doc-ifyamkra7109699.shtml<p>
而从2015年12月1日之前的新闻则为如下形式：<p>
http://finance.sina.com.cn/stock/marketresearch/20151201/073623896421.shtml<p>

下面示范了如何处理这种可能的变化<p>
解决思路：
文章的shtml文件名称一半都比较长，且包含数字，所以采用判断网页文件名长度的办法，小于12的全部去除。

In [43]:
#url = 'http://roll.finance.sina.com.cn/finance/zq1/scyj/index_1.shtml'
# url = 'http://roll.finance.sina.com.cn/finance/zq1/scyj/index_170.shtml'
url = 'http://finance.sina.com.cn/stock/'
res = requests.get(url)
res.encoding = res.apparent_encoding

bs_obj = BeautifulSoup(res.text,'lxml')
result = bs_obj.find_all('a',{'target':'_blank','href':re.compile('shtml')})
original = len(result)

for i in result:
    gap = i['href'].rfind('.shtml')-i['href'].rfind('/')-1
    filename = i['href'][i['href'].rfind('/')+1:i['href'].rfind('.shtml')]
    text = i.string
    if i.string is None:
        i.string = ''
    
    len_text = len(i.string)
        
    if len_text<5 or gap<12:
        print('Remove: %s(%s),%s, %s' %(filename, gap, i.string,i['href']))
        result.remove(i)
    else:
        print('Keep: %s(%s), %s, %s' %(filename, gap, i.string,i['href']))
        

print('totaly retrieved %s links, keeped %s.' %(original,len(result)))

Remove: comfinanceweb(13),, http://finance.sina.com.cn/mobile/comfinanceweb.shtml
Remove: ggdp(4),个股, http://finance.sina.com.cn/column/ggdp.shtml
Remove: index(5),主力, http://roll.finance.sina.com.cn/finance/zq1/zldx/index.shtml
Remove: nc(2),, http://finance.sina.com.cn/realstock/company/sz399416/nc.shtml
Remove: euro(4),欧洲股市, http://finance.sina.com.cn/money/globalindex/euro.shtml
Remove: sector(6),板块异动, http://finance.sina.com.cn/stock/usstock/sector.shtml#f_bkzf
Remove: sector(6),板块异动, http://finance.sina.com.cn/stock/usstock/sector.shtml#f_bkzf
Remove: sector(6),板块异动, http://finance.sina.com.cn/stock/usstock/sector.shtml
Remove: OIL(3),布伦特原油, http://finance.sina.com.cn/futures/quotes/OIL.shtml
Remove: index(5),机构看盘, http://roll.finance.sina.com.cn/finance/qh/jgkp__nysh/index.shtml
Remove: index(5),白银分析, http://roll.finance.sina.com.cn/finance/gjs/byfx/index.shtml
Remove: index(5),滚动新闻, http://roll.finance.sina.com.cn/finance/wh/index.shtml
Keep: doc-ifyarref5813229(19), 震荡或加剧, htt

In [38]:
type(result[0])

bs4.element.Tag

In [47]:
for i in result:
    if i.string is None or len(i.string)<5:
        result.remove(i)

for j in result:
    print('%s: %s(%s)' % (len(j.string), j.string, j['href']))

3: 能源股(http://finance.sina.com.cn/stock/usstock/sector.shtml#c76m)
12: 春季行情第一波攻势结束？(http://finance.sina.com.cn/stock/jsy/2017-02-18/doc-ifyarrcc7844675.shtml)
5: 震荡或加剧(http://finance.sina.com.cn/roll/2017-02-18/doc-ifyarref5813229.shtml)
7: 挖掘投资新主题(http://finance.sina.com.cn/roll/2017-02-18/doc-ifyarzzv3059525.shtml)
6: 周末要闻回顾(http://finance.sina.com.cn/stock/y/2017-02-19/doc-ifyarrcc7991608.shtml)
8: 短线调整空间有限(http://finance.sina.com.cn/roll/2017-02-18/doc-ifyarref5812941.shtml)
6: 酝酿两会行情(http://finance.sina.com.cn/stock/marketresearch/2017-02-18/doc-ifyarrcf4559681.shtml)
7: 演绎结构性机会(http://finance.sina.com.cn/roll/2017-02-18/doc-ifyarref5815702.shtml)
11: 信托私募及外资共振进场(http://finance.sina.com.cn/stock/marketresearch/2017-02-18/doc-ifyarrcf4559681.shtml)
10: 下周解禁规模大幅下降(http://finance.sina.com.cn/stock/s/2017-02-19/doc-ifyarrcf4731559.shtml)
16: 杨德龙:今年A股港股有补涨需求 (http://finance.sina.com.cn/stock/marketresearch/2017-02-18/doc-ifyarrcc7796051.shtml)
7: 大摩看好三只股(http://finance.sina.com.cn/stoc

## Scrapy
