**pre-work** 
  
  
- Make sure of your aim -- What to crawl?  
  
  
- Fine the best webpage that has the data -- Where to crawl?  
  
  
- Analyze the structure of the webpage and locate the tags for your data.

**Start crawling**  

- Simulate HTTP request  
  
  
- Send request to the server  
  
  
- Get the HTML returned from the server  
  
  
- Use regular expression to extract the data we want

In [2]:
import re

from urllib import request


class Spider():
    url = 'https://www.huya.com/g/100043'
    root_pattern = '<span class="txt">([\s\S]*?)</li>'
    name_pattern = '<i class="nick" title="([\s\S]*?)">'
    number_pattern = '<i class="js-num">([\s\S]*?)</i>'

    def __fetch_content(self):
        r = request.urlopen(Spider.url)

        htmls = r.read()
        htmls = str(htmls, encoding='utf-8')
        return htmls

    def __analysis(self, htmls):
        root_html = re.findall(Spider.root_pattern, htmls)
        anchors = []
        for html in root_html:
            name = re.findall(Spider.name_pattern, html)
            number = re.findall(Spider.number_pattern, html)
            anchor = {'name': name, 'number': number}
            anchors.append(anchor)
        return anchors

    def __refine(self, anchors):
        # To remove any non-necessary spaces if any, and to convert the list format of each match to string.
        lamb = lambda anchor: {
            'name': anchor['name'][0].strip(),
            'number': anchor['number'][0].strip()
        }
        return map(lamb, anchors)

    def __sort(self, anchors):
        """Sort the data fetched from the web

        Arguments:
            anchors {[type]} -- [description]
        """
        anchors = sorted(anchors, key=self.__sort_seed, reverse=True)
        return anchors

    def __sort_seed(self, anchor): ## anchor here represents each individule dict element in anchors.
        r = re.findall('\d*', anchor['number'])
        number = float(r[0])
        if '万' in anchor['number']:
            number *= 10000
        return number

    def __show(self, anchors):
        for rank in range(0, len(anchors)):
            print(str(rank+1) + ': ' + anchors[rank]['name'] + '     ' + anchors[rank]['number'])

    def go(self):
        htmls = self.__fetch_content()
        anchors = self.__analysis(htmls)
        anchors = list(self.__refine(anchors))
        anchors = self.__sort(anchors)
        self.__show(anchors)


spider = Spider()
spider.go()


1: 安德罗妮丶     126.1万
2: -Mt马特     73.6万
3: 冥界     18.5万
4: 三千丶菜鹏-53511     18.2万
5: GK-MIU酱     14.7万
6: 执念     12.9万
7: 自闭少年王富贵     10.9万
8: 尚娱-非酋的时臣     10.6万
9: 荀公子     10.4万
10: 巅峰-老菜     10.2万
11: 疼叔叔疼你     9.5万
12: 燕青zyq     9.0万
13: Ym-烟总     8.9万
14: 绫乃儿     8.8万
15: 甲第-澄海、花心-9182     8.7万
16: DT-渣鸡     8.6万
17: 巨蘑哥一身武艺     8.5万
18: 虎牙丶小帅     8.5万
19: MY-王某人     7.4万
20: 霆锋大神     7.4万
21: 椰子水     7.0万
22: 大灰灰     7.0万
23: 小米饭     6.9万
24: 小奈斯     6.9万
25: 慌拥i     6.8万
26: 亿凡-娜么     6.8万
27: MY-冰糖解说     6.8万
28: 莫小雪     6.6万
29: 守望先锋联赛     6.6万
30: 阿司Minigun     6.6万
31: 鱼鱼贝贝Lyn     6.4万
32: 暴雪游戏频道     6.3万
33: 老王     6.2万
34: Young     6.2万
35: 炉石平民法王     6.1万
36: LUCA啊     6.1万
37: 折木御秋     6.1万
38: 喜欢笨笨     6.0万
39: 龙     6.0万
40: 虎牙-梦醒     6.0万
41: SoloAsR穆镧     5.8万
42: 麦抖丶奥丁-90137     5.8万
43: 救难小福星     5.8万
44: 小老虎     5.7万
45: 娱加-精灵丶流氓海     5.7万
46: Joker-魔兽老男孩     5.7万
47: 亿凡-湖北鱼神     5.7万
48: 星点-黎猫     5.7万
49: 威廉丶RPG-人族无敌     5.7万
50: 觅心者太子哥     5.6万
51: 辣舞ing     5.6万

**Explanation of regular expression used:**  

`'<span class="txt">([\s\S]*?)</li>'` means to match all the characters `[\s\S]*`, between boundary `<span class="txt">` and `</li>` in a non-greedy mode `?`, and don't return boundaries.

**Breakpoint debugging**

In [None]:
选取定位标签
1. 尽量选择具有唯一标识性的标签。
2. 尽量选择最接近要提取数据的标签。
3. 尽量选取可以闭合的标签。