# Distributed scraping: multiprocessing

**Speed up scraping by distributed crawling and parsing. I'm going to scrape [my website](https://mofanpy.com/) but in a local server "http://127.0.0.1:4000/" to eliminate different downloading speed. This test is more accurate in time measuring. You can use "https://mofanpy.com/" instead, because you cannot access "http://127.0.0.1:4000/".**

**We gonna scrape all web pages in my website and reture the title and url for each page.**

In [1]:
import multiprocessing as mp
import time
from urllib.request import urlopen, urljoin
from bs4 import BeautifulSoup
import re

base_url = "http://127.0.0.1:4000/"
# base_url = 'https://mofanpy.com/'

# DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN
if base_url != "http://127.0.0.1:4000/":
    restricted_crawl = True
else:
    restricted_crawl = False

**Create a crawl function to open a url in parallel.**

In [1]:
def crawl(url):
    response = urlopen(url)
    time.sleep(0.1)             # slightly delay for downloading
    return response.read().decode()

**Create a parse function to find all results we need in parallel**

In [3]:
def parse(html):
    soup = BeautifulSoup(html, 'lxml')
    urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
    title = soup.find('h1').get_text().strip()
    page_urls = set([urljoin(base_url, url['href']) for url in urls])
    url = soup.find('meta', {'property': "og:url"})['content']
    return title, page_urls, url

## Normal way

**Do not use multiprocessing, test the speed. Firstly, set what urls we have already seen and what we haven't in a python set.**

In [6]:
unseen = set([base_url,])
seen = set()

count, t1 = 1, time.time()

while len(unseen) != 0:                 # still get some url to visit
    if restricted_crawl and len(seen) > 20:
            break
        
    print('\nDistributed Crawling...')
    htmls = [crawl(url) for url in unseen]

    print('\nDistributed Parsing...')
    results = [parse(html) for html in htmls]

    print('\nAnalysing...')
    seen.update(unseen)         # seen the crawled
    unseen.clear()              # nothing unseen

    for title, page_urls, url in results:
        print(count, title, url)
        count += 1
        unseen.update(page_urls - seen)     # get new url to crawl
print('Total time: %.1f s' % (time.time()-t1, ))    # 53 s


Distributed Crawling...

Distributed Parsing...

Analysing...
1 教程 http://localhost:4000

Distributed Crawling...

Distributed Parsing...

Analysing...
2 其他教学系列 http://localhost:4000tutorials/others/
3 Linux 简易教学 http://localhost:4000tutorials/others/linux-basic/
4 近期更新 http://localhost:4000recent-posts/
5 正则表达式 http://localhost:4000tutorials/python-basic/basic/13-10-regular-expression/
6 从头开始做一个汽车状态分类器1: 分析数据 http://localhost:4000tutorials/machine-learning/ML-practice/build-car-classifier-from-scratch1/
7 Threading 多线程教程系列 http://localhost:4000tutorials/python-basic/threading/
8 强化学习 Reinforcement Learning 教程系列 http://localhost:4000tutorials/machine-learning/reinforcement-learning/
9 从头开始做一个汽车状态分类器2: 搭建模型 http://localhost:4000tutorials/machine-learning/ML-practice/build-car-classifier-from-scratch2/
10 我的一点点背景资料 (Mofan's Background) http://localhost:4000about/
11 Numpy & Pandas 教程系列 http://localhost:4000tutorials/data-manipulation/np-pd/
12 推荐学习顺序 http://localhost:4000learning-steps/

## multiprocessing
**Create a process pool and scrape parallelly.**

In [7]:
unseen = set([base_url,])
seen = set()

pool = mp.Pool(4)                       
count, t1 = 1, time.time()
while len(unseen) != 0:                 # still get some url to visit
    if restricted_crawl and len(seen) > 20:
            break
    print('\nDistributed Crawling...')
    crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen]
    htmls = [j.get() for j in crawl_jobs]                                       # request connection

    print('\nDistributed Parsing...')
    parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
    results = [j.get() for j in parse_jobs]                                     # parse html

    print('\nAnalysing...')
    seen.update(unseen)         # seen the crawled
    unseen.clear()              # nothing unseen

    for title, page_urls, url in results:
        print(count, title, url)
        count += 1
        unseen.update(page_urls - seen)     # get new url to crawl
print('Total time: %.1f s' % (time.time()-t1, ))    # 16 s !!!


Distributed Crawling...

Distributed Parsing...

Analysing...
1 教程 http://localhost:4000

Distributed Crawling...

Distributed Parsing...

Analysing...
2 其他教学系列 http://localhost:4000tutorials/others/
3 Linux 简易教学 http://localhost:4000tutorials/others/linux-basic/
4 近期更新 http://localhost:4000recent-posts/
5 正则表达式 http://localhost:4000tutorials/python-basic/basic/13-10-regular-expression/
6 从头开始做一个汽车状态分类器1: 分析数据 http://localhost:4000tutorials/machine-learning/ML-practice/build-car-classifier-from-scratch1/
7 Threading 多线程教程系列 http://localhost:4000tutorials/python-basic/threading/
8 强化学习 Reinforcement Learning 教程系列 http://localhost:4000tutorials/machine-learning/reinforcement-learning/
9 从头开始做一个汽车状态分类器2: 搭建模型 http://localhost:4000tutorials/machine-learning/ML-practice/build-car-classifier-from-scratch2/
10 我的一点点背景资料 (Mofan's Background) http://localhost:4000about/
11 Numpy & Pandas 教程系列 http://localhost:4000tutorials/data-manipulation/np-pd/
12 推荐学习顺序 http://localhost:4000learning-steps/