# Multiprocessing in Python: Web scraping application
### by [Jason DeBacker](http://jasondebacker.com), December 2017

This notebook provides a tutorial and examples showing how to use a the `multiprocesing` library in Python.  Parallelization is applied to web scraping.

## Scraping Wikipedia

Recall our example of scraping some data on the Georgia Bulldogs football team from Wikipedia.  For each season of data we had to grab the html from a different url and parse it. Then we had to search there this parsed data to pull out elements of a table of season results that we were interested in.  This was not a trivial task. 

Consider the time to scrape a 1980-1988 seasons and to load the results into a dataframe:

In [143]:
# import packages
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
import multiprocessing
import time
import sys
import os

# create a dictionary in which to store data
# the keys will be the column names, the values lists
# containing the element in each row in that column
results_dict = {'date': [], 'opponent': [], 'rank': [], 'site': [],
                'tv': [], 'result': [], 'attendance': [], 'year': []}

start = time.time()
# Loop over years
for year in range(1980, 1989):
    # give URL and header
    wiki = "https://en.wikipedia.org/wiki/" + str(year) + "_Georgia_Bulldogs_football_team"
    header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia

    # Make the request to get served the webpage, "soupify" it
    req = urllib.request.Request(wiki,headers=header)
    page = urllib.request.urlopen(req)
    soup = BeautifulSoup(page, 'lxml')

    # extract the table by pulling information from the wikitable class
    # There's only one table like this here, so that makes it easier
    table = soup.find("table", {"class": "wikitable"})

    # iterate through the table, pulling out each row
    for row in table.findAll("tr"):
        cells = row.findAll("td")
        #For each "tr", assign each "td" to a variable.
        if len(cells) == 7:
            results_dict['date'].append(str(cells[0].find(text=True)))
            results_dict['opponent'].append(str(cells[1].findAll(text=True)))
            results_dict['rank'].append(str(cells[2].find(text=True)))
            results_dict['site'].append(str(cells[3].findAll(text=True)))
            results_dict['tv'].append(str(cells[4].find(text=True)))
            results_dict['result'].append(str(cells[5].findAll(text=True)))
            results_dict['attendance'].append(str(cells[6].find(text=True)))
            results_dict['year'].append(year)

    print('Year = ', year)
print('Time to scrape 1980-1988 is ', time.time()-start)
uga_df = pd.DataFrame(results_dict)
uga_df.head(n=10)

Year =  1980
Year =  1981
Year =  1982
Year =  1983
Year =  1984
Year =  1985
Year =  1986
Year =  1987
Year =  1988
Time to scrape 1980-1988 is  10.302467823028564


Unnamed: 0,attendance,date,opponent,rank,result,site,tv,year
0,95288,September 6,"['at\xa0', 'Tennessee']",No. 16,"['W', '\xa016–15\xa0\xa0']","['Neyland Stadium', ' • ', 'Knoxville, TN', ' ...",,1980
1,60150,September 13,"['Texas A&M', '*']",No. 12,"['W', '\xa042–0\xa0\xa0']","['Sanford Stadium', ' • ', 'Athens, GA']",,1980
2,61800,September 20,"['Clemson', '*']",No. 10,"['W', '\xa020–16\xa0\xa0']","['Sanford Stadium • Athens, GA (', 'Rivalry', ...",,1980
3,59200,September 27,"['TCU', '*']",No. 10,"['W', '\xa034–3\xa0\xa0']","['Sanford Stadium • Athens, GA']",,1980
4,60300,October 11,['Ole Miss'],No. 6,"['W', '\xa028–21\xa0\xa0']","['Sanford Stadium • Athens, GA']",,1980
5,59300,October 18,['Vanderbilt'],No. 6,"['W', '\xa041–0\xa0\xa0']","['Sanford Stadium • Athens, GA (', 'Rivalry', ...",,1980
6,57239,October 25,"['at\xa0', 'Kentucky']",No. 5,"['W', '\xa027–0\xa0\xa0']","['Commonwealth Stadium', ' • ', 'Lexington, KY']",,1980
7,62200,November 1,"['No. 14\xa0', 'South Carolina', '*']",No. 4,"['W', '\xa013–10\xa0\xa0']","['Sanford Stadium • Athens, GA (', 'Rivalry', ...",ABC,1980
8,68528,November 8,"['vs.\xa0No. 20\xa0', 'Florida']",No. 2,"['W', '\xa026–21\xa0\xa0']","['Gator Bowl Stadium', ' • ', 'Jacksonville, F...",ABC,1980
9,74900,November 15,"['at\xa0', 'Auburn']",No. 1,"['W', '\xa031–21\xa0\xa0']","['Jordan–Hare Stadium', ' • ', 'Auburn, AL', '...",,1980


## Scraping on mulitple processors at once

We can help improve the time to do this task by splitting it up into multiple processes that will run simultaneously on different cores of our computer.  That is, we will do these tasks (fetching, parsing each url) in "parallel".

To do this, we need to do just two things
1) Import the `multiprocessing` package (which will give us the Python tools to handle the multiprocessing).
2) Make a slide modification to our code so it's written in more of a functional style that can be easily handled with the multiprocessing libary.

Let's do the harder (but not hard) part in (2) first:

In [123]:
def get_wiki(year):
    '''
    This function grabs and parses the Wikipedia page
    for UGA football of the given year
    
    Args:
        year: integer, the year of the football season to scrape data for
    
    Returns:
        results_dict: dictionary, the variables to extract from the table 
    '''
    
    # create a dictionary in which to store data
    # the keys will be the column names, the values lists
    # containing the element in each row in that column
    results_dict = {'date': [], 'opponent': [], 'rank': [], 'site': [],
                'tv': [], 'result': [], 'attendance': [], 'year': []}

    # give URL and header
    wiki = "https://en.wikipedia.org/wiki/" + str(year) + "_Georgia_Bulldogs_football_team"
    header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia

    # Make the request to get served the webpage, "soupify" it
    req = urllib.request.Request(wiki,headers=header)
    page = urllib.request.urlopen(req)
    soup = BeautifulSoup(page, 'lxml')
    
    # extract the table by pulling information from the wikitable class
    # There's only one table like this here, so that makes it easier
    table = soup.find("table", {"class": "wikitable"})
#     table = beautiful_soup_tag_to_unicode(soup.find("table", {"class": "wikitable"}))
    
    
    # iterate through the table, pulling out each row
    for row in table.findAll("tr"):
        cells = row.findAll("td")
        #For each "tr", assign each "td" to a variable.
        if len(cells) == 7:
            results_dict['date'] = str(cells[0].find(text=True))
            results_dict['opponent'] = str(cells[1].findAll(text=True))
            results_dict['rank'] = str(cells[2].find(text=True))
            results_dict['site'] = str(cells[3].findAll(text=True))
            results_dict['tv'] = str(cells[4].find(text=True))
            results_dict['result'] = str(cells[5].findAll(text=True))
            results_dict['attendance'] = str(cells[6].find(text=True))
            results_dict['year']= year
    del table
    print('Year = ', year)
#     print('Process = ', os.getpid())
    return results_dict

Now the easier part.  We'll import the multiprocessing libary and call this function repeatedly.

In [124]:
# list of years
years = range(1980, 1989)

# call wiki function in loop (should be similar time to loop above without function)
start = time.time()
results = {}
for y in years:
    results[y] = get_wiki(y)
list_of_dicts = []
for i, result in results.items():
        list_of_dicts.append(result)
print('Time for scrape in serial is: ', time.time()-start)
uga_df = pd.DataFrame(list_of_dicts)
uga_df.head(n=10)

Year =  1980
Year =  1981
Year =  1982
Year =  1983
Year =  1984
Year =  1985
Year =  1986
Year =  1987
Year =  1988
Time for scrape in serial is:  6.79303503036499


Unnamed: 0,attendance,date,opponent,rank,result,site,tv,year
0,77896,January 1,"['vs.\xa0No. 7\xa0', 'Notre Dame', '*']",No. 1,"['W', '\xa017–10\xa0\xa0']","['Louisiana Superdome', ' • ', 'New Orleans, L...",ABC,1980
1,77224,January 1,"['vs.\xa0No. 8\xa0', 'Pittsburgh', '*']",No. 2,"['L', '\xa020–24\xa0\xa0']","['Louisiana Superdome', ' • ', 'New Orleans', ...",ABC,1981
2,78124,January 1,"['vs.\xa0No. 2\xa0', 'Penn State', '*']",No. 1,"['L', '\xa023–27\xa0\xa0']","['Louisiana Superdome', ' • ', 'New Orleans, L...",ABC,1982
3,67891,January 2,"['vs.\xa0No. 2\xa0', 'Texas', '*']",No. 7,"['W', '\xa010–9\xa0\xa0']","['Cotton Bowl', ' • ', 'Dallas', ' (', 'Cotton...",CBS,1983
4,51821,December 22,"['vs.\xa0No. 15\xa0', 'Florida State', '*']",,"['T', '\xa017–17\xa0\xa0']","['Florida Citrus Bowl', ' • ', 'Orlando, FL', ...",NBC,1984
5,52203,December 28,"['vs.\xa0', 'Arizona', '*']",,"['T', '\xa013c13\xa0\xa0']","['Sun Bowl Stadium', ' • ', 'El Paso, Texas', ...",CBS,1985
6,25358,November 23,"['vs.\xa0', 'Boston College', '*']",No. 17,"['L', '\xa024–27\xa0\xa0']","['Tampa Stadium', ' • ', 'Tampa, FL', ' (', 'H...",Mizlou,1986
7,53240,December 29,"['vs.\xa0', 'Arkansas', '*']",No. 15,"['W', '\xa020–17\xa0\xa0']","['Liberty Bowl Memorial Stadium', ' • ', 'Memp...",Raycom,1987
8,76236,January 1,"['vs.\xa0', 'Michigan State', '*']",No. 19,"['W', '\xa034–27\xa0\xa0']","['Gator Bowl Stadium', ' • ', 'Jacksonville, F...",ESPN,1988


In [132]:
# try to increase recursion limit
# sys.setrecursionlimit(5000)

# use multiprocessing to do different years simultaneously
start = time.time()
results = {}
pool = multiprocessing.Pool()
for y in years:
    results[y] = pool.apply_async(get_wiki, args=(y,))
pool.close()
pool.join()
list_of_dicts = []
for i, result in results.items():
        list_of_dicts.append(result.get())
print('Time for scrape in parallel is: ', time.time()-start)
uga_df = pd.DataFrame(list_of_dicts)
uga_df.head(n=10)

Year =  1983
Year =  1981
Year =  1982
Year =  1980
Year =  1984
Year =  1986
Year =  1985
Year =  1987
Year =  1988
Time for scrape in parallel is:  2.3501181602478027


Unnamed: 0,attendance,date,opponent,rank,result,site,tv,year
0,77896,January 1,"['vs.\xa0No. 7\xa0', 'Notre Dame', '*']",No. 1,"['W', '\xa017–10\xa0\xa0']","['Louisiana Superdome', ' • ', 'New Orleans, L...",ABC,1980
1,77224,January 1,"['vs.\xa0No. 8\xa0', 'Pittsburgh', '*']",No. 2,"['L', '\xa020–24\xa0\xa0']","['Louisiana Superdome', ' • ', 'New Orleans', ...",ABC,1981
2,78124,January 1,"['vs.\xa0No. 2\xa0', 'Penn State', '*']",No. 1,"['L', '\xa023–27\xa0\xa0']","['Louisiana Superdome', ' • ', 'New Orleans, L...",ABC,1982
3,67891,January 2,"['vs.\xa0No. 2\xa0', 'Texas', '*']",No. 7,"['W', '\xa010–9\xa0\xa0']","['Cotton Bowl', ' • ', 'Dallas', ' (', 'Cotton...",CBS,1983
4,51821,December 22,"['vs.\xa0No. 15\xa0', 'Florida State', '*']",,"['T', '\xa017–17\xa0\xa0']","['Florida Citrus Bowl', ' • ', 'Orlando, FL', ...",NBC,1984
5,52203,December 28,"['vs.\xa0', 'Arizona', '*']",,"['T', '\xa013c13\xa0\xa0']","['Sun Bowl Stadium', ' • ', 'El Paso, Texas', ...",CBS,1985
6,25358,November 23,"['vs.\xa0', 'Boston College', '*']",No. 17,"['L', '\xa024–27\xa0\xa0']","['Tampa Stadium', ' • ', 'Tampa, FL', ' (', 'H...",Mizlou,1986
7,53240,December 29,"['vs.\xa0', 'Arkansas', '*']",No. 15,"['W', '\xa020–17\xa0\xa0']","['Liberty Bowl Memorial Stadium', ' • ', 'Memp...",Raycom,1987
8,76236,January 1,"['vs.\xa0', 'Michigan State', '*']",No. 19,"['W', '\xa034–27\xa0\xa0']","['Gator Bowl Stadium', ' • ', 'Jacksonville, F...",ESPN,1988


Same as above, but use `multiprocessing.Pool.map()` instead o `multiprocessing.Pool.apply_async()`

In [133]:
# use multiprocessing to do different years simultaneously
start = time.time()
results = {}
pool = multiprocessing.Pool()
list_of_dicts = pool.map(get_wiki, years)
pool.close()
pool.join()
print('Time for scrape in parallel is: ', time.time()-start)
uga_df = pd.DataFrame(list_of_dicts)
uga_df.head(n=10)

Year =  1983
Year =  1981
Year =  1982
Year =  1980
Year =  1984
Year =  1985
Year =  1986
Year =  1987
Year =  1988
Time for scrape in parallel is:  2.2375168800354004


Unnamed: 0,attendance,date,opponent,rank,result,site,tv,year
0,77896,January 1,"['vs.\xa0No. 7\xa0', 'Notre Dame', '*']",No. 1,"['W', '\xa017–10\xa0\xa0']","['Louisiana Superdome', ' • ', 'New Orleans, L...",ABC,1980
1,77224,January 1,"['vs.\xa0No. 8\xa0', 'Pittsburgh', '*']",No. 2,"['L', '\xa020–24\xa0\xa0']","['Louisiana Superdome', ' • ', 'New Orleans', ...",ABC,1981
2,78124,January 1,"['vs.\xa0No. 2\xa0', 'Penn State', '*']",No. 1,"['L', '\xa023–27\xa0\xa0']","['Louisiana Superdome', ' • ', 'New Orleans, L...",ABC,1982
3,67891,January 2,"['vs.\xa0No. 2\xa0', 'Texas', '*']",No. 7,"['W', '\xa010–9\xa0\xa0']","['Cotton Bowl', ' • ', 'Dallas', ' (', 'Cotton...",CBS,1983
4,51821,December 22,"['vs.\xa0No. 15\xa0', 'Florida State', '*']",,"['T', '\xa017–17\xa0\xa0']","['Florida Citrus Bowl', ' • ', 'Orlando, FL', ...",NBC,1984
5,52203,December 28,"['vs.\xa0', 'Arizona', '*']",,"['T', '\xa013c13\xa0\xa0']","['Sun Bowl Stadium', ' • ', 'El Paso, Texas', ...",CBS,1985
6,25358,November 23,"['vs.\xa0', 'Boston College', '*']",No. 17,"['L', '\xa024–27\xa0\xa0']","['Tampa Stadium', ' • ', 'Tampa, FL', ' (', 'H...",Mizlou,1986
7,53240,December 29,"['vs.\xa0', 'Arkansas', '*']",No. 15,"['W', '\xa020–17\xa0\xa0']","['Liberty Bowl Memorial Stadium', ' • ', 'Memp...",Raycom,1987
8,76236,January 1,"['vs.\xa0', 'Michigan State', '*']",No. 19,"['W', '\xa034–27\xa0\xa0']","['Gator Bowl Stadium', ' • ', 'Jacksonville, F...",ESPN,1988


In [135]:
multiprocessing.cpu_count()

4