![Ironhack logo](https://i.imgur.com/1QgrNNw.png)

# Lab | Parallelization

## Introduction

This lab will combine parallelization with some of the other topics you have learned in the Intermediate Python module of this program (list comprehensions, requests library, functional programming, web scraping, etc.). You will write code that extracts a list of links from a web page, requests each URL, and then indexes the page referenced by each link - both sequentially and in parallel.

## Resources

- [Multiprocessing Library Documentation](https://docs.python.org/3/library/multiprocessing.html?highlight=multiprocessing#module-multiprocessing)
- [Python Parallel Computing (in 60 Seconds or less)](https://dbader.org/blog/python-parallel-computing-in-60-seconds)
- [Python Multiprocessing: Pool vs Process – Comparative Analysis](https://www.ellicium.com/python-multiprocessing-pool-process/)

## Step 1: Use the requests library to retrieve the content from the URL below.

In [45]:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Data_science'

In [3]:
response = requests.get('https://en.wikipedia.org/wiki/Data_science')
html=response.content
soup = BeautifulSoup(html)

## Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [21]:
soup.find_all('a')

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a href="/wiki/Information_science" title="Information science">information science</a>,
 <a class="image" href="/wiki/File:PIA23792-1600x1200(1).jpg"><img alt="" class="thumbimage" data-file-height="1200" data-file-width="1600" decoding="async" height="188" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/45/PIA23792-1600x1200%281%29.jpg/250px-PIA23792-1600x1200%281%29.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/45/PIA23792-1600x1200%281%29.jpg/375px-PIA23792-1600x1200%281%29.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/45/PIA23792-1600x1200%281%29.jpg/500px-PIA23792-1600x1200%281%29.jpg 2x" width="250"/></a>,
 <a class="internal" href="/wiki/File:PIA23792-1600x1200(1).jpg" title="Enlarge"></a>,
 <a href="/wiki/Comet_NEOWISE" title="Comet NEOWISE">Comet NEOWISE</a>,
 <a href="/wiki/Astronomical_survey

In [25]:
results = []
for link in soup.find_all('a'):
    try:
        results.append(link['href'])
    except KeyError:
        results.append('NA')
results

['NA',
 '#mw-head',
 '#searchInput',
 '/wiki/Information_science',
 '/wiki/File:PIA23792-1600x1200(1).jpg',
 '/wiki/File:PIA23792-1600x1200(1).jpg',
 '/wiki/Comet_NEOWISE',
 '/wiki/Astronomical_survey',
 '/wiki/Space_telescope',
 '/wiki/Wide-field_Infrared_Survey_Explorer',
 '/wiki/Machine_learning',
 '/wiki/Data_mining',
 '/wiki/File:Kernel_Machine.svg',
 '/wiki/Statistical_classification',
 '/wiki/Cluster_analysis',
 '/wiki/Regression_analysis',
 '/wiki/Anomaly_detection',
 '/wiki/Data_Cleaning',
 '/wiki/Automated_machine_learning',
 '/wiki/Association_rule_learning',
 '/wiki/Reinforcement_learning',
 '/wiki/Structured_prediction',
 '/wiki/Feature_engineering',
 '/wiki/Feature_learning',
 '/wiki/Online_machine_learning',
 '/wiki/Semi-supervised_learning',
 '/wiki/Unsupervised_learning',
 '/wiki/Learning_to_rank',
 '/wiki/Grammar_induction',
 '/wiki/Supervised_learning',
 '/wiki/Statistical_classification',
 '/wiki/Regression_analysis',
 '/wiki/Decision_tree_learning',
 '/wiki/Ensembl

In [34]:
results = []
list_links = soup.find_all('a', href=True)
list_links_clean = [link['href'] for link in list_links if link['href'].startswith('http') and '%' not in link['href']]
len(list_links_clean)

61

## Step 3: Use list comprehensions with conditions to clean the link list.

Create a list with the absolute link and remove any that contain a percentage sign (%)

In [None]:
# your code here

## Step 4: Write a function called crawl_page that accepts a link and does the following.

- Request the content of the page referenced by that link.
- Create a soup with the request content.
- Extract a list of links
- Return the count of links in the page

In [35]:
def crawl_page(url):
    soup = BeautifulSoup(requests.get(url).content)
    results = []
    list_links = soup.find_all('a', href=True)
    list_links_clean = [link['href'] for link in list_links if link['href'].startswith('http') and '%' not in link['href']]
    return len(list_links_clean)

crawl_page('https://en.wikipedia.org/wiki/Data_science')

61

## Step 5: Sequentially loop through the list of links, running the crawl_page function each time and save result in a list.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [36]:
[crawl_page(url) for url in list_links_clean]

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


[14,
 14,
 1,
 15,
 21,
 3,
 66,
 10,
 131,
 150,
 15,
 21,
 1,
 63,
 36,
 112,
 0,
 4,
 54,
 44,
 69,
 15,
 0,
 0,
 0,
 0,
 321,
 4,
 4,
 21,
 28,
 279,
 266,
 50,
 14,
 61,
 12,
 39,
 36,
 45,
 79,
 77,
 50,
 60,
 26,
 33,
 47,
 43,
 33,
 38,
 29,
 28,
 31,
 44,
 39,
 133,
 39,
 0,
 14,
 81,
 81]

## Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [37]:
%%time
[crawl_page(url) for url in list_links_clean]

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Wall time: 56.8 s


[14,
 14,
 1,
 15,
 21,
 3,
 66,
 10,
 131,
 150,
 15,
 21,
 1,
 63,
 36,
 112,
 0,
 4,
 54,
 44,
 69,
 15,
 0,
 0,
 0,
 0,
 321,
 4,
 4,
 21,
 28,
 279,
 266,
 50,
 14,
 61,
 12,
 39,
 36,
 45,
 79,
 77,
 50,
 60,
 26,
 33,
 47,
 43,
 33,
 38,
 29,
 28,
 31,
 44,
 39,
 133,
 39,
 0,
 14,
 81,
 81]

In [38]:
!pip3 install multiprocess

#import multiprocessing
from multiprocess import Pool, cpu_count
# If you are using MaC use the multiprocessing library 

Collecting multiprocess
  Downloading multiprocess-0.70.12.2-py38-none-any.whl (128 kB)
Collecting dill>=0.3.4
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
Installing collected packages: dill, multiprocess
Successfully installed dill-0.3.4 multiprocess-0.70.12.2


In [39]:
cpu_count()

8

In [41]:
pool = Pool(cpu_count()-1)

In [42]:
%%time
pool.map(crawl_page, list_links_clean)

NameError: name 'BeautifulSoup' is not defined