# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [18]:
import requests
import urllib.request

url = 'https://en.wikipedia.org/wiki/Data_science'

In [19]:
with urllib.request.urlopen(url) as response:
   html = response.read()

### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [4]:
from bs4 import BeautifulSoup

In [20]:
soup = BeautifulSoup(html, 'html.parser')
soup.prettify()

'<!DOCTYPE html>\n<html class="client-nojs" dir="ltr" lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   Data science - Wikipedia\n  </title>\n  <script>\n   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"XrEF5wpAMNUAAAW5gqYAAACI","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Data_science","wgTitle":"Data science","wgCurRevisionId":954961466,"wgRevisionId":954961466,"wgArticleId":35458904,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: others","CS1 maint: date and year","Use dmy dates from December 2012","Information science","Computer occupations","Computational fields of study

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [21]:
domain = 'http://wikipedia.org'

In [148]:
import re
pattern = r'http://[\w]+\..+\.[\w]+'
absolute_links = re.findall(pattern, soup.prettify())
print('\nAbsolute links:', absolute_links)

#/ferenc/lippai.html
pattern = r'/wiki/[\w]+'
relative_links = re.findall(pattern, soup.prettify())
print(len(relative_links))
print('\nRelative links:', relative_links)



Absolute links: ['http://cacm.acm.org', 'http://cacm.acm.org', 'http://www.datascienceassn.org', 'http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf', 'http://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf']
228

Relative links: ['/wiki/Data_science', '/wiki/Information_science', '/wiki/Machine_learning', '/wiki/Data_mining', '/wiki/File', '/wiki/Statistical_classification', '/wiki/Cluster_analysis', '/wiki/Regression_analysis', '/wiki/Anomaly_detection', '/wiki/Automated_machine_learning', '/wiki/Association_rule_learning', '/wiki/Reinforcement_learning', '/wiki/Structured_prediction', '/wiki/Feature_engineering', '/wiki/Feature_learning', '/wiki/Online_machine_learning', '/wiki/Semi', '/wiki/Unsupervised_learning', '/wiki/Learning_to_rank', '/wiki/Grammar_induction', '/wiki/Supervised_learning', '/wiki/Statistical_classification', '/wiki/Regression_analysis', '/wiki/Decision_tree_learning', '/wiki/Ensemble_learning', '/wiki/Bootstrap_aggregating', '/wik

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [None]:
import os

### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [None]:
from slugify import slugify

In [None]:
# your code here

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [None]:
# your code here

### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [None]:
import multiprocessing

In [None]:
# your code here