# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [52]:
import requests
import re

url = 'https://en.wikipedia.org/wiki/Data_science'

In [17]:
# your code here
html = requests.get(url).content

### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [18]:
from bs4 import BeautifulSoup

In [67]:
# your code here
links = soup.find_all('a', href=True)
links = [item['href'] for item in links]
links = list(set(links))
links

['/wiki/U-Net',
 'https://it.wikipedia.org/wiki/Scienza_dei_dati',
 '/wiki/Feature_engineering',
 '/wiki/Convolutional_neural_network',
 'https://web.archive.org/web/20120403153707/http://www.jstage.jst.go.jp/browse/dsj/_vols',
 '/w/index.php?title=Special:UserLogin&returnto=Data+science',
 '/wiki/Special:RecentChangesLinked/Data_science',
 '/wiki/Random_forest',
 '//pubmed.ncbi.nlm.nih.gov/19265007',
 '#cite_note-Escoufier-8',
 '#cite_ref-:2_28-2',
 '/wiki/Perceptron',
 '#mw-head',
 '//pubmed.ncbi.nlm.nih.gov/19535325',
 '/wiki/Deep_learning',
 'https://doi.org/10.1126%2Fscience.1250475',
 '/wiki/DBSCAN',
 '/w/index.php?title=Data_science&action=edit&section=1',
 'https://web.archive.org/web/20120822033955/http://www.jds-online.com/v1-1',
 '#cite_note-29',
 '/wiki/Bayesian_network',
 'https://web.archive.org/web/20171124155559/https://blogs.wsj.com/cio/2014/05/02/why-do-we-need-data-science-when-weve-had-statistics-for-centuries/',
 '/wiki/Glossary_of_artificial_intelligence',
 '/w/in

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [0]:
domain = 'http://wikipedia.org'

In [88]:
# your code here
domain = 'https://en.wikipedia.org/'
absolute_links = [link.replace('%', '') for link in links if link.startswith('http')]
relative_links = [domain + link.replace('%', '') for link in links if link.startswith('/')]
links_ok = absolute_links + relative_links
list(set(links_ok))

['https://en.wikipedia.org//wiki/BiasE28093variance_dilemma',
 'https://doi.org/10.10072FBF00141776',
 'https://it.wikipedia.org/wiki/Scienza_dei_dati',
 'https://en.wikipedia.org//wiki/Digital_object_identifier',
 'https://ja.wikipedia.org/wiki/E38387E383BCE382BFE382B5E382A4E382A8E383B3E382B9',
 'https://doi.org/10.10932Fbiostatistics2Fkxp014',
 'https://web.archive.org/web/20120403153707/http://www.jstage.jst.go.jp/browse/dsj/_vols',
 'https://el.wikipedia.org/wiki/CE95CF80CEB9CF83CF84CEAECEBCCEB7_CEB4CEB5CEB4CEBFCEBCCEADCEBDCF89CEBD',
 'https://en.wikipedia.org//wiki/Perceptron',
 'https://en.wikipedia.org//wiki/Linear_regression',
 'https://en.wikipedia.org//wiki/Discipline_(academia)',
 'https://en.wikipedia.org//wiki/Computational_learning_theory',
 'https://en.wikipedia.org///www.worldcat.org/issn/0028-0836',
 'https://pt.wikipedia.org/wiki/CiC3AAncia_de_dados',
 'https://en.wikipedia.org//wiki/University_of_Michigan',
 'https://web.archive.org/web/20120822033955/http://www.jds-

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [90]:
import os

In [93]:
# your code here

dirs = os.mkdir('./wikipedia')

FileExistsError: [WinError 183] Não é possível criar um arquivo já existente: './wikipedia'

In [94]:
os.chdir('./wikipedia')

In [95]:
!pwd

/c/Users/lzapa/DataBootCampIronH/Labs/Week 4/3. Parallelization/your-code/wikipedia


### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [98]:
from slugify import slugify

In [0]:
# your code here

def index_page(url):
    html = requests.get(url)
    slugify(html)
    slugify

In [100]:
html = requests.get(url).content
slugify(html)

NameError: name 'unicode' is not defined

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run. 

_hint: Use tqdm to keep track of the time._ 

In [0]:
# your code here

### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

Use both methods, i.e., for one hand use the `multiprocess` module to use the function created in the jupyter notebook and run the download in parallel.

And for another hand create a python file containing the function to download the file and use the `multiprocessing` module to run. 

In [0]:
# your code here



**BONUS**: Create a function that counts how many files are there in the wikipedia folder using the `os` module. 

Delete the files from the folder before you run and perform the above solution asynchronously. 

Use your function to check how many files are being downloaded.