# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [2]:
import requests
from tqdm.auto import tqdm
url = 'https://en.wikipedia.org/wiki/Data_science'

In [3]:
# your code here
html = requests.get(url).content

### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [4]:
from bs4 import BeautifulSoup

In [5]:
# your code here
soup = BeautifulSoup(html)
links = soup.find_all('a', href=True)
links = [item['href'] for item in links]
links = list(set(links))
links

['#cite_ref-9',
 '#cite_ref-TansleyTolle2009_4-1',
 'https://doi.org/10.1093%2Fbiostatistics%2Fkxp014',
 '/wiki/Recurrent_neural_network',
 '/wiki/OPTICS_algorithm',
 '/wiki/Category:Data_analysis',
 '/wiki/Empirical_risk_minimization',
 '#cite_ref-32',
 '#cite_ref-:2_28-4',
 '#cite_note-BellHey2009-5',
 '/w/index.php?title=Data_science&action=info',
 '/wiki/Wikipedia:File_Upload_Wizard',
 '/wiki/Knowledge',
 '/wiki/Wikipedia:Community_portal',
 '/wiki/Statistical_classification',
 '/wiki/Long_short-term_memory',
 '/wiki/Bayesian_network',
 '/wiki/Category:Computer_occupations',
 '#cite_ref-:2_28-2',
 '/wiki/Special:Random',
 'http://www.gfkl.org/welcome/',
 '/wiki/Statistics',
 'https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/',
 '//foundation.wikimedia.org/wiki/Privacy_policy',
 '#cite_ref-23',
 '/wiki/Academic_publishing',
 '#cite_note-6',
 'https://en.wikipedia.org/w/index.php?title=Data_science&oldid=946667548',
 '/wiki/Gated_recurrent_unit',


### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [6]:
domain = 'http://wikipedia.org'

In [7]:
# your code here
links_abs = [ item for item in links if item.startswith('http')  and '%' not in item]
links_rel = [ domain + item for item in links if item.startswith('/') and '%' not in item]
links_clean = list(set(links_abs + links_rel))
links_clean

['http://datamining.it.uts.edu.au/conferences/dsaa14/',
 'http://wikipedia.org/wiki/Category:Articles_with_unsourced_statements_from_April_2018',
 'http://wikipedia.org/wiki/Peter_Naur',
 'http://wikipedia.org/wiki/Decision_tree_learning',
 'http://wikipedia.org/wiki/Industry',
 'http://wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Contact_us',
 'http://wikipedia.org/wiki/Template:Machine_learning_bar',
 'http://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf',
 'https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/',
 'http://wikipedia.org/wiki/General_Assembly_(school)',
 'http://wikipedia.org/wiki/Canonical_correlation_analysis',
 'http://wikipedia.org/wiki/Unsupervised_learning',
 'https://web.archive.org/web/20120403153707/http://www.jstage.jst.go.jp/browse/dsj/_vols',
 'http://wikipedia.org/wiki/Deep_learning',
 'http://wikipedia.org/wiki/Data_analysis',
 'http://wikipedia.org/wiki/Convolutional_neural_network',
 'https://cs.wikipedia.org/wiki/Da

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [8]:
import os

In [9]:
# your code here
os.mkdir('wikipedia')
path = os.getcwd()
os.chdir(path+"\\wikipedia")
os.getcwd()

'C:\\Users\\pedro\\Ironhack_DAFT\\Labs\\week4_remote\\Parallelization\\your-code\\wikipedia'

### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [10]:
from slugify import slugify

In [11]:
# your code here
def index_page(url):
    try:
        content = requests.get(url).content
        soup = BeautifulSoup(content)
        filename = slugify(soup.find('title').text)
        filename += '.html' 
        f = open(filename, 'wb')
        f.write(content)
        f.close
    except:
        pass

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run. 

_hint: Use tqdm to keep track of the time._ 

In [14]:
%%time

for link in tqdm(links_clean):
    index_page(link)

HBox(children=(IntProgress(value=0, max=283), HTML(value='')))


Wall time: 33min 21s


### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

Use both methods, i.e., for one hand use the `multiprocess` module to use the function created in the jupyter notebook and run the download in parallel.

And for another hand create a python file containing the function to download the file and use the `multiprocessing` module to run. 

In [15]:
from multiprocessing import Pool
from index_page import index_page

In [16]:
%%time

pool = Pool(processes=4)
results = pool.map(index_page, links_clean)
pool.terminate()

Wall time: 3min 2s


In [11]:
from multiprocess import Pool

In [12]:
%%time

pool = Pool(processes=4)
results = pool.map(index_page, links_clean)
pool.terminate()

Wall time: 498 ms


**BONUS**: Create a function that counts how many files are there in the wikipedia folder using the `os` module. 

Delete the files from the folder before you run and perform the above solution asynchronously. 

Use your function to check how many files are being downloaded.

In [14]:
def files_qntd():
    return len(os.listdir())
qtd = files_qntd()
if qtd > 0: 
    [os.remove(item) for item in os.listdir()]

In [None]:
from multiprocessing import Pool
from index_page import index_page
pool = Pool(processes=4)
results = pool.map_async(index_page, links_clean)
result_list = results.get()
pool.terminate()

In [None]:
qtd = files_qntd()