In [1]:
from multiprocessing import Pool, cpu_count

In [2]:
cpu_count()

8

# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [4]:
import requests

url = 'https://en.wikipedia.org/wiki/Data_science'

In [6]:
response=requests.get(url)
response

<Response [200]>

### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [7]:
from bs4 import BeautifulSoup

In [29]:
soup = BeautifulSoup(response.content)
soup
links = [link.get('href') for link in soup.find_all("a")]
links

[None,
 '#mw-head',
 '#searchInput',
 '/wiki/Information_science',
 '/wiki/Machine_learning',
 '/wiki/Data_mining',
 '/wiki/Statistical_classification',
 '/wiki/Cluster_analysis',
 '/wiki/Regression_analysis',
 '/wiki/Anomaly_detection',
 '/wiki/Automated_machine_learning',
 '/wiki/Association_rule_learning',
 '/wiki/Reinforcement_learning',
 '/wiki/Structured_prediction',
 '/wiki/Feature_engineering',
 '/wiki/Feature_learning',
 '/wiki/Online_machine_learning',
 '/wiki/Semi-supervised_learning',
 '/wiki/Unsupervised_learning',
 '/wiki/Learning_to_rank',
 '/wiki/Grammar_induction',
 '/wiki/Supervised_learning',
 '/wiki/Statistical_classification',
 '/wiki/Regression_analysis',
 '/wiki/Decision_tree_learning',
 '/wiki/Ensemble_learning',
 '/wiki/Bootstrap_aggregating',
 '/wiki/Boosting_(machine_learning)',
 '/wiki/Random_forest',
 '/wiki/K-nearest_neighbors_algorithm',
 '/wiki/Linear_regression',
 '/wiki/Naive_Bayes_classifier',
 '/wiki/Artificial_neural_network',
 '/wiki/Logistic_regre

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [35]:
domain = 'http://wikipedia.org'

In [72]:
relative_links= [f"{url}{link.get('href')}" if str(link.get('href')).startswith('#') == True else f"{domain}{link.get('href')}" if str(link.get('href')).startswith("/") else link.get('href') for link in soup.find_all('a')]
relative_links=list(filter(lambda x: re.search("%",str(x))==None, relative_links))
relative_links=list(filter(lambda x: x!=None, relative_links))
relative_links

['https://en.wikipedia.org/wiki/Data_science#mw-head',
 'https://en.wikipedia.org/wiki/Data_science#searchInput',
 'http://wikipedia.org/wiki/Information_science',
 'http://wikipedia.org/wiki/Machine_learning',
 'http://wikipedia.org/wiki/Data_mining',
 'http://wikipedia.org/wiki/Statistical_classification',
 'http://wikipedia.org/wiki/Cluster_analysis',
 'http://wikipedia.org/wiki/Regression_analysis',
 'http://wikipedia.org/wiki/Anomaly_detection',
 'http://wikipedia.org/wiki/Automated_machine_learning',
 'http://wikipedia.org/wiki/Association_rule_learning',
 'http://wikipedia.org/wiki/Reinforcement_learning',
 'http://wikipedia.org/wiki/Structured_prediction',
 'http://wikipedia.org/wiki/Feature_engineering',
 'http://wikipedia.org/wiki/Feature_learning',
 'http://wikipedia.org/wiki/Online_machine_learning',
 'http://wikipedia.org/wiki/Semi-supervised_learning',
 'http://wikipedia.org/wiki/Unsupervised_learning',
 'http://wikipedia.org/wiki/Learning_to_rank',
 'http://wikipedia.org

In [73]:
absolute_links=[link.get('href') for link in soup.find_all("a",attrs={'href': re.compile("http")})]
absolute_links=list(filter(lambda x: re.search("%",str(x))==None, absolute_links))
absolute_links

['https://arxiv.org/list/cs.LG/recent',
 'https://en.wikipedia.org/w/index.php?title=Template:Machine_learning_bar&action=edit',
 'https://en.wikipedia.org/w/index.php?title=Data_science&action=edit',
 'http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext',
 'https://web.archive.org/web/20141109113411/http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext',
 'http://simplystatistics.org/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-science/',
 'https://web.archive.org/web/20140102194117/http://simplystatistics.org/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-science/',
 'https://www.springer.com/book/9784431702085',
 'https://books.google.com/books?id=oGs_AQAAIAAJ',
 'https://web.archive.org/web/20170320193019/https://books.google.com/books?id=oGs_AQAAIAAJ',
 'http://www.datascienceassn.org/about-data-science',
 'https://www.oreilly.com/library/view/doing-data-science/9781449363871/ch01.html',
 'https://me

In [74]:
links= absolute_links + relative_links
links=list(set(links))
links

['https://en.wikipedia.org/wiki/Data_science#Frameworks',
 'http://wikipedia.org/w/index.php?title=Data_science&action=edit&section=14',
 'https://en.wikipedia.org/wiki/Data_science#Platforms',
 'http://wikipedia.org/wiki/Artificial_neural_network',
 'https://en.wikipedia.org/wiki/Data_science#cite_ref-8',
 'http://wikipedia.org/wiki/Naive_Bayes_classifier',
 'http://wikipedia.org/wiki/Data_security',
 'http://wikipedia.org/wiki/Montpellier_2_University',
 'https://en.wikipedia.org/wiki/Data_science#cite_ref-:6_32-1',
 'http://wikipedia.org/wiki/Deep_learning',
 'https://en.wikipedia.org/wiki/Data_science#cite_note-14',
 'http://wikipedia.org/w/index.php?title=Data_science&printable=yes',
 'https://en.wikipedia.org/wiki/Data_science#cite_note-1',
 'http://wikipedia.org/wiki/Doi_(identifier)',
 'http://wikipedia.org/wiki/Wikipedia:About',
 'http://wikipedia.org/wiki/Temporal_difference_learning',
 'https://en.wikipedia.org/wiki/Data_science#cite_ref-12',
 'http://wikipedia.org/wiki/Glos

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [75]:
import os

In [76]:
os.mkdir("./wikipedia")

In [78]:
os.chdir("./wikipedia")

In [79]:
os.getcwd()

'C:\\Users\\Pedro\\ironhack\\week4-APIandPythonVisualization\\wikipedia'

### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [83]:
!pip install python-slugify

Collecting python-slugify
  Downloading https://files.pythonhosted.org/packages/9f/42/e336f96a8b6007428df772d0d159b8eee9b2f1811593a4931150660402c0/python-slugify-4.0.1.tar.gz
Collecting text-unidecode>=1.3 (from python-slugify)
  Downloading https://files.pythonhosted.org/packages/a6/a5/c0b6468d3824fe3fde30dbb5e1f687b291608f9473681bbf7dabbf5a87d7/text_unidecode-1.3-py2.py3-none-any.whl (78kB)
Building wheels for collected packages: python-slugify
  Building wheel for python-slugify (setup.py): started
  Building wheel for python-slugify (setup.py): finished with status 'done'
  Created wheel for python-slugify: filename=python_slugify-4.0.1-py2.py3-none-any.whl size=6774 sha256=148c4ce56e4a05d1536df3d084d555fac58c4c53c1019ffd163f4c2935ca5fe9
  Stored in directory: C:\Users\Pedro\AppData\Local\pip\Cache\wheels\67\b8\ba\041548f30a6fc058c9b3f79a5b7b6aea925a15dd1e5c4992a4
Successfully built python-slugify
Installing collected packages: text-unidecode, python-slugify
Successfully installed 

In [84]:
from slugify import slugify

In [108]:
response=requests.get("https://www.wikipedia.org/")
a=response.content

In [110]:
type(a.decode())

str

In [121]:
def index_page(link):
    response=requests.get(link)
    filename=slugify(link)+".html"
    f= open(filename,"w+")
    f.write(str(response.content))

In [122]:
index_page("https://www.wikipedia.org/")

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run. 

_hint: Use tqdm to keep track of the time._ 

In [0]:
# your code here

### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

Use both methods, i.e., for one hand use the `multiprocess` module to use the function created in the jupyter notebook and run the download in parallel.

And for another hand create a python file containing the function to download the file and use the `multiprocessing` module to run. 

In [0]:
# your code here