# Lab07: Web Crawling

## Overview

There are three parts to this exercise.  I would like you to place all material for this exercise in your github repository in the directory `labs/07/` directory.

Objectives:
* Learning HTML by writing it
* Using BeautifulSoup to scrape data from webpages

<span style="color:red">Make sure you have installed BeautifulSoup into your conda environment</span>.



## Part A:  Making a webpage

I would like you to write a valid HTML webpage.  You don't need any special software to do this: any text editor (including VSCode) would do.  Name the page exercise08.html.  The page must contain the following elements:

* A level-1 header
* An image (download one and put it in the exercise/08 folder with the page)
* A few paragraphs (about any topic)
* A few links in those paragraphs (that are to other pages on that topic)
* A table with three columns, one heading column, and at least three data columns
* You can pick the topic for the table.

Make this webpage and its contents available in your github repository for today's lab.


In [None]:
# There is no code for part A, but
# a code cell does break up the notebook

## Part B: A Simple Beautiful Soup

I would like you to find all the URLS used in the links in your webpage using the find_all method from Beautiful Soup.

You may use this starter code to assist you

In [6]:
# This code will get you started with finding the links
# in your webpage

from bs4 import BeautifulSoup

filename='exercise8.html'  # <--- change this to your file
                            # it should be in the same directory 
                            # as this notebook
with open(filename) as fh:
    parsed_page = BeautifulSoup(fh)

In [None]:

alltags = parsed_page.find_all('a')
alllink = list(map(lambda x: x.attrs('href')))

## Part C:  Crawling Wikipedia with Beautiful Soup

I'd like you to find all the unique web pages linked to from the **Young Frankenstein** Wikipedia page that are also within the wikipedia domain.  Use the demo from today to guide your efforts.

Link: [https://en.wikipedia.org/wiki/Young_Frankenstein](https://en.wikipedia.org/wiki/Young_Frankenstein)

In [11]:
import urllib
import ssl

url = 'https://en.wikipedia.org/wiki/Young_Frankenstein'

with urllib.request.urlopen(url, context=ssl.SSLContext(), timeout=5) as response:
    soup = BeautifulSoup(response.read())
    meta_data = response.info()
    response_code = response.getcode()

print(f'Response code: {response_code}')
print(meta_data)

Response code: 200
Date: Mon, 21 Feb 2022 02:27:28 GMT
Server: mw1387.eqiad.wmnet
X-Content-Type-Options: nosniff
P3p: CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."
Content-Language: en
Vary: Accept-Encoding,Cookie,Authorization
Last-Modified: Mon, 21 Feb 2022 02:24:41 GMT
Content-Type: text/html; charset=UTF-8
Age: 52279
X-Cache: cp1085 miss, cp1075 hit/16
X-Cache-Status: hit-front
Server-Timing: cache;desc="hit-front", host;desc="cp1075"
Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
Report-To: { "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
NEL: { "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}
Permissions-Policy: interest-cohort=()
Set-Cookie: WMF-Last-Access=21-Feb-2022;Path=/;HttpOnly;secure;Expires=Fri, 25 Mar 2022 12:00:

In [17]:
#set gives us all of our unique tags

# gives us only a tags with href attributes
all_a_tags_with_href = set(filter(lambda x: 'href' in x.attrs , soup.body.find_all('a')))

# we want the links from those a tags
links_in_page = set(map(lambda x: x.attrs['href'] if 'href' in x.attrs else'', soup.body.find_all('a')))

# we want to normalize these links to make them fully specified
links_in_page = set(map(lambda x: urllib.parse.urljoin(url,x), links_in_page))

len(links_in_page)

813

In [22]:
links_in_wikipedia = list(map(lambda x: urllib.parse.urlparse(x).netloc, links_in_page))
links_in_wikipedia


['en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'data.bnf.fr',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'fa.wikipedia.org',
 'web.archive.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'id.loc.gov',
 'en.wikipedia.org',
 'en.wikipedia.org',
 'cr

In [23]:
links_in_wikipedia = list(filter(lambda x: urllib.parse.urlparse(x).netloc.endswith('wikipedia.org'), links_in_page))
links_in_wikipedia

['https://en.wikipedia.org/wiki/Jesse_James_Meets_Frankenstein%27s_Daughter',
 'https://en.wikipedia.org/wiki/Indiana_Jones_and_the_Last_Crusade',
 'https://en.wikipedia.org/wiki/Young_Frankenstein#cite_note-34',
 'https://en.wikipedia.org/wiki/COVID-19_pandemic',
 'https://en.wikipedia.org/wiki/Young_Frankenstein',
 'https://en.wikipedia.org/wiki/Sunday_Lovers',
 'https://en.wikipedia.org/wiki/Frankenstein%27s_Monster_(video_game)',
 'https://en.wikipedia.org/wiki/Young_Frankenstein#cite_note-41',
 'https://en.wikipedia.org/wiki/Neill_Blomkamp',
 'https://en.wikipedia.org/wiki/Victor_Frankenstein',
 'https://en.wikipedia.org/wiki/Young_Frankenstein#cite_note-19',
 'https://en.wikipedia.org/wiki/Bride_of_Frankenstein',
 'https://en.wikipedia.org/wiki/Young_Frankenstein#cite_ref-27',
 'https://en.wikipedia.org/wiki/Shuler_Hensley',
 'https://en.wikipedia.org/wiki/Apt_Pupil_(film)',
 'https://en.wikipedia.org/wiki/Template_talk:Saturn_Award_for_Best_Horror_Film',
 'https://fa.wikipedia.o

In [25]:
links_in_wikipedia = set(filter(lambda x: urllib.parse.urlparse(x).netloc.endswith('wikipedia.org'), links_in_page))
len(links_in_wikipedia)


717

In [26]:
webpages_on_wikipedia = set()

for wiki_url in links_in_wikipedia: 
    try: 

        with urllib.request.urlopen(wiki_url, context=ssl.SSLContext(),timeout=5) as response:
            content_type=response.info()['Content-Type']
            # content_type = response.get_headers().get_content_types()
            if 'text/html' in content_type:
                webpages_on_wikipedia.add(wiki_url)
    except:
        print(f'Unable to open {wiki_url}')