# Lab07: Web Crawling

## Overview

There are three parts to this exercise.  I would like you to place all material for this exercise in your github repository in the directory `labs/07/` directory.

Objectives:
* Learning HTML by writing it
* Using BeautifulSoup to scrape data from webpages

<span style="color:red">Make sure you have installed BeautifulSoup into your conda environment</span>.



## Part A:  Making a webpage

I would like you to write a valid HTML webpage.  You don't need any special software to do this: any text editor (including VSCode) would do.  Name the page exercise08.html.  The page must contain the following elements:

* A level-1 header
* An image (download one and put it in the exercise/08 folder with the page)
* A few paragraphs (about any topic)
* A few links in those paragraphs (that are to other pages on that topic)
* A table with three columns, one heading column, and at least three data columns
* You can pick the topic for the table.

Make this webpage and its contents available in your github repository for today's lab.


In [12]:
# There is no code for part A, but
# a code cell does break up the notebook
fn_html = 'exercise08.html'

## Part B: A Simple Beautiful Soup

I would like you to find all the URLS used in the links in your webpage using the find_all method from Beautiful Soup.

You may use this starter code to assist you

In [13]:
# This code will get you started with finding the links
# in your webpage

from bs4 import BeautifulSoup

filename='exercise08.html'  # <--- change this to your file
                            # it should be in the same directory 
                            # as this notebook
with open(filename) as fh:
    parsed_page = BeautifulSoup(fh)

In [14]:
all_a_tags = parsed_page.find_all('a')
all_links = list(map(lambda x: x.attrs['href'], all_a_tags))
all_links

['https://cse.msu.edu/~cse801a',
 'http://www.weather.gov',
 'http://www.msu.edu']

## Part C:  Crawling Wikipedia with Beautiful Soup

I'd like you to find all the unique web pages linked to from the **Young Frankenstein** Wikipedia page that are also within the wikipedia domain.  Use the demo from today to guide your efforts.

Link: [https://en.wikipedia.org/wiki/Young_Frankenstein](https://en.wikipedia.org/wiki/Young_Frankenstein)

In [15]:
import urllib

In [16]:
url = 'https://en.wikipedia.org/wiki/Young_Frankenstein'
import ssl

# ssl.SSLContext() gives us the ability to ignore secure sites
with urllib.request.urlopen(url, context=ssl.SSLContext(), timeout=5) as response:
    soup = BeautifulSoup(response.read())
    meta_data = response.info()
    response_code = response.getcode()

print(f'Response code: {response_code}')
print(meta_data)

Response code: 200
date: Sat, 30 Apr 2022 18:33:04 GMT
vary: Accept-Encoding,Cookie,Authorization
server: ATS/8.0.8
x-content-type-options: nosniff
p3p: CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."
content-language: en
last-modified: Fri, 29 Apr 2022 18:37:03 GMT
content-type: text/html; charset=UTF-8
age: 11504
x-cache: cp1085 miss, cp1085 hit/132
x-cache-status: hit-front
server-timing: cache;desc="hit-front", host;desc="cp1085"
strict-transport-security: max-age=106384710; includeSubDomains; preload
report-to: { "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
nel: { "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}
set-cookie: WMF-Last-Access=30-Apr-2022;Path=/;HttpOnly;secure;Expires=Wed, 01 Jun 2022 12:00:00 GMT
set-cookie: WMF-Last-Access-Global=30-Ap

In [17]:
# Give us all unique links (unnormalized)

# Give us only a tags with href attributes
all_a_tags_with_href = set(filter(lambda x: 'href' in x.attrs, soup.body.find_all('a')))

# We want the links from those a tag's href attributes
links_in_page = set(map(lambda x: x.attrs['href'], all_a_tags_with_href))

# We want to "normalize" these links to make them fully specified
links_in_page = set(map(lambda x: urllib.parse.urljoin(url, x), links_in_page))
len(links_in_page)

817

In [18]:
# We are just interested in pages in the wikipedia.org domain
# We can restrict ourselves to en.wikipedia.org for brevity; it's fine if you do this
# in your next assignment
links_in_wikipedia = set(filter(lambda x: urllib.parse.urlparse(x).netloc.endswith('wikipedia.org'), links_in_page))
len(links_in_wikipedia)

720

In [19]:
webpages_on_wikipedia = set()

# We want just the web pages.
# So we have to examine the meta info of "Content-Type" to find "text/html"
for wiki_url in links_in_wikipedia:
    try:
        # Something could go wrong (e.g. 404 not found), so use a try
        with urllib.request.urlopen(wiki_url, context=ssl.SSLContext(), timeout=5) as response:
            content_type = response.info()['Content-Type']
            # content_type = response.get_headers().get_content_type() 
            if 'text/html' in content_type:
                webpages_on_wikipedia.add(wiki_url)
    except:
        # Catch exception, print message, continue loop
        print(f'Unable to open {wiki_url}')

In [20]:
x = list([[0],[1],[2]])
x

[[0], [1], [2]]

In [21]:
x_ints = list(map(lambda x: x[0], x))
x_ints

[0, 1, 2]