<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 9.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [2]:
## Import Libraries
import regex as re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

### Define the content to retrieve (webpage's URL)

In [3]:
# specify the url
quote_page = 'https://marvel.fandom.com/wiki/Thor_Odinson_(Earth-616)'

### Retrieve the page
- Require Internet connection

In [4]:
# query the website and return the html to the variable 'page'
http = urllib3.PoolManager()
r = http.request('GET', quote_page) #GET PUT POST DELETE
if r.status == 200: #200 succcess, 400 problem on our side, 500 problem on their side
    page = r.data
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occured. Request Status: %s' % r.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 800325


### Convert the stream of bytes into a BeautifulSoup representation

In [6]:
# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

Type of the variable 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [None]:
print(soup.prettify()[:1000]) #first 1000 elements of page

#instead of .prettify, also: get_text etc.

### Check the HTML's Title

In [7]:
print('Title tag :%s:' % soup.title)

print('Title text :%s:' % soup.title.string)

Title tag :<title>Thor Odinson (Earth-616) | Marvel Database | Fandom</title>:
Title text :Thor Odinson (Earth-616) | Marvel Database | Fandom:


### Find the main content
- Check if it is possible to use only the relevant data

##### Wikimedia's article
- Wiki pages use the tag `article` for the actual content of the page

        <article class="WikiaMainContent" id="WikiaMainContent">
            <div class="WikiaMainContentContainer" id="WikiaMainContentContainer">
                <div class="WikiaArticle" id="WikiaArticle">

In [8]:
tag = 'div'
class_name = {"class":  "mw-parser-output"}
article = soup.find_all(tag, class_name)
print('Type of the variable \'article\':', article.__class__.__name__)

Type of the variable 'article': ResultSet


In [9]:
for a in article: #loop through output object "article"
    print(a)

<div class="mw-parser-output"><p>
<aside class="portable-infobox pi-background pi-border-color pi-theme-character pi-layout-default" role="region">
<h2 class="pi-item pi-item-spacing pi-title pi-secondary-background" data-source="Title"><a class="mw-disambig" href="/wiki/Thor" title="Thor">Thor</a></h2>
<figure class="pi-item pi-image" data-source="Image">
<a class="image image-thumbnail" href="https://static.wikia.nocookie.net/marveldatabase/images/5/55/Thor_Odinson_%28Earth-616%29_from_Empyre_Vol_1_1_001.jpg/revision/latest?cb=20200829210404" title="">
<img alt="" class="pi-image-thumbnail" data-image-key="Thor_Odinson_%28Earth-616%29_from_Empyre_Vol_1_1_001.jpg" data-image-name="Thor Odinson (Earth-616) from Empyre Vol 1 1 001.jpg" height="415" src="https://static.wikia.nocookie.net/marveldatabase/images/5/55/Thor_Odinson_%28Earth-616%29_from_Empyre_Vol_1_1_001.jpg/revision/latest/scale-to-width-down/325?cb=20200829210404" srcset="https://static.wikia.nocookie.net/marveldatabase/ima

In [10]:
words = [a.get_text() for a in article]
words

['\n\nThor\n\n\n\n\n\nGallery\n\nName\nThor Odinson\n\n\nAliases\nEditorial Names:Formerly Unworthy Thor, Thor: God of Thunder, Mighty Thor, Thor: Son of Asgard, Astonishing ThorOther Aliases:All-Father Thor,[1] All-Father Odinson,[2] Arkin Torsen,[3] Arthur,[4] Beowulf,[5] Blond Hair,[6] "bork! borkborkbork!",[7] Brood of Thunder,[8] Brood Thor,[8] Deconsecrator,[9] Donald M. Blake,[10] Donar (Old Dutch name),[11] The Gaea-Son,[12] God of Lightning and Thunder,[13] God of Thunder,[10] The Golden Avenger,[14] Hammer-Thrower,[15] Herald of Thunder,[16] Hloriddi,[17] Hrodr\'s foeman,[17] Jake Olson,[18] Jormungand\'s Fear,[17] King Thor,[19] Lightning-Caller,[15] The Lightning-Giver,[20] Longbeard\'s Son,[17] "No-Name",[21] Phoenix-Son,[2] The Scion of all Asgard,[22] Siegfried,[23] Siegmund,[24] Sigurd Jarlson,[25] Son of Earth,[26] Son of Gaea,[26] Son of Thunderbird,[27] Sparkles,[28] Storm-God,[15] Thorr,[29][30][31] Thunaer,[11] Thunder Father,[32] Thunderboy[33], The Thunderer,[34]

In [11]:
for word in words:
    print(word)



Thor





Gallery

Name
Thor Odinson


Aliases
Editorial Names:Formerly Unworthy Thor, Thor: God of Thunder, Mighty Thor, Thor: Son of Asgard, Astonishing ThorOther Aliases:All-Father Thor,[1] All-Father Odinson,[2] Arkin Torsen,[3] Arthur,[4] Beowulf,[5] Blond Hair,[6] "bork! borkborkbork!",[7] Brood of Thunder,[8] Brood Thor,[8] Deconsecrator,[9] Donald M. Blake,[10] Donar (Old Dutch name),[11] The Gaea-Son,[12] God of Lightning and Thunder,[13] God of Thunder,[10] The Golden Avenger,[14] Hammer-Thrower,[15] Herald of Thunder,[16] Hloriddi,[17] Hrodr's foeman,[17] Jake Olson,[18] Jormungand's Fear,[17] King Thor,[19] Lightning-Caller,[15] The Lightning-Giver,[20] Longbeard's Son,[17] "No-Name",[21] Phoenix-Son,[2] The Scion of all Asgard,[22] Siegfried,[23] Siegmund,[24] Sigurd Jarlson,[25] Son of Earth,[26] Son of Gaea,[26] Son of Thunderbird,[27] Sparkles,[28] Storm-God,[15] Thorr,[29][30][31] Thunaer,[11] Thunder Father,[32] Thunderboy[33], The Thunderer,[34] Uncle Thor,[35] unh

### Get some of the text
- Plain text without HTML tags

In [13]:
# show the first 500 characters after removing redundant new lines

clean_words = [re.sub(r'\n\n+', '\n', word) for word in words]
#replace double new lines with one new line

for cw in clean_words:
    print(cw)


Thor
Gallery
Name
Thor Odinson
Aliases
Editorial Names:Formerly Unworthy Thor, Thor: God of Thunder, Mighty Thor, Thor: Son of Asgard, Astonishing ThorOther Aliases:All-Father Thor,[1] All-Father Odinson,[2] Arkin Torsen,[3] Arthur,[4] Beowulf,[5] Blond Hair,[6] "bork! borkborkbork!",[7] Brood of Thunder,[8] Brood Thor,[8] Deconsecrator,[9] Donald M. Blake,[10] Donar (Old Dutch name),[11] The Gaea-Son,[12] God of Lightning and Thunder,[13] God of Thunder,[10] The Golden Avenger,[14] Hammer-Thrower,[15] Herald of Thunder,[16] Hloriddi,[17] Hrodr's foeman,[17] Jake Olson,[18] Jormungand's Fear,[17] King Thor,[19] Lightning-Caller,[15] The Lightning-Giver,[20] Longbeard's Son,[17] "No-Name",[21] Phoenix-Son,[2] The Scion of all Asgard,[22] Siegfried,[23] Siegmund,[24] Sigurd Jarlson,[25] Son of Earth,[26] Son of Gaea,[26] Son of Thunderbird,[27] Sparkles,[28] Storm-God,[15] Thorr,[29][30][31] Thunaer,[11] Thunder Father,[32] Thunderboy[33], The Thunderer,[34] Uncle Thor,[35] unhappy Hrun

### Find the links in the text

In [14]:
# identify the type of tag to retrieve
# create a list with the links for the '<a>' tag

tag = 'a'
article = soup.find_all(tag)
tag_list = [t.get('href') for t in article]
tag_list

['//marvel.fandom.com',
 '#',
 'https://marvel.fandom.com/wiki/Marvel_Database',
 '/f',
 'https://marvel.fandom.com/wiki/Special:AllPages',
 'https://marvel.fandom.com/wiki/Special:Community',
 '/wiki/Blog:Recent_posts',
 'https://shop.fandom.com/marvel-ft4844.html',
 'https://shop.fandom.com/marvel/apparel/4844-gf33402.html',
 'https://shop.fandom.com/marvel/kids/4844-gf33417.html',
 'https://shop.fandom.com/marvel/bags/4844-gf33426.html',
 'https://shop.fandom.com/marvel/accessories/4844-gf33432.html',
 'https://shop.fandom.com/marvel/jewelry/4844-gf33447.html',
 'https://shop.fandom.com/marvel/home/4844-gf33453.html',
 'https://shop.fandom.com/marvel/kitchen/4844-gf33468.html',
 'https://shop.fandom.com/marvel/stationery/4844-gf33479.html',
 'https://shop.fandom.com/marvel/novelties/4844-gf33488.html',
 '#',
 '#',
 'https://marvel.fandom.com/wiki/Hub:Comics',
 'https://marvel.fandom.com/wiki/Hub:Events',
 'https://marvel.fandom.com/wiki/Hub:Games',
 'https://marvel.fandom.com/wiki/H

In [15]:
# keep only links to wiki itself
tag_list = [t for t in tag_list if t and t.startswith('/wiki/')]
tag_list

['/wiki/Blog:Recent_posts',
 '/wiki/Blog:Recent_posts',
 '/wiki/Blog:Recent_posts',
 '/wiki/Blog:Recent_posts',
 '/wiki/Special:Search',
 '/wiki/Special:Search',
 '/wiki/Special:Search',
 '/wiki/Blog:Recent_posts',
 '/wiki/Blog:Recent_posts',
 '/wiki/Blog:Recent_posts',
 '/wiki/Blog:Recent_posts',
 '/wiki/Thor_Odinson_(Earth-616)?action=edit',
 '/wiki/Category:Characters',
 '/wiki/Category:Avengers_(Earth-616)/Members',
 '/wiki/Category:Heralds_of_Galactus_(Earth-616)/Members',
 '/wiki/Category:Avengers_(Hydra)_(Earth-616)/Members',
 '/wiki/Category:Thor_Corps_(Earth-15513)/Members',
 '/wiki/Category:New_Avengers_(A.I.M.)_(Earth-616)/Members',
 '/wiki/Category:Avengers_Unity_Division_(Earth-616)/Members',
 '/wiki/Category:Axis_(Avengers)_(Earth-616)/Members',
 '/wiki/Category:Avengers_(Heroes_Reborn)_(Earth-616)/Members',
 '/wiki/Category:Avengers_(1,000_AD)_(Earth-616)/Members',
 '/wiki/Category:League_of_Realms_(Earth-616)/Members',
 '/wiki/Category:God_Squad_(Earth-616)/Members',
 '

### Create a filter for unwanted types of articles

In [16]:
# create filter for undesired links
filter = '(%s)' % '|'.join([
    'Category:',
    'File:',
    'Help:'
])

In [17]:
# remove links found in the filter
tag_list = [t for t in tag_list if not re.search(filter, t)]
tag_list

['/wiki/Blog:Recent_posts',
 '/wiki/Blog:Recent_posts',
 '/wiki/Blog:Recent_posts',
 '/wiki/Blog:Recent_posts',
 '/wiki/Special:Search',
 '/wiki/Special:Search',
 '/wiki/Special:Search',
 '/wiki/Blog:Recent_posts',
 '/wiki/Blog:Recent_posts',
 '/wiki/Blog:Recent_posts',
 '/wiki/Blog:Recent_posts',
 '/wiki/Thor_Odinson_(Earth-616)?action=edit',
 '/wiki/Thor_Odinson_(Earth-616)?action=edit',
 '/wiki/Thor_Odinson_(Earth-616)?action=history',
 '/wiki/Talk:Thor_Odinson_(Earth-616)',
 '/wiki/Thor',
 '/wiki/Thor_Odinson_(Earth-616)/Gallery',
 '/wiki/Thor_Odinson',
 '/wiki/Unworthy_Thor_Vol_1',
 '/wiki/Thor:_God_of_Thunder_Vol_1',
 '/wiki/Mighty_Thor_Vol_2',
 '/wiki/Thor_Son_of_Asgard_Vol_1',
 '/wiki/Astonishing_Thor_Vol_1',
 '/wiki/All-Father',
 '/wiki/Arkin_Torsen',
 '/wiki/Beowulf',
 '/wiki/Brood',
 '/wiki/Deconsecrator',
 '/wiki/Donald_Blake',
 '/wiki/Donar',
 '/wiki/God_of_Thunder',
 '/wiki/Hloriddi',
 '/wiki/Jake_Olson',
 '/wiki/Jormungand_(Earth-616)',
 '/wiki/King_Thor',
 '/wiki/No-Nam



---



---



> > > > > > > > > © 2021 Institute of Data


---



---



