<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 9.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [4]:
## Import Libraries
import regex as re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

import warnings
import requests
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [2]:
url = 'https://marvelcinematicuniverse.fandom.com/wiki/Scarlet_Witch'

### Retrieve the page
- Require Internet connection

In [3]:
http = urllib3.PoolManager()
r = http.request('GET', url)
if r.status == 200:
    page = r.data
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occurred. Request Status: %s' % r.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 677993


In [9]:
# Alternate way

req = requests.get(url)
content = req.content

AttributeError: 'Response' object has no attribute 'status'

### Convert the stream of bytes into a BeautifulSoup representation

In [10]:
soup = BeautifulSoup(page, 'html.parser')

### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [14]:
print(soup.prettify()[:5000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Scarlet Witch | Marvel Cinematic Universe Wiki | Fandom
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Scarlet_Witch","wgTitle":"Scarlet Witch","wgCurRevisionId":1287414,"wgRevisionId":1287414,"wgArticleId":94179,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Characters","Captain America: The Winter Soldier Characters","Avengers: Age of Ultron Characters","Captain America: Civil War Characters","Avengers: Infinity War Characters","Avengers: Endgame Characters","Doctor Strange in the Multiverse of Madness Characters","WandaVision C

### Check the HTML's Title

In [12]:
soup.title

<title>Scarlet Witch | Marvel Cinematic Universe Wiki | Fandom</title>

### Find the main content
- Check if it is possible to use only the relevant data

In [23]:
tag = 'article'
article = soup.find_all(tag)[0]

<article class="WikiaMainContent" id="WikiaMainContent">
<div class="WikiaMainContentContainer" id="WikiaMainContentContainer">
<script data-name="featured-video-mapped-to-wiki">
    (function () {
        // we can't use mw.config here because it's too early so we need to pass this data in different way
        var videoDetails = {
            mediaId: "Vz50u7iA",
            impressionsPerSession: 1,
            isDedicatedForArticle: true,
        };
        var hasVideoOnPage = null;
        var videoBridgeCountries = ["US","CA","AU","GB","UK"];

        function getCookieValue(cookieName) {
            var cookieSplit = ('; ' + document.cookie).split('; ' + cookieName + '=');
            return cookieSplit.length === 2 ? cookieSplit.pop().split(';').shift() : null;
        }

        function hasMaxedOutPlayerImpressionsInWiki() {
            var impressionsSoFar = Number(getCookieValue('playerImpressionsInWiki')) || 0;
            var allowedImpressions = Number(videoDetails.impr

In [24]:
article.text



### Get some of the text
- Plain text without HTML tags

In [34]:
print(re.sub(r'\n\n+', '\n', article.text)[500:2000])

Mom[8]Stinkin' Sister[8]Old Riding Hood[9]Sis[9]Magical Girl[10]Toots[1]Superstar[1]Baby Witch[1]Buttercup[1]Angel[1]
Species
Human
Citizenship
 Sokovian
Gender
Female
Date of Birth
Early 1989
Date of Death
Spring 2018 (victim of the Snap; resurrected by Hulk in 2023)
Affiliation
 HYDRA (formerly) Avengers (formerly)
Status
Alive
Appearances
Movie
Captain America: The Winter Soldier (mid-credits scene)Avengers: Age of UltronCaptain America: Civil WarAvengers: Infinity WarAvengers: EndgameSpider-Man: Far From Home (picture)Doctor Strange in the Multiverse of Madness (unreleased)
TV Series
Agents of S.H.I.E.L.D. (mentioned)WandaVision (8 episodes)
Web Series
WHiH Newsfront (footage)Team Thor (drawing)
Docuseries
Legends
Comic
Avengers: Age of Ultron Prelude - This Scepter'd IsleCaptain America: Road to WarSpider-Man: Homecoming PreludeAvengers: Infinity War PreludeCaptain Marvel Prelude (computer screen)Avengers: Endgame PreludeBlack Widow Prelude (flashbacks)
Actors/Actresses
Portrayed 

### Find the links in the text

In [44]:
# identify the type of tag to retrieve
tag = 'a'
# create a list with the links from the `<a>` tag
tag_list = [t.get('href') for t in article.find_all(tag)]
tag_list

['/wiki/Wanda_Maximoff_(disambiguation)',
 'https://static.wikia.nocookie.net/marvelcinematicuniverse/images/3/30/Wanda_from_WandaVision.jpg/revision/latest?cb=20210102021722',
 '#cite_note-WV108-1',
 '/wiki/Quicksilver',
 '#cite_note-CATWSPCS-2',
 '#cite_note-AAoU-3',
 '/wiki/Quicksilver',
 '#cite_note-AAoU-3',
 '/wiki/Wolfgang_von_Strucker',
 '/wiki/Quicksilver',
 '#cite_note-AAoU-3',
 '/wiki/Quicksilver',
 '#cite_note-AAoU-3',
 '#cite_note-CACW-4',
 '#cite_note-AIW-5',
 '#cite_note-WV101-6',
 '#cite_note-WV101-6',
 '#cite_note-WV101-6',
 '#cite_note-WV102-7',
 '#cite_note-WV102-7',
 '#cite_note-WV102-7',
 '#cite_note-WV102-7',
 '#cite_note-WV105-8',
 '#cite_note-WV105-8',
 '#cite_note-WV106-9',
 '#cite_note-WV106-9',
 '/wiki/Magic',
 '#cite_note-WV107-10',
 '#cite_note-WV108-1',
 '#cite_note-WV108-1',
 '#cite_note-WV108-1',
 '#cite_note-WV108-1',
 '#cite_note-WV108-1',
 '/wiki/Humans',
 '/wiki/Sokovia',
 '/wiki/1980s#1989',
 '/wiki/2018',
 '/wiki/Snap',
 '/wiki/Hulk',
 '/wiki/2023',

In [45]:
tag_list = [t[6:] for t in tag_list if (t) and (t.startswith('/wiki/'))]
tag_list

['Wanda_Maximoff_(disambiguation)',
 'Quicksilver',
 'Quicksilver',
 'Wolfgang_von_Strucker',
 'Quicksilver',
 'Quicksilver',
 'Magic',
 'Humans',
 'Sokovia',
 '1980s#1989',
 '2018',
 'Snap',
 'Hulk',
 '2023',
 'HYDRA',
 'Avengers',
 'Captain_America:_The_Winter_Soldier',
 'List_of_Post-credits_Scenes',
 'Avengers:_Age_of_Ultron',
 'Captain_America:_Civil_War',
 'Avengers:_Infinity_War',
 'Avengers:_Endgame',
 'Spider-Man:_Far_From_Home',
 'Doctor_Strange_in_the_Multiverse_of_Madness',
 'Agents_of_S.H.I.E.L.D.',
 'WandaVision',
 'WHiH_Newsfront_(web_series)',
 'Team_Thor',
 'Legends',
 'Avengers:_Age_of_Ultron_Prelude_-_This_Scepter%27d_Isle',
 'Captain_America:_Road_to_War',
 'Spider-Man:_Homecoming_Prelude',
 'Avengers:_Infinity_War_Prelude',
 'Captain_Marvel_Prelude',
 'Avengers:_Endgame_Prelude',
 'Black_Widow_Prelude',
 'Elizabeth_Olsen',
 'Michaela_Russell',
 'Sophia_Gaidarova',
 'Vision',
 'Captain_America:_Civil_War',
 'Sokovia',
 'Pietro_Maximoff',
 'HYDRA',
 'Scepter',
 'Wolf

In [48]:
tag_list = list(set(tag_list))
tag_list

['Stan_Nielson',
 'Nick_Fury',
 'List_of_Post-credits_Scenes',
 'Korg',
 'Stark_Eco-Compound',
 'James_Rhodes',
 'FBI',
 'Valkyrie',
 'Joss_Whedon',
 'Pepper_Potts',
 'Soul_Stone',
 'Cameron_Klein',
 'Jimmy_Woo',
 'Agnes',
 'Darcy_Lewis',
 'HYDRA_Research_Base',
 'Guardians_of_the_Galaxy',
 'South_Africa',
 'Masters_of_the_Mystic_Arts',
 'Anthony_Russo',
 'Avengers:_Age_of_Ultron',
 'Helen_Cho',
 'Morgan_Stark',
 'Mind_Stone',
 'Special:Categories',
 'Arc_Reactor',
 'Category:Sorcerers',
 'Avengers:_Age_of_Ultron_Prelude_-_This_Scepter%27d_Isle',
 'Thresher',
 'Chitauri',
 'War_Machine',
 'Hulk',
 'Iron_Man',
 'Wasp',
 'Category:Comics_Characters',
 'Asgardians',
 'Crossbones',
 'Monica_Rambeau',
 'WandaVision',
 'Stark_Sonic_Cannon',
 'Avengers_Civil_War',
 'Wakandan_Royal_Guard',
 'Ultron',
 'C.C._Ice',
 'Michaela_Russell',
 '1980s',
 'Ayo',
 'Clint_Barton',
 'Sophia_Gaidarova',
 'Wormhole',
 'Delly_Allen',
 'Filmed_Before_a_Live_Studio_Audience',
 'Oleg_Maximoff',
 'Luis%27_Van',
 '

In [49]:
tag_list = [unquote(t) for t in tag_list]
tag_list

['Stan_Nielson',
 'Nick_Fury',
 'List_of_Post-credits_Scenes',
 'Korg',
 'Stark_Eco-Compound',
 'James_Rhodes',
 'FBI',
 'Valkyrie',
 'Joss_Whedon',
 'Pepper_Potts',
 'Soul_Stone',
 'Cameron_Klein',
 'Jimmy_Woo',
 'Agnes',
 'Darcy_Lewis',
 'HYDRA_Research_Base',
 'Guardians_of_the_Galaxy',
 'South_Africa',
 'Masters_of_the_Mystic_Arts',
 'Anthony_Russo',
 'Avengers:_Age_of_Ultron',
 'Helen_Cho',
 'Morgan_Stark',
 'Mind_Stone',
 'Special:Categories',
 'Arc_Reactor',
 'Category:Sorcerers',
 "Avengers:_Age_of_Ultron_Prelude_-_This_Scepter'd_Isle",
 'Thresher',
 'Chitauri',
 'War_Machine',
 'Hulk',
 'Iron_Man',
 'Wasp',
 'Category:Comics_Characters',
 'Asgardians',
 'Crossbones',
 'Monica_Rambeau',
 'WandaVision',
 'Stark_Sonic_Cannon',
 'Avengers_Civil_War',
 'Wakandan_Royal_Guard',
 'Ultron',
 'C.C._Ice',
 'Michaela_Russell',
 '1980s',
 'Ayo',
 'Clint_Barton',
 'Sophia_Gaidarova',
 'Wormhole',
 'Delly_Allen',
 'Filmed_Before_a_Live_Studio_Audience',
 'Oleg_Maximoff',
 "Luis'_Van",
 'Spid

### Create a filter for unwanted types of articles

In [51]:
filter  = '(%s)' % '|'.join([
    '.png',
    '.jpg', # both Alternate_reality and alternate_reality
    'Category:',
    'Special',
    'production',
    'Season'
])
# remove the links that are found in the filter
tag_list = [t for t in tag_list if not re.search(filter, t)]
tag_list

['Stan_Nielson',
 'Nick_Fury',
 'List_of_Post-credits_Scenes',
 'Korg',
 'Stark_Eco-Compound',
 'James_Rhodes',
 'FBI',
 'Valkyrie',
 'Joss_Whedon',
 'Pepper_Potts',
 'Soul_Stone',
 'Cameron_Klein',
 'Jimmy_Woo',
 'Agnes',
 'Darcy_Lewis',
 'HYDRA_Research_Base',
 'Guardians_of_the_Galaxy',
 'South_Africa',
 'Masters_of_the_Mystic_Arts',
 'Anthony_Russo',
 'Avengers:_Age_of_Ultron',
 'Helen_Cho',
 'Morgan_Stark',
 'Mind_Stone',
 'Arc_Reactor',
 "Avengers:_Age_of_Ultron_Prelude_-_This_Scepter'd_Isle",
 'Thresher',
 'Chitauri',
 'War_Machine',
 'Hulk',
 'Iron_Man',
 'Wasp',
 'Asgardians',
 'Crossbones',
 'Monica_Rambeau',
 'WandaVision',
 'Stark_Sonic_Cannon',
 'Avengers_Civil_War',
 'Wakandan_Royal_Guard',
 'Ultron',
 'C.C._Ice',
 'Michaela_Russell',
 '1980s',
 'Ayo',
 'Clint_Barton',
 'Sophia_Gaidarova',
 'Wormhole',
 'Delly_Allen',
 'Filmed_Before_a_Live_Studio_Audience',
 'Oleg_Maximoff',
 "Luis'_Van",
 'Spider-Man',
 "T'Challa",
 'S.W.O.R.D.',
 'Berlin',
 'Kraglin_Obfonteri',
 'S.W.O



---



---



> > > > > > > > > © 2021 Institute of Data


---



---



