<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 9.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [2]:
## Import Libraries
import regex as re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [4]:
quote_page = 'https://elderscrolls.fandom.com/wiki/The_Elder_Scrolls_Wiki'

### Retrieve the page
- Require Internet connection

In [5]:
http = urllib3.PoolManager()
r = http.request('GET', quote_page)
if r.status == 200:
    page = r.data
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occurred. Request Status: %s' % r.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 249188


### Convert the stream of bytes into a BeautifulSoup representation

In [6]:
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

Type of the variable 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [7]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Elder Scrolls | Fandom
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"The_Elder_Scrolls_Wiki","wgTitle":"The Elder Scrolls Wiki","wgCurRevisionId":3121405,"wgRevisionId":3121405,"wgArticleId":123863,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["The Elder Scrolls Wiki"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","

### Check the HTML's Title

In [8]:
print('Title tag :%s:' % soup.title)
print('Title text:%s:' % soup.title.string)

Title tag :<title>Elder Scrolls | Fandom</title>:
Title text:Elder Scrolls | Fandom:


### Find the main content
- Check if it is possible to use only the relevant data

### Get some of the text
- Plain text without HTML tags

In [10]:
# show the first 500 characters after removing redundant newlines
print(re.sub(r'\n\n+', '\n', soup.text)[:500])


Elder Scrolls | Fandom
 
	Games
	Movies
	TV
	Video
Wikis
 
	Explore Wikis
	Community Central
	Start a Wiki
  
				Search			
							This wiki						
								This wiki							
								All wikis							
				 | 
			
 
 
 
 
					Sign In				
Don't have an account?
					Register				
	Start a Wiki
Elder Scrolls
67,822
Pages
 Add new page
Elder Scrolls Online
 
Quests
 
Main Quest
Aldmeri Dominion
Daggerfall Covenant
Ebonheart Pact
Side Quests
Alliances
 
Aldmeri Dominion
Daggerfall Covenant
Ebonheart Pac


### Find the links in the text

In [12]:
# identify the type of tag to retrieve
tag = 'a'
# create a list with the links from the `<a>` tag
tag_list = [t.get('href') for t in soup.find_all(tag)]
tag_list

['https://www.fandom.com/',
 'https://www.fandom.com/topics/games',
 'https://www.fandom.com/topics/movies',
 'https://www.fandom.com/topics/tv',
 'https://www.fandom.com/video',
 'https://www.fandom.com/explore',
 '//community.fandom.com/wiki/Community_Central',
 '//community.fandom.com/wiki/Special:CreateNewWiki',
 None,
 None,
 'https://www.fandom.com/signin?redirect=https%3A%2F%2Felderscrolls.fandom.com%2Fwiki%2FThe_Elder_Scrolls_Wiki',
 'https://www.fandom.com/register?redirect=https%3A%2F%2Felderscrolls.fandom.com%2Fwiki%2FThe_Elder_Scrolls_Wiki',
 '//community.fandom.com/wiki/Special:CreateNewWiki',
 '//elderscrolls.fandom.com',
 '//elderscrolls.fandom.com',
 '/wiki/Special:CreatePage',
 '/wiki/Portal:Online',
 '/wiki/Quests_(Online)',
 '/wiki/Main_Quest_(Online)',
 '/wiki/Aldmeri_Dominion_Quests',
 '/wiki/Daggerfall_Covenant_Quests',
 '/wiki/Ebonheart_Pact_Quests',
 '/wiki/Category:Online:_Side_Quests',
 '/wiki/Alliances',
 '/wiki/Aldmeri_Dominion_(Online)',
 '/wiki/Daggerfall_

In [13]:
tag_list = [t[6:] for t in tag_list if (t) and (t.startswith('/wiki/'))]
tag_list

['Special:CreatePage',
 'Portal:Online',
 'Quests_(Online)',
 'Main_Quest_(Online)',
 'Aldmeri_Dominion_Quests',
 'Daggerfall_Covenant_Quests',
 'Ebonheart_Pact_Quests',
 'Category:Online:_Side_Quests',
 'Alliances',
 'Aldmeri_Dominion_(Online)',
 'Daggerfall_Covenant',
 'Ebonheart_Pact',
 'Classes_(Online)',
 'Dragonknight',
 'Nightblade_(Online)',
 'Sorcerer_(Online)',
 'Templar',
 'Warden',
 'Necromancer_(Online)',
 'Races_(Online)',
 'Altmer_(Online)',
 'Argonian_(Online)',
 'Bosmer_(Online)',
 'Breton_(Online)',
 'Dunmer_(Online)',
 'Imperial_(Online)',
 'Khajiit_(Online)',
 'Nord_(Online)',
 'Orsimer_(Online)',
 'Redguard_(Online)',
 'Locations_(Online)',
 'Regions_(Online)',
 'Category:Online:_Realms',
 'Category:Online:_Cities',
 'Category:Online:_Delves',
 'Category:Online:_Dungeons',
 'Category:Online:_Dark_Anchors',
 'Wayshrines_(Online)',
 'Category:Online:_Unmarked_Locations',
 'Category:Online:_Gameplay',
 'Combat_(Online)',
 'Skills_(Online)',
 'Ultimate_Skills',
 'Syner

In [14]:
tag_list = [re.sub('_', ' ', t) for t in tag_list]
tag_list

['Special:CreatePage',
 'Portal:Online',
 'Quests (Online)',
 'Main Quest (Online)',
 'Aldmeri Dominion Quests',
 'Daggerfall Covenant Quests',
 'Ebonheart Pact Quests',
 'Category:Online: Side Quests',
 'Alliances',
 'Aldmeri Dominion (Online)',
 'Daggerfall Covenant',
 'Ebonheart Pact',
 'Classes (Online)',
 'Dragonknight',
 'Nightblade (Online)',
 'Sorcerer (Online)',
 'Templar',
 'Warden',
 'Necromancer (Online)',
 'Races (Online)',
 'Altmer (Online)',
 'Argonian (Online)',
 'Bosmer (Online)',
 'Breton (Online)',
 'Dunmer (Online)',
 'Imperial (Online)',
 'Khajiit (Online)',
 'Nord (Online)',
 'Orsimer (Online)',
 'Redguard (Online)',
 'Locations (Online)',
 'Regions (Online)',
 'Category:Online: Realms',
 'Category:Online: Cities',
 'Category:Online: Delves',
 'Category:Online: Dungeons',
 'Category:Online: Dark Anchors',
 'Wayshrines (Online)',
 'Category:Online: Unmarked Locations',
 'Category:Online: Gameplay',
 'Combat (Online)',
 'Skills (Online)',
 'Ultimate Skills',
 'Syner

In [15]:
tag_list = list(set(tag_list))
tag_list

['Portal:Daggerfall',
 'Category:Lore: Factions',
 'Category:Items',
 'Breton (Online)',
 'Bethesda Softworks',
 'Marriage',
 'The Elder Scrolls IV: Shivering Isles',
 'Crown Store',
 'Category:Skyrim: Unobtainable Items',
 'The Elder Scrolls Wiki:Vandalism in progress',
 'The Elder Scrolls Online',
 'Imperial (Online)',
 'Followers (Skyrim)',
 'User blog:Amulet of Kings/ESO: Markarth %26 Update 28 Released on PC %26 Mac',
 'Races (Online)',
 'Synergy',
 'Altmer (Online)',
 'The Elder Scrolls III: Morrowind',
 'Nord (Online)',
 'The Queen%27s Decree',
 'Dragon Shouts',
 'Category:The Elder Scrolls Wiki',
 'Help:Editing',
 'Help:Talk page',
 'The Elder Scrolls Wiki:Chat',
 'Category:Online: Cities',
 'Category:Content',
 'Sorcerer (Online)',
 'Races (Skyrim)',
 'Skills (Skyrim)',
 'Creation Club',
 'The Elder Scrolls: Blades',
 'The Elder Scrolls Wiki:News',
 'The Elder Scrolls Wiki:Style and Formatting',
 'Quests (Summerset)',
 'Map (Skyrim)',
 'Dragons (Skyrim)',
 'Category:Games',
 '

In [16]:
tag_list.sort()
tag_list

['Abilities (Skyrim)',
 'Alchemy (Skyrim)',
 'Aldmeri Dominion (Online)',
 'Aldmeri Dominion Quests',
 'Alliances',
 'Altmer (Online)',
 'An Elder Scrolls Legend: Battlespire',
 'Argonian (Online)',
 'Armor (Skyrim)',
 'Bethesda Softworks',
 'Bitter Coast (Online)',
 'Blog:Recent posts',
 'Bosmer (Online)',
 'Breton (Online)',
 'Category:Audio File',
 'Category:Books',
 'Category:Characters',
 'Category:Concept Art',
 'Category:Content',
 'Category:ESO Morrowind: Location Images',
 'Category:Factions',
 'Category:Gameplay',
 'Category:Games',
 'Category:Images',
 'Category:Items',
 'Category:Locations',
 'Category:Lore',
 'Category:Lore: Characters',
 'Category:Lore: Concepts',
 'Category:Lore: Events',
 'Category:Lore: Factions',
 'Category:Lore: Locations',
 'Category:Online: Cities',
 'Category:Online: Dark Anchors',
 'Category:Online: Delves',
 'Category:Online: Dungeons',
 'Category:Online: Gameplay',
 'Category:Online: Official plug-ins',
 'Category:Online: Realms',
 'Category:On

### Create a filter for unwanted types of articles

In [17]:
# create a filter for undesired links
filter  = '(%s)' % '|'.join([
    'User:',
    'Blog:',
])
# remove the links that are found in the filter
tag_list = [t for t in tag_list if not re.search(filter, t)]
tag_list

['Abilities (Skyrim)',
 'Alchemy (Skyrim)',
 'Aldmeri Dominion (Online)',
 'Aldmeri Dominion Quests',
 'Alliances',
 'Altmer (Online)',
 'An Elder Scrolls Legend: Battlespire',
 'Argonian (Online)',
 'Armor (Skyrim)',
 'Bethesda Softworks',
 'Bitter Coast (Online)',
 'Bosmer (Online)',
 'Breton (Online)',
 'Category:Audio File',
 'Category:Books',
 'Category:Characters',
 'Category:Concept Art',
 'Category:Content',
 'Category:ESO Morrowind: Location Images',
 'Category:Factions',
 'Category:Gameplay',
 'Category:Games',
 'Category:Images',
 'Category:Items',
 'Category:Locations',
 'Category:Lore',
 'Category:Lore: Characters',
 'Category:Lore: Concepts',
 'Category:Lore: Events',
 'Category:Lore: Factions',
 'Category:Lore: Locations',
 'Category:Online: Cities',
 'Category:Online: Dark Anchors',
 'Category:Online: Delves',
 'Category:Online: Dungeons',
 'Category:Online: Gameplay',
 'Category:Online: Official plug-ins',
 'Category:Online: Realms',
 'Category:Online: Side Quests',
 '



---



---



> > > > > > > > > © 2021 Institute of Data


---



---



