<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 9.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [1]:
## Import Libraries
import regex as re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [2]:
url = 'https://liquipedia.net/starcraft2/Serral'

### Retrieve the page
- Require Internet connection

In [4]:
http = urllib3.PoolManager()
r = http.request('GET', url)
if r.status == 200:
    page = r.data
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occurred. Request Status: %s' % r.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 361941


### Convert the stream of bytes into a BeautifulSoup representation

### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [7]:
soup = BeautifulSoup(page, 'html.parser')
soup.prettify()

'<!DOCTYPE html>\n<html class="client-nojs Send_pizza_to_FO-nTTaX All_glory_to_Liquipedia" dir="ltr" lang="en" prefix="og: http://ogp.me/ns#">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   Serral - Liquipedia - The StarCraft II Encyclopedia\n  </title>\n  <script>\n   document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );\n  </script>\n  <script>\n   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Serral","wgTitle":"Serral","wgCurRevisionId":1899866,"wgRevisionId":1899866,"wgArticleId":32557,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Improperly formatted TeamHistory date","Active players","Pages with dead links","Pages with hard coded colors","1998 births","Players","Finnish Players","Zerg Players","Pages with ExternalMediaLinks"],"

### Check the HTML's Title

### Find the main content
- Check if it is possible to use only the relevant data

In [8]:
soup.title

<title>Serral - Liquipedia - The StarCraft II Encyclopedia</title>

In [43]:
tag = 'p'
article = soup.find_all(tag)
article
#print('Type of the variable \'article\':', article.__class__.__name__

[<p><br/>
 </p>,
 <p>Joona "<b>Serral</b>" Sotala is a Zerg player from <a href="/starcraft2/Category:Finland" title="Category:Finland">Finland</a> who is currently playing for <a href="/starcraft2/Ence_eSports" title="Ence eSports">Ence eSports</a>.
 </p>,
 <p>Although Serral has kept a relatively low profile, he has had some appearances in major tournaments.
 Serral participated <a href="/starcraft2/Copenhagen_Games_Spring_2012" title="Copenhagen Games Spring 2012">Copenhagen Games Spring 2012</a> in April, 2012. He was in the same group with <a href="/starcraft2/StarNaN" title="StarNaN">StarNaN</a>, <a href="/starcraft2/Joe" title="Joe">Joe</a> and <a class="new" href="/starcraft2/index.php?title=Buddha&amp;action=edit&amp;redlink=1" title="Buddha (page does not exist)">Buddha</a> and managed to advance to the main bracket by placing 2nd in his group by only dropping two games against the group winner, <a href="/starcraft2/StarNaN" title="StarNaN">StarNaN</a>.
 In the main bracket S

In [58]:
article[5]

<p>Serral made an impressive run at <a href="/starcraft2/2013_DreamHack_Open/Bucharest" title="2013 DreamHack Open/Bucharest">2013 DreamHack Open: Bucharest</a> on September 14, 2013. After topping his 3-player group in the <a href="/starcraft2/2013_DreamHack_Open/Bucharest/Group_Stage_1" title="2013 DreamHack Open/Bucharest/Group Stage 1">first group stage</a>, he played against <a href="/starcraft2/SuperNova" title="SuperNova">SuperNova</a>, <a href="/starcraft2/Pal" title="Pal">pal</a>, and <a href="/starcraft2/TargA" title="TargA">TargA</a> in the <a href="/starcraft2/2013_DreamHack_Open/Bucharest/Group_Stage_2" title="2013 DreamHack Open/Bucharest/Group Stage 2">second group stage</a>. His match against SuperNova was not originally planned to be featured, however due to delays to the <a href="/starcraft2/DIMAGA" title="DIMAGA">DIMAGA</a> vs. <a href="/starcraft2/Flash" title="Flash">Flash</a> series, it was casted on the main stream. Despite losing the series 0-2, his performance 

### Get some of the text
- Plain text without HTML tags

In [48]:
print(re.sub(r'\n\n+', '\n', article[3].text)[:500])

On July 28, 2012, Serral participated in the World Championship Series 2012 Finland event. He was one of the eight players to qualify for the event through Vectorama 2012. Serral lost against Welmu in round 1 and fell into the losers bracket where he managaed to beat grim and Winsti, but failing to defeat elfi in round 3. Serral finished 7th/8th along with core, winning $200.



### Find the links in the text

In [59]:
# create a list with the links from the `<a>` tag
# identify the type of tag to retrieve
tag = 'a'
# create a list with the links from the `<a>` tag
tag_list = [t.get('href') for t in article[5].find_all(tag)]
tag_list

['/starcraft2/2013_DreamHack_Open/Bucharest',
 '/starcraft2/2013_DreamHack_Open/Bucharest/Group_Stage_1',
 '/starcraft2/SuperNova',
 '/starcraft2/Pal',
 '/starcraft2/TargA',
 '/starcraft2/2013_DreamHack_Open/Bucharest/Group_Stage_2',
 '/starcraft2/DIMAGA',
 '/starcraft2/Flash',
 '#cite_note-2',
 '/starcraft2/Pal',
 '/starcraft2/TargA',
 '/starcraft2/2013_DreamHack_Open/Bucharest/Group_Stage_3',
 '/starcraft2/Jaedong',
 '/starcraft2/CranK',
 '/starcraft2/ForGG']

### Create a filter for unwanted types of articles

In [3]:
filter  = '(%s)' % '|'.join([
    'TargA',
])
# remove the links that are found in the filter
tag_list = [t for t in tag_list if not re.search(filter, t)]
tag_list


NameError: name 'tag_list' is not defined



---



---



> > > > > > > > > © 2021 Institute of Data


---



---



