<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.2: Web Scraping

INSTRUCTIONS:

- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and based on a TV or movie character. As in Demo 8.3,  we focus on the navigation bar and aim to extract all characters from the show from the links in the text.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [None]:
## Import Libraries
import regex as re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [None]:
# specify the url
quote_page = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'

### Retrieve the page
- Require Internet connection

In [None]:
# query the website and return the html to the variable ‘page’
http = urllib3.PoolManager()
r = http.request('GET', quote_page)
if r.status == 200:
    page = r.data
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occurred. Request Status: %s' % r.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 456671


### Convert the stream of bytes into a BeautifulSoup representation

In [None]:
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

Type of the variable 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [None]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Barry Kripke | The Big Bang Theory Wiki | Fandom
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"29ff722a5e31bd41","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Barry_Kripke","wgTitle":"Barry Kripke","wgCurRevisionId":357620,"wgRevisionId":357620,"wgArticleId":2273,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Characters","Caltech Faculty","Scientists","Physicists","Experimental Physicists","Theoretical Physicists","Particle Physicists","Recurring Characters","Season

### Check the HTML's Title

In [None]:
print('Title tag :%s:' % soup.title)
print('Title text:%s:' % soup.title.string)

Title tag :<title>Barry Kripke | The Big Bang Theory Wiki | Fandom</title>:
Title text:Barry Kripke | The Big Bang Theory Wiki | Fandom:


### Find the main content
- Check if it is possible to use only the relevant data

In [None]:
article_tag = 'p'
p = soup.find_all(article_tag)
print('Type of the variable \'article\':', p.__class__.__name__)

Type of the variable 'article': ResultSet


In [None]:
for para in p:
    print(para.text)



Barry Kripke





							Adult
							
						



							Young Adult
							
						



















General Information

Name
Barry Kripke


Born
Possibly May 12


Gender
Male


Nicknames
Bawwy (Siri)


Religion
Unknown


Nationality
American


Occupation
Physicist


Portrayed By
John Ross Bowie



Relationships

Relationships
Amy Farrah Fowler (crush)Beverly Hofstadter (romantic interest)


Family
Unknown



Episode Guide

First episode
"The Killer Robot Instability"


Last episode
The Change Constant


Number of episodes
25



Seasons Guide

Seasons
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12



"I am a stwing pwagmatist. I say I'm gonna pwove something that cannot be pwoved, I appwy for gwant money, and then I spend it on wiquor and bwoads."
―Barry Kripke, The Relationship Diremption

Beverly Hofstadter (romantic interest)
Barry Kripke, Ph.D. is a Caltech plasma-physicist-turned-string-theorist and he is a colleague of Leonard and Sheldon. He has a case of rhotacism, where he pronoun

### Find the content under the 'nav' tag

In [None]:
article_tag = 'nav'
nav = soup.find_all(article_tag)[0]
print('Type of the variable \'article\':', nav.__class__.__name__)

Type of the variable 'article': Tag


In [None]:
nav.text

'\n\n\n\n\n Explore\n\n \n\n\n\n\n Main Page\n\n\n\n\n Discuss\n\n\n\n\nAll Pages\n\n\n\n\nCommunity\n\n\n\n\nInteractive Maps\n\n\n\n\nRecent Blog Posts\n\n\n\n\n\n\n\n\nCharacters\n\n\n\n\n\n\nThe Big Bang Theory\n\n \n\n\n\n\nMain Characters\n \n\n\n\n\nLeonard Hofstadter\n\n\n\n\nPenny Hofstadter\n\n\n\n\nSheldon Cooper\n\n\n\n\nAmy Farrah Fowler\n\n\n\n\nHoward Wolowitz\n\n\n\n\nBernadette Rostenkowski-Wolowitz\n\n\n\n\nRajesh Koothrappali\n\n\n\n\nStuart Bloom\n\n\n\n\nLeslie Winkle\n\n\n\n\nEmily Sweeney\n\n\n\n\n\n\n\nRecurring Characters\n \n\n\n\n\nBeverly Hofstadter\n\n\n\n\nMary Cooper\n\n\n\n\nDebbie Wolowitz\n\n\n\n\nMike Rostenkowski\n\n\n\n\nV. M. Koothrappali\n\n\n\n\nPriya Koothrappali\n\n\n\n\nDenise\n\n\n\n\nBarry Kripke\n\n\n\n\nWil Wheaton\n\n\n\n\nZack Johnson\n\n\n\n\n\n\n\nSeasons (1-6)\n \n\n\n\n\nSeason 1\n\n\n\n\nSeason 2\n\n\n\n\nSeason 3\n\n\n\n\nSeason 4\n\n\n\n\nSeason 5\n\n\n\n\nSeason 6\n\n\n\n\n\n\n\nSeasons (7-12)\n \n\n\n\n\nSeason 7\n\n\n\n\nSeason

In [None]:
# show the first 500 characters after removing redundant newlines
print(re.sub(r'\n\n+', '\n', nav.text)[:500])


 Explore
 
 Main Page
 Discuss
All Pages
Community
Interactive Maps
Recent Blog Posts
Characters
The Big Bang Theory
 
Main Characters
 
Leonard Hofstadter
Penny Hofstadter
Sheldon Cooper
Amy Farrah Fowler
Howard Wolowitz
Bernadette Rostenkowski-Wolowitz
Rajesh Koothrappali
Stuart Bloom
Leslie Winkle
Emily Sweeney
Recurring Characters
 
Beverly Hofstadter
Mary Cooper
Debbie Wolowitz
Mike Rostenkowski
V. M. Koothrappali
Priya Koothrappali
Denise
Barry Kripke
Wil Wheaton
Zack Johnson
Seasons (1-6


### Find the links in the text

In [None]:
for t in nav.find_all('a'):
    print(t)

<a data-tracking="custom-level-1" href="#">
<svg class="wds-icon-tiny wds-icon"><use xlink:href="#wds-icons-book-tiny"></use></svg> <span>Explore</span>
</a>
<a data-tracking="explore-main-page" href="https://bigbangtheory.fandom.com/wiki/Main_Page">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-home-tiny"></use></svg> <span>Main Page</span>
</a>
<a data-tracking="explore-discuss" href="/f">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-discussions-tiny"></use></svg> <span>Discuss</span>
</a>
<a data-tracking="explore-all-pages" href="https://bigbangtheory.fandom.com/wiki/Special:AllPages">
<span>All Pages</span>
</a>
<a data-tracking="explore-community" href="https://bigbangtheory.fandom.com/wiki/Special:Community">
<span>Community</span>
</a>
<a data-tracking="interactive-maps" href="https://bigbangtheory.fandom.com/wiki/Special:AllMaps">
<span>Interactive Maps</span>
</a>
<a data-tracking="explore-blogs" h

In [None]:
# identify the type of tag to retrieve
link_tag = 'a'

# create a list with the links from the `<a>` tag
tag_list = []
for t in nav.find_all(link_tag):
    tag_list.append(t.get('href'))

# List comprehension version:
# tag_list = [t.get('href') for t in nav.find_all(link_tag)]

print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 68


['#',
 'https://bigbangtheory.fandom.com/wiki/Main_Page',
 '/f',
 'https://bigbangtheory.fandom.com/wiki/Special:AllPages',
 'https://bigbangtheory.fandom.com/wiki/Special:Community',
 'https://bigbangtheory.fandom.com/wiki/Special:AllMaps',
 '/Blog:Recent_posts',
 'https://bigbangtheory.fandom.com/wiki/Category:Characters',
 'https://bigbangtheory.fandom.com/wiki/The_Big_Bang_Theory',
 'https://bigbangtheory.fandom.com/wiki/Category:Main_Characters',
 'https://bigbangtheory.fandom.com/wiki/Leonard_Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Penny_Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Sheldon_Cooper',
 'https://bigbangtheory.fandom.com/wiki/Amy_Farrah_Fowler',
 'https://bigbangtheory.fandom.com/wiki/Howard_Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/Bernadette_Rostenkowski-Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/Rajesh_Koothrappali',
 'https://bigbangtheory.fandom.com/wiki/Stuart_Bloom',
 'https://bigbangtheory.fandom.com/wiki/Leslie_Winkle',
 

In [None]:
# keep only the links to the wiki itself
wiki_tag_list = []
for link in tag_list:
    if link is not None and link[32:38] == '/wiki/':
        wiki_link = link[38:]
        wiki_tag_list.append(wiki_link)

# List comprehension:
# wiki_tag_list = [link[6:] for link in tag_list if link is not None and link[:6] == '/wiki/']

print('Size of \'wiki_tag_list\':', len(wiki_tag_list))
wiki_tag_list

Size of 'wiki_tag_list': 65


['Main_Page',
 'Special:AllPages',
 'Special:Community',
 'Special:AllMaps',
 'Category:Characters',
 'The_Big_Bang_Theory',
 'Category:Main_Characters',
 'Leonard_Hofstadter',
 'Penny_Hofstadter',
 'Sheldon_Cooper',
 'Amy_Farrah_Fowler',
 'Howard_Wolowitz',
 'Bernadette_Rostenkowski-Wolowitz',
 'Rajesh_Koothrappali',
 'Stuart_Bloom',
 'Leslie_Winkle',
 'Emily_Sweeney',
 'Category:Recurring_Characters',
 'Beverly_Hofstadter',
 'Mary_Cooper',
 'Debbie_Wolowitz',
 'Mike_Rostenkowski',
 'V._M._Koothrappali',
 'Priya_Koothrappali',
 'Denise',
 'Barry_Kripke',
 'Wil_Wheaton',
 'Zack_Johnson',
 'Seasons_(1-6)',
 'Season_1',
 'Season_2',
 'Season_3',
 'Season_4',
 'Season_5',
 'Season_6',
 'Seasons_(7-12)',
 'Season_7',
 'Season_8',
 'Season_9',
 'Season_10',
 'Season_11',
 'Season_12',
 'Young_Sheldon',
 'Category:Main_Characters',
 'Sheldon_Cooper',
 'Mary_Cooper',
 'George_Cooper_Sr.',
 'George_Cooper_Jr.',
 'Missy_Cooper',
 'Meemaw',
 'Jeff_Difford',
 'Category:Recurring_Characters',
 'Ta

### Create a filter for undesired links (those not corresponding to characters)

In [None]:
filter  = '(%s)' % '|'.join([
    'Main_',
    'Season',
    'Category:',
    'File:',
    'Help:',
    'Portal:',
    'action=',
    'Special:',
    'Talk:',
    'The'
])
# remove the links that are found in the filter
filtered_tag_list = []
for t in wiki_tag_list:
    if not re.search(filter, t):
        filtered_tag_list.append(t)

# filtered_tag_list = [t for t in wiki_tag_list if not re.search(filter, t)]
print('Size of \'filtered_tag_list\':', len(filtered_tag_list))
filtered_tag_list

Size of 'filtered_tag_list': 35


['Leonard_Hofstadter',
 'Penny_Hofstadter',
 'Sheldon_Cooper',
 'Amy_Farrah_Fowler',
 'Howard_Wolowitz',
 'Bernadette_Rostenkowski-Wolowitz',
 'Rajesh_Koothrappali',
 'Stuart_Bloom',
 'Leslie_Winkle',
 'Emily_Sweeney',
 'Beverly_Hofstadter',
 'Mary_Cooper',
 'Debbie_Wolowitz',
 'Mike_Rostenkowski',
 'V._M._Koothrappali',
 'Priya_Koothrappali',
 'Denise',
 'Barry_Kripke',
 'Wil_Wheaton',
 'Zack_Johnson',
 'Young_Sheldon',
 'Sheldon_Cooper',
 'Mary_Cooper',
 'George_Cooper_Sr.',
 'George_Cooper_Jr.',
 'Missy_Cooper',
 'Meemaw',
 'Jeff_Difford',
 'Tam_Nguyen',
 'Veronica_Duncan',
 'Billy_Sparks',
 'Brenda_Sparks',
 'John_Sturgis',
 'Dale_Ballard',
 'Paige_Swanson']

### Remove duplicates

In [None]:
unique_tag_list = list(set(filtered_tag_list))
print('Size of \'unique_tag_list\':', len(unique_tag_list))
unique_tag_list

Size of 'unique_tag_list': 33


['Sheldon_Cooper',
 'Bernadette_Rostenkowski-Wolowitz',
 'Howard_Wolowitz',
 'Wil_Wheaton',
 'Penny_Hofstadter',
 'Veronica_Duncan',
 'Mike_Rostenkowski',
 'Brenda_Sparks',
 'Denise',
 'Paige_Swanson',
 'Zack_Johnson',
 'George_Cooper_Sr.',
 'George_Cooper_Jr.',
 'John_Sturgis',
 'Amy_Farrah_Fowler',
 'Priya_Koothrappali',
 'Missy_Cooper',
 'Emily_Sweeney',
 'Young_Sheldon',
 'Jeff_Difford',
 'Mary_Cooper',
 'Leonard_Hofstadter',
 'Rajesh_Koothrappali',
 'Dale_Ballard',
 'Debbie_Wolowitz',
 'Beverly_Hofstadter',
 'Barry_Kripke',
 'Tam_Nguyen',
 'Billy_Sparks',
 'Meemaw',
 'Leslie_Winkle',
 'Stuart_Bloom',
 'V._M._Koothrappali']

### Convert underscore to space

In [None]:
spaced_tag_list = []
for tag in unique_tag_list:
    processed_tag = re.sub('_', ' ', tag)
    spaced_tag_list.append(processed_tag)

# spaced_tag_list = [re.sub('_', ' ', t) for t in unique_tag_list]
print('Size of \'tag_list\':', len(spaced_tag_list))
spaced_tag_list

Size of 'tag_list': 33


['Sheldon Cooper',
 'Bernadette Rostenkowski-Wolowitz',
 'Howard Wolowitz',
 'Wil Wheaton',
 'Penny Hofstadter',
 'Veronica Duncan',
 'Mike Rostenkowski',
 'Brenda Sparks',
 'Denise',
 'Paige Swanson',
 'Zack Johnson',
 'George Cooper Sr.',
 'George Cooper Jr.',
 'John Sturgis',
 'Amy Farrah Fowler',
 'Priya Koothrappali',
 'Missy Cooper',
 'Emily Sweeney',
 'Young Sheldon',
 'Jeff Difford',
 'Mary Cooper',
 'Leonard Hofstadter',
 'Rajesh Koothrappali',
 'Dale Ballard',
 'Debbie Wolowitz',
 'Beverly Hofstadter',
 'Barry Kripke',
 'Tam Nguyen',
 'Billy Sparks',
 'Meemaw',
 'Leslie Winkle',
 'Stuart Bloom',
 'V. M. Koothrappali']

### Order the list

In [None]:
spaced_tag_list.sort()
print('Size of \'spaced_tag_list\':', len(spaced_tag_list))
spaced_tag_list

Size of 'spaced_tag_list': 33


['Amy Farrah Fowler',
 'Barry Kripke',
 'Bernadette Rostenkowski-Wolowitz',
 'Beverly Hofstadter',
 'Billy Sparks',
 'Brenda Sparks',
 'Dale Ballard',
 'Debbie Wolowitz',
 'Denise',
 'Emily Sweeney',
 'George Cooper Jr.',
 'George Cooper Sr.',
 'Howard Wolowitz',
 'Jeff Difford',
 'John Sturgis',
 'Leonard Hofstadter',
 'Leslie Winkle',
 'Mary Cooper',
 'Meemaw',
 'Mike Rostenkowski',
 'Missy Cooper',
 'Paige Swanson',
 'Penny Hofstadter',
 'Priya Koothrappali',
 'Rajesh Koothrappali',
 'Sheldon Cooper',
 'Stuart Bloom',
 'Tam Nguyen',
 'V. M. Koothrappali',
 'Veronica Duncan',
 'Wil Wheaton',
 'Young Sheldon',
 'Zack Johnson']



---



---



> > > > > > > > > © 2023 Institute of Data


---



---



