<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [1]:
import re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [2]:
url = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'

### Retrieve the page
- Require Internet connection

In [3]:
http = urllib3.PoolManager()
r = http.request('GET', url)
if r.status == 200:
    page = r.data
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occurred. Request Status: %s' % r.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 518088


### Convert the stream of bytes into a BeautifulSoup representation

In [4]:
soup = BeautifulSoup(page, 'html.parser')

In [5]:
type(soup)

bs4.BeautifulSoup

In [6]:
soup.prettify()[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" dir="ltr" lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   Barry Kripke | The Big Bang Theory Wiki | Fandom\n  </title>\n  <script>\n   document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );\n  </script>\n  <script>\n   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Barry_Kripke","wgTitle":"Barry Kripke","wgCurRevisionId":352395,"wgRevisionId":352395,"wgArticleId":2273,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Characters","Caltech Faculty","Scientists","Physicists","Experimental Physicists","Theoretical Physicists","Particle Physicists","Recurring Characters","Season 2","Season 3","Season 4","Season 5","Season 6","Season 7","Season 8","Season 9","The Big Bang Theory","Kripke","Si

### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

### Check the HTML's Title

In [7]:
soup.title

<title>Barry Kripke | The Big Bang Theory Wiki | Fandom</title>

In [8]:
soup.title.string

'Barry Kripke | The Big Bang Theory Wiki | Fandom'

### Find the main content
- Check if it is possible to use only the relevant data

In [36]:
article_tag = 'div'
article = soup.find_all(article_tag)[4]


In [37]:
article

<div class="wds-dropdown">
<div class="wds-tabs__tab-label wds-dropdown__toggle first-level-item">
<a data-tracking="custom-level-1" href="#">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-book-tiny"></use></svg> <span>Explore</span>
</a>
<svg class="wds-icon wds-icon-tiny wds-dropdown__toggle-chevron"><use xlink:href="#wds-icons-dropdown-tiny"></use></svg> </div>
<div class="wds-is-not-scrollable wds-dropdown__content">
<ul class="wds-list wds-is-linked">
<li>
<a data-tracking="explore-main-page" href="https://bigbangtheory.fandom.com/wiki/Main_Page">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-home-tiny"></use></svg> <span>Main Page</span>
</a>
</li>
<li>
<a data-tracking="explore-discuss" href="/f">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-comment-tiny"></use></svg> <span>Discuss</span>
</a>
</li>
<li>
<a data-tracking="explore-all-pages" href="https://bigbangt

In [38]:
type(article)

bs4.element.Tag

In [39]:
for t in article.find_all('a'):
    print(t)

<a data-tracking="custom-level-1" href="#">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-book-tiny"></use></svg> <span>Explore</span>
</a>
<a data-tracking="explore-main-page" href="https://bigbangtheory.fandom.com/wiki/Main_Page">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-home-tiny"></use></svg> <span>Main Page</span>
</a>
<a data-tracking="explore-discuss" href="/f">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-comment-tiny"></use></svg> <span>Discuss</span>
</a>
<a data-tracking="explore-all-pages" href="https://bigbangtheory.fandom.com/wiki/Special:AllPages">
<span>All Pages</span>
</a>
<a data-tracking="explore-community" href="https://bigbangtheory.fandom.com/wiki/Special:Community">
<span>Community</span>
</a>
<a data-tracking="explore-blogs" href="/wiki/Blog:Recent_posts">
<span>Recent Blog Posts</span>
</a>


In [41]:
link_tag = 'a'

tag_list = []
for t in article.find_all(link_tag):
    tag_list.append(t.get('href'))


print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 6


['#',
 'https://bigbangtheory.fandom.com/wiki/Main_Page',
 '/f',
 'https://bigbangtheory.fandom.com/wiki/Special:AllPages',
 'https://bigbangtheory.fandom.com/wiki/Special:Community',
 '/wiki/Blog:Recent_posts']

In [42]:
# keep only the links to the wiki itself
wiki_tag_list = []
for link in tag_list:
    if link is not None and link[:6] == '/wiki/':
        wiki_link = link[6:]
        wiki_tag_list.append(wiki_link)

# List comprehension:
# wiki_tag_list = [link[6:] for link in tag_list if link is not None and link[:6] == '/wiki/']

print('Size of \'wiki_tag_list\':', len(wiki_tag_list))
wiki_tag_list

Size of 'wiki_tag_list': 1


['Blog:Recent_posts']

In [50]:

httplist = []
for link in tag_list:
    if link is not None and link[:5] == 'https':
        link = link[:]
        httplist.append(link)

print('Size of http URLs'':', len(httplist))
httplist

Size of http URLs: 3


['https://bigbangtheory.fandom.com/wiki/Main_Page',
 'https://bigbangtheory.fandom.com/wiki/Special:AllPages',
 'https://bigbangtheory.fandom.com/wiki/Special:Community']

In [51]:
for t in article.find_all('div'):
    print(t)

<div class="wds-tabs__tab-label wds-dropdown__toggle first-level-item">
<a data-tracking="custom-level-1" href="#">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-book-tiny"></use></svg> <span>Explore</span>
</a>
<svg class="wds-icon wds-icon-tiny wds-dropdown__toggle-chevron"><use xlink:href="#wds-icons-dropdown-tiny"></use></svg> </div>
<div class="wds-is-not-scrollable wds-dropdown__content">
<ul class="wds-list wds-is-linked">
<li>
<a data-tracking="explore-main-page" href="https://bigbangtheory.fandom.com/wiki/Main_Page">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-home-tiny"></use></svg> <span>Main Page</span>
</a>
</li>
<li>
<a data-tracking="explore-discuss" href="/f">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-comment-tiny"></use></svg> <span>Discuss</span>
</a>
</li>
<li>
<a data-tracking="explore-all-pages" href="https://bigbangtheory.fandom.com/wiki/Speci

### Get some of the text
- Plain text without HTML tags

### Find the links in the text

### Create a filter for unwanted types of articles



---



---



> > > > > > > > > © 2021 Institute of Data


---



---



