# Scrape a website using BeautifulSoup

In [2]:
import requests
from bs4 import BeautifulSoup

`1` [Workflow](#1) 
<br>
`2` [Searching and navigating HTML tree](#2)
<br>
`3` [Extracting data from the HTML tree](#3)
<br>
`4` [Applications](#4)

<a id = '1'></a>
## `1.` Workflow - 5 Steps

1. Inspect the webpages (i.e. content, structure) on webpages.
2. Request to the server to get the html content
3. Select a parser
    - splitting a string/text into syntectical components
    - can help identify all the elements, relationships to one another, their attributes, contents
    - is represented through a parse tree
    - BeautifulSoup suppors three parsers (lxml > html5lib > html.parser(built_in)
    

4. Create a BeautifulSoup object
5. Export the HTML to a file (optional, but recommended)

Using a opensource 'wikipedia' website. Suppose we already got a good hang of the webpages through inspection.

In [3]:
base_url = 'https://en.wikipedia.org/wiki/Music'

In [4]:
response = requests.get(base_url)
print(response.status_code) # 200 if scraping is a success

200


In [5]:
html = response.content
print(html[1:100]) # binary starting with b'

b'!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'


In [6]:
soup = BeautifulSoup(html, 'html.parser') # using built-in parser 'html-parser' 
print(soup.prettify()[1:500])

!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Music - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"8d812247-752c-42c8-b404-90b86185


#### `The prettify() method` will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string:

In [20]:
# optional | exporting the HTML as a file
with open('wiki_response.html', 'wb') as file: 
    file.write(soup.prettify(encoding = 'utf-8')) # encoded to unicode string

---
<a id = '2'></a>
## `2.` Searching and navigating HTML tree

- soup.find( )
- soup.findall( )

In [76]:
soup.find('div') # finding one div tag

<div class="noprint" id="mw-page-base"></div>

In [24]:
print(soup.find('movie')) # there is no 'movie' tag so it returns None

None


In [25]:
soup.find('a') # want to explore 'a' tag with url, but it does not return what I want

<a id="top"></a>

In [74]:
soup.find_all('a')[1:10] # volia, find_all() returns all 'a' tags

[<a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected."><img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x" width="20"/></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>,
 <a class="image" href="/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg"><img alt="François Boucher, Allegory of Music, 1764, NGA 32680.jpg" data-file-he

---
### Searching by attribute

In [10]:
soup.find_all('a', class_ = 'image')[0]

<a class="image" href="/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg"><img alt="François Boucher, Allegory of Music, 1764, NGA 32680.jpg" data-file-height="3189" data-file-width="4000" decoding="async" height="175" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Fran%C3%A7ois_Boucher%2C_Allegory_of_Music%2C_1764%2C_NGA_32680.jpg/220px-Fran%C3%A7ois_Boucher%2C_Allegory_of_Music%2C_1764%2C_NGA_32680.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Fran%C3%A7ois_Boucher%2C_Allegory_of_Music%2C_1764%2C_NGA_32680.jpg/330px-Fran%C3%A7ois_Boucher%2C_Allegory_of_Music%2C_1764%2C_NGA_32680.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Fran%C3%A7ois_Boucher%2C_Allegory_of_Music%2C_1764%2C_NGA_32680.jpg/440px-Fran%C3%A7ois_Boucher%2C_Allegory_of_Music%2C_1764%2C_NGA_32680.jpg 2x" width="220"/></a>

#### Note underscore(_) attached to class, it is to distinguish from class as a built-in method of python. Alternatively we can perform the same operations passing attributes in dictionary.

In [9]:
soup.find_all('a', {'class':'image'})[0] 

<a class="image" href="/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg"><img alt="François Boucher, Allegory of Music, 1764, NGA 32680.jpg" data-file-height="3189" data-file-width="4000" decoding="async" height="175" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Fran%C3%A7ois_Boucher%2C_Allegory_of_Music%2C_1764%2C_NGA_32680.jpg/220px-Fran%C3%A7ois_Boucher%2C_Allegory_of_Music%2C_1764%2C_NGA_32680.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Fran%C3%A7ois_Boucher%2C_Allegory_of_Music%2C_1764%2C_NGA_32680.jpg/330px-Fran%C3%A7ois_Boucher%2C_Allegory_of_Music%2C_1764%2C_NGA_32680.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Fran%C3%A7ois_Boucher%2C_Allegory_of_Music%2C_1764%2C_NGA_32680.jpg/440px-Fran%C3%A7ois_Boucher%2C_Allegory_of_Music%2C_1764%2C_NGA_32680.jpg 2x" width="220"/></a>

In [47]:
# extend to find a specific href 
soup.find_all('a', {'class':'image', 
                    'href':"/wiki/File:Muses_sarcophagus_Louvre_MR880.jpg"})

[<a class="image" href="/wiki/File:Muses_sarcophagus_Louvre_MR880.jpg"><img alt="" class="thumbimage" data-file-height="1140" data-file-width="2670" decoding="async" height="167" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/0c/Muses_sarcophagus_Louvre_MR880.jpg/390px-Muses_sarcophagus_Louvre_MR880.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/0c/Muses_sarcophagus_Louvre_MR880.jpg/585px-Muses_sarcophagus_Louvre_MR880.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/0c/Muses_sarcophagus_Louvre_MR880.jpg/780px-Muses_sarcophagus_Louvre_MR880.jpg 2x" width="390"/></a>]

---
<a id = '3'></a>
## `3` Extracting data from the HTML tree

In [81]:
# randomly picked for an example
a = soup.find_all('a')[1]
a

<a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected."><img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x" width="20"/></a>

### Getting the attribute value

In [82]:
a.attrs

{'href': '/wiki/Wikipedia:Protection_policy#semi',
 'title': 'This article is semi-protected.'}

In [83]:
# can extract an attribute like a dictionary
a['title']

'This article is semi-protected.'

In [84]:
a.get('title')

'This article is semi-protected.'

In [85]:
a['id'] # when passing an incorrect attribute, it returns an error

KeyError: 'id'

In [86]:
a.get('id') # or by using .get() method, we can return None
print(a.get('id'))

None


---
### Extracing text from a tag

In [87]:
p = soup.find_all('p')[1] # extracting a paragraph from the html

In [88]:
p.string # hmm.. nothing returned

In [89]:
repr(p.string)

'None'

In [90]:
p.text # text method is used and seems stronger

'Music is an art form, and cultural activity, whose medium is sound. General definitions of music include common elements such as pitch (which governs melody and harmony), rhythm (and its associated concepts tempo, meter, and articulation), dynamics (loudness and softness), and the sonic qualities of timbre and texture (which are sometimes termed the "color" of a musical sound). Different styles or types of music may emphasize, de-emphasize or omit some of these elements. Music is performed with a vast range of instruments and vocal techniques ranging from singing to rapping; there are solely instrumental pieces, solely vocal pieces (such as songs without instrumental accompaniment) and pieces that combine singing and instruments. The word derives from Greek μουσική (mousike; "art of the Muses").[1]\nSee glossary of musical terminology.\n'

In [91]:
p.get_text() # works the same

'Music is an art form, and cultural activity, whose medium is sound. General definitions of music include common elements such as pitch (which governs melody and harmony), rhythm (and its associated concepts tempo, meter, and articulation), dynamics (loudness and softness), and the sonic qualities of timbre and texture (which are sometimes termed the "color" of a musical sound). Different styles or types of music may emphasize, de-emphasize or omit some of these elements. Music is performed with a vast range of instruments and vocal techniques ranging from singing to rapping; there are solely instrumental pieces, solely vocal pieces (such as songs without instrumental accompaniment) and pieces that combine singing and instruments. The word derives from Greek μουσική (mousike; "art of the Muses").[1]\nSee glossary of musical terminology.\n'

#### When extracting text from an element containing `lots of nested elements,` the answers is obvious. When extracting text from inidividual content(child), string may be useful.

In [84]:
print(p.contents[2]) # one element inside p tag
print(p.contents[2].string)
print(p.contents[2].text)
print(p.contents[2].get_text())

<a href="/wiki/The_arts#Music" title="The arts">art form</a>
art form
art form
art form


---
<a id = '4'></a>
## `4` Applications

Here we will apply the scraping methods to extract urls from the data.
- [Extracting urls](#4.1)
- [Extracting from nestaged tags](#4.2)
- [Scraping multiple pages automatically](#4.3)

In [12]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

<a id = '4.1'></a>
### Extracting urls
Create a function to scrape a wikipedia page.

In [13]:
base_url = 'https://en.wikipedia.org/wiki/Music'

In [14]:
def wiki_extract(url):
    response = requests.get(url)
    if response.status_code != 200:
        print("We cannot extract data from the url. Please check again")
    else:
        try:
            html = response.content
            soup = BeautifulSoup(html, 'lxml') #using 'lxml' parser this time
        except:
            None
    return soup

In [15]:
data = wiki_extract(base_url)

Anchor tags (a) contain href attributes (for links).

In [92]:
links = data.find_all('a')
links[1:5]

[<a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected."><img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x" width="20"/></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>]

In [8]:
links[20] # how it looks within an anchor tag

<a href="/wiki/Puppetry" title="Puppetry">Puppetry</a>

In [9]:
links[20].get('href')

'/wiki/Puppetry'

The above result '/wiki/Puppetry' is a relative url within a href attribute. We would like to extract this.
<br>
However this relative url is in form of 'absolute url' - for instance: http://en.wikipedia.org/wiki/Puppetry. 

Instead of creating it manually, there is a very handy module that can help: **`urllib.parse.urljoin( , )`**

As we already import the sub-module .urljoin() we don't need urllib.parse.

In [10]:
relative_url = links[20].get('href')
absolute_url = urljoin(base_url, relative_url)
print(absolute_url)

https://en.wikipedia.org/wiki/Puppetry


---
### Processing multile links at all once

Using list comprehension

In [26]:
links = [l.get('href') for l in links] # for loop prints <a> tags line by line by order

Some cleaning needed for 
- None
- there are some external relative urls (i.e. #Etymology)

In [27]:
clean_links = [l for l in links if l != None]
clean_links[1:20]

['#mw-head',
 '#searchInput',
 '/wiki/Music_(disambiguation)',
 '/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg',
 '/wiki/Fran%C3%A7ois_Boucher',
 '/wiki/Paleolithic',
 '/wiki/Performing_arts',
 '/wiki/Acrobatics',
 '/wiki/Ballet',
 '/wiki/List_of_circus_skills',
 '/wiki/Clown',
 '/wiki/Dance',
 '/wiki/Gymnastics',
 '/wiki/Magic_(illusion)',
 '/wiki/Mime_artist',
 '/wiki/Opera',
 '/wiki/Professional_wrestling',
 '/wiki/Puppetry',
 '/wiki/Public_speaking']

In [28]:
relative_urls = [link for link in clean_links if '/wiki/' in link]
relative_urls[1:20]

['/wiki/Music_(disambiguation)',
 '/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg',
 '/wiki/Fran%C3%A7ois_Boucher',
 '/wiki/Paleolithic',
 '/wiki/Performing_arts',
 '/wiki/Acrobatics',
 '/wiki/Ballet',
 '/wiki/List_of_circus_skills',
 '/wiki/Clown',
 '/wiki/Dance',
 '/wiki/Gymnastics',
 '/wiki/Magic_(illusion)',
 '/wiki/Mime_artist',
 '/wiki/Opera',
 '/wiki/Professional_wrestling',
 '/wiki/Puppetry',
 '/wiki/Public_speaking',
 '/wiki/Theatre',
 '/wiki/Ventriloquism']

In [30]:
full_urls = [urljoin(base_url, link) for link in relative_urls]
full_urls[1:20]

['https://en.wikipedia.org/wiki/Music_(disambiguation)',
 'https://en.wikipedia.org/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg',
 'https://en.wikipedia.org/wiki/Fran%C3%A7ois_Boucher',
 'https://en.wikipedia.org/wiki/Paleolithic',
 'https://en.wikipedia.org/wiki/Performing_arts',
 'https://en.wikipedia.org/wiki/Acrobatics',
 'https://en.wikipedia.org/wiki/Ballet',
 'https://en.wikipedia.org/wiki/List_of_circus_skills',
 'https://en.wikipedia.org/wiki/Clown',
 'https://en.wikipedia.org/wiki/Dance',
 'https://en.wikipedia.org/wiki/Gymnastics',
 'https://en.wikipedia.org/wiki/Magic_(illusion)',
 'https://en.wikipedia.org/wiki/Mime_artist',
 'https://en.wikipedia.org/wiki/Opera',
 'https://en.wikipedia.org/wiki/Professional_wrestling',
 'https://en.wikipedia.org/wiki/Puppetry',
 'https://en.wikipedia.org/wiki/Public_speaking',
 'https://en.wikipedia.org/wiki/Theatre',
 'https://en.wikipedia.org/wiki/Ventriloquism']

---
<a id = '4.2'></a>
### Extractng from nested tags
Firstly on the website, inspect the element and find a unique identifier(tag). Then validate the tag, by inspecting other similar elements.

After inspection on the webpage (try it!), I assumed the link that I want to scrape sits in <'a'> tag, which is located inside <'div'> tag with an attribute 'role = note'. By inspecting the rest, it is validated to be true.

In [32]:
div_notes = data.find_all('div', {'role':'note'})
div_notes[1:5]

[<div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Musical_composition" title="Musical composition">Musical composition</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Musical_notation" title="Musical notation">Musical notation</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Musical_improvisation" title="Musical improvisation">Musical improvisation</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Music_theory" title="Music theory">Music theory</a></div>]

#### Inspect

In [39]:
div_notes[20].find_all('a')

[<a href="/wiki/Music_education" title="Music education">Music education</a>]

In [93]:
div_links = [div.find_all('a') for div in div_notes]
div_links[1:10]# some div tags contain more than one link 

[[<a href="/wiki/Musical_composition" title="Musical composition">Musical composition</a>],
 [<a href="/wiki/Musical_notation" title="Musical notation">Musical notation</a>],
 [<a href="/wiki/Musical_improvisation" title="Musical improvisation">Musical improvisation</a>],
 [<a href="/wiki/Music_theory" title="Music theory">Music theory</a>],
 [<a href="/wiki/Elements_of_music" title="Elements of music">Elements of music</a>],
 [<a href="/wiki/Strophic_form" title="Strophic form">Strophic form</a>,
  <a href="/wiki/Binary_form" title="Binary form">Binary form</a>,
  <a href="/wiki/Ternary_form" title="Ternary form">Ternary form</a>,
  <a class="mw-redirect" href="/wiki/Rondo_form" title="Rondo form">Rondo form</a>,
  <a href="/wiki/Variation_(music)" title="Variation (music)">Variation (music)</a>,
  <a href="/wiki/Musical_development" title="Musical development">Musical development</a>],
 [<a href="/wiki/History_of_music" title="History of music">History of music</a>],
 [<a href="/wiki

As some div tags contain more than one link, we iterate again from div_notes to obtain link line by line.

In [41]:
div_links = []

for div in div_notes:
    links = div.find_all('a')
    
    for link in links:
        div_links.append(link)

In [43]:
div_links[1:10]

[<a href="/wiki/Musical_composition" title="Musical composition">Musical composition</a>,
 <a href="/wiki/Musical_notation" title="Musical notation">Musical notation</a>,
 <a href="/wiki/Musical_improvisation" title="Musical improvisation">Musical improvisation</a>,
 <a href="/wiki/Music_theory" title="Music theory">Music theory</a>,
 <a href="/wiki/Elements_of_music" title="Elements of music">Elements of music</a>,
 <a href="/wiki/Strophic_form" title="Strophic form">Strophic form</a>,
 <a href="/wiki/Binary_form" title="Binary form">Binary form</a>,
 <a href="/wiki/Ternary_form" title="Ternary form">Ternary form</a>,
 <a class="mw-redirect" href="/wiki/Rondo_form" title="Rondo form">Rondo form</a>]

Combine the link (inside href attribute) with base_url, and get absolute_urls.

In [44]:
abs_urls = [urljoin(base_url, link.get('href')) for link in div_links]

In [46]:
abs_urls[1:10]

['https://en.wikipedia.org/wiki/Musical_composition',
 'https://en.wikipedia.org/wiki/Musical_notation',
 'https://en.wikipedia.org/wiki/Musical_improvisation',
 'https://en.wikipedia.org/wiki/Music_theory',
 'https://en.wikipedia.org/wiki/Elements_of_music',
 'https://en.wikipedia.org/wiki/Strophic_form',
 'https://en.wikipedia.org/wiki/Binary_form',
 'https://en.wikipedia.org/wiki/Ternary_form',
 'https://en.wikipedia.org/wiki/Rondo_form']

<a id = '4.3'></a>
### Scraping multiple pages automatically

We'll try to get main texts from links that we scraped above. So we need understanding how to connect to multiple pages in small amount of time and efforts.

In [51]:
url_list = abs_urls

In [52]:
import requests
from bs4 import BeautifulSoup

def wiki_contents_extract(url_list):    
    texts = []
    i = 0 # counter to check or debug this function
    for url in url_list:
        response = requests.get(url)
        if response.status_code != 200:
            print("We cannot access the url. Check again.")
        else:
            print('Url #{}, {}'.format(i, url))
            try:
                html = response.content
                data = BeautifulSoup(html, 'lxml')
        
            except:
                print("We cannot parse the url. Check the code.")
        
        paragraphs = data.find_all('p')
        contents = [p.text for p in paragraphs]
        
        texts.append(contents)

        i = i + 1
        
    return texts

In [53]:
texts = wiki_contents_extract(url_list)

Url #0, https://en.wikipedia.org/wiki/Music_(disambiguation)
Url #1, https://en.wikipedia.org/wiki/Musical_composition
Url #2, https://en.wikipedia.org/wiki/Musical_notation
Url #3, https://en.wikipedia.org/wiki/Musical_improvisation
Url #4, https://en.wikipedia.org/wiki/Music_theory
Url #5, https://en.wikipedia.org/wiki/Elements_of_music
Url #6, https://en.wikipedia.org/wiki/Strophic_form
Url #7, https://en.wikipedia.org/wiki/Binary_form
Url #8, https://en.wikipedia.org/wiki/Ternary_form
Url #9, https://en.wikipedia.org/wiki/Rondo_form
Url #10, https://en.wikipedia.org/wiki/Variation_(music)
Url #11, https://en.wikipedia.org/wiki/Musical_development
Url #12, https://en.wikipedia.org/wiki/History_of_music
Url #13, https://en.wikipedia.org/wiki/Music_of_Egypt
Url #14, https://en.wikipedia.org/wiki/History_of_music_in_the_biblical_period
Url #15, https://en.wikipedia.org/wiki/20th-century_music
Url #16, https://en.wikipedia.org/wiki/Aesthetics_of_music
Url #17, https://en.wikipedia.org/w

Using .join method to combine all texts.

In [94]:
print(texts[0])
print(''.join(texts[0]))

['Music is an art form consisting of sound and silence, expressed through time.\n', 'Music may also refer to:\n']
Music is an art form consisting of sound and silence, expressed through time.
Music may also refer to:



In [60]:
page_text = [''.join(text) for text in texts]

#### For easier look up, we make a dictionary. Urls will be used as keys.

In [61]:
text_dictionary = dict(zip(url_list, page_text))