# Webscrapping using BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML.  We can navigate the HTML as a tree, and/or filter out what we are looking for.


In [1]:
from bs4 import BeautifulSoup
import requests

In [2]:
#store a html source code into a variable 
html = "<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3> \
<b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p> \
<h3>Stephen Curry</h3><p> Salary: $85,000,000</p> \
<h3>Kevin Durant</h3><p> Salary: $73,200,000</p></body></html>"
html

"<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3> <b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p> <h3>Stephen Curry</h3><p> Salary: $85,000,000</p> <h3>Kevin Durant</h3><p> Salary: $73,200,000</p></body></html>"

In [3]:
#to parse a document, pass into beautifulsoup constructor
#it object represents the document as a nested data structure

soup = BeautifulSoup(html, 'html5lib')
soup

<!DOCTYPE html>
<html><head><title>Page Title</title></head><body><h3> <b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p> <h3>Stephen Curry</h3><p> Salary: $85,000,000</p> <h3>Kevin Durant</h3><p> Salary: $73,200,000</p></body></html>

First, the document is converted to Unicode (similar to ASCII) and HTML entities are converted to Unicode characters. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. The <code>BeautifulSoup</code> object can create other types of objects. In this lab, we will cover <code>BeautifulSoup</code> and <code>Tag</code> objects, that for the purposes of this lab are identical. Finally, we will look at <code>NavigableString</code> objects.


In [4]:
#display the HTML in the nested data structure
soup.prettify()

'<!DOCTYPE html>\n<html>\n <head>\n  <title>\n   Page Title\n  </title>\n </head>\n <body>\n  <h3>\n   <b id="boldest">\n    Lebron James\n   </b>\n  </h3>\n  <p>\n   Salary: $ 92,000,000\n  </p>\n  <h3>\n   Stephen Curry\n  </h3>\n  <p>\n   Salary: $85,000,000\n  </p>\n  <h3>\n   Kevin Durant\n  </h3>\n  <p>\n   Salary: $73,200,000\n  </p>\n </body>\n</html>\n'

## Tag

The tag object corresponds to an HTML tag in the original document  
The tag object is a tree of obejcts  
Ex, the tag title correspond to html title

In [5]:
#The tag of title
tag_title = soup.title
print ("Tag object: ",tag_title)

#findout the type of tag
print (type(tag_title))

#if there is more than one tag with the same name,
#the first element with that tag name is called
tag_object = soup.h3
print(tag_object)

#This is a tree representation of HTML tag

Tag object:  <title>Page Title</title>
<class 'bs4.element.Tag'>
<h3> <b id="boldest">Lebron James</b></h3>


## Children, Parents and Siblings

In [6]:
print(tag_object)

#we can know that the tag_object is h3

<h3> <b id="boldest">Lebron James</b></h3>


In [7]:
#we can access the tag or navigate down the branch as follows :

#access the tag_object child
tagChild = tag_object.b
print(tagChild)

#access the parent of tag_object's child
parentTag = tagChild.parent
print(parentTag) #and this is identical to tag_object itself


<b id="boldest">Lebron James</b>
<h3> <b id="boldest">Lebron James</b></h3>


In [8]:
#access the tag_object (h3) parent, which is the body
print(tag_object.parent)

<body><h3> <b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p> <h3>Stephen Curry</h3><p> Salary: $85,000,000</p> <h3>Kevin Durant</h3><p> Salary: $73,200,000</p></body>


In [9]:
#access the sibling of tag_object (h3), which is the element that has the same level
sibling1 = tag_object.next_sibling
print(sibling1)

#sibling2 is a header, which is a sibling from sibling1
sibling2 = sibling1.next_sibling
print(sibling2)

#sibling3 is stephen curry, which is a sibling from sibling2
sibling3 = sibling2.next_sibling
print(sibling3)

#sibling4 is the salary of stephen curry
print (sibling3.next_sibling)

<p> Salary: $ 92,000,000 </p>
 
<h3>Stephen Curry</h3>
<p> Salary: $85,000,000</p>


## HTML Attributes

If the tag has attributes, then :  
the tag <code> id = boldest </code>, has an attribute <code>id</code> whose value is <code>boldest</code>  
We can acces the tag's attributes by treating the tag like a dictionary

In [10]:
print(tagChild['id'])
print(tagChild.attrs)

print(tagChild.get('id'))

boldest
{'id': 'boldest'}
boldest


## Navigable String

A string corresponds to a bit of text or content within a tag  
Beautifulsoup use <code>NavigableString</code> class to containt the string

In [11]:
print(tagChild)

tagChildString = tagChild.string
print(tagChildString)
print(type(tagChildString))

#we need to convert it into string, due to its class type
tagChildWord = str(tagChildString)
print(type(tagChildWord))

<b id="boldest">Lebron James</b>
Lebron James
<class 'bs4.element.NavigableString'>
<class 'str'>


## Filter

Filters allow the user to find a complex patterns, the simplest filter is a string.  
Now, we will pass a string to a different filter method and Beautifulsoup will perform a match against that exact string

In [13]:
#we put a html table to a table object

table = "<table><tr><td id='flight'>Flight No</td><td>Launch site</td> \
<td>Payload mass</td></tr><tr> <td>1</td> \
<td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td> \
<td>300 kg</td></tr><tr><td>2</td> \
<td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td> \
<td>94 kg</td></tr><tr><td>3</td> \
<td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td> \
<td>80 kg</td></tr></table>"

#then we will parse the table into the structured data structure by beautifulsoup
tableBs = BeautifulSoup(table, 'html5lib')

**Find All**

In [26]:
def head():
    print ("========")
    
#findall the tag name
tableRow = tableBs.find_all('tr')
print(tableRow)
head()

#and it is iterable like a list
print(tableRow[0])
head()

#we can also obtain its child, which is td
print(tableRow[0].td)
head()

#if we iterate it through the list, each element will corresponds to a row in a table
for i, row in enumerate(tableRow):
    print('row', i,'is', row)
head()

#as row is a cell object, we can apply method findall to it
#and extract table cells in the object "cells" using the tag "td"
#this is all the children with the name "td"
for i, row in enumerate(tableRow):
    print("Row",i)
    cells = row.find_all('td')
    for j, cell in enumerate(cells):
        print("Column",j ,"cell",cell)
head()

#we can also use a list that we can match against any item in that list
listInput = tableBs.find_all(name = ['tr','td'])
print(listInput)

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>, <tr> <td>1</td> <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td> <td>300 kg</td></tr>, <tr><td>2</td> <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td> <td>94 kg</td></tr>, <tr><td>3</td> <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td> <td>80 kg</td></tr>]
<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
<td id="flight">Flight No</td>
row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
row 1 is <tr> <td>1</td> <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td> <td>300 kg</td></tr>
row 2 is <tr><td>2</td> <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td> <td>94 kg</td></tr>
row 3 is <tr><td>3</td> <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td> <td>80 kg</td></tr>
Row 0
Column 0 cell <td id="flig

## Attributes

If the argument is not recognized, it will be turned into a filter on the tag's attributes. 

For example, with id argument, the beautifulsoup will filter against each tag's id attribute.  
Example. the first td elements have a value of id of flight, then we can filter based on that id value

In [28]:
tableBs.find_all(id = 'flight')

[<td id="flight">Flight No</td>]

In [37]:
#we also can find all the elements that have links to the florida wikipedia page:
listInput = tableBs.find_all(href= "https://en.wikipedia.org/wiki/Florida")
print(listInput)
head()

#if we set the href attribute to true, regardless of what value is
#the code finds all anchor tags with href value
print(tableBs.find_all('a', href=True))
head()

#in the other side, we can find all anchord tags without href value
print(tableBs.find_all('a',href=False))
head()

#using the object soup, we can find the element with the id attribute content set to boldest
print(soup.find_all(id='boldest'))
head()

#and lastly, with string, we can search for strings instead of tags
print(tableBs.find_all(string = 'Florida'))

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>, <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]
[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>, <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>, <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]
[<a></a>, <a> </a>]
[<b id="boldest">Lebron James</b>]
['Florida', 'Florida']


## Find Two Tables Attributes

In [38]:
twoTb="<h3>Rocket Launch </h3> \
<p><table class='rocket'> \
<tr><td>Flight No</td><td>Launch site</td><td>Payload mass</td></tr> \
<tr><td>1</td><td>Florida</td><td>300 kg</td></tr> \
<tr><td>2</td><td>Texas</td><td>94 kg</td></tr> \
<tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p>\
<p><h3>Pizza Party</h3> \
<table class='pizza'> \
<tr><td>Pizza Place</td><td>Orders</td><td>Slices </td></tr> \
<tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr> \
<tr><td>Little Caesars</td><td>12</td><td >144 </td></tr> \
<tr><td>Papa John's</td><td>15 </td><td>165</td></tr>"

In [39]:
#then create a beautifulsoup object of twoTb object
tbBs = BeautifulSoup(twoTb, 'html.parser')

In [42]:
#we can find the first table using the tag name table
print(tbBs.find('table'))
head()

#we can also filter on the class atribute to find the second table
#cuz class is a syntax in python, we add _ to differentiate them
print(tbBs.find("table", class_="pizza"))

<table class="rocket"> <tr><td>Flight No</td><td>Launch site</td><td>Payload mass</td></tr> <tr><td>1</td><td>Florida</td><td>300 kg</td></tr> <tr><td>2</td><td>Texas</td><td>94 kg</td></tr> <tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>
<table class="pizza"> <tr><td>Pizza Place</td><td>Orders</td><td>Slices </td></tr> <tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr> <tr><td>Little Caesars</td><td>12</td><td>144 </td></tr> <tr><td>Papa John's</td><td>15 </td><td>165</td></tr></table>


# Download and Scrap the Contents of a Web Page IBM

In [43]:
#define the web page url to the object
url = "http://www.ibm.com"

In [45]:
#use "get" to download the contents of the webpage in text format
#and store it to data object
data = requests.get(url).text
data

'\n<!DOCTYPE HTML>\n<html lang="id-id">\n<head>\r\n    \r\n    \r\n    \r\n    \r\n    \r\n    \r\n    \r\n    <meta charset="UTF-8"/>\r\n    <meta name="languageCode" content="id"/>\r\n    <meta name="countryCode" content="id"/>\r\n    <meta name="searchTitle" content="IBM - Indonesia"/>\r\n    <meta name="focusArea" content="Cross IBM - All"/>\r\n    <title>IBM - Indonesia</title>\r\n    <link rel="icon" href="/content/dam/adobe-cms/default-images/favicon.svg"/>\r\n    \r\n    <meta name="description" content="Selama lebih dari satu abad, IBM telah menjadi inovator teknologi global, yang memimpin kemajuan dalam AI, otomatisasi, dan solusi hybrid cloud yang membantu pertumbuhan bisnis."/>\r\n    <meta name="template" content="full-width-layout"/>\r\n    <meta name="viewport" content="width=device-width, initial-scale=1"/>\r\n    <meta name="robots" content="index, follow"/>\r\n    \r\n      \r\n      \r\n    <link rel="canonical" href="https://www.ibm.com/id-id"/>\r\n    <style id="an

In [46]:
#then, create a beautifulsoup object using its constructor
soup = BeautifulSoup(data, 'html5lib')
soup

<!DOCTYPE html>
<html lang="id-id"><head>
    
    
    
    
    
    
    
    <meta charset="utf-8"/>
    <meta content="id" name="languageCode"/>
    <meta content="id" name="countryCode"/>
    <meta content="IBM - Indonesia" name="searchTitle"/>
    <meta content="Cross IBM - All" name="focusArea"/>
    <title>IBM - Indonesia</title>
    <link href="/content/dam/adobe-cms/default-images/favicon.svg" rel="icon"/>
    
    <meta content="Selama lebih dari satu abad, IBM telah menjadi inovator teknologi global, yang memimpin kemajuan dalam AI, otomatisasi, dan solusi hybrid cloud yang membantu pertumbuhan bisnis." name="description"/>
    <meta content="full-width-layout" name="template"/>
    <meta content="width=device-width, initial-scale=1" name="viewport"/>
    <meta content="index, follow" name="robots"/>
    
      
      
    <link href="https://www.ibm.com/id-id" rel="canonical"/>
    <style id="anti-flicker-style">
        :not(:defined) {
          visibility: hidden;
    

In [47]:
#1. scrape all of the links in the webpage
for link in soup.find_all('a', href=True):
    print(link.get('href'))

https://www.ibm.com/cloud?lnk=intro


In [50]:
#2. scrape the images tags
for image in soup.find_all('img'):
    print(image)
    print(image.get('src'))

<img alt="Potret para konsultan IBM" class="bx--image__img" id="image-1653134387" loading="lazy" src="/content/dam/adobe-cms/default-images/home-consultants.component.crop-16by9-xl.ts=1695238692853.jpg/content/adobe-cms/id/id/homepage/_jcr_content/root/table_of_contents/simple_image"/>
/content/dam/adobe-cms/default-images/home-consultants.component.crop-16by9-xl.ts=1695238692853.jpg/content/adobe-cms/id/id/homepage/_jcr_content/root/table_of_contents/simple_image


In [52]:
#3. scrape data from HTML tables
# Notes. we will use the another source that contains the html table

#3.1 Get the source
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

#3.2 Get the contents of webpage in text format
datas = requests.get(url).text

#3.3 create a beautifulsoup object
soup = BeautifulSoup(datas, 'html5lib')

#3.3 Find the html table in the webpage
table = soup.find('table') # in html, table is represented by the tag <table>

#get all rows from the table
for row in table.find_all('tr'): #in html, row is represented by tag <tr>
    cols = row.find_all('td') #in html, column is represented by tag <td>
    color_name = cols[2].string #store the value in column 3 as color_name
    color_code =cols[3].text #store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

Color Name--->Hex Code#RRGGBB
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF
