# BeautifulSoup:
BeautifulSoup is a Python library used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.


In [1]:
import requests
from bs4 import BeautifulSoup

page = requests.get()

#Creates beautifulsoup object
soup = BeautifulSoup(page, "html.parser")

#pulls all instances of <a> tag
artists = soup.find_all('a')

#Clear data of all tags
for artist in artitst:
    names = artist.contents[0]
    fullLink = artist.get('href')
    print(names)
    print(fullLink)

# Scrapy
 Scrapy is an open-source and collaborative web crawling framework for Python. It is used to extract the data from the website.


In [None]:
import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/',]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {'quote': quote.css('span.text::text').get()}

# Selenium:
Selenium is a tool used for controlling web browsers through programs and automating browser tasks.

In [None]:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.example.com")

# Web Scraping Tables using Pandas

In [8]:
import pandas as pd
import requests

URL = 'https://en.wikipedia.org/wiki/List_of_largest_banks'

# request with headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/127.0.0.1 Safari/537.36"
}
response = requests.get(URL, headers=headers)

# parseamos el HTML con pandas
tables = pd.read_html(response.text)
df = tables[0]

print(df)

    Rank                                Bank name  \
0      1  Industrial and Commercial Bank of China   
1      2               Agricultural Bank of China   
2      3                  China Construction Bank   
3      4                            Bank of China   
4      5                           JPMorgan Chase   
..   ...                                      ...   
95    96                            Handelsbanken   
96    97                 Industrial Bank of Korea   
97    98                                      DNB   
98    99                      Qatar National Bank   
99   100                  National Bank of Canada   

    Total assets (2024) (US$ billion)  
0                             6303.44  
1                             5623.12  
2                             5400.28  
3                             4578.28  
4                             4002.81  
..                                ...  
95                             351.79  
96                             345.81  
97 

  tables = pd.read_html(response.text)


### BeautifulSoup Objects

In [4]:
!pip install bs4
!pip install requests pandas



In [5]:
from bs4 import BeautifulSoup
import requests

In [6]:
# Considering this Html
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

SyntaxError: invalid syntax (3517537022.py, line 3)

In [7]:
# We can store it as a string in the variable HTML:
html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

In [8]:
#To parse a document, pass it into the BeautifulSoup constructor. The BeautifulSoup object represents the document as a nested data structure:
soup = BeautifulSoup(html, 'html.parser')
#First, the document is converted to Unicode (similar to ASCII) and HTML entities are converted to Unicode characters. Beautiful Soup transforms 
#a complex HTML document into a complex tree of Python objects. The BeautifulSoup object can create other types of 
#objects. In this lab, we will cover BeautifulSoup and Tag objects, that for the purposes of this lab are identical. 
#Finally, we will look at NavigableString objects.

#We can use the method prettify() to display the HTML in the nested structure:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>



## Tags
Let's say we want the title of the page and the name of the top paid player. We can use the Tag. The Tag object corresponds to an HTML tag in the original document, for example, the tag title.

In [9]:
tag_object = soup.title
print("Tag object:", tag_object)

Tag object: <title>Page Title</title>


In [10]:
tag_object=soup.h3
tag_object


<h3><b id="boldest">Lebron James</b></h3>

### Children, Parents, Siblings

The TAG object is a TREE ob objects.We can access de child of the tag or navigate down the branch as follow

In [11]:
tag_child = tag_object.b
tag_child

<b id="boldest">Lebron James</b>

In [14]:
parent_tag = tag_child.parent
parent_tag # this is identical to tag_object
#tag_object parent is the body element
tag_object.parent

<body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body>

In [17]:
# tag_object simbling is the paragraph element
sibling1=tag_object.next_sibling
sibling1

<p> Salary: $ 92,000,000 </p>

In [25]:
sibling2=sibling1.next_sibling
sibling2
sibling3=sibling2.next_sibling
sibling3

<p> Salary: $85,000, 000 </p>

In [38]:
## Example made by Pablo Herrador
import pandas as pd
soup = BeautifulSoup(html, "html.parser")

names = []
salaries= []

for h3 in soup.find_all("h3"):
    name = h3.get_text(strip=True)
    salario_p = h3.find_next_sibling("p")  # encuentra el siguiente <p> al mismo nivel
    salary = salario_p.get_text(strip=True) if salario_p else None

    names.append(name)
    salaries.append(salary)

data = pd.DataFrame({"name":names, "salary": salaries})
print(data)

import re # "regular expressions" library

# clean df
def clean_salary(s):
    if s is None:
        return None
    # erase 'Salary: $' and ","
    s_clean = re.sub(r'[^\d]', '', s)  
    return int(s_clean)


data['salary'] = data['salary'].apply(clean_salary)

print(data)


            name                salary
0   Lebron James  Salary: $ 92,000,000
1  Stephen Curry  Salary: $85,000, 000
2   Kevin Durant  Salary: $73,200, 000
            name    salary
0   Lebron James  92000000
1  Stephen Curry  85000000
2   Kevin Durant  73200000


### HTML Attributes

If the tag has attributes, the tag id="boldest" has an attribute id whose value is boldest. You can access a tag’s attributes by treating the tag like a dictionary:

In [40]:
tag_child["id"]

'boldest'

In [41]:
tag_child.attrs

{'id': 'boldest'}

In [42]:
tag_child.get("id")

'boldest'

### Navigable String

A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the NavigableString class to contain this text. In our HTML we can obtain the name of the first player by extracting the string of the Tag object tag_child as follows:

In [43]:
tag_string = tag_child.string
tag_string

'Lebron James'

In [44]:
type(tag_string)

bs4.element.NavigableString

A NavigableString is similar to a Python string or Unicode string. To be more precise, the main difference is that it also supports some BeautifulSoup features. We can convert it to string object in Python:

In [45]:
unicode_string = str(tag_string)
unicode_string

'Lebron James'

In [46]:
type(unicode_string)

str

### Filter

Filters allow you to find complex patterns, the simplest filter is a string. In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string. Consider the following HTML of rocket launches:

In [None]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
    <td>80 kg</td>
  </tr>
</table>

We can store it as a string in the variable table:

In [51]:
table="<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>"

In [56]:
table_bs = BeautifulSoup(table, "html.parser")
print(table_bs.prettify())

<table>
 <tr>
  <td id="flight">
   Flight No
  </td>
  <td>
   Launch site
  </td>
  <td>
   Payload mass
  </td>
 </tr>
 <tr>
  <td>
   1
  </td>
  <td>
   <a href="https://en.wikipedia.org/wiki/Florida">
    Florida
    <a>
    </a>
   </a>
  </td>
  <td>
   300 kg
  </td>
 </tr>
 <tr>
  <td>
   2
  </td>
  <td>
   <a href="https://en.wikipedia.org/wiki/Texas">
    Texas
   </a>
  </td>
  <td>
   94 kg
  </td>
 </tr>
 <tr>
  <td>
   3
  </td>
  <td>
   <a href="https://en.wikipedia.org/wiki/Florida">
    Florida
    <a>
    </a>
   </a>
  </td>
  <td>
   80 kg
  </td>
 </tr>
</table>



# Find All

The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.

The Method signature for find_all(name, attrs, recursive, string, limit, **kwargs)

### Name

When we set the name parameter to a tag name, the method will extract all the tags with that name and its children.

In [61]:
table_row = table_bs.find_all('tr')
table_row

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>]

In [69]:
for i in range(0,4):
    print(f"Row {i} in table_row:", table_row[i])
type(table_row[0])

Row 0 in table_row: <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
Row 1 in table_row: <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>
Row 2 in table_row: <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
Row 3 in table_row: <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>


bs4.element.Tag

In [75]:
# we can obtain the child
tbl_r0_child = table_row[1].td
tbl_r0_child.next_sibling


<td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td>

In [79]:
for i,row in enumerate(table_row):
    print("row",i)
    cells = row.find_all('td')
    for j, cell in enumerate(cells):
        print('colunm',j,"cell",cell)

row 0
colunm 0 cell <td id="flight">Flight No</td>
colunm 1 cell <td>Launch site</td>
colunm 2 cell <td>Payload mass</td>
row 1
colunm 0 cell <td>1</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td>
colunm 2 cell <td>300 kg</td>
row 2
colunm 0 cell <td>2</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
colunm 2 cell <td>94 kg</td>
row 3
colunm 0 cell <td>3</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td>
colunm 2 cell <td>80 kg</td>


## Atributes

If the argument is not recognized it will be turned into a filter on the tag’s attributes. For example with the id argument, Beautiful Soup will filter against each tag’s id attribute. For example, the first td elements have a value of id of flight, therefore we can filter based on that id value.

In [80]:
table_bs.find_all(id="flight")

[<td id="flight">Flight No</td>]

In [81]:
list_input=table_bs.find_all(href="https://en.wikipedia.org/wiki/Florida")
list_input

[<a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a>]

In [82]:
table_bs.find_all(href=True)

[<a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a>]

# Exercises

### find_all()

Find all the `<a>` tags without <code>href</code> value 

In [99]:
anchors_without_href = table_bs.find_all("a", href=False)

In [86]:
table_bs.find_all(lambda tag: tag.name == "a" and not tag.has_attr("href"))

[<a></a>, <a> </a>]

Find all the elements that do not contain any links

In [103]:
table_bs.find_all(lambda tag: tag.name != "a" and not tag.find('a'))

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <td>1</td>,
 <td>300 kg</td>,
 <td>2</td>,
 <td>94 kg</td>,
 <td>3</td>,
 <td>80 kg</td>]

With string you can search for strings instead of tags, where we find all the elments with Florida:

In [112]:
table_bs.find_all(string="Florida")

['Florida', 'Florida']

### Lets go again

In [None]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>


In [134]:
html="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"
html_bs = BeautifulSoup(html, "html.parser")
html_bs.find_all("table", class_ = "pizza")
table = html_bs.table
table.contents


[<tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <tr><td>1</td><td>Florida</td><td>300 kg</td></tr>,
 <tr><td>2</td><td>Texas</td><td>94 kg</td></tr>,
 <tr><td>3</td><td>Florida </td><td>80 kg</td></tr>]

# Downloading And Scraping The Contents Of A Web Page

In [140]:
import requests
url = "https://www.ibm.com"
data = requests.get(url).text #we download in text format
soup = BeautifulSoup(data, "html.parser")

### Scrape all links

In [142]:
for link in soup.find_all('a',href=True):
    print(link.get('href'))

https://www.ibm.com/sports/usopen?lnk=hpls1us
https://www.ibm.com/think/insights/lets-create-smarter-business?lnk=hpls2us
https://www.ibm.com/think/insights/scale-ai-agents-business?lnk=hprc1us
https://www.ibm.com/solutions/ipaas?lnk=hprc2us
https://www.ibm.com/campaign/data-and-ai-trust?lnk=hprc3us
https://www.ibm.com/thought-leadership/institute-business-value/report/quantum-safe?lnk=hprc4us
https://www.ibm.com/community/ibm-techxchange-conference/?lnk=hpdev1us
https://github.com/ibm-granite-community/granite-snack-cookbook?lnk=hpdev2us
https://developer.ibm.com/technologies/artificial-intelligence/?lnk=hpdev3us
https://www.ibm.com/products/watsonx-governance?OT=AITrustWeek&lnk=hpdev4us
https://www.ibm.com/new/announcements/openai-s-open-source-models-available-on-ibm-watsonx-ai?lnk=hpdev5us
https://research.ibm.com/blog/granite-vision-ocr-leaderboard?lnk=hpdev6us
https://www.ibm.com/new/announcements/governing-ai-with-confidence-our-journey-with-watsonx-governance?lnk=hpdev7us
https

## Scrape all images tags

In [143]:
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link)
    print(link.get('src'))

<img alt="Two hands scaling a square with AI pictogram" class="cmp-image__image" height="1271" itemprop="contentUrl" loading="lazy" src="https://assets.ibm.com/is/image/ibm/api-connect-data-fabric?ts=1757005554058&amp;dpr=off" srcset="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="1271"/>
https://assets.ibm.com/is/image/ibm/api-connect-data-fabric?ts=1757005554058&dpr=off
<img alt="Technical illustration to represent how to best scale and accelerate the impact of AI through watsonx.data" class="cmp-image__image" height="3139" itemprop="contentUrl" loading="lazy" src="https://assets.ibm.com/is/image/ibm/watsonx-data-technical-illustration-grey10-web-guidebook?ts=1757005562484&amp;dpr=off" srcset="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" title="Abstract tech illustration with geometric cubes" width="3660"/>
https://assets.ibm.com/is/image/ibm/watsonx-data-technical-illustration-grey10-web-guidebook?ts=1757005562484&dp

## Scrape Data from HTML tables

In [150]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"
data = requests.get(url).text
soup = BeautifulSoup(data, "html.parser")

In [151]:
table = soup.find('table') # in html table is represented by the tag <table>

In [152]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].string # store the value in column 3 as color_name
    color_code = cols[3].string # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

Color Name--->None
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF
