# **Web Scraping Lab**


Importing the required modules and functions


In [2]:
from bs4 import BeautifulSoup 
import requests

<h2 id="BSO">Beautiful Soup Objects</h2>


Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML.  We can navigate the HTML as a tree and/or filter out what we are looking for.

Consider the following HTML:


In [3]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

I then store it as a string in the variable HTML:


In [4]:
html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

To parse a document, pass it into the <code>BeautifulSoup</code> constructor, the <code>BeautifulSoup</code> object, which represents the document as a nested data structure:


In [5]:
soup = BeautifulSoup(html, "html.parser")

Using the method <code>prettify()</code> to display the HTML in the nested structure:


In [6]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>



Using tag to get the title of the page and the name of the top paid player 

In [7]:
tag_object=soup.title

print("tag object:",tag_object)

tag object: <title>Page Title</title>


In [8]:
print("tag object type:",type(tag_object))

tag object type: <class 'bs4.element.Tag'>


In [9]:
tag_object=soup.h3
tag_object

<h3><b id="boldest">Lebron James</b></h3>

Accessing the child of the tag and navigating down the branch


In [10]:
tag_child =tag_object.b
tag_child

<b id="boldest">Lebron James</b>

You can access the parent with the <code> parent</code>


In [11]:
parent_tag=tag_child.parent
parent_tag

<h3><b id="boldest">Lebron James</b></h3>

this is identical to


In [12]:
tag_object

<h3><b id="boldest">Lebron James</b></h3>

<code>tag_object</code> parent is the <code>body</code> element.


In [13]:
tag_object.parent

<body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body>

<code>tag_object</code> sibling is the <code>paragraph</code> element


In [14]:
sibling_1=tag_object.next_sibling
sibling_1

<p> Salary: $ 92,000,000 </p>

`sibling_2` is the `header` element which is also a sibling of both `sibling_1` and `tag_object`


In [15]:
sibling_2=sibling_1.next_sibling
sibling_2

<h3> Stephen Curry</h3>

Using the object sibling_2 and the property next_sibling to find the salary of Stephen Curry


In [16]:
sibling_2.next_sibling

<p> Salary: $85,000, 000 </p>

Accessing a tag’s attributes by treating the tag like a dictionary


In [17]:
tag_child['id']

'boldest'

Accessing the dictionary directly as attrs


In [18]:
tag_child.attrs

{'id': 'boldest'}

Obtaining the content of the attribute of the tag using the Python get() method.


In [19]:
tag_child.get('id')

'boldest'

In [20]:
tag_string=tag_child.string
tag_string

'Lebron James'

verifying that the type is Navigable String


Coverting it to a string object in Python:


In [21]:
type(tag_string)

bs4.element.NavigableString

In [23]:
unicode_string = str(tag_string)
unicode_string

'Lebron James'

In [24]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg


storing it as a string in the variable table


In [25]:
table="<table><tr><td id='flight' >Flight No</td><td>Launch site</td><td>Payload mass</td></tr><tr><td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td><td>80 kg</td></tr></table>"

In [26]:
table_bs = BeautifulSoup(table, "html.parser")

In [27]:
table_rows=table_bs.find_all('tr')
table_rows

[<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>,
 <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>]

In [28]:
first_row =table_rows[0]
first_row

<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>

The type is <code>tag</code>


In [29]:
print(type(first_row))

<class 'bs4.element.Tag'>


obtaining the child


In [30]:
first_row.td

<td id="flight">Flight No</td>

iterating through the list, each element corresponds to a row in the table:


In [31]:
for i,row in enumerate(table_rows):
    print("row",i,"is",row)
    

row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>
row 1 is <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>
row 2 is <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
row 3 is <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>


applying the method find_all to the cell object row and extract table cells in the object cells using the tag 'td', this is all the children with the name td. In the resulting list, each element corresponds to a cell and is a Tag object, we iterate through this list. We also extract the content using the string attribute.


In [32]:
for i,row in enumerate(table_rows):
    print("row",i)
    cells=row.find_all('td')
    for j,cell in enumerate(cells):
        print('colunm',j,"cell",cell)

row 0
colunm 0 cell <td id="flight">Flight No</td>
colunm 1 cell <td>Launch site</td>
colunm 2 cell <td>Payload mass</td>
row 1
colunm 0 cell <td>1</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
colunm 2 cell <td>300 kg</td>
row 2
colunm 0 cell <td>2</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
colunm 2 cell <td>94 kg</td>
row 3
colunm 0 cell <td>3</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td>
colunm 2 cell <td>80 kg</td>


using a list we match against any item in that list.


In [33]:
list_input=table_bs .find_all(name=["tr", "td"])
list_input

[<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>,
 <td>300 kg</td>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>,
 <td>94 kg</td>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>,
 <td>3</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td>,
 <td>80 kg</td>]

In [34]:
table_bs.find_all(id="flight")

[<td id="flight">Flight No</td>]

In [35]:
list_input=table_bs.find_all(href="https://en.wikipedia.org/wiki/Florida")
list_input

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

In [36]:
table_bs.find_all(href=True)

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

In [37]:
table_bs.find_all('a', href=False)

[]

Using the soup object we find the element with the id attribute content set to "boldest"

In [38]:
soup.find_all(id="boldest")

[<b id="boldest">Lebron James</b>]

In [40]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>


0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg

0,1,2
Pizza Place,Orders,Slices
Domino's Pizza,10,100
Little Caesars,12,144
Papa John's,15,165


Storing the HTML as a Python string 


In [41]:
two_tables="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"

creating a BeautifulSoup object  

In [42]:
two_tables_bs= BeautifulSoup(two_tables, 'html.parser')

finding the first table using the tag name table


In [43]:
two_tables_bs.find("table")

<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>

We can filter on the class attribute to find the second table, but because class is a keyword in Python, we add an underscore.


In [44]:
two_tables_bs.find("table",class_='pizza')

<table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>

<h2 id="DSCW">Downloading And Scraping The Contents Of A Web Page</h2> 


In [45]:
url = "https://web.archive.org/web/20230224123642/https://www.ibm.com/us-en/"

downloading the contents of the webpage in text format and storing it in a variable


In [46]:
data  = requests.get(url).text 

creating a BeautifulSoup object

In [47]:
soup = BeautifulSoup(data,"html.parser")  # create a soup object using the variable 'data'

Scrape all links


In [48]:
for link in soup.find_all('a',href=True):  # in html anchor/link is represented by the tag <a>

    print(link.get('href'))


https://web.archive.org/web/20230224123642/https://www.ibm.com/reports/threat-intelligence/
https://web.archive.org/web/20230224123642/https://www.ibm.com/about
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/strategy/?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/ibmix?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/technology/
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/operations/?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/strategic-partnerships
https://web.archive.org/web/20230224123642/https://www.ibm.com/employment/?lnk=flatitem
https://web.archive.org/web/20230224123642/https://www.ibm.com/impact
https://web.archive.org/web/20230224123642/https://research.ibm.com/
https://web.archive.org/web/20230224123642/https://www.ibm.com/


## Scrape  all images  Tags


In [44]:
for link in soup.find_all('img'):
    print(link)
    print(link.get('src'))

<img alt="Person standing with arms crossed" aria-describedby="bx--image-1" class="bx--image__img" src="https://web.archive.org/web/20230224123642im_/https://1.dam.s81c.com/p/0a23e414312bcb6f/08196d0e04260ae5_cropped.jpg.global.sr_16x9.jpg"/>
https://web.archive.org/web/20230224123642im_/https://1.dam.s81c.com/p/0a23e414312bcb6f/08196d0e04260ae5_cropped.jpg.global.sr_16x9.jpg
<img alt="Team members at work in a conference room" aria-describedby="bx--image-2" class="bx--image__img" src="https://web.archive.org/web/20230224123642im_/https://1.dam.s81c.com/p/06655c075aa3aa29/CaitOppermann_2019_12_06_IBMGarage_DSC3304.jpg.global.m_16x9.jpg"/>
https://web.archive.org/web/20230224123642im_/https://1.dam.s81c.com/p/06655c075aa3aa29/CaitOppermann_2019_12_06_IBMGarage_DSC3304.jpg.global.m_16x9.jpg
<img alt="Coworkers looking at laptops" aria-describedby="bx--image-3" class="bx--image__img" src="https://web.archive.org/web/20230224123642im_/https://1.dam.s81c.com/p/08f951353c2707b8/052022_CaitOp

## Scrape data from HTML tables


In [49]:

url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

In [50]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [51]:
soup = BeautifulSoup(data,"html.parser")

In [52]:
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>

In [53]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].string # store the value in column 3 as color_name
    color_code = cols[3].string # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

Color Name--->None
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF


## Scrape data from HTML tables into a DataFrame using BeautifulSoup and Pandas


In [76]:
!pip install pandas
import pandas as pd



In [54]:
#The below url contains html tables with data about world population.
url = "https://en.wikipedia.org/wiki/World_population"

In [55]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [56]:
soup = BeautifulSoup(data,"html.parser")

In [57]:
#find all html tables in the web page
tables = soup.find_all('table') # in html table is represented by the tag <table>

In [58]:
# we can see how many tables were found by checking the length of the tables list
len(tables)

31

In [59]:
for index,table in enumerate(tables):
    if ("10 most densely populated countries" in str(table)):
        table_index = index
print(table_index)

8


In [60]:
print(tables[table_index].prettify())

<table class="wikitable sortable" style="text-align:right">
 <caption>
  10 most densely populated countries
  <small>
   (with population above 5 million)
  </small>
  <sup class="reference" id="cite_ref-:10_104-0">
   <a href="#cite_note-:10-104">
    <span class="cite-bracket">
     [
    </span>
    99
    <span class="cite-bracket">
     ]
    </span>
   </a>
  </sup>
 </caption>
 <tbody>
  <tr>
   <th scope="col">
    Rank
   </th>
   <th scope="col">
    Country
   </th>
   <th scope="col">
    Population
   </th>
   <th scope="col">
    Area
    <br/>
    <small>
     (km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th scope="col">
    Density
    <br/>
    <small>
     (pop/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <span class="mw-image-border" typeof="mw:File">
      <span>
       <img alt="" class="mw-file-element" data-file-height="600" data-fi

In [61]:
import pandas as pd

In [70]:
df = pd.DataFrame([[td.text for td in row.find_all('td')] for row in table.find_all('tr')])
print(df)

         0
0  Germany


## Scrape data from HTML tables into a DataFrame using BeautifulSoup and read_html


Using the same `url`, `data`, `soup`, and `tables` object as in the last section we can use the `read_html` function to create a DataFrame.

Remember the table we need is located in `tables[table_index]`

We can now use the `pandas` function `read_html` and give it the string version of the table as well as the `flavor` which is the parsing engine `bs4`.


In [64]:
!pip install html5lib

Collecting html5lib
  Downloading html5lib-1.1-py2.py3-none-any.whl.metadata (16 kB)
Downloading html5lib-1.1-py2.py3-none-any.whl (112 kB)
   ---------------------------------------- 0.0/112.2 kB ? eta -:--:--
   ---------- ----------------------------- 30.7/112.2 kB ? eta -:--:--
   ---------- ----------------------------- 30.7/112.2 kB ? eta -:--:--
   ---------- ----------------------------- 30.7/112.2 kB ? eta -:--:--
   ---------- ----------------------------- 30.7/112.2 kB ? eta -:--:--
   -------------------------------- ------ 92.2/112.2 kB 290.5 kB/s eta 0:00:01
   ---------------------------------- --- 102.4/112.2 kB 294.4 kB/s eta 0:00:01
   -------------------------------------- 112.2/112.2 kB 250.4 kB/s eta 0:00:00
Installing collected packages: html5lib
Successfully installed html5lib-1.1


In [72]:
pd.read_html(str(tables[5]), flavor='lxml')

  pd.read_html(str(tables[5]), flavor='lxml')


[  Country / Dependency  Population % of world         Date  \
 0                India  1425775850      17.5%  14 Apr 2023   
 1                China  1409670000      17.3%  17 Jan 2024   
 2        United States   337272301      4.14%  13 Dec 2024   
 3            Indonesia   278696200      3.42%   1 Jul 2023   
 4             Pakistan   229488994      2.82%   1 Jul 2022   
 5              Nigeria   216746934      2.66%   1 Jul 2022   
 6               Brazil   218536918      2.68%  13 Dec 2024   
 7           Bangladesh   168220000      2.06%   1 Jul 2020   
 8               Russia   147190000      1.81%   1 Oct 2021   
 9               Mexico   128271248      1.57%  31 Mar 2022   
 
   Source (official or from the United Nations)  
 0                            UN projection[89]  
 1                 National annual estimate[90]  
 2                National population clock[91]  
 3                 National annual estimate[92]  
 4                            UN projection[93]  
 5   

The function `read_html` always returns a list of DataFrames so we must pick the one we want out of the list.


In [74]:
population_data_read_html = pd.read_html(str(tables[5]), flavor='lxml')[0]

population_data_read_html

  population_data_read_html = pd.read_html(str(tables[5]), flavor='lxml')[0]


Unnamed: 0,Country / Dependency,Population,% of world,Date,Source (official or from the United Nations)
0,India,1425775850,17.5%,14 Apr 2023,UN projection[89]
1,China,1409670000,17.3%,17 Jan 2024,National annual estimate[90]
2,United States,337272301,4.14%,13 Dec 2024,National population clock[91]
3,Indonesia,278696200,3.42%,1 Jul 2023,National annual estimate[92]
4,Pakistan,229488994,2.82%,1 Jul 2022,UN projection[93]
5,Nigeria,216746934,2.66%,1 Jul 2022,UN projection[93]
6,Brazil,218536918,2.68%,13 Dec 2024,National population clock[94]
7,Bangladesh,168220000,2.06%,1 Jul 2020,Annual Population Estimate[95]
8,Russia,147190000,1.81%,1 Oct 2021,2021 preliminary census results[96]
9,Mexico,128271248,1.57%,31 Mar 2022,


## Scrape data from HTML tables into a DataFrame using read_html


We can also use the `read_html` function to directly get DataFrames from a `url`.


In [75]:
dataframe_list = pd.read_html(url, flavor='lxml')

We can see there are 25 DataFrames just like when we used `find_all` on the `soup` object.


In [76]:
len(dataframe_list)

31

Finally we can pick the DataFrame we need out of the list.


In [77]:
dataframe_list[5]

Unnamed: 0,Country / Dependency,Population,% of world,Date,Source (official or from the United Nations)
0,India,1425775850,17.5%,14 Apr 2023,UN projection[89]
1,China,1409670000,17.3%,17 Jan 2024,National annual estimate[90]
2,United States,337272301,4.14%,13 Dec 2024,National population clock[91]
3,Indonesia,278696200,3.42%,1 Jul 2023,National annual estimate[92]
4,Pakistan,229488994,2.82%,1 Jul 2022,UN projection[93]
5,Nigeria,216746934,2.66%,1 Jul 2022,UN projection[93]
6,Brazil,218536918,2.68%,13 Dec 2024,National population clock[94]
7,Bangladesh,168220000,2.06%,1 Jul 2020,Annual Population Estimate[95]
8,Russia,147190000,1.81%,1 Oct 2021,2021 preliminary census results[96]
9,Mexico,128271248,1.57%,31 Mar 2022,


We can also use the `match` parameter to select the specific table we want. If the table contains a string matching the text it will be read.


In [78]:
pd.read_html(url, match="10 most densely populated countries", flavor='lxml')[0]

Unnamed: 0,Rank,Country,Population,Area (km2),Density (pop/km2)
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,Palestine[note 3][100],5223000,6025,867
3,4,Taiwan[note 4],23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,Israel,9402617,21937,429
9,10,India,1389637446,3287263,423


In [79]:
df.to_csv('10_most_densly_populated_countries.csv',index=False)

## Author


RAHMAT USMAN
