### [DOM model of HTML page](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model) and Table Scraping

[Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) is a Python library for pulling data out of HTML and XML files using a DOM parser such as html5lin, lxml, etc...

Note:  
DOM stands for Document Object Model. It [represents a document with a logical tree](https://en.wikipedia.org/wiki/Document_Object_Model#/media/File:DOM-model.svg). Each branch of the tree ends in a node, and each node contains objects. DOM methods allow programmatic access to the tree.>

In [1]:
import requests
#pip install beautifulsoup4 -> package to parse html DOM
#pip install html5lib
#or !pip install lxml
from bs4 import BeautifulSoup
import pandas as pd

In [14]:
url = "https://raw.githubusercontent.com/mdn/css-examples/main/learn/tasks/tables/table-download.html"
r = requests.get(url)
if(r.status_code == 200):
    print("Success!")
r

# https://bioub.github.io/dom-visualizer/ to visualize the DOM

Success!


<Response [200]>

In [15]:
print(r.text)
# r.text
# r = r.json()
# r

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8"/>
    <title>Tables Task</title>

    <style>
      table {
        font-size: 75%;
      }
    </style>

  </head>

  <body>
    <table>
      <caption>A summary of the UK's most famous punk bands</caption>
      <thead>
        <tr>
          <th scope="col">Band</th>
          <th scope="col">Year formed</th>
          <th scope="col">No. of Albums</th>
          <th scope="col">Most famous song</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th scope="row">Buzzcocks</th>
          <td>1976</td>
          <td>9</td>
          <td>Ever fallen in love (with someone you shouldn't've)</td>
        </tr>
        <tr>
          <th scope="row">The Clash</th>
          <td>1976</td>
          <td>6</td>
          <td>London Calling</td>
        </tr>
        <tr>
          <th scope="row">The Damned</th>
          <td>1976</td>
          <td>10</td>
          <td>Smash it up</td>
        </tr>
     

In [18]:
soup = BeautifulSoup(r.text, 'html.parser')
#soup = BeautifulSoup(r.text, 'lxml')
soup # soup is a html object now (json)
# soup.index

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Tables Task</title>
<style>
      table {
        font-size: 75%;
      }
    </style>
</head>
<body>
<table>
<caption>A summary of the UK's most famous punk bands</caption>
<thead>
<tr>
<th scope="col">Band</th>
<th scope="col">Year formed</th>
<th scope="col">No. of Albums</th>
<th scope="col">Most famous song</th>
</tr>
</thead>
<tbody>
<tr>
<th scope="row">Buzzcocks</th>
<td>1976</td>
<td>9</td>
<td>Ever fallen in love (with someone you shouldn't've)</td>
</tr>
<tr>
<th scope="row">The Clash</th>
<td>1976</td>
<td>6</td>
<td>London Calling</td>
</tr>
<tr>
<th scope="row">The Damned</th>
<td>1976</td>
<td>10</td>
<td>Smash it up</td>
</tr>
<tr>
<th scope="row">Sex Pistols</th>
<td>1975</td>
<td>1</td>
<td>Anarchy in the UK</td>
</tr>
<tr>
<th scope="row">Sham 69</th>
<td>1976</td>
<td>13</td>
<td>If the kids are united</td>
</tr>
<tr>
<th scope="row">Siouxsie and the Banshees</th>
<td>1976</td>
<td>11</td>
<td>H

In [19]:
soup.head

<head>
<meta charset="utf-8"/>
<title>Tables Task</title>
<style>
      table {
        font-size: 75%;
      }
    </style>
</head>

In [20]:
soup.body

<body>
<table>
<caption>A summary of the UK's most famous punk bands</caption>
<thead>
<tr>
<th scope="col">Band</th>
<th scope="col">Year formed</th>
<th scope="col">No. of Albums</th>
<th scope="col">Most famous song</th>
</tr>
</thead>
<tbody>
<tr>
<th scope="row">Buzzcocks</th>
<td>1976</td>
<td>9</td>
<td>Ever fallen in love (with someone you shouldn't've)</td>
</tr>
<tr>
<th scope="row">The Clash</th>
<td>1976</td>
<td>6</td>
<td>London Calling</td>
</tr>
<tr>
<th scope="row">The Damned</th>
<td>1976</td>
<td>10</td>
<td>Smash it up</td>
</tr>
<tr>
<th scope="row">Sex Pistols</th>
<td>1975</td>
<td>1</td>
<td>Anarchy in the UK</td>
</tr>
<tr>
<th scope="row">Sham 69</th>
<td>1976</td>
<td>13</td>
<td>If the kids are united</td>
</tr>
<tr>
<th scope="row">Siouxsie and the Banshees</th>
<td>1976</td>
<td>11</td>
<td>Hong Kong Garden</td>
</tr>
<tr>
<th scope="row">Stiff Little Fingers</th>
<td>1977</td>
<td>10</td>
<td>Suspect Device</td>
</tr>
<tr>
<th scope="row">The Stranglers</

In [21]:
soup.body.table

<table>
<caption>A summary of the UK's most famous punk bands</caption>
<thead>
<tr>
<th scope="col">Band</th>
<th scope="col">Year formed</th>
<th scope="col">No. of Albums</th>
<th scope="col">Most famous song</th>
</tr>
</thead>
<tbody>
<tr>
<th scope="row">Buzzcocks</th>
<td>1976</td>
<td>9</td>
<td>Ever fallen in love (with someone you shouldn't've)</td>
</tr>
<tr>
<th scope="row">The Clash</th>
<td>1976</td>
<td>6</td>
<td>London Calling</td>
</tr>
<tr>
<th scope="row">The Damned</th>
<td>1976</td>
<td>10</td>
<td>Smash it up</td>
</tr>
<tr>
<th scope="row">Sex Pistols</th>
<td>1975</td>
<td>1</td>
<td>Anarchy in the UK</td>
</tr>
<tr>
<th scope="row">Sham 69</th>
<td>1976</td>
<td>13</td>
<td>If the kids are united</td>
</tr>
<tr>
<th scope="row">Siouxsie and the Banshees</th>
<td>1976</td>
<td>11</td>
<td>Hong Kong Garden</td>
</tr>
<tr>
<th scope="row">Stiff Little Fingers</th>
<td>1977</td>
<td>10</td>
<td>Suspect Device</td>
</tr>
<tr>
<th scope="row">The Stranglers</th>
<td

In [22]:
soup.body.table.thead

<thead>
<tr>
<th scope="col">Band</th>
<th scope="col">Year formed</th>
<th scope="col">No. of Albums</th>
<th scope="col">Most famous song</th>
</tr>
</thead>

In [23]:
ths=soup.body.table.thead.find_all('th')
ths

[<th scope="col">Band</th>,
 <th scope="col">Year formed</th>,
 <th scope="col">No. of Albums</th>,
 <th scope="col">Most famous song</th>]

In [24]:
columns = [row.get_text() for row in ths]
columns

['Band', 'Year formed', 'No. of Albums', 'Most famous song']

In [25]:
soup.body.table.tbody

<tbody>
<tr>
<th scope="row">Buzzcocks</th>
<td>1976</td>
<td>9</td>
<td>Ever fallen in love (with someone you shouldn't've)</td>
</tr>
<tr>
<th scope="row">The Clash</th>
<td>1976</td>
<td>6</td>
<td>London Calling</td>
</tr>
<tr>
<th scope="row">The Damned</th>
<td>1976</td>
<td>10</td>
<td>Smash it up</td>
</tr>
<tr>
<th scope="row">Sex Pistols</th>
<td>1975</td>
<td>1</td>
<td>Anarchy in the UK</td>
</tr>
<tr>
<th scope="row">Sham 69</th>
<td>1976</td>
<td>13</td>
<td>If the kids are united</td>
</tr>
<tr>
<th scope="row">Siouxsie and the Banshees</th>
<td>1976</td>
<td>11</td>
<td>Hong Kong Garden</td>
</tr>
<tr>
<th scope="row">Stiff Little Fingers</th>
<td>1977</td>
<td>10</td>
<td>Suspect Device</td>
</tr>
<tr>
<th scope="row">The Stranglers</th>
<td>1974</td>
<td>17</td>
<td>No More Heroes</td>
</tr>
</tbody>

In [26]:
trs = soup.body.table.tbody.find_all('tr')
trs

[<tr>
 <th scope="row">Buzzcocks</th>
 <td>1976</td>
 <td>9</td>
 <td>Ever fallen in love (with someone you shouldn't've)</td>
 </tr>,
 <tr>
 <th scope="row">The Clash</th>
 <td>1976</td>
 <td>6</td>
 <td>London Calling</td>
 </tr>,
 <tr>
 <th scope="row">The Damned</th>
 <td>1976</td>
 <td>10</td>
 <td>Smash it up</td>
 </tr>,
 <tr>
 <th scope="row">Sex Pistols</th>
 <td>1975</td>
 <td>1</td>
 <td>Anarchy in the UK</td>
 </tr>,
 <tr>
 <th scope="row">Sham 69</th>
 <td>1976</td>
 <td>13</td>
 <td>If the kids are united</td>
 </tr>,
 <tr>
 <th scope="row">Siouxsie and the Banshees</th>
 <td>1976</td>
 <td>11</td>
 <td>Hong Kong Garden</td>
 </tr>,
 <tr>
 <th scope="row">Stiff Little Fingers</th>
 <td>1977</td>
 <td>10</td>
 <td>Suspect Device</td>
 </tr>,
 <tr>
 <th scope="row">The Stranglers</th>
 <td>1974</td>
 <td>17</td>
 <td>No More Heroes</td>
 </tr>]

In [27]:
for row in trs:
    ths = row.find_all('th')
    tds = row.find_all('td')
    
    rowData = [h_or_d.get_text() for h_or_d in ths+tds]
    print (rowData)

['Buzzcocks', '1976', '9', "Ever fallen in love (with someone you shouldn't've)"]
['The Clash', '1976', '6', 'London Calling']
['The Damned', '1976', '10', 'Smash it up']
['Sex Pistols', '1975', '1', 'Anarchy in the UK']
['Sham 69', '1976', '13', 'If the kids are united']
['Siouxsie and the Banshees', '1976', '11', 'Hong Kong Garden']
['Stiff Little Fingers', '1977', '10', 'Suspect Device']
['The Stranglers', '1974', '17', 'No More Heroes']


In [28]:
soup.body.table.tfoot

<tfoot>
<tr>
<th colspan="2" scope="row">Total albums</th>
<td colspan="2">77</td>
</tr>
</tfoot>

### [pd.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html)
A simple table scraping wrapper built on the top of beautifulsoup

In [29]:
url = "https://ourworldindata.org/the-worlds-deadliest-earthquakes"
tables = pd.read_html(url, encoding='utf-8')
len(tables)

1

In [30]:
tables[0]

Unnamed: 0,Ranking,Location,Year,Estimated death toll,Earthquake magnitude,Additional information
0,1,"Shaanxi, China",1556,830000,8,More than 97 counties in China were affected. ...
1,2,"Port-au-Prince, Haiti",2010,316000,7,Death toll is still disputed. Here we present ...
2,3,"Antakya, Turkey",115,260000,7.5,Antioch (ancient ruins which lie near the mode...
3,4,"Antakya, Turkey",525,250000,7,Severe damage to the area of the Byzantine Emp...
4,5,"Tangshan, China",1976,242769,7.5,Reported that the earthquake risk had been gre...
5,6,"Gyzndzha, Azerbaijan",1139,230000,Unknown,Often termed the Ganja earthquake. Much less i...
6,7,"Sumatra, Indonesia",2004,227899,9.1,Earthquake in Indian Ocean off the coast of Su...
7,8,"Damghan, Iran",856,200000,7.9,Estimated that extent of the damage area was 2...
8,8,"Gansu, China",1920,200000,8.3,Damage occurred across 7 provinces and regions...
9,9,"Dvin, Armenia",893,150000,Unknown,"City of Dvin was destroyed, with the collapse ..."


### [tabula-py](https://nbviewer.org/github/chezou/tabula-py/blob/master/examples/tabula_example.ipynb): A PDF table scraper
tabula-py is a tool to convert PDF tables to pandas DataFrame. tabula-py is a wrapper of tabula-java, which requires java on your machine.

In [2]:
#!pip install tabula-py
from tabula import read_pdf
url = "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf"
tables = read_pdf(url, pages='all')
len(tables)

ImportError: cannot import name 'read_pdf' from 'tabula' (c:\Users\nishi\AppData\Local\Programs\Python\Python311\Lib\site-packages\tabula\__init__.py)

In [None]:
tables[0]

In [None]:
tables[1]