# WebScraping

## HTML Tables structure

Table Schematic

![Table structure](./pic03.gif)

Table Structure

![Table structure](./pic01.gif)

Table HTML Code

![Table structure](./pic02.png)

Sample HTML Table Code

![Table in Page](./result.png)

## Web Scraping with Python

import required packages

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

This is our sample page to Scrape some Data From it. check it.

https://afd.calpoly.edu/web/sample-tables

In [2]:
# Send a request to the web page and fetch the HTML content
url = 'https://afd.calpoly.edu/web/sample-tables'

In [3]:
response = requests.get(url)

Let's take a look at retrived Data

Every time you request for a webpage you receive two Data, onw for status of your request and one for result text </br>
Let's take a look at them

In [5]:
response.status_code

200

In [4]:
response.text

'<!DOCTYPE html>\r\n<html class="no-js" lang="en" dir="ltr">\r\n<head>\r\n<!--\r\nCal Poly Web Template v2.0.0\r\nCode maintained by\r\nInformation Technology Services\r\nCalifornia Polytechnic State University\r\nSan Luis Obispo, CA 93407\r\n-->\r\n<meta charset="UTF-8">\r\n<meta http-equiv="X-UA-Compatible" content="IE=Edge">\r\n<meta name="viewport" content="initial-scale=1">\r\n\r\n<title>Sample Tables - Web - Cal Poly</title>\n\r\n<meta http-equiv="content-language" content="en" />\r\n<meta name="language" content="en" />\r\n<meta name="msapplication-config" content="none"/>\r\n\r\n<meta name="codebase" content="AFD-2.0" />\r\n\r\n\r\n<meta name="Description" content="As stewards of the University resources, we provide high quality, efficient support and planning services as an integral part of the campus community in support of student learning." />\r\n<meta name="Keywords" content="Cal Poly, Administration, AFD, Finance, Police, Budget, Facilities, Fiscal, HR, Risk, Technology, 

Now it's time to make the result text pretty

In [6]:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

In [7]:
soup

<!DOCTYPE html>

<html class="no-js" dir="ltr" lang="en">
<head>
<!--
Cal Poly Web Template v2.0.0
Code maintained by
Information Technology Services
California Polytechnic State University
San Luis Obispo, CA 93407
-->
<meta charset="utf-8"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="initial-scale=1" name="viewport"/>
<title>Sample Tables - Web - Cal Poly</title>
<meta content="en" http-equiv="content-language">
<meta content="en" name="language">
<meta content="none" name="msapplication-config">
<meta content="AFD-2.0" name="codebase"/>
<meta content="As stewards of the University resources, we provide high quality, efficient support and planning services as an integral part of the campus community in support of student learning." name="Description"/>
<meta content="Cal Poly, Administration, AFD, Finance, Police, Budget, Facilities, Fiscal, HR, Risk, Technology, Corporation" name="Keywords"/>
<link href="https://use.typekit.net/asw2aly.css" rel="stylesheet

So, now we have the comlete HTML source of page in a shape that we can understand better.</br>
Let's extract Tables 

In [8]:
# Find the table element using its tag name and any unique identifier (e.g., class, id)
tables = soup.find_all('table')

In [9]:
tables

[<table border="1" summary="Provide table summary here">
 <caption>
 <strong>Table caption (name and description of table)</strong><br/>
         You can describe your table here or in the context of your page.  
         This table is read using the first row as the header for each column.
         (Replace this caption with your own description of the table)
         </caption>
 <tr>
 <th scope="col">Description</th>
 <th scope="col">Date</th>
 <th scope="col"><a href="#">Location</a></th>
 </tr>
 <tr>
 <td> Academic Senate Meeting</td>
 <td>May 25, 2205</td>
 <td>Building 99 Room 1</td>
 </tr>
 <tr class="shade-row">
 <td>Commencement Meeting</td>
 <td>December 15, 2205</td>
 <td>Building 42 Room 10</td>
 </tr>
 <tr>
 <td>Dean's Council</td>
 <td>February 1, 2206</td>
 <td>Building 35 Room 5</td>
 </tr>
 <tr class="shade-row">
 <td>Committee on Committees</td>
 <td>March 3, 2206</td>
 <td>Building 1 Room 201</td>
 </tr>
 <tr>
 <td> Lorem ipsum dolor sit amet, <a href="#">consectetue

The tables object is a list of all tables that we found in the page

With this Function we can Extract all **th** tags and also **td** tags inside every **tr** tag

In [10]:
# Extract the data from the table
def extractTablesData(table):
    all_rows = []
    all_columns = []
    if table:
        # Extract the table headers (th) and append them to the table_data list
        headers = table.find_all("th")
        header_row = [header.get_text(strip=True) for header in headers]
        all_columns.append(header_row)

        # Extract the table rows (tr) and append their data to the table_data list
        rows = table.find_all("tr")
        for row in rows[1:]:  # Skip the first row as it contains the headers
            row_data = [cell.get_text(strip=True) for cell in row.find_all("td")]
            all_rows.append(row_data)

    return all_columns, all_rows

In this step we want to iterate over all tables in list and retrive the header and data of each one 

In [11]:
for table in tables:
    cols, rows = extractTablesData(table)
    print('*******')
    print(cols)
    print('*******')
    print('')

*******
[['Description', 'Date', 'Location']]
*******

*******
[['Name', 'Telephone', 'Email', 'Office']]
*******

*******
[['Instructor', 'Class', 'Location', 'Dr. Sally', 'Dr. Steve', 'Dr. Kathy']]
*******

*******
[['Aligned Left', 'Aligned Center', 'Aligned Right']]
*******

*******
[['Day', 'Time', 'Location']]
*******

*******
[['NAME OF SYSTEM OR PORTAL CHANNEL', 'NAME OF SYSTEM OR ACTIVITY', 'STATUS DURING OUTAGE', 'DATA FROZEN AS OF', 'EXPECTED UP TIME', 'Personal Information', 'Group Leave Balance', 'Leave/CTO Balances', 'Faculty Course Info', 'Enrollment Planning', 'Student Pay', 'PolyData', 'PolyProfile']]
*******



In [12]:
for table in tables:
    cols, rows = extractTablesData(table)
    print('*******')
    print(cols)
    print('--------')
    for row in rows:
        print(row)
    print('---------')
    break

*******
[['Description', 'Date', 'Location']]
--------
['Academic Senate Meeting', 'May 25, 2205', 'Building 99 Room 1']
['Commencement Meeting', 'December 15, 2205', 'Building 42 Room 10']
["Dean's Council", 'February 1, 2206', 'Building 35 Room 5']
['Committee on Committees', 'March 3, 2206', 'Building 1 Room 201']
['Lorem ipsum dolor sit amet,consectetuer adipiscing elit. Sed lacus arcu, porta posuere, varius et.', 'Loremipsum dolorsit amet, consectetuer adipiscing elit. Sed lacus arcu, porta posuere, varius et.', 'Loremipsum dolorsit amet, consectetuer adipiscing elit. Sed lacus arcu, porta posuere, varius et.']
['Lorem ipsum dolor', 'Lorem ipsum dolor', 'Lorem ipsum dolor']
---------


In [14]:
for table in tables:
    cols, rows = extractTablesData(table)
    columns = cols[0]

    break
    
print(columns)

['Description', 'Date', 'Location']


In [15]:
for table in tables:
    cols, rows = extractTablesData(table)
    for row in rows:
        print(row)
        break
    break

['Academic Senate Meeting', 'May 25, 2205', 'Building 99 Room 1']


In [16]:
for table in tables:
    cols, rows = extractTablesData(table)
    for row in rows:
        print(row[0])
        print(row[1])
        print(row[2])
        break
    break

Academic Senate Meeting
May 25, 2205
Building 99 Room 1


In [17]:
Description = []
Date = []
Location = []

for table in tables:
    cols, rows = extractTablesData(table)
    for row in rows:
        Description.append(row[0])
        Date.append(row[1])
        Location.append(row[2])
        break
    break

In [18]:
print(Description)

['Academic Senate Meeting']


In [19]:
print(Date)

['May 25, 2205']


In [20]:
print(Location)

['Building 99 Room 1']


In [21]:
Description = []
Date = []
Location = []

for table in tables:
    cols, rows = extractTablesData(table)
    for row in rows:
        Description.append(row[0])
        Date.append(row[1])
        Location.append(row[2])
    break

In [22]:
print(Description)

['Academic Senate Meeting', 'Commencement Meeting', "Dean's Council", 'Committee on Committees', 'Lorem ipsum dolor sit amet,consectetuer adipiscing elit. Sed lacus arcu, porta posuere, varius et.', 'Lorem ipsum dolor']


In [23]:
print(columns)

['Description', 'Date', 'Location']


In [24]:
print(Description)

['Academic Senate Meeting', 'Commencement Meeting', "Dean's Council", 'Committee on Committees', 'Lorem ipsum dolor sit amet,consectetuer adipiscing elit. Sed lacus arcu, porta posuere, varius et.', 'Lorem ipsum dolor']


In [25]:
print(Date)

['May 25, 2205', 'December 15, 2205', 'February 1, 2206', 'March 3, 2206', 'Loremipsum dolorsit amet, consectetuer adipiscing elit. Sed lacus arcu, porta posuere, varius et.', 'Lorem ipsum dolor']


In [26]:
print(Location)

['Building 99 Room 1', 'Building 42 Room 10', 'Building 35 Room 5', 'Building 1 Room 201', 'Loremipsum dolorsit amet, consectetuer adipiscing elit. Sed lacus arcu, porta posuere, varius et.', 'Lorem ipsum dolor']


In [27]:
dic = {
    'Description': Description,
    'Date': Date,
    'Location': Location
}


In [28]:
dic

{'Description': ['Academic Senate Meeting',
  'Commencement Meeting',
  "Dean's Council",
  'Committee on Committees',
  'Lorem ipsum dolor sit amet,consectetuer adipiscing elit. Sed lacus arcu, porta posuere, varius et.',
  'Lorem ipsum dolor'],
 'Date': ['May 25, 2205',
  'December 15, 2205',
  'February 1, 2206',
  'March 3, 2206',
  'Loremipsum dolorsit amet, consectetuer adipiscing elit. Sed lacus arcu, porta posuere, varius et.',
  'Lorem ipsum dolor'],
 'Location': ['Building 99 Room 1',
  'Building 42 Room 10',
  'Building 35 Room 5',
  'Building 1 Room 201',
  'Loremipsum dolorsit amet, consectetuer adipiscing elit. Sed lacus arcu, porta posuere, varius et.',
  'Lorem ipsum dolor']}

In [29]:
df = pd.DataFrame(dic)

In [30]:
df

Unnamed: 0,Description,Date,Location
0,Academic Senate Meeting,"May 25, 2205",Building 99 Room 1
1,Commencement Meeting,"December 15, 2205",Building 42 Room 10
2,Dean's Council,"February 1, 2206",Building 35 Room 5
3,Committee on Committees,"March 3, 2206",Building 1 Room 201
4,"Lorem ipsum dolor sit amet,consectetuer adipis...","Loremipsum dolorsit amet, consectetuer adipisc...","Loremipsum dolorsit amet, consectetuer adipisc..."
5,Lorem ipsum dolor,Lorem ipsum dolor,Lorem ipsum dolor


![Table structure](./table.png)