# **Web Scraping  Using Pandas and Selenium**
> ## - Jonathan Scofield
>> ###          2/17/2024

## **Introduction** 

> #### This project demonstrates how to use the Python **selenium** library to extract table data from a dynamically loaded web page. Popular web scraping libraries such as **Beautiful Soup** and **requests** work well with static web pages; however, they cannot easily parse dynamically loaded web pages. **Selenium WebDriver** interacts with web pages as a browser, enabling it to interact with the web pages **JavaScript** and trigger events. 

> #### This project uses public well data from the [Geological Survey of Alabama](https://www.gsa.state.al.us/). The parent web page for this project can be found at https://www.gsa.state.al.us/ogb/wells. We will use the hyperlinks in the API column to obtain additional information about the wells. A sample of what a child page might look like can be found here: https://www.gsa.state.al.us/ogb/wells/details/construct/17323-B-1/26114/32904/404/221100

> #### **IMPORTANT:** *The web pages used in this project are not owned by the author and may be removed or altered at any time. The author has provided copies of these pages for reference in this repository.*


## **Setup and Requirements**

> #### The following dependencies must be installed:
>> \#!pip install pandas==2.0.3 selenium==4.17.2 chromedriver-autoinstaller==0.6.4 

In [1]:
# Import libraries
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

In [2]:
# Required to use the Chrome driver in selenium
#chromedriver_autoinstaller.install()

## **Classes and Methods**

### **Parent Class**

> #### This class will serve as our basic class for scraping a webpage. It starts the selenium WebDriver when called and contains 3 methods:
>> 1. **extract_text()** will extract the anchor text from the HTML href tag in all columns except where indicated. Pandas returns href tags as a tuple: (anchor text, hyperlink).
>> 2. **get_html_table()** will load a web page from a given url, waiting until JavaScript loads tables. It then uses the Pandas **read_html()** method to extract the table and load into a Pandas DataFrame.
>> 3. **close_browser()** will close the browser once all crawling operations have completed.



In [3]:
class WebPage:
    
    def __init__(self):
        self.driver = webdriver.Chrome() # Start web driver
        
    # When extracting links in Pandas, the href and link text are returned as tuples.
    # The following function will return only the link text for all columns excluded from link_fields
    def extract_text(self, df, href_columns): 
        for column in df.columns.to_list():
            df[column] = df[column].map(lambda x: x[0] if column not in href_columns else x)
        return df
    
    # This function extracts HTML from the webpage using Pandas
    def get_html_table(self, url, href_columns = [], table_index = 0):
        preserve_links = len(href_columns) > 0 
        self.driver.get(url) # Load the web page
        WebDriverWait(self.driver, 10).until(
        EC.presence_of_element_located(
            (By.TAG_NAME, "table"))) # Important: wait for tables to load before returning HTML
        html = self.driver.page_source # Return HTML
        # Use Pandas to extract tables via HTML tags
        tables = pd.read_html(html, extract_links = 'body', flavor = 'bs4') if preserve_links else pd.read_html(html, flavor = 'bs4')
        df = pd.DataFrame(tables[table_index])
        # Return the data frame with selected links preserved a tuples
        return self.extract_text(df, href_columns) if preserve_links != 0 else df
    
    def close_browser(self): # This function closes the browser
        self.driver.quit()

### **Child Class**

> #### This class extends the parent class by allowing us to follow embedded links in our table. When calling the class, we pass all the parameters needed to locate the table of parent page. It contains 3 methods:
>> 1. **get_links()** creates a dictionary object from the columns which preserve links. This is needed to crawl the child tables.
>> 2. **format_child_url()** takes the partial url returned in the tuple and combines it with the parent url by removing overlapping directories using Python set operations.
>> 3. **get_linked_table()** follows a link and return another table ONLY if the anchor text is present in the parent table.

In [4]:
class ChildPage(WebPage): #Child class of web page
    
    def __init__(self, parent_url, href_columns, parent_table_index):
        super().__init__() # This class inherits all attributes from parent class
        self.parent_url = parent_url
        self.href_columns = href_columns # Preserve links in these columns
        self.parent_table_index = parent_table_index # In case of multiple tables
        self.parent_df = self.get_html_table(url = self.parent_url, href_columns = self.href_columns, table_index = self.parent_table_index) 
        
    # Create a dictionary from columns containing link tuples as {column: {anchor: link}}
    # Uses format_child_url (below) to fix url formatting
    def get_links(self): 
        return {href_column : {anchor_text: self.format_child_url(link) for anchor_text, link in self.parent_df[href_column].to_list()} for href_column in self.href_columns}

    # Function to join partial child url with parent url
    def format_child_url(self, child_url): 
        child_url = str(child_url) # Partial url for child page
        parent_url_dir = set(self.parent_url.split('/')) # Extract directories from parent url
        child_url_dir = set(child_url.split('/')) # Extract directories from child url
        common_dir = parent_url_dir & child_url_dir # Intersection of directories
        for element in common_dir: # Delete elements in place and remove doubled slashes
            child_url = child_url.replace(element, '').replace('//', '/') 
        return self.parent_url + child_url # Combine and remove overlap
    
    def get_linked_table(self, href_column, anchor_text):
        link_dict = self.get_links() # Load link dictionary
        link = link_dict[href_column].get(anchor_text, None) # Get link value
        df = self.get_html_table(url = link) # Follow link using inherited method
        df[href_column] = anchor_text # Adds a new column with the anchor text for clarity
        return df
        
            

### **Demonstration**

#### Retrieve the oil well table from the main page and preserve links in the "API" and "Online Map" columns

In [5]:
oil_wells = ChildPage('https://www.gsa.state.al.us/ogb/wells', ['API', 'Online Map'], 0)

In [6]:
oil_wells.parent_df

Unnamed: 0,Permit,API,Name,Permit Date,Log Date,Status,Status Date,Type,Operator,Field,Pool,County,Section,TwpRng,Unit Acres,Unit Description,Lon,Lat,Online Map
0,17469-CG,"(01073219420000, /ogb/wells/details/construct/...",,02-16-2024,,PW,02-16-2024,CM,Black Warrior Methane Corp.,Oak Grove,Pottsville Coal Interval,Jefferson,34,18S 7W,,VI-C,-87.2571,33.43193,"(, /apps/maps/?query=map_1748_3%2CWEBOGBSDE.DB..."
1,17468,"(01013200050000, /ogb/wells/details/construct/...",Louis Sudler 18-16 #1,02-09-2024,,PW,02-09-2024,UN,"Ranger Ventures, LLC",Wildcat,,Butler,18,9N 13E,160.0,SE,-86.79062,31.74438,"(, /apps/maps/?query=map_1748_3%2CWEBOGBSDE.DB..."
2,17467,"(01013200040000, /ogb/wells/details/construct/...",Tisdale 13-16 #1,02-09-2024,,PW,02-09-2024,UN,"Ranger Ventures, LLC",Wildcat,,Butler,13,10N 13E,160.0,SE,-86.70578,31.83361,"(, /apps/maps/?query=map_1748_3%2CWEBOGBSDE.DB..."
3,17466-CG,"(01073219410000, /ogb/wells/details/construct/...",RGGS 07-05-06,02-07-2024,,PW,02-07-2024,CM,Black Warrior Methane Corp.,Oak Grove,Pottsville Coal Interval,Jefferson,7,19S 6W,,Unit VI-B,-87.21371,33.40288,"(, /apps/maps/?query=map_1748_3%2CWEBOGBSDE.DB..."
4,17465-CG,"(01125263000000, /ogb/wells/details/construct/...",Cassidy 29-15-02,02-07-2024,,PW,02-07-2024,CM,"Warrior Met Coal Gas, LLC",Blue Creek,Pottsville Coal Interval,Tuscaloosa,29,18S 8W,,Unit 1 (gob wells only),-87.39608,33.44236,"(, /apps/maps/?query=map_1748_3%2CWEBOGBSDE.DB..."
5,17464,"(01013200030000, /ogb/wells/details/construct/...",W. Forsyth 31-2 #1,01-19-2024,,PW,01-19-2024,UN,"Ranger Ventures, LLC",Wildcat,,Butler,31,10N 13E,160.0,NE,-86.7948,31.79964,"(, /apps/maps/?query=map_1748_3%2CWEBOGBSDE.DB..."
6,17463-CG,"(01073219400000, /ogb/wells/details/construct/...",RGGS 34-01-02,12-27-2023,,AC,01-04-2024,CM,Black Warrior Methane Corp.,Oak Grove,Pottsville Coal Interval,Jefferson,34,18S 7W,,VI-C (gob and horizontal boreholes),-87.24975,33.43654,"(, /apps/maps/?query=map_1748_3%2CWEBOGBSDE.DB..."
7,17462-B,"(01035203790000, /ogb/wells/details/construct/...",A. E. Davis 29-9 #2,12-19-2023,,PW,12-19-2023,OIL,ESE Operating LLC,Southwest Range,Smackover,Conecuh,29,4N 8E,160.0,SE,-87.28168,31.27951,"(, /apps/maps/?query=map_1748_3%2CWEBOGBSDE.DB..."
8,17461-CG,"(01125262990000, /ogb/wells/details/construct/...",Cassidy 29-09-01,12-13-2023,,AC,01-11-2024,CM,"Warrior Met Coal Gas, LLC",Blue Creek,Pottsville Coal Interval,Tuscaloosa,29,18S 8W,,Unit 1 (GOB wells only),-87.39097,33.4466,"(, /apps/maps/?query=map_1748_3%2CWEBOGBSDE.DB..."
9,17460-CG,"(01073219390000, /ogb/wells/details/construct/...",C-3 23-4 No. 1,12-12-2023,,PW,12-12-2023,CM,Keyrock Energy LLC,Oak Grove,Pottsville Coal Interval,Jefferson,23,18S 7W,80.0,W2 NW,-87.24553,33.46616,"(, /apps/maps/?query=map_1748_3%2CWEBOGBSDE.DB..."


#### See embedded links

In [7]:
oil_wells.get_links()

{'API': {'01073219420000': 'https://www.gsa.state.al.us/ogb/wells/details/construct/17469-CG/28400/33951/84/327100',
  '01013200050000': 'https://www.gsa.state.al.us/ogb/wells/details/construct/17468/28399/33950/0/',
  '01013200040000': 'https://www.gsa.state.al.us/ogb/wells/details/construct/17467/28398/33949/0/',
  '01073219410000': 'https://www.gsa.state.al.us/ogb/wells/details/construct/17466-CG/28397/33948/84/327100',
  '01125263000000': 'https://www.gsa.state.al.us/ogb/wells/details/construct/17465-CG/28396/33947/227/327100',
  '01013200030000': 'https://www.gsa.state.al.us/ogb/wells/details/construct/17464/28395/33946/0/',
  '01073219400000': 'https://www.gsa.state.al.us/ogb/wells/details/construct/17463-CG/28394/33945/84/327100',
  '01035203790000': 'https://www.gsa.state.al.us/ogb/wells/details/construct/17462-B/28393/33944/289/221100',
  '01125262990000': 'https://www.gsa.state.al.us/ogb/wells/details/construct/17461-CG/27393/32944/227/327100',
  '01073219390000': 'https://ww

#### Get casing data for well with API 01035203650100 (index 48). *Note: the casing table is the first table that appears via the child link, so the default index will suffice*

In [8]:
oil_wells.get_linked_table('API','01035203650100')

Unnamed: 0,Feature,Diameter,Inside Diameter,Casing Top,Casing Bottom,Pressure,Recovered,Comment,API
0,COND,20.0,,0.0,80.0,,,,1035203650100
1,HOLE,12.25,,0.0,2679.0,,,,1035203650100
2,SURF,9.625,,0.0,2679.0,,,,1035203650100
3,HOLE,8.75,,2679.0,12274.0,,,,1035203650100
4,,,,,,,,,1035203650100
5,,,,,,,,,1035203650100


In [9]:
oil_wells.close_browser()

### **References**

> - https://medium.com/thedevproject/how-to-scrape-javascript-heavy-sites-like-a-pro-with-python-1ecf6f829538
> - https://selenium-python.readthedocs.io/
> - https://pandas.pydata.org/pandas-docs/version/2.0.3/
