## Dynamic Scrapping

This Jupyter Notebook is designed for dynamic web scraping, data extraction, and analysis. It leverages powerful libraries such as Selenium for browser automation, BeautifulSoup for parsing HTML content, and Pandas for data manipulation and analysis. The goal is to automate the extraction of specific information from web pages, which involves navigating through pages, handling dynamic content, and extracting and processing the data for insights or further analysis.

### Load packages and define the fucntions for crawling, scrapping and data matching

Cells below are used to load pacakges and defining functions

In [11]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
import requests                 # to interact with websites and request/get data from them 
from bs4 import BeautifulSoup   # to parse and extract data from websites 
import pandas as pd
import time
import random
import requests
import urllib
import numpy as np
import pandas as pd
from PIL import Image
from io import BytesIO
import re

In [90]:
def dynamic_crawler(url_list, numbers_list):
    '''
    Crawl and scrape all the urls of the financial report of our goal company.
    Input:
        - url_list: urls where we start crawling
        - numbers_list: list of company's unique identifier(CIK) that 
        included all the company we want to scrape esg content
    Output: 
        - url_dict: dictionary that contains the urls of financial report of our goal company
    '''
    url_dict = {}
    for u in url_list:
        driver = webdriver.Chrome() 
        driver.get(u)
        time.sleep(2)
        result_count_element = driver.find_element(By.ID, "show-result-count").find_element(By.TAG_NAME, "h5")
        num = result_count_element.text 
        if num:
            num = re.search(r"\b\d+\b",num).group()
            page_num = int(num)//100+1
            for i in range(page_num):
                testpath = "//td[@class='preview-file']"
                links = driver.find_elements(By.XPATH, testpath)
                page_source_test = driver.page_source
                soup_test = BeautifulSoup(page_source_test)
                identify_list = []
                print("1")
                for i in soup_test.find_all("a", class_="preview-file"):
                    if i["data-adsh"][:10] in numbers_list:
                        identify_list.append((i["data-adsh"],i["data-file-name"]))
                for i in identify_list:
                    xpath = "//a[@class='preview-file'][@data-adsh='{}'][@data-file-name='{}']".format(i[0],i[1])
                    link = driver.find_element(By.XPATH, xpath)
                    link.click()
                    print("find button")
                    time.sleep(1.5)
                    page_source = driver.page_source
                    soup = BeautifulSoup(page_source,"html.parser")
                    open_file_url = soup.find(id="open-file")["href"]
                    url_dict[i[0]] = open_file_url
                    close_path = "//button[@id='close-modal']"
                    close = driver.find_element(By.XPATH, close_path)
                    close.click()
                    print("close")
                    time.sleep(2)
                if page_num > 1:
                    next_path = "//a[@data-value='nextPage']"
                    next_page = driver.find_element(By.XPATH, next_path)
                    next_page.click()
                    time.sleep(2)
                    print("find next page")
                    page_num -= 1
    return url_dict      

In [53]:
def scrapper(header, url_dict,key_words):
    '''
    Scrape, screen and store the ESG ccontent from the urls of goal companies financial reports.
    Input:
        - header: the component of request that contains the user agent info 
        - url_dict: urls of goal companies financial reports
        - key_words: ESG key words dictionary
    Output:
        - a dictionary that contains all the scrapped content
    '''
    company_scrapped ={}
    for code, url in url_dict.items():
        response = requests.get(url,headers=header)
        response_txt = response.text
        soup = BeautifulSoup(response_txt, 'xml')
        soup_div = soup.find_all(["div","p","span"])
        company_scrapped[code[:10]] = soup_div
        content = []
        for i in soup_div:
            if any(k.lower() in str(i.text).lower()for k in key_words):
                content.append(i.text)
        company_scrapped[code[:10]] = content
        if len(content) > 1:
            del(content[0])
        time.sleep(random.randint(2, 3)) 
        print(response.status_code)
    return company_scrapped

In [57]:
def matching_dataset(df, all_esg_content):
    '''
    Match the two dataset based on unique identifier of the company
    Input:
        - df: company financial info dataframe
        - all_esg_content: all esg content
    Output:
        - matched_df: merged dataframe 
    '''
    matched_df = df[df['CIK'].isin(all_esg_content.keys())]
    matched_df['Content'] = matched_df['CIK'].map(all_esg_content)
    return matched_df 

### Load company codes from orbis dataset that downloaded

This cell reads company codes from the downloaded dataset from Orbis, storing codes in a list. These codes are used to identify companies included for data scraping. We only scrape the companies that have corresponding financial data in Orbis (this also means that we select companies that meet our standard:large, active and US public limited company, since we apply these criteria when getting company data from Orbis)

In [79]:
financial_info = pd.read_excel("company_info_1.xlsx",dtype={'CIK': str}) 
numbers_list = list(financial_info["CIK"])

### Begin the dynamic crawling and scrapping process

Cell below are used to define list of urls of 55 states in United States in Edgar database. We will firstly load the chrome driver, then make it to scrape all urls of "preview document" in the page and store it into a list. Then we will use chrome driver to access each url, click on button of "open preview", "open file", "close" and "find next page". We will store the url of each report's "open file" button.

In [88]:
state_list = ["https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=AL&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=AK&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=AZ&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=AR&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=CA&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=CO&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=CT&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=DE&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=DC&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=FL&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=GA&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=HI&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=ID&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=IL&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=IN&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=IA&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=KS&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=KY&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=LA&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=ME&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=MD&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=MA&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=MI&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=MN&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=MS&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=MO&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=MT&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=NE&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=NV&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=NH&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=NJ&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=NM&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=NY&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=NY&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=ND&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=OH&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=OK&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=OR&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=PA&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=RI&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=SC&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=SD&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=TN&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=TX&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=UT&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=VT&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=VA&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=WA&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=WV&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=WI&forms=10-K",
            "https://www.sec.gov/edgar/search/#/dateRange=1y&category=custom&locationCode=WY&forms=10-K"]

In [91]:
url_dict = dynamic_crawler(state_list, numbers_list)

1
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
1
1
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find next page
1
find button
close
find button
close
find button
close
1
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
1
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
find button
close
f

In [93]:
url_dict

{'0000785161-24-000009': 'https://www.sec.gov/Archives/edgar/data/0000785161/000078516124000009/ehc-20231231.htm',
 '0001396009-24-000006': 'https://www.sec.gov/Archives/edgar/data/0001396009/000139600924000006/vmc-20231231x10k.htm',
 '0000092122-24-000009': 'https://www.sec.gov/Archives/edgar/data/0000066904/000009212224000009/so-20231231.htm',
 '0001691303-24-000008': 'https://www.sec.gov/Archives/edgar/data/0001691303/000169130324000008/hcc-20231231.htm',
 '0001718227-23-000081': 'https://www.sec.gov/Archives/edgar/data/0001718227/000171822723000081/road-20230930.htm',
 '0001017480-23-000026': 'https://www.sec.gov/Archives/edgar/data/0001017480/000101748023000026/hibb-20230128.htm',
 '0001169445-23-000003': 'https://www.sec.gov/Archives/edgar/data/0001169445/000116944523000003/cpsi-20221231.htm',
 '0001731289-24-000064': 'https://www.sec.gov/Archives/edgar/data/0001731289/000173128924000064/nkla-20231231.htm',
 '0001060391-24-000142': 'https://www.sec.gov/Archives/edgar/data/0001060

In [92]:
# Store the url list for backup 
import json
with open("url_full_list_0301", 'w') as file:
    # Use json.dump() to write the dictionary to the file
    json.dump(url_dict, file, indent=4)  # 'indent' for pretty printing

### Scrape the esg content of each report

We use the scrapped urls for each report to scrap the content of each report. We use a dictionary from generated from NLP to screen the esg content in each report.

In [94]:
header = { "User-Agent" : "scraper for final project of a course MACS 30122 at UChicago, huiy@uchicago.edu" } 

In [82]:
# ESG words dictionary
key_words = [ "climate change", "greenhouse gas", "sustainable", "renewable energy",
        "carbon footprint", "biodiversity", "energy efficiency", "pollution", "waste management",
        "sustainable development", "natural resources", "deforestation", "eco-friendly", "recycling",
        "water scarcity", "air quality", "clean energy", "environmental impact", "ozone depletion",
        "sustainability goals", "renewable resources", "carbon neutral", "emissions reduction",
        "worker safety", "gender equality", "data protection", "community engagement",
        "social responsibility", "employee well-being", "labor rights", "inclusion", "diversity",
        "human rights", "child labor", "forced labor", "living wage", "workplace harassment",
        "consumer protection", "product safety", "data privacy", "community development",
        "stakeholder relations", "supply chain ethics", "social impact", "access to healthcare",
        "educational programs","corporate governance", "board diversity", "executive compensation", "anti-corruption",
        "ethical practices", "compliance", "transparency", "stakeholder engagement", "accountability",
        "corporate ethics", "board independence", "shareholder rights", "audit committee",
        "risk management", "regulatory compliance", "corporate transparency", "governance structure",
        "business ethics", "conflict of interest", "data governance", "ethical supply chain",
        "whistleblower protection", "sustainable governance"
        ]

In [95]:
esg_dict_no_subtitles_incorporated =[
        "climate", "climate change", "climates", "global warming", "climatic", "climatic conditions",
        "pollution", "air pollution", "pollutants", "pollutant", "emissions", "pollutions", "pollutant emissions", "polluting", "mercury pollution",
        "resource", "resources", "mineral resources",
        "biodiversity", "bio diversity", "biodiversity conservation", "ecosystems", "marine biodiversity", "deforestation", "ecological", "ecology", "habitats", "fauna",
        "waste", "wastes", "waste disposal", "hazardous waste", "garbage", "recyclable waste", "landfills", "recycling", "landfill", "landfilling",
        "carbon", "carbon emissions", "carbon emission", "CO2", "greenhouse gas", "emission", "carbon dioxide emissions", "greenhouse gases", "greenhouse gas emissions", "carbon dioxide",
        "renewable", "renewable energy", "renewables", "biomass", "renewable fuels", "biofuels", "renewable energies", "fossil fuels",
        "water", "potable water", "sewage", "groundwater", "freshwater", "potable", "wastewater", "brackish groundwater",
        "deforestation", "tropical deforestation", "Amazon deforestation", "rampant deforestation", "rainforest destruction", "biodiversity", "tropical forests", "desertification", "rainforests",
        "greenhouse", "greenhouses", "hydroponic garden", "glasshouse", "unheated greenhouse", "hydroponically", "garden", "hydroponic greenhouse", "glasshouses"
        "rights", "freedoms", "inalienable rights", "constitutional protections",
        "labor", "wages", "union", "labor unions", "wage",
        "employee", "employees", "worker", "employer", "coworker", "workers", "staffer",
        "diversity", "cultural diversity", "diverse", "inclusiveness", "multicultural", "culturally diverse", "geographic diversity", "linguistic diversity", "inclusivity",
        "community", "communities",
        "safety",
        "development", "revitalization",
        "consumer", "consumers", "retail", "consumer electronics",
        "trade", "trading", "trades", "traded",
        "justice", "judicial", "criminal justice", "equality", "injustice"
        "board", "directors", "trustees", "boards",
        "pay", "paying", "paid", "pays", "reimburse", "payment", "repay",
        "corruption", "rampant corruption", "graft", "bribery", "endemic corruption", "corrupt", "cronyism", "rampant graft", "endemic graft", "anticorruption",
        "shareholder", "shareholders", "stockholder", "controlling shareholder", "stockholders", "shareowner", "investor", "shareholding", "unitholder",
        "transparency", "accountability", "openness", "transparent", "clarity", "objectivity",
        "ethics", "ethical", "ethical lapses",
        "risk", "risks", "probability", "danger", "likelihood", "risky", "hazard", "peril",
        "privacy", "confidentiality",
        "investment", "investments", "investing", "investment", "investor", "invest", "investors", "equity", "investement",
        "corporate", "corporations", "multinational corporations"
    ]


In [96]:
esg_content = scrapper(header, url_dict,key_words)

200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200


### Match the two datasets

We match the dataset from Orbis and esg content we scrapped using CIK as mapping key

In [97]:
filtered_df = matching_dataset(financial_info, esg_content)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  matched_df['Content'] = matched_df['CIK'].map(all_esg_content)


In [98]:
filtered_df

Unnamed: 0.1,Unnamed: 0,Company name Latin alphabet,Adjust-Company Name,Inactive,Quoted,Branch,OwnData,Woco,Country ISO code,"NACE Rev. 2, core code (4 digits)",...,"US SIC, primary code(s)",Unnamed: 21,"US SIC, primary code(s).1","US SIC, secondary code(s)",National ID,National ID type,National ID label,Ticker symbol,CIK,Content
0,1.0,WALMART INC.,WALMART INC.,No,Yes,No,No,Yes,US,4719.0,...,5331.0,5331,5331.0,5411.0,71-0415188,VAT/Tax number,EIN,WMT,0000104169,"[Directors, Executive Officers and Corporate G..."
1,2.0,"AMAZON.COM, INC.","AMAZON.COM, INC.",No,Yes,No,No,Yes,US,4791.0,...,5961.0,5961,5961.0,5999.0,91-1646860,VAT/Tax number,EIN,AMZN,0001018724,"[Directors, Executive Officers, and Corporate ..."
3,4.0,EXXON MOBIL CORP,EXXON MOBIL CORP,No,Yes,No,No,Yes,US,1920.0,...,2911.0,2911,2911.0,1311.0,13-5409005,VAT/Tax number,EIN,XOM,0000034088,"[Directors, Executive Officers and Corporate G..."
4,5.0,CVS HEALTH CORPORATION,CVS HEALTH CORPORATION,No,Yes,No,No,Yes,US,4773.0,...,5912.0,5912,5912.0,,05-0494040,VAT/Tax number,EIN,CVS,0000064803,"[Directors, Executive Officers and Corporate G..."
6,7.0,MCKESSON CORPORATION,MCKESSON CORPORATION,No,Yes,No,No,Yes,US,4645.0,...,5122.0,5122,5122.0,5047.0,94-3207296,VAT/Tax number,EIN,MCK,0000927653,"[Directors, Executive Officers, and Corporate ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9688,10692.0,"HEPION PHARMACEUTICALS, INC.","HEPION PHARMACEUTICALS, INC.",No,Yes,No,No,No,US,7211.0,...,8731.0,8731,8731.0,,46-2783806,VAT/Tax number,EIN,HEPA,0001583771,[Non-compliance with applicable regulatory req...
9771,10778.0,SES AI CORPORATION,SES AI CORPORATION,No,Yes,No,No,No,US,2720.0,...,3691.0,3691,3691.0,,98-1567584,VAT/Tax number,EIN,SES,0001819142,[TABLE OF CONTENTS​​​PART I​Item 1.Business5It...
9806,10813.0,"KRYSTAL BIOTECH, INC.","KRYSTAL BIOTECH, INC.",No,Yes,No,No,No,US,7211.0,...,8731.0,8731,8731.0,,82-1080209,VAT/Tax number,EIN,KRYS,0001711279,"[Directors, Executive Officers and Corporate G..."
9831,10838.0,INSTIL BIO INC,INSTIL BIO INC,No,Yes,No,No,No,US,7211.0,...,8731.0,8731,8731.0,,83-2072195,VAT/Tax number,EIN,TIL,0001789769,"[Item 10. Directors, Executive Officers and Co..."


In [99]:
filename = "full_data——0302.xlsx"
filtered_df.to_excel(filename, index=False, engine='openpyxl')