## 3. Scraping data from HTML files

The HTMLScraper is an extension to EXSCLAIM! code (https://github.com/MaterialEyes/exsclaim) which allows users to create a folder with HLTML files of the journals Nature, Wiley, ACS, RSC and scrape the images and captions.

! Note that for journals with dynamic webpages the chomedriver installation is crusial.

Make sure the chrome driver is connected to be able to use Selenium.

In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium_stealth import stealth

URL = "https://google.com"

# create a new Service instance and specify path to Chromedriver executable
service = ChromeService(executable_path=ChromeDriverManager().install())

options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
driver = webdriver.Chrome(service=service, options=options)

driver.get(URL)
title = driver.title

stealth(driver,
                languages=["en-US", "en"],
                vendor="Google Inc.",
                platform="Win32",
                webgl_vendor="Intel Inc.",
                renderer="Intel Iris OpenGL Engine",
                fix_hairline=True,
                )

print(f" {title} Driver is connected successfully")


 Google Driver is connected successfully


```
!git clone https://github.com/katerinavr/exsclaim.git
%cd exsclaim
!python setup.py install
!pip install urllib3==1.25.10
!pip install --upgrade --no-cache-dir gdown
from IPython.display import clear_output
import pandas as pd
import locale
locale.getpreferredencoding = lambda: "UTF-8"
clear_output()

```

```
!pip install langchain
!pip install transformers
!pip install gradio
!pip install accelerate
!pip install chromadb
!pip install sentence_transformers
!pip install unstructured
!pip install tiktoken
!pip install openai
clear_output()

```

In [4]:
#import requests
import os

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?id="+id
    cmd = "gdown %s -O %s"%(URL, destination)
    os.system(cmd)

```
# Load the pretrained models
!mkdir /content/exsclaim/exsclaim/figures/checkpoints/
download_file_from_google_drive('1ZodeH37Nd4ZbA0_1G_MkLKuuiyk7VUXR', '/content/exsclaim/exsclaim/figures/checkpoints/classifier_model.pt')
download_file_from_google_drive('1Hh7IPTEc-oTWDGAxI9o0lKrv9MBgP4rm', '/content/exsclaim/exsclaim/figures/checkpoints/object_detection_model.pt')
download_file_from_google_drive('1rZaxCPEWKGwvwYYa8jLINpUt20h0jo8y', '/content/exsclaim/exsclaim/figures/checkpoints/text_recognition_model.pt')
download_file_from_google_drive('1B4_rMbP3a1XguHHX4EnJ6tSlyCCRIiy4', '/content/exsclaim/exsclaim/figures/checkpoints/scale_bar_detection_model.pt')
download_file_from_google_drive('1oGjPG698LdSGvv3FhrLYh_1FhcmYYKpu', '/content/exsclaim/exsclaim/figures/checkpoints/scale_label_recognition_model.pt')

```

Below you can find an example of the json query which is the input to the pipeline. Several example queries can be found under: /exsclaim/query

When you use the HTMLScraper, you need to create a 'html_files' folder and upload your HTML files

In [6]:
test_json =  {
    "name": "html-ECPs",

    "html_folder": "/content/html_files" ,

     "llm": "gpt-3.5-turbo",

    "openai_API": # here you need to add your OpenAI API key ,
    "save_format": ["boxes", "save_subfigures", "csv"],

    "logging": ["print", "exsclaim.log"]
    }



Once the run is completed successfully a **SUCCESS** message will be printed.
Inside the exsclaim directory the generated documents can be located into the /exsclaim/output/name

In [7]:
from exsclaim.pipeline import Pipeline

test_pipeline = Pipeline(test_json)
results = test_pipeline.run(tools=None,
        figure_separator=True,
        caption_distributor=True,
        journal_scraper=False,
        html_scraper=True,
        driver = driver)

[K
        @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
        @@@@@@@@@@@@@@@@@@@&   /&@@@(   /@@@@@@@@@@@@@@@@@@@
        @@@@@@@@@@@@@@@ %@@@@@@@@@@@@@@@@@@@ *@@@@@@@@@@@@@@
        @@@@@@@@@@@@ @@@@@@@@@@@@@@,  .@@@@@@@@ *@@@@@@@@@@@
        @@@@@@@@@.#@@@@@@@@@@@@@@@@,    @@@@@@@@@@ @@@@@@@@@
        @@@@@@@&,@@@@@@@@@@@@@@@@@@.    @@@@@@@@@@@@ @@@@@@@
        @@@@@@ @@@@@@@@@@@@@@@@@@@@     @@@@@@@@@@@@@ @@@@@@
        @@@@@ @@@@@@@@@@@@@@@@@@@@@    *@@@@@@@@@@@@@@/@@@@@
        @@@@ @@@@@@@@@@@@@@@@@@@@@@    @@@@@@@@@@@@@@@@,@@@@
        @@@ @@@@@@@@@@@@@@@@@@@@@@&    @@@@@@@@@@@@@@@@@ @@@
        @@@,@@@@@@@@@@@@@@@@@@@@@@*   (@@@@@@@@@@@@@@@@@@%@@
        @@.@@@@@@@@@@@@@@@@@@@@@@@    @@@@@@@@@@@@@@@@@@@ @@
        @@ @@@@@@@@@@@@@@@@@@@@@@@    @@@@@@@@@@@@@@@@@@@ @@
        @@ @@@@@@@@@@@@@@@@@@@@@@/   &@@@@@@@@@@@@@@@@@@@ @@
        @@,@@@@@@@@@@@@@@@@@@@@@@    @@@@@@@@@@@@@@@@@@@@ @@
        @@@.@@@@@@@@@@@@@@@@@@@@&   @@@@@@@@@@@@@@@@@@@@@%@@
        @@@ @@@@@@@



[KRunning HTML Scraper
[K>>> (1 of 1) Extracting figures from: All Donor Electrochromic Polymers Tunable across the Visible Spectrum via Random Copolymerization _ Chemistry of Materials.htmlimage saved as:  /content/exsclaim/output/html-ECPs/figures/acs.chemmater.9b01293_fig1.png
image saved as:  /content/exsclaim/output/html-ECPs/figures/acs.chemmater.9b01293_fig2.png
image saved as:  /content/exsclaim/output/html-ECPs/figures/acs.chemmater.9b01293_fig3.png
image saved as:  /content/exsclaim/output/html-ECPs/figures/acs.chemmater.9b01293_fig4.png
image saved as:  /content/exsclaim/output/html-ECPs/figures/acs.chemmater.9b01293_fig5.png
image saved as:  /content/exsclaim/output/html-ECPs/figures/acs.chemmater.9b01293_fig6.png
image saved as:  /content/exsclaim/output/html-ECPs/figures/acs.chemmater.9b01293_fig7.png
image saved as:  /content/exsclaim/output/html-ECPs/figures/acs.chemmater.9b01293_fig8.png
image saved as:  /content/exsclaim/output/html-ECPs/figures/acs.chemmater.9b0129