Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting started with Botasaurus script throws lots of exceptions #24

Closed
gameuser1982 opened this issue Dec 24, 2023 · 4 comments
Closed

Comments

@gameuser1982
Copy link

gameuser1982 commented Dec 24, 2023

Description

I am just seeing a ton of exceptions trying to run the first Selenium scraping task that goes to https://www.omkar.cloud/ and grabs the h1 heading. It's the first Botasaurus script here:

from botasaurus import *

@browser
def scrape_heading_task(driver: AntiDetectDriver, data):
    # Navigate to the Omkar Cloud website
    driver.get("https://www.omkar.cloud/")
    
    # Retrieve the heading element's text
    heading = driver.text("h1")

    # Save the data as a JSON file in output/all.json
    return {
        "heading": heading
    }
     
if __name__ == "__main__":
    # Initiate the web scraping task
    scrape_heading_task()

It's the first script in what is botasaurus: https://www.omkar.cloud/botasaurus/docs/what-is-botasaurus/

Steps to Reproduce

  1. Run python main.py

Expected behavior: [What you expect to happen]

Scrape the h1 heading and store it as a string called heading which is returned once the function is called (and presumably automatically saved into a json file by the botasaurus framework)

Actual behavior: [What actually happens]

Lots of errors:

(py311selenium) C:\py311seleniumbot>python main.py
Running

DevTools listening on ws://127.0.0.1:64985/devtools/browser/6520850b-e749-463b-9c45-8e5ecdea678e
[24816:3140:1224/150501.718:ERROR:cert_issuer_source_aia.cc(34)] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate

[24816:3140:1224/150501.917:ERROR:cert_issuer_source_aia.cc(34)] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate

Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 377, in run_task
    close_driver(driver)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 250, in close_driver
    driver.quit()
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 470, in quit
    self.close_proxy()
TypeError: 'bool' object is not callable
Error getting page source: Message: invalid session id
Stacktrace:
        GetHandleVerifier [0x00916EE3+174339]
        (No symbol) [0x00840A51]
        (No symbol) [0x00556E8A]
        (No symbol) [0x00580980]
        (No symbol) [0x00581F8D]
        GetHandleVerifier [0x009B4B1C+820540]
        sqlite3_dbdata_init [0x00A753EE+653550]
        sqlite3_dbdata_init [0x00A74E09+652041]
        sqlite3_dbdata_init [0x00A697CC+605388]
        sqlite3_dbdata_init [0x00A75D9B+656027]
        (No symbol) [0x0084FE6C]
        (No symbol) [0x008483B8]
        (No symbol) [0x008484DD]
        (No symbol) [0x00835818]
        BaseThreadInitThunk [0x76FBFCC9+25]
        RtlGetAppContainerNamedObjectPath [0x774D7C6E+286]
        RtlGetAppContainerNamedObjectPath [0x774D7C3E+238]

Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 377, in run_task
    close_driver(driver)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 250, in close_driver
    driver.quit()
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 470, in quit
    self.close_proxy()
TypeError: 'bool' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 431, in save_screenshot
    self.get_screenshot_as_file(
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 927, in get_screenshot_as_file
    png = self.get_screenshot_as_png()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 963, in get_screenshot_as_png
    return b64decode(self.get_screenshot_as_base64().encode('ascii'))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 975, in get_screenshot_as_base64
    return self.execute(Command.SCREENSHOT)['value']
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 429, in execute
    self.error_handler.check_response(response)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 243, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id
Stacktrace:
        GetHandleVerifier [0x00916EE3+174339]
        (No symbol) [0x00840A51]
        (No symbol) [0x00556E8A]
        (No symbol) [0x00580862]
        (No symbol) [0x005A6EBA]
        (No symbol) [0x005A2036]
        (No symbol) [0x005A1CC2]
        (No symbol) [0x005370DB]
        (No symbol) [0x005375DE]
        (No symbol) [0x005379EB]
        GetHandleVerifier [0x009B4B1C+820540]
        sqlite3_dbdata_init [0x00A753EE+653550]
        sqlite3_dbdata_init [0x00A74E09+652041]
        sqlite3_dbdata_init [0x00A697CC+605388]
        sqlite3_dbdata_init [0x00A75D9B+656027]
        (No symbol) [0x0084FE6C]
        (No symbol) [0x00536F4C]
        (No symbol) [0x00536AEA]
        (No symbol) [0x006A526C]
        BaseThreadInitThunk [0x76FBFCC9+25]
        RtlGetAppContainerNamedObjectPath [0x774D7C6E+286]
        RtlGetAppContainerNamedObjectPath [0x774D7C3E+238]

Failed to save screenshot
Failed for input: None
We've paused the browser to help you debug. Press 'Enter' to close.
Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 377, in run_task
    close_driver(driver)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 250, in close_driver
    driver.quit()
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 470, in quit
    self.close_proxy()
TypeError: 'bool' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\py311seleniumbot\main.py", line 18, in <module>
    scrape_heading_task()
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 443, in wrapper_browser
    current_result = run_task(data_item, False, 0)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 411, in run_task
    close_driver(driver)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 249, in close_driver
    driver.close()
    ^^^^^^^^^^^^^^
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 551, in close
    self.execute(Command.CLOSE)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 429, in execute
    self.error_handler.check_response(response)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 243, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id
Stacktrace:
        GetHandleVerifier [0x00916EE3+174339]
        (No symbol) [0x00840A51]
        (No symbol) [0x00556E8A]
        (No symbol) [0x00580862]
        (No symbol) [0x005A6EBA]
        (No symbol) [0x005A2036]
        (No symbol) [0x005A1CC2]
        (No symbol) [0x005370DB]
        (No symbol) [0x005375DE]
        (No symbol) [0x005379EB]
        GetHandleVerifier [0x009B4B1C+820540]
        sqlite3_dbdata_init [0x00A753EE+653550]
        sqlite3_dbdata_init [0x00A74E09+652041]
        sqlite3_dbdata_init [0x00A697CC+605388]
        sqlite3_dbdata_init [0x00A75D9B+656027]
        (No symbol) [0x0084FE6C]
        (No symbol) [0x00536F4C]
        (No symbol) [0x00536AEA]
        (No symbol) [0x006A526C]
        BaseThreadInitThunk [0x76FBFCC9+25]
        RtlGetAppContainerNamedObjectPath [0x774D7C6E+286]
        RtlGetAppContainerNamedObjectPath [0x774D7C3E+238]

Reproduces how often: [What percentage of the time does it reproduce?]

It happens every time.

Additional context

I setup a virtual environment with botasaurus

@gameuser1982
Copy link
Author

gameuser1982 commented Dec 24, 2023

Update: It's my own damn fault. I installed botasaurus into a virtual environment I had previously installed Selenium into stupidly thinking they could co-exist without conflict. Wrong wrong wrong.

Solution: I uninstalled botasaurus from my virtual environment that I had originally used selenium for. Created a new virtual environment and ONLY installed botasaurus.

Now script scrapes as expected, though the certificate parsing errors still exist therefore I am keeping this issue open. Do these cert errors mean that the website is being connected to insecurely or can it be safely ignored?

Here is the new output:

(py311botasaurus) C:\py311botasaurus>python main.py
Running
[INFO] Downloading Chrome Driver. This is a one-time process. Download in progress...

DevTools listening on ws://127.0.0.1:2309/devtools/browser/1ea8b6bd-45cd-4b14-af05-ef74b8bf8484
[6340:14368:1224/155004.893:ERROR:cert_issuer_source_aia.cc(34)] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate

[6340:14368:1224/155005.099:ERROR:cert_issuer_source_aia.cc(34)] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate

Written
     output/scrape_heading_task.json

(py311botasaurus) C:\py311botasaurus>

@Chetan11-dev
Copy link
Contributor

Yes, these keep occurring. Ignore them, Also it wasn't your fault, I yesterday released buggy Code (fixed now), that's why it occurred.

@gameuser1982
Copy link
Author

Wow nice! Thanks for the quick reply on this! This is a pretty awesome framework and the scraping side of things makes sense to me!

@Chetan11-dev
Copy link
Contributor

Chetan11-dev commented Dec 24, 2023

Thanks, a lot of awesomeness is on it's way that will seriously change the landscape of webscraping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants