The following notebook is based on a walkthrough from Thu Vu's Python Web-scraping with Selenium video on YouTube.

https://www.youtube.com/watch?v=RuNolAh_4bU&ab_channel=ThuVudataanalytics

In the event that anyone else runs Python through WSL2 for Windows, you might encounter a couple of issues.  I'm adding this here in hopes that it can save you the time that I spent.

https://www.gregbrisebois.com/posts/chromedriver-in-wsl2/ - this link brought me to the realization that I needed to launch Chrome with an XServer at least one time so that I could check the box to make Chrome the default browser.  Otherwise, everything just timed out because the pages wouldn't load.

https://cloudbytes.dev/snippets/run-selenium-and-chrome-on-wsl2 - this was another link that I attempted to use for troubleshooting purposes.  It was actually the first one that I used and was necessary for walking through the installation of Chrome to my WSL2 install.  The rest wasn't really necessary, but it does show an alternative method for getting the pages.  Options are good, I suppose.

In [17]:
import pandas as pd
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

### Create driver

In [15]:
driver = webdriver.Chrome(ChromeDriverManager().install())




[WDM] - Current google-chrome version is 102.0.5005
[WDM] - Get LATEST chromedriver version for 102.0.5005 google-chrome
[WDM] - Driver [/home/jpigg/.wdm/drivers/chromedriver/linux64/102.0.5005.61/chromedriver] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


In [36]:
page_url = "https://witcher.fandom.com/wiki/Category:Characters_in_the_stories"
driver.get(page_url)
# driver.find_element(By.XPATH, '//div[text()="ACCEPT"]').click()

## Find books


In [37]:
book_categories = driver.find_elements(by=By.CLASS_NAME, value='category-page__member-link')
book_categories[0].get_attribute('href')

'https://witcher.fandom.com/wiki/Category:Baptism_of_Fire_characters'

In [38]:
driver.get(book_categories[0].get_attribute('href'))


In [39]:
character_elems = driver.find_elements(by=By.CLASS_NAME, value = 'category-page__member-link')
character_elems

[<selenium.webdriver.remote.webelement.WebElement (session="1838789e9c834dd25c8bc549dd1eae44", element="e3899e29-aa3a-4f31-8db5-0e7258d739ec")>,
 <selenium.webdriver.remote.webelement.WebElement (session="1838789e9c834dd25c8bc549dd1eae44", element="f7b25b8e-8922-4b5e-8b4b-89110c171d90")>,
 <selenium.webdriver.remote.webelement.WebElement (session="1838789e9c834dd25c8bc549dd1eae44", element="1633968c-ca8f-4c1e-a3a8-9bcdbcf96a76")>,
 <selenium.webdriver.remote.webelement.WebElement (session="1838789e9c834dd25c8bc549dd1eae44", element="933c9cc9-282e-48ac-9fa3-3effad610543")>,
 <selenium.webdriver.remote.webelement.WebElement (session="1838789e9c834dd25c8bc549dd1eae44", element="c05026f8-8051-4c19-b12a-75f3a13ab4b3")>,
 <selenium.webdriver.remote.webelement.WebElement (session="1838789e9c834dd25c8bc549dd1eae44", element="97e05e08-d460-4402-b4be-60949da327cf")>,
 <selenium.webdriver.remote.webelement.WebElement (session="1838789e9c834dd25c8bc549dd1eae44", element="c4a135ce-14b7-44ec-ac9d-07

# Full Code

In [53]:
import pandas as pd
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options


## Setup chrome options
print("Setting options...")
chrome_options = Options()
chrome_options.add_argument("--headless") # Ensure GUI is off
chrome_options.add_argument("--no-sandbox")
print("Options set!")

# Set path to chromedriver as per your configuration
print("Preparing service...")
webdriver_service = Service("./chromedriver/stable/chromedriver")
print("Service ready!")

# Choose Chrome Browser
print("Creating driver...")
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
print("Driver created!")

# Get page
print("Getting page with characters...")
page_url = "https://witcher.fandom.com/wiki/Category:Characters_in_the_stories"
driver.get(page_url)
print("Page retrieved!")


# # Create driver
# print("Creating driver...")
# driver = webdriver.Chrome(ChromeDriverManager().install())
# print("Driver created!")

# # Go to the chracters in books page
# print("Getting page with characters...")
# page_url = "https://witcher.fandom.com/wiki/Category:Characters_in_the_stories"
# driver.get(page_url)
# print("Page retrieved!")

# # Accept the cookies
# print("Accepting cookies...")
# time.sleep(3)
# driver.find_element(By.XPATH, '//div[text()="ACCEPT"]').click()
# print("Cookies accepted!")

# Find books
print('Building list of books...')
book_categories = driver.find_elements(by=By.CLASS_NAME, value='category-page__member-link')

books = []
for category in book_categories:
    book_url = category.get_attribute('href')
    book_name = category.text
    books.append({'book_name': book_name, 'url': book_url})
print('Books list complete!')

# Build character list
print('Building character list...')
character_list = []
for book in books:
    # go to book page
    driver.get(book['url'])

    # find links to characters by using class name
    character_elems = driver.find_elements(by=By.CLASS_NAME, value = 'category-page__member-link')

    for elem in character_elems:
        character_list.append({'book': book['book_name'], 'character': elem.text})
print('Character list complete!')


Setting options...
Options set!
Preparing service...
Service ready!
Creating driver...
Driver created!
Getting page with characters...
Page retrieved!
Building list of books...
Books list complete!
Building character list...
Character list complete!


In [56]:
# Convert chararcter list to Pandas dataframe for conversion to CSV and use in R
character_df = pd.DataFrame(character_list)
character_df.to_csv('characters.csv', index=False)