# Dealing with Dynamic Websites

### ISM6564

**Week01, Part04b**

&copy; 2023 Dr. Tim Smith

<a target="_blank" href="https://colab.research.google.com/github/prof-tcsmith/ta-f23/blob/main/W02/W02.4b-selenium.ipynb#offline=1">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

---

## Learning Objectives

Some websites use JavaScript to load content dynamically. In such cases, a simple GET request may not retrieve all the content you see when browsing the site manually. Selenium is a popular library to handle dynamic content.

In this notebook, we will use Selenium to scrape a dynamic website. You will learn how to install selenium, webdrivers, and webdriver manager. You will also learn how to use Selenium to scrape a dynamic website that contains javascript and wouldn't be scrapable using the requests library.

Following along with this notebook as you watch the associated video posted in the classroom canvas site.

#### Troubleshooting File locations and path

The following two cells can help you find what the default directory is for your notebook, and also display the current path environment variable.

This isn't necessary to run in your notebooks - but can help if you are troubleshooting file not found issues

In [1]:
import os
os.getcwd()

'c:\\Users\\veera\\OneDrive\\Desktop\\Text_Analytics\\W02'

In [2]:
import sys
sys.path

['c:\\Users\\veera\\OneDrive\\Desktop\\Text_Analytics\\W02',
 'c:\\Users\\veera\\AppData\\Local\\Programs\\Python\\Python311\\python311.zip',
 'c:\\Users\\veera\\AppData\\Local\\Programs\\Python\\Python311\\DLLs',
 'c:\\Users\\veera\\AppData\\Local\\Programs\\Python\\Python311\\Lib',
 'c:\\Users\\veera\\AppData\\Local\\Programs\\Python\\Python311',
 '',
 'C:\\Users\\veera\\AppData\\Roaming\\Python\\Python311\\site-packages',
 'C:\\Users\\veera\\AppData\\Roaming\\Python\\Python311\\site-packages\\win32',
 'C:\\Users\\veera\\AppData\\Roaming\\Python\\Python311\\site-packages\\win32\\lib',
 'C:\\Users\\veera\\AppData\\Roaming\\Python\\Python311\\site-packages\\Pythonwin',
 'c:\\Users\\veera\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages']

In [6]:
pip install webdriver-manager


Collecting webdriver-manager
  Downloading webdriver_manager-4.0.0-py2.py3-none-any.whl (27 kB)
Collecting python-dotenv (from webdriver-manager)
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv, webdriver-manager
Successfully installed python-dotenv-1.0.0 webdriver-manager-4.0.0
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Using Selenium to Scrape from a Dynamic Site

In [1]:
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from webdriver_manager.firefox import GeckoDriverManager 

# for more information on WebDriver Manager https://pypi.org/project/webdriver-manager/
# note1: to install webdriver_manager, on the command line use the name webdriver-manager (no underscore)
# note2: You will need to have Firefox installed on your computer for this to work. (though, you're welcomed 
# to try it with Chrome... you will need to download the Chrome driver and change the code below to use it)

# when loading the driver, one of the two will work. The first one will download the driver for you, the second
# will use the driver you downloaded and placed in the same folder as your notebook or in your PATH variable.
# NOTE: (only in linux), the first one will work once, but then you will need to comment it out and use the second one OR
# do a ps -aux and find the running geckodriver process and kill it.

# if you have selenium 3 installed, use one of these:
#driver = webdriver.Firefox(executable_path=GeckoDriverManager().install()) # this will work on Windows and Mac, and should work on Linux when run the first time
#driver = webdriver.Firefox(executable_path=<insert path to manual downloaded geckodriver>)
driver = webdriver.Firefox() # use if geckodriver is in your PATH environmnet variable (which includes the same folder as your notebook)

# if you hve selenium 4 installed, use one of these:
#driver = webdriver.Firefox(service=Service(GeckoDriverManager().install())) # this will work on Windows and Mac, and should work on Linux when run the first time
#driver = webdriver.Firefox() # use if geckodriver is in your PATH environmnet variable (which includes the same folder as your notebook)

url = "https://www.britannica.com/topic/Presidents-of-the-United-States-1846696"
driver.get(url)
driver.implicitly_wait(10) # this is how long to wait for the page to load

driver.page_source
driver.close()