# Getting started

In [None]:
from selenium import webdriver
import pandas as pd
import numpy as np

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

To open Chrome through selenium, the command is generally webdriver.Chrome(CHROMEDRIVER_PATH). To retrieve CHROMEDRIVER_PATH, use ChromeDriverManager

In [None]:
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

# Basics
To open a website, you can use driver.get() method.

In [None]:
driver.get("https://en.wikipedia.org/wiki/Northwestern_University")

The webdriver allows you to interact with the browser window in various ways such as
+ Get window size

In [None]:
# driver.get_window_size()

+ Set window size

In [None]:
# driver.set_window_size(800, 600)

+ Get window position

In [None]:
# driver.get_window_position()

+ Set window position

In [None]:
# driver.set_window_position(100, 100)

+ Minimize window

In [None]:
#driver.minimize_window()

+ Maximize window

In [None]:
# driver.maximize_window()

+ Get the website title

In [None]:
# driver.title

+ Take a screen shot of the page

In [None]:
# driver.get_screenshot_as_file('./test.png')

For a complete list of properties and methods, you can go to the WebDriver's documention https://www.selenium.dev/selenium/docs/api/py/webdriver_remote/selenium.webdriver.remote.webdriver.html

To shutdown the driver and close the website, use the quit method.

In [None]:
# driver.quit()

### Headless mode
Headless mode runs the window in the background without displaying it. You can configure ChromeDriver to open browser in headless mode with the following code.

In [None]:
# from selenium.webdriver.chrome.options import Options

# options = Options()
# options.add_argument('--headless=new')
# driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)

In [None]:
# driver.quit()

For webscraping, we recommend opening the browser because it allows you to observe and inspect the website elements. Headless mode may be used after you are done with development. 

# Interacting with website content

## Understanding HTML

HyperText Markup Language (HTML) is a language used to create the basic structures of a website. In order to retrieve information from a website, we need to be able to read and understand, to some extent, HTML script.

#### HTML source code
The source code of a webiste let you see the elements that are handled on the page as well as paths/links to other features such as images or audios. 

To get the HTML code of the website, you can right click on any blank space and select "View Page Source". You can also right click on a specific object and click "Inspect". This will direct you to where the object is located in the source code.

You can also load the page source in python using the ChromeDriverManager.

In [None]:
# driver = webdriver.Chrome(ChromeDriverManager().install())
# driver.get("https://en.wikipedia.org/wiki/Northwestern_University")

In [None]:
# psource = driver.page_source

#### HTML elements

There are several types of HTML elements and each is specified by a tag name. Below are some of the most common tags you will see (Credit: Aaron Geller). 
![alt text](HTMLTags.png "Examples of HTML tags")

Depending on your goals and the type of information you are looking for, you may encounter other tags such as ```<table>```, ```<video>```, ```<audio>```, ```<data>```, etc. 

#### HTML attributes

Attributes describe certain property of a HTML tag and ther are also different types of attributes. The most import ones for webscraping are "id", "class", and "name".
+ "id": specify a unique ID and the value can only be used once in the HTML
+ "class": define equal styles for HTML tags and can be reused. 
+ "name": used on the form controls to relate how data is labeled when sent to server, and multiple elements may share the same name.

## Retrieving website content in Python

#### Scraping Rules

1. You should check a website’s Terms and Conditions before you scrape it. Be careful to read the statements about legal use of data. Usually, the data you scrape should not be used for commercial purposes.
2. Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed

(Source: https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/)

Selenium supports 2 method(s) for webscraping
- find_element: return the first element that matches a condition
- find elements: return a list of all elements found

We can search elements by eight different attributes as indicated by the By class.

In [None]:
from selenium.webdriver.common.by import By

In [None]:
for k in dir(By):
    if not k.startswith("__"):
        print(k)

The general syntax is:

driver.find_elements(By.[__attribute__], [__attribute_value__])

| Attribute | Search condition | HTML Script Example | Usage |
|:--------:|:--------:|:--------:|:------------------------:|
| ID | HTML ID | ```<div id="myID">``` | (By.ID, "myID")|
| NAME | HTML name | ```<input name="myNAME">``` | (By.NAME, "myName")|
| CLASS_NAME | HTML class | ```<div class="myCLASS">``` | (By.CLASS_NAME, "myCLASS")|
| TAG_NAME | HTML tag | ```<h1>Welcome</h1>``` | (By.TAG_NAME, "h1")|
| LINK_TEXT | Text content of a link | ```<a href="something.html">myLINK</a>``` | (By.LINK_TEXT, "myLINK")|
| PARTIAL_LINK_TEXT | substring of text content |  ```<a href="something.html">myLINK</a>``` | (By.PARTIAL_LINK_TEXT, "LINK")|
<!-- | XPATH | XPATH expression |  -->

Selenium documentation does a good job of explaining how to make use of these attributes and can be found at https://selenium-python.readthedocs.io/locating-elements.html. For this workshop, we will go over TAG_NAME, CLASS_NAME, XPATH, and LINK_TEXT attribute.

#### TAG_NAME

As the name suggests, you can search and obtain an element if you know its tag in the HTML source code.

In [None]:
# driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://www.nfl.com/stats/player-stats/")

Exercise: Write code to retrieve the second header on the page

In [None]:
# header = 

In [None]:
# print(header)

To get the text content from the Selenium object, 

In [None]:
print(header[0].text)

You can also find the tag of an element using "Inspect" and use that to retrieve the data in Python.

In [None]:
table = driver.find_elements(By.TAG_NAME, "table")

In [None]:
# print(table[0].text)

#### CLASS_NAME

Example: Write code to retrieve the list of players

In [None]:
# players = 

In [None]:
# player_list = []
# for p in players:
#     player_list.append(p.text)
# print(player_list)

If you use the class name "d3-o-player-fullname nfl-o-cta--link", the code may return an empty list. Pay close attention to the associated HTML tag. The tag for this class is ```<a>```, which specifies links. What you want to look for is the one above associated with ```<div>```.

Exercise: Write code to retrieve the passing yards data

In [None]:
# yds = 

In [None]:
# yds_list = []
# for y in yds:
#     yds_list.append(y.text)
# print(yds_list)

### XPATH

XPath is the language used for locating nodes in an XML document.

Some vocabularies
- node: different types including root, element and attribute node
- parent: the immediate element containing the current one. E.g.: _html_ is the parent of _body_
- children: the immediate elements contained by the current one. E.g.: _body_ is the children of _head_
- siblings: nodes on the same level as the current element.


Some syntax
| Syntax | Description | Example | Explanation |
|:--------:|:--------:|:------------------------:|:--------:|
| / | Select from the root node. Useful for writing absolute path| /html/body/div[1]/form[1]| Select the first _form_ tag |
| // | Select a node regardless of where you are | //form[1] | Select the first _form_ tag |
| tag name | Selects all nodes with this tag name | form | Select all node correspond to _form_ tag |
| @ | Select an attribute | //input[@name="email"] or //input/@name | Select all _input_ tag that have the "email" name attribute |
| .. | Select the parent of a node | //form/.. | Select the parent tag of each _form_ node

(Source: https://www.scrapingbee.com/blog/practical-xpath-for-web-scraping/)

In [None]:
table = driver.find_elements(By.XPATH, '//table')
print(table[0].text)

Exercise: Write code to get the list of players using By.XPATH

In [None]:
# players = 

In [None]:
# for p in players:
#     print(p.text)

#### LINK_TEXT

In [None]:
player = driver.find_element(By.LINK_TEXT, "Tua Tagovailoa")

You can go to the embedded website by using the click method.

In [None]:
player.click()

To navigate to the previous page,

In [None]:
driver.back()

## Bonus

#### WebDriverWait

It is possible that you may need to wait for a few (or more) seconds for the site to load all necessary content. It is important to account for this, especially if you want to run your code in headless mode. WebDriverWait waits for specific conditions before proceeding and is often accompanied by some expected conditions. Some of them are

+ presence_of_element_located: an HTML element is present.
+ visibility_of: an element becomes visible.
+ element_to_be_clickable: an HTML element is clickable.

More details can be found in the Selenium documentation https://selenium-python.readthedocs.io/waits.html.

In [None]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [None]:
player = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.LINK_TEXT, "Tua Tagovailoa")))

In [None]:
player.click()

In [None]:
driver.back()

Suppose you try for an element that may not be present in the script.

In [None]:
player = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.LINK_TEXT, "Travis Kelce")))

In [None]:
from selenium.common.exceptions import TimeoutException

In [None]:
try:
    player = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.LINK_TEXT, "Travis Kelce")))
except TimeoutException:
    player = []

In [None]:
player

Exercise: Write code to retrieve the link and navigate to the next page

In [None]:
# next = 

In [None]:
# next.click()

In [None]:
# driver.back()

If you encounter ElementClickInterceptedException error, it's possible that there are other blocking elements. A quick fix is to scroll down to where the element is visible. You can do that interactively or with Python code. 

The execute_script method executes JavaScript in the current window and, thus helps us interact with a website.

#### Scrolling

In [None]:
# Get scroll height
driver.execute_script("return document.documentElement.scrollHeight")

In [None]:
# From top to end of page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

In [None]:
# From end to top of page
driver.execute_script("window.scrollTo(document.body.scrollHeight, 0);")

We can guess the scroll height of the Next Page element and try it out.

In [None]:
driver.execute_script("window.scrollTo(0, 1000);")

In [None]:
next.click()

In [None]:
driver.back()

Another workaround is to use the execute_script. Unlike the previous method which is driven by the webdriver element click, this method executes the click via JavaScript.

In [None]:
driver.execute_script("arguments[0].click();", next)

In [None]:
driver.back()

The StaleElementReferenceException is a common error that we encounter while testing web applications using Selenium. When we reference a stale element, Selenium throws the StaleElementReferenceException. An element becomes stale due to a page refresh or DOM update. JavaScript still executes the click though so we can include a safeguard for this exception, similar to what we did for the TimeOutException.

In [None]:
from selenium.common.exceptions import StaleElementReferenceException

In [None]:
try:
    next_link = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.LINK_TEXT, "Next Page")))
    driver.execute_script("arguments[0].click();", next_link)
except TimeoutException:
    next_link = []
    print("timeout")
except StaleElementReferenceException:
    next_link = []
    print("stale")

In [None]:
driver.back()

#### Putting everything together
Why is this useful? Suppose you want to gather the entire list of players along with their passing yards. We need to iteratively navigate to the next page.

In [None]:
# Retrieving and saving list of players and their passing yards on the first page
players = driver.find_elements(By.CLASS_NAME, "d3-o-media-object__body")
players_list = [p.text for p in players]
yds = driver.find_elements(By.CLASS_NAME, "selected")
yds_list = [y.text for y in yds]

In [None]:
while True:
    try:
        next_link = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.LINK_TEXT, "Next Page")))
        driver.execute_script("arguments[0].click();", next_link)
        players = driver.find_elements(By.CLASS_NAME, "d3-o-media-object__body")
        players_list = players_list + [p.text for p in players]
        yds = driver.find_elements(By.CLASS_NAME, "selected")
        yds_list = yds_list + [y.text for y in yds]
    except TimeoutException:
        players = driver.find_elements(By.CLASS_NAME, "d3-o-media-object__body")
        players_list = players_list + [p.text for p in players]
        yds = driver.find_elements(By.CLASS_NAME, "selected")
        yds_list = yds_list + [y.text for y in yds]
        print("timeout")
        break
    except StaleElementReferenceException:
        continue

In [None]:
df = pd.DataFrame()
df["player"] = players_list
df["yds"] = yds_list

In [None]:
df

In [None]:
driver.quit()

### Practice

1. Retrieve other player stats for Year 2023.
2. Repeat the process for previous years til 2020. 