# Web scraping with Selenium in Python

Selenium for Python module docs: https://selenium-python.readthedocs.io/

## Install module Selenium
```
pip install selenium
```

## Import Selenium module

```Python
from selenium import webdriver

# Quite possible we also need something else
from selenium.webdriver.common.by import By         # To define search criterias
from selenium.webdriver.common.keys import Keys     # To send special keypresses like arrows
```

In [3]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

## Browser driver

A browser driver is needed to communicate selenium with the browser. Used driver must match user's browser and version.

Download browser driver for Chrome from: https://sites.google.com/chromium.org/driver/

Other options:
- MS Edge: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
- Firefox: https://github.com/mozilla/geckodriver/releases
- Safari: https://webkit.org/blog/6900/webdriver-support-in-safari-10/

Confirm that you download a version of the driver that matchs the actual browser version in your system.

Place the downloaded driver in your project's folder. Unzip or unpack if necessary.

## Start the browser/driver

First, the browser driver is activated:
```Python
driver = webdriver.Chrome()
```

For other browsers:
- Firefox: `webdriver.Firefox()`
- Edge: `webdriver.Edge()`
- Safari: `webdriver.Safari()`

If the module doesn't find the driver by itself, add the path to the driver as a parameter. Like this:
```Python
driver = webdriver.Chrome('./chromedriver')
```

## Basic use

In [5]:
# Start the browser
driver = webdriver.Chrome()

# Open a webpage:
driver.get('https://google.com')

In [6]:
# Google shows an user agreement. Find "Accept" button in the web page:
# Old method: button = driver.get_element_by_xpath("//button[@id='L2AGLb")
# New method: button = driver.find_element(by=By.XPATH, value="//button[@id='L2AGLb")
# or even easier:
button = driver.find_element(By.ID, 'L2AGLb')

# Click the button
button.click()


Two functions/methods are used to locate HTML elements:
- `driver.find_element(by, value)`    # Returns only one element
- `driver.find_elements(by, value)`   # Returns a list of elements (even if there is only one result)

And the search can be based on different HTML parameters (tag name, class, text, ...). This is the 'by' parameter:
- `By.ID`: Use Id attribute for the search
- `By.TAG_NAME`: Find certain HTML elements based on its tag name (a, p, h1, button, img, ...)
- `By.CLASS_NAME`: Find HTML elements based on their classes
- `By.NAME`: Use Name attribute for the search
- `By.XPATH`: XPATH search (It is a XML thing)
- `By.LINK_TEXT`: Search a -elements based on a perfect match with its text
- `By.PARTIAL_LINK_TEXT`: Search a -elements based on a partial match with its text
- `BY.CSS_SELECTOR`: Find HTML elements based on a CSS selector

See https://selenium-python.readthedocs.io/locating-elements.html for more information on locating elements.

In [7]:
# Locate an element based on the Name attribute. In this case the Google's search box.
search_box = driver.find_element(By.NAME, 'q')

# Write a search term
search_box.send_keys('ChromeDriver')

# And submit the search
search_box.submit()

Notice that after submit(), the page is refreshed, so any object that was found in the old page (like search_box) is not going to work anymore. Objects must be located on the new page again.

See below to get a list of properties and methods for elements (a.k.a. what can I do with an element?)

In [12]:
# Close the browser and the driver
driver.close()

Another example:

In [11]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from time import sleep

driver = webdriver.Chrome()
driver.get('http://www.python.org')

# Stops execution if 'Python' not in page's title
assert 'Python' in driver.title

# Find the search box
elem = driver.find_element(By.NAME, 'q')

# Clear it (delete content if there was any)
elem.clear()

# Write a search term
elem.send_keys('pycon')

# And press ENTER to submit the search
elem.send_keys(Keys.RETURN)

# Stops execution if the text 'No results found' is found the page source
assert 'No results found.' not in driver.page_source

# Wait a moment before closing everything
sleep(5)

# Close browser and driver
driver.close()


## Driver object methods

- `add_cookie()`: Adds a cookie to your current session.
- `back()`: Goes one step backward in the browser history.
- `close()`: Closes the current window.
- `create_web_element()`: Creates a web element with the specified `element_id`.
- `delete_all_cookies()`: Delete all cookies in the scope of the session.
- `delete_cookie()`: Deletes a single cookie with the given name.
- `delete_network_conditions()`: Resets Chromium network emulation settings.
- `execute()`: Sends a command to be executed by a `command.CommandExecutor`.
- `execute_async_script()`: Asynchronously Executes JavaScript in the current window/frame.
- `execute_cdp_cmd()`: Execute Chrome Devtools Protocol command and get returned result
- `execute_script()`: Synchronously Executes JavaScript in the current window/frame.
- `file_detector_context()`: Overrides the current file detector (if necessary) in limited context.
- `find_element()`: Find an element given a By strategy and locator.
- `find_elements()`: Find elements given a By strategy and locator.
- `forward()`: Goes one step forward in the browser history.
- `fullscreen_window()`: Invokes the window manager-specific 'full screen' operation
- `get()`: Loads a web page in the current browser session.
- `get_cookie()`: Get a single cookie by name. Returns the cookie if found, None if not.
- `get_cookies()`: Returns a set of dictionaries, corresponding to cookies visible in the current session.
- `get_issue_message()`: Returns an error message when there is any issue in a Cast session.
- `get_log()`: Gets the log for a given log type
- `get_network_conditions()`: Gets Chromium network emulation settings.
- `get_screenshot_as_base64()`: Gets the screenshot of the current window as a base64 encoded string
- `get_screenshot_as_file()`: Saves a screenshot of the current window to a PNG image file.
- `get_screenshot_as_png()`: Gets the screenshot of the current window as a binary data.
- `get_sinks()`: Returns a list of sinks available for Cast.
- `get_window_position()`: Gets the x,y position of the current window.
- `get_window_rect()`: Gets the x, y coordinates of the window as well as height and width of
- `get_window_size()`: Gets the width and height of the current window.
- `implicitly_wait()`: Sets a sticky timeout to implicitly wait for an element to be found,
- `launch_app()`: Launches Chromium app specified by id.
- `maximize_window()`: Maximizes the current window that webdriver is using
- `minimize_window()`: Invokes the window manager-specific 'minimize' operation
- `print_page()`: Takes PDF of the current page.
- `quit()`: Closes the browser and shuts down the ChromiumDriver executable
- `refresh()`: Refreshes the current page.
- `save_screenshot()`: Saves a screenshot of the current window to a PNG image file. Returns
- `set_network_conditions()`: Sets Chromium network emulation settings.
- `set_page_load_timeout()`: Set the amount of time to wait for a page load to complete
- `set_permissions()`: Sets Applicable Permission.
- `set_script_timeout()`: Set the amount of time that the script should wait during an
- `set_sink_to_use()`: Sets a specific sink, using its name, as a Cast session receiver target.
- `set_window_position()`: Sets the x,y position of the current window. (window.moveTo)
- `set_window_rect()`: Sets the x, y coordinates of the window as well as height and width of
- `set_window_size()`: Sets the width and height of the current window. (window.resizeTo)
- `start_client()`: Called before starting a new session. This method may be overridden
- `start_session()`: Creates a new session with the desired capabilities.
- `start_tab_mirroring()`: Starts a tab mirroring session on a specific receiver target.
- `stop_casting()`: Stops the existing Cast session on a specific receiver target.
- `stop_client()`: Called after executing a quit command. This method may be overridden

## Driver object properties

- `capabilities` or `caps`: Browser capabilities (dict)
- `current_url`: Current URL on the browser (str)
- `name`: Driver name (str)
- `page_source`: Page source (str)
- `session_id`: Current session ID (str)
- `title`: The title of the current page (str)

## HTML element object methods

- `clear()`: Clears the text if it's a text entry element.
- `click()`: Clicks the element.
- `find_element()`: Find an element given a By strategy and locator.
- `find_elements()`: Find elements given a By strategy and locator.
- `get_attribute()`: Get an element's attribute value.
- `get_dom_attribute()`: Gets the given attribute of the element. Unlike `get_attribute()`, this method only returns attributes declared in the element's HTML markup.
- `get_property()`: Gets the given property of the element.
- `is_displayed()`: Whether the element is visible to a user.
- `is_enabled()`: Returns whether the element is enabled.
- `is_selected()`: Whether the element is selected at this moment.
- `location_once_scrolled_into_view`: Element coordinates (dict)
- `screenshot()`: Saves a screenshot of the current element to a PNG image file.
- `send_keys()`: Send keypresses to the element. It's also possible to simulate key pressing by using `Keys` -class (i.e. `Keys.ARROW_DOWN`). (See https://www.selenium.dev/selenium/docs/api/py/webdriver/selenium.webdriver.common.keys.html for a list of supported keys)
- `submit()`: Submits a form.
- `value_of_css_property()`: The value of a CSS property.

## HTML element object properties

- `aria_role`: Element's Aria role(str)
- `accessible_name`: Element's accessible name or Aria label (str)
- `id`: Element's ID attribute (str)
- `location`: Element coordinates (dict)
- `parent`: Parent element (object)
- `rect`: Element square coordinates (dict)
- `screenshot_as_base64`: Element image as str (str)
- `screenshot_as_png`: Element image as bytes (bytes)
- `size`: Element size (dict)
- `tag_name`: Element's tagname (str)
- `text`: Element's text (str)

To get an element's HTML use `get_attribute('outerHTML')`.