# Data Extraction with Selenium
In this tutorial, we discuss how to use Selenium to extract data from the web.  Please see https://selenium-python.readthedocs.io for more details.

## Installation
Before using selenium, we will have to install a webdriver of your choice.  It can be Chrome or Firefox.  Once installed, you will need to know the location of the drive as it will be used as a parameter to start a browser.  To install the driver, just install python helper package chromedriver_autoinstaller. 

        pip install chromedriver_autoinstaller

We also have to install selenium package.

        pip install selenium

In [None]:
from selenium import webdriver
import chromedriver_autoinstaller
import time
import os

In [None]:
chromedriver_autoinstaller.install()

In [None]:
browser = webdriver.Chrome()

## Browsing a webpage
Once the browser starts, we can tell it to visit a webpage.

In [None]:
url = 'https://www.google.com'

In [None]:
browser.get(url=url)

In [None]:
html = browser.execute_script("return document.documentElement.outerHTML")
html[:3000]

## Interact with a webpage
When the page is loaded, we can interact with all elements in the webpage.  In this example, we will perform a search for a particular keyword in Google.  We will have to locate the correct element and then send the proper keys.

In [None]:
from selenium.webdriver.common.by import By

Find an element with attribute name = 'q'.  Note htat this is a textarea component.

In [None]:
q_element = browser.find_element(By.CSS_SELECTOR, '[name=q]')

In [None]:
q_element.clear()
q_element.send_keys('ประเทศไทย')
q_element.send_keys(u'\ue007')

## Navigate the webpage
We can navigate the current webpage, similar to Beautiful Soup.  Selenium supports several navigation approaches.

Google puts search results as 'a' elements inside a 'div' with id='search'.  Thus, we select with id='search' then 'a' elements.

In [None]:
all_link = browser.find_elements(By.CSS_SELECTOR, '#search a')

Check all results that we found.

In [None]:
for link in all_link:
    print('[link text]', link.text)
    print('[link href]', link.get_attribute('href'))
    print('---')

Click the first result

In [None]:
all_link[0].click()

We find all headlines of this wiki page

In [None]:
all_headlines = browser.find_elements(By.CSS_SELECTOR, 'span[class^="mw-headline"]')

List all headlines

In [None]:
for h in all_headlines:
    print('[text]', h.text)
    print('[class]', h.get_attribute('class'))
    print('[id]', h.get_attribute('id'))
    print('[parent]', h.find_element(By.XPATH, '..').tag_name)
    print('---')

Click at the 'ภูมิอากาศ' headline

In [None]:
for h in all_headlines:
    if h.get_attribute('id') == 'ภูมิอากาศ':
        h.click()
        break

## End browsing session

In [None]:
browser.quit()