# Data Extraction with Selenium
In this tutorial, we discuss how to use Selenium to extract data from the web.  Please see https://selenium-python.readthedocs.io for more details.

## Installation
Before using selenium, we will have to install a webdriver of your choice.  It can be Chrome or Firefox.  Once installed, you will need to know the location of the drive as it will be used as a parameter to start a browser.  To install the driver, just install python helper package chromedriver_autoinstaller. 

        pip install chromedriver_autoinstaller

We also have to install selenium package.

        pip install selenium

In [1]:
from selenium import webdriver
import chromedriver_autoinstaller
import time
import os

In [2]:
chromedriver_autoinstaller.install()

'/Users/natawut/opt/anaconda3/lib/python3.9/site-packages/chromedriver_autoinstaller/111/chromedriver'

In [3]:
browser = webdriver.Chrome()

## Browsing a webpage
Once the browser starts, we can tell it to visit a webpage.

In [4]:
url = 'https://www.google.com'

In [5]:
browser.get(url=url)

In [6]:
html = browser.execute_script("return document.documentElement.outerHTML")
html[:3000]

'<html itemscope="" itemtype="http://schema.org/WebPage" lang="th"><head><meta charset="UTF-8"><meta content="origin" name="referrer"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><link href="/manifest?pwa=webhp" crossorigin="use-credentials" rel="manifest"><title>Google</title><script src="https://apis.google.com/_/scs/abc-static/_/js/k=gapi.gapi.en.yHsE3XoyXLE.O/m=gapi_iframes,googleapis_client/rt=j/sv=1/d=1/ed=1/rs=AHpOoo8LDClD0V3IE-5SJcudVO91TD73Qw/cb=gapi.loaded_0" nonce="PMhYP3QI6o3DzuYtOq65RQ" async=""></script><script nonce="PMhYP3QI6o3DzuYtOq65RQ">(function(){window.google={kEI:\'yN0NZOKOJNXY-QaguJqwCQ\',kEXPI:\'31\',kBL:\'4YFS\'};google.sn=\'webhp\';google.kHL=\'th\';})();(function(){\nvar f=this||self;var h,k=[];function l(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||h}function m(a){for(var b=null;a&&(!a.getAttribute||!(b=a.getAttribute("leid")));)a=a.parentNode;return b}\nfunction n(a

## Interact with a webpage
When the page is loaded, we can interact with all elements in the webpage.  In this example, we will perform a search for a particular keyword in Google.  We will have to locate the correct element and then send the proper keys.

In [7]:
from selenium.webdriver.common.by import By

In [8]:
q_element = browser.find_element(By.CSS_SELECTOR, '[name=q]')

In [9]:
q_element.clear()
q_element.send_keys('ประเทศไทย')
q_element.send_keys(u'\ue007')

## Navigate the webpage
We can navigate the current webpage, similar to Beautiful Soup.  Selenium supports several navigation approaches.

In [10]:
all_link = browser.find_elements(By.CSS_SELECTOR, '.g a')

In [11]:
for link in all_link:
    print('[link text]', link.text)
    print('[link href]', link.get_attribute('href'))
    print('---')

[link text] ประเทศไทย - วิกิพีเดีย
wikipedia.org
https://th.wikipedia.org › wiki › ประเทศไทย
[link href] https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2
---
[link text] 
[link href] https://www.google.com/search?q=related:https://th.wikipedia.org/wiki/%25E0%25B8%259B%25E0%25B8%25A3%25E0%25B8%25B0%25E0%25B9%2580%25E0%25B8%2597%25E0%25B8%25A8%25E0%25B9%2584%25E0%25B8%2597%25E0%25B8%25A2+%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2&sa=X&ved=2ahUKEwj-1NK0ydb9AhWdV2wGHauXCoIQH3oECBEQDQ
---
[link text] 
[link href] https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2
---
[link text] ประชาธิปไตยอันมีพระมหากษัตริย์ทรงเ...
[link href] http://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B8%8A%E0%B8%B2%E0%B8%98%E0%B8%B4%E0%B8%9B%E0%B9%84%E0%B8%95%E0%B8%A2%E0%B8%AD%E0%B8%B1%E0%B8%99%E0%B8%A1%E0%B8%B5%E0%B8%9E%E0%B8%A3%E0%B8%B0%E0%B8

[link text] ธนาคารโลก
[link href] http://datatopics.worldbank.org/world-development-indicators
---
[link text] สกุลเงิน
[link href] https://www.google.com/search?q=%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2+%E0%B8%AA%E0%B8%81%E0%B8%B8%E0%B8%A5%E0%B9%80%E0%B8%87%E0%B8%B4%E0%B8%99&stick=H4sIAAAAAAAAAOPgE-LQz9U3ME8zrNCSzU620s_JT04syczP00_OL80rKaq0Si4tKkrNS65cxGryYMfsBzsWP9ix4cHOhgc7pj_YseLBzhYwY5HCgx2rHuxofLBjx4MdS8HS7Q92bHmwYyYA1dQM3mIAAAA&sa=X&ved=2ahUKEwj-1NK0ydb9AhWdV2wGHauXCoIQ6BMoAHoECF8QAg
---
[link text] บาท
[link href] https://www.google.com/search?q=&si=AEcPFx6l3RvH8SFlhHZyn7jIc6m2bU9vmoFvFAMQv2WWSYjXN5QZDDEnuaKd__gEE1a0wKEQk7pg3LK-3n0E0EDKWtdJPyBWPpI4_LME_0R7k_pF_9lRmf_15RH53Bq2P8S7wX5sVA7mvmSghfDNcUITXzO61DI2acRUfWfxe57g2K1zY3eKD7I%3D&sa=X&ved=2ahUKEwj-1NK0ydb9AhWdV2wGHauXCoIQmxMoAXoECF8QAw
---
[link text] ทวีป
[link href] https://www.google.com/search?q=%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2+%E0

In [12]:
all_link[0].click()

In [13]:
all_headlines = browser.find_elements(By.CSS_SELECTOR, 'span[class^="mw-headline"]')

In [14]:
for h in all_headlines:
    print('[text]', h.text)
    print('[class]', h.get_attribute('class'))
    print('[id]', h.get_attribute('id'))
    print('[parent]', h.find_element(By.XPATH, '..').tag_name)
    print('---')

[text] ชื่อเรียก
[class] mw-headline
[id] ชื่อเรียก
[parent] h2
---
[text] ประวัติศาสตร์
[class] mw-headline
[id] ประวัติศาสตร์
[parent] h2
---
[text] ยุคก่อนประวัติศาสตร์
[class] mw-headline
[id] ยุคก่อนประวัติศาสตร์
[parent] h3
---
[text] อาณาจักรสุโขทัยและแคว้นต่าง ๆ
[class] mw-headline
[id] อาณาจักรสุโขทัยและแคว้นต่าง_ๆ
[parent] h3
---
[text] อาณาจักรอยุธยาและธนบุรี
[class] mw-headline
[id] อาณาจักรอยุธยาและธนบุรี
[parent] h3
---
[text] กรุงรัตนโกสินทร์ตอนต้นและสมัยอาณานิคม
[class] mw-headline
[id] กรุงรัตนโกสินทร์ตอนต้นและสมัยอาณานิคม
[parent] h3
---
[text] ราชาธิปไตยภายใต้รัฐธรรมนูญ สงครามโลกครั้งที่สอง และสงครามเย็น
[class] mw-headline
[id] ราชาธิปไตยภายใต้รัฐธรรมนูญ_สงครามโลกครั้งที่สอง_และสงครามเย็น
[parent] h3
---
[text] ร่วมสมัย
[class] mw-headline
[id] ร่วมสมัย
[parent] h3
---
[text] ภูมิประเทศ
[class] mw-headline
[id] ภูมิประเทศ
[parent] h2
---
[text] ภูมิอากาศ
[class] mw-headline
[id] ภูมิอากาศ
[parent] h3
---
[text] ความหลากหลายทางชีวภาพ
[class] mw-headline
[id] ความหลาก

## End browsing session

In [15]:
browser.quit()