<img src="https://pbs.twimg.com/profile_images/620117027689136129/-vYs_XqS_400x400.png" height="200" width="200">

# USING SELENIUM FOR WEB CRAWLING


### About Selenium
Selenium automates browsers. That's it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that.
In our scenario we'll be using it's capability to harvest data from online resources.




### Installation - Python package
Selenium can by installed using the following commands via your terminal:
#### - Anaconda Distribution
- `conda install selenium`


#### - Virtual Distribution
- `pip3 install selenium`


OR
- `pip install selenium`

### Installation - Browser drivers
These are required to automate the browser - mechanism that runs the browsers according to the parameters specified by us. You can download the driver for any browser of your preference. Just make sure that the browser is already installed on your system.

https://www.seleniumhq.org/download/



Recommended drivers:
1. Mozilla Firefox - https://github.com/mozilla/geckodriver/releases
2. Google Chrome - https://sites.google.com/a/chromium.org/chromedriver/downloads

### Basic Terminology:
- XPATH - XPath uses path expressions to select nodes or node-sets in an XML document. This can vary for different browsers. Reference: https://www.w3schools.com/xml/xpath_intro.asp
- CSS Selectors - In CSS, selectors are patterns used to select the element(s) you want to style. This remains the same for webpage. Reference: https://www.w3schools.com/cssref/css_selectors.asp

### A few pointers:
- Try exploring a webpage manually first, look for patterns and try to make your code as generic as possible.
- The same webpage may behave different in different browsers.

In [11]:
from selenium import webdriver
# OPTIONS: For configuring browser-driver setting
from selenium.webdriver.chrome.options import Options

### Exception Handling Configuration

In [12]:
import selenium.common.exceptions
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.ui import WebDriverWait

#### Playing around with Webdriver - UNIMELB

In [14]:
# Firefox configuration
options = Options()
options.add_experimental_option("prefs", {
  "download.default_directory": r"/Users/k/Desktop/",
  "download.prompt_for_download": False,
  "download.directory_upgrade": True,
  "safebrowsing.enabled": True
})
# options.add_argument('-headless')
# browser = webdriver.Firefox(executable_path='/Users/k/PycharmProjects/Tels/geckodriver', firefox_options=options)
browser = webdriver.Chrome(executable_path='/Users/k/Downloads/chromedriver', chrome_options=options)

browser.maximize_window()

browser.get('https://app.lms.unimelb.edu.au/webapps/portal/execute/tabs/tabAction?tab_tab_group_id=_41_1')
browser.implicitly_wait(20) # implicity_wait: Waiting for the page to load properly

# Login Module
username = browser.find_element_by_id('user_id')
password = browser.find_element_by_id('password')
username.send_keys('kkishore')
password.send_keys('Tangent117')
browser.find_element_by_id('entry-login').click()
browser.implicitly_wait(10)

# Exploring the webpage
all_subjects = browser.find_element_by_css_selector('.coursefakeclass')
items = all_subjects.find_elements_by_tag_name('li')
semester2 = [] # To store name of all subjects available for a student

# Finding current semester subjects: which have SM2 in their text and storing them in semester2 list
for i in items:
    if i.text.__contains__('SM2'):
        semester2.append(i)

# Exploring the first found subject
semester2[0].click()

# I want to download all lecture slides and textual material present for that subject - In this case it was listed under documents tab
browser.find_element_by_link_text('Documents').click()
browser.implicitly_wait(5)
browser.find_element_by_link_text('Labs and Workshops').click()

# Now inside the directory structure of Subject -> Labs & Workshops -> Contents
# We will now try to iterate through the contents which is a list as per the html source
docdump = browser.find_element_by_id('content_listContainer')
docslist = docdump.find_elements_by_tag_name('li')

# Generic approach - try to list down the list values and then we can find it by link_text functionality
documents = []
for document in docslist:
    documents.append(str(document.text))
print('List of documents found: ' + str(documents))

# Opening the first found document
browser.find_element_by_link_text(documents[0]).click()
browser.implicitly_wait(10)

# Check number of tabs available
print(browser.window_handles)
if len(browser.window_handles) == 2:
    print('Tabs available')
    print(browser.window_handles)
    browser.switch_to.window(browser.window_handles[1])
else:
    print('Tabs not found')

List of documents found: ['Lab and Workshop 1', 'Lab and Workshop 1 Solutions', 'Lab and Workshop 2', 'Lab and Workshop 2 Solutions', 'Lab and Workshop 3', 'Sample Lab Test', 'Sample Lab Test Solutions']
['CDwindow-1AB32D84E1D4EC64A5193362C1A46097', 'CDwindow-96D1AA070E3275EA51FE0D7204FC1205']
Tabs available
['CDwindow-1AB32D84E1D4EC64A5193362C1A46097', 'CDwindow-96D1AA070E3275EA51FE0D7204FC1205']


In [None]:
# JS Scrolling
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

In [None]:
# Waiting for an element to get activated or some object to become accessible
try:
    # What you want from the browser window - XPATH, CSS SELECTOR
    wait.until(expected_conditions.visibility_of_element_located((By.ID, 'orderSearchForm_orderNumber')))
except Exception:
    # How you want to handle the error - Better to write small snippets of try and
    # except so that you can control browser behaviour for different scenarios

In [None]:
# Taking a screenshot
brows