# <span style="color:darkblue"> Lecture 6 - Interacting and Extracting Web Data </span>

<font size = "5">

In the previous class we covered ...

<font size = "3">


- The basics of web scraping: APIs and Web Crawling
- We covered basic concepts, benefits, limitations, and <br>
ethical considerations
- How to access an API, JSON format and processing
- Tree-like struture of many web formats

<font size = "5">

In this class we will cover ...

<font size = "3">

- Web crawling
- Basics of extracting data using Selenium

# <span style="color:darkblue"> I. Before we start </span>

<font size = "5">

Follow these instructions:

<font size = "3">

1. Install Google Chrome, if you haven't already. <br>
We will connect Python to Chrome for web navigation.

2. Install the following packages in Anaconda.<br>
Open your operating system terminal and type the following:

``` conda install conda-forge::html5lib  ``` <br>
``` conda install conda-forge::selenium  ``` <br>
``` conda install conda-forge::webdriver-manager  ```


<font size = "5">

Import packages for data processing

In [1]:
# Manage datasets
import pandas as pd

# Work with time data
import time 

<font size = "5">

Import packages for HTML processing

In [2]:
# Conduct HTTP requests
import requests

# Construct tree structure of HTML data
import html5lib

# Parse HTML data obtained from scraping
from bs4 import BeautifulSoup as soup 


<font size = "5">

Import packages for interactive <br>
website navigation

In [3]:
# Import webdriver for chrome
from webdriver_manager.chrome import ChromeDriverManager

# Automate navigating within browser (SELENIUM)
#------ Key: Manage keys
#------ Select: Obtain features from website
#------ WebDriverWait: Add wait times implicitly
#------ By: Use common information locator strategies
#------ EC and Options: Browser configuration
#------ remote.command: Check whether browser is active

from selenium import webdriver #to automate the navigating within the browser
from selenium.webdriver.common.keys    import Keys
from selenium.webdriver.support.ui     import Select
from selenium.webdriver.support.ui     import WebDriverWait 
from selenium.webdriver.common.by      import By
from selenium.webdriver.support        import expected_conditions as EC
from selenium.webdriver.chrome.options import Options #to use properties of the chrome webbrowser
from selenium.webdriver.remote.command import Command # Use to check whether the web driver is active


# <span style="color:darkblue"> II. Running Chrome from Python </span>

<font size = "5">

We will be working with the Course Atlas

<font size = "3">

- This is where Emory courses are listed
- It is an interactive website that displays information <br>
as you click through <br>

<img src="figures/screenshot_homepage.png" alt="drawing" width="400"/>

<font size = "5">

Define URL

In [4]:
base_url = 'https://atlas.emory.edu'

<font size = "5">

Initialize web driver

<font size = "3">

- You may have to grant permission to Python to access the browser <br>
via a pop-up window
- If you choose ```options.headless = False``` a new window will appear
- If you choose ```options.headless = True``` all process will be hidden <br>
(you may consider this option after you've got the example automated)

In [5]:
options = Options()

# True hides the navigating of the browser by the scraper,
# False shows you the tab/window opening and stuff getting clicked
options.headless = False 

driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)

  options.headless = False


  driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)


<font size = "5">

Open website

<font size = "3">

- For this to work the browser needs to be open

In [6]:
driver.get(base_url)

# <span style="color:darkblue"> III. Diagnose elements from website </span>

<font size = "5">

To know what to scrape, go to Developer Mode <br>
on your browser (preferrably in a different window than the one) <br>
automated by Python

<font size = "5">


<img src="figures/screenshot_courseatlas_form.png" alt="drawing" width="200"/>


<font size = "5">

If you get to Developer mode, you will find that <br>
all the form fields have the tag ``` form-group ```

<img src="figures/screenshot_courseatlas_developer.png" alt="drawing" width="500"/>

<font size = "5">

We can use python to extract all elements with this tag

In [7]:
from selenium.webdriver.common.by import By

list_forms = driver.find_elements(By.CLASS_NAME, 'form-control')
len(list_forms)

16

<font size = "5">

Open a single element

In [9]:
list_forms[0].get_attribute("placeholder")

'Title, Subject, Instructor, or Keyword'

<font size = "5">

Try it yourself: Amazon website


In [15]:
base_url = 'https://www.amazon.com/b?node=283155'

In [16]:
driver.get(base_url)

<div class="p13n-sc-truncate-desktop-type2  p13n-sc-truncated" aria-hidden="true" data-rows="2">The Women: A Novel</div>

In [21]:
list_objects = driver.find_elements(By.CLASS_NAME, 'p13n-sc-truncate-desktop-type2')


len(list_objects)

20