# Selenium
## Web Browser automation and Scraping

Selenium is a Python package that allows you to automate your web browser and to scrape data off web pages.

###### Download Link and Instructions: http://selenium-python.readthedocs.io/installation.html

# Example 1: Scraping from Cav Daily Salary Data

In [1]:
# imports

import numpy as np
import pandas as pd
from selenium import webdriver #pip install selenium first...
import time # this is for sleeping

In [45]:
# initializing the browser and going to a web page

# open instance of browser
browser = webdriver.Chrome()#Note: you may need to put the path of your webdriver in the parentheses as a string

# go to the web page that we want to scrape from
browser.get('https://cavdailyonline.github.io/facultysalarygryphon/')

# wait for browser/page to load before doing anything else
'''
If you don't do this, selenium may get confused while running 
the next command because whatever object it looks for may not yet be there.
So when running a command that will open a new web page it is usually
a good idea to sleep for a few seconds.
''' 
time.sleep(2)

Selenium allows you to select web page elements in a variety of ways:
                                    1. .find_element_by_class_name
                                    2. .find_element_by_css_selector
                                    3. .find_element_by_id
                                    4. .find_element_by_link_text
                                    5. .find_element_by_name
                                    6. .find_element_by_partial_link_text
                                    7. .find_element_by_tag_name
                                    8. .find_element_by_xpath

## Collecting Data Across 428 Pages... All at once.

In [53]:
n_clicks = 1
#total clicks = 10,678 entries /25 each page
df = pd.DataFrame() #blank df
while n_clicks < 429:
    html = browser.find_element_by_xpath('//*[@id="data-table-container"]').get_attribute('outerHTML')
    table_list = pd.read_html(html)
    temp_df = table_list[0]
    df = df.append(temp_df, ignore_index = True)
    #move to next page
    browser.find_element_by_xpath('//*[@id="data-table-container_wrapper"]/div[4]/ul/li[7]/a').click()
    time.sleep(0.5)
    n_clicks += 1
    if (n_clicks % 10) is 0:
        print("Scraping Page: " + str(n_clicks))
print(df.shape)
df.head()

Scraping Page: 10
Scraping Page: 20
Scraping Page: 30
Scraping Page: 40
Scraping Page: 50
Scraping Page: 60
Scraping Page: 70
Scraping Page: 80
Scraping Page: 90
Scraping Page: 100
Scraping Page: 110
Scraping Page: 120
Scraping Page: 130
Scraping Page: 140
Scraping Page: 150
Scraping Page: 160
Scraping Page: 170
Scraping Page: 180
Scraping Page: 190
Scraping Page: 200
Scraping Page: 210
Scraping Page: 220
Scraping Page: 230
Scraping Page: 240
Scraping Page: 250
Scraping Page: 260
Scraping Page: 270
Scraping Page: 280
Scraping Page: 290
Scraping Page: 300
Scraping Page: 310
Scraping Page: 320
Scraping Page: 330
Scraping Page: 340
Scraping Page: 350
Scraping Page: 360
Scraping Page: 370
Scraping Page: 380
Scraping Page: 390
Scraping Page: 400
Scraping Page: 410
Scraping Page: 420
(10675, 4)


Unnamed: 0,Name,Salary,Division,Title
0,"Aaron, Bradley J (Brad)",53560,AS-College of Arts & Sciences,Education and Outreach Manager-EO58
1,"Abad-Jorge, Ana R",144900,CP-School of Cont/Prof Studies,Assistant Professor
2,"Abate, Hiwot Mulugeta",32000,MD-School of Medicine,Lab and Research Technician 1-LAB26
3,"Abatzis, Vaia T",100000,MD-School of Medicine,Assistant Professor
4,"Abbas, Tarek A",131000,MD-School of Medicine,Assistant Professor


In [59]:
#I need this data frame for class. I'm going to save it as an excel document...
df.to_excel('cavdaily2015.xlsx')

### Let's break that down...<br>
Steps:<br>
1. Go to website<br>
https://cavdailyonline.github.io/facultysalarygryphon/<br>
2. Right click -> inspect -> find data of interest by clicking around<br>
3. Right click -> copy x_path<br>
4. Get outer html<br>
5. Read html using pd.read_html()<br>
6. Find df object (if it exists)
7. Find the xpath for the "next" button and tell the browser to click it
8. Scrape this new page using the same steps as 3-6
9. Manually keep doing this or combine these steps into a for loop

In [None]:
#steps 1-5 code
html = browser.find_element_by_xpath('//*[@id="data-table-container"]').get_attribute('outerHTML')
pd_html = pd.read_html(html)

In [None]:
#step 6: find df object
print(type(pd_html))

We have a list. How long is the list? What type are the list elements

In [None]:
print("List Length: " + str(len(pd_html)))
print("Type of List Element 1: " + str(type(pd_html[0])))

In [None]:
#here's our data frame
salary_page1 = pd_html(1)

See the code in the for loop I created earlier to see how we could click a button to move to another page...

In [60]:
# close the browser now that we've got the data we need
'''
just for convenience sake because Selenium opens up a new 
window of your browser everytime you run it... which can get annoying
'''
browser.close()

# Example 2: Slack Bot

In [61]:
# first, let's pick a channel and a message to send to that channel

# pick your channel
channel = 'bot_world'
# write a message
message = 'Hi there Slack!'

In [62]:
# open browser
browser = webdriver.Chrome()

# go to the web page that we want to scrape from
browser.get('http://slack.com/signin')

# wait for browser/page to load before doing anything else
time.sleep(1)

In [7]:
# LOGIN
#We'll do this by finding elements in different ways, rather than xpath as we did before...

# type the slack team name
browser.find_element_by_id("domain").send_keys("hackcville")

# press continue button
browser.find_element_by_id("submit_team_domain").click()

# wait for next page to load
time.sleep(2)

In [8]:
# supply username and password for slack
email = input("Email: ")
password = input("Password: ")

# type username
browser.find_element_by_id("email").send_keys(email)
# type password
browser.find_element_by_id("password").send_keys(password)
# click sign in button
browser.find_element_by_id("signin_btn").click()

# wait for the next page to load
time.sleep(15)

Email: ct3fb@virginia.edu
Password: Whisper


In [9]:
# navigate to the channel you want to write in

'''
This finds the elements with class name = "overflow_ellipse" 
whithin the element with id = "channels" and puts them all in a list.
'''
channels = browser.find_element_by_id('channels').find_elements_by_class_name('overflow_ellipsis')

# lets see what the inner HTML elements look like
channels[0].get_attribute('innerHTML')

'\n\t\t\t\t\t\tannouncements\n\t\t\t\t'

In [10]:
# that is ugly ^^^
# let's clean it up with .strip() to clear the whitespace in the for loop

'''
this just iterates through the stripped inner HTML of the items 
in the 'channels' list until it finds the one we want then clicks it
'''
for i in range(len(channels)):
    if (channels[i].get_attribute('innerHTML').strip() == channel):
        # click on the 'bot_world' channel
        channels[i].click()
        
        # break out of the for loop
        break

In [11]:
# find the element for the text bar
text_bar = browser.find_element_by_class_name('ql-editor.ql-blank').find_element_by_css_selector('p')
# type your message
text_bar.send_keys(message)
# press enter
text_bar.send_keys(u'\ue007')

#you can repeat this cell with different messages to send multiple messages to the channel

So if you ever want to write a message to Slack but you're in class and don't want to get called out for not working, just whip out this script and no one will suspect anything.

## Challenge(s)
1. Scrape the entire table from http://www.imdb.com/title/tt1490017/ by making Selenium press the "See full cast >>" button at the bottom of the table which opens up the full cast list, not just the first 15 members.

2. Scrape the full cast list from multiple imdb pages by using the search bar to navigate between the pages.
3. Go to a popular news source and scrape the first few pages of titles.