Selenium
---
<a class="anchor" id="selenium"></a>

Often, you're going to need to go to many different websites to scrape data. The package Selenium automates this process for you. 

First, we'll need to download it. 

On a Mac, try typing into your terminal: 

conda install -c conda-forge selenium

Then type:

which chromedriver

If it works, great, if not (or if you have a PC), download it directly from 

http://chromedriver.storage.googleapis.com/index.html 

After you've downloaded it, type "which chromedriver" into your terminal to make sure it gives you a path. Everyone will have a slighty different path. Mine is "/usr/local/bin/chromedriver." Replace what mine is to what yours is in the code below. 

In [11]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# type `which chromedriver` from shell to find chromedriver. Then change the next line!!!
chromedriver = "/usr/local/bin/chromedriver"
driver = webdriver.Chrome(chromedriver)
driver.get("https://images.google.com")

What happened? You should have gotten a browser window that popped up and went to the google images site. Whoa! What if we want to search for Kanye West pics? We'll need to type something into the search text box. Once again, to find how to reference the text box, we'll right click on it and go to Inspect. It seems to be referenced by id=lst-ib so let's tell Chromium to look for that:

In [12]:
search_box = driver.find_element_by_id("lst-ib")
print(search_box)

<selenium.webdriver.remote.webelement.WebElement (session="f4ab5ac08cccb88e640bdfde0d488ca7", element="0.6447736553766905-1")>


Okay, it found it. Now let's tell Chromium to type "Kanye West" into the search box:

In [13]:
search_box.send_keys("Kanye West")

Did you see what just happened on your google webpage? Now, let's press enter to complete our search:

In [14]:
search_box.send_keys(Keys.RETURN)

What if we want to download the first image that comes up? Right click on it and press inspect to see that the the name is "hN6qb_8t9e0lcM:". Let's tell Selenium to search for this name:

In [15]:
search = driver.find_element_by_name('hN6qb_8t9e0lcM:')

Let's click on this pic:

In [16]:
search.click()

Now we are on a new page. If we wanted to know the url of the page we are currently on, we could type:

In [18]:
driver.current_url

'https://www.google.com/search?tbm=isch&source=hp&biw=1200&bih=672&ei=tPimWoWRIMPk0gKEsIugCw&q=Kanye+West&oq=Kanye+West&gs_l=img.3..0l10.9153.9228.0.52295.10.2.0.0.0.0.82.138.2.2.0....0...1ac..64.img..8.2.136....0.5WPJdzhJvVg#imgrc=hN6qb_8t9e0lcM:'

Okay. This new image location is a little harder to find using id or class. Instead, we'll use XPath, which is typically the easiest way to search for something on a webpage. To do this, right click on the Kanye image, choose inspect, and see the blue text that got highlighted. Now, right click on the blue text and choose "Copy - XPath". It should look like this:

<img src="xpath.jpg" style="width: 200px;"/>

Now, we can paste what we just copied into this new search:

In [19]:
pic = search.find_element_by_xpath('//*[@id="irc_cc"]/div[2]/div[1]/div[2]/div[2]/a/img')
pic


<selenium.webdriver.remote.webelement.WebElement (session="f4ab5ac08cccb88e640bdfde0d488ca7", element="0.6236586095908558-2")>

We can use the .get_attribute method to access this hyperlink:

In [20]:
url = pic.get_attribute('src')
print(url)

https://upload.wikimedia.org/wikipedia/commons/thumb/1/11/Kanye_West_at_the_2009_Tribeca_Film_Festival.jpg/1200px-Kanye_West_at_the_2009_Tribeca_Film_Festival.jpg


How can we download this picture? We can use requests to communicate with this url and write its contents to a file called kanye.jpg:

In [21]:
import requests

response = requests.get(url)
with open('kanye.jpg', 'wb') as f:
        f.write(response.content)

### Exercise - Selenium 1

Use Selenium to go to "http://www.boxofficemojo.com/movies/?id=matrix.htm" http://www.boxofficemojo.com and use Selenium to print the Domestic Total Gross.


In [22]:
chromedriver = "/usr/local/bin/chromedriver"
driver = webdriver.Chrome(chromedriver)
matrix_url = "http://www.boxofficemojo.com/movies/?id=matrix.htm"
driver.get(matrix_url)
gross_selector = '//font[contains(text(), "Domestic")]/b'
print(driver.find_element_by_xpath(gross_selector).text)

$171,479,930


### Exercise - Selenium 2

Use Selenium to go to "http://www.boxofficemojo.com/movies/?id=matrix.htm" http://www.boxofficemojo.com and use Selenium to print the genres.
Hint: driver.find_elements_by_xpath may contain more than one thing so you might need a loop.

In [23]:
chromedriver = "/usr/local/bin/chromedriver"
driver = webdriver.Chrome(chromedriver)
matrix_url = "http://www.boxofficemojo.com/movies/?id=matrix.htm"
driver.get(matrix_url)
genre_selector = '//a[contains(@href, "/genres/chart/")]/b'
for genre_anchor in driver.find_elements_by_xpath(genre_selector):
    print(genre_anchor.text)

Action - Wire-Fu
Man vs. Machine
Post-Apocalypse
Virtual Reality


### Exercise - Selenium 3
Use Selenium and XPath to click on the tab that says "Similar Movies." Print the new url location.

In [24]:
chromedriver = "/usr/local/bin/chromedriver"
driver = webdriver.Chrome(chromedriver)
matrix_url = "http://www.boxofficemojo.com/movies/?id=matrix.htm"
driver.get(matrix_url)
similar_selector = '//*[@id="body"]/table[2]/tbody/tr/td/table[2]/tbody/tr[1]/td/ul/li[6]/a'
driver.find_element_by_xpath(similar_selector).click()
print(driver.current_url)

http://www.boxofficemojo.com/movies/?page=similar&id=matrix.htm


### Exercise - Selenium 4
Use pd.read_html to read in the table of similar movies that you are viewing on the similar movie page that you are now on.

In [25]:
import pandas as pd
tables = pd.read_html(driver.current_url)
tables[2]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,461,462,463,464,465,466,467,468,469,470
0,"The Matrix Domestic Total Gross: $171,479,930...",,"The Matrix Domestic Total Gross: $171,479,930...","Domestic Total Gross: $171,479,930Distributor:...","Domestic Total Gross: $171,479,930",Distributor: Warner Bros.,"Release Date: March 31, 1999",Genre: Sci-Fi Action,Runtime: 2 hrs. 16 min.,MPAA Rating: R,...,1112.0,"$4,020,663",1005.0,10/26/84,Averages:,"$146,573,152",3319.0,"$48,255,848",3301.0,
1,,"The Matrix Domestic Total Gross: $171,479,930...","Domestic Total Gross: $171,479,930Distributor:...","Domestic Total Gross: $171,479,930",Distributor: Warner Bros.,"Release Date: March 31, 1999",Genre: Sci-Fi Action,Runtime: 2 hrs. 16 min.,MPAA Rating: R,Production Budget: $63 million,...,,,,,,,,,,
2,"Domestic Total Gross: $171,479,930Distributor:...","Domestic Total Gross: $171,479,930",Distributor: Warner Bros.,"Release Date: March 31, 1999",Genre: Sci-Fi Action,Runtime: 2 hrs. 16 min.,MPAA Rating: R,Production Budget: $63 million,,,...,,,,,,,,,,
3,"Domestic Total Gross: $171,479,930",,,,,,,,,,...,,,,,,,,,,
4,Distributor: Warner Bros.,"Release Date: March 31, 1999",,,,,,,,,...,,,,,,,,,,
5,Genre: Sci-Fi Action,Runtime: 2 hrs. 16 min.,,,,,,,,,...,,,,,,,,,,
6,MPAA Rating: R,Production Budget: $63 million,,,,,,,,,...,,,,,,,,,,
7,Title (click to view),Studio,Release Gross* / Theaters,Opening / Theaters,Date^,,,,,,...,,,,,,,,,,
8,Self/Less,Focus,"$12,279,691",2353,"$5,403,460",2353,7/10/15,,,,...,,,,,,,,,,
9,Tomorrowland,BV,"$93,436,322",3972,"$33,028,165",3972,5/22/15,,,,...,,,,,,,,,,


### Exercise - Selenium 5

Use Selenium to go to http://www.imdb.com/ and to type into the search box Kanye West. You won't quite be there yet because two names are listed. Use Selenium again to click on Kanye West.

In [26]:
chromedriver = "/usr/local/bin/chromedriver"
driver = webdriver.Chrome(chromedriver)

url = "http://www.imdb.com"
driver.get(url)

query = driver.find_element_by_id("navbar-query")
query.send_keys("Kanye West")
query.send_keys(Keys.RETURN)
name_selector = '//*[@id="main"]/div/div[2]/table/tbody/tr[1]/td[2]/a'
driver.find_element_by_xpath(name_selector).click()