# Selenium:

## Learning Goals:

- Able to install and setup Selenium
- Able to login to website platform
- Able to navigate through pages/pop-ups
- Able to write scraper that is more "human-like"
- Able to know when to use appropriate `find_element(s)_by...`
- Able to acquire desired data

---

## First, head over to [this page](https://chromedriver.chromium.org/downloads) and locate the chromedriver that matches your chrome version.

**How to Find Your Internet Browser Version Number - Google Chrome.**

1) Click on the Menu icon in the upper right corner of the screen. 

2) Click on Help, and then About Google Chrome. 

3) Your Chrome browser version number can be found here.

## Next, download the appropriate driver that matches your version of Chrome

- After you have downloaded the driver, press `command` + `spacebar`
- Inside of the spotlight search you just opened, type `/usr/local/bin/` and open that folder
- Next, in a separate finder window (`command` + `n`), navigate to where you downloaded the `chromedriver`
- Finally, move the `chromedriver` from where ever you downloaded it into your `/usr/local/bin/`

*Technically, you can install the driver anywhere, but most tutorials I have read say to put it in `/usr/local/bin/`*

...However, after a bit of research, I believe the reason we want to install the `chromedriver` inside of `/usr/local/bin/` is so that you don't have to explicitly state the chromedriver path when you instantiate your driver 😎 

## Install Selenium if you have not already done so:

If you can please use the Conda install, and I would suggest running it in terminal not jupyter

In [None]:
#!conda install -c conda-forge selenium
#or
#!pip install selenium

In [None]:
import re
import os
import time
import random
import requests
import numpy as np
import pandas as pd
from os import system
from math import floor
from copy import deepcopy
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [None]:
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_colwidth', 200)

---

## DRIVER HERE:

In [None]:
driver = webdriver.Chrome()

---

## Time to scrape!

<img src = "https://media1.tenor.com/images/3fd84ba4b54f8d299f7732e63cdb3c00/tenor.gif?itemid=11903546" />

### Visiting a webpage

In [None]:
# Visit the website of your choice:
my_url = 'https://www.espn.com'
driver.get(my_url)

#### Methods for finding a single element 

    This will return the FIRST instance of your desired "element"

* find_element_by_id
* find_element_by_name
* find_element_by_xpath  
* find_element_by_link_text
* find_element_by_partial_link_text
* find_element_by_tag_name
* find_element_by_class_name
* find_element_by_css_selector

---

#### Methods for finding multiple elements

    This will return a list of ALL instances of your desired "element"

* find_elements_by_name
* find_elements_by_xpath
* find_elements_by_link_text
* find_elements_by_partial_link_text
* find_elements_by_tag_name
* find_elements_by_class_name
* find_elements_by_css_selector

From the [Selenium Python Docs](https://selenium-python.readthedocs.io/locating-elements.html "Selenium Docs") 

### Selecting the FIRST instance of an "element"

First, well check out `.find_element_by_css_selector()`

In [None]:
driver.find_element_by_css_selector('h1')



In [None]:
driver.find_element_by_css_selector('h1').text

### Selecting ALL instances of your desired "element"

In [None]:
listy = driver.find_elements_by_css_selector('h1')

In [None]:
for x in listy[:15]:
    if len(x.text) > 3:
        print(x.text)

---

#### Closing the driver:

If you were to just close your driver's browsing window, your Google chrome instance will still appear open in your mac's dock. Using `driver.quit()`, we can close the Google chrome instance, which will also close the driver's browser:

In [None]:
driver.quit()

#### Timing

Sometimes we will need to wait for the page to load. Other times, we may want to have our scraper act more like a human, in terms of "click rate."

Two possible ways to make this happen are by using `time.sleep()` or `WebDriverWait()`

If we just want to mimic the behavior of a human, we can use `time.sleep()`:

In [None]:
# Using a single "wait" time:

time.sleep(5)

In [None]:
# Using a randomized time:

sequence = [x/10 for x in range(8, 14)]
print(sequence)

time.sleep(random.choice(sequence))

In [None]:
driver = webdriver.Chrome()

If we explicitly want to wait for our page to load, we can use `WebDriverWait()`:

In [None]:
wait = WebDriverWait(driver, 5)

try:
    page_loaded = wait.until(lambda driver: driver.current_url == 'https://www.espn.com/')
    print('The page loaded correctly')
except TimeoutException:
    print("Loading timeout expired")

#### Wait a second... what is that `xpath` thing?

XPath is defined as XML path. It is a syntax or language for finding any element on the web page using XML path expression. XPath is used to find the location of any element on a webpage using HTML DOM structure. The basic format of XPath is explained below with screen shot.

<img src='https://www.guru99.com/images/3-2016/032816_0758_XPathinSele1.png' >

XPath contains the path of the element situated at the web page. Standard syntax for creating XPath is:

`Xpath=//tagname[@attribute='value']`

- // == Select current node.
- Tagname == Tagname of the particular node.
- @ == Select attribute.
- Attribute == Attribute name of the node.
- Value == Value of the attribute.

<img src='https://media1.giphy.com/media/XBpEStoQ5rftPFA8rh/giphy.gif?cid=790b7611dbcd651cd785fb8382888f7b41666d5c8695755b&rid=giphy.gif'>

In [None]:
driver.find_element_by_xpath('//*[@id="news-feed"]/section[1]/header/a/div[1]/div/div[2]/span').text



# Modal buttons and scrolling:

In [None]:
driver = webdriver.Chrome()

In [None]:
# These websites have modal popups:

driver.get('https://nike.com')

# Other options:
# https://www.meundies.com

The following cell is an example of how you can click on buttons and scroll down the page (for dynamic loading).

In [None]:
modal_button = driver.find_element_by_xpath('//*[@id="gen-nav-commerce-header-v2"]/div[3]/header/div/div[1]/div[2]/nav/div[2]/ul/li[2]/a')
webdriver.ActionChains(driver).move_to_element(modal_button).click(modal_button).perform()

In [None]:
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")

In [None]:
driver.quit()

## When to use BeautifulSoup vs.  Selenium?

Everything depends on the website and your data goals.

In general:
- If the data needs to be exposed interactively, then go for Selenium. 
- Selenium for more complex JavaScript heavy pages. 
---
- If the data is accessible in the HTML structure (more static pages), soup is a more lightweight tool. 
- Soup gives you more control about navigating the HTML tree.

In [None]:
html = requests.get('https://www.skysports.com/premier-league-table')
bs = BeautifulSoup(html.content, 'lxml')
#table = bs.table
table = bs.find('table')


In [None]:
table_rows = table.find_all('tr')

for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)

In [None]:
html = requests.get('https://www.skysports.com/premier-league-table')
bs = BeautifulSoup(html.content, 'lxml')
table = bs.table

#If you know there is more than one table, you can edit the code to include the proper index:
# table = bs.find_all('table')[0] 

df = pd.read_html(str(table), index_col='Team')
df = df[0].dropna(axis=0, thresh=4)
df

#### Adjusting the header and index:

- Caveat: this uses pandas, not Selenium or Soup

If there is more than one table, pandas reads the html as a list of tables:

In [None]:
df2 = pd.read_html('https://www.sportsmole.co.uk/football/premier-league/2018-19/')

df2

In [None]:
# Let's check out one of our tables:

df2[0]

As we can see above, the table's formatting is slightly off...

So we can make adjustments like so:

In [None]:
df2 = pd.read_html('https://www.sportsmole.co.uk/football/premier-league/2018-19/',header=0, index_col=1)

df2[0].columns =  ['final_standings', 'P', 'W', 'D', 'L', 'F', 'A', 'GD', 'PTS']

df2[0]

---

### The best example of when Selenium is supreme:

When the page is written in JavaScript

In [None]:
html = requests.get('http://www.tennisabstract.com/cgi-bin/player.cgi?p=RogerFederer')
bs = BeautifulSoup(html.content, 'lxml')
table = bs.table

table = bs.find(lambda tag: tag.name=='table' ) 
rows = table.findAll(lambda tag: tag.name=='tr')


In [None]:
bs.find_all('tr')



In [None]:
url = "http://www.tennisabstract.com/cgi-bin/player.cgi?p=RogerFederer"
driver = webdriver.Chrome()

In [None]:
driver.get(url)

In [None]:
table = driver.find_element_by_id("recent-results")

In [None]:
table.text

In [None]:
body = table.find_element_by_css_selector('tbody')

In [None]:
# Table rows usually have the css tag 'tr'
rows = body.find_elements_by_css_selector('tr')

In [None]:
[x.text for x in rows]

In [None]:
rows[0]

In [None]:
#allows us to look at the HTML inside the selenium object
rows[0].get_attribute('innerHTML')

In [None]:
row_data = rows[0].find_elements_by_css_selector('td')

In [None]:
for e in row_data: 
    print(e.text)

In [None]:
data_list = []
for r in rows: 
    row_list = []
    row_data = r.find_elements_by_css_selector('td')
    for d in row_data: 
        row_list.append(d.text)
    data_list.append(row_list)

In [None]:
data_list[10]

In [None]:
len(data_list[10])

In [None]:
headers = table.find_element_by_css_selector('thead')
headers.text

In [None]:
columns = headers.text.split(' ')
print(columns)

In [None]:
print('Number of columns:     '+ str(len(columns)))
print()
print('Number of data points: '+ str(len(data_list[0])))

In [None]:
columns = ['Date','Tournament','Surface','Rd','Rk','vRk', 
           'Opponent','Score','DR','A%','DF%','1stIn',
           '1st%','2nd%','BPSvd','Time']

In [None]:
print(data_list[0])

In [None]:
federer_h2h = pd.DataFrame(data_list[1:], columns=columns)

In [None]:
federer_h2h.head()

- A slightly different approach:

## Some other neat stuff:

In [None]:
# Let's take a screenshot! 

driver.get('https://www.nytimes.com')

driver.get_screenshot_as_file('ny_times_front_pg.png')

driver.quit()

In [None]:
!ls

# IMDB
Example using click and send keys

In [None]:
url = 'https://www.imdb.com/'

In [None]:
driver = webdriver.Chrome()

In [None]:
driver.get(url)

In [None]:
search_bar = driver.find_element_by_xpath('//*[@id="suggestion-search"]')

In [None]:
search_bar.send_keys('Ready Player One')

In [None]:
search_bar.send_keys(Keys.ENTER)

In [None]:
rp1 = [x.find_element_by_tag_name('a').get_attribute('href') for x in driver.find_elements_by_class_name('result_text')][0]

In [None]:
rp1

In [None]:
driver.get(rp1)

In [None]:
f = driver.find_element_by_class_name('title_block').text.split('\n')

In [None]:
f

In [None]:
keys = ['star_rating','num_reviews','Title','rating','length','Genres','Release']

In [None]:
f = f[:-1] + f[-1].split(' | ')
f.pop(2)
f

In [None]:
dict(zip(keys,f))

## Find reviews for a list of movies

In [None]:
#get url for reviews by class name and use driver.get
reviews = driver.find_element_by_class_name('user-comments')
reviews.find_elements_by_tag_name('a')[-1].get_attribute('href')

In [None]:
reviews.find_elements_by_tag_name('a')[-1].get_attribute('href')

In [None]:
#or find url by xpath and click on that button
driver.find_element_by_xpath('//*[@id="titleUserReviewsTeaser"]/div/a[2]').click()

In [None]:
#Lets pull out 1 review and get each item, then we can put it in a loop
x = driver.find_elements_by_class_name('review-container')[0]

In [None]:
#get rating
x.find_element_by_tag_name('span').text

In [None]:
#get title
x.find_element_by_class_name('title').text

In [None]:
#get date
x.find_element_by_class_name('review-date').text

In [None]:
#get the review
x.find_element_by_class_name('content').text

In [None]:
#now lets put it in a loop
data = []
keys = ['rating','title','date','review']
reviews = driver.find_elements_by_class_name('review-container')
for review in reviews:
    lst = [review.find_element_by_tag_name('span').text,
    review.find_element_by_class_name('title').text,
    review.find_element_by_class_name('review-date').text,
    review.find_element_by_class_name('content').text]
    data.append(dict(zip(keys,lst)))

In [None]:
data

## Beautiful Soup Version

In [None]:
resp = requests.get('https://www.imdb.com/title/tt0360717/reviews?ref_=tt_urv')

In [None]:
bs = BeautifulSoup(resp.content, 'html.parser')

In [None]:
[x.text for x in bs.findAll(class_='content')]

In [None]:
driver.quit()

# YELP
Using beautiful soup to parse list of urls

In [None]:
resp = requests.get('https://www.yelp.com/biz/bushwick-grind-cafe-brooklyn-2?adjust_creative=XCA5Kc7RlIdeGhJ7qoZSYA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=XCA5Kc7RlIdeGhJ7qoZSYA')

In [None]:
bs = BeautifulSoup(resp.content, 'html.parser')

In [None]:
x = bs.findAll(class_='margin-b5__373c0__2ErL8')

In [None]:
len(x)

In [None]:
x = x[1]

In [None]:
x.find(class_='css-n6i4z7').text

In [None]:
x.find(class_='css-e81eai').text

In [None]:
x.find(class_='raw__373c0__3rcx7').text

In [None]:
resp = requests.get('https://www.yelp.com/biz/grey-cafe-flushing?adjust_creative=XCA5Kc7RlIdeGhJ7qoZSYA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=XCA5Kc7RlIdeGhJ7qoZSYA')

In [None]:
bs = BeautifulSoup(resp.content, 'html.parser')

In [None]:
x = bs.findAll(class_='margin-b5__373c0__2ErL8')[2]

In [None]:
x.find(class_='css-n6i4z7').text

In [None]:
x.find(class_='i-stars__373c0__1T6rz')['aria-label']

In [None]:
x.find(class_='css-e81eai').text

In [None]:
x.find(class_='raw__373c0__3rcx7').text