# Selenium:

## Learning Goals:

- Able to install and setup Selenium
- Able to login to website platform
- Able to navigate through pages/pop-ups
- Able to write scraper that is more "human-like"
- Able to know when to use appropriate `find_element(s)_by...`
- Able to acquire desired data

---

## First, head over to [this page](https://chromedriver.chromium.org/downloads) and locate the chromedriver that matches your chrome version.

**How to Find Your Internet Browser Version Number - Google Chrome.**

1) Click on the Menu icon in the upper right corner of the screen. 

2) Click on Help, and then About Google Chrome. 

3) Your Chrome browser version number can be found here.

## Next, download the appropriate driver that matches your version of Chrome

- After you have downloaded the driver, press `command` + `spacebar`
- Inside of the spotlight search you just opened, type `/usr/local/bin/` and open that folder
- Next, in a separate finder window (`command` + `n`), navigate to where you downloaded the `chromedriver`
- Finally, move the `chromedriver` from where ever you downloaded it into your `/usr/local/bin/`

*Technically, you can install the driver anywhere, but most tutorials I have read say to put it in `/usr/local/bin/`*

...However, after a bit of research, I believe the reason we want to install the `chromedriver` inside of `/usr/local/bin/` is so that you don't have to explicitly state the chromedriver path when you instantiate your driver 😎 

## Install Selenium if you have not already done so:

In [None]:
# !pip install selenium

# Please complete the above steps before lecture

In [1]:
import re
import os
import time
import random
import requests
import numpy as np
import pandas as pd
from os import system
from math import floor
from copy import deepcopy
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [2]:
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_colwidth', 200)

---

**Next, I always like to label my driver with a bold title cell**


- I find it helps when we need to re-instantiate our driver and for general organization
- Also, when we copy and paste this code form a notebook to a .py file, we would usually only need one driver

## DRIVER HERE:

In [3]:
driver = webdriver.Chrome()

---

### Note: Headless Browsers

**Headless Browser**
A Headless Browser is also a Web Browser but without a graphical user interface (GUI) but can be controlled programmatically which can be extensively used for automation, testing, and other purposes.

**Why to use Headless Browsers?**
There are a lot of advantages and disadvantages in using the Headless Browsers. Using a headless browser might not be very helpful for browsing the Web, but for Automating tasks and tests it’s awesome.

**Advantages of Headless Browsers**

Some of the advantages are as follows:

- Headless Browsers are typically faster than real browsers. 
    - The reason for being faster is because we are not starting up a Browser GUI and can bypass all the time a real browser takes to load CSS, JavaScript and open and render HTML DOM.
- Performance wise, you can typically see a 2x to 15x faster performance when using a headless browser.

*More info on headless browsers here:* https://stackoverflow.com/questions/53083952/difference-of-headless-browsers-for-automation

## Time to scrape!

<img src = "https://media1.tenor.com/images/3fd84ba4b54f8d299f7732e63cdb3c00/tenor.gif?itemid=11903546" />

### Visiting a webpage

In [5]:
# Visit the website of your choice:

driver.get('https://www.espn.com')

#### Methods for finding a single element 

    This will return the FIRST instance of your desired "element"

* find_element_by_id
* find_element_by_name
* find_element_by_xpath  
* find_element_by_link_text
* find_element_by_partial_link_text
* find_element_by_tag_name
* find_element_by_class_name
* find_element_by_css_selector

---

#### Methods for finding multiple elements

    This will return a list of ALL instances of your desired "element"

* find_elements_by_name
* find_elements_by_xpath
* find_elements_by_link_text
* find_elements_by_partial_link_text
* find_elements_by_tag_name
* find_elements_by_class_name
* find_elements_by_css_selector

From the [Selenium Python Docs](https://selenium-python.readthedocs.io/locating-elements.html "Selenium Docs") 

### Selecting the FIRST instance of an "element"

First, well check out `.find_element_by_css_selector()`

In [6]:
driver.find_element_by_css_selector('h1')

<selenium.webdriver.remote.webelement.WebElement (session="d7c5dafc6d3c4eed336485ffc6d34e1c", element="c1b8bceb-9889-4d86-9d47-405a55d22627")>

In [7]:
driver.find_element_by_css_selector('h1').text

'ESPN'

### Selecting ALL instances of your desired "element"

In [8]:
listy = driver.find_elements_by_css_selector('h1')

In [9]:
for x in listy[:15]:
    if len(x.text) > 3:
        print(x.text)

ESPN
Customize ESPN
Picking the five best possible 2019 World Series matchups
'The Bogeyman' is back: What Luis Severino expects from himself in his 2019 debut
LATEST ON ESPN+
NFL PrimeTime returns: Boomer and TJ recap Week 2
The Fantasy Show: Week 2 reactions
ESPN FC: Six Premier League questions
Peyton's Places (Ep. 10): Peyton catches up with J.J. Watt
UEFA CHAMPIONS LEAGUE
Week 3 NFL Power Rankings: 1-32 poll, plus the most pleasant surprise for each team
GIANTS BENCH ELI
Daniel Jones is right quarterback for Giants -- right now


### Selecting a specific element (by class name)

Using `.find_element_by_class_name()` to locate an element:

In [10]:
driver.find_element_by_class_name('contentItem__title--hero').text

'Picking the five best possible 2019 World Series matchups'

---

#### Closing the driver:

If you were to just close your driver's browsing window, your Google chrome instance will still appear open in your mac's dock. Using `driver.quit()`, we can close the Google chrome instance, which will also close the driver's browser:

In [12]:
driver.quit()

### Logging into websites

We'll use `.find_element_by_id()` for this example:

In [2]:
from private import *

In [3]:
my_url = 'https://www.facebook.com/'

In [4]:
driver = webdriver.Chrome()
driver.get(my_url)

In [5]:
username = driver.find_element_by_id("email")
password = driver.find_element_by_id("pass")
submit   = driver.find_element_by_id("loginbutton")
  
username.send_keys(FB_USERNAME)
password.send_keys(PASSWORD)

In [8]:
submit.click()

#### Timing

Sometimes we will need to wait for the page to load. Other times, we may want to have our scraper act more like a human, in terms of "click rate."

Two possible ways to make this happen are by using `time.sleep()` or `WebDriverWait()`

If we just want to mimic the behavior of a human, we can use `time.sleep()`:

In [19]:
# Using a single "wait" time:

time.sleep(2)

In [20]:
# Using a randomized time:

sequence = [x/10 for x in range(8, 14)]
print(sequence)

time.sleep(random.choice(sequence))

[0.8, 0.9, 1.0, 1.1, 1.2, 1.3]


If we explicitly want to wait for our page to load, we can use `WebDriverWait()`:

In [23]:
wait = WebDriverWait(driver, 5)

try:
    page_loaded = wait.until(lambda driver: driver.current_url == my_url)
    print('The page loaded correctly')
except TimeoutException:
    print("Loading timeout expired")

The page loaded correctly


In [24]:
driver.quit()

### Ohhhh nooooooo, I can't remember how I named my variables...

And I don't want to open the file elsewhere to check, because that seems inefficient...

We can do something like this:

In [2]:
print(list(locals().keys()))

['__name__', '__doc__', '__package__', '__loader__', '__spec__', '__builtin__', '__builtins__', '_ih', '_oh', '_dh', 'In', 'Out', 'get_ipython', 'exit', 'quit', '_', '__', '___', '_i', '_ii', '_iii', '_i1', '_i2']


The previous output is a bit messy... 

If we are writing a .py file specifically to store "private" variables, I recommend using an all caps syntax. The two reasons I like this are:

    1) This mimics the syntax of ENVIRONMENT_VARIABLES

    2) If we name our private.py file variables with all caps, we can see all our private variable names like this:

In [26]:
for key in list(globals().keys()):
    if key[-1] == key[-1].title() and key[-1].isalpha() == True:
        print(key)

EC
FB_USERNAME
INSTA_USERNAME
PASSWORD


**NOTE:** The variable "`EC`" is present in the list above because of how we imported the `expected_conditions` module up at the top

In [27]:
driver = webdriver.Chrome()
driver.get('https://www.instagram.com/')

time.sleep(3)

In [28]:
# Find the login click button
ig_login_button = driver.find_element_by_xpath('//*[@id="react-root"]/section/main/article/div[2]/div[2]/p/a')

# Click the button
ig_login_button.click()

# Here, I could use a more elegant sleeper, using WebDriverWait and waiting for the page to load,
# but I'm gonna be lazy 
time.sleep(3)

#### Wait a second... what is that `xpath` thing?

XPath is defined as XML path. It is a syntax or language for finding any element on the web page using XML path expression. XPath is used to find the location of any element on a webpage using HTML DOM structure. The basic format of XPath is explained below with screen shot.

<img src='https://www.guru99.com/images/3-2016/032816_0758_XPathinSele1.png' >

XPath contains the path of the element situated at the web page. Standard syntax for creating XPath is:

`Xpath=//tagname[@attribute='value']`

- // == Select current node.
- Tagname == Tagname of the particular node.
- @ == Select attribute.
- Attribute == Attribute name of the node.
- Value == Value of the attribute.

<img src='https://media1.giphy.com/media/XBpEStoQ5rftPFA8rh/giphy.gif?cid=790b7611dbcd651cd785fb8382888f7b41666d5c8695755b&rid=giphy.gif'>

**We can perform the next operations a few different ways:**

Similar to above, we could use the `xpath`

Or... based on visual knowledge of inspecting html/css elements, we can see the css selector `input` and we could assume that the only 2 possible inputs are Username and Password

---

With that knowledge, we can define both variables in one line of code


In [30]:
ig_username, ig_password = driver.find_elements_by_css_selector('input')

# ig_username = driver.find_element_by_xpath('//*[@id="react-root"]/section/main/div/article/div/div[1]/div/form/div[2]/div/label/input')
# ig_password = driver.find_element_by_xpath('//*[@id="react-root"]/section/main/div/article/div/div[1]/div/form/div[3]/div/label/input')

In [31]:
ig_username.send_keys(INSTA_USERNAME)
ig_password.send_keys(PASSWORD)

In [32]:
# Here is the complete xpath to the login element:

full_login_xpath = '//*[@id="react-root"]/section/main/div/article/div/div[1]/div/form/div[4]/button/div'

ig_submit = driver.find_element_by_xpath(full_login_xpath)

In [None]:
# Sometimes, depending on the HTML layout, we might want to truncate the xpath

ig_submit = driver.find_element_by_xpath('//div[4]/button/div')

In [33]:
ig_submit.click()

# Modal buttons and scrolling:

In [34]:
# Whoah! What's that modal? 
try:
    modal_button = driver.find_element_by_class_name("HoLwm")
    modal_button.click()
    
except: 
    pass 

In [13]:
# These websites have modal popups:

driver.get('https://www.nike.com')

# Other options:
# https://www.carbon38.com
# https://www.meundies.com

The following cell is an example of how you could write functions to scroll down the page (for dynamic loading) and for loading more content with "clicks"

In [35]:
# Example: Scroll down (with a test for a modal)

def scroll_down():
    for i in range(1, 10):
        try:
            modal_button = driver.find_element_by_class_name("button2")
            webdriver.ActionChains(driver).move_to_element(modal_button).click(modal_button).perform()
      ##### modal_button.click() also works 
            
        except:
            time.sleep(.5)
            pass 
        
        #scroll to the bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1)

        
# Example: Load more content
# Code snippet for context purposes only. We will not run this function:

def get_more(): 
    for i in range(1, 5):
        try:
            next_b = driver.find_element_by_xpath("//*[contains(text(), 'Load next Politics story')]")
            webdriver.ActionChains(driver).move_to_element(next_b).click(next_b).perform()
            time.sleep(.5)
        except: 
            print("Page #" + str(i) + " has failed to load") 

In [36]:
# Run this cell and watch the page scrollllllll

scroll_down()

In [37]:
driver.quit()

## When to use BeautifulSoup vs.  Selenium?

<img src='https://media.giphy.com/media/xTiN0IuPQxRqzxodZm/giphy.gif' width = 400>

<img src='https://media2.giphy.com/media/3o7TKAdOad9Y3eSMZG/giphy.gif?cid=790b761168b43f2be748800602251dce3cad91fcb4c972f9&rid=giphy.gif' width = 400>

<img src = "https://media1.giphy.com/media/8VLgtJqaxIlhu/giphy.gif?cid=790b7611df175494e219b99894f7e717b3ea7bfbf806f9c4&rid=giphy.gif" />

**Just kidding!**

Everything depends on the website and your data goals.

In general:
- If the data needs to be exposed interactively, then go for Selenium. 
- Selenium for more complex JavaScript heavy pages. 
---
- If the data is accessible in the HTML structure (more static pages), soup is a more lightweight tool. 
- Soup gives you more control about navigating the HTML tree.

In [42]:
html = requests.get('https://www.skysports.com/premier-league-table')
bs = BeautifulSoup(html.content, 'lxml')
table = bs.table

# table = bs.find(lambda tag: tag.name=='table' ) 
# rows = table.findAll(lambda tag: tag.name=='tr')

In [43]:
table_rows = table.find_all('tr')

for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)

[]
['1', '\nLiverpool\n', '5', '5', '0', '0', '15', '4', '11', '15', '\n\n       \n']
['2', '\nManchester City\n', '5', '3', '1', '1', '16', '6', '10', '10', '\n\n       \n']
['3', '\nTottenham Hotspur\n', '5', '2', '2', '1', '11', '6', '5', '8', '\n\n       \n']
['4', '\nManchester United\n', '5', '2', '2', '1', '8', '4', '4', '8', '\n\n       \n']
['5', '\nLeicester City\n', '5', '2', '2', '1', '6', '4', '2', '8', '\n\n       \n']
['6', '\nChelsea\n', '5', '2', '2', '1', '11', '11', '0', '8', '\n\n       \n']
['7', '\nArsenal\n', '5', '2', '2', '1', '8', '8', '0', '8', '\n\n       \n']
['8', '\nWest Ham United\n', '5', '2', '2', '1', '6', '7', '-1', '8', '\n\n       \n']
['9', '\nBournemouth\n', '5', '2', '1', '2', '8', '9', '-1', '7', '\n\n       \n']
['10', '\nSouthampton\n', '5', '2', '1', '2', '5', '6', '-1', '7', '\n\n       \n']
['11', '\nEverton\n', '5', '2', '1', '2', '5', '7', '-2', '7', '\n\n       \n']
['12', '\nCrystal Palace\n', '5', '2', '1', '2', '3', '6', '-3', '7', '

In [44]:
html = requests.get('https://www.skysports.com/premier-league-table')
bs = BeautifulSoup(html.content, 'lxml')
table = bs.table

#If you know there is more than one table, you can edit the code to include the proper index:
# table = bs.find_all('table')[0] 

df = pd.read_html(str(table), index_col='Team')
df = df[0].dropna(axis=0, thresh=4)
df

Unnamed: 0_level_0,#,Pl,W,D,L,F,A,GD,Pts,Last 6
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Liverpool,1,5,5,0,0,15,4,11,15,
Manchester City,2,5,3,1,1,16,6,10,10,
Tottenham Hotspur,3,5,2,2,1,11,6,5,8,
Manchester United,4,5,2,2,1,8,4,4,8,
Leicester City,5,5,2,2,1,6,4,2,8,
Chelsea,6,5,2,2,1,11,11,0,8,
Arsenal,7,5,2,2,1,8,8,0,8,
West Ham United,8,5,2,2,1,6,7,-1,8,
Bournemouth,9,5,2,1,2,8,9,-1,7,
Southampton,10,5,2,1,2,5,6,-1,7,


#### Adjusting the header and index:

- Caveat: this uses pandas, not Selenium or Soup

If there is more than one table, pandas reads the html as a list of tables:

In [45]:
df2 = pd.read_html('https://www.sportsmole.co.uk/football/premier-league/2018-19/')

df2

[      0                               1    2    3    4    5    6    7    8  \
 0   NaN                            Team    P    W    D    L    F    A   GD   
 1     C         Manchester CityMan City   38   32    2    4   95   23   72   
 2     2                       Liverpool   38   30    7    1   89   22   67   
 3     3                         Chelsea   38   21    9    8   63   39   24   
 4     4          Tottenham HotspurSpurs   38   23    2   13   67   39   28   
 5     5                         Arsenal   38   21    7   10   73   51   22   
 6     6        Manchester UnitedMan Utd   38   19    9   10   65   54   11   
 7     7   Wolverhampton WanderersWolves   38   16    9   13   47   46    1   
 8     8                         Everton   38   15    9   14   54   46    8   
 9     9         Leicester CityLeicester   38   15    7   16   51   48    3   
 10   10         West Ham UnitedWest Ham   38   15    7   16   52   55   -3   
 11   11                         Watford   38   14  

In [46]:
# Let's check out one of our tables:

df2[0]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,,Team,P,W,D,L,F,A,GD,PTS
1,C,Manchester CityMan City,38,32,2,4,95,23,72,98
2,2,Liverpool,38,30,7,1,89,22,67,97
3,3,Chelsea,38,21,9,8,63,39,24,72
4,4,Tottenham HotspurSpurs,38,23,2,13,67,39,28,71
5,5,Arsenal,38,21,7,10,73,51,22,70
6,6,Manchester UnitedMan Utd,38,19,9,10,65,54,11,66
7,7,Wolverhampton WanderersWolves,38,16,9,13,47,46,1,57
8,8,Everton,38,15,9,14,54,46,8,54
9,9,Leicester CityLeicester,38,15,7,16,51,48,3,52


As we can see above, the table's formatting is slightly off...

So we can make adjustments like so:

In [47]:
df2 = pd.read_html('https://www.sportsmole.co.uk/football/premier-league/2018-19/',header=0, index_col=1)

df2[0].columns =  ['final_standings', 'P', 'W', 'D', 'L', 'F', 'A', 'GD', 'PTS']

df2[0]

Unnamed: 0_level_0,final_standings,P,W,D,L,F,A,GD,PTS
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Manchester CityMan City,C,38.0,32.0,2.0,4.0,95.0,23.0,72.0,98.0
Liverpool,2,38.0,30.0,7.0,1.0,89.0,22.0,67.0,97.0
Chelsea,3,38.0,21.0,9.0,8.0,63.0,39.0,24.0,72.0
Tottenham HotspurSpurs,4,38.0,23.0,2.0,13.0,67.0,39.0,28.0,71.0
Arsenal,5,38.0,21.0,7.0,10.0,73.0,51.0,22.0,70.0
Manchester UnitedMan Utd,6,38.0,19.0,9.0,10.0,65.0,54.0,11.0,66.0
Wolverhampton WanderersWolves,7,38.0,16.0,9.0,13.0,47.0,46.0,1.0,57.0
Everton,8,38.0,15.0,9.0,14.0,54.0,46.0,8.0,54.0
Leicester CityLeicester,9,38.0,15.0,7.0,16.0,51.0,48.0,3.0,52.0
West Ham UnitedWest Ham,10,38.0,15.0,7.0,16.0,52.0,55.0,-3.0,52.0


---

### An example where formatting is an issue:

In [48]:
html = requests.get('http://www.nfl.com/stats/team')
nfl_soup = BeautifulSoup(html.content, 'lxml')
table = nfl_soup.table

In [49]:
table.prettify

<bound method Tag.prettify of <table border="0" cellpadding="0" cellspacing="0" class="data-table1" summary="This table summarizes the NFL Total Offense Leaders." width="100%">
<thead>
<tr class="thd1">
<td colspan="2">Total Offense (YPG)</td>
<td align="right"><a href="/stats/categorystats?tabSeq=2&amp;offensiveStatisticCategory=GAME_STATS&amp;conference=ALL&amp;role=TM&amp;season=2019&amp;seasonType=REG&amp;d-447263-s=TOTAL_YARDS_GAME_AVG&amp;d-447263-o=2&amp;d-447263-n=1">Complete List</a></td>
</tr>
</thead>
<tbody>
<tr class="tbdy-sorted" id="r1c1_1" onmouseover="nfl.ui.behaviors.thumbs.tabs.team.mouseOver(this, 'r1c1', 'http://i.nflcdn.com/static/site/7.5/img/teams/BAL/BAL_logo-80x90.gif')">
<td class="tbdy-sorted team-logo-container" rowspan="5">
<img id="r1c1_thumb" onerror="nocover(this);" src="http://i.nflcdn.com/static/site/7.5/img/teams/BAL/BAL_logo-80x90.gif"/>
</td>
<td scope="row">
								1. 
								<a href="/teams/baltimoreravens/profile?team=BAL">Baltimore Ravens</a>

In [50]:
nfl = pd.read_html('http://www.nfl.com/stats/team')

nfl

[       Total Offense (YPG)        Complete List  Unnamed: 2
 0                      NaN  1. Baltimore Ravens       541.5
 1        2. Dallas Cowboys                484.0         NaN
 2    3. Kansas City Chiefs                477.5         NaN
 3  4. Los Angeles Chargers                429.5         NaN
 4  5. New England Patriots                423.0         NaN,
              Passing (YPG)          Complete List  Unnamed: 2
 0                      NaN  1. Kansas City Chiefs       405.5
 1    2. Cincinnati Bengals                  343.0         NaN
 2        3. Dallas Cowboys                  333.0         NaN
 3      4. Baltimore Ravens                  318.0         NaN
 4  5. New England Patriots                  310.5         NaN,
             Rushing (YPG)        Complete List  Unnamed: 2
 0                     NaN  1. Baltimore Ravens       223.5
 1   2. Indianapolis Colts                185.0         NaN
 2    3. Minnesota Vikings                185.0         NaN
 3  4. San Fra

In [51]:
# PRO-TIP: if you want to instantiate a new df variable from a previous df or list of dfs, 
# making a copy of the df will save you from a headache

offense = deepcopy(nfl[3])
offense

Unnamed: 0,Total Defense (YPG),Complete List,Unnamed: 2
0,,1. New England Patriots,246.0
1,2. Baltimore Ravens,274.5,
2,3. Atlanta Falcons,277.5,
3,4. Chicago Bears,292.5,
4,5. Los Angeles Rams,293.5,


In [None]:
# passing = nfl[1].copy()
# passing

In [52]:
cell1 = offense.iloc[0,1]
cell2 = offense.iloc[0,2]
offense.iloc[0,0] = cell1
offense.iloc[0,1] = cell2
offense.drop(['Unnamed: 2'],axis=1,inplace=True)
offense.columns = ['team', 'total_offense_ypg']
offense.team = offense.team.apply(lambda x: x.split('.\xa0')[1] if '.\xa0' in x else x.split('. ')[1])
offense.total_offense_ypg = offense.total_offense_ypg.astype(float).astype(int)

In [53]:
offense

Unnamed: 0,team,total_offense_ypg
0,New England Patriots,246
1,Baltimore Ravens,274
2,Atlanta Falcons,277
3,Chicago Bears,292
4,Los Angeles Rams,293


In [54]:
def clean_data(data_list):
    cleaned = []
    data_copy = deepcopy(data_list)
    for data in data_copy:
        col_1, col_3 = data.columns[0], data.columns[-1]
        cell1 = data.iloc[0,1]
        cell2 = data.iloc[0,2]
        data.iloc[0,0] = cell1
        data.iloc[0,1] = cell2
        data.drop([col_3],axis=1,inplace=True)
        data.columns = ['team', col_1]
        data['team'] = data['team'].apply(lambda x: x.split('.\xa0')[1] if '.\xa0' in x else x.split('. ')[1])
        data.total_offense_ypg = data[col_1].astype(float).astype(int)
        cleaned.append(data)
    return cleaned

In [55]:
tables_list = deepcopy(nfl[:6])

In [56]:
tables_list[2]

Unnamed: 0,Rushing (YPG),Complete List,Unnamed: 2
0,,1. Baltimore Ravens,223.5
1,2. Indianapolis Colts,185.0,
2,3. Minnesota Vikings,185.0,
3,4. San Francisco 49ers,178.5,
4,5. Houston Texans,153.0,


In [57]:
clean_tables = clean_data(tables_list)

  del sys.path[0]


In [58]:
clean_tables[0]

Unnamed: 0,team,Total Offense (YPG)
0,Baltimore Ravens,541.5
1,Dallas Cowboys,484.0
2,Kansas City Chiefs,477.5
3,Los Angeles Chargers,429.5
4,New England Patriots,423.0


In [59]:
clean_tables[3]

Unnamed: 0,team,Total Defense (YPG)
0,New England Patriots,246.0
1,Baltimore Ravens,274.5
2,Atlanta Falcons,277.5
3,Chicago Bears,292.5
4,Los Angeles Rams,293.5


### The best example of when Selenium is supreme:

When the page is written in JavaScript

In [60]:
html = requests.get('http://www.tennisabstract.com/cgi-bin/player.cgi?p=RogerFederer')
bs = BeautifulSoup(html.content, 'lxml')
table = bs.table

# table = bs.find(lambda tag: tag.name=='table' ) 
# rows = table.findAll(lambda tag: tag.name=='tr')

In [62]:
# bs

In [64]:
# table

In [65]:
url = "http://www.tennisabstract.com/cgi-bin/player.cgi?p=RogerFederer"
driver = webdriver.Chrome()

In [66]:
driver.get(url)

In [67]:
table = driver.find_element_by_id("recent-results")

In [68]:
body = table.find_element_by_css_selector('tbody')

In [69]:
# Table rows usually have the css tag 'tr'
rows = body.find_elements_by_css_selector('tr')

In [70]:
len(rows)

25

In [71]:
rows[0].get_attribute('innerHTML')

'<td>26-Aug-2019</td><td><a href="http://www.tennisabstract.com/cgi-bin/tourney.cgi?t=2019US_Open">US Open</a></td><td>Hard</td><td>QF</td><td align="right">3</td><td align="right">78</td><td><a href="http://www.tennisabstract.com/cgi-bin/player.cgi?p=GrigorDimitrov">Grigor Dimitrov</a> [BUL] d. (3)<b>Federer</b></td><td>3-6 6-4 3-6 6-4 6-2</td><td align="right">0.91</td><td align="right">4.2%</td><td align="right">0.7%</td><td align="right">61.1%</td><td align="right">73.9%</td><td align="right">50.0%</td><td align="right">10/15</td><td align="right">3:12</td>'

In [72]:
row_data = rows[0].find_elements_by_css_selector('td')

In [73]:
for e in row_data: 
    print(e.text)

26-Aug-2019
US Open
Hard
QF
3
78
Grigor Dimitrov [BUL] d. (3)Federer
3-6 6-4 3-6 6-4 6-2
0.91
4.2%
0.7%
61.1%
73.9%
50.0%
10/15
3:12


In [74]:
data_list = []
for r in rows: 
    row_list = []
    row_data = r.find_elements_by_css_selector('td')
    for d in row_data: 
        row_list.append(d.text)
    data_list.append(row_list)

In [75]:
data_list[10]

['01-Jul-2019',
 'Wimbledon',
 'Grass',
 'R16',
 '3',
 '20',
 '(2)Federer d. (17)Matteo Berrettini [ITA]',
 '6-1 6-2 6-2',
 '2.77',
 '8.2%',
 '1.6%',
 '68.9%',
 '88.1%',
 '68.4%',
 '1/1',
 '1:14']

In [76]:
len(data_list[10])

16

In [77]:
headers = table.find_element_by_css_selector('thead')
headers.text

'Date Tournament Surface Rd Rk vRk Score DR A% DF% 1stIn 1st% 2nd% BPSvd Time'

In [78]:
columns = headers.text.split(' ')
print(columns)

['Date', 'Tournament', 'Surface', 'Rd', 'Rk', 'vRk', 'Score', 'DR', 'A%', 'DF%', '1stIn', '1st%', '2nd%', 'BPSvd', 'Time']


In [79]:
print('Number of columns:     '+ str(len(columns)))
print()
print('Number of data points: '+ str(len(data_list[0])))

Number of columns:     15

Number of data points: 16


In [80]:
columns = ['Date','Tournament','Surface','Rd','Rk','vRk', 
           'Opponent','Score','DR','A%','DF%','1stIn',
           '1st%','2nd%','BPSvd','Time']

In [81]:
print(data_list[0])

['26-Aug-2019', 'US Open', 'Hard', 'QF', '3', '78', 'Grigor Dimitrov [BUL] d. (3)Federer', '3-6 6-4 3-6 6-4 6-2', '0.91', '4.2%', '0.7%', '61.1%', '73.9%', '50.0%', '10/15', '3:12']


In [82]:
federer_h2h = pd.DataFrame(data_list[1:], columns=columns)

In [83]:
federer_h2h.head()

Unnamed: 0,Date,Tournament,Surface,Rd,Rk,vRk,Opponent,Score,DR,A%,DF%,1stIn,1st%,2nd%,BPSvd,Time
0,26-Aug-2019,US Open,Hard,R16,3,15,(3)Federer d. (15)David Goffin [BEL],6-2 6-2 6-0,2.18,16.1%,4.8%,67.7%,83.3%,40.0%,5/7,1:19
1,26-Aug-2019,US Open,Hard,R32,3,58,(3)Federer d. Daniel Evans [GBR],6-2 6-2 6-1,2.6,16.4%,1.6%,65.6%,80.0%,71.4%,1/2,1:20
2,26-Aug-2019,US Open,Hard,R64,3,99,(3)Federer d. Damir Dzumhur [BIH],3-6 6-2 6-3 6-4,1.2,13.4%,3.4%,68.9%,76.8%,43.2%,6/8,2:22
3,26-Aug-2019,US Open,Hard,R128,3,190,(3)Federer d. Sumit Nagal [IND],4-6 6-1 6-2 6-4,1.21,9.1%,5.3%,60.6%,71.3%,48.1%,10/13,2:30
4,12-Aug-2019,Cincinnati Masters,Hard,R16,3,70,(Q)Andrey Rublev [RUS] d. (3)Federer,6-3 6-4,0.68,10.2%,4.1%,65.3%,62.5%,52.9%,1/4,1:01


- A slightly different approach:

In [84]:
header = table.find_element_by_css_selector('thead')

header_elements = header.find_elements_by_css_selector('th')

len(header_elements)

16

In [85]:
headers = []

for x in header_elements: 
    headers.append(x.text)
print(headers)

['Date', 'Tournament', 'Surface', 'Rd', 'Rk', 'vRk', '', 'Score', 'DR', 'A%', 'DF%', '1stIn', '1st%', '2nd%', 'BPSvd', 'Time']


## Some other neat stuff:

In [86]:
# Let's take a screenshot! 

driver.get('https://www.nytimes.com')

driver.get_screenshot_as_file('ny_times_front_pg.png')

driver.quit()

In [None]:
# The .get_attribute() method is your friend
# Example code (don't run this):

# element.get_attribute("attribute name")

# attribute_value = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID,
#                                                                 "id_name_here"))).get_attribute("attribute_name_here")

# Example of "complete" scraper:

- Including an example of `.get_attribute()`

In [88]:
driver = webdriver.Chrome()

In [89]:
topic_url_dict = {'15': 'arts', '16':'sports', '24':'sci-tech', '14': 'business',
                  '17': 'international', '13': 'authority'}

topic_codes = list(topic_url_dict.keys())

driver = webdriver.Chrome()

In [90]:
# Before creating a scraper, examine user interface structure:
# Can we scroll through articles? Do we need to get the article links first?

driver.get('http://www.satirewire.com/')

In [98]:
# <-- STEP 1 -->
# Create a function to scrape the links of articles:

def scrape_links_satirewire(topic_codes):
    
    base_url = "http://www.satirewire.com/content1/?cat="
    link_list = []
    
    for code in topic_codes: 
        url = base_url + code 
        topic = topic_url_dict[code]
        driver.get(url)
        time.sleep(1.18)
        last_page = driver.find_element_by_class_name('pages').text
        last_page_value = int(last_page.split(' of ', 1)[1])
        link_objects1 = driver.find_elements_by_class_name('morelink')
        print('Scraping ', len(link_objects1), topic.upper(), ' article links')
        for link in link_objects1:
            if '#' not in link.get_attribute('href'):
                link_list.append((link.get_attribute('href'), topic))
            else:
                pass       
        for x in range(2, (last_page_value + 1)):
            driver.get(url + '&paged=' + str(x))
            time.sleep(1.08)
            link_objects1 = driver.find_elements_by_class_name('morelink')
            print('Scraping ', len(link_objects1), topic.upper(), ' article links')

            for link in link_objects1:
                if '#' not in link.get_attribute('href'):
                    link_list.append((link.get_attribute('href'), topic))
                else:
                    pass                        
    df = pd.DataFrame()
    df['urls'] = [x[0] for x in link_list]
    df['topics'] = [x[1] for x in link_list]
    df = df.drop_duplicates(subset='urls')
    set_satirewire_urls = [(df['urls'][i], df['topics'][i]) for i in list(df.index)]
    
    print('-----------------------------------')
    print('Total of', len(set_satirewire_urls), 'urls to scrape for articles')
    print('-----------------------------------')
    return set_satirewire_urls                  

In [99]:
# <-- STEP 2 -->
# Create a helper function to clean up the article's text:

def clean_up_satirewire(dirty_string):
    
    body_clean1 = re.sub(r"\s+", " ", dirty_string)
    body_squeaky = body_clean1.split('Copyright ©', 1)[0]
    sep1 = '(SatireWire) — '
    sep2 = '(SatireWire.com) — '
    sep3 = '(SatireWire.com) – '
    if sep1 in body_squeaky:
        clean = body_squeaky.split(sep1, 1)[1]
    elif sep2 in body_squeaky:
        clean = body_squeaky.split(sep2, 1)[1]
    elif sep3 in body_squeaky:
        clean = body_squeaky.split(sep3, 1)[1]
    else:
        clean = body_squeaky
    return clean 

In [100]:
print('Are these to "dash" characters equal to each other? Answer: ' + str(bool('—'=='–')))

Are these to "dash" characters equal to each other? Answer: False


In [101]:
# <-- STEP 3 -->
# Create a helper function to scrape individual article content:

def scrape_one_article(url, topic, ind, all_urls, all_dates, all_titles, 
                       all_lengths, all_topics1, body_contents, source_id):
    try:
        driver.get(url)
        time.sleep(1.1)
        body = driver.find_element_by_class_name('entry').text
        length = round(len(body) /5/ 250, 1)
        
        if length >= .5:
            if url not in all_urls:                
                date = driver.find_element_by_class_name('entry-date').text
                date = pd.to_datetime(date).date().strftime('%Y-%m-%d')
                title = driver.find_element_by_tag_name('h2').text
                body_squeaky = clean_up_satirewire(body)
                
                all_urls.append(url)
                body_contents.append(body_squeaky)
                all_dates.append(date)
                all_titles.append(title)
                all_lengths.append(length)
                all_topics1.append(topic)

#           --- ADDING CATEGORIES AT A LATER TIME FOR TOPIC MODELING ---
#                 category_dict = find_categories(content, categories)
#                 all_topics1.append(category_dict[0])
#                 all_topics2.append(category_dict[1])
#                 all_topics3.append(category_dict[2])
#                 all_topics4.append(category_dict[3])
#                 all_topics5.append(category_dict[4])

            else:
                print("Duplicate link not added", ind)
                pass
        else:
            print('Not worthy of scraping article #', ind)
            pass
    except Exception as e:
        print('Nothing to scrape for link #', str(ind) , e)
        pass

In [102]:
# <-- STEP 4 -->
# Scrape each article's content and populate a dataframe

def scrape_satirewire_articles(urls_list):
    ind = 1
    body_contents = []
    all_urls = []
    all_dates = []
    all_titles = []
    all_lengths = []
    all_topics1 = []
    author = 'Author not specified'
    source_id = 'SatireWire'
    
#     all_topics2 = []
#     all_topics3 = []
#     all_topics4 = []
#     all_topics5 = []
    
    for url, topic in urls_list:
        print('Working on #' + str(ind) + ' of '+ str(len(urls_list)) +' links')
        print()
        scrape_one_article(url, topic, ind, all_urls, all_dates, all_titles, 
                           all_lengths, all_topics1, body_contents, source_id)
        ind += 1    

    df = pd.DataFrame()
    df['body_content'] = body_contents
    df['url'] = all_urls
    df['date'] = all_dates
    df['title'] = all_titles
    df['length'] = all_lengths
    df['topic_1'] = all_topics1
    df['author'] = author
    df['source_id'] = source_id
    df['satire_or_not'] = 'satire'
    df['label'] = 1

# ADDING CATEGORIES AT A LATER TIME FOR TOPIC MODELING    
#     df['topic_1'] = all_topics1
#     df['topic_2'] = all_topics2
#     df['topic_3'] = all_topics3
#     df['topic_4'] = all_topics4
#     df['topic_5'] = all_topics5

    df = df.drop_duplicates()
    df.index = range(len(df.index))

    return df

In [103]:
# # Complete scraping function

def scrape_satirewire():
    
    start = time.time()
    
    satirewire_urls = scrape_links_satirewire(['16'])    # <--- FOR TESTING/DEMONSTRATION PURPOSES
#     satirewore_urls = scrape_links_satirewire(topic_codes)

    satirewire_df = scrape_satirewire_articles(satirewire_urls)
    print('The satire scraper took ', str(time.time() - start), 'seconds.')  # <---   Can remove before uploading to AWS

    return satirewire_df

In [104]:
satire = scrape_satirewire()

Scraping  20 SPORTS  article links
Scraping  20 SPORTS  article links
Scraping  17 SPORTS  article links
-----------------------------------
Total of 32 urls to scrape for articles
-----------------------------------
Working on #1 of 32 links

Working on #2 of 32 links

Working on #3 of 32 links

Working on #4 of 32 links

Working on #5 of 32 links

Working on #6 of 32 links

Working on #7 of 32 links

Working on #8 of 32 links

Working on #9 of 32 links

Working on #10 of 32 links

Working on #11 of 32 links

Working on #12 of 32 links

Working on #13 of 32 links

Working on #14 of 32 links

Working on #15 of 32 links

Working on #16 of 32 links

Working on #17 of 32 links

Working on #18 of 32 links

Working on #19 of 32 links

Working on #20 of 32 links

Working on #21 of 32 links

Working on #22 of 32 links

Working on #23 of 32 links

Working on #24 of 32 links

Working on #25 of 32 links

Working on #26 of 32 links

Working on #27 of 32 links

Working on #28 of 32 links

Not wort

In [105]:
satire.tail()

Unnamed: 0,body_content,url,date,title,length,topic_1,author,source_id,satire_or_not,label
26,"In a surprising tactical shift, the frustrated European Central Bank announced yesterday it will no longer intervene to bolster the sagging euro, but will instead intervene in English Premier Leag...",http://www.satirewire.com/content1/?p=561,2009-05-05,European Central Bank to Intervene in Football Matches,1.0,sports,Author not specified,SatireWire,satire,1
27,"As World Cup fever grips the globe, nowhere is the mania for Earth’s greatest sporting event stronger than in the United States, where 280 million soccer-mad Americans are on the emotional edge, u...",http://www.satirewire.com/content1/?p=825,2009-05-05,SOCCER-MAD U.S. CRAZED OVER WORLD CUP,3.0,sports,Author not specified,SatireWire,satire,1
28,"Saying there could be no greater blow to the enemies of freedom than to see the United States win gold, President George W. Bush today called upon non-U.S. Olympians to unite behind America by fin...",http://www.satirewire.com/content1/?p=880,2009-05-05,BUSH ASKS NON-U.S. OLYMPIANS TO UNITE BEHIND AMERICA BY FINISHING BEHIND AMERICA,3.6,sports,Author not specified,SatireWire,satire,1
29,The awe and wonder over miraculous World Series victories by baseball’s New York Yankees were dampened by growing cynicism today as residents said they couldn’t help but notice that God had sudden...,http://www.satirewire.com/content1/?p=916,2009-05-05,SO NOW GOD TAKES AN INTEREST IN ANSWERING NEW YORK’S PRAYERS?,2.0,sports,Author not specified,SatireWire,satire,1
30,"The National Hockey League opened its regular season Wednesday, offering fans a welcome respite from last month’s stunningly mindless violence by allowing them to sit back and watch players from C...",http://www.satirewire.com/content1/?p=929,2009-05-05,HOCKEY PROVIDES WELCOME RESPITE FROM VIOLENCE,2.5,sports,Author not specified,SatireWire,satire,1


In [106]:
satire.body_content[30]

'The National Hockey League opened its regular season Wednesday, offering fans a welcome respite from last month’s stunningly mindless violence by allowing them to sit back and watch players from Calgary to Colorado entertain them with stunningly mindless violence. In Pittsburgh, Penguin right winger Stephane Richer slammed Colorado Avalanche forward Eric Messier into the boards, injuring Messier’s shoulder in a senseless attack that temporarily mitigated memories of the senseless attacks on America on Sept. 11. Colorado won the game 3-1. In Toronto, Maple Leafs’ center Darcy Tucker did his part to distract fans from the seemingly random brutality of a chaotic world by blindsiding Ottawa’s Karel Rachunek, leaving the Senator defenseman motionless on the ice for 10 minutes. Ottawa squeaked out a 5-4 win. Rachunek was taken off on a stretcher. “The disquieting images of New York and Washington are forever encased on the minds of millions, but at some point, we have to let ourselves escap