# Scraping basics for Selenium

If you feel comfortable with scraping, you're free to skip this notebook.

## Part 0: Imports

Import what you need to use Selenium, and start up a new Chrome to use for scraping. You might want to copy from the [Selenium snippets](http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-snippets/) page.

**You only need to do `driver = webdriver.Chrome(...)` once,** every time you do it you'll open a new Chrome instance. You'll only need to run it again if you close the window (or want another Chrome, for some reason).

In [6]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.service import Service as BraveService

from webdriver_manager.core.utils import ChromeType
from webdriver_manager.chrome import ChromeDriverManager

In [7]:
options = webdriver.ChromeOptions()
options.binary_location = "/Applications/Brave Browser.app/Contents/MacOS/Brave Browser"

driver = webdriver.Chrome(options=options, service=BraveService(ChromeDriverManager(chrome_type=ChromeType.BRAVE).install()))




## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline.

In [8]:
driver.get('http://jonathansoma.com/lede/static/by-class.html')

In [9]:
title = driver.find_element(By.CLASS_NAME, 'title').text
print(title)

How to Scrape Things


In [10]:
subhead = driver.find_element(By.CLASS_NAME, 'subhead').text
print(subhead)

Some Supplemental Materials


In [11]:
byline = driver.find_element(By.CLASS_NAME, 'byline').text
print(byline)

By Jonathan Soma


## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline.

In [12]:
driver.get('http://jonathansoma.com/lede/static/by-tag.html')

In [13]:
title = driver.find_element(By.CSS_SELECTOR, 'h1').text
print(title)

How to Scrape Things


In [14]:
subhead = driver.find_element(By.CSS_SELECTOR, 'h3').text
print(subhead)

Some Supplemental Materials


In [15]:
byline = driver.find_element(By.CSS_SELECTOR, 'p').text
print(byline)

By Jonathan Soma


## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, printing out the title, subhead, and byline.

> **This will be important for the next few:** if you scrape multiples, you have a list. Even though it's Seleninum, you can use things like `[0]`, `[1]`, `[-1]` etc just like you would for a normal list.

In [16]:
driver.get('http://jonathansoma.com/lede/static/by-list.html')

In [17]:
elements = driver.find_elements(By.CSS_SELECTOR, 'p')

title = elements[0].text
print(title)

How to Scrape Things


In [18]:
subhead = elements[1].text
print(subhead)

Some Supplemental Materials


In [19]:
byline = elements[2].text
print(byline)

By Jonathan Soma


## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline.

In [20]:
driver.get('http://jonathansoma.com/lede/static/single-table-row.html')

In [21]:
elements = driver.find_elements(By.CSS_SELECTOR, 'td')

title = elements[0].text
print(title)

How to Scrape Things


In [22]:
subhead = elements[1].text
print(subhead)

Some Supplemental Materials


In [23]:
byline = elements[2].text
print(byline)

By Jonathan Soma


## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [24]:
driver.get('http://jonathansoma.com/lede/static/single-table-row.html')

In [25]:
elements = driver.find_elements(By.CSS_SELECTOR, 'td')

book = {}

book['title'] = elements[0].text
book['subhead'] = elements[1].text
book['byline'] = elements[2].text

book

{'title': 'How to Scrape Things',
 'subhead': 'Some Supplemental Materials',
 'byline': 'By Jonathan Soma'}

## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

In [26]:
driver.get('http://jonathansoma.com/lede/static/multiple-table-rows.html')

In [27]:
books = driver.find_elements(By.CSS_SELECTOR, 'tr')

for book in books:
    title = book.find_elements(By.CSS_SELECTOR, 'td')[0].text
    subhead = book.find_elements(By.CSS_SELECTOR, 'td')[1].text
    byline = book.find_elements(By.CSS_SELECTOR, 'td')[2].text
    
    print(title)
    print(subhead)
    print(byline)

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma
How to Scrape Many Things
But, Is It Even Possible?
By Sonathan Joma
The End of Scraping
Let's All Use CSV Files
By Amos Nathanos


## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either!

In [28]:
driver.get('http://jonathansoma.com/lede/static/the-actual-table.html')

In [29]:
books = driver.find_elements(By.CSS_SELECTOR, 'tr')

book_list = []
for book in books:
    title = book.find_elements(By.CSS_SELECTOR, 'td')[0].text
    subhead = book.find_elements(By.CSS_SELECTOR, 'td')[1].text
    byline = book.find_elements(By.CSS_SELECTOR, 'td')[2].text
    
    record = {
        'title': title,
        'subhead': subhead,
        'byline': byline
    }
    
    book_list.append(record)
    
book_list

[{'title': 'How to Scrape Things',
  'subhead': 'Some Supplemental Materials',
  'byline': 'By Jonathan Soma'},
 {'title': 'How to Scrape Many Things',
  'subhead': 'But, Is It Even Possible?',
  'byline': 'By Sonathan Joma'},
 {'title': 'The End of Scraping',
  'subhead': "Let's All Use CSV Files",
  'byline': 'By Amos Nathanos'}]

## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

In [30]:
driver.get('http://jonathansoma.com/lede/static/the-actual-table.html')

In [31]:
import pandas as pd

pd.DataFrame(book_list)

Unnamed: 0,title,subhead,byline
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


In [32]:
pd.read_html(driver.page_source)[0]

Unnamed: 0,0,1,2
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`

In [33]:
driver.get('http://jonathansoma.com/lede/static/the-actual-table.html')

In [34]:
books = driver.find_elements(By.CSS_SELECTOR, 'tr')

book_list = []
for book in books:
    title = book.find_elements(By.CSS_SELECTOR, 'td')[0].text
    subhead = book.find_elements(By.CSS_SELECTOR, 'td')[1].text
    byline = book.find_elements(By.CSS_SELECTOR, 'td')[2].text
    
    record = {
        'title': title,
        'subhead': subhead,
        'byline': byline
    }
    
    book_list.append(record)

df = pd.DataFrame(book_list)

In [35]:
df.to_csv('output.csv', index=False)

In [36]:
driver.quit()