# Scraping basics for Selenium

If you feel comfortable with scraping, you're free to skip this notebook.

## Part 0: Imports

Import what you need to use Selenium, and start up a new Chrome to use for scraping. You might want to copy from the [Selenium snippets](http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-snippets/) page.

**You only need to do `driver = webdriver.Chrome(...)` once,** every time you do it you'll open a new Chrome instance. You'll only need to run it again if you close the window (or want another Chrome, for some reason).

In [25]:
import pandas as pd
import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

In [2]:
driver = webdriver.Chrome(ChromeDriverManager().install())




  driver = webdriver.Chrome(ChromeDriverManager().install())


## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline.

In [3]:
driver.get("http://jonathansoma.com/lede/static/by-class.html")

In [4]:
driver.page_source

'<html><head></head><body><h1 class="title">How to Scrape Things</h1>\n<h3 class="subhead">Some Supplemental Materials</h3>\n<p class="byline">By Jonathan Soma</p></body></html>'

In [5]:
title = driver.find_element(By.CSS_SELECTOR, ".title").text
subhead = driver.find_element(By.CSS_SELECTOR, ".subhead").text
byline = driver.find_element(By.CSS_SELECTOR, ".byline").text

print("title:", title)
print("subhead:", title)
print("byline:", title)

title: How to Scrape Things
subhead: How to Scrape Things
byline: How to Scrape Things


## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline.

In [6]:
driver.get("http://jonathansoma.com/lede/static/by-tag.html")

In [7]:
driver.page_source

'<html><head></head><body><h1>How to Scrape Things</h1>\n<h3>Some Supplemental Materials</h3>\n<p>By Jonathan Soma</p></body></html>'

In [8]:
title = driver.find_element(By.CSS_SELECTOR, "h1").text
subhead = driver.find_element(By.CSS_SELECTOR, "h3").text
byline = driver.find_element(By.CSS_SELECTOR, "p").text

print("title:", title)
print("subhead:", title)
print("byline:", title)

title: How to Scrape Things
subhead: How to Scrape Things
byline: How to Scrape Things


## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, printing out the title, subhead, and byline.

> **This will be important for the next few:** if you scrape multiples, you have a list. Even though it's Seleninum, you can use things like `[0]`, `[1]`, `[-1]` etc just like you would for a normal list.

In [9]:
driver.get("http://jonathansoma.com/lede/static/by-list.html")

In [10]:
driver.page_source

'<html><head></head><body><p>How to Scrape Things</p>\n<p>Some Supplemental Materials</p>\n<p>By Jonathan Soma</p></body></html>'

In [11]:
tags = driver.find_elements(By.CSS_SELECTOR, "p")
texts = [tag.text for tag in tags]

print("title:", texts[0])
print("subhead:", texts[1])
print("byline:", texts[-1])

title: How to Scrape Things
subhead: Some Supplemental Materials
byline: By Jonathan Soma


## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline.

In [12]:
driver.get("http://jonathansoma.com/lede/static/single-table-row.html")

In [13]:
driver.page_source

'<html><head></head><body><table>\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Some Supplemental Materials</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n</tbody></table></body></html>'

In [14]:
elements = driver.find_elements(By.CSS_SELECTOR, "td")
texts = [element.text for element in elements]

print("title:", texts[0])
print("subhead:", texts[1])
print("byline:", texts[-1])

title: How to Scrape Things
subhead: Some Supplemental Materials
byline: By Jonathan Soma


## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [15]:
items = ["title", "subhead", "byline"]

book = dict(zip(items, texts))

In [16]:
book

{'title': 'How to Scrape Things',
 'subhead': 'Some Supplemental Materials',
 'byline': 'By Jonathan Soma'}

## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

In [17]:
driver.get("https://jonathansoma.com/lede/static/multiple-table-rows.html")

In [18]:
driver.page_source

"<html><head></head><body><table>\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Some Supplemental Materials</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n  <tr>\n    <td>How to Scrape Many Things</td>\n    <td>But, Is It Even Possible?</td>\n    <td>By Sonathan Joma</td>\n  </tr>\n  <tr>\n    <td>The End of Scraping</td>\n    <td>Let's All Use CSV Files</td>\n    <td>By Amos Nathanos</td>\n  </tr>\n</tbody></table></body></html>"

In [19]:
items = ["title", "subhead", "byline"]

rows = driver.find_elements(By.CSS_SELECTOR, "tr")

for row in rows:
    elements = row.find_elements(By.CSS_SELECTOR, "td")
    texts = [element.text for element in elements]
    book = dict(zip(items, texts))
    for key, value in book.items():
        print(f"{key}: {value}")
    print('------')

title: How to Scrape Things
subhead: Some Supplemental Materials
byline: By Jonathan Soma
------
title: How to Scrape Many Things
subhead: But, Is It Even Possible?
byline: By Sonathan Joma
------
title: The End of Scraping
subhead: Let's All Use CSV Files
byline: By Amos Nathanos
------


## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either!

In [20]:
driver.get("https://jonathansoma.com/lede/static/the-actual-table.html")

In [21]:
driver.page_source

'<html><head></head><body><table id="booklist">\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Some Supplemental Materials</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n  <tr>\n    <td>How to Scrape Many Things</td>\n    <td>But, Is It Even Possible?</td>\n    <td>By Sonathan Joma</td>\n  </tr>\n  <tr>\n    <td>The End of Scraping</td>\n    <td>Let\'s All Use CSV Files</td>\n    <td>By Amos Nathanos</td>\n  </tr>\n</tbody></table></body></html>'

In [22]:
items = ["title", "subhead", "byline"]
books = []

rows = driver.find_elements(By.CSS_SELECTOR, "tr")

for row in rows:
    elements = row.find_elements(By.CSS_SELECTOR, "td")
    texts = [element.text for element in elements]
    book = dict(zip(items, texts))
    books.append(book)
    
books

[{'title': 'How to Scrape Things',
  'subhead': 'Some Supplemental Materials',
  'byline': 'By Jonathan Soma'},
 {'title': 'How to Scrape Many Things',
  'subhead': 'But, Is It Even Possible?',
  'byline': 'By Sonathan Joma'},
 {'title': 'The End of Scraping',
  'subhead': "Let's All Use CSV Files",
  'byline': 'By Amos Nathanos'}]

## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

In [23]:
df = pd.DataFrame(books)
df

Unnamed: 0,title,subhead,byline
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`

In [24]:
df.to_csv("output", index=False)