# Scraping basics for Selenium

If you feel comfortable with scraping, you're free to skip this notebook.

## Part 0: Imports

Import what you need to use Selenium, and start up a new Chrome to use for scraping. You might want to copy from the [Selenium snippets](http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-snippets/) page.

**You only need to do `driver = webdriver.Chrome(...)` once,** every time you do it you'll open a new Chrome instance. You'll only need to run it again if you close the window (or want another Chrome, for some reason).

In [1]:
import pandas as pd

import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select

from webdriver_manager.chrome import ChromeDriverManager

In [2]:
driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Driver [C:\Users\nao22\.wdm\drivers\chromedriver\win32\96.0.4664.45\chromedriver.exe] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline.

In [3]:
driver.get("http://jonathansoma.com/lede/static/by-class.html")

In [4]:
from bs4 import BeautifulSoup
import requests

response = requests.get("http://jonathansoma.com/lede/static/by-class.html")
doc = BeautifulSoup(response.text)

In [5]:
doc.select('.title')

[<h1 class="title">How to Scrape Things</h1>]

In [6]:

doc.select('.subhead')

[<h3 class="subhead">Some Supplemental Materials</h3>]

In [7]:
doc.select('.byline')

[<p class="byline">By Jonathan Soma</p>]

## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline.

In [8]:
driver.get("http://jonathansoma.com/lede/static/by-tag.html")

response = requests.get("https://jonathansoma.com/lede/static/by-tag.html")
doc_t = BeautifulSoup(response.text)

In [9]:
doc_t.find('h1').string

'How to Scrape Things'

In [10]:
doc_t.find('h3').string

'Some Supplemental Materials'

In [11]:
doc_t.find('p').string

'By Jonathan Soma'

## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, printing out the title, subhead, and byline.

> **This will be important for the next few:** if you scrape multiples, you have a list. Even though it's Seleninum, you can use things like `[0]`, `[1]`, `[-1]` etc just like you would for a normal list.

In [12]:
driver.get(" http://jonathansoma.com/lede/static/by-list.html")

response = requests.get("http://jonathansoma.com/lede/static/by-list.html")
doc_st = BeautifulSoup(response.text)

In [13]:
doc_st.find_all('p')[0].string

'How to Scrape Things'

In [14]:
doc_st.find_all('p')[1].string

'Some Supplemental Materials'

In [15]:
doc_st.find_all('p')[2].string

'By Jonathan Soma'

## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline.

In [16]:
driver.get("http://jonathansoma.com/lede/static/single-table-row.html")

response = requests.get("http://jonathansoma.com/lede/static/single-table-row.html")
doc_table = BeautifulSoup(response.text)

In [22]:
doc_table.find_all('td')[0].string

'How to Scrape Things'

In [23]:
doc_table.find_all('td')[1].string

'Some Supplemental Materials'

In [24]:
doc_table.find_all('td')[2].string

'By Jonathan Soma'

## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [30]:

book = {}
book['title'] = doc_table.find_all('td')[0].string
book['subhead'] = doc_table.find_all('td')[1].string
book['byline'] = doc_table.find_all('td')[2].string

book 
    

{'title': 'How to Scrape Things',
 'subhead': 'Some Supplemental Materials',
 'byline': 'By Jonathan Soma'}

## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

In [33]:
driver.get("http://jonathansoma.com/lede/static/multiple-table-rows.html")

response = requests.get("http://jonathansoma.com/lede/static/multiple-table-rows.html")
doc_tables = BeautifulSoup(response.text)

In [82]:
tr_list = doc_tables.find_all('tr')

for title in tr_list:
    print(title.select('td')[0].string)

How to Scrape Things
How to Scrape Many Things
The End of Scraping


In [77]:
tr_list

for subhead in tr_list:
    print(subhead.select('td')[1].string)

Some Supplemental Materials
But, Is It Even Possible?
Let's All Use CSV Files


In [85]:
tr_list

for byline in tr_list:
    print(byline.select('td')[2].string)

By Jonathan Soma
By Sonathan Joma
By Amos Nathanos


## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either!

In [80]:
driver.get("http://jonathansoma.com/lede/static/the-actual-table.html")

response = requests.get("http://jonathansoma.com/lede/static/the-actual-table.html")
actual_tables = BeautifulSoup(response.text)

In [87]:
tr_list = actual_tables.find_all('tr')

actual_list =[]

for element in tr_list:
    dict = {}
    dict['title'] = element.select('td')[0].string
    dict['subhead'] = element.select('td')[1].string
    dict['byline'] = element.select('td')[2].string
    actual_list.append(dict)

actual_list

[{'title': 'How to Scrape Things',
  'subhead': 'Some Supplemental Materials',
  'byline': 'By Jonathan Soma'},
 {'title': 'How to Scrape Many Things',
  'subhead': 'But, Is It Even Possible?',
  'byline': 'By Sonathan Joma'},
 {'title': 'The End of Scraping',
  'subhead': "Let's All Use CSV Files",
  'byline': 'By Amos Nathanos'}]

## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

In [88]:
import pandas as pd
import numpy as np

In [92]:
df = pd.DataFrame(actual_list)
df

Unnamed: 0,title,subhead,byline
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`

In [94]:
df.to_csv('output.csv')