# Advanced Web Scraping
### Selenium and Scrapy

In class, we went over using `requests` and `BeautifulSoup` to scrape web pages.

When is this not enough?
- When the webpage has javascript you need.
- When you need to interact with the webpage.
- When you want to parallelize or manage many scrapers at once.


What each package does:
#### __Selenium__
   - Creates an instance of a web browser which you can use to interact with a web page.
   
#### __Scrapy__
   
   - A larger package that manages scraping "spiders."

When do you use each one?

#### Selenium
- When you need to interact dynamically with a web page.

#### Scrapy
- When you need to manage several spiders at once and set parameters on how they scrape.



### Scraping using Selenium
- Create a web browser instance
- Interact with the elements on the web page.
- Download what you want.

For this example, I'm going to interact with a web page and click things and edit drop down menus.

In [22]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver import FirefoxOptions

import pandas as pd

First, create a web browser

In [23]:
driver = webdriver.Firefox()

Go to the web page.

In [24]:
driver.get("https://www.naab-css.org/dairy-cross-reference")

Suppose that I want to enter something into the "NAAB Code" field. First I have to find it:

In [25]:
naab_id = driver.find_element_by_name("ctl00$ctl00$ContentPlaceHolder_Content$ContentPlaceHolder_Content$TextBox_NAABCode")

Now I can use "send_keys" to put text in the field:

In [26]:
naab_id.send_keys("This is not a code!")

Now I have to click "search":

In [27]:
search_naab_code = driver.find_elements_by_name("ctl00$ctl00$ContentPlaceHolder_Content$ContentPlaceHolder_Content$Button_SearchNAAB")[0]

search_naab_code.click()

That's not a real code, so I need to go back:

In [28]:
driver.back()

Now I will find the box again, clear it, and put in an actual code:

In [29]:
naab_id = driver.find_element_by_name("ctl00$ctl00$ContentPlaceHolder_Content$ContentPlaceHolder_Content$TextBox_NAABCode")

naab_id.clear()

naab_id.send_keys("029HO16708")

Click "Search"

In [31]:
search_naab_code = driver.find_elements_by_name("ctl00$ctl00$ContentPlaceHolder_Content$ContentPlaceHolder_Content$Button_SearchNAAB")[0]

search_naab_code.click()

Now we can scrape the table:

In [32]:
table = driver.find_elements_by_class_name('DairyCrossTable')[0]

pd.read_html(table.get_attribute('outerHTML'))[0]

Unnamed: 0,0,1
0,Breed,HO
1,Country,ITA
2,ID Number,028990245854
3,Semen Release Date,2013-5
4,Status,I
5,Sampling Code,
6,Original Controller,
7,Reg. Name,GEGANIA DOB.DODY ET TV TL TY
8,Short Name,DODY
9,Birthdate,10/3/2011


Now suppose we want to interact with the dropdown menus. How do we do that?
- Find the elements.
- Select the item you want.
- Click the button.

In [33]:
driver.back()

First the Breed selection:

In [34]:
breed_list = "ctl00$ctl00$ContentPlaceHolder_Content$ContentPlaceHolder_Content$DropDownList_Breed"

breed_dropdown = Select(driver.find_element_by_name(breed_list))

[x.text for x in breed_dropdown.options]

['-- Select Breed --',
 'AMERICAN LINEBACK',
 'AYRSHIRE',
 'BROWN SWISS',
 'CROSSBREEDS (XX)',
 'Dairy Cross (XD)',
 'DUTCH BELTED',
 'Fleckvieh',
 'GIROLANDO',
 'GUERNSEY',
 'HOLSTEIN',
 'INTERNATIONAL RED DAIRY',
 'JERSEY',
 'MILKING DEVON',
 'MONTBELIARDE',
 'MUESE-RHINE-ISSEL',
 'NORWEGIAN RED AND WHITE',
 'RED & WHITE HOLSTEIN',
 'SHORTHORN (Milking)',
 'SIMMENTAL',
 'SWEDISH RED AND WHITE']

In [35]:
breed_dropdown.select_by_visible_text('HOLSTEIN')

Country selection:

In [36]:
country_name = "ctl00$ctl00$ContentPlaceHolder_Content$ContentPlaceHolder_Content$DropDownList_Country"

country_dropdown = Select(driver.find_element_by_name(country_name))

country_dropdown.select_by_visible_text("USA/840")

Now just enter the id number in the box:

In [37]:
id_entry = "ctl00$ctl00$ContentPlaceHolder_Content$ContentPlaceHolder_Content$TextBox_IDNumber"

id_enter = driver.find_element_by_name(id_entry)

id_enter.send_keys("003141494481")

Click "Search"

In [38]:
search_name="ctl00$ctl00$ContentPlaceHolder_Content$ContentPlaceHolder_Content$Button_SearchRegNumber"

search_box = driver.find_element_by_name(search_name)

search_box.click()

And now scrape the table:

In [39]:
table = driver.find_elements_by_class_name('DairyCrossTable')[0]

pd.read_html(table.get_attribute('outerHTML'))[0]

Unnamed: 0,0,1
0,Breed,HO
1,Country,840
2,ID Number,003141494481
3,Semen Release Date,2018-8
4,Status,A
5,Sampling Code,
6,Original Controller,
7,Reg. Name,DENOVO 7895 MENTOR-ET
8,Short Name,MENTOR
9,Birthdate,6/13/2017


In this case, I did not actually need to interact with any javascript, and thus no need to use `selenium`. However, there are many cases when you'll need it. For example, `scrapy` and `BeautifulSoup` __cannot__ interact with any javascript.

In the case you don't need to interact with javascript, `scrapy` is a much better package for doing scraping.

### What is `scrapy`?
A framework for webcrawling which creates "spiders" that autonomously crawl websites for you.

What are it's advantages?
- Automated: no need to run a script multiple times. Running it once spawns spiders to do work for you.
- Parallel: will easily run in parallel.
- Parameters: constrains the spiders to only run at some times and under certain conditions.


### Rule of Scraping: Don't Be A Jerk
Scraping too agressively can crash the website, which is against the scraping code.

Using `scrapy`, you can tell it to only scrape during certain hours (like hours with less traffic) and to stop when the website gets overloaded.

__Scrapy is a more responsible and efficient way to scrape websites.__

### Example:
To scrape the tables I needed, I programmed a `scrapy` spider to go to a list of websites.

This is a much better way to scrape a long list of websites that I needed.