# Scraping many pages + Using Selenium

## The pages we'll be looking at

If I wanted to read specific information about a specfic mine, it takes a few steps. **Do these steps with your browser before you try any programming.**

1. Visit the [Mine Data Retrieval System](https://arlweb.msha.gov/drs/drshome.htm)
2. Scroll down to **Mine Identification Number (ID) Search**
3. Type in a mine ID number, such as `3503598`, click **Search**
4. I'm on a page! It lists the MINE NAME and MINE OWNER.

After searching for and finding a mine, I can use this page to **find reports about this mine**. Some of the reports are on accidents, violations, inspections, health samples and more. To get those reports:

1. Search for a mine (if you haven't already)
2. Scroll down and change **Beginning Date** to `1/1/1995` (violation reports begin in 1995, accidents begin in 1983)
3. Select the report type of `Violations`
4. Click **Get Report**
5. I'm on a page! It lists ALL OF THE MINE'S VIOLATIONS.

By changing the report type you're searching for you can find all sorts of different data.

# Researching mine information

## Preparation 

### When you search for information on a specific mine, what URL should Selenium visit first?

- *TIP: the answer is NOT `https://arlweb.msha.gov/drs/ASP/BasicMineInfonew.asp`*

In [1]:
# https://arlweb.msha.gov/drs/drshome.htm 

### How can you identify the text field we're going to type the Mine ID into?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

In [2]:
# //*[@id="inputdrs"] (the xpath for the Mine ID area on the main search page )

### How can you identify the search button we're going to click, or the form we're going to submit?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [3]:
# //*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input 

### Use Selenium to search using the mine ID `3901432`. Get me the operator's name by scraping.

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

In [8]:
import selenium

In [9]:
from selenium import webdriver
driver = webdriver.Chrome() 

In [10]:
driver.get('https://arlweb.msha.gov/drs/drshome.htm')

In [14]:
id_finder = driver.find_element_by_name('MineId')
id_finder

<selenium.webdriver.remote.webelement.WebElement (session="123acad9cc936fa1fd94266bccf57e1f", element="0.15989126644044038-1")>

In [15]:
id_finder.send_keys(3901432)

In [16]:
search = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input ')
search.click()

In [17]:
#don't need beautiful soup because we are just looking for one element on the HTML page, not a bunch of information! 
#from bs4 import BeautifulSoup
#doc = BeautifulSoup(driver.page_source, 'html.parser')
#doc.find_all('tr')

In [18]:
mine_name = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[1]/tbody/tr[3]/td[2]/font/b')
mine_name.text

'Krueger Brothers Gravel & Dirt'

# Using .apply to find data about SEVERAL mines

The file `mines-subset.csv` has a list of mine IDs. We're going to scrape the operator's name for each of those mines.

### Open up `mines-subset.csv` and save it into a dataframe

In [19]:
#we are dealing with pdfs so we are going to use dataframes in pandas

import pandas as pd
df = pd.read_csv('mines-subset.csv')
df.head()


FileNotFoundError: File b'mines-subset.csv' does not exist

### Open up `mines-subset.csv` in a text editor, then look at your dataframe. Is something different about them?

In [20]:
#running terminal commands inside notebook
!cat mines-subset.csv
#this creates jsut one column 

cat: mines-subset.csv: No such file or directory


### Scrape the operator's name for each of those mines and print it

- *TIP: use .apply and a function*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook *

In [None]:
#run a function through every single row  
def name_scrape(row):
    driver.get('https://arlweb.msha.gov/drs/drshome.htm')
    id_finder = driver.find_element_by_name('MineID')
    id_finder.send_keys(row['id'])
    search = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input ')
    mine_name = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[1]/tbody/tr[3]/td[2]/font/b')
    print(row['id'])
    print('-----')
    
driver = webdriver.Chrome()     
df.apply(name_scrape, axis = 1)
driver.close()

### Scrape the operator's name and save it into a new column

- *TIP: Use .apply and a function*
- *TIP: Remember to use `return`*

In [22]:
def name_scrape(row):
    driver.get('https://arlweb.msha.gov/drs/drshome.htm')
    id_finder = driver.find_element_by_name('MineId')
    id_finder.send_keys(row['id'])
    search = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input ')
    mine_name = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[1]/tbody/tr[3]/td[2]/font/b')
    print(row['id'])
    print('-----')
    
driver = webdriver.Chrome()     
df.apply(name_scrape, axis = 1)
driver.close()

NameError: name 'df' is not defined

# Researching mine violations

Read the very top again to remember how to find mine violations

### When you search for a mine's violations, what URL is Selenium going to start on?

- *TIP: `requests` can send form data to load in the middle of a bunch of steps, but Selenium has to start at the beginning

In [None]:
# https://arlweb.msha.gov/drs/drshome.htm 

### When you're searching for violations from the Mine Information page, how are you going to identify the "Beginning Date" field?

In [None]:
#//*[@id="content"]/form[1]/table[2]/tbody/tr[2]/td/font/input[1] 


### When you're searching for violations from the Mine Information page, how are you going to identify the "Violations" button?

In [None]:
#violations: //*[@id="content"]/form[1]/table[3]/tbody/tr[2]/td[2]/table/tbody/tr[1]/td/input 


### When you're searching for violations from the Mine Information page, how are you going to identify the form or the button to click to get a list of the violations?

In [None]:
# //*[@id="content"]/form[1]/table[3]/tbody/tr[3]/td[2]/input 

### Using the mine ID `3901432`, scrape all of their violations since 1/1/1995

**Save this into a CSV called `3901432-violations.csv`.** This CSV must include the following fields:

- Citation number
- Case number
- Standard violated
- Link to standard
- Proposed penalty
- Amount paid to date

**Tips:**

- *TIP: It's probably worth it to print them all first, then save them to a CSV once you know it's all working.*
- *TIP: You'll use the parent pattern - get the ROWS first (tr), then loop through and get the TABLE CELLS (td)*

In [None]:
    driver.get('https://arlweb.msha.gov/drs/drshome.htm')
    id_finder = driver.find_element_by_name('MineId')
    id_finder.send_keys(row['id'])
    search = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input ')
    search.click() 

In [None]:
date = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[2]/tbody/tr[2]/td/font/input[1] ')
date.send_keys('1/1/1995')

In [None]:
pick_violation = driver.find_elemement_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[2]/td[2]/table/tbody/tr[1]/td/input')
pick_violation


In [None]:
get_report_button = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[3]/td[2]/input')
get_report_button


In [None]:
#make a dictionary with beautiful soup...can't use Selenium because will be going through a lot of rows..
#xpath is just for unique elements on a page 
# Citation number
#Case number
#Standard violated
#Link to standard
#Proposed penalty
#Amount paid to date 

#In beautifulsoup 


In [None]:
from bs4 import BeautifulSoup
doc2 = BeautifulSoup(driver.page_source, 'html.parser')

In [None]:
violations = doc.find_all('tr', class_ = 'draviols')
for violations in violations:
    print('this is a violation')
    all_cells = violations.find_all('td')
    print('citation number', all_cells[2].text)
    print('case number', all_cells[3].text)
    print('Standard Violation:', all_cells[10].find('a')['href'])
    print('Proposed penalty', all_cells[11].text)
    print('Amount paid', all_cells[14].text)
    

In [None]:
#saving csv
#1 make an empty list
#2 every time through loop, create dictionary
#3 add dictionary to the list
#4 when done, convert list to a dataframe 

# Using .apply to save mine data for SEVERAL mines

The file `mines-subset.csv` has a list of mine IDs. We're going to scrape the operator's name for each of those mines.

### Open up `mines-subset.csv` and save it into a dataframe

In [28]:
all_violations = []
for violations in all_violations:
    violation = {}
    print('this is a violation')
    all_cells = violations.find_all('td')
    violation['citation number'], all_cells[2].text.strip()
    violation['case number'], all_cells[3].text.strip()
    violation['Standard Violation:'], all_cells[10].find('a')['href']
    violation['Proposed penalty'], all_cells[11].text.strip()
    violation['Amount paid'], all_cells[14].text.strip()
    violation.append(all_violations) 
    
    

In [30]:
import pandas as pd
df = pd.DataFrame(all_violations)
df.head()

In [31]:
df.to_csv('mines.csv', index = False)

### Scrape the violations for each mine

**Save each mine's violations into separate CSV files.** Each CSV file must include the following fields:

- Citation number
- Case number
- Standard violated
- Link to standard
- Proposed penalty
- Amount paid to date

Make sure you are saving them into **separate files.** It might be nice to name them after the mine id.

- *TIP: Use .apply for this*
- *TIP: Print out the ID before you start scraping. That way you can take that ID and search manually to see if there is anything weird about the results.*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook 
- *TIP: It's probably worth it to print the fields first, then save them to a CSV once you know it's all working.*