# Scraping many pages + Using Selenium

## The pages we'll be looking at

If I wanted to read specific information about a specfic mine, it takes a few steps. **Do these steps with your browser before you try any programming.**

1. Visit the [Mine Data Retrieval System](https://arlweb.msha.gov/drs/drshome.htm)
2. Scroll down to **Mine Identification Number (ID) Search**
3. Type in a mine ID number, such as `3503598`, click **Search**
4. I'm on a page! It lists the MINE NAME and MINE OWNER.

After searching for and finding a mine, I can use this page to **find reports about this mine**. Some of the reports are on accidents, violations, inspections, health samples and more. To get those reports:

1. Search for a mine (if you haven't already)
2. Scroll down and change **Beginning Date** to `1/1/1995` (violation reports begin in 1995, accidents begin in 1983)
3. Select the report type of `Violations`
4. Click **Get Report**
5. I'm on a page! It lists ALL OF THE MINE'S VIOLATIONS.

By changing the report type you're searching for you can find all sorts of different data.

# Researching mine information

## Preparation 

### When you search for information on a specific mine, what URL should Selenium visit first?

- *TIP: the answer is NOT `https://arlweb.msha.gov/drs/ASP/BasicMineInfonew.asp`*

In [1]:
# url = "https://arlweb.msha.gov/drs/drshome.htm"

### How can you identify the text field we're going to type the Mine ID into?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

In [2]:
# mine_id = driver.find_element_by_name('MineId')

### How can you identify the search button we're going to click, or the form we're going to submit?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [3]:
# search = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')

### Use Selenium to search using the mine ID `3901432`. Get me the operator's name by scraping.

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

In [34]:
# Putting all together

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# Driver is the thing that controls the browser. Opens a separate Chrome browser
driver = webdriver.Chrome()
# Tell Selenium to go to a url
driver.get('https://arlweb.msha.gov/drs/drshome.htm')

# Driver is the thing that controls the browser. Opens a separate Chrome browser
driver = webdriver.Chrome()
# Tell Selenium to go to a url
driver.get('https://arlweb.msha.gov/drs/drshome.htm')
# use by_name to get the Mine ID box
# Use send.keys to insert ID
mine_field = driver.find_element_by_name('MineId')
mine_field.send_keys('3901432')

search_field = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')
search_field.click()

# from bs4 import BeautifulSoup
# doc = BeautifulSoup(driver.page_source, "html.parser")
# operator_names = doc.find_all('tr')[3].find_all('td')[1].find('b').text
# operator_names   

# Using Selenium
operator_names = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[1]/tbody/tr[3]/td[2]/font/b')
operator_names.text # all we want from this is a text. Do NOT use .click()

'Krueger Brothers Gravel & Dirt'

# Using .apply to find data about SEVERAL mines

The file `mines-subset.csv` has a list of mine IDs. We're going to scrape the operator's name for each of those mines.

### Open up `mines-subset.csv` and save it into a dataframe

In [9]:
import pandas as pd
df = pd.read_csv('mines-subset.csv')
df

Unnamed: 0,id
0,4104757
1,801306
2,3609931


### Open up `mines-subset.csv` in a text editor, then look at your dataframe. Is something different about them?

In [35]:
# Original second number has 0 in it.
# Use dtype={'id': 'str'} to bring it back -- treats as integers
# !cat - terminal

!cat mines-subset.csv

id
4104757
0801306
3609931

In [36]:
df = pd.read_csv('mines-subset.csv', dtype={'id': 'str'})
df

Unnamed: 0,id
0,4104757
1,801306
2,3609931


### Scrape the operator's name for each of those mines and print it

- *TIP: use .apply and a function*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook *

In [12]:
# You have ID numbers
# Open up the page
# Type ID in
# Click search
# Scrape info off of page


driver = webdriver.Chrome() 
import time 
def scrape_mine_info(row):
    # a dict called row and a key called id - on order to get id number, you will use row['id] 
    # puts some delay when loading next search
    time.sleep(1)
    # first visit the site
    driver.get('https://arlweb.msha.gov/drs/drshome.htm')
    # put ids one by one using row['id]
    mine_id_field = driver.find_element_by_name('MineId') 
    mine_id_field.send_keys(row['id']) 
    # hit the search button
    search_field = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')
    search_field.click() 
    # all we want from this is a text. Do NOT use .click(), but use .text to get name
    # When scraping pages that just has single things on it, it's easy to use xpath
    operator_names_field = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[1]/tbody/tr[3]/td[2]/font/b')
    
    # When sending data from .apply() use return to send data back from the apply()
    return operator_names_field.text

# opens one new Chrome just before we go thro all the row. Then scrapes all thr rows on one browser.
# driver = webdriver.Chrome() 

# df.apply() takes the dataframe and runs a function (scrape_mine_info) for every single row
# axis=1 is a like a loop for pandas - means do it for every single row in df ids.
# axis=0 does each column at a time
# df.apply(scrape_mine_info, axis=1)

# Take the result (names) of df.apply(scrape_mine_info, axis=1) 
# Send it back and save it into a column: df['name']


df['name'] = df.apply(scrape_mine_info, axis=1)
df['name']

# Closes browser after it finishes scraping.
# driver.close()

0                  Dirt Works
1    Holley Dirt Company, Inc
2              M.R. Dirt Inc.
Name: name, dtype: object

### Scrape the operator's name and save it into a new column

- *TIP: Use .apply and a function*
- *TIP: Remember to use `return`*

In [13]:
# Look at the datafram to see if worked
# Save it into a new column: df['name']
df.head()

Unnamed: 0,id,name
0,4104757,Dirt Works
1,801306,"Holley Dirt Company, Inc"
2,3609931,M.R. Dirt Inc.


# Researching mine violations

Read the very top again to remember how to find mine violations

### When you search for a mine's violations, what URL is Selenium going to start on?

- *TIP: `requests` can send form data to load in the middle of a bunch of steps, but Selenium has to start at the beginning

In [14]:
# https://arlweb.msha.gov/drs/drshome.htm

# The Trump White House has not announced iftar plans. And Ramadan is almost over. 

### When you're searching for violations from the Mine Information page, how are you going to identify the "Beginning Date" field?

In [15]:
# date = '//*[@id="content"]/form[1]/table[2]/tbody/tr[2]/td/font/input[1]'

### When you're searching for violations from the Mine Information page, how are you going to identify the "Violations" button?

In [16]:
# violations = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[2]/td[2]/table/tbody/tr[1]/td/input')

### When you're searching for violations from the Mine Information page, how are you going to identify the form or the button to click to get a list of the violations?

In [17]:
# get_report = //*[@id="content"]/form[1]/table[3]/tbody/tr[3]/td[2]/input

### Using the mine ID `3901432`, scrape all of their violations since 1/1/1995

**Save this into a CSV called `3901432-violations.csv`.** This CSV must include the following fields:

- Citation number
- Case number
- Standard violated
- Link to standard
- Proposed penalty
- Amount paid to date

**Tips:**

- *TIP: It's probably worth it to print them all first, then save them to a CSV once you know it's all working.*
- *TIP: You'll use the parent pattern - get the ROWS first (tr), then loop through and get the TABLE CELLS (td)*

In [37]:
driver = webdriver.Chrome() 
# gets to the mine info page
driver.get('https://arlweb.msha.gov/drs/drshome.htm')
# put ids one by one using row['id]
mine_id_field = driver.find_element_by_name('MineId') 
# mine ID we are interested in: 3901432
mine_id_field.send_keys('3901432') 
# hit the search button
search_field = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')
search_field.click() 

# Entering date data into the form
# use find_element_by_xpath because we only want one of them.
date_field = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[2]/tbody/tr[2]/td/font/input[1]')
# use send.keys() to put in the date
date_field.send_keys('1/1/1995')


violations_field = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[2]/td[2]/table/tbody/tr[1]/td/input')
violations_field.click()

reports_field = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[3]/td[2]/input')
reports_field.click()

from bs4 import BeautifulSoup
# Do not use request because Selenium has already loaded the page.
# Request is a library that allows us to download info from the internet
# response = requests.get('')
# doc = BeautifulSoup(response.text, 'html.parser')

# Use drive.page_source to give the source code of the page we are currently on to BeautifulSoup
# Process it with html.parser

doc = BeautifulSoup(driver.page_source, 'html.parser')

# take a look at our current page and get info inside from table rows 'trs' on it
# this gives you the data on first table row

violations_data = doc.find_all('tr', class_='drsviols') 


# len(violations_data) gives us 18 rows
# Grab each of the table cells and - if we want the citation #, that's the 3rd one, etc
# DO NOT USE XPATH TO SELECT MORE THAN ONE ELEMENT. USE IT IF IT'S UNIQUE FOR THE PAGE
# We are trying to get every rows with the class drsviols
# Give me this row and loop thro each of them. We don't need a DataFrame
# What we need from the table about the violations:

    # Citation number
    # Case number
    # Standard violated
    # Link to standard
    # Proposed penalty
    # Amount paid to date
    
# Loop thro each row. 
# This is not a DataFrame, so do not use .apply()
# violations_data is basically a LITS.

for item in violations_data:
    # td is each cell
    cells = item.find_all('td')
    print('Citation number', cells[2].text)
    print('Case number', cells[3].text)
    # Standard violated element has a JS in it. Just get the "a" tag (text) inside the cell.
    print('Standard violated', cells[10].find('a').text)
    # In order to get an attribute like [href] from an element, treat it like a dict
    print('Link to standard', cells[10].find('a')['href'])
    print('Proposed penalty', cells[11].text)
    print('Amount paid to date', cells[14].text)


Citation number 8750964                        
Case number 000361866           
Standard violated 
56.18010            
Link to standard http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-vol1/pdf/CFR-2014-title30-vol1-sec56-18010.pdf
Proposed penalty 100.00
Amount paid to date 100.00 
Citation number 6426438                        
Case number 000260865           
Standard violated 
56.4101             
Link to standard http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-vol1/pdf/CFR-2011-title30-vol1-sec56-4101.pdf
Proposed penalty 100.00
Amount paid to date 100.00 
Citation number 6426439                        
Case number 000260865           
Standard violated 
56.4201(a)(2)       
Link to standard http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-vol1/pdf/CFR-2011-title30-vol1-sec56-4201.pdf
Proposed penalty 100.00
Amount paid to date 100.00 
Citation number 6588189                        
Case number 000260865           
Standard violated 
56.14200            
Link to standard http://www.gpo.

In [24]:
# The above code only gives the firt rows of the table: cells[2], cells[3]....etc
# What we want to do instead is to get the whole data on the table and save it into a CSV:

# Follow these steps to do that:
    # 1. First make an empty list []
    # 2. Every time thro the loop, create a dict {} of your data
    # 3. Add the dic to the list
    # 4. When the entire loop is over, convert the list to a DataFrame
    # 5. And save that to a DataFrame
    
# Before we go thro any violations row in the table, our violations is empty list

violations = []
for item in violations_data: # the whole row
    # everty time we go thro the loop, we save new set violations to the dic (keys - violations[] and values)
    # Use strip() after text to make data clean
    violation = {}
    cells = item.find_all('td') # td is each cell
    violation['Citation number'] = cells[2].text.strip()
    violation['Case number'] = cells[3].text.strip()
    # Standard violated element has a JS in it. Just get the a tag (text) inside the cell.
    violation['Standard violated'] = cells[10].find('a').text.strip()
    # In order to get an attribute like [href] from an element, treat it like a dict
    violation['Link to standard'] = cells[10].find('a')['href']
    violation['Proposed penalty'] = cells[11].text.strip()
    violation['Amount paid to date'] = cells[14].text.strip()
    
    # add them to the list of dictionaries
    violations.append(violation)
violations

[{'Amount paid to date': '100.00',
  'Case number': '000361866',
  'Citation number': '8750964',
  'Link to standard': 'http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-vol1/pdf/CFR-2014-title30-vol1-sec56-18010.pdf',
  'Proposed penalty': '100.00',
  'Standard violated': '56.18010'},
 {'Amount paid to date': '100.00',
  'Case number': '000260865',
  'Citation number': '6426438',
  'Link to standard': 'http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-vol1/pdf/CFR-2011-title30-vol1-sec56-4101.pdf',
  'Proposed penalty': '100.00',
  'Standard violated': '56.4101'},
 {'Amount paid to date': '100.00',
  'Case number': '000260865',
  'Citation number': '6426439',
  'Link to standard': 'http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-vol1/pdf/CFR-2011-title30-vol1-sec56-4201.pdf',
  'Proposed penalty': '100.00',
  'Standard violated': '56.4201(a)(2)'},
 {'Amount paid to date': '100.00',
  'Case number': '000260865',
  'Citation number': '6588189',
  'Link to standard': 'http://www.gpo.gov/fdsys/pkg/

In [25]:
# Save above data into a DataFrame
import pandas as pd
df = pd.DataFrame(violations)
df.head()

Unnamed: 0,Amount paid to date,Case number,Citation number,Link to standard,Proposed penalty,Standard violated
0,100.0,361866,8750964,http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-...,100.0,56.18010
1,100.0,260865,6426438,http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...,100.0,56.4101
2,100.0,260865,6426439,http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...,100.0,56.4201(a)(2)
3,100.0,260865,6588189,http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...,100.0,56.14200
4,100.0,238554,6588210,http://www.gpo.gov/fdsys/pkg/CFR-2010-title30-...,100.0,50.30(a)


In [26]:
# save it into a CSV
df.to_csv('3901432-violations.csv', index=False)

# Using .apply to save mine data for SEVERAL mines

The file `mines-subset.csv` has a list of mine IDs. We're going to scrape the operator's name for each of those mines.

### Open up `mines-subset.csv` and save it into a dataframe

In [27]:
df = pd.read_csv('mines-subset.csv', dtype={'id' : 'str'})
df

Unnamed: 0,id
0,4104757
1,801306
2,3609931


### Scrape the violations for each mine

**Save each mine's violations into separate CSV files.** Each CSV file must include the following fields:

- Citation number
- Case number
- Standard violated
- Link to standard
- Proposed penalty
- Amount paid to date

Make sure you are saving them into **separate files.** It might be nice to name them after the mine id.

- *TIP: Use .apply for this*
- *TIP: Print out the ID before you start scraping. That way you can take that ID and search manually to see if there is anything weird about the results.*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook 
- *TIP: It's probably worth it to print the fields first, then save them to a CSV once you know it's all working.*

In [32]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome() 
import time
def scrape_violations(row):
    # gets mine info page
    driver.get('https://arlweb.msha.gov/drs/drshome.htm')
    mine_id_field = driver.find_element_by_name('MineId') 
    # Insert IDs into the field one by one using row['id]
    mine_id_field.send_keys(row['id']) # treat it like a dic
    # hits search button
    search_field = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')
    search_field.click() 
    # Use find_element_by_xpath because we only want one of them.
    date_field = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[2]/tbody/tr[2]/td/font/input[1]')
    # Use send.keys() to put date into field
    date_field.send_keys('1/1/1995')  
    # clicks violations button
    violations_field = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[2]/td[2]/table/tbody/tr[1]/td/input')
    violations_field.click()
    # gets the report
    reports_field = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[3]/td[2]/input')
    reports_field.click()
    
    # After we get the report with all the data, use BeautifulSoup to find all the rows that are important to us
    # drive.page_source loads current page so we don't need to use 'request'
    doc = BeautifulSoup(driver.page_source, 'html.parser')
    # take a look at our current page and the trs on it
    violations_data = doc.find_all('tr', class_='drsviols') 

    # Now start searching for violations
    violations = []
    for item in violations_data: # the whole row
        # everty time we go thro the loop, we save new set violations to the dic (keys - violations[] and values)
        # Use strip() after text to make data clean
        violation = {}
        cells = item.find_all('td') # td is each cell
        violation['Citation number'] = cells[2].text.strip()
        violation['Case number'] = cells[3].text.strip()

        # 'Standard violated' element has JS in it. Just get the 'a' tag (text) inside the cell.
        # if some of cells don't have an 'a' tag
        # if you can't find an 'a' tag, don't give me an error
        a_tag = cells[10].find('a')
        if a_tag:
            violation['Standard violated'] = a_tag.text.strip()
            # In order to get an attribute like [href] from an element, treat it like a dict
            violation['Link to standard'] = a_tag['href']

        # these two below rows have fewer cells than the rest. Use 'if' statement to get rid of error
        # if the cells are more than 14, run the code
        if len(cells) > 14:
            violation['Proposed penalty'] = cells[11].text.strip()
            violation['Amount paid to date'] = cells[14].text.strip()

        # add it to list of dictionaries
        violations.append(violation)

    # Save new data into a new dataframe
    violations_df = pd.DataFrame(violations)
    # save that into a different filename for every single using row IDs
    violations_df.to_csv(row['id'] + '-violations.csv', index=False)

df.apply(scrape_violations, axis=1)

0    None
1    None
2    None
dtype: object

In [33]:
pd.DataFrame(violations).head()

Unnamed: 0,Amount paid to date,Case number,Citation number,Link to standard,Proposed penalty,Standard violated
0,,,8912694,http://www.ecfr.gov/cgi-bin/text-idx?SID=f462b...,,56.14132(a)
1,351.0,427623.0,8638781,http://www.gpo.gov/fdsys/pkg/CFR-2016-title30-...,351.0,56.12028
2,117.0,411633.0,8903435,http://www.gpo.gov/fdsys/pkg/CFR-2016-title30-...,117.0,56.9300(a)
3,117.0,411633.0,8903434,http://www.gpo.gov/fdsys/pkg/CFR-2016-title30-...,117.0,46.11(d)
4,100.0,411633.0,8903436,http://www.gpo.gov/fdsys/pkg/CFR-2016-title30-...,100.0,56.12004


In [None]:
# if code doesn't run, you can sometimes us 'try' and 'except'
# try:
# This will show you where your code has failed
# except:
#     print('Failed on', row['id'])