# Mine Safety

We're interested in [US mine safety](https://arlweb.msha.gov/drs/drshome.htm), thank goodness we can search for these things.

## Setup: Import what you'll need to search and scrape and Selenium

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait

## Starting from `https://arlweb.msha.gov/drs/drshome.htm`, search for every operator with 'dirt' in their name, including abandoned mines.

> - *Tip: If you can't make an element work using name, class or ID, try to use the XPath*

In [2]:
driver = webdriver.Chrome()

In [3]:
driver.get('https://arlweb.msha.gov/drs/drshome.htm')

In [4]:
operator = driver.find_element_by_name('OperSearch')
driver.execute_script("arguments[0].scrollIntoView(true)", operator)

In [5]:
operator.send_keys('dirt')

In [6]:
button = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table/tbody/tr[7]/td[3]/input[1]')
button.click()

## Scrape the results page, saving it as `dirt-operators.csv`

> - *Tip: Think about what each row in your dataset will be, and start by looping through that*
> - *Tip: Printing is cool and good! Print everything! Move it into a dictionary later.*
> - *Tip: If you don't want a row, think about what's in the row that makes it different. You can use an `if` statement or list slicing to skip the ones you aren't interested in.*
> - *Tip: Make sure your dictionary and your loop variable have DIFFERENT NAMES*
> - *Tip: After you've made your dictionary (and printed it, of course), you'll want to add it to your list of rows*
> - *Tip: Be sure to import pandas to convert it to a dataframe*
> - *Tip: Make sure you don't include the index when saving your dataframe*

### Hopefully you know that each `tr` is supposed to be a row of your data. What is the index of the first row element that is actually a result?

> - *Tip: `.text` will help you here.*
> - *Tip: You aren't interesting in annotations or anything, just mines and where they are from*
> - *Tip: Using `print("-----")` will help you keep track of different rows*
> - *Tip: If you have a list called `animals`, `animals[2:]` will skip the first two and start with the third. You can use this to skip ahead to the 'good' data if you want*

In [7]:
results = driver.find_elements_by_tag_name('tr')

In [9]:
rows = []

for result in results[7:-1]:
    #print("-----")
    #print(result.text)
    row = {}
    
    ids = result.find_element_by_tag_name("td")
    row['ids'] = ids.text
    
    states = result.find_elements_by_tag_name("td")[1]
    row['states'] = states.text
    
    operators = result.find_elements_by_tag_name("td")[2]
    row['operators'] = operators.text
    
    names = result.find_elements_by_tag_name("td")[3]
    row['names'] = names.text
    
    types = result.find_elements_by_tag_name("td")[3]
    row['types'] = types.text
    
    cm = result.find_elements_by_tag_name("td")[3]
    row['cm'] = cm.text
    
    commodity = result.find_elements_by_tag_name("td")[3]
    row['commodity'] = commodity.text
    
    more = result.find_elements_by_tag_name("td")[3]
    row['more'] = more.text
    
    print(row)
    rows.append(row)
    

{'ids': '3503598', 'states': 'OR ', 'operators': 'Newberg Rock & Dirt  ', 'names': 'Newberg Rock & Dirt', 'types': 'Newberg Rock & Dirt', 'cm': 'Newberg Rock & Dirt', 'commodity': 'Newberg Rock & Dirt', 'more': 'Newberg Rock & Dirt'}
{'ids': '1401575', 'states': 'KS ', 'operators': 'Bender Sand & Dirt  ', 'names': 'BENDER SAND & DIRT', 'types': 'BENDER SAND & DIRT', 'cm': 'BENDER SAND & DIRT', 'commodity': 'BENDER SAND & DIRT', 'more': 'BENDER SAND & DIRT'}
{'ids': '5001797', 'states': 'AK ', 'operators': 'Dirt Company  ', 'names': 'Bush Pilot', 'types': 'Bush Pilot', 'cm': 'Bush Pilot', 'commodity': 'Bush Pilot', 'more': 'Bush Pilot'}
{'ids': '2103723', 'states': 'MN ', 'operators': 'Dirt Doctor Inc  ', 'names': 'Rock Lake Plant', 'types': 'Rock Lake Plant', 'cm': 'Rock Lake Plant', 'commodity': 'Rock Lake Plant', 'more': 'Rock Lake Plant'}
{'ids': '2103914', 'states': 'MN ', 'operators': 'Dirt Work Specialists LLC  ', 'names': 'Astec Plant', 'types': 'Astec Plant', 'cm': 'Astec Plant

In [11]:
import pandas as pd

df = pd.DataFrame(rows)
df.head(10)

Unnamed: 0,cm,commodity,ids,more,names,operators,states,types
0,Newberg Rock & Dirt,Newberg Rock & Dirt,3503598,Newberg Rock & Dirt,Newberg Rock & Dirt,Newberg Rock & Dirt,OR,Newberg Rock & Dirt
1,BENDER SAND & DIRT,BENDER SAND & DIRT,1401575,BENDER SAND & DIRT,BENDER SAND & DIRT,Bender Sand & Dirt,KS,BENDER SAND & DIRT
2,Bush Pilot,Bush Pilot,5001797,Bush Pilot,Bush Pilot,Dirt Company,AK,Bush Pilot
3,Rock Lake Plant,Rock Lake Plant,2103723,Rock Lake Plant,Rock Lake Plant,Dirt Doctor Inc,MN,Rock Lake Plant
4,Astec Plant,Astec Plant,2103914,Astec Plant,Astec Plant,Dirt Work Specialists LLC,MN,Astec Plant
5,Portable #1,Portable #1,4104757,Portable #1,Portable #1,Dirt Works,TX,Portable #1
6,River Road Pit,River Road Pit,801306,River Road Pit,River Road Pit,"Holley Dirt Company, Inc",FL,River Road Pit
7,PORTABLE SCREENER,PORTABLE SCREENER,3901432,PORTABLE SCREENER,PORTABLE SCREENER,Krueger Brothers Gravel & Dirt,SD,PORTABLE SCREENER
8,Forbes Pit,Forbes Pit,3609624,Forbes Pit,Forbes Pit,M R Dirt,PA,Forbes Pit
9,Camptown Quarry,Camptown Quarry,3609931,Camptown Quarry,Camptown Quarry,M.R. Dirt Inc.,PA,Camptown Quarry


In [12]:
df.to_csv("dirt-operators.csv", index=False)

### Loop through each operator result, printing its name

> - *Tip: If you have a list called `animals`, `animals[2:]` will skip the first two and start with the third.*
> - *Tip: You can use list slicing or an `if` statement to skip the non-data row(s). List slicing is probably easier, even if you aren't comfortable with it.*
> - *Tip: or honestly you can use `try` and `except` if you know how it works.*
> - *Tip: Once you have the "right" rows of data, you're going to be looking for a certain tag inside*
> - *Tip: Sometimes you can't say "give me this class," and instead you have to say "give me all of the `div` elements, and then give me the third one."*

In [23]:
df.names

0           Newberg Rock & Dirt
1            BENDER SAND & DIRT
2                    Bush Pilot
3               Rock Lake Plant
4                   Astec Plant
5                   Portable #1
6                River Road Pit
7             PORTABLE SCREENER
8                    Forbes Pit
9               Camptown Quarry
10            Fedscreek Surface
11                        No. 3
12                   Mine No. 6
13              Sandretto Drive
14    R D BLANKENSHIP DIRT WORK
15                   Molino Pit
16        Pettibone Jaw Crusher
17                Chieftan 1400
18             Mike's Money Pit
19                      Crusher
Name: names, dtype: object

### Loop through each operator result, printing its ID

There should be ONE code per row, and NO empty rows between them.

In [31]:
df.ids

0     3503598
1     1401575
2     5001797
3     2103723
4     2103914
5     4104757
6     0801306
7     3901432
8     3609624
9     3609931
10    1519799
11    4407379
12    4407296
13    0203332
14    2901986
15    0801417
16    4300768
17    4300776
18    2302283
19    2103518
Name: ids, dtype: object

## Saving the results

### Loop through each `tr` to create a list of dictionaries

Each dictionary must contain

- Operator ID
- Operator name
- Mine name
- State
- Mine type
- Coal or metal
- Status
- Commodity

Create a new dictionary for each row.

> - *Tip: Start with an empty dictionary, then add the keys one at a time like we did during class*
> - *Tip: You might want to save all of the cells in a variable, then use indexes to get the second, third, fourth, etc.*
> - *Tip: I know you already skipped a bunch of rows already, but one of them still might be bad! Which one is it? How can you skip it? You might need to slice out some of the end of your list, too. Use `print` to help you debug, or just look at the page closely.*
> - *Tip: Or, if you did the other homework already, `try` / `except` is also an option*

### Save that to a CSV named `dirt-operators.csv`

### Open the CSV file and examine the first few.

Make sure you didn't save that extra weird unnamed index column.