# Mine Safety

We're interested in [US mine safety](https://arlweb.msha.gov/drs/drshome.htm#MID), thank goodness we can search for these things.

## Preparation: Knowing your tags

These questions are the same for every data set, and might not work exactly for yours.

**Search for every operator with 'dirt' in their name, including abandoned mines.**

### What is the tag and class name for every row of data?

In [None]:
# Each row is a <tr> tag without class

### What is the tag and class name for every mine operator's name?

In [115]:
# Every mine operator's name in a <font> tag within a <td> tag and has no class

### What is the tag and class name for every mine's name?

In [116]:
# Every mine's name in a <font> tag within a <td> tag and has no class

### What is the tag and class name for every mine operator's name?

In [117]:
# Every mine operator's name in a <font> tag within a <td> tag and has no class

### What is the tag and class name for every mine operator's name?

In [None]:
# Every mine operator's name in a <font> tag within a <td> tag and has no class

## Being lazy

If you only needed these results, what would you do instead of scraping them?

In [None]:
# copy-past into Excel

## Setup: Import what you'll need to scrape the page

Use `requests`, not `urllib`.

In [118]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Try to scrape the page

To test if you requested the page correctly, save the BeautifulSoup document as `doc` and run the code `doc.find_all('tr')[-1].text` to get the text of the last `<tr>` element.

- If the result starts with **Total Number of Mines Found**, you were successful.

In [119]:
base_url = 'https://arlweb.msha.gov/drs/ASP/OprNameStatesearch.asp'
data = {
    'OperSearch': '',
    'MineName':'dirt',
    'StateSearch':'None',
    'CM':'All',
    'x':'21',
    'y':'9',
    'MC':'Opersearch'
}

In [120]:
response = requests.post(base_url, data=data)
doc = BeautifulSoup(response.text, 'html.parser')

In [121]:
doc.find_all('tr')[-1].text.strip()

'Total Number of Mines Found:\xa0\xa077'

## Actually scraping

### Hopefully you know that each `tr` is supposed to be your data. What is the index of the first row element that is actually a result?

`.text` will help you here.

In [122]:
doc.find_all('tr')[7]

<tr>
<td align="center">
<form action="/drs/ASP/BasicMineInfostatecounty.asp" method="post" name="search">
<input name="MineId" type="hidden" value="3503598"/><font style="FONT-SIZE:.75em;">3503598</font>
</form></td>
<td><font style="FONT-SIZE:.75em;"><!-- DNT --><b>OR</b><!-- /DNT --> </font></td>
<td><font style="FONT-SIZE:.75em;"><!-- DNT --> Newberg Rock &amp; Dirt<!-- /DNT -->  </font></td>
<td><font style="FONT-SIZE:.75em;"><!-- DNT -->Newberg Rock &amp; Dirt<!-- /DNT --></font></td>
<td align="center"><font style="FONT-SIZE:.75em;"><!-- DNT -->Surface             <!-- /DNT --></font></td>
<td align="center"><font style="FONT-SIZE:.75em;"><!-- DNT -->M<!-- /DNT --> </font></td>
<td><font style="FONT-SIZE:.75em;">Active  </font></td>
<td><font style="FONT-SIZE:.75em;">Crushed, Broken Stone NEC  </font></td>
<th bgcolor="#000000"><input alt="More Information" border="0" name="submit" src="/drs/images/moreinfo.jpg" type="image"/></th></tr>

### Loop through each operator result, printing its name

Use LIST SLICING to skip the non-data row(s).

In [123]:
for t in doc.find_all('th', attrs={'bgcolor':'#000000'}):
    print(t.find_parent().find_all('td')[3].text.strip())

Newberg Rock & Dirt
Allied Dirt Moving Co Pit & Plant
AM Dirtworks & Aggregate Sales
Bar-Lin Dirt Pit
Barber'S Dirt Pit
Pay Dirt
BENDER SAND & DIRT
BERT'S DIRT
Big Red Dirt Farm
Big River Dirt Pit
BOHANNON SAND & DIRT
Buck'S Dirt Pit
17-A DIRT PIT
PAYDIRT #1
Dirt Crew 1
Dirt Crew # 2
Crowes Dirt Pit
D & H Dirt
WISE DIRT
Dirt Cheap
THE DIRT PIT
DIRTCO INC
Dirtman Sand & Gravel #2
DIRTWORKS, INC.
Dirtworks, Inc.
Eddie Carr Dirt LLC
DIRT/SAND SCREEN-AASE PIT
Floyd Smith Dirt Pit
Greer Dirt Pit
Guidry Sand & Dirt Pit
Dirt Pit
Harris Dirt Pit
Hatchet Creek Rock & Dirt LLC
Dirt Pit
Iske Dirt Sand & Gravel
Cowan Dirt Pit #1 & #2
Cowan Dirt Pit
COWAN DIRT PIT #3
Stephens Red Dirt Farm
Cowan Dirt Pit # 4
Pay Dirt
Trainer Dirt Pit
L I P Dirt & Trucking
La Amite Dirt Pit
Lee'S Dirt Pit
Dirt Dumper
Lowe Dirt Pit
Lowe Dirt Pit
Little-G-Dirt Pit
Long'S Dirt Pit
MARCELO DIRT-LOAM
Maurice Dirt And Sand
Moss Dirt Pit
Nelson & Sons Dirt Haulers Incorporated
NELSON'S DIRT PIT
Dirty Ike Quarry
R D BLANKEN

### Loop through each operator result, printing its ID

There should be ONE code per row, and NO empty rows between them.

In [124]:
for t in doc.find_all('th', attrs={'bgcolor':'#000000'}):
    print(t.find_parent().find_all('td')[0].text.strip())

3503598
0502030
4801789
1601167
4103265
2601714
1401575
1700776
0301963
1601082
1600916
1600956
3800631
3401803
1302275
1302409
1601106
3400915
4104192
4503200
4104016
0301729
0404851
2200734
5002028
2200637
1301775
3401762
4103577
1601124
1600801
1601150
4703427
0405187
2501216
1600761
1600954
1601271
0301890
1601391
0202046
1601163
1601250
1600950
1600908
1601196
1600899
1601049
1600953
4102999
4103597
1601257
1601165
1601194
4104054
2401288
2901986
1601127
3800655
4105017
1600980
1600986
1600951
4103211
2402115
1601159
4104475
0801388
1601178
3800617
4104618
1601234
4801174
1601131
1600952
1601162
4103264


## Saving the results

### Loop through each `tr` to create a list of dictionaries

Each dictionary must contain

- Operator ID
- Operator name
- Mine name
- State
- Mine type
- Coal or metal
- Status
- Commodity

Create a new dictionary for each row.

In [126]:
mines = []

for t in doc.find_all('th', attrs={'bgcolor':'#000000'}):
    columns = t.find_parent().find_all('td')
    mines.append(
    {
        'Operator ID': columns[0].text.strip(),
        'Operator name': columns[2].text.strip(),
        'Mine name': columns[3].text.strip(),
        'State': columns[1].text.strip(),
        'Mine type': columns[4].text.strip(),
        'Coal or metal': columns[5].text.strip(),
        'Status': columns[6].text.strip(),
        'Commodity': columns[7].text.strip()
    })

mines

[{'Coal or metal': 'M',
  'Commodity': 'Crushed, Broken Stone NEC',
  'Mine name': 'Newberg Rock & Dirt',
  'Mine type': 'Surface',
  'Operator ID': '3503598',
  'Operator name': 'Newberg Rock & Dirt',
  'State': 'OR',
  'Status': 'Active'},
 {'Coal or metal': 'M',
  'Commodity': 'Construction Sand and Gravel',
  'Mine name': 'Allied Dirt Moving Co Pit & Plant',
  'Mine type': 'Surface',
  'Operator ID': '0502030',
  'Operator name': 'Allied Dirt Moving Company',
  'State': 'CO',
  'Status': 'Abandoned'},
 {'Coal or metal': 'M',
  'Commodity': 'Construction Sand and Gravel',
  'Mine name': 'AM Dirtworks & Aggregate Sales',
  'Mine type': 'Surface',
  'Operator ID': '4801789',
  'Operator name': 'AM Dirtworks & Aggregate Sales',
  'State': 'ND',
  'Status': 'Intermittent'},
 {'Coal or metal': 'M',
  'Commodity': 'Construction Sand and Gravel',
  'Mine name': 'Bar-Lin Dirt Pit',
  'Mine type': 'Surface',
  'Operator ID': '1601167',
  'Operator name': 'Bar-Lin Dirt Company',
  'State': 'L

### Save that to a CSV

In [127]:
df = pd.DataFrame(mines)
df.to_csv('mines.csv')

### Open the CSV file and examine the first few. Make sure you didn't save an extra weird unnamed column.

In [128]:
pd.read_csv('mines.csv')

Unnamed: 0.1,Unnamed: 0,Coal or metal,Commodity,Mine name,Mine type,Operator ID,Operator name,State,Status
0,0,M,"Crushed, Broken Stone NEC",Newberg Rock & Dirt,Surface,3503598,Newberg Rock & Dirt,OR,Active
1,1,M,Construction Sand and Gravel,Allied Dirt Moving Co Pit & Plant,Surface,502030,Allied Dirt Moving Company,CO,Abandoned
2,2,M,Construction Sand and Gravel,AM Dirtworks & Aggregate Sales,Surface,4801789,AM Dirtworks & Aggregate Sales,ND,Intermittent
3,3,M,Construction Sand and Gravel,Bar-Lin Dirt Pit,Surface,1601167,Bar-Lin Dirt Company,LA,Abandoned
4,4,M,Construction Sand and Gravel,Barber'S Dirt Pit,Surface,4103265,Barber'S Dirt Pit,TX,Abandoned
5,5,M,Gold Ore,Pay Dirt,Surface,2601714,Basil Cramer,NV,Abandoned
6,6,M,Construction Sand and Gravel,BENDER SAND & DIRT,Surface,1401575,Bender Sand & Dirt,KS,Abandoned
7,7,M,Construction Sand and Gravel,BERT'S DIRT,Surface,1700776,BERT'S DIRT,ME,Abandoned
8,8,M,Construction Sand and Gravel,Big Red Dirt Farm,Surface,301963,Big Red Dirt Farm LLC,AR,Abandoned
9,9,M,Construction Sand and Gravel,Big River Dirt Pit,Surface,1601082,Big River Dirt Pit,LA,Abandoned
