# Mine Safety

We're interested in [US mine safety](https://arlweb.msha.gov/drs/drshome.htm#MID), thank goodness we can search for these things.

## Preparation: Knowing your tags

These questions are the same for every data set, and might not work exactly for yours.

**Search for every operator with 'dirt' in their name, including abandoned mines.**

### What is the tag and class name for every row of data?

In [51]:
from bs4 import BeautifulSoup
import requests

In [52]:
url = 'https://arlweb.msha.gov/drs/ASP/OprNameStatesearch.asp'

data = {
'OperSearch':'dirt',
'Abandoned': 'No',
'MineName': '',
'StateSearch':'None',
'CM':'All',
'x':'0',
'y':'0',
'MC':'Opersearch'
}

response = requests.post(url, data=data)
doc = BeautifulSoup(response.text, 'html.parser')  
doc.prettify()

'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<head>\n <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\n <!-- ****************************************** Begin META TAGS ********************************************* -->\n <meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>\n <!-- ****************************************** End META TAGS *********************************************** -->\n <title>\n  MSHA  - Mine  Data Retrieval System - Basic Mine Information Page\n </title>\n <script src="/2010redesign/Scripts/federated-analytics.js" type="text/javascript">\n </script>\n <script src="/2010redesign/Scripts/AC_RunActiveContent.js" type="text/javascript">\n </script>\n <link href="/2010Redesign/includes/Print.css" media="print" rel="stylesheet" type="text/css"/>\n <link href="/2010Redesign/Includes/MSHAwebnew.css" media="screen" rel="stylesheet" type="text/css">\n  <link href="/2010Redesign/includes/style-screen.css" media=

### What is the tag and class name for every mine operator's name?

In [53]:
#<td> 

### What is the tag and class name for every mine's name?

In [54]:
#<td> 

### What is the tag and class name for every mine operator's name?

In [55]:
#<td> 

### What is the tag and class name for every mine operator's name?

In [56]:
#<td>

## Being lazy

If you only needed these results, what would you do instead of scraping them?

In [57]:
#put into excel 

## Setup: Import what you'll need to scrape the page

Use `requests`, not `urllib`.

In [58]:
#done above 

## Try to scrape the page

To test if you requested the page correctly, save the BeautifulSoup document as `doc` and run the code `doc.find_all('tr')[-1].text` to get the text of the last `<tr>` element.

- If the result starts with **Total Number of Mines Found**, you were successful.

In [59]:
scrape = doc.find_all('tr')[-1]
print(scrape.text)


Total Number of Mines Found:  19


## Actually scraping

### Hopefully you know that each `tr` is supposed to be your data. What is the index of the first row element that is actually a result?

`.text` will help you here.

In [60]:
body = doc.find('body') 
row_tags = body.find_all('tr')[7]
print(row_tags.text)




3503598

OR 
 Newberg Rock & Dirt  
Newberg Rock & Dirt
Surface             
M 
Active  
Crushed, Broken Stone NEC  



### Loop through each operator result, printing its name

Use LIST SLICING to skip the non-data row(s).

In [61]:
row = doc.find_all('tr')

for element in row[7:19]:
    operator = element.find_all('td')[2]
    print(operator.text)
    print('---')

 Newberg Rock & Dirt  
---
AM Dirtworks & Aggregate Sales  
---
Dirt Company  
---
Dirt Con  
---
Dirt Doctor Inc  
---
Dirt Works  
---
Holley Dirt Company, Inc  
---
Krueger Brothers Gravel & Dirt  
---
M R Dirt  
---
M.R. Dirt Inc.  
---
P B Dirt Movers, Inc  
---
PB Dirt Movers  
---


### Loop through each operator result, printing its ID

There should be ONE code per row, and NO empty rows between them.

In [62]:
row = doc.find_all('tr')

for element2 in row[7:19]:
    operator2 = element2.find_all('td')[0]
    if not operator2:
        continue
    print(operator2.text)
    #print('---')



3503598



4801789



5001797



4608254



2103723



4104757



0801306



3901432



3609624



3609931



1519799



4407296



## Saving the results

### Loop through each `tr` to create a list of dictionaries

Each dictionary must contain

- Operator ID
- Operator name
- Mine name
- State
- Mine type
- Coal or metal
- Status
- Commodity

Create a new dictionary for each row.

In [63]:
mines = {
    
    'Operator ID': '',
    'Operator name': '',
    'Mine name': '',
    'State': '',
    'Mine type': '',
    'Coal or metal': '',
    'Status': '',
    'Commodity': ''
}

In [64]:
mines.keys()

dict_keys(['Operator ID', 'Operator name', 'Mine name', 'State', 'Mine type', 'Coal or metal', 'Status', 'Commodity'])

In [68]:
#I don't get how to put this together...as in add the elements to the dictionary! 

mines_list = []

row = doc.find_all('tr') 
for element in row[7:19]:
    mines_dict = {}

    operator = element.find_all('td')[2]
    if operator:
        mines_dict['Operator ID'] = operator.text
    name = element.find_all('td')[0] 
    if name:
        mines_dict['Operator name'] = name.text
    mine_names = element.find_all('td')[2] 
    if mine_names:
        mines_dict['Mine name'] = mine_names.text
    state = element.find_all('td')[1]
    if state:
        mines_dict['State'] = state.text
    mine_type = element.find_all('td')[4]
    if mine_type:
        mines_dict['Mine type'] = mine_type.text
    coal_metal = element.find_all('td')[5]
    if coal_metal:
        mines_dict['Coal or Metal'] = coal_metal.text 
    status = element.find_all('td')[6]
    if status:
        mines_dict['Status'] = status.text
    commodity = element.find_all('td')[7]
    if commodity:
        mines_dict['Commodity'] = commodity.text 
    mines_list.append(mines_dict)
print(mines_list) 

[{'Operator ID': ' Newberg Rock & Dirt \xa0', 'Operator name': '\n\n3503598\n', 'Mine name': ' Newberg Rock & Dirt \xa0', 'State': 'OR\xa0', 'Mine type': 'Surface             ', 'Coal or Metal': 'M\xa0', 'Status': 'Active\xa0 ', 'Commodity': 'Crushed, Broken Stone NEC\xa0 '}, {'Operator ID': 'AM Dirtworks & Aggregate Sales \xa0', 'Operator name': '\n\n4801789\n', 'Mine name': 'AM Dirtworks & Aggregate Sales \xa0', 'State': 'ND\xa0', 'Mine type': 'Surface             ', 'Coal or Metal': 'M\xa0', 'Status': 'Intermittent\xa0 ', 'Commodity': 'Construction Sand and Gravel\xa0 '}, {'Operator ID': 'Dirt Company \xa0', 'Operator name': '\n\n5001797\n', 'Mine name': 'Dirt Company \xa0', 'State': 'AK\xa0', 'Mine type': 'Surface             ', 'Coal or Metal': 'M\xa0', 'Status': 'Intermittent\xa0 ', 'Commodity': 'Construction Sand and Gravel\xa0 '}, {'Operator ID': 'Dirt Con \xa0', 'Operator name': '\n\n4608254\n', 'Mine name': 'Dirt Con \xa0', 'State': 'WV\xa0', 'Mine type': 'Surface            

### Save that to a CSV

In [69]:
import pandas as pd
df = pd.DataFrame(mines_list)
df.to_csv("../mines2.csv", index=False)

### Open the CSV file and examine the first few. Make sure you didn't save an extra weird unnamed column.

In [None]:
#yay! 
