# Mine Safety

We're interested in [US mine safety](https://arlweb.msha.gov/drs/drshome.htm#MID), thank goodness we can search for these things.

## Preparation: Knowing your tags

These questions are the same for every data set, and might not work exactly for yours.

**Search for every operator with 'dirt' in their name, including abandoned mines.**

### What is the tag and class name for every row of data?

tag is tr and class name is ??? I'm not sure. I don't see anything that says 'class' anywhere.

### What is the tag and class name for every mine operator's name?

I think they're in !-- DNT -- tags, but I can't get them out!

### What is the tag and class name for every mine's name?

I think they're in !-- DNT -- tags, but I can't get them out!

## Being lazy

If you only needed these results, what would you do instead of scraping them?

## Setup: Import what you'll need to scrape the page

Use `requests`, not `urllib`.

In [None]:
from bs4 import BeautifulSoup
import requests
response = requests.get('https://www.tdlr.texas.gov/cimsfo/fosearch.asp')
response.text

## Try to scrape the page

To test if you requested the page correctly, save the BeautifulSoup document as `doc` and run the code `doc.find_all('tr')[-1].text` to get the text of the last `<tr>` element.

- If the result starts with **Total Number of Mines Found**, you were successful.

In [None]:
data = {"OperSearch":"dirt",
"Abandoned":"No",
"MineName":"",
"StateSearch":"None",
"CM":"All",
"x":"26",
"y":"10",
"MC":"Opersearch"
}

Headers = {
    "Referer": "https://arlweb.msha.gov/drs/drshome.htm",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

response = requests.post('https://arlweb.msha.gov/drs/ASP/OprNameStatesearch.asp', data = data, headers = Headers)
response.text

In [None]:
doc = BeautifulSoup(response.text,'html.parser')
doc.prettify()

In [6]:
doc.find_all('tr')[-1].text

'\nTotal Number of Mines Found:\xa0\xa019'

## Actually scraping

### Hopefully you know that each `tr` is supposed to be your data. What is the index of the first row element that is actually a result?

`.text` will help you here.

In [11]:
#it's in the tr tag and if we count down, it's the 7th one (starting at zero)
table_row = doc.find_all('tr')[7]
print(table_row.text)




3503598

OR 
 Newberg Rock & Dirt  
Newberg Rock & Dirt
Surface             
M 
Active  
Crushed, Broken Stone NEC  



### Loop through each operator result, printing its ID

There should be ONE code per row, and NO empty rows between them.

In [76]:
table_row = doc.find_all('tr')[7:]
for each_element in table_row:
    print(each_element.find_all('font')[0].string)

3503598
4801789
5001797
4608254
2103723
4104757
0801306
3901432
3609624
3609931
1519799
4407296
4407270
0203332
2901986
4300768
4300776
2302283
2103518
None


### This is practice 
#### table_row = doc.find_all('tr')[7:]
#### for each_element in table_row:
#####     print(each_element.find_all(value=True))
    #so here we're using the attribute 'value' which has a value (confusing yes) of the mind ID number. 

###and so is this
#here's another way to do it, although it doesn't work as well...
ID_num = doc.find_all('td')
for each_element in ID_num:
    print(each_element.find_all(value=True)) !=None

    I can't figure out how to get rid of the spaces. I think it has to do with next siblings but neither of these
    work and I'm confused.
    if each_element.find_all(value=True) !=None:
        print(each_element.find_next_siblings(value=True))

## Saving the results

### Loop through each `tr` to create a list of dictionaries

Each dictionary must contain

- Operator ID
- Operator name
- Mine name
- State
- Mine type
- Coal or metal
- Status
- Commodity

Create a new dictionary for each row.

In [45]:
#first let's practice finding things, because I'm having trouble getting them out.
doc.find_all('tr')[7:]
#here's where the info starts.

[<tr>
 <td align="center">
 <form action="/drs/ASP/BasicMineInfostatecounty.asp" method="post" name="search">
 <input name="MineId" type="hidden" value="3503598"/><font style="FONT-SIZE:.75em;">3503598</font>
 </form></td>
 <td><font style="FONT-SIZE:.75em;"><!-- DNT --><b>OR</b><!-- /DNT --> </font></td>
 <td><font style="FONT-SIZE:.75em;"><!-- DNT --> Newberg Rock &amp; Dirt<!-- /DNT -->  </font></td>
 <td><font style="FONT-SIZE:.75em;"><!-- DNT -->Newberg Rock &amp; Dirt<!-- /DNT --></font></td>
 <td align="center"><font style="FONT-SIZE:.75em;"><!-- DNT -->Surface             <!-- /DNT --></font></td>
 <td align="center"><font style="FONT-SIZE:.75em;"><!-- DNT -->M<!-- /DNT --> </font></td>
 <td><font style="FONT-SIZE:.75em;">Active  </font></td>
 <td><font style="FONT-SIZE:.75em;">Crushed, Broken Stone NEC  </font></td>
 <th bgcolor="#000000"><input alt="More Information" border="0" name="submit" src="/drs/images/moreinfo.jpg" type="image"/></th></tr>,
 <tr>
 <td align="center">
 

In [86]:
table_row = doc.find_all('tr')[7:]
for each_element in table_row:
    print(each_element.find_all('td')[7].string)
#where is there a list index out of range at the bottom, but it seems to return what I want?

Crushed, Broken Stone NEC  
Construction Sand and Gravel  
Construction Sand and Gravel  
Crushed, Broken Limestone NEC  
Construction Sand and Gravel  
Construction Sand and Gravel  
Sand, Common  
Construction Sand and Gravel  
Construction Sand and Gravel  
Dimension Stone NEC  
Coal (Bituminous)  
Coal (Bituminous)  
Coal (Bituminous)  
Construction Sand and Gravel  
Construction Sand and Gravel  
Construction Sand and Gravel  
Construction Sand and Gravel  
Construction Sand and Gravel  
Construction Sand and Gravel  


IndexError: list index out of range

In [73]:
mines = []

mine_table = doc.find_all('tr')[7:]
for each_element in mine_table:
    current = {}
    if each_element.find_all('form'):
        current['ID'] = each_element.find_all(Value=True)
        current['operator_name'] = each_element.find_all('td')[2] #or are these in DNT tags? How do you write the DNT
            #tags? I can't get anything to come up. I know that in tr the operator name is in the 2 td tag. 
        current['mine_name'] = each_element.find_all('td')[3]
        current['state'] =
        current['mine_type']
        current['coal_metal']
        current['status']
        current['commodity']
        mines.append(current)

[<font style="FONT-SIZE:.75em;">3503598</font>, <font style="FONT-SIZE:.75em;"><!-- DNT --><b>OR</b><!-- /DNT --> </font>, <font style="FONT-SIZE:.75em;"><!-- DNT --> Newberg Rock &amp; Dirt<!-- /DNT -->  </font>, <font style="FONT-SIZE:.75em;"><!-- DNT -->Newberg Rock &amp; Dirt<!-- /DNT --></font>, <font style="FONT-SIZE:.75em;"><!-- DNT -->Surface             <!-- /DNT --></font>, <font style="FONT-SIZE:.75em;"><!-- DNT -->M<!-- /DNT --> </font>, <font style="FONT-SIZE:.75em;">Active  </font>, <font style="FONT-SIZE:.75em;">Crushed, Broken Stone NEC  </font>]
[<font style="FONT-SIZE:.75em;">4801789</font>, <font style="FONT-SIZE:.75em;"><!-- DNT --><b>ND</b><!-- /DNT --> </font>, <font style="FONT-SIZE:.75em;"><!-- DNT -->AM Dirtworks &amp; Aggregate Sales<!-- /DNT -->  </font>, <font style="FONT-SIZE:.75em;"><!-- DNT -->AM Dirtworks &amp; Aggregate Sales<!-- /DNT --></font>, <font style="FONT-SIZE:.75em;"><!-- DNT -->Surface             <!-- /DNT --></font>, <font style="FONT-SIZE:

### Save that to a CSV

In [1]:
import pandas as pd
df = pd.DataFrame(mines_df)
df.to_csv("mines_df.csv", index=False)

NameError: name 'mines_df' is not defined

### Open the CSV file and examine the first few. Make sure you didn't save an extra weird unnamed column.

In [66]:
mines_df = pd.read_csv('mines')
mines_df.head()


FileNotFoundError: File b'mines' does not exist