## Gathering some reactor data


In [None]:
# import modules to facilitate the scrape
import csv

import pandas as pd
import requests

from bs4 import BeautifulSoup

### Fetch HTML with `requests`

`requests` is great at playing web browser. For more information, check out the [full documentation](http://docs.python-requests.org/en/master/).

```python
requests.get('some URL')
# navigates to a site and sends you the response

response.content
# a way requests serves up the page's HTML source code
```

We are going to be getting data on nuclear reactors operating in the U.S.: http://www.nrc.gov/reactors/operating/list-power-reactor-units.html

In [None]:
# fetch the contents of webpage with requests
url = "http://www.nrc.gov/reactors/operating/list-power-reactor-units.html"
main_page = requests.get(url)

### Parse HTML with `BeautifulSoup`

In [None]:
# let BeautifulSoup parse the content of that page
soup = BeautifulSoup(main_page.content, 'html.parser')

### Target the data

In [None]:
# snip out the table and pass it to a new variable
reactors_table = soup.find('table')

In [None]:
# print reactor_table to verify we have the right thing
print(reactors_table)

In [None]:
# use .find_all to create a list of rows in the table
reactor_rows = reactors_table.find_all('tr')

In [None]:
# isolate the fourth row and print it
ex_row = reactor_rows[3]
print(ex_row)

One of our table's rows, with a little shading and indentation:

```html
<tr valign="top">
    <td scope="row"><a href="/info-finder/reactors/ano1.html">Arkansas Nuclear 1</a><br/>05000313</td>
    <td align="center">DPR-51</td>
    <td>PWR</td>
    <td>6 miles WNW of Russellville,  AR</td>
    <td>Entergy Nuclear Operations, Inc.</td>
    <td align="middle">4</td>
</tr>```

In [None]:
# use .find_all again to generate a list of the row's cells and return it
cells = ex_row.find_all('td')
cells

BeautifulSoup has a couple other methods we haven't discussed yet that are helpful for extracting the information _inside_ of tags:
```python
soup.contents
# breaks up everything in a tag into a fresh list (useful when you have more than text in a cell)

soup.get('some attribute')
# returns the attribute (useful for getting URLs from <a> tags, for example)
```

In [None]:
# examine the "contents" of the first item in cells
cells[0].contents

In [None]:
# isolate and print the name, the link and the docket number
print(cells[0].contents[0].text)
print(cells[0].contents[0].get('href'))
print(cells[0].contents[2])

### Extract the data

In [None]:
# make an empty list to hold the data
scraped_data = []

# a for loop is going to take us through every row in the table EXCEPT the header
# combining two steps: the list it pulls from will be greated by a .find_all for 'tr' tags
for row in reactors_table.find_all('tr')[1:]:
    
    # .find_all 'td' tags in the row and put them into a variable
    cells = row.find_all('td')
    
    # extract the cell contents
    reactor_name = cells[0].contents[0].text
    link = 'http://www.nrc.gov' + cells[0].contents[0].get('href')
    docket = cells[0].contents[2]
    license = cells[1].text
    reactor_type = cells[2].text
    location = cells[3].text
    owner = cells[4].text
    region = cells[5].text
    
    # append the collected data to the empty list
    scraped_data.append([reactor_name, link, docket, license, reactor_type, location, owner, region])
    

### Write the data to CSV

In [None]:
# open a file and write our data to it
with open('reactor_data.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(['reactor_name', 'link', 'docket', 'license', 'reactor_type', 'location', 'owner', 'region'])
    writer.writerows(scraped_data)

### If you want only a simple HTML table on a page:

In [None]:
pd.read_html('https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population', header=0)[3]

In [None]:
pd.read_clipboard()

For more on pandas and parsing tabular data, I recommend the CAR class and/or [this book](http://shop.oreilly.com/product/0636920023784.do).