# Education Locations

## Overview

This activity allows you to practice using the [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) library to scrape some data from the web. It also allows you to practice using a **Jupyter Notebook** to both document and perform your work. As you can see, you can write _Markdown_, as well as Python

**quick tips**
- Type `esc`, `m`, `enter` to start writing Markdown rather than code 
- Type `shift` and `enter` to run a code section

## Set up

In order to use the python libraries, we'll need to ensure they're installed on your machine. You can do this easily by running the following command(s) on your terminal 

```
# Install beautifulsoup using pip on the terminal
pip install beautifulsoup4
pip install pygeocoder
pip install plotly
```

You should now be able to import the library inside of this notebook by running the following line of Python code

In [1]:
from bs4 import BeautifulSoup as bs, SoupStrainer as ss


We'll also need to import a few other libraries, such as `pandas` to manage our data, and `requests` to make URL requests

In [6]:
import requests as r
import pandas as p
import re
from pygeocoder import Geocoder
import plotly.plotly as py

## Identify Institution Links

Our first task is to use python to identify the **links to institution pages** on their [website](https://collegecost.ed.gov/catc/Default.aspx). We'll begin by making a request of the page content. Due to peculiarities of how the page is built on the client side, we'll read a local version of the page using the `codecs` package.

In [7]:
import codecs
file = codecs.open("college-site.html", 'r')
page_content = file.read()
soup = bs(page_content, 'html.parser')

### Now that we have all the page content, you should open up the [website](https://collegecost.ed.gov/catc/Default.aspx) in your browser to _identify the part of the DOM_ where the relevant information is.

In [8]:
# Find the TuitionGrid table
table = soup.find(id = 'dvCATWTuitionGrid')

In [9]:
# Extract each row from the table
table_rows = table.find_all('tr', recursive=True)

In [10]:
# Look at a single row of your table, and figure out how to extract the address from it
table_rows[0]['onclick']
re.findall(r"'(.*?)'", table_rows[0]['onclick'])

[u'http://nces.ed.gov/collegenavigator/?id=142328']

## Extracting links from table rows

In this section, we'll iterate through the table rows and extract the links from each one.

In [11]:
# Write a simple function to extract the link from each row
def extract_url(row):
    links = re.findall(r"'(.*?)'", row['onclick'])
    return links[0]

In [15]:
# List to store links
links = []

# Iterate through table rows and use the `extract_url` function to get the URL and store it in `links`
for tr in table_rows:
    link = extract_url(tr)
    links.append(link)

## Iterate through links and extract address from webpage

In [16]:
# Write a function to retrieve the address of an institution given it's URL (go to the URL, extract address)
def get_address(url):
    page = r.get(url)
    soup = bs(page.content, 'html.parser')
    container = soup.find('div', { "class" : "collegedash"})
    span = container.findAll('span', attrs={'class': None})[0]
    # This helped: http://stackoverflow.com/questions/38754940/get-text-after-specific-tag-with-beautiful-soup
    text = span.findAll('br')[0].nextSibling
    return text

In [17]:
# List to store addresses
addresses = []

# Iterate through links and use your `get_address` function to get the address and store it in `addresses`
for link in links:
    address = get_address(link)
    addresses.append(address)

In [18]:
# Iterate through the addresses and use the `Geocoder.geocode` function to get the lat/long
coordinates = []
for address in addresses:
    try:
        location = Geocoder.geocode(address)
        coordinates.append(location.coordinates)
    except:
        pass

## Mapping with Plotly

In [23]:
# Need to sign-in to plotly: get API key here:https://plot.ly/settings/api
# py.sign_in('USERNAME', 'API-KEY')

# Plot with plotly, from example: https://plot.ly/python/scatter-plots-on-maps/
data = [ dict(
        type = 'scattergeo',
        locationmode = 'USA-states',
        lon = coordinates_df[1],
        lat = coordinates_df[0],
        mode = 'markers',
        marker = dict( 
            size = 8, 
            opacity = 0.8,
            reversescale = True,
            autocolorscale = False,
            line = dict(
                width=1,
                color='rgba(102, 102, 102)'
            ),
        ))]

layout = dict(
        title = 'Most Increases in Higher Education Tuition',
        colorbar = True,   
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showland = True,
            landcolor = "rgb(250, 250, 250)",
            subunitcolor = "rgb(217, 217, 217)",
            countrycolor = "rgb(217, 217, 217)",
            countrywidth = 0.5,
            subunitwidth = 0.5        
        ),
    )

fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False, filename='tuition-increases' )