# Education Deserts Project:

By: Lucas Hu, Nikhil Sinha, Chengyi (Jeff) Chen, Ashwath Raj

---
## Problem Statement
The social issue we are addressing is education. Specifically, we are addressing the limited accessibility of higher education in specific pockets of the United States. These areas of limited accessibility are called education deserts because they are areas that do not contain any colleges within sixty miles, or any similar distance that limits a student’s ability to attend college. In the American Council of Education’s report on "Education Deserts: The Continued Significance of “Place” in the Twenty-First Century” they state that “the majority —57.4 percent—of incoming freshmen attending public four-year colleges enroll within 50 miles from their permanent home” (ACE). Given the fact that millions of adult and student Americans live in these education deserts, it’s easy to see the effect that geographical limitations can have on higher poverty rates and below average income. 

Our project seeks to help alleviate the issue of geographically limited college accessibility by mapping where education deserts are. Then, by calculating the potential socioeconomic impact of placing colleges in those areas, we can create a framework for optimizing the benefits of building new colleges in any given education desert. 

---
## Project Components
Our project will be divided into 3 primary steps:

### Data Mining: 
First, we will look at the locations of existing colleges to discover regions in the U.S. that are currently “education deserts.” We will define an “education desert” as a region in that does not have any colleges within a 50-minute commute. This is essentially the first step in our project, and should be the first component we complete.

We will then use census data to discover which features of a region are predictive of that region being an education desert. This will be accomplished by training a binary classifier to identify regions as being education deserts or not, and then interpreting the feature importances of that classifier.


### Prediction: 
Next, want to predict the “population cover” of existing colleges in the U.S. (i.e. how many college graduates each college contributes to the region). We will do this by looking at non-education desert regions in particular, and using census data to regress on the percentage of the population in that region that have college degrees. (For the time being, we can progress on this in parallel with step 1, by ignoring whether or not a given region is labeled as an education desert.)

By applying this regressor to regions that are currently education deserts, we will be able to predict the economic impact of a hypothetical new college (by way of increased college graduate salaries), if a new college were to be built within (or near) a current education desert region.


### Optimization: 
Lastly, we will use the findings from Step 2 to decide where to create new colleges in the U.S. so as to maximize the overall increase in economic impact. Let’s say we have the budget to create 10 new colleges: where should we place those 10 colleges so as to maximize the increase in total college graduate salaries? (Algorithm to be described more in Section 4.)

This will be the last step in our project. If we want to work in parallel, we can start off by optimizing for alternative/proxy metrics, so that we can have the overall optimization pipeline ready to go before Step 2 is complete.

---
## Datasets
We will be primarily working with 3 data sources: the American Community Survey 5-Year Estimates in Census Block Groups dataset that contains 220,334 rows for each census block group and 2161 columns for each feature like Median Earnings broken down by education level, as well as Age and Occupation data; shape files for each state that include the latitude and longitude coordinates of the polygons that describe a specific census block group; and the Integrated Post-secondary Education Data System dataset which contains a list of 1534 colleges and 145 features about each school, including the exact location of the school. 

### Data Mining: 
We will first loop through all the census block groups, represented as polygons, and calculate the latitude / longitude position of it’s centroid. Subsequently, we would either 
1. Loop through the list of colleges and use the Haversine formula to calculate the  map distance between each college and each census block group, labelling any census-block group that is not within a 50 mile radius to the school as an Education Desert [0] and whichever is within distance as a Non-education Desert [1], 
OR 
2. Use tries to find the longest matching suffix (lat-long pairs) to compute 1 distance per census block, rather than all (block, uni) pairs, giving us a reduced runtime of O(m+n), rather than O(n^2) as per the former approach.

### Prediction: 
One of the major features we will use in order to figure out the economic impact of the education desert would be to use the columns from the ACS dataset: 
`Population 25 Years and Over: Some College or More`, 
`Occupied Housing Units`,
`Median Earnings: <Gender><Education Level>`.

---
## Approaches/Metrics
For the different parts of our project, we will employ different techniques to try to generate the best results.

### Data mining: 
After calculating the distances between the colleges and the census blocks, whether using the trie technique or the pair distance technique, we will train a Support Vector Machine classifier to try and differentiate between education deserts (1) and non-education deserts (0). While the predictions of this classifier will tell us information that we already know, doing a feature importance analysis will tell us what the most important features or feature pairs will be to determining whether an area is in an education desert or not. This can inform us of things to focus on later on in the project. The reason we will use an SVM classifier is due to the SVM’s affinity for drawing decision boundaries for non-linear functions, or essentially its ability to do binary classification.


### Population cover estimation: 
The way that we will approach this task is relatively complex. First, we will compute the ratio between the number of people with college degrees and the number of people who are of working age for every given census block group. These numbers will be used by our model as labels that determine the percentage of people who go to college in any given region. From here, we will train a standard feed-forward neural network using the rest of the census features to try and predict the value of this percentage. However, to train this model, we will decrease the amount of data that we are using to only include the areas which are not in education deserts. The reason for this is because we want to estimate that if we were to put a college in another region that is an education desert, what percentage of the population in the surrounding communities would go to college. This allows us to generate a more precise estimate of how many new students would attend college if we were to put them in any specific education desert.


### Optimization: 
We will approach this optimization problem in two ways: with and without the population cover estimation. Without the population cover estimation, we will try to maximize the number of students that have access to a college education. This will mean trying to place a college in every census block group and computing how many new people will have access to college, and recording the top k locations by number of new people with collegiate accessibility. With the population cover estimation, we will try to maximize the economic benefit for every college that we could theoretically place. To do this, we would place the college in any census block and determine which census blocks would be covered by the 50 mile constraint. We would then calculate the population cover for each of these census block groups and then compute the difference between the current percentage of people who go to college in the region and our estimate. This will tell us the percentage of people who would have a college education if there were a college in the region but do not currently. From here we assume that the average salary of these new college graduates will be the average salary of a college graduate. We then calculate a new average salary for the region using this figure. Multiplying the new average salary by the working population of the region gives us a total salary estimate. We repeat the same calculation using the original average salary and subtract these two values to get a figure that tells us the number of dollars added by the college to the surrounding population. We compute this value for every potential census block group and record the top k values. This leaves us with two sets of recommendations for where to place new colleges to maximize different objectives.

---
## Data Mining
__In this section we will read in census tract data, representing each cenus tract as a graph node with location of node at the centroid of the polygon__

In [1]:
# Library Imports
import fiona 
import pandas as pd
import numpy as np
import subprocess
import os
import requests
from bs4 import BeautifulSoup
import seaborn as sns
sns.set(style="ticks")

# To unzip file
import zipfile

# To have progress bar
from tqdm import tqdm

# plotting libraries
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('seaborn-paper')
%matplotlib inline

# Helper function to create a new folder
def mkdir(path):
    try: 
        os.makedirs(path)
    except OSError:
        if not os.path.isdir(path):
            raise
        else:
            print("(%s) already exists" % (path))

---
## Datasets

In [2]:
# Census tracts shapefiles url
ct_shape_url = 'https://www.census.gov/geo/maps-data/data/cbf/cbf_tracts.html'

# Census tracts data url from 2012 - 2017
ct_file_name = 'acs_5_year_estimates_census_tracts.csv'
ct_data_url = 'https://www.dropbox.com/s/ni28x7mw6uh00dg/' + ct_file_name + '.zip?dl=1'

# American University Data
au_file_name = 'IPEDS_data.xlsx'
au_data_url = 'https://public.tableau.com/s/sites/default/files/media/Resources/' + au_file_name

# Directory of datasets
DATASETS_PATH = 'datasets/'

# Directory of census tract shapefile data
CENSUS_TRACTS_PATH = DATASETS_PATH + 'census_tracts/'

# Make the directory for the census tracts shapefiles data
mkdir(DATASETS_PATH)

# Remove any old data for census tracts shapefiles
# subprocess.call(['rm', '-rf', CENSUS_TRACTS_PATH])

(datasets/) already exists


### Census Tracts Data

In [3]:
# Download data 
if not os.path.isfile(DATASETS_PATH + ct_file_name):
    
    os.system('!wget --directory-prefix={} -Nq {}'.format(DATASETS_PATH, ct_data_url))
    
    # Unzipping the file
    zip_ref = zipfile.ZipFile(DATASETS_PATH + ct_file_name + '.zip', 'r')
    zip_ref.extractall(DATASETS_PATH + ct_file_name + '/')
    zip_ref.close()
    
    # Remove the old census tract .zip shapefile
    subprocess.call(['rm', '-rf', DATASETS_PATH + ct_file_name + '.zip'])

In [4]:
# Let's take a look at the census tract data
census_tracts = pd.read_csv(DATASETS_PATH + ct_file_name, encoding='ISO-8859-1', low_memory=False)
census_tracts.head()

Unnamed: 0,FIPS,Geographic Identifier,Name of Area,Qualifying Name,State/U.S.-Abbreviation (USPS),Summary Level,Geographic Component,File Identification,Logical Record Number,US,...,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: in Labor Force: Employed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: in Labor Force: Unemployed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: Not in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Employed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Unemployed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: Not in Labor Force,Households.1,Households with Housing Costs more than 30% of Income
0,Geo_FIPS,Geo_GEOID,Geo_NAME,Geo_QName,Geo_STUSAB,Geo_SUMLEV,Geo_GEOCOMP,Geo_FILEID,Geo_LOGRECNO,Geo_US,...,SE_T254_004,SE_T254_005,SE_T254_006,SE_T254_007,SE_T254_008,SE_T254_009,SE_T254_010,SE_T254_011,SE_T255_001,SE_T255_002
1,01001020100,14000US01001020100,"Census Tract 201, Autauga County, Alabama","Census Tract 201, Autauga County, Alabama",al,140,00,ACSSF,0001766,,...,36,7,80,1360,880,845,35,480,754,144
2,01001020200,14000US01001020200,"Census Tract 202, Autauga County, Alabama","Census Tract 202, Autauga County, Alabama",al,140,00,ACSSF,0001767,,...,59,0,204,1230,823,793,30,407,783,218
3,01001020300,14000US01001020300,"Census Tract 203, Autauga County, Alabama","Census Tract 203, Autauga County, Alabama",al,140,00,ACSSF,0001768,,...,61,3,305,2291,1491,1421,70,800,1279,357
4,01001020400,14000US01001020400,"Census Tract 204, Autauga County, Alabama","Census Tract 204, Autauga County, Alabama",al,140,00,ACSSF,0001769,,...,16,0,66,3241,1953,1833,120,1288,1749,361


### American University Data

In [5]:
# Download data 
if not os.path.isfile(DATASETS_PATH + au_file_name):
    
    os.system('!wget --directory-prefix={} -Nq {}'.format(DATASETS_PATH, au_data_url))

In [6]:
# Let's take a look at the american university data
universities = pd.read_excel(DATASETS_PATH + au_file_name)
universities.head()

Unnamed: 0,ID number,Name,year,ZIP code,Highest degree offered,County name,Longitude location of institution,Latitude location of institution,Religious affiliation,Offers Less than one year certificate,...,Percent of freshmen receiving federal grant aid,Percent of freshmen receiving Pell grants,Percent of freshmen receiving other federal grant aid,Percent of freshmen receiving state/local grant aid,Percent of freshmen receiving institutional grant aid,Percent of freshmen receiving student loan aid,Percent of freshmen receiving federal student loans,Percent of freshmen receiving other loan aid,Endowment assets (year end) per FTE enrollment (GASB),Endowment assets (year end) per FTE enrollment (FASB)
0,100654,Alabama A & M University,2013,35762,Doctor's degree - research/scholarship,Madison County,-86.568502,34.783368,Not applicable,Implied no,...,81.0,81.0,7.0,1.0,32.0,89.0,89.0,1.0,,
1,100663,University of Alabama at Birmingham,2013,35294-0110,Doctor's degree - research/scholarship and pro...,Jefferson County,-86.80917,33.50223,Not applicable,Implied no,...,36.0,36.0,10.0,0.0,60.0,56.0,55.0,5.0,24136.0,
2,100690,Amridge University,2013,36117-3553,Doctor's degree - research/scholarship and pro...,Montgomery County,-86.17401,32.362609,Churches of Christ,Implied no,...,90.0,90.0,0.0,40.0,90.0,100.0,100.0,0.0,,302.0
3,100706,University of Alabama in Huntsville,2013,35899,Doctor's degree - research/scholarship and pro...,Madison County,-86.63842,34.722818,Not applicable,Yes,...,31.0,31.0,4.0,1.0,63.0,46.0,46.0,3.0,11502.0,
4,100724,Alabama State University,2013,36104-0271,Doctor's degree - research/scholarship and pro...,Montgomery County,-86.295677,32.364317,Not applicable,Implied no,...,76.0,76.0,13.0,11.0,34.0,81.0,81.0,0.0,13202.0,


### Census Tract Shape Files

In [7]:
# Make request to get the webpage
r = requests.get(ct_shape_url)
soup = BeautifulSoup(r.content, "html.parser")

# Get the download links from the dropdown <option> tag
locations = soup.find('select',
                      {'name':'Location',
                       'id':'ct2017m'}).findChildren('option' , recursive=False)[1:]

# Put all the states and the urls for their shape files in a dictionary
state_urls = {location.text.strip() : location.attrs['value'] for location in locations}

In [8]:
# Download data 
if not os.path.isdir(CENSUS_TRACTS_PATH):
    
    # Make the directory for the census tracts shapefiles data
    mkdir(CENSUS_TRACTS_PATH)
    
    for state, state_url in state_urls.items():
        os.system('wget --directory-prefix={} -Nq {}'.format(CENSUS_TRACTS_PATH, state_url))
        
    # Storing the shape file names
    census_tract_shapefiles = []
    for state, state_url in state_urls.items():

        # Extracting the name of shapefile
        shapefile = state_url[state_url.rindex('/') + 1:]
        census_tract_shapefiles.append(shapefile)

        # Renaming the file
        os.rename(CENSUS_TRACTS_PATH + shapefile, CENSUS_TRACTS_PATH + state + '.zip')

        # Unzipping the file
        zip_ref = zipfile.ZipFile(CENSUS_TRACTS_PATH + state + '.zip', 'r')
        zip_ref.extractall(CENSUS_TRACTS_PATH + state + '/')
        zip_ref.close()

        # Remove the old census tract .zip shapefile
        subprocess.call(['rm', '-rf', CENSUS_TRACTS_PATH + state + '.zip'])

In [9]:
# Shapely library to help with calculation
# of the representative centroid position
from shapely.geometry import MultiPoint

# Let's calculate and store each of 
# the centroids of each
# census tract in a dictionary
census_tracts_centroids = {}
for subdir, dirs, files in list(os.walk(CENSUS_TRACTS_PATH))[1:]:
    
    # Opening the shapefile
    state_shapes = fiona.open(subdir)
    
    # Looping through each state and 
    # making key: geoid, value: centroid position
    # of (longitude, latitude)
    for census_tract in state_shapes:
        
        geoid = census_tract['properties']['GEOID']
        geometry = np.array(census_tract['geometry']['coordinates'])
        points = None
        
        # Some of the geometries are in a 2d and some in 3d array
        if len([True for lat_long in geometry[0] if len(lat_long) != 2]) > 0:
            points = MultiPoint(geometry[0][0])
        else:
            points = MultiPoint(geometry[0])
        
        # True centroid, not necessarily an existing point
        centroid_pt = points.centroid
        
        # A represenative point, not centroid,
        # that is guarnateed to be with the geometry
        census_tracts_centroids[geoid] = points.representative_point()

In [10]:
# Number of census tracts
len(census_tracts_centroids)

73874

__TODO: Problem with shape of the geometry, I thought it was supposed to be a list of size=2 tuples.__

__UPDATE: I reduced the 3D and 2D arrays to 1D, hopefully I didn't miss information__

### Census Tract Graph

__We will loop through each census tract and for each census tract, compute it's distance to all other ${n - 1}$ census tracts, storing it in a graph as the edge weights for each census tract node - Essentially creating a ${K_{73,874}}$ Complete Graph__

In [11]:
# Haversine formula to calculate 
# geographical distance between 2 pairs of
# latitude, longitude coordinates
from haversine import haversine

# The Census Tract Graph where each node represents
# a census tract and each edge exists if and only if
# the two census tracts are within 50 miles of each
# other
class CT_Graph:
    
    # The Nodes (Census Tracts)
    class Node:
        def __init__(self, geo_id, coordinates):
            self._geo_id = geo_id
            self._coordinates = coordinates
        
        # US Census Tract Geo ID
        @property
        def geo_id(self):
            return self._geo_id
        
        @geo_id.setter
        def geo_id(self, value):
            self._geo_id = value
        
        # Census tract centroid location 
        # Type: Point
        # Format: (latitude, longitude)
        @property
        def coordinates(self):
            return self._coordinates
        
        @coordinates.setter
        def coordinates(self, value):
            self._coordinates = value
        
    # The Edges (Connections from centroids to centroids)
    class Edge:
        def __init__(self, node_pair, distance):
            self._node_pair = node_pair
            
            # Storing Haversine Distance in miles
            self._distance = distance

        # Tuple of 2 Nodes
        @property
        def node_pair(self):
            return self._node_pair
        
        @node_pair.setter
        def node_pair(self, value):
            self._node_pair = value
        
        # Edge Weight / Haversine Distance
        # between the pairs of lat longs
        @property
        def distance(self):
            return self._distance
        
        @distance.setter
        def distance(self, value):
            self._distance = value
        
    def __init__(self, census_tracts_centroids):
        self._threshold_distance = 50
        self._nodes = []
        self._edges = []
        
        # Creating all my nodes O(n)
        for geo_id, coordinates in census_tracts_centroids.items():
            self._nodes.append(CT_Graph.Node(geo_id, coordinates))
         
        # Creating all my edges O(n^2)
        for idx, node1 in tqdm(enumerate(self._nodes)):
            
            for node2 in self._nodes[idx + 1:]:
                
                # Calculating Haversine Distance
                distance = haversine(*[(node.coordinates.x, node.coordinates.y) for node in (node1, node2)], unit='mi')
                
                # Add the Edge only if Haversine Distance is less than 50 miles
                if distance <= self._threshold_distance:
                    self._edges.append(CT_Graph.Edge((node1, node2), distance))
        
    # List of Census Tract nodes
    @property
    def nodes(self):
        return self._nodes

    @nodes.setter
    def nodes(self, value):
        self._nodes = value

    # List of Census Tract edges
    @property
    def edges(self):
        return self._edges

    @edges.setter
    def edges(self, value):
        self._edges = value

In [12]:
# Initializing Census Tract Graph
ct_graph = CT_Graph(census_tracts_centroids)

16it [00:38,  2.45s/it]

KeyboardInterrupt: 

### Census Tract Data

__Census Tracts have a population of around ${2,500}$ - ${8,000}$ people__

In [None]:
census_tracts.describe()