In [1]:
!conda install -c conda-forge folium

Solving environment: | ^C
failed

CondaError: KeyboardInterrupt



# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a new high school. Specifically, this report will be targeted to public school lawmakers in **Atlanta, GA**.
There are already several public high schools in the city of Atlanta, but they do not represent equal portions of the population or equal areas of the city. We will be attempting to find a location for a new school that can reduce overcrowding in Atlanta Public Schools.

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by the city.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* number of existing schools in the city
* distance of neighborhood to the nearest school

We decided to use 2010 census data pertaining to Atlanta for neighborhood information

Following data sources will be needed to extract/generate the required information:
* Google Maps API will be used to find coordinates for each school and neighborhood. 

In [1]:
import pandas as pd
import numpy as np
import requests
import json
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium
import geopy
import requests
import urllib

from geopy.geocoders import Nominatim

### School Locations
We will gather information from Wikipedia pertaining to the public high schools in Atlanta. Google Maps and high school websites were used to locate each school and find their addresses. We will create a dataframe for these schools

In [28]:
geolocator = Nominatim(user_agent="AtlData")
schoolLis = [['Benjamin Elijah Hays High School', '3450 Benjamin E Mays Dr SW Atlanta, Georgia 30331 United States'],
              ["BEST Academy High School", '1190 Northwest Drive NW, Atlanta, Georgia 30318'],
              ['Booker T. Washington High School', '45 Whitehouse Drive SW, Atlanta, Georgia 30314 United States'],
              ["Coretta Scott King Young Women's Leadership Academy", '1190 Northwest Drive NW, Atlanta, GA 30318'],
              ['Daniel McLaughlin Therrell High School', '3099 Panther Trail Southwest Atlanta, Georgia'],
              ['Frederick Douglass High School', '225 Hamilton E Holmes Dr NW, Atlanta, GA 30318'],
              ['Henry W. Grady High School', '929 Charles Allen Dr NE, Atlanta, GA 30309'],
              ['Maynard H. Jackson High School', '801 Glenwood Avenue SE, Atlanta, Georgia 30316'],
              ['Carver High School','55 McDonough Blvd SE, Atlanta, GA 30315'],
              ['North Atlanta High School', '4111 Northside Parkway NW, Atlanta, Georgia 30327'],
              ['South Atlanta High school', '800 Hutchens Rd SE, Atlanta, GA 30354']]
names = []
longs = []
lats = []
address = []
for item in schoolLis:
    location = geolocator.geocode(item[1])
    names.append(item[0])
    address.append(item[1])
    longs.append(location.longitude)
    lats.append(location.latitude)
schoolData = {'School': names, "Address": address, "Latitude": lats, "Longitude": longs}
schools = pd.DataFrame(data = schoolData)
schools.head()

Unnamed: 0,School,Address,Latitude,Longitude
0,Benjamin Elijah Hays High School,"3450 Benjamin E Mays Dr SW Atlanta, Georgia 30...",33.737973,-84.500985
1,BEST Academy High School,"1190 Northwest Drive NW, Atlanta, Georgia 30318",33.788677,-84.479339
2,Booker T. Washington High School,"45 Whitehouse Drive SW, Atlanta, Georgia 30314...",33.754066,-84.420035
3,Coretta Scott King Young Women's Leadership Ac...,"1190 Northwest Drive NW, Atlanta, GA 30318",33.788677,-84.479339
4,Daniel McLaughlin Therrell High School,"3099 Panther Trail Southwest Atlanta, Georgia",33.69919,-84.490096


### Removing data
As we can see, BEST Academy High School is located at the same coordinates as Coretta Scott King YOung Women's Leadership Academy. These schools are single gendered public schools that share a campus. Therefore, it is appropriate to remove one of these schools from our data, as we can consider them as one united school.

In [29]:
schools = schools.drop([1])
schools = schools.reset_index(drop = True)
schools

Unnamed: 0,School,Address,Latitude,Longitude
0,Benjamin Elijah Hays High School,"3450 Benjamin E Mays Dr SW Atlanta, Georgia 30...",33.737973,-84.500985
1,Booker T. Washington High School,"45 Whitehouse Drive SW, Atlanta, Georgia 30314...",33.754066,-84.420035
2,Coretta Scott King Young Women's Leadership Ac...,"1190 Northwest Drive NW, Atlanta, GA 30318",33.788677,-84.479339
3,Daniel McLaughlin Therrell High School,"3099 Panther Trail Southwest Atlanta, Georgia",33.69919,-84.490096
4,Frederick Douglass High School,"225 Hamilton E Holmes Dr NW, Atlanta, GA 30318",33.766561,-84.470118
5,Henry W. Grady High School,"929 Charles Allen Dr NE, Atlanta, GA 30309",33.781093,-84.372139
6,Maynard H. Jackson High School,"801 Glenwood Avenue SE, Atlanta, Georgia 30316",33.739245,-84.361899
7,Carver High School,"55 McDonough Blvd SE, Atlanta, GA 30315",33.719924,-84.386178
8,North Atlanta High School,"4111 Northside Parkway NW, Atlanta, Georgia 30327",33.86471,-84.449704
9,South Atlanta High school,"800 Hutchens Rd SE, Atlanta, GA 30354",33.671388,-84.363435


### Creating the Map
We define a function resetMap() to delete and remake an object map_atl that is a folium map centered around Atlanta.

In [35]:
map_atl = folium.Map(location=[33.74, -84.38], zoom_start=11)
def resetMap():
    global map_atl
    try:
        del map_atl
    finally:
        map_atl = folium.Map(location=[33.74, -84.38], zoom_start=11)
resetMap()   

map_atl

### Loading data
 We load pertinent GeoJson data on the populations of Atlanta neighborhoods. This information helps us to map out the neighborhoods in Atlanta as well as the general population. We drop data that is not relevant to our findings, such as the individual populations of different races and ethnicities. We only want to look at the whole populations of each neighborhood

In [32]:
popurl = 'https://opendata.arcgis.com/datasets/d6298dee8938464294d3f49d473bcf15_196.geojson'
with urllib.request.urlopen(popurl) as url:
    data2 = json.loads(url.read().decode())
from pandas.io.json import json_normalize
popData = json_normalize(data2["features"])
neighborhoods = popData.drop(columns = ["properties.URL", "properties.asian", "properties.black", "properties.hispanic",
                                        "properties.last_edited_date", "properties.other", "properties.white","geometry.coordinates",
                                        "geometry.type", "properties.A", "properties.GlobalID", "properties.NPU", "properties.OBJECTID", 
                                        "properties.STATISTICA", "properties.pop", "type"])

## Methodology <a name="methodology"></a>

In this project, we are focused on determining the optimal location for opening a new public high school. In order to do this, we make a few assumptions. First, we assume that populations are equally distributed within neighborhoods. Second, that there are equal percentage distributions of high school aged minors among all the neighborhoods. Finally, that an entire neighborhood is assigned to one school. 

The first step in our process is to plot each school and create neighborhood clusters for each school based upon distance each neighborhood is away from each school. We will use a version of K means clustering for this step.

From there, we will analyze the proportion of the population that each school represents. This will help us determine which schools can be considered overpopulated and would therefore need another school to handle the population.

Finally, we will be adding one new school and using many iterations of K means clustering to determine which location would be ideal for a new school. 

## Analysis <a name="analysis"></a>

### Plotting the population and schools
We define a function plotPops() that plots the relevant neighborhood population data and a function plotSchools(schools) that plots the locations of the schools saved in the DataFrame argument "schools". These functions will be useful later when we need to see how our recommendations change school clustering. 

In [36]:
def plotPops():
    map_atl.choropleth(geo_data = data2,
        data=popData,
        columns=['properties.NEIGHBORHO', 'properties.POP2010'],
        key_on = "feature.properties.NEIGHBORHO",
        fill_color='YlOrRd', 
        fill_opacity=0.7, 
        line_opacity=0.2,
        legend_name='Atlanta Neighborhood Population')
plotPops()
colorss = [
    'red',
    'blue',
    'gray',
    'darkred',
    'black',
    'purple',
    'darkblue',
    'green',
    'darkgreen',
    'lightgreen',
    'darkblue',
    'lightblue',
    'purple',
    'darkpurple',
    'pink',
    'cadetblue',
    'lightgray',
    'black'
]
def plotSchools(schools):
    global colorss
    for index, item in schools.iterrows():
        folium.Marker([item["Latitude"], item["Longitude"]], popup=item["School"], icon=folium.Icon(color=colorss[index])).add_to(map_atl)
plotSchools(schools)
map_atl

## Data on neighborhoods
We begin to format our data into a useful DataFrame. We remove all irrelevant information and derive neighborhood centroids from the GeoJson data. The DataFrame we create has information on each neighborhood pertaining to its centroid latitude and longitude, population, and what school it is to be assigned to. We initialize the Assigned Schools to blank values. They will be updated later.

In [39]:
neighpops = popData.drop(columns = ["properties.URL", "properties.asian", "properties.black", "properties.hispanic",
                                        "properties.last_edited_date", "properties.other", "properties.white",
                                        "geometry.type", "properties.A", "properties.GlobalID", "properties.NPU", "properties.OBJECTID", 
                                        "properties.STATISTICA", "properties.pop", "type"])
neighDF = pd.DataFrame({"Neighborhood":[],
                        "Lat":[],
                        "Long":[],
                        "Pop":[],
                        "Assigned School":[]})
for index, item in neighpops.iterrows():
    latLis = []
    longLis = []
    for neigh in item ["geometry.coordinates"][0]:
        if type(neigh[1]) != list:
            latLis.append(neigh[1])
            longLis.append(neigh[0])
        elif type(neigh[1]) == list:
            for coor in neigh:
                latLis.append(coor[1])
                longLis.append(coor[0])
    neighLat = min(latLis) + ((max(latLis) - min(latLis))/2)
    neighLong = min(longLis) + (max(longLis) - min(longLis))/2
    folium.Circle([neighLat, neighLong], radius = 70, color = "red").add_to(map_atl)
    neighDF = pd.concat([neighDF, pd.DataFrame({"Neighborhood":[item["properties.NEIGHBORHO"]],
                                                "Lat": [neighLat],
                                                "Long": [neighLong],
                                                "Pop": [item["properties.POP2010"]],
                                                "Assigned School": [""]})], ignore_index = True)

In [40]:
neighDF.head()

Unnamed: 0,Neighborhood,Lat,Long,Pop,Assigned School
0,"Arden/Habersham, Argonne Forest, Peachtree Bat...",33.830723,-84.39838,2672.0,
1,"Peachtree Heights East, Peachtree Hills",33.820586,-84.381416,3736.0,
2,Peachtree Heights West,33.832133,-84.388578,4874.0,
3,"Buckhead Forest, South Tuxedo Park",33.846813,-84.383542,3372.0,
4,"Chastain Park, Tuxedo Park",33.865428,-84.398157,3423.0,


## withProp
We create a DataFrame withProp which is the same as the previously created schools DataFrame, except that it now includes one new school called "Proposed Location. The location is initialized in Brookhaven, as this area intuitively seemed desolate of schools. 

In [41]:
withProp = pd.concat([schools[["School", "Latitude", "Longitude"]], pd.DataFrame({"School": ["Proposed Location"],
                                                                               "Latitude": [33.8650],
                                                                               "Longitude": [-84.3371]})], ignore_index = True)


## Assigning neighborhoods to schools
We define a function that assigned a neighborhood to one school from a list of schools. A neighborhood is assigned to the nearest school.

In [42]:
def assign_members(neigh, schools):
    distLis = []
    assigned = ""
    for index, school in schools.iterrows():
        distLis.append(np.sqrt(np.square(neigh["Lat"] - school["Latitude"]) + np.square(neigh["Long"] - school["Longitude"])))
    minDist = min(distLis)
    closestSchool = [i for i,j in enumerate(distLis) if j == minDist]
    for index, school in schools.iterrows():
        if index in closestSchool:
            assigned = school["School"]
    return assigned

print('assign_members function defined!')


assign_members function defined!


## Updating and Plotting neighborhoods
Functions are defined for updating school assignments for each neighborhood using the assign_members function and for plotting the neighborhoods on the map, using the assigned school to differentiate neighborhoods by color

In [43]:
def updateNeighs(schools):
    assignments = []
    for index, neigh in neighDF.iterrows():
        assignments.append(assign_members(neigh,schools))
    neighDF.update(pd.DataFrame({"Assigned School":assignments}))
def plotNeighs(schools):
    global colorss
    schoolCol = 0
    for index,neigh in neighDF.iterrows():
        for index2, school in schools.iterrows():
            if neigh["Assigned School"] == school["School"]:
                schoolCol = index2
        #print(neigh["Assigned School"], schoolCol, neigh["Neighborhood"] )
        folium.Circle([neigh["Lat"], neigh["Long"]], radius = 70, color = colorss[schoolCol]).add_to(map_atl)
updateNeighs(schools)
resetMap()
plotPops()
plotSchools(schools)
plotNeighs(schools)
map_atl

## Population of schools
Now that neighborhoods have been clustered for school assignment, we would like to know the population each school must represent based upon these assignments. As we can see, there is an unequal distribution of the population of Atlanta represented by each school. In particular, Henry W. Grady High School must represent almost twice the population of the next largest school, Booker T. Washington High School. There is therefore cause for opening a new school

In [44]:
neighDF.groupby(["Assigned School"]).sum().drop(columns = ["Lat", "Long"])

Unnamed: 0_level_0,Pop
Assigned School,Unnamed: 1_level_1
Benjamin Elijah Hays High School,27831.0
Booker T. Washington High School,65969.0
Carver High School,35005.0
Coretta Scott King Young Women's Leadership Academy,20479.0
Daniel McLaughlin Therrell High School,31189.0
Frederick Douglass High School,20451.0
Henry W. Grady High School,127585.0
Maynard H. Jackson High School,39539.0
North Atlanta High School,31203.0
South Atlanta High school,20752.0


In [14]:
schools

Unnamed: 0,School,Address,Latitude,Longitude
0,Benjamin Elijah Hays High School,"3450 Benjamin E Mays Dr SW Atlanta, Georgia 30...",33.737973,-84.500985
1,Booker T. Washington High School,"45 Whitehouse Drive SW, Atlanta, Georgia 30314...",33.754066,-84.420035
2,Coretta Scott King Young Women's Leadership Ac...,"1190 Northwest Drive NW, Atlanta, GA 30318",33.788677,-84.479339
3,Daniel McLaughlin Therrell High School,"3099 Panther Trail Southwest Atlanta, Georgia",33.69919,-84.490096
4,Frederick Douglass High School,"225 Hamilton E Holmes Dr NW, Atlanta, GA 30318",33.766561,-84.470118
5,Henry W. Grady High School,"929 Charles Allen Dr NE, Atlanta, GA 30309",33.781093,-84.372139
6,Maynard H. Jackson High School,"801 Glenwood Avenue SE, Atlanta, Georgia 30316",33.739245,-84.361899
7,Carver High School,"55 McDonough Blvd SE, Atlanta, GA 30315",33.719924,-84.386178
8,North Atlanta High School,"4111 Northside Parkway NW, Atlanta, Georgia 30327",33.86471,-84.449704
9,South Atlanta High school,"800 Hutchens Rd SE, Atlanta, GA 30354",33.671388,-84.363435


In [15]:
withProp

Unnamed: 0,School,Latitude,Longitude
0,Benjamin Elijah Hays High School,33.737973,-84.500985
1,Booker T. Washington High School,33.754066,-84.420035
2,Coretta Scott King Young Women's Leadership Ac...,33.788677,-84.479339
3,Daniel McLaughlin Therrell High School,33.69919,-84.490096
4,Frederick Douglass High School,33.766561,-84.470118
5,Henry W. Grady High School,33.781093,-84.372139
6,Maynard H. Jackson High School,33.739245,-84.361899
7,Carver High School,33.719924,-84.386178
8,North Atlanta High School,33.86471,-84.449704
9,South Atlanta High school,33.671388,-84.363435


## Updating with new school
We now use out previously defined functions to add the initial Proposed Location to the map and cluster based upon its current location. Of course, this location is initialized outside city limits and will therefore need to be updated.

In [45]:
updateNeighs(withProp)
resetMap()
plotPops()
plotSchools(withProp)
plotNeighs(withProp)
map_atl

In [17]:
neighDF.groupby(["Assigned School"]).sum().drop(columns = ["Lat", "Long"])

Unnamed: 0_level_0,Pop
Assigned School,Unnamed: 1_level_1
Benjamin Elijah Hays High School,27831.0
Booker T. Washington High School,65969.0
Carver High School,35005.0
Coretta Scott King Young Women's Leadership Academy,20479.0
Daniel McLaughlin Therrell High School,31189.0
Frederick Douglass High School,20451.0
Henry W. Grady High School,107515.0
Maynard H. Jackson High School,39539.0
North Atlanta High School,20779.0
Proposed Location,30494.0


As we can see, the addition of this new school at its initial location already reduces the burden on Henry W. Grady by 20,000 students. We now create a function to move the location of the Proposed Location to the average coordinates of its cluster.  

In [18]:
def update_newSchool():
    avgs = neighDF.groupby("Assigned School").mean().reset_index()
    PLavg = avgs[avgs["Assigned School"] == "Proposed Location"]
    newLat = PLavg.iloc[0]["Lat"]
    newLong = PLavg.iloc[0]["Long"]
    withProp.update(pd.DataFrame({"School": "Proposed Location",
                 "Latitude": newLat,
                 "Longitude": newLong}, index = [10]))


## Iteration
We now iterate the update_newSchool function 500 times so that it may converge on an ideal cluster location.

In [46]:
for x in range(500):
    update_newSchool()
updateNeighs(withProp)
resetMap()
plotPops()
plotSchools(withProp)
plotNeighs(withProp)
map_atl

In [47]:
neighDF.groupby(["Assigned School"]).sum().drop(columns = ["Lat", "Long"])

Unnamed: 0_level_0,Pop
Assigned School,Unnamed: 1_level_1
Benjamin Elijah Hays High School,27831.0
Booker T. Washington High School,65969.0
Carver High School,35005.0
Coretta Scott King Young Women's Leadership Academy,20479.0
Daniel McLaughlin Therrell High School,31189.0
Frederick Douglass High School,20451.0
Henry W. Grady High School,84131.0
Maynard H. Jackson High School,39539.0
North Atlanta High School,17356.0
Proposed Location,57301.0


Now that a new location has been found for a school, we look at our population data to determine if this addition is successful in relieving the burden  on Henry W. Grady. We can see that we are successful, having reduced Henry W. Grady's population representation by over 40,000 people. The new proposed location will represent approximately 57,000 people. Below, we determine the address for the new school.

In [24]:
prop = withProp[withProp["School"] == "Proposed Location"]
prop[["Latitude","Longitude"]].values

array([[ 33.85288117, -84.36772806]])

In [25]:

location = geolocator.reverse(prop[["Latitude","Longitude"]].values[0])
print(location.address)

3596, North Stratford Road Northeast, Buckhead, Atlanta, Fulton County, Georgia, 30342, United States of America


## Results and Discussion <a name="results"></a>

Our purpose in this project was to determine the necessity and location of a new public high school in the Atlanta area. Our analysis shows that while several high schools in Atlanta represent a manageable portion of the population, Henry W. Grady High School is in a highly populated area and must represent larger portions of the city than many of the other schools. Specifically, the north-east region of the city seems to be barren of public high schools. 

In order to address this issue, we added a new public high school location and used K means clustering to find a location that would help alleviate this population burden. A location at or around 3596 North Stratford Road NW, Buckhead Atlanta was determined to be the suggested location. Of course, there may be unknown reasons why this area does not have a high school located nearby. There could be private high schools that represent the population, there may be schools outside city limits, or there may be low barriers to quick transportation to and from other high schools to this area. Therefore, recommendations made should be used as a starting point for further analysis. 

## Conclusion <a name="conclusion"></a>

The purpose of this project was to analyze Atlanta school and population information in order to make a recommendation to policymakers regarding the need for a new school within city limits. We identified highly population density neighborhoods and clustered them to their nearest high school. From their we determined which schools would have been overpopulated based upon these clusters. 
All this information in addition to general visual distribution of the public schools led us to determine which area was most in need of a high school. The north east region of the city was chosen and a new proposed school location was determined using K means clustering. This process led to 