# Capstone Project - Educational Business from Boston 
#### Author: Jorge Calvo Martín

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

<p>In this project we will try to find an optimal location for an educational business, it can be a college, a nursery school, a university, etc ... Specifically, this report will be addressed to the parties interested in opening this type of business in any of the neighborhoods from the city of Boston.

As there are many educational businesses in the city and Boston, we will try to detect places where it is best to put the business and also what type of educational business is the most appropriate.

We will use our data science powers to generate some more promising neighborhoods based on this criteria. The advantages of each area will be clearly expressed so that those interested can choose the best possible final location.



## Data <a name="data"></a>

According to the definition of our problem, the factors that will influence our decision are:
    
<ul>
    <li>Number of schools, kindergartens, existing universities in the neighborhood</li>

   <li>Number and distance of schools in the neighborhood, if there are any distance from the neighborhood to the city center</li>
</ul>

The following data sources will be needed to extract / generate the required information:

The centers of the candidate areas will be generated algorithmically and the addresses of the neighborhoods will be obtained from the dataset of https://data.boston.gov/dataset/boston-neighborhoods

The number of schools and their type and location in each neighborhood will be obtained using the Foursquare API
The coordinate of the center of Boston will be obtained using the geocoding of Nominatim</p>

#### Load library

In [1]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation


from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Folium installed
Libraries imported.


### Next, let's load the data.

In [2]:
df_boston=pd.read_csv("Boston_Neighbourhood.csv")

In [3]:
df_boston.head()

Unnamed: 0,OBJECTID,Name,Acres,Neighborhood_ID,SqMiles,ShapeSTArea,ShapeSTLength
0,27,Roslindale,1605.568237,15,2.51,69938270.0,53563.912597
1,28,Jamaica Plain,2519.245394,11,3.94,109737900.0,56349.937161
2,29,Mission Hill,350.853564,13,0.55,15283120.0,17918.724113
3,30,Longwood,188.611947,28,0.29,8215904.0,11908.757148
4,31,Bay Village,26.539839,33,0.04,1156071.0,4650.635493


## Methodology <a name="methodology"></a>

Now that we have our placement candidates, let's use the Foursquare API to get information about schools in each neighborhood, using the “college” category

For this search we mark a radius of 1000 miles around the center of the established neighborhood. We are interested in places in the 'college' category, therefore any result obtained in the analysis is useful for us.

We can see how the Chinatown neighborhood in Boston has a lot of places that match the category we are looking for.

#### Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

In [4]:
CLIENT_ID = 'FL33NARRRSLQVOWU5RS1S4GJ1PYCCU20MURF54X0LHOVGP10' # your Foursquare ID
CLIENT_SECRET = 'DYCI23FO5QM0GEYELTO102C4Y2R12KDJBYI5OWVL0OYF32XL' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 100
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: FL33NARRRSLQVOWU5RS1S4GJ1PYCCU20MURF54X0LHOVGP10
CLIENT_SECRET:DYCI23FO5QM0GEYELTO102C4Y2R12KDJBYI5OWVL0OYF32XL


In [5]:
query = '4bf58dd8d48988d13b941735'
radius =1000

In [6]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng,
            query,
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [7]:
neighborhood=[]
list_latitude=[]
list_longitude=[]
for name in zip(df_boston["Name"]):

        geolocator = Nominatim(user_agent="boston_explorer")
        location = geolocator.geocode(name)
        latitude = location.latitude
        longitude = location.longitude
        print('The geograpical coordinate {}, {}, {}.'.format(name,latitude, longitude))
        #create list
        neighborhood=np.append(neighborhood,[[name]])
        list_latitude=np.append(list_latitude,[[latitude]])
        list_longitude=np.append(list_longitude,[longitude])

The geograpical coordinate ('Roslindale',), 42.2912093, -71.1244966.
The geograpical coordinate ('Jamaica Plain',), 42.3098201, -71.1203299.
The geograpical coordinate ('Mission Hill',), 46.4719653, -84.6767229.
The geograpical coordinate ('Longwood',), 28.7007225, -81.3492779714387.
The geograpical coordinate ('Bay Village',), 41.4849875, -81.920832.
The geograpical coordinate ('Leather District',), -33.6976001, 19.0061312.
The geograpical coordinate ('Chinatown',), 40.7164913, -73.9962504.
The geograpical coordinate ('North End',), 42.3650974, -71.0544954.
The geograpical coordinate ('Roxbury',), 43.2494702, -89.6749947.
The geograpical coordinate ('South End',), 44.6318333, -63.5800267.
The geograpical coordinate ('Back Bay',), 42.3507067, -71.0797297.
The geograpical coordinate ('East Boston',), 42.3750973, -71.0392173.
The geograpical coordinate ('Charlestown',), 43.2387, -72.424622.
The geograpical coordinate ('West End',), -27.4804327, 153.0128302.
The geograpical coordinate ('B

### Create Dataframe 

In [8]:
boston_df=pd.DataFrame(list(zip(neighborhood,list_latitude,list_longitude)),columns=["Neighborhood","Latitude","Longitude"])
boston_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Roslindale,42.291209,-71.124497
1,Jamaica Plain,42.30982,-71.12033
2,Mission Hill,46.471965,-84.676723
3,Longwood,28.700723,-81.349278
4,Bay Village,41.484988,-81.920832


In [9]:
boston_venues = getNearbyVenues(names=boston_df["Neighborhood"],
                                           latitudes=boston_df["Latitude"],
                                           longitudes=boston_df["Longitude"]
                                          )

Roslindale
Jamaica Plain
Mission Hill
Longwood
Bay Village
Leather District
Chinatown
North End
Roxbury
South End
Back Bay
East Boston
Charlestown
West End
Beacon Hill
Downtown
Fenway
Brighton
West Roxbury
Hyde Park
Mattapan
Dorchester
South Boston Waterfront
South Boston
Allston
Harbor Islands


In [10]:
boston_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Jamaica Plain,42.30982,-71.12033,Eliot School of Fine and Applied Arts,42.310946,-71.116934,Trade School
1,Longwood,28.700723,-81.349278,Friends Academy,28.699155,-81.348271,Elementary School
2,Longwood,28.700723,-81.349278,Rockin Guitar Man,28.697035,-81.347362,Gun Shop
3,Bay Village,41.484988,-81.920832,Normandy school,41.483254,-81.918746,Elementary School
4,Bay Village,41.484988,-81.920832,Bay School Board,41.484546,-81.920825,Government Building


### how many venues were returned for each neighborhood

In [11]:
boston_venues.groupby("Neighborhood").count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Allston,2,2,2,2,2,2
Back Bay,7,7,7,7,7,7
Bay Village,2,2,2,2,2,2
Beacon Hill,2,2,2,2,2,2
Brighton,6,6,6,6,6,6
Chinatown,47,47,47,47,47,47
East Boston,2,2,2,2,2,2
Fenway,6,6,6,6,6,6
Jamaica Plain,1,1,1,1,1,1
Longwood,2,2,2,2,2,2


In [12]:
# one hot encoding
df_venues = pd.get_dummies(boston_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
df_venues['Neighborhood'] = boston_venues['Neighborhood'] 

df_venues.Neighborhood.head()

0    Jamaica Plain
1         Longwood
2         Longwood
3      Bay Village
4      Bay Village
Name: Neighborhood, dtype: object

In [13]:
df_venues.shape

(91, 19)

## Analysis <a name="analysis"></a>

### The mean of the frequency of occurrence of each category

In [14]:
boston_grouped = df_venues.groupby('Neighborhood').mean().reset_index()
boston_grouped.head()

Unnamed: 0,Neighborhood,Church,College Communications Building,Community College,Driving School,Elementary School,General College & University,Government Building,Gun Shop,High School,Language School,Middle School,Music School,Office,Preschool,School,Student Center,Trade School,University
0,Allston,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0
1,Back Bay,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.428571,0.0,0.0,0.142857,0.0,0.142857,0.142857,0.0,0.0,0.0
2,Bay Village,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Beacon Hill,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0
4,Brighton,0.0,0.166667,0.166667,0.0,0.0,0.166667,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0


In [15]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

According to the average of the categories found by neighborhoods, we will make a selection of the 10 most common educational businesses in each neighborhood, so in this way we can subsequently perform the clustering

In [16]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = boston_grouped['Neighborhood']

for ind in np.arange(boston_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(boston_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Allston,School,Language School,University,Gun Shop,College Communications Building,Community College,Driving School,Elementary School,General College & University,Government Building
1,Back Bay,High School,School,Preschool,Music School,Elementary School,Government Building,College Communications Building,Community College,Driving School,General College & University
2,Bay Village,Elementary School,Government Building,University,Trade School,College Communications Building,Community College,Driving School,General College & University,Gun Shop,High School
3,Beacon Hill,School,Driving School,University,Gun Shop,College Communications Building,Community College,Elementary School,General College & University,Government Building,High School
4,Brighton,School,College Communications Building,Community College,General College & University,Language School,University,Gun Shop,Driving School,Elementary School,Government Building


## Clustering Kmeans
<p>We will group the neighborhoods using K-means to obtain the centroids of the 5 clusters, this will give us the optimal places where the marked educational businesses are established.</p>

In [17]:
# set number of clusters
kclusters = 5

boston_grouped_clustering = boston_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(boston_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 3, 4, 1, 0, 1, 1, 3, 2, 4], dtype=int32)

#### Union array Cluster Labels with Dataset Boston

In [18]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

boston_total = boston_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
boston_total = boston_total.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

boston_total.dropna(inplace=True)
boston_total.head() # check the last columns!

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Jamaica Plain,42.30982,-71.12033,2.0,Trade School,University,College Communications Building,Community College,Driving School,Elementary School,General College & University,Government Building,Gun Shop,High School
3,Longwood,28.700723,-81.349278,4.0,Elementary School,Gun Shop,University,Trade School,College Communications Building,Community College,Driving School,General College & University,Government Building,High School
4,Bay Village,41.484988,-81.920832,4.0,Elementary School,Government Building,University,Trade School,College Communications Building,Community College,Driving School,General College & University,Gun Shop,High School
6,Chinatown,40.716491,-73.99625,1.0,School,Driving School,Music School,High School,Middle School,Elementary School,Language School,Church,Office,Student Center
7,North End,42.365097,-71.054495,4.0,Elementary School,Church,School,Preschool,Gun Shop,College Communications Building,Community College,Driving School,General College & University,Government Building


In [19]:
boston_total.shape

(14, 14)

### let's visualizat Boston with Educational Business in it.

In [22]:
geolocator = Nominatim(user_agent="boston_explorer")
location = geolocator.geocode("Boston")
latitude = location.latitude
longitude = location.longitude
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(boston_total['Latitude'], boston_total['Longitude'], boston_total['Neighborhood'], boston_total['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        #color="green",
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        #fill_color="red",
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Cluster 0 = School

In [24]:
boston_total.loc[boston_total['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
13,West End,-27.480433,153.01283,0.0,School,Language School,University,Gun Shop,College Communications Building,Community College,Driving School,Elementary School,General College & University,Government Building
17,Brighton,50.82204,-0.137406,0.0,School,College Communications Building,Community College,General College & University,Language School,University,Gun Shop,Driving School,Elementary School,Government Building
24,Allston,42.355434,-71.132127,0.0,School,Language School,University,Gun Shop,College Communications Building,Community College,Driving School,Elementary School,General College & University,Government Building


## Cluster 1 = School

In [25]:
boston_total.loc[boston_total['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Chinatown,40.716491,-73.99625,1.0,School,Driving School,Music School,High School,Middle School,Elementary School,Language School,Church,Office,Student Center
9,South End,44.631833,-63.580027,1.0,School,University,Elementary School,Gun Shop,College Communications Building,Community College,Driving School,General College & University,Government Building,High School
11,East Boston,42.375097,-71.039217,1.0,School,Elementary School,University,Gun Shop,College Communications Building,Community College,Driving School,General College & University,Government Building,High School
14,Beacon Hill,47.579258,-122.311598,1.0,School,Driving School,University,Gun Shop,College Communications Building,Community College,Elementary School,General College & University,Government Building,High School
20,Mattapan,42.267566,-71.092427,1.0,School,Preschool,University,Gun Shop,College Communications Building,Community College,Driving School,Elementary School,General College & University,Government Building


## Cluster 2 = Trade School

In [26]:
boston_total.loc[boston_total['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Jamaica Plain,42.30982,-71.12033,2.0,Trade School,University,College Communications Building,Community College,Driving School,Elementary School,General College & University,Government Building,Gun Shop,High School


## Cluster 3 = High School

In [27]:
boston_total.loc[boston_total['Cluster Labels'] == 3]

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,Back Bay,42.350707,-71.07973,3.0,High School,School,Preschool,Music School,Elementary School,Government Building,College Communications Building,Community College,Driving School,General College & University
16,Fenway,42.343451,-71.097716,3.0,High School,General College & University,Church,School,Preschool,Office,Music School,Middle School,Language School,Trade School


## Cluster 4 = Elementary School

In [28]:
boston_total.loc[boston_total['Cluster Labels'] == 4]

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Longwood,28.700723,-81.349278,4.0,Elementary School,Gun Shop,University,Trade School,College Communications Building,Community College,Driving School,General College & University,Government Building,High School
4,Bay Village,41.484988,-81.920832,4.0,Elementary School,Government Building,University,Trade School,College Communications Building,Community College,Driving School,General College & University,Gun Shop,High School
7,North End,42.365097,-71.054495,4.0,Elementary School,Church,School,Preschool,Gun Shop,College Communications Building,Community College,Driving School,General College & University,Government Building


## Conclusion <a name="conclusion"></a>

Interested parties will make the final decision on the optimal location of educational businesses based on the specific characteristics of neighborhoods and locations in each recommended area, taking into account additional factors such as the attractiveness of each location (proximity to the park or water) , noise levels / proximity to the main roads, availability of real estate, prices, social and economic dynamics of each neighborhood, etc.