# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM / Coursera for Pedro G Jimenez Gutz


## Table of Contents

1. <a href="#item1">Introduction: Business Problem</a>    
2. <a href="#item2">Data</a>
3. <a href="#item3">Methodology</a>
4. <a href="#item4">Results and Discussion</a> 
5. <a href="#item5">Conclusion</a> 

## 1. Introduction: Business Problem

Mexico City abbreviated as CDMX, is the capital city of Mexico and the most populous city in North America.
The city has 16 subdivisions, formerly known as boroughs.
Only 1 in 10 Mexicans has insurance for private medical expenses.
Insurance companies in the health sector have the opportunity to increase the percentage of the insured population for the private sector.
The public health sector is applying the strategy of improving existing capacity, or temporarily reconverting internal areas to deal with the covit-19 pandemic global.

The purpose of this project is to identify boroughs in CDMX, with a low number of public hospitals.
Health is a priority issue, and it is essential to have effective universal medical coverage. 
The challenge is to expand the health system. Where should there be another hospital?


## 2. Data

**2.1 Data Acquisition**

Data requirements to solve the problem:
    List of CDMX boroughs with population density and coordinates.
    List of public sector hospitals.
    Search for health interest venues in each borough.

**Import libraries**

In [94]:
import requests # library to handle requests
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes
import random # library for random number generation
from bs4 import BeautifulSoup
import  urllib.request
import csv

import geocoder
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

import matplotlib.pyplot as plt # plotting library
# backend for rendering plots within the browser
%matplotlib inline 
import matplotlib.cm as cm
import matplotlib.colors as colors

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Folium installed
Libraries imported.


**Define Foursquare Credentials and Version**

In [43]:
CLIENT_ID = 'BSM4513XEFJWXHZBFQDXOEZMKP0WBELOT0132HV5YO04UW2Q' # your Foursquare ID
CLIENT_SECRET = '2JEXUZHFRS4SSS3D1AT4UAILPG0MA5FQGYW3CISUNAUAA4GV' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: BSM4513XEFJWXHZBFQDXOEZMKP0WBELOT0132HV5YO04UW2Q
CLIENT_SECRET:2JEXUZHFRS4SSS3D1AT4UAILPG0MA5FQGYW3CISUNAUAA4GV


**Read in the Hospital dataset**

In [44]:
# Read in the data
df_hospital = pd.read_csv('hospitales-y-centros-de-salud.csv')
df_hospital.shape

(27, 6)

In [45]:
# View the top rows of the dataset
df_hospital.head()

Unnamed: 0,Nombre,Titular,Latitud,Longitud,Coordenadas,Geopoint
0,Hospital Pediátrico Iztapalapa,Director: Dr. Ramón Carballo Herrada Dirección...,19.356548,-99.107689,"-99.107689,19.356548,0.000000","19.356548,-99.107689"
1,Hospital Pediátrico Moctezuma,Director: Dr. Enrique Grania Pozada Dirección:...,19.432852,-99.098358,"-99.098358,19.432852,0.000000","19.432852,-99.098358"
2,Hospital Pediátrico Coyoacán,Director: Dr. Miguel Camarillo Valencia Direcc...,19.345737,-99.167725,"-99.167725,19.345737,0.000000","19.345737,-99.167725"
3,Hospital Pediátrico San Juan de Aragón,Director: Dr. Raúl Contreras Artine Dirección:...,19.457306,-99.092812,"-99.092812,19.457306,0.000000","19.457306,-99.092812"
4,Hospital Pediátrico Villa,Director: Dra. María del Rocío Lima Carcaño Di...,19.487551,-99.113876,"-99.113876,19.487551,0.000000","19.487551,-99.113876"


**Read in the Population Borough dataset**

In [46]:
# Read in the data
df_population = pd.read_csv('Demarcaciones_Territoriales_Ciudad_de_Mexico.csv')
df_population.shape

(17, 3)

In [47]:
# View the top rows of the dataset
df_population.head()

Unnamed: 0,Demarcaciones territoriales,Población (2010),Superficie (km²)
0,Ciudad de México,8 851 080,1 479.00
1,Álvaro Obregón,727 034,96.17
2,Azcapotzalco,414 711,33.66
3,Benito Juárez,385 439,26.63
4,Coyoacán,620 416,54.40


**Read in the IDH Borough dataset**

In [48]:
# Read in the data
df_IDH = pd.read_csv('Delegaciones_de_la_Ciudad_de_Mexico_por_IDH.csv')
df_IDH.shape

(16, 7)

In [49]:
# View the top rows of the dataset
df_IDH.head()

Unnamed: 0,Informe 2010,Variación respecto al informe de 2005,Delegación,Informe 2010.1,Informe 2005,Informe 2000,País comparable
0,1,Sin cambios,Benito Juárez,0.917,0.9509,0.9164,Bandera de Alemania
1,2,Crecimiento (1),Miguel Hidalgo,0.88,0.9188,0.8816,Bandera de Eslovenia
2,3,Decrecimiento (1),Coyoacán,0.867,0.9169,0.8837,Bandera de Grecia
3,4,Crecimiento (1),Cuauhtémoc,0.848,0.8921,0.8699,Bandera de Catar
4,5,Crecimiento (1),Azcapotzalco,0.832,0.8915,0.8551,Bandera de Chile


**Data Wrangling**

**Rename and Dropping unnecesary columns - Hospital dataset**

In [50]:
#Extract Borough from column 'Titular'
df_hospital['Borough'] = df_hospital['Titular'].str.split(',', 0).str[2].str.strip()
df_hospital['Borough'] = df_hospital['Borough'].str.replace(r'Delegación', '')
df_hospital['Borough'] = df_hospital['Borough'].str.strip()

df_hospital.drop(['Titular', 'Coordenadas', 'Geopoint'], axis=1, inplace = True)
df_hospital.rename(columns = {'Nombre':'Hospital', 'Latitud':'Latitude', 'Longitud':'Longitude'}, inplace = True) 

df_hospital

Unnamed: 0,Hospital,Latitude,Longitude,Borough
0,Hospital Pediátrico Iztapalapa,19.356548,-99.107689,Iztapalapa
1,Hospital Pediátrico Moctezuma,19.432852,-99.098358,Venustiano Carranza
2,Hospital Pediátrico Coyoacán,19.345737,-99.167725,Coyoacán
3,Hospital Pediátrico San Juan de Aragón,19.457306,-99.092812,Gustavo A. Madero
4,Hospital Pediátrico Villa,19.487551,-99.113876,Gustavo A. Madero
5,Hospital Pediátrico Iztacalco,19.402376,-99.117943,Iztacalco
6,Hospital Pediátrico Peralvillo,19.460251,-99.141022,Cuauhtémoc
7,Hospital Materno Infantil Iniguarán,19.452307,-99.113228,Venustiano Carranza
8,Hospital Materno Infantil Xochimilco,19.254906,-99.104958,Xochimilco
9,Hospital General Xoco,19.36005,-99.163162,Benito Juárez


**Rename columns - Population Borough dataset**

In [51]:
# Elimina el primer renglon que es la Ciudad de México
df_population.drop([0,0], inplace=True)
df_population.rename(columns = {'Demarcaciones territoriales':'Borough', 'Población (2010)':'Population', 'Superficie (km²)':'Area km2'}, inplace = True)
df_population

Unnamed: 0,Borough,Population,Area km2
1,Álvaro Obregón,727 034,96.17
2,Azcapotzalco,414 711,33.66
3,Benito Juárez,385 439,26.63
4,Coyoacán,620 416,54.4
5,Cuajimalpa,186 391,74.58
6,Cuauhtémoc,531 831,32.4
7,Gustavo A. Madero,1 185 772,94.07
8,Iztacalco,384 326,23.3
9,Iztapalapa,1 815 786,117.0
10,La Magdalena Contreras,239 086,74.58


In [52]:
# Remove spaces in Population column
df_population['Population'] = df_population['Population'].str.replace(' ', '')
# Created column calculated
df_population['Population density in km2'] = df_population['Population'].astype(int) / df_population['Area km2'].astype(float)
df_population

Unnamed: 0,Borough,Population,Area km2,Population density in km2
1,Álvaro Obregón,727034,96.17,7559.88354
2,Azcapotzalco,414711,33.66,12320.588235
3,Benito Juárez,385439,26.63,14473.864063
4,Coyoacán,620416,54.4,11404.705882
5,Cuajimalpa,186391,74.58,2499.208903
6,Cuauhtémoc,531831,32.4,16414.537037
7,Gustavo A. Madero,1185772,94.07,12605.208887
8,Iztacalco,384326,23.3,16494.678112
9,Iztapalapa,1815786,117.0,15519.538462
10,La Magdalena Contreras,239086,74.58,3205.765621


**Rename and Dropping unnecesary columns - IDH dataset**

In [53]:
df_IDH.drop(['Informe 2010', 'Variación respecto al informe de 2005', 'Informe 2005', 'Informe 2000', 'País comparable'], axis=1, inplace = True)
df_IDH.rename(columns = {'Delegación':'Borough', 'Informe 2010.1':'IDH'}, inplace = True) 
df_IDH

Unnamed: 0,Borough,IDH
0,Benito Juárez,0.917
1,Miguel Hidalgo,0.88
2,Coyoacán,0.867
3,Cuauhtémoc,0.848
4,Azcapotzalco,0.832
5,Tlalpan,0.829
6,Cuajimalpa de Morelos,0.825
7,Iztacalco,0.822
8,Venustiano Carranza,0.816
9,La Magdalena Contreras,0.815


**Merge (df_population and df_IDH) datasets**

In [54]:
df_borough = pd.merge(df_population, df_IDH, on='Borough')
df_borough

Unnamed: 0,Borough,Population,Area km2,Population density in km2,IDH
0,Álvaro Obregón,727034,96.17,7559.88354,0.806
1,Azcapotzalco,414711,33.66,12320.588235,0.832
2,Benito Juárez,385439,26.63,14473.864063,0.917
3,Coyoacán,620416,54.4,11404.705882,0.867
4,Cuauhtémoc,531831,32.4,16414.537037,0.848
5,Gustavo A. Madero,1185772,94.07,12605.208887,0.806
6,Iztacalco,384326,23.3,16494.678112,0.822
7,Iztapalapa,1815786,117.0,15519.538462,0.783
8,La Magdalena Contreras,239086,74.58,3205.765621,0.815
9,Miguel Hidalgo,372889,46.99,7935.496914,0.88


**Check if the Borough in both the data frames match**

In [55]:
set(df_IDH.Borough) - set(df_population.Borough)

{'Cuajimalpa de Morelos'}

**Find the index of the Boroughs that didn't match**

In [56]:
print("The index of borough is",df_IDH.index[df_IDH['Borough'] == 'Cuajimalpa de Morelos'].tolist())

The index of borough is [6]


**Changing the Borough names to match the other data frame**

In [57]:
df_IDH.iloc[6,0] = 'Cuajimalpa'

**Check again if the Borough names in both data sets match**

In [58]:
set(df_IDH.Borough) - set(df_population.Borough)

set()

**We can combine both the data frames together**

In [59]:
df_borough = pd.merge(df_population, df_IDH, on='Borough')
df_borough

Unnamed: 0,Borough,Population,Area km2,Population density in km2,IDH
0,Álvaro Obregón,727034,96.17,7559.88354,0.806
1,Azcapotzalco,414711,33.66,12320.588235,0.832
2,Benito Juárez,385439,26.63,14473.864063,0.917
3,Coyoacán,620416,54.4,11404.705882,0.867
4,Cuajimalpa,186391,74.58,2499.208903,0.825
5,Cuauhtémoc,531831,32.4,16414.537037,0.848
6,Gustavo A. Madero,1185772,94.07,12605.208887,0.806
7,Iztacalco,384326,23.3,16494.678112,0.822
8,Iztapalapa,1815786,117.0,15519.538462,0.783
9,La Magdalena Contreras,239086,74.58,3205.765621,0.815


**Get the number of hospitals per Borough**

In [60]:
df_total_hospital = df_hospital['Borough'].value_counts().rename_axis('Borough').reset_index(name='Total Hospitals')
df_total_hospital

Unnamed: 0,Borough,Total Hospitals
0,Gustavo A. Madero,5
1,Miguel Hidalgo,3
2,Iztapalapa,3
3,Venustiano Carranza,3
4,Azcapotzalco,2
5,Cuauhtémoc,2
6,Iztacalco,1
7,Coyoacán,1
8,Xochimilco,1
9,Cuajimalpa de Morelos,1


**Check if the Borough in both the data frames match**

In [61]:
set(df_total_hospital.Borough) - set(df_borough.Borough)

{'Cuajimalpa de Morelos'}

**Find the index of the Borough that didn't match**

In [62]:
print("The index of borough is",df_total_hospital.index[df_total_hospital['Borough'] == 'Cuajimalpa de Morelos'].tolist())

The index of borough is [9]


**Changing the Borough names to match the other data frame**

In [63]:
df_total_hospital.iloc[14,0] = 'Cuajimalpa'

**Check again if the Borough names in both data sets match**

In [64]:
set(df_total_hospital.Borough) - set(df_borough.Borough)

{'Cuajimalpa de Morelos'}

**We can combine both the data frames together**

In [65]:
df_borough_h = pd.merge(df_borough, df_total_hospital, on='Borough', how='left')

**Replace Nan values with zeros**

In [66]:
df_borough_h['Total Hospitals'] = df_borough_h['Total Hospitals'].fillna(0)

<a id='item3'></a>

## 3. Methodology

The methodology in this project consists of two parts:

<a href="#part1">Exploratory Data Analysis</a> View the total of Hospitals by boroughs in CDMX, and extract neighborhoods of the five selected boroughs.

<a href="#part2">Modelling</a> To help stakeholders interested in knowing, using or improving the hospital health system in CDMX. We will use K-means clustering which is a form of unsupervised machine learning algorithm that clusters data based on predefined cluster size. We will use a cluster size of 5 for this project.

### Exploratory Data Analysis

**View the information of the dataset**

In [67]:
df_borough_h.describe()

Unnamed: 0,Population density in km2,IDH,Total Hospitals
count,16.0,16.0,16.0
mean,8972.517577,0.823875,1.625
std,5667.70899,0.041262,1.310216
min,571.700013,0.742,0.0
25%,3439.197761,0.806,1.0
50%,9670.101398,0.819,1.0
75%,13296.115716,0.836,2.25
max,16494.678112,0.917,5.0


In [68]:
df_borough_h.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16 entries, 0 to 15
Data columns (total 6 columns):
Borough                      16 non-null object
Population                   16 non-null object
Area km2                     16 non-null object
Population density in km2    16 non-null float64
IDH                          16 non-null float64
Total Hospitals              16 non-null float64
dtypes: float64(3), object(3)
memory usage: 896.0+ bytes


In [69]:
df_borough_h

Unnamed: 0,Borough,Population,Area km2,Population density in km2,IDH,Total Hospitals
0,Álvaro Obregón,727034,96.17,7559.88354,0.806,1.0
1,Azcapotzalco,414711,33.66,12320.588235,0.832,2.0
2,Benito Juárez,385439,26.63,14473.864063,0.917,0.0
3,Coyoacán,620416,54.4,11404.705882,0.867,1.0
4,Cuajimalpa,186391,74.58,2499.208903,0.825,1.0
5,Cuauhtémoc,531831,32.4,16414.537037,0.848,2.0
6,Gustavo A. Madero,1185772,94.07,12605.208887,0.806,5.0
7,Iztacalco,384326,23.3,16494.678112,0.822,1.0
8,Iztapalapa,1815786,117.0,15519.538462,0.783,3.0
9,La Magdalena Contreras,239086,74.58,3205.765621,0.815,0.0


**Sort the total Hospitals in descenting order to see 5 boroughs with the lowest number of Hospitals**

In [70]:
df_borough_h.sort_values(['Total Hospitals'], ascending = False, axis = 0, inplace = True )
df_borough_h_top = df_borough_h.head() 
df_borough_h

Unnamed: 0,Borough,Population,Area km2,Population density in km2,IDH,Total Hospitals
6,Gustavo A. Madero,1185772,94.07,12605.208887,0.806,5.0
8,Iztapalapa,1815786,117.0,15519.538462,0.783,3.0
10,Miguel Hidalgo,372889,46.99,7935.496914,0.88,3.0
14,Venustiano Carranza,430978,33.4,12903.532934,0.816,3.0
1,Azcapotzalco,414711,33.66,12320.588235,0.832,2.0
5,Cuauhtémoc,531831,32.4,16414.537037,0.848,2.0
0,Álvaro Obregón,727034,96.17,7559.88354,0.806,1.0
3,Coyoacán,620416,54.4,11404.705882,0.867,1.0
4,Cuajimalpa,186391,74.58,2499.208903,0.825,1.0
7,Iztacalco,384326,23.3,16494.678112,0.822,1.0


**We will choose the last five Borough.
The Borough 'La Magdalena Contreras' not have Hospitals.
The Boroughs {'Xochimilco', 'Tlalpan', 'Tláhuac', 'Milpa Alta' have one Hospital.**

In [71]:
df_borough_h.corr()

Unnamed: 0,Population density in km2,IDH,Total Hospitals
Population density in km2,1.0,0.390904,0.399217
IDH,0.390904,1.0,-0.134106
Total Hospitals,0.399217,-0.134106,1.0


**Boroughs in CDMX**

<img src='CDMX.jpg'>

**We can see that the chosen Boroughs are contiguous**

**Creating a new dataset of the Boroughs choose, and generating their coordinates**

In [72]:
Borough = ['La Magdalena Contreras', 'Xochimilco', 'Tlalpan', 'Tláhuac', 'Milpa Alta']
Neighborhood = ['La Magdalena Contreras, CDMX', 'Xochimilco, CDMX', 'Tlalpan, CDMX', 'Tláhuac, CDMX', 'Milpa Alta, CDMX']

Latitude = ['','','','','']
Longitude = ['','','','','']

df_N = {'Neighborhood': Neighborhood,'Borough':Borough,'Latitude': Latitude,'Longitude':Longitude}
df_neighborhood = pd.DataFrame(data=df_N, columns=['Neighborhood', 'Borough', 'Latitude', 'Longitude'], index=None)
df_neighborhood

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
0,"La Magdalena Contreras, CDMX",La Magdalena Contreras,,
1,"Xochimilco, CDMX",Xochimilco,,
2,"Tlalpan, CDMX",Tlalpan,,
3,"Tláhuac, CDMX",Tláhuac,,
4,"Milpa Alta, CDMX",Milpa Alta,,


**Find the Coordinates**

In [75]:
Latitude = []
Longitude = []

for i in range(len(Neighborhood)):
    address = Neighborhood[i]
    geolocator = Nominatim(user_agent="cdmx_agent")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    Latitude.append(location.latitude)
    Longitude.append(location.longitude)
print(Latitude, Longitude)

[19.27547005, 19.23697845, 19.200877, 19.26950425, 19.138028] [-99.26333858358939, -99.0823001406525, -99.21701240427146, -99.00409684032508, -99.05892017210884]


In [None]:
df_neigh = {'Neighborhood': Neighborhood,'Borough':Borough,'Latitude': Latitude,'Longitude':Longitude}
df_neigN = pd.DataFrame(data=df_neigh, columns=['Neighborhood', 'Borough', 'Latitude', 'Longitude'], index=None)

df_neigN

In [76]:
# create map of CDMX using latitude and longitude values
map_lon = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_neigN['Latitude'], df_neigN['Longitude'], df_neigN['Borough'], df_neigN['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_lon)  
    
map_lon

<a id='part2'></a>

### Modelling

**Finding all the venues within a 3000 meter radius of each neighborhood.
    Perform one hot ecoding on the venues data.
    Grouping the venues by the neighborhood and calculating their mean.
    Performing a K-means clustering (Defining K = 5)**

Create a function to extract the venues from each Neighborhood

In [77]:
def getNearbyVenues(names, latitudes, longitudes, radius=3000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [78]:
df_venues = getNearbyVenues(names=df_neigN['Neighborhood'],
                                   latitudes=df_neigN['Latitude'],
                                   longitudes=df_neigN['Longitude']
                                  )

La Magdalena Contreras, CDMX
Xochimilco, CDMX
Tlalpan, CDMX
Tláhuac, CDMX
Milpa Alta, CDMX


In [79]:
print(df_venues.shape)
df_venues.head()

(88, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"La Magdalena Contreras, CDMX",19.27547,-99.263339,Segundo Dinamo,19.286371,-99.272958,Rock Climbing Spot
1,"La Magdalena Contreras, CDMX",19.27547,-99.263339,La Virgen,19.254767,-99.267493,Memorial Site
2,"La Magdalena Contreras, CDMX",19.27547,-99.263339,Gotchamania,19.264595,-99.240855,Paintball Field
3,"La Magdalena Contreras, CDMX",19.27547,-99.263339,Gotcha Manía,19.254879,-99.244915,Theme Park
4,"La Magdalena Contreras, CDMX",19.27547,-99.263339,"Parque y Corredor Ecoturístico ""Los Dínamos""",19.30054,-99.253412,Nature Preserve


In [80]:
df_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"La Magdalena Contreras, CDMX",18,18,18,18,18,18
"Milpa Alta, CDMX",2,2,2,2,2,2
"Tlalpan, CDMX",8,8,8,8,8,8
"Tláhuac, CDMX",30,30,30,30,30,30
"Xochimilco, CDMX",30,30,30,30,30,30


In [81]:
print('There are {} uniques categories.'.format(len(df_venues['Venue Category'].unique())))

There are 46 uniques categories.


In [82]:
# one hot encoding
df_onehot = pd.get_dummies(df_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
df_onehot['Neighborhood'] = df_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [df_onehot.columns[-1]] + list(df_onehot.columns[:-1])
df_onehot = df_onehot[fixed_columns]

df_onehot.head()

Unnamed: 0,Neighborhood,BBQ Joint,Bakery,Bar,Beer Garden,Burger Joint,Café,Camera Store,Candy Store,Clothing Store,...,Plaza,Restaurant,Rock Climbing Spot,Seafood Restaurant,Soccer Field,Soccer Stadium,Spa,Taco Place,Theme Park,Vegetarian / Vegan Restaurant
0,"La Magdalena Contreras, CDMX",0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,"La Magdalena Contreras, CDMX",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"La Magdalena Contreras, CDMX",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"La Magdalena Contreras, CDMX",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,"La Magdalena Contreras, CDMX",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [83]:
df_grouped = df_onehot.groupby('Neighborhood').mean().reset_index()
df_grouped

Unnamed: 0,Neighborhood,BBQ Joint,Bakery,Bar,Beer Garden,Burger Joint,Café,Camera Store,Candy Store,Clothing Store,...,Plaza,Restaurant,Rock Climbing Spot,Seafood Restaurant,Soccer Field,Soccer Stadium,Spa,Taco Place,Theme Park,Vegetarian / Vegan Restaurant
0,"La Magdalena Contreras, CDMX",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.055556,0.0,0.055556,0.0,0.111111,0.055556,0.055556,0.0,0.111111,0.0
1,"Milpa Alta, CDMX",0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Tlalpan, CDMX",0.0,0.125,0.0,0.0,0.0,0.125,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Tláhuac, CDMX",0.0,0.0,0.033333,0.0,0.033333,0.033333,0.0,0.0,0.033333,...,0.033333,0.0,0.0,0.1,0.0,0.0,0.0,0.1,0.0,0.033333
4,"Xochimilco, CDMX",0.066667,0.033333,0.0,0.066667,0.0,0.0,0.0,0.033333,0.0,...,0.0,0.066667,0.033333,0.0,0.0,0.0,0.0,0.1,0.0,0.0


In [84]:
df_grouped.shape

(5, 47)

In [85]:
num_top_venues = 3

for hood in df_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = df_grouped[df_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----La Magdalena Contreras, CDMX----
                venue  freq
0   Convenience Store  0.11
1          Theme Park  0.11
2  Mexican Restaurant  0.11


----Milpa Alta, CDMX----
          venue  freq
0  Camera Store   0.5
1       Factory   0.5
2     BBQ Joint   0.0


----Tlalpan, CDMX----
                venue  freq
0  Mexican Restaurant  0.12
1              Bakery  0.12
2            Mountain  0.12


----Tláhuac, CDMX----
                venue  freq
0  Mexican Restaurant  0.23
1          Taco Place  0.10
2  Seafood Restaurant  0.10


----Xochimilco, CDMX----
                venue  freq
0  Mexican Restaurant   0.2
1          Taco Place   0.1
2         Flower Shop   0.1




In [86]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [87]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = df_grouped['Neighborhood']

for ind in np.arange(df_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"La Magdalena Contreras, CDMX",Mexican Restaurant,Nature Preserve,Convenience Store,Soccer Field,Theme Park,Playground,Memorial Site,Paintball Field,Rock Climbing Spot,Gun Range
1,"Milpa Alta, CDMX",Camera Store,Factory,Vegetarian / Vegan Restaurant,Deli / Bodega,Gym,Gun Range,Garden Center,Flower Shop,Farm,Fair
2,"Tlalpan, CDMX",Deli / Bodega,Bakery,Farm,Café,Mexican Restaurant,Mountain,Outdoors & Recreation,Park,Diner,Garden Center
3,"Tláhuac, CDMX",Mexican Restaurant,Pizza Place,Taco Place,Gym,Seafood Restaurant,Vegetarian / Vegan Restaurant,Lounge,Italian Restaurant,Clothing Store,Park
4,"Xochimilco, CDMX",Mexican Restaurant,Taco Place,Flower Shop,BBQ Joint,Beer Garden,Restaurant,Music Venue,Bakery,Candy Store,Coffee Shop


**Clustering similar neighborhoods together using K-Means clustering**

In [88]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

df_grouped_clustering = df_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([3, 1, 2, 4, 0])

In [89]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

df_merged = df_neigN

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
df_merged = df_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

df_merged.head()

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"La Magdalena Contreras, CDMX",La Magdalena Contreras,19.27547,-99.263339,3,Mexican Restaurant,Nature Preserve,Convenience Store,Soccer Field,Theme Park,Playground,Memorial Site,Paintball Field,Rock Climbing Spot,Gun Range
1,"Xochimilco, CDMX",Xochimilco,19.236978,-99.0823,0,Mexican Restaurant,Taco Place,Flower Shop,BBQ Joint,Beer Garden,Restaurant,Music Venue,Bakery,Candy Store,Coffee Shop
2,"Tlalpan, CDMX",Tlalpan,19.200877,-99.217012,2,Deli / Bodega,Bakery,Farm,Café,Mexican Restaurant,Mountain,Outdoors & Recreation,Park,Diner,Garden Center
3,"Tláhuac, CDMX",Tláhuac,19.269504,-99.004097,4,Mexican Restaurant,Pizza Place,Taco Place,Gym,Seafood Restaurant,Vegetarian / Vegan Restaurant,Lounge,Italian Restaurant,Clothing Store,Park
4,"Milpa Alta, CDMX",Milpa Alta,19.138028,-99.05892,1,Camera Store,Factory,Vegetarian / Vegan Restaurant,Deli / Bodega,Gym,Gun Range,Garden Center,Flower Shop,Farm,Fair


In [90]:
# Dropping the row with the NaN value 
df_merged.dropna(inplace = True)

In [91]:
df_merged.shape

(5, 15)

In [92]:
df_merged['Cluster Labels'] = df_merged['Cluster Labels'].astype(int)

**Visualize the clusters**

In [95]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['Neighborhood'], df_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=8,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.5).add_to(map_clusters)
       
map_clusters

**View clusters**

Cluster 1

In [96]:
df_merged[df_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,"Xochimilco, CDMX",Xochimilco,19.236978,-99.0823,0,Mexican Restaurant,Taco Place,Flower Shop,BBQ Joint,Beer Garden,Restaurant,Music Venue,Bakery,Candy Store,Coffee Shop


Cluster 2

In [97]:
df_merged[df_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,"Milpa Alta, CDMX",Milpa Alta,19.138028,-99.05892,1,Camera Store,Factory,Vegetarian / Vegan Restaurant,Deli / Bodega,Gym,Gun Range,Garden Center,Flower Shop,Farm,Fair


Cluster 3

In [98]:
df_merged[df_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,"Tlalpan, CDMX",Tlalpan,19.200877,-99.217012,2,Deli / Bodega,Bakery,Farm,Café,Mexican Restaurant,Mountain,Outdoors & Recreation,Park,Diner,Garden Center


Cluster 4

In [99]:
df_merged[df_merged['Cluster Labels'] == 3]

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"La Magdalena Contreras, CDMX",La Magdalena Contreras,19.27547,-99.263339,3,Mexican Restaurant,Nature Preserve,Convenience Store,Soccer Field,Theme Park,Playground,Memorial Site,Paintball Field,Rock Climbing Spot,Gun Range


Cluster 5

In [100]:
df_merged[df_merged['Cluster Labels'] == 4]

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,"Tláhuac, CDMX",Tláhuac,19.269504,-99.004097,4,Mexican Restaurant,Pizza Place,Taco Place,Gym,Seafood Restaurant,Vegetarian / Vegan Restaurant,Lounge,Italian Restaurant,Clothing Store,Park


## 4. Results and Discussion

Clusters show a place, because they are the coordinates of the city. In the analysis of the data it was determined that the feasible Borough to build the Borough was 'La Magdalena Contreras', since it is the one that would bring better benefits to your city and nearby places.

The results were limited by the number of places initially included and by the low calls and results of the Foursquare API.

<a id='item5'></a>

## 5. Conclusion

Fortunately, from the analysis of the data sets, we were able to determine the candidate Boroughs to have a Hospital.