# The Battle of Neighborhoods

## Business Problem

In this notebook We will find the optimal location for a pharmacy in the city of Seville, located at the south of Spain. 
During this process we will use data comming from different data sources like the National Statistics Institute of Spain or ESRI databases.

We will use that information along with the one provided by the Foursquare API to look for the venues around each sub-neighborhood (centroid of the sub-neighborhood) in the city of Seville, then we will identify the pharmacies next to these sub-neighborhoods (centroids) as well as the sub-neighborhoods with no pharmacies around them.

On the other hand, we will use the income per household and age in the sub-neighborhoods as well as the the services around the already existing pharmacies in them to classify the pharmacies.

Finally, we will clasify the sub-neighborhoods with no pharmacy inside them and will select these sub-neighborhoods where were the age and income conditions combined with services makes them suitable for opening a pharmacy. These conditions will be:
- More than 50% of people is over 40 years
- Average perhousehold income greater than 30K €

## Data

For the execution of this analysis we will use the folowing data:

- Demographic data: https://github.com/jomsaga/Capstone/blob/main/Sevilla_Seccion_Censal_Barrio_Distrito.csv

- Sub-neighborhood centroid data: https://github.com/jomsaga/Capstone/blob/main/Centroids.csv

- Income data: https://github.com/jomsaga/Capstone/blob/main/Renta_Media_Persona_Media_Hogar_Seccion_Censal.csv





# Analysis

## Import necesarry libraries

In [1]:
import pandas as pd
import numpy as np
import geocoder
import folium
import json
import requests
from tqdm import tqdm
import branca.colormap as cm
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.preprocessing import scale

## Create the dataframe

### Create dataframes from different sources (CSVs)

In [2]:
DD_url = 'https://raw.githubusercontent.com/jomsaga/Capstone/main/Sevilla_Seccion_Censal_Barrio_Distrito.csv'
C_url = 'https://raw.githubusercontent.com/jomsaga/Capstone/main/Centroids.csv'
ID_url = 'https://raw.githubusercontent.com/jomsaga/Capstone/main/Renta_Media_Persona_Media_Hogar_Seccion_Censal.csv'

In [67]:
DD_df = pd.read_csv(DD_url,delimiter=';')
C_df = pd.read_csv(C_url,delimiter=';', encoding='latin-1')
ID_df = pd.read_csv(ID_url,delimiter=r"\s+", encoding='latin-1', header=None)

In [68]:
DD_df.head()

Unnamed: 0,CUSEC,Barrio,Distrito,Población_Total,H_total,M_Total,H_00_Entre_0_y_4_años,H_01_Entre_5_y_9_años,H_02_Entre_10_y_14_años,H_03_Entre_15_y_19_años,...,M_11_Entre_55_y_59_años,M_12_Entre_60_y_64_años,M_13_Entre_65_y_69_años,M_14_Entre_70_y_74_años,M_15_Entre_75_y_79_años,M_16_Entre_80_y_84_años,M_17_Entre_85_y_89_años,M_18_Entre_90_y_más_años,Shape__Area,Shape__Length
0,4109106023,TRIANA ESTE,Triana,821,358,463,10,17,28,22,...,27,31,25,31,29,32,14.0,12.0,43979.91553,875.704245
1,4109104028,LOS PAJAROS,Cerro Amate,795,384,411,30,22,17,24,...,23,30,11,8,7,10,9.0,1.0,60709.0752,1385.198757
2,4109104033,JUAN XXIII,Cerro Amate,1156,535,621,22,30,34,27,...,33,34,36,37,42,27,33.0,7.0,102626.6489,1279.491928
3,4109104035,ROCHELAMBERT,Cerro Amate,1449,676,773,21,33,38,32,...,57,30,47,62,31,29,15.0,6.0,78884.0791,1493.304158
4,4109101002,FERIA,Casco Antiguo,950,449,501,13,19,13,16,...,36,27,21,35,23,16,9.0,7.0,77461.28052,1305.417423


In [69]:
C_df.head()

Unnamed: 0,CUSEC,Barrio,Distrito,Poblacion,Shape__Are,xcoord,ycoord
0,4109108035,SAN PABLO D Y E,San Pablo - Santa Justa,769,32050.2644,-5.95739,37.39938
1,4109109010,PALACIO DE CONGRESOS URBADIEZ ENTREPUENTES,Este,2208,485294.7092,-5.94065,37.40601
2,4109109003,COLORES ENTREPARQUES,Este,2119,380933.1804,-5.91771,37.38841
3,4109110013,SECTOR SUR LA PALMERA REINA MERCEDES,Bellavista La Palmera,647,26359.63672,-5.98599,37.35797
4,4109110012,SECTOR SUR LA PALMERA REINA MERCEDES,Bellavista La Palmera,1066,482544.1013,-5.98489,37.36369


In [70]:
ID_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,4109101001,Sevilla,sección,1001,Renta,media,por,persona,2017,15.189
1,4109101001,Sevilla,sección,1001,Renta,media,por,hogar,2017,35.467
2,4109101002,Sevilla,sección,1002,Renta,media,por,persona,2017,14.763
3,4109101002,Sevilla,sección,1002,Renta,media,por,hogar,2017,35.431
4,4109101003,Sevilla,sección,1003,Renta,media,por,persona,2017,15.518


#### Remove inecesary columns

In [71]:
ID_df = ID_df[[0,9]]
columns_ID_df = ['CUSEC', 'Income [k€]']
ID_df.columns = columns_ID_df
ID_df.head()

Unnamed: 0,CUSEC,Income [k€]
0,4109101001,15.189
1,4109101001,35.467
2,4109101002,14.763
3,4109101002,35.431
4,4109101003,15.518


#### Create 2 income dataframes, one for person and another per house

In [72]:
ID_df_P = ID_df[ID_df.index % 2 == 0]
ID_df_P.columns = ['CUSEC','Income_Per_Person [K€]']
ID_df_P.head()

Unnamed: 0,CUSEC,Income_Per_Person [K€]
0,4109101001,15.189
2,4109101002,14.763
4,4109101003,15.518
6,4109101004,15.818
8,4109101005,15.507


In [73]:
ID_df_H = ID_df[ID_df.index % 2 != 0]
ID_df_H.columns = ['CUSEC','Income_Per_House [K€]']
ID_df_H.head()

Unnamed: 0,CUSEC,Income_Per_House [K€]
1,4109101001,35.467
3,4109101002,35.431
5,4109101003,34.13
7,4109101004,37.452
9,4109101005,35.293


### Let's Create a single dataframe containing all the information included in the different dataframes

In [74]:
C_df.set_index('CUSEC', inplace=True)

In [75]:
DD_filtered = DD_df.drop(columns=['Barrio', 'Distrito'])
DD_filtered

Unnamed: 0,CUSEC,Población_Total,H_total,M_Total,H_00_Entre_0_y_4_años,H_01_Entre_5_y_9_años,H_02_Entre_10_y_14_años,H_03_Entre_15_y_19_años,H_04_Entre_20_y_24_años,H_05_Entre_25_y_29_años,...,M_11_Entre_55_y_59_años,M_12_Entre_60_y_64_años,M_13_Entre_65_y_69_años,M_14_Entre_70_y_74_años,M_15_Entre_75_y_79_años,M_16_Entre_80_y_84_años,M_17_Entre_85_y_89_años,M_18_Entre_90_y_más_años,Shape__Area,Shape__Length
0,4109106023,821,358,463,10,17,28,22,16,20,...,27,31,25,31,29,32,14.0,12.0,4.397992e+04,875.704245
1,4109104028,795,384,411,30,22,17,24,21,37,...,23,30,11,8,7,10,9.0,1.0,6.070908e+04,1385.198757
2,4109104033,1156,535,621,22,30,34,27,26,30,...,33,34,36,37,42,27,33.0,7.0,1.026266e+05,1279.491928
3,4109104035,1449,676,773,21,33,38,32,36,41,...,57,30,47,62,31,29,15.0,6.0,7.888408e+04,1493.304158
4,4109101002,950,449,501,13,19,13,16,24,45,...,36,27,21,35,23,16,9.0,7.0,7.746128e+04,1305.417423
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
526,4109109068,1252,601,651,18,27,45,62,67,21,...,45,16,13,14,8,13,7.0,1.0,5.443147e+04,1451.575365
527,4109103007,1510,698,812,72,44,26,26,14,26,...,39,43,35,27,23,21,12.0,8.0,6.879409e+04,1606.676826
528,4109105044,1450,758,692,44,47,43,53,47,60,...,53,50,38,33,30,20,15.0,3.0,1.765981e+05,2025.047814
529,4109109069,1237,608,629,23,42,79,79,35,29,...,29,11,21,5,8,3,3.0,,1.641491e+07,19886.712560


In [76]:
df = C_df
df = df.join(ID_df_P.set_index('CUSEC'), on= df.index)
df = df.join(ID_df_H.set_index('CUSEC'), on= df.index)
df = df.join(DD_filtered.set_index('CUSEC'), on= df.index)
df.head()

Unnamed: 0_level_0,Barrio,Distrito,Poblacion,Shape__Are,xcoord,ycoord,Income_Per_Person [K€],Income_Per_House [K€],Población_Total,H_total,...,M_11_Entre_55_y_59_años,M_12_Entre_60_y_64_años,M_13_Entre_65_y_69_años,M_14_Entre_70_y_74_años,M_15_Entre_75_y_79_años,M_16_Entre_80_y_84_años,M_17_Entre_85_y_89_años,M_18_Entre_90_y_más_años,Shape__Area,Shape__Length
CUSEC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4109108035,SAN PABLO D Y E,San Pablo - Santa Justa,769,32050.2644,-5.95739,37.39938,8.048,19.469,769,363,...,29,21,21,23,33,21,20.0,6.0,32050.2644,1090.486795
4109109010,PALACIO DE CONGRESOS URBADIEZ ENTREPUENTES,Este,2208,485294.7092,-5.94065,37.40601,15.638,51.195,2208,1099,...,121,114,89,45,25,13,9.0,8.0,485294.7092,3443.731401
4109109003,COLORES ENTREPARQUES,Este,2119,380933.1804,-5.91771,37.38841,11.587,34.827,2119,1036,...,120,74,66,43,35,10,9.0,8.0,380933.1804,3206.726561
4109110013,SECTOR SUR LA PALMERA REINA MERCEDES,Bellavista La Palmera,647,26359.63672,-5.98599,37.35797,13.765,34.004,647,283,...,8,18,28,22,35,23,22.0,12.0,26359.63672,866.828293
4109110012,SECTOR SUR LA PALMERA REINA MERCEDES,Bellavista La Palmera,1066,482544.1013,-5.98489,37.36369,18.457,53.481,1066,473,...,38,41,30,42,36,23,24.0,20.0,482544.1013,4208.913795


#### Reset Index and cast CUSEC field as string

In [77]:
df.reset_index(inplace = True)
df[['CUSEC']] = df[['CUSEC']].astype(str) # Converting this to string since otherwise folium will not be able to link the key on the dataframe with the dataframe
df.head()

Unnamed: 0,CUSEC,Barrio,Distrito,Poblacion,Shape__Are,xcoord,ycoord,Income_Per_Person [K€],Income_Per_House [K€],Población_Total,...,M_11_Entre_55_y_59_años,M_12_Entre_60_y_64_años,M_13_Entre_65_y_69_años,M_14_Entre_70_y_74_años,M_15_Entre_75_y_79_años,M_16_Entre_80_y_84_años,M_17_Entre_85_y_89_años,M_18_Entre_90_y_más_años,Shape__Area,Shape__Length
0,4109108035,SAN PABLO D Y E,San Pablo - Santa Justa,769,32050.2644,-5.95739,37.39938,8.048,19.469,769,...,29,21,21,23,33,21,20.0,6.0,32050.2644,1090.486795
1,4109109010,PALACIO DE CONGRESOS URBADIEZ ENTREPUENTES,Este,2208,485294.7092,-5.94065,37.40601,15.638,51.195,2208,...,121,114,89,45,25,13,9.0,8.0,485294.7092,3443.731401
2,4109109003,COLORES ENTREPARQUES,Este,2119,380933.1804,-5.91771,37.38841,11.587,34.827,2119,...,120,74,66,43,35,10,9.0,8.0,380933.1804,3206.726561
3,4109110013,SECTOR SUR LA PALMERA REINA MERCEDES,Bellavista La Palmera,647,26359.63672,-5.98599,37.35797,13.765,34.004,647,...,8,18,28,22,35,23,22.0,12.0,26359.63672,866.828293
4,4109110012,SECTOR SUR LA PALMERA REINA MERCEDES,Bellavista La Palmera,1066,482544.1013,-5.98489,37.36369,18.457,53.481,1066,...,38,41,30,42,36,23,24.0,20.0,482544.1013,4208.913795


### Let's search the pharmacies inside each sub-neighborhood

In [78]:
# Let's load the geographical information

Seville_geo = 'https://raw.githubusercontent.com/jomsaga/Capstone/main/Sevilla_Full.json'

geojson = requests.get(Seville_geo).json()

In [79]:
# Create two lists one containing the sub-neighborhood ids and another lists of the sub-neighborhoods' vertexs' coordinates

list_sub_neighborhood_coordinates = []

list_sub_neighborhood_CUSEC = []

for venue in range(len(geojson['features'])):
    
    list_sub_neighborhood_coordinates.append(geojson['features'][venue]['geometry']['coordinates'][0])
    
    list_sub_neighborhood_CUSEC.append(geojson['features'][venue]['properties']['CUSEC'])

In [80]:
# Create a list in whcih each element is a string containing each sub-neighborhoods' vertexs' coordinates
# Each of the strings will be passed as parameter to the request in order to get the pharmacies inside each sub-neighborhood

list_sub_neighborhood_coordinates_url  = []

for sub_neighborhood_coordinates in list_sub_neighborhood_coordinates:
    
    sub_neighborhood_vertexs  =''
    
    for vertex in sub_neighborhood_coordinates:
        sub_neighborhood_vertexs+= str(vertex[1]) + ',' + str(vertex[0]) + ';'
    
    list_sub_neighborhood_coordinates_url.append(sub_neighborhood_vertexs)

In [81]:
# Static information of the request

CLIENT_ID = '5AHFGCONKYCMOZSZJLT4BYJTBV3SEQWMCKSPYBZDAZTBUUVC'#'UZZCI3GU0B0TESGPPWWGIL1UFKAB1NKJ2QGF0W0VEAZBLGUH'  #'XZ00IYHHH1TQXGJPE44J24KGX2L00I4M4QNZ2QX3DSDWDUTN'
CLIENT_SECRET = 'RA5BEWZQMXOB4PNDFALI24RMXMDFCAYNLC14LSMVWSUC14IJ'#'YN03BGN2GZCZBTQH4ANJVVKSMLUH350ON0BLPRSNWHZ4UZAE' #'B0UT2Q1ZKYGGXQYSNW2EIR2ZMJQL1THI2UMQ2N2VR4XBKWZK'
VERSION = '20180605' #'20161101'
LIMIT = 100
RADIUS = 500 
categoryId = '4bf58dd8d48988d10f951735' # This is the category Id of the Pharmacy venue
llAcc = 100000000

In [18]:
# Perform the requests and create a list of tuples to build the pharmacies dataframe

pharmacy_df_list = []

for sub_neighborhood,sub_neighborhood_coordinates in tqdm(zip(list_sub_neighborhood_CUSEC, list_sub_neighborhood_coordinates_url)):
    
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&polygon={}&categoryId={}&v={}&llAcc={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET,
            sub_neighborhood_coordinates,
            categoryId,
            VERSION,
            llAcc)
    
    try:
        
        results = requests.get(url).json()
        
        if results['response']['groups'][0]['items'] != []:
            for venue in range(len(results['response']['groups'][0]['items'])):    
                Pharmacy_id = results['response']['groups'][0]['items'][venue]['venue']['id']
                Pharmacy_name = results['response']['groups'][0]['items'][venue]['venue']['name']
                Pharmacy_lat = results['response']['groups'][0]['items'][venue]['venue']['location']['lat']
                Pharmacy_lng = results['response']['groups'][0]['items'][venue]['venue']['location']['lng']

            pharmacy_df_list.append((sub_neighborhood, Pharmacy_id, Pharmacy_name, Pharmacy_lat, Pharmacy_lng))

        else:

            Pharmacy_id = np.nan
            Pharmacy_name = np.nan
            Pharmacy_lat = np.nan
            Pharmacy_lng = np.nan


            pharmacy_df_list.append((sub_neighborhood, Pharmacy_id, Pharmacy_name, Pharmacy_lat, Pharmacy_lng))
    
    except:
        
        Pharmacy_id = np.nan
        Pharmacy_name = np.nan
        Pharmacy_lat = np.nan
        Pharmacy_lng = np.nan


        pharmacy_df_list.append((sub_neighborhood, Pharmacy_id, Pharmacy_name, Pharmacy_lat, Pharmacy_lng))

531it [04:12,  2.11it/s]


In [82]:
# Build the dataframe

columns = ['Sub-neighborhood','Pharmacy Id','Pharmacy Name','Pharmacy Latitude','Pharmacy Longitude']

pharmacy_df = pd.DataFrame(pharmacy_df_list)

pharmacy_df.columns = columns

In [83]:
# Clean the dataframe

pharmacy_df_final = pharmacy_df.dropna(inplace=True)

pharmacy_df_final = pharmacy_df.drop_duplicates(subset = 'Pharmacy Id')

pharmacy_df_final.head()

Unnamed: 0,Sub-neighborhood,Pharmacy Id,Pharmacy Name,Pharmacy Latitude,Pharmacy Longitude
2,4109104033,4f70af04e4b077f17f9addfe,Farmacia Fernandez Vega,37.375087,-5.952968
3,4109104035,53f1fec1498e97084aff42a3,FARMACIA PARQUE AMATE 24H,37.379246,-5.953345
8,4109101027,4bfc13a1e05e0f47e027cfa8,Farmacia Marqués De Paradas,37.389537,-6.001053
10,4109101008,4e69fcef45ddadf2d040fb90,Farmacia San Julian,37.398658,-5.985472
16,4109101017,4f5dce02e4b03690722fdb1a,Farmacia Montesion,37.396951,-5.991197


### Let's represent the dataframes information in a map

In [84]:
# Import the geogrphical information from json files

Seville_popup = 'https://raw.githubusercontent.com/jomsaga/Capstone/main/Sevilla_Full.json'

json_data = requests.get(Seville_popup).json()

# Add the income per persone and per house in new columns in the popup json

for i in tqdm(range(len(json_data['features']))):

    for j in range(len(json_data['features'])):

        if json_data['features'][i]['properties']['CUSEC'] == int(df.iloc[j,0]):
            
            json_data['features'][i]['properties']['Income_Per_Person [K€]'] = df.iloc[j,7]
            
            json_data['features'][i]['properties']['Income_Per_House [K€]'] = df.iloc[j,8]

100%|████████████████████████████████████████████████████████████████████████████████| 531/531 [00:17<00:00, 29.95it/s]


### Let's search the pharmacies inside each sub-neighborhood

In [85]:
# Create a single maps with different layers representing the income per person, per house and the pharmacies

Map = folium.Map(location=[37.3826, -5.94], zoom_start=12, tiles = None)

tiles = ['cartodbpositron','openstreetmap','stamenterrain','stamentoner']
tiles_names = ['Cartodb Positron','Open StreetMap','Stamen Terrain', 'Stamen Toner']

i = 0
for tile in tiles:
    folium.TileLayer(tile,name = tiles_names[i]).add_to(Map)
    i += 1
    
# Colormaps can be found here: https://nbviewer.jupyter.org/github/python-visualization/folium/blob/master/examples/Colormaps.ipynb 
colormap = cm.linear.Set1_08.scale(df['Income_Per_Person [K€]'].min(), df['Income_Per_House [K€]'].max()).to_step(9)
colormap.caption = 'Income [Thousands of Euros]'
Map.add_child(colormap)

# Create the income per person layer
IPP_layer = folium.GeoJson(json_data,
                          style_function = lambda feature:{
                              'weight': 0.3,
                              'color':'Black',
                              'fillColor': colormap(feature['properties']['Income_Per_Person [K€]']) if 
                              feature['properties']['Income_Per_Person [K€]'] > 0 else '#00000000',
                              'fillOpacity': 0.3,
                              
                          },
                          tooltip=folium.GeoJsonTooltip(fields=['Distrito','Barrio', 'Income_Per_Person [K€]'],
                          aliases=['District','Neighborhood', 'Income per person [K€]']),
                          name ='Income per person [k€]',
                          overlay = True,
                          show = True)

IPP_layer.add_to(Map)

# Create the income per house layer
IPH_layer = folium.GeoJson(json_data,
                          style_function = lambda feature:{
                              'weight': 0.3,
                              'color':'Black',
                              'fillColor': colormap(feature['properties']['Income_Per_House [K€]']) if 
                              feature['properties']['Income_Per_Person [K€]'] > 0 else '#00000000',
                              'fillOpacity': 0.3,
                              
                          },
                          tooltip=folium.GeoJsonTooltip(fields=['Distrito','Barrio', 'Income_Per_House [K€]'],
                          aliases=['District','Neighborhood', 'Income per House [K€]']),
                          name ='Income per House [k€]',
                          overlay = True,
                          show = False)

IPH_layer.add_to(Map)

# Create the pharmacies layer

Pharmacies_layer = folium.FeatureGroup(name='Pharmacies') #Crewate a feature group whcih includes all the markes (pharmacies)

# Then we can include the markers as a single layer to the map control

for lat, lng, sub_neighborhood, pharmacy_name in zip(pharmacy_df_final['Pharmacy Latitude'],
                                                     pharmacy_df_final['Pharmacy Longitude'],
                                                     pharmacy_df_final['Sub-neighborhood'],
                                                     pharmacy_df_final['Pharmacy Name']):
    
    label = 'Sub-neighborhood: {}\n, Pharmacy: {}'.format(sub_neighborhood, pharmacy_name)
    
    label = folium.Popup(label, parse_html=True)
    
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(Pharmacies_layer) 

Pharmacies_layer.add_to(Map)

Map.keep_in_front(Pharmacies_layer) # Keep the pharmacies layer allways in front

folium.LayerControl().add_to(Map)

Map.save('Map.html')

Map

## Neighborhood Analysis

In [86]:
df.head()

Unnamed: 0,CUSEC,Barrio,Distrito,Poblacion,Shape__Are,xcoord,ycoord,Income_Per_Person [K€],Income_Per_House [K€],Población_Total,...,M_11_Entre_55_y_59_años,M_12_Entre_60_y_64_años,M_13_Entre_65_y_69_años,M_14_Entre_70_y_74_años,M_15_Entre_75_y_79_años,M_16_Entre_80_y_84_años,M_17_Entre_85_y_89_años,M_18_Entre_90_y_más_años,Shape__Area,Shape__Length
0,4109108035,SAN PABLO D Y E,San Pablo - Santa Justa,769,32050.2644,-5.95739,37.39938,8.048,19.469,769,...,29,21,21,23,33,21,20.0,6.0,32050.2644,1090.486795
1,4109109010,PALACIO DE CONGRESOS URBADIEZ ENTREPUENTES,Este,2208,485294.7092,-5.94065,37.40601,15.638,51.195,2208,...,121,114,89,45,25,13,9.0,8.0,485294.7092,3443.731401
2,4109109003,COLORES ENTREPARQUES,Este,2119,380933.1804,-5.91771,37.38841,11.587,34.827,2119,...,120,74,66,43,35,10,9.0,8.0,380933.1804,3206.726561
3,4109110013,SECTOR SUR LA PALMERA REINA MERCEDES,Bellavista La Palmera,647,26359.63672,-5.98599,37.35797,13.765,34.004,647,...,8,18,28,22,35,23,22.0,12.0,26359.63672,866.828293
4,4109110012,SECTOR SUR LA PALMERA REINA MERCEDES,Bellavista La Palmera,1066,482544.1013,-5.98489,37.36369,18.457,53.481,1066,...,38,41,30,42,36,23,24.0,20.0,482544.1013,4208.913795


In [87]:
df.dropna(inplace = True)

In [101]:
df.reset_index(inplace = True)

In [88]:
df.set_index('CUSEC', inplace = True)

In [89]:
List_drop = ['Barrio','Distrito','xcoord','ycoord','Poblacion','Shape__Are','Shape__Area','Shape__Length']

In [90]:
df_cluster = df.drop(List_drop, axis = 1)

In [91]:
df_cluster.dropna(inplace = True)

In [92]:
df_cluster.head()

Unnamed: 0_level_0,Income_Per_Person [K€],Income_Per_House [K€],Población_Total,H_total,M_Total,H_00_Entre_0_y_4_años,H_01_Entre_5_y_9_años,H_02_Entre_10_y_14_años,H_03_Entre_15_y_19_años,H_04_Entre_20_y_24_años,...,M_09_Entre_45_y_49_años,M_10_Entre_50_y_54_años,M_11_Entre_55_y_59_años,M_12_Entre_60_y_64_años,M_13_Entre_65_y_69_años,M_14_Entre_70_y_74_años,M_15_Entre_75_y_79_años,M_16_Entre_80_y_84_años,M_17_Entre_85_y_89_años,M_18_Entre_90_y_más_años
CUSEC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4109108035,8.048,19.469,769,363,406,19,21,12,20,27,...,23,32,29,21,21,23,33,21,20.0,6.0
4109109010,15.638,51.195,2208,1099,1109,30,44,49,60,101,...,58,95,121,114,89,45,25,13,9.0,8.0
4109109003,11.587,34.827,2119,1036,1083,23,44,60,67,100,...,101,130,120,74,66,43,35,10,9.0,8.0
4109110013,13.765,34.004,647,283,364,12,13,13,11,21,...,34,27,8,18,28,22,35,23,22.0,12.0
4109110012,18.457,53.481,1066,473,593,22,29,26,44,28,...,40,37,38,41,30,42,36,23,24.0,20.0


In [94]:
df_cluster_normalized = scale(df_cluster)
df_cluster_normalized

array([[-0.81735137, -0.91173291, -1.27794876, ..., -0.19989281,
         0.38100964, -0.55135696],
       [ 1.0486814 ,  1.89382791,  2.13458202, ..., -0.91077676,
        -0.87485333, -0.23461998],
       [ 0.05272636,  0.44638986,  1.92352209, ..., -1.17735825,
        -0.87485333, -0.23461998],
       ...,
       [ 0.11935283,  0.59150508,  2.20809727, ..., -0.19989281,
        -0.87485333, -0.55135696],
       [-0.68409843, -0.53643189, -0.63765459, ..., -1.26621874,
        -1.10319205, -1.50156789],
       [-0.79153668, -0.92862322, -0.99811649, ...,  0.77757263,
        -0.76068397, -0.70972545]])

### K-means Model

In [95]:
# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_cluster_normalized)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([1, 2, 0, 1, 4, 4, 1, 2, 4, 2, 4, 2, 4, 2, 4, 1, 4, 4, 4, 0, 4, 4,
       1, 4, 4, 2, 0, 4, 4, 1, 0, 1, 0, 1, 1, 0, 3, 1, 0, 1, 1, 1, 0, 4,
       2, 0, 2, 3, 2, 4, 2, 1, 2, 4, 1, 2, 4, 4, 4, 1, 2, 1, 4, 4, 4, 4,
       4, 1, 2, 4, 1, 1, 2, 1, 1, 1, 1, 4, 0, 0, 4, 2, 1, 1, 0, 1, 1, 4,
       4, 1, 4, 1, 4, 4, 2, 4, 1, 1, 1, 1, 1, 1, 2, 2, 1, 3, 1, 0, 1, 1,
       3, 2, 1, 4, 2, 4, 4, 1, 4, 2, 0, 0, 1, 1, 4, 1, 0, 4, 1, 4, 1, 4,
       2, 4, 4, 1, 1, 4, 2, 2, 2, 4, 1, 4, 4, 1, 2, 3, 4, 1, 1, 1, 1, 2,
       4, 1, 2, 1, 3, 1, 2, 0, 2, 2, 2, 2, 0, 4, 4, 4, 1, 2, 2, 4, 4, 1,
       1, 1, 4, 4, 1, 4, 1, 1, 1, 4, 1, 1, 1, 2, 4, 2, 4, 4, 1, 1, 1, 4,
       4, 4, 4, 4, 4, 4, 1, 4, 1, 1, 1, 2, 4, 1, 1, 2, 4, 1, 1, 4, 1, 2,
       2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 3, 1, 2, 1, 1, 1, 1, 4, 4, 2, 1, 1,
       1, 4, 1, 4, 1, 0, 1, 3, 4, 1, 2, 1, 1, 1, 4, 0, 2, 1, 4, 3, 4, 4,
       0, 1, 1, 4, 2, 3, 0, 4, 4, 1, 2, 2, 4, 4, 2, 0, 4, 1, 1, 2, 1, 2,
       2, 4, 4, 0, 4, 1, 2, 1, 1, 1, 2, 4, 4, 1, 1,

In [96]:
# Add cluster information to the dataframe 
df['Cluster'] = kmeans.labels_

In [102]:
#df.reset_index(inplace = True)
df = df.astype({'CUSEC': 'int64'})
df = df.astype({'Cluster':'int64'})
df.head()

Unnamed: 0,CUSEC,Barrio,Distrito,Poblacion,Shape__Are,xcoord,ycoord,Income_Per_Person [K€],Income_Per_House [K€],Población_Total,...,M_12_Entre_60_y_64_años,M_13_Entre_65_y_69_años,M_14_Entre_70_y_74_años,M_15_Entre_75_y_79_años,M_16_Entre_80_y_84_años,M_17_Entre_85_y_89_años,M_18_Entre_90_y_más_años,Shape__Area,Shape__Length,Cluster
0,4109108035,SAN PABLO D Y E,San Pablo - Santa Justa,769,32050.2644,-5.95739,37.39938,8.048,19.469,769,...,21,21,23,33,21,20.0,6.0,32050.2644,1090.486795,1
1,4109109010,PALACIO DE CONGRESOS URBADIEZ ENTREPUENTES,Este,2208,485294.7092,-5.94065,37.40601,15.638,51.195,2208,...,114,89,45,25,13,9.0,8.0,485294.7092,3443.731401,2
2,4109109003,COLORES ENTREPARQUES,Este,2119,380933.1804,-5.91771,37.38841,11.587,34.827,2119,...,74,66,43,35,10,9.0,8.0,380933.1804,3206.726561,0
3,4109110013,SECTOR SUR LA PALMERA REINA MERCEDES,Bellavista La Palmera,647,26359.63672,-5.98599,37.35797,13.765,34.004,647,...,18,28,22,35,23,22.0,12.0,26359.63672,866.828293,1
4,4109110012,SECTOR SUR LA PALMERA REINA MERCEDES,Bellavista La Palmera,1066,482544.1013,-5.98489,37.36369,18.457,53.481,1066,...,41,30,42,36,23,24.0,20.0,482544.1013,4208.913795,4


In [103]:
# Join the df containing the demographic information and the pharmacies data frame

df_final = df

df_final = df_final.join(pharmacy_df_final.set_index('Sub-neighborhood'), on='CUSEC')

df_final.head()

Unnamed: 0,CUSEC,Barrio,Distrito,Poblacion,Shape__Are,xcoord,ycoord,Income_Per_Person [K€],Income_Per_House [K€],Población_Total,...,M_16_Entre_80_y_84_años,M_17_Entre_85_y_89_años,M_18_Entre_90_y_más_años,Shape__Area,Shape__Length,Cluster,Pharmacy Id,Pharmacy Name,Pharmacy Latitude,Pharmacy Longitude
0,4109108035,SAN PABLO D Y E,San Pablo - Santa Justa,769,32050.2644,-5.95739,37.39938,8.048,19.469,769,...,21,20.0,6.0,32050.2644,1090.486795,1,,,,
1,4109109010,PALACIO DE CONGRESOS URBADIEZ ENTREPUENTES,Este,2208,485294.7092,-5.94065,37.40601,15.638,51.195,2208,...,13,9.0,8.0,485294.7092,3443.731401,2,4bbb04547421a5933e57c440,Farmacia Lda. Mª. Carmen Garzón Álvarez,37.407375,-5.940567
2,4109109003,COLORES ENTREPARQUES,Este,2119,380933.1804,-5.91771,37.38841,11.587,34.827,2119,...,10,9.0,8.0,380933.1804,3206.726561,0,,,,
3,4109110013,SECTOR SUR LA PALMERA REINA MERCEDES,Bellavista La Palmera,647,26359.63672,-5.98599,37.35797,13.765,34.004,647,...,23,22.0,12.0,26359.63672,866.828293,1,,,,
4,4109110012,SECTOR SUR LA PALMERA REINA MERCEDES,Bellavista La Palmera,1066,482544.1013,-5.98489,37.36369,18.457,53.481,1066,...,23,24.0,20.0,482544.1013,4208.913795,4,,,,


### Represent the different sub-neighborhood's clusters in a map

In [113]:
# In order to represent a choropleth map we need to cast the key to match the datatype of the key in the json
df_final = df_final.astype({'CUSEC': 'int64'})

In [173]:
Cluster_map = folium.Map(location=[37.3826, -5.94], zoom_start=12, tiles = None)

tiles = ['cartodbpositron','openstreetmap','stamenterrain','stamentoner']
tiles_names = ['Cartodb Positron','Open StreetMap','Stamen Terrain', 'Stamen Toner']

i = 0
for tile in tiles:
    folium.TileLayer(tile,name = tiles_names[i]).add_to(Cluster_map)
    i += 1

Cluster_map.choropleth(
    geo_data=Seville_geo,
    data=df_final,
    columns=['CUSEC', 'Cluster'],
    key_on='feature.properties.CUSEC',
    bins = [0,1,2,3,4,5],
    fill_color='YlOrBr',
    nan_fill_color = '#00000000',
    fill_opacity=0.4, 
    line_opacity=0.2,
    name= 'Cluster',
    legend_name='Neighborhood Cluster'
)

Pharmacies_layer = folium.FeatureGroup(name='Pharmacies') #Crewate a feature group whcih includes all the markes (pharmacies)

# Then we can include the markers as a single layer to the map control

for lat, lng, sub_neighborhood, pharmacy_name in zip(pharmacy_df_final['Pharmacy Latitude'],
                                                     pharmacy_df_final['Pharmacy Longitude'],
                                                     pharmacy_df_final['Sub-neighborhood'],
                                                     pharmacy_df_final['Pharmacy Name']):
    
    label = 'Sub-neighborhood: {}\n, Pharmacy: {}'.format(sub_neighborhood, pharmacy_name)
    
    label = folium.Popup(label, parse_html=True)
    
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(Pharmacies_layer) 

Pharmacies_layer.add_to(Cluster_map)

Cluster_map.keep_in_front(Pharmacies_layer) # Keep the pharmacies layer allways in front

    
folium.LayerControl().add_to(Cluster_map)

Cluster_map

### Examine each cluster

In [202]:
cluster_num = 2
print('Cluster: ', cluster_num)
print('Min Population: ', df_final[df_final['Cluster']==cluster_num]['Poblacion'].min())
print('Max Population: ', df_final[df_final['Cluster']==cluster_num]['Poblacion'].max())
print('Min Income per person: ', df_final[df_final['Cluster']==cluster_num]['Income_Per_Person [K€]'].min())
print('Max Income per person: ', df_final[df_final['Cluster']==cluster_num]['Income_Per_Person [K€]'].max())
print('Min Income per house: ', df_final[df_final['Cluster']==cluster_num]['Income_Per_House [K€]'].min())
print('Max Income per House: ', df_final[df_final['Cluster']==cluster_num]['Income_Per_House [K€]'].max())

Cluster:  2
Min Population:  1306
Max Population:  2329
Min Income per person:  4.452
Max Income per person:  20.491
Min Income per house:  12.242
Max Income per House:  63.471000000000004


In [180]:
df_pob_40 = df_final

In [181]:
df_pob_40['Poblacion_40'] = df_pob_40.loc[:,'H_08_Entre_40_y_44_años':'M_18_Entre_90_y_más_años'].sum(axis = 1)

In [208]:
df_interest = df_pob_40[df_pob_40['Poblacion'] > 2000].copy()

In [209]:
df_interst = df_interest.sort_values(by='Poblacion_40', ascending = False, inplace = True) 

In [210]:
df_interest

Unnamed: 0,CUSEC,Barrio,Distrito,Poblacion,Shape__Are,xcoord,ycoord,Income_Per_Person [K€],Income_Per_House [K€],Población_Total,...,M_17_Entre_85_y_89_años,M_18_Entre_90_y_más_años,Shape__Area,Shape__Length,Cluster,Pharmacy Id,Pharmacy Name,Pharmacy Latitude,Pharmacy Longitude,Poblacion_40
429,4109110024,ELCANO BERMEJALES,Bellavista La Palmera,2983,133897.9,-5.97689,37.34497,13.957,40.812,2983,...,5.0,4.0,133897.9,1507.350927,3,,,,,2062.0
230,4109104042,SANTA AURELIA CANTABRICO ATLANTICO LA ROMERIA,Cerro Amate,2599,223184.1,-5.95242,37.38434,8.669,24.284,2599,...,12.0,7.0,223184.1,2423.344607,3,,,,,1942.0
501,4109110025,ELCANO BERMEJALES,Bellavista La Palmera,2687,536384.6,-5.97681,37.3424,11.118,28.691,2687,...,8.0,3.0,536384.6,4430.830197,3,4e6f813fae604d1b45a6af03,Farmacia J.Machuca De Castro,37.345607,-5.979164,1903.0
322,4109101023,FERIA,Casco Antiguo,2329,146176.5,-5.98961,37.39875,13.16,29.54,2329,...,24.0,10.0,146176.5,2174.550532,2,,,,,1856.0
261,4109105020,GIRALDA SUR,Sur,2452,292776.7,-5.9772,37.37483,16.224,43.43,2452,...,18.0,7.0,292776.7,2532.771785,3,,,,,1825.0
397,4109102001,DOCTOR BARRAQUER GRUPO RENFE POLICLINICO,Macarena,2164,515753.0,-5.99101,37.40516,14.34,31.29,2164,...,33.0,14.0,515753.0,3677.886985,2,,,,,1780.0
249,4109104063,LA PLATA,Cerro Amate,2389,171731.7,-5.94533,37.36942,6.114,17.926,2389,...,14.0,4.0,171731.7,1875.456637,3,,,,,1772.0
55,4109106017,TRIANA OESTE,Triana,2172,143794.8,-6.01112,37.38826,14.218,36.705,2172,...,32.0,15.0,143794.8,1618.525565,2,,,,,1771.0
510,4109109053,COLORES ENTREPARQUES,Este,2239,195338.5,-5.92523,37.40352,11.858,36.468,2239,...,9.0,6.0,195338.5,2024.085304,0,,,,,1757.0
338,4109103032,LA BUHAIRA,Nervión,2169,162512.9,-5.97157,37.38624,18.219,49.015,2169,...,28.0,13.0,162512.9,1749.3897,2,5c49d9ec31fd14002ccc9a7a,Kiehls,37.386669,-5.971599,1752.0


### Map of preferred areas

In [217]:
Results_map = folium.Map(location=[37.3826, -5.94], zoom_start=12, tiles = None)

tiles = ['cartodbpositron','openstreetmap','stamenterrain','stamentoner']
tiles_names = ['Cartodb Positron','Open StreetMap','Stamen Terrain', 'Stamen Toner']

i = 0
for tile in tiles:
    folium.TileLayer(tile,name = tiles_names[i]).add_to(Results_map)
    i += 1

Pharmacies_layer_result = folium.FeatureGroup(name='Suitable sub-neighborhoods') #Crewate a feature group whcih includes all the markes (pharmacies)

# Then we can include the markers as a single layer to the map control

for lat, lng, sub_neighborhood, barrio in zip(df_interest['ycoord'],
                                              df_interest['xcoord'],
                                              df_interest['CUSEC'],
                                              df_interest['Barrio']):
    
    label = 'Sub-neighborhood: {}\n, Barrio: {}'.format(sub_neighborhood, barrio)
    
    label = folium.Popup(label, parse_html=True)
    
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(Pharmacies_layer_result) 

Pharmacies_layer_result.add_to(Results_map)

Results_map.keep_in_front(Pharmacies_layer_result) # Keep the pharmacies layer allways in front

folium.LayerControl().add_to(Results_map)

Results_map

### Conclusion

Clusters 1 and 4 are the clusters with less population, there are no more than 2000 people in them, so we are not interersted in the sub-neighborhoods inside these clusters.

On the other hand, the sub-neighborhood included in clusters 2 and 3 are the ones in whcih the icome per person and house can reach higher limits. 

And finally, the sub-neighborhoods included in clusters 2 and 3 are the ones in whcih the population older than 40 years is higher, so we will select sub-neighborhoods in clusters 2 and 3 in which there are no pharmacies as the prefered subneighborhoods to open a pharmacy.