# Capstone Project

## Introduction
The city of Santiago in Chile is known for its huge seggregation. The city is divided into 31 different communes. The upper class is mainly concentrated in the north eastern communes. The lower class lives mainly in the southern ones, and little to none mix of classes is seen across the city.

However, the country has experienced a great economic growth throughout the last decades, lifting the lower class into a higher economic status. This, together with a growth in population, raised the prices of housing across the whole city. 

In this project I will focus on trying to distinguish the economic status of a commune based on the type of venues that are located within it. Knowing the type of venues that form part of the communes with higher economic status could be of great importance for real estate agnecies. With this information the will be able to determine if a lower class commune is getting a higher economic status if it has similar venue types as higher class communes. If this is the case, they will wnat to invest on this places.

## Data
For this project I will require data about location and development indeces for each commune. Fortunately, I found this information on this link: https://es.wikipedia.org/wiki/Anexo:Comunas_de_Chile. I loaded this data into a dataframe and preprocessed it, obtaining a dasaet called santiago_data that contains the commune name, HDI (human development index and coordinates for each commune in santiago. Together with this information, I obtained data for venues in each Commune using the foursquare API, this information is contained in the santiago_venues dataset. In the following cells you can observe how I obtained and managed the data, together with a map that displays each commune location, together with its name and HDI.

In [113]:
import pandas as pd # library for data analsysis

In [114]:
#loading data
df = pd.read_html('https://es.wikipedia.org/wiki/Anexo:Comunas_de_Chile')[0]

In [115]:
#changing column names
df = df[["Nombre","Provincia", "IDH 2005.1", "Latitud", "Longitud"]]
df = df.rename({'Nombre':'Commune'}, axis=1)
df = df.rename({'Provincia':'Province'}, axis=1)
df = df.rename({'IDH 2005.1':'HDI'}, axis=1)
df = df.rename({'Latitud':'Latitude'}, axis=1)
df = df.rename({'Longitud':'Longitude'}, axis=1)

In [116]:
#translating HDI values
df.loc[df['HDI'] == "Medio", 'HDI'] = "Medium"
df.loc[df['HDI'] == "Alto", 'HDI'] = "High"
df.loc[df['HDI'] == "Muy alto", 'HDI'] = "Very high"
df.loc[df['HDI'] == "Bajo medio", 'HDI'] = "Medium low"
df.loc[df['HDI'] == "Bajo alto", 'HDI'] = "Upper low"
df.loc[df['HDI'] == "Bajo", 'HDI'] = "Low"

In [120]:
#obtaining only communes from santiago
santiago_data = df[df['Province'].str.contains("Santiago")].reset_index(drop=True)

In [121]:
#changing coordinates types and obtaining final data set
import re

def dms2dd(degrees, minutes, seconds, direction):
    dd = float(degrees) + float(minutes)/60 + float(seconds)/(60*60);
    if direction == 'E' or direction == 'N':
        dd *= -1
    return dd;

def dd2dms(deg):
    d = int(deg)
    md = abs(deg - d) * 60
    m = int(md)
    sd = (md - m) * 60
    return [d, m, sd]

def parse_dms(dms):
    parts = re.split('[^\d\w]+', dms)
    lat = dms2dd(parts[0], parts[1], parts[2], parts[3])

    return (lat)

santiago_data['Longitude'] = santiago_data['Longitude'].map(lambda x: x.lstrip('-') + ".0E")
santiago_data['Longitude'] = santiago_data['Longitude'].map(lambda x: x[:5] + "\\" + x[5:])
santiago_data['Latitude'] = santiago_data['Latitude'].map(lambda x: x.lstrip('-') + ".0S")
santiago_data['Latitude'] = santiago_data['Latitude'].map(lambda x: x[:5] + "\\" + x[5:])
santiago_data['Longitude'] = santiago_data['Longitude'].apply(parse_dms)
santiago_data['Latitude'] = santiago_data['Latitude'].apply(parse_dms)
santiago_data['Longitude'] = santiago_data['Longitude'].map(lambda x:  -x)
santiago_data['Latitude'] = santiago_data['Latitude'].map(lambda x: -x)

santiago_data.head()

Unnamed: 0,Commune,Province,HDI,Latitude,Longitude
0,Santiago,Santiago,Very high,-33.437222,-70.657222
1,Cerrillos,Santiago,High,-33.5,-70.716667
2,Cerro Navia,Santiago,Medium,-33.421944,-70.735
3,Conchalí,Santiago,High,-33.38,-70.675
4,El Bosque,Santiago,High,-33.566944,-70.675


In [232]:
#dividing data frame into clusters. One with very high HDI and the other with lower HDI
santiago_data.loc[santiago_data['HDI'] == 'Very high', 'Cluster'] = int(2)
santiago_data.loc[santiago_data['HDI'] != "Very high", 'Cluster'] = int(1)

In [132]:
santiago_data.head()

Unnamed: 0,Commune,Province,HDI,Latitude,Longitude,Cluster
0,Santiago,Santiago,Very high,-33.437222,-70.657222,1.0
1,Cerrillos,Santiago,High,-33.5,-70.716667,2.0
2,Cerro Navia,Santiago,Medium,-33.421944,-70.735,2.0
3,Conchalí,Santiago,High,-33.38,-70.675,2.0
4,El Bosque,Santiago,High,-33.566944,-70.675,2.0


In [119]:
#importing libraries for the next section
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [106]:
#obtaining coordinates from Santiago
address = 'Santiago, Chile'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Santiago are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Santiago are -33.4377968, -70.6504451.


In [233]:
#Displaying data in a map
map_santiago = folium.Map(location=[latitude, longitude], zoom_start=10, tiles="Stamen Toner")
x = np.arange(2)
ys = [i + x + (i*x)**2 for i in range(2)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to map
for lat, lng,name, HDI, cluster in zip(santiago_data['Latitude'], santiago_data['Longitude'], santiago_data['Commune'], santiago_data['HDI'], santiago_data["Cluster"]):
    cluster = int(cluster)
    label = folium.Popup(name + "," + HDI, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7,
        parse_html=False).add_to(map_santiago)  
    
map_santiago

In [140]:
# import function to get venues near each commune
CLIENT_ID = 'VHT0HA320F3E5KQIHMDEFI0ZCHN1HAOKRPYSIV2MSXY0RK2U' # your Foursquare ID
CLIENT_SECRET = 'FZHLZQKDLUYJZEVT2JLKZOP10R5WG2RJHZUNLURMI2KQG4CP' # your Foursquare Secret
VERSION = '20180604'
radius = 5000
LIMIT = 100
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [141]:
# we get nearby venues for each commune and save it in santiago_venues dataframe
santiago_venues = getNearbyVenues(names=santiago_data['Commune'],
                                   latitudes=santiago_data['Latitude'],
                                   longitudes=santiago_data['Longitude']
                                  )
print(santiago_venues.shape)
santiago_venues.head()

(470, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Santiago,-33.437222,-70.657222,Plaza de Bolsillo - Santiago Centro,-33.436778,-70.655481,Plaza
1,Santiago,-33.437222,-70.657222,Starbucks,-33.437938,-70.657007,Coffee Shop
2,Santiago,-33.437222,-70.657222,YMCA,-33.43906,-70.656257,Pool
3,Santiago,-33.437222,-70.657222,Caffe Mauro,-33.437763,-70.655304,Coffee Shop
4,Santiago,-33.437222,-70.657222,Bambudda,-33.438987,-70.655631,Asian Restaurant


In [142]:
# one hot encoding
santiago_onehot = pd.get_dummies(santiago_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
santiago_onehot['Neighborhood'] = santiago_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [santiago_onehot.columns[-1]] + list(santiago_onehot.columns[:-1])
santiago_onehot = santiago_onehot[fixed_columns]


In [231]:
santiago_grouped = santiago_onehot.groupby('Neighborhood').mean().reset_index()


In [218]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = santiago_grouped['Neighborhood']

for ind in np.arange(santiago_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(santiago_grouped.iloc[ind, :], num_top_venues)


In [219]:
# set number of clusters
kclusters = 5

santiago_grouped_clustering = santiago_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(santiago_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

santiago_merged = santiago_data.rename({'Commune':'Neighborhood'}, axis=1)

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
santiago_merged = santiago_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

santiago_merged = santiago_merged[santiago_merged["Neighborhood"]!="La Florida"]
santiago_merged["Cluster Labels"].value_counts()

0.0    23
3.0     5
2.0     1
4.0     1
1.0     1
Name: Cluster Labels, dtype: int64

In [212]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11, tiles="Stamen Toner")

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]



# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(santiago_merged['Latitude'], santiago_merged['Longitude'], santiago_merged['Neighborhood'], santiago_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    cluster = int(cluster)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [225]:
#displaying venues for each cluster
santiago_merged.loc[santiago_merged['Cluster Labels'] == 0, santiago_merged.columns[[0] + list(range(5, santiago_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Santiago,1.0,0.0,Coffee Shop,Peruvian Restaurant,Pizza Place,Sandwich Place,Chinese Restaurant,Burger Joint,Asian Restaurant,Japanese Restaurant,Bakery,Sushi Restaurant
4,El Bosque,2.0,0.0,Pet Store,Pizza Place,Food & Drink Shop,Diner,Farmers Market,Falafel Restaurant,Electronics Store,Donut Shop,Dive Bar,Dessert Shop
5,Estación Central,2.0,0.0,Residential Building (Apartment / Condo),Gym,Argentinian Restaurant,Food Truck,Restaurant,Japanese Restaurant,Yoga Studio,Farmers Market,Falafel Restaurant,Electronics Store
6,Huechuraba,2.0,0.0,Market,Outdoors & Recreation,Ice Cream Shop,Dive Bar,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Electronics Store,Donut Shop
7,Independencia,2.0,0.0,Food,Fried Chicken Joint,Park,Plaza,Asian Restaurant,Sandwich Place,Diner,Farmers Market,Falafel Restaurant,Electronics Store
8,La Cisterna,2.0,0.0,Sushi Restaurant,Pizza Place,Chinese Restaurant,Bakery,Fast Food Restaurant,Middle Eastern Restaurant,Basketball Court,Farmers Market,Falafel Restaurant,Electronics Store
12,La Reina,1.0,0.0,Sushi Restaurant,Coffee Shop,Café,Gourmet Shop,General Entertainment,Italian Restaurant,Liquor Store,Fish & Chips Shop,Cupcake Shop,Chinese Restaurant
13,Las Condes,1.0,0.0,Restaurant,Fast Food Restaurant,Bakery,Creperie,Plaza,Dessert Shop,Sandwich Place,Coffee Shop,Bike Shop,Salad Place
14,Lo Barnechea,1.0,0.0,Restaurant,Gym,Burger Joint,Pizza Place,Sushi Restaurant,Electronics Store,Shopping Mall,Other Great Outdoors,Diner,Soccer Stadium
16,Lo Prado,2.0,0.0,Sushi Restaurant,Pharmacy,Nightclub,Garden Center,Convenience Store,Food Truck,Farmers Market,Bakery,Chinese Restaurant,Grocery Store


In [226]:
santiago_merged.loc[santiago_merged['Cluster Labels'] == 1, santiago_merged.columns[[0] + list(range(5, santiago_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,La Granja,2.0,1.0,Candy Store,Yoga Studio,Dive Bar,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Electronics Store,Donut Shop,Diner


In [227]:
santiago_merged.loc[santiago_merged['Cluster Labels'] == 2, santiago_merged.columns[[0] + list(range(5, santiago_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
30,San Ramón,2.0,2.0,Soccer Stadium,Moving Target,Dive Bar,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Electronics Store,Donut Shop,Yoga Studio,Fish & Chips Shop


In [228]:
santiago_merged.loc[santiago_merged['Cluster Labels'] == 3, santiago_merged.columns[[0] + list(range(5, santiago_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Cerrillos,2.0,3.0,Fast Food Restaurant,Restaurant,Grocery Store,Plaza,Farmers Market,Falafel Restaurant,Electronics Store,Donut Shop,Dive Bar,Diner
2,Cerro Navia,2.0,3.0,Dive Bar,Hardware Store,Plaza,Arts & Entertainment,Yoga Studio,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Electronics Store,Donut Shop
3,Conchalí,2.0,3.0,Southern / Soul Food Restaurant,Fast Food Restaurant,Liquor Store,Plaza,Yoga Studio,Dive Bar,Farmers Market,Falafel Restaurant,Electronics Store,Donut Shop
11,La Pintana,2.0,3.0,Plaza,Pharmacy,Farmers Market,Grocery Store,Soccer Stadium,Diner,Falafel Restaurant,Electronics Store,Donut Shop,Dive Bar
27,Renca,2.0,3.0,BBQ Joint,Football Stadium,Plaza,Dive Bar,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Electronics Store,Donut Shop,Yoga Studio


In [229]:
santiago_merged.loc[santiago_merged['Cluster Labels'] == 4, santiago_merged.columns[[0] + list(range(5, santiago_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
15,Lo Espejo,2.0,4.0,Plaza,Café,Bus Station,Yoga Studio,Dive Bar,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Electronics Store,Donut Shop
