# Capstone Project: Introduction and Data

## Introduction
The city of Santiago in Chile is known for its huge seggregation. The city is divided into 31 different communes. The upper class is mainly concentrated in the north eastern communes. The lower class lives mainly in the southern ones, and little to none mix of classes is seen across the city.

However, the country has experienced a great economic growth throughout the last decades, lifting the lower class into a higher economic status. This, together with a growth in population, raised the prices of housing across the whole city. 

In this project I will focus on trying to distinguish the economic status of a commune based on the type of venues that are located within it. Knowing the type of venues that form part of the communes with higher economic status could be of great importance for real estate agnecies. With this information the will be able to determine if a lower class commune is getting a higher economic status if it has similar venue types as higher class communes. If this is the case, they will wnat to invest on this places.

## Data
For this project I will require data about location and development indeces for each commune. Fortunately, I found this information on this link: https://es.wikipedia.org/wiki/Anexo:Comunas_de_Chile. I loaded this data into a dataframe and preprocessed it, obtaining a dasaet called santiago_data that contains the commune name, HDI (human development index and coordinates for each commune in santiago. Together with this information, I obtained data for venues in each Commune using the foursquare API, this information is contained in the santiago_venues dataset. In the following cells you can observe how I obtained and managed the data, together with a map that displays each commune location, together with its name and HDI.

In [113]:
import pandas as pd # library for data analsysis

In [114]:
#loading data
df = pd.read_html('https://es.wikipedia.org/wiki/Anexo:Comunas_de_Chile')[0]

In [115]:
#changing column names
df = df[["Nombre","Provincia", "IDH 2005.1", "Latitud", "Longitud"]]
df = df.rename({'Nombre':'Commune'}, axis=1)
df = df.rename({'Provincia':'Province'}, axis=1)
df = df.rename({'IDH 2005.1':'HDI'}, axis=1)
df = df.rename({'Latitud':'Latitude'}, axis=1)
df = df.rename({'Longitud':'Longitude'}, axis=1)

In [116]:
#translating HDI values
df.loc[df['HDI'] == "Medio", 'HDI'] = "Medium"
df.loc[df['HDI'] == "Alto", 'HDI'] = "High"
df.loc[df['HDI'] == "Muy alto", 'HDI'] = "Very high"
df.loc[df['HDI'] == "Bajo medio", 'HDI'] = "Medium low"
df.loc[df['HDI'] == "Bajo alto", 'HDI'] = "Upper low"
df.loc[df['HDI'] == "Bajo", 'HDI'] = "Low"

In [120]:
#obtaining only communes from santiago
santiago_data = df[df['Province'].str.contains("Santiago")].reset_index(drop=True)

In [121]:
#changing coordinates types and obtaining final data set
import re

def dms2dd(degrees, minutes, seconds, direction):
    dd = float(degrees) + float(minutes)/60 + float(seconds)/(60*60);
    if direction == 'E' or direction == 'N':
        dd *= -1
    return dd;

def dd2dms(deg):
    d = int(deg)
    md = abs(deg - d) * 60
    m = int(md)
    sd = (md - m) * 60
    return [d, m, sd]

def parse_dms(dms):
    parts = re.split('[^\d\w]+', dms)
    lat = dms2dd(parts[0], parts[1], parts[2], parts[3])

    return (lat)

santiago_data['Longitude'] = santiago_data['Longitude'].map(lambda x: x.lstrip('-') + ".0E")
santiago_data['Longitude'] = santiago_data['Longitude'].map(lambda x: x[:5] + "\\" + x[5:])
santiago_data['Latitude'] = santiago_data['Latitude'].map(lambda x: x.lstrip('-') + ".0S")
santiago_data['Latitude'] = santiago_data['Latitude'].map(lambda x: x[:5] + "\\" + x[5:])
santiago_data['Longitude'] = santiago_data['Longitude'].apply(parse_dms)
santiago_data['Latitude'] = santiago_data['Latitude'].apply(parse_dms)
santiago_data['Longitude'] = santiago_data['Longitude'].map(lambda x:  -x)
santiago_data['Latitude'] = santiago_data['Latitude'].map(lambda x: -x)

santiago_data.head()

Unnamed: 0,Commune,Province,HDI,Latitude,Longitude
0,Santiago,Santiago,Very high,-33.437222,-70.657222
1,Cerrillos,Santiago,High,-33.5,-70.716667
2,Cerro Navia,Santiago,Medium,-33.421944,-70.735
3,Conchalí,Santiago,High,-33.38,-70.675
4,El Bosque,Santiago,High,-33.566944,-70.675


In [119]:
#importing libraries for the next section
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [106]:
#obtaining coordinates from Santiago
address = 'Santiago, Chile'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Santiago are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Santiago are -33.4377968, -70.6504451.


In [122]:
#Displaying data in a map
map_santiago = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng,name, HDI in zip(santiago_data['Latitude'], santiago_data['Longitude'], santiago_data['Commune'], santiago_data['HDI']):
    label = folium.Popup(name + "," + HDI, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_santiago)  
    
map_santiago

In [108]:
# import function to get venues near each commune
CLIENT_ID = 'VHT0HA320F3E5KQIHMDEFI0ZCHN1HAOKRPYSIV2MSXY0RK2U' # your Foursquare ID
CLIENT_SECRET = 'FZHLZQKDLUYJZEVT2JLKZOP10R5WG2RJHZUNLURMI2KQG4CP' # your Foursquare Secret
VERSION = '20180604'
radius = 3000
LIMIT = 100
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
# we get nearby venues for each commune and save it in santiago_venues dataframe
santiago_venues = getNearbyVenues(names=santiago_data['Commune'],
                                   latitudes=santiago_data['Latitude'],
                                   longitudes=santiago_data['Longitude']
                                  )
print(santiago_venues.shape)
santiago_venues.head()