# Capstone Project - The Battle of Neighborhoods



## Introduction

### Background

Curitiba is a Brazilian city, capital of the state of Paraná. The city of Curitiba is divided into a total of 75 neighborhoods, grouped into ten administrative regions. The regional are species of subprefectures, treated here as boroughs, whose headquarters are represented by the units of the so-called "Rua da Cidadania" (citizenship street), and have the purpose of decentralizing public agencies and the provision of social, structural and leisure services within the city. According to the Brazilian Institute of Geography and Statistics, the city had, in 2012, 108,474 local units, 103,211 companies and active commercial establishments and 780,390 workers, of which 1,084,369 were employed and 931,971 employees.

### Problem

This project aims to serve people looking for a place to open a business by answering where that business should be opened.

### Interest

Entrepreneurs who are looking for places to open their businesses and wondering in which neighborhoods their business will be most appropriate and prosperous.

Importing used libraries

In [0]:
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim
import re
import requests
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN

## Data collection and cleanning

The used data comes from the Curitiba Neighborhoods page.

In [0]:
url = 'https://pt.wikipedia.org/wiki/Lista_de_bairros_de_Curitiba'
raw_data = pd.read_html(url, decimal=',', thousands=u'\xa0')

Creating the dataframe.

In [0]:
df = pd.DataFrame(columns=['borough', 'neighborhood', 'men', 'women', 'total', 'households', 'income_per_head', 'area'])

for i in range(len(raw_data)-1):
  aux = pd.DataFrame(columns=['borough', 'neighborhood', 'men', 'women', 'total', 'households', 'income_per_head', 'area'])
  aux['neighborhood'] = pd.Series(raw_data[i][0][3:])
  aux['men'] = pd.to_numeric(pd.Series(raw_data[i][2][3:]), downcast='float')
  aux['women'] = pd.to_numeric(pd.Series(raw_data[i][3][3:]), downcast='float')
  aux['total'] = pd.to_numeric(pd.Series(raw_data[i][4][3:]), downcast='float')
  aux['households'] = pd.to_numeric(pd.Series(raw_data[i][5][3:]), downcast='float')
  aux['income_per_head'] = pd.to_numeric(pd.Series(raw_data[i][6][3:]), downcast='float')
  aux['area'] = pd.to_numeric(pd.Series(raw_data[i][1][3:]), downcast='float')
  borough = re.search('-\s(.*)\s\(', raw_data[i][0][0]).group(1).replace('Regional ', '')
  aux['borough'] = borough
  df = df.append(aux, ignore_index=True)

df = df[~df['neighborhood'].duplicated()] # There is a neighborhood that is in two boroughs

Normalizing some fiels.

In [4]:
df['men'] = df['men'] / df['total']
df = df.drop(columns=['women'], axis=1)
for i in range(2, 7):
  df.iloc[:,i]=(df.iloc[:,i]-df.iloc[:,i].mean())/df.iloc[:,i].std()
df.head()


Unnamed: 0,borough,neighborhood,men,total,households,income_per_head,area
0,Bairro Novo,Ganchinho,1.30841,-0.534514,-0.622308,-0.941887,0.862672
1,Bairro Novo,Sitio Cercado,0.973782,3.216365,3.054554,-0.779877,0.849981
2,Bairro Novo,Umbará,1.210841,-0.24773,1.519758,-0.805252,2.650608
3,Boa Vista,Abranches,0.699475,-0.383036,-0.447892,-0.707649,-0.228809
4,Boa Vista,Atuba,0.593846,-0.325166,-0.380984,-0.512454,-0.236741


In [5]:
df.shape

(75, 7)

Using geopy to get the coordinates for the *Bairros*

In [0]:
coords = pd.DataFrame(columns=['neighborhood', 'latitude', 'longitude'])

for neighborhood in list(df['neighborhood']):
  geolocator = Nominatim(user_agent="jhkl")
  location = geolocator.geocode("Curitiba, %s" % neighborhood, timeout=10)
  aux = pd.DataFrame({'neighborhood':[neighborhood],
                      'latitude':[location.latitude],
                      'longitude':[location.longitude]})
  coords = coords.append(aux, ignore_index=True)

# Manualy correcting two neighborhood coordinates
coords.loc[coords['neighborhood'] == 'Boa Vista', 'latitude'] =  -25.3851343
coords.loc[coords['neighborhood'] == 'Boa Vista', 'longitude']= -49.2460489
coords.loc[coords['neighborhood'] == 'Cachoira', 'latitude'] = -25.3539823
coords.loc[coords['neighborhood'] == 'Cachoeira', 'longitude'] = -49.2572706

Joining the dataframes in the original one.

In [0]:
df = pd.merge(left=df, right=coords, how='left', on='neighborhood')

In [8]:
df.head()

Unnamed: 0,borough,neighborhood,men,total,households,income_per_head,area,latitude,longitude
0,Bairro Novo,Ganchinho,1.30841,-0.534514,-0.622308,-0.941887,0.862672,-25.572076,-49.263667
1,Bairro Novo,Sitio Cercado,0.973782,3.216365,3.054554,-0.779877,0.849981,-25.542701,-49.269106
2,Bairro Novo,Umbará,1.210841,-0.24773,1.519758,-0.805252,2.650608,-25.568169,-49.285699
3,Boa Vista,Abranches,0.699475,-0.383036,-0.447892,-0.707649,-0.228809,-25.361474,-49.272054
4,Boa Vista,Atuba,0.593846,-0.325166,-0.380984,-0.512454,-0.236741,-25.3875,-49.206606


Retrieving 50 venues in a radius os 2km from the neighborhood coordinates.

In [0]:
CLIENT_ID = # REMOVED
CLIENT_SECRET =  # REMOVED
VERSION = '20180604'
LIMIT = 50

In [0]:
# function that extracts the category of the venue

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)
    

In [0]:
curitiba_venues = getNearbyVenues(names=df['neighborhood'],
                                  latitudes=df['latitude'],
                                  longitudes=df['longitude'],
                                  radius=2000
                                  )

One hot encoding and create a dataframe with the grouped venues by category.

In [0]:
curitiba_onehot = pd.get_dummies(curitiba_venues[['Venue Category']], prefix="", prefix_sep="")
curitiba_onehot['Neighborhood'] = curitiba_venues['Neighborhood'] 
curitiba_grouped = curitiba_onehot.groupby('Neighborhood').mean().reset_index()

Function that returns the most coomon venues in the neighborhood

In [0]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Returning the top 10 venues categories per neighborhood.

In [14]:
num_top_venues = 10

# create columns according to number of top venues
columns = ['neighborhood']
for ind in range(num_top_venues):
        columns.append('{} category'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['neighborhood'] = curitiba_grouped['Neighborhood']

for ind in range(curitiba_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(curitiba_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,neighborhood,1 category,2 category,3 category,4 category,5 category,6 category,7 category,8 category,9 category,10 category
0,Abranches,Bakery,Grocery Store,Soccer Field,Brazilian Restaurant,Pizza Place,Gym / Fitness Center,Park,Farmers Market,Restaurant,Snack Place
1,Ahú,Gym / Fitness Center,Dessert Shop,Bar,Martial Arts Dojo,Coffee Shop,Italian Restaurant,Soccer Field,Restaurant,Chocolate Shop,Pet Store
2,Alto Boqueirão,Bakery,Gym / Fitness Center,Pizza Place,Gym,Brazilian Restaurant,Seafood Restaurant,Grocery Store,Snack Place,Supermarket,Hot Dog Joint
3,Alto da Glória,Brazilian Restaurant,Italian Restaurant,Theater,Pizza Place,Soccer Stadium,Hotel,Restaurant,Dessert Shop,Bar,Portuguese Restaurant
4,Alto da XV,Italian Restaurant,BBQ Joint,Coffee Shop,Gym / Fitness Center,Restaurant,Liquor Store,Beer Bar,Bakery,Café,Dessert Shop


## Modeling

Getting a view of the city and neighborhoods.

In [15]:
# create map
latitude = -25.4295963
longitude = -49.2712724

map_curitiba = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to the map
markers_colors = []
for lat, lon, poi in zip(df['latitude'], df['longitude'], df['neighborhood']):
    label = folium.Popup(str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='#8000ff',
        fill=True,
        fill_color='#8000ff',
        fill_opacity=0.7).add_to(map_curitiba)
       
map_curitiba

Creatting dataframes for clustering.

In [0]:
df_clust_1 = df.drop(['neighborhood', 'borough', 'latitude', 'longitude'], 1) # Neighborhood socioeconomical  data
df_clust_2 = curitiba_grouped.drop('Neighborhood', 1) # Neighborhood venue  data

First clustering

In [17]:
# number of clusters
kclusters_1 = 4

# run k-means clustering
kmeans_1 = KMeans(n_clusters=kclusters_1, random_state=0).fit(df_clust_1)

# check cluster labels generated for each row in the dataframe
kmeans_1.labels_[0:10] 

array([1, 2, 2, 1, 1, 0, 2, 1, 0, 1], dtype=int32)

Second clustering

In [18]:
# number of clusters
kclusters_2 = 5

# run k-means clustering
kmeans_2 = KMeans(n_clusters=kclusters_2, random_state=0).fit(df_clust_2)

# check cluster labels generated for each row in the dataframe
kmeans_2.labels_[0:10]

array([1, 0, 1, 0, 0, 1, 3, 4, 1, 1], dtype=int32)

Merging the cluster labels with the original dataframe

In [0]:
df['clust_geo'] = pd.Series(kmeans_1.labels_)
neighborhoods_venues_sorted['clust_ven'] = pd.Series(kmeans_2.labels_)
df = pd.merge(left=df, right=neighborhoods_venues_sorted, how='left', on='neighborhood')

## Maps (visualizing clusters)

Map showing clusters based on socioeconomical data.

In [20]:
# create map
latitude = -25.4295963
longitude = -49.2712724

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters_1)
ys = [i + x + (i*x)**2 for i in range(kclusters_1)]
colors_array = cm.rainbow(np.linspace(0, 1, 6))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df['latitude'], df['longitude'], df['neighborhood'], df['clust_geo']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Map showing clusters based on venues data.

In [21]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters_2)
ys = [i + x + (i*x)**2 for i in range(kclusters_2)]
colors_array = cm.rainbow(np.linspace(0, 1, 10))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df['latitude'], df['longitude'], df['neighborhood'], df['clust_ven']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters