# Capstone Project - The Battle of the Neighborhoods

## Table of contents
* [Introduction](#introduction)
* [Data](#data)
  * [Library import and Variables Initialization](#data_init)
  * [Import São Paulo data e fill geolocation](#imp_sp)
  * [Import Canberra data e fill geolocation](#imp_cb)
  * [Merge both cities](#merge)
  * [Map Plot of both neighborhoods](#map_both)
  * [Foursquare exploration](#fs_expl)
    * [Credential initialization](#fs_expl_init)
    * [Venue gathering](#fs_venue_gathering)
    * [Venue compare](#fs_venue_compare)
  * [Getting venue category dummies and grouping](#fs_venue_dummies)
  * [Prepare top venues data for neighborhood](#prepare_top_10)
* [Methodology](#methodology)
* [Analysis](#analysis)
  * [Top 5 Venue Categories](#top_5)
  * [Clustering](#clustering)
  * [Cluster map output](#cluster_map)
  * [Clustered Map Observations](#clustered_map_obs)
  * [Cluster Detail](#cluster_detail)
    * [Neighborhoods with no venue data](#no_venue)
    * [Cluster 0](#cluster_0)
    * [Cluster 1](#cluster_1)
    * [Cluster 2](#cluster_2)
    * [Cluster 3](#cluster_3)
    * [Cluster 4](#cluster_4)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction <a name="introduction"></a>

The objective is compare neighborhoods from São Paulo/Brasil with Canberra/Australia to provide information to  people who wants to move from the first one to the second one and vice-versa.

This way neighborhoods of both cities will be grouped by it's similarities and main venues characteristics.

## Data <a name="data"></a>

Source of data:
- Boroughs and Neighborhoods
 - São Paulo was obtained from https://www.prefeitura.sp.gov.br/cidade/secretarias/subprefeituras/subprefeituras/dados_demograficos/index.php
 - Camberra was obtained from https://en.wikipedia.org/wiki/List_of_Canberra_suburbs
- Geo Location of the Neighborhoods was obtained from Nominatim from geopy
- Trending Venue data from foursquare api

### Library Import and Variables Initialization <a name="data_init"></a>

In [667]:
#library import

import requests
from bs4 import BeautifulSoup
import pandas as pd
import pgeocode
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
import folium.plugins
from IPython.core.display import display, HTML
import numpy as np

In [668]:
# global use variables
geolocator = Nominatim(user_agent="battlen")

# são paulo and canberra's locations for plot maps
loc_sp = geolocator.geocode('São Paulo, Brasil')
lat_sp = loc_sp.latitude
lng_sp = loc_sp.longitude

loc_cb = geolocator.geocode('Canberra, Australia')
lat_cb = loc_cb.latitude
lng_cb = loc_cb.longitude

print("São Paulo's Location {} {}".format(lat_sp,lng_sp))
print("Canberra's Location {} {}".format(lat_cb,lng_cb))

São Paulo's Location -23.5506507 -46.6333824
Canberra's Location -35.2975906 149.1012676


### Import São Paulo data e fill geolocation <a name="imp_sp"></a>

In [669]:
# Import São Paulo data e fill geolocation
page_sp = requests.get('https://www.prefeitura.sp.gov.br/cidade/secretarias/subprefeituras/subprefeituras/dados_demograficos/index.php')
soup_sp = BeautifulSoup(page_sp.text, 'html.parser')
    
df_sp = pd.read_html(str(soup_sp.find('table')))[0]
df_sp = df_sp[(df_sp['Distritos'] != 'TOTAL') & (df_sp['Distritos'].isnull() == False)]
df_sp.drop(columns=df_sp.columns[[2,3,4]],inplace=True)
df_sp.columns = ['Borough','Neighborhood']


for index, row in df_sp.iterrows():
    print('.',end='')
    try:
      location = geolocator.geocode('{}, Sao Paulo, Brazil'.format(row['Neighborhood']))
      df_sp.at[index,'Latitude'] = location.latitude
      df_sp.at[index,'Longitude'] = location.longitude
    except Exception as e:
      print('***',e)
        
df_sp.head()

................................................................................................

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Aricanduva,Aricanduva,-23.578024,-46.511454
1,Aricanduva,Carrão,-23.55153,-46.537791
2,Aricanduva,Vila Formosa,-23.566876,-46.546323
4,Butantã,Butantã,-23.569056,-46.721883
5,Butantã,Morumbi,-23.596499,-46.717845


In [670]:
print("Checking NA São Paulo values:")
print(df_sp.isna().sum())

Checking NA São Paulo values:
Borough         0
Neighborhood    0
Latitude        0
Longitude       0
dtype: int64


### Import Canberra data e fill geolocation <a name="imp_cb"></a>

In [671]:
# Import Canberra data e fill geolocation
page_cb = requests.get('https://en.wikipedia.org/wiki/List_of_Canberra_suburbs')
soup_cb = BeautifulSoup(page_cb.text, 'html.parser')

district = soup_cb.find('h2').find_next('h2')

extracted_data=[]

while district != None:
    district = district.find('span')     
    
    if district != None:
        suburbs = district.find_next('ul')
    
        if suburbs != None:
          suburbs = suburbs.find_all('a')
    
          if suburbs != None:  
            if district.find('a'):
              district_name = district.a.string
            else:
              district_name = district.string
            
            if district_name != 'References' and district_name != 'External links':
                for row in suburbs:
                  row_data = {}
                  row_data['Borough'] = district_name
                  row_data['Neighborhood'] = row.string
                  extracted_data.append(row_data)
                  
        district = district.find_next('h2')
        
        
df_cb = pd.DataFrame(extracted_data)

for index, row in df_cb.iterrows():
    print('.',end='')    
    try:
      # if row['Borough'] == 'Other':
      #   s = '{}, Australia'.format(row['Neighborhood'])    
      # else:
      #   s = '{}, {}, Australia'.format(row['Neighborhood'],row['Borough']) 
        
      s = '{}, Território da Capital Australiana, Australia'.format(row['Neighborhood'])      
        
      location = geolocator.geocode(s)
    
      df_cb.at[index,'Latitude'] = location.latitude
      df_cb.at[index,'Longitude'] = location.longitude
    except Exception as e:
      print('***',row['Neighborhood'],row['Borough'],e)

df_cb.head()

...............................................................................................................................................

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Belconnen,Aranda,-35.258055,149.080426
1,Belconnen,Belconnen,-35.227434,149.043145
2,Belconnen,Belconnen Town Centre,-35.227434,149.043145
3,Belconnen,Emu Ridge,-35.235379,149.066002
4,Belconnen,Bruce,-35.245352,149.091633


In [672]:
print("Checking NA values:")
df_cb.isna().sum()

Checking NA values:


Borough         0
Neighborhood    0
Latitude        0
Longitude       0
dtype: int64

### Merge both cities <a name="merge"></a>

In [725]:
df_merged = pd.concat([df_sp,df_cb],keys=['sp','cb'],names=['City'])
df_merged.groupby('City').count()

Unnamed: 0_level_0,Borough,Neighborhood,Latitude,Longitude
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sp,96,96,96,96
cb,143,143,143,143


### Map Plot of both neighborhoods <a name="map_both"></a>

In [891]:
# create map of são paulo using latitude and longitude values
map_sp = folium.Map(location=[lat_sp, lng_sp], zoom_start=10)

# create map of canberra using latitude and longitude values
map_cb = folium.Map(location=[lat_cb, lng_cb], zoom_start=10)

# add markers to map
for index, row in df_merged.iterrows():
    label = '{}, {}'.format(row['Neighborhood'], row['Borough'])
    label = folium.Popup(label, parse_html=True)
    ci = folium.CircleMarker(
        [row['Latitude'], row['Longitude']],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False)
    if index[0] == 'cb':
        ci.add_to(map_cb)
    else:
        ci.add_to(map_sp)


htmlmap = HTML('<iframe srcdoc="{}" style="float:left; width: {}px; height: {}px; display:inline-block; width: 45%; margin: 0 auto; border: 2px solid black"></iframe>'
           '<iframe srcdoc="{}" style="float:right; width: {}px; height: {}px; display:inline-block; width: 45%; margin: 0 auto; border: 2px solid black"></iframe>'
           .format(map_sp.get_root().render().replace('"', '&quot;'),500,500,
                   map_cb.get_root().render().replace('"', '&quot;'),500,500))
display(htmlmap)



### Foursquare exploration <a name="fs_expl"></a>
#### Credential initialization <a name="fs_expl_init"></a>

In [727]:
CLIENT_ID = 'MAPXQQR5QDKO0YVCYVFWPZCBQG1UWTLZSQQJZXSGYIRD4VK0' # your Foursquare ID
CLIENT_SECRET = 'I4MAMG2JDMA54ZL21SEPJA15K5QLL4GHMWUQ5W3H3GRFJBOB' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
RADIUS =500 # Default radius to check

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: MAPXQQR5QDKO0YVCYVFWPZCBQG1UWTLZSQQJZXSGYIRD4VK0
CLIENT_SECRET:I4MAMG2JDMA54ZL21SEPJA15K5QLL4GHMWUQ5W3H3GRFJBOB


#### Venue gathering  <a name="fs_venue_gathering"></a>

In [785]:
venues_list=[]

for index, row in df_merged.iterrows():
    print('.',end='')
    
    if index[0] == 'cb':
        r = RADIUS * 3
    else:
        r = RADIUS
            
    # create the API request URL
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        row['Latitude'], 
        row['Longitude'], 
        r, 
        LIMIT)
            
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
        
    # return only relevant information for each nearby venue
    venues_list.append([(
        index[0],
        row['Neighborhood'], 
        row['Latitude'], 
        row['Longitude'],  
        v['venue']['name'], 
        v['venue']['location']['lat'], 
        v['venue']['location']['lng'],  
        v['venue']['categories'][0]['name']) for v in results])

df_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
df_venues.columns = ['City',
              'Neighborhood', 
              'NLatitude', 
              'NLongitude', 
              'Venue', 
              'VLatitude', 
              'VLongitude', 
              'VCategory']
    
df_venues.head()


...............................................................................................................................................................................................................................................

Unnamed: 0,City,Neighborhood,NLatitude,NLongitude,Venue,VLatitude,VLongitude,VCategory
0,sp,Aricanduva,-23.578024,-46.511454,Academia Mega Fitness,-23.581196,-46.508753,Gym / Fitness Center
1,sp,Aricanduva,-23.578024,-46.511454,Loja A Moderna,-23.576725,-46.516153,Clothing Store
2,sp,Aricanduva,-23.578024,-46.511454,Padaria Doce Villa,-23.574097,-46.513193,Bakery
3,sp,Aricanduva,-23.578024,-46.511454,Padaria Reis,-23.580423,-46.512293,Bakery
4,sp,Aricanduva,-23.578024,-46.511454,Falcon Doces,-23.576635,-46.51609,Candy Store


#### Venue compare  <a name="fs_venue_compare"></a>

In [786]:
df_venues.groupby('City').count()

Unnamed: 0_level_0,Neighborhood,NLatitude,NLongitude,Venue,VLatitude,VLongitude,VCategory
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
cb,2823,2823,2823,2823,2823,2823,2823
sp,2653,2653,2653,2653,2653,2653,2653


In [787]:
df_venues.groupby('City')['VCategory'].nunique()

City
cb    214
sp    292
Name: VCategory, dtype: int64

### Getting venue category dummies  and grouping <a name="fs_venue_dummies"></a>

In [788]:
# one hot encoding
df_onehot = pd.get_dummies(df_venues[['VCategory']], prefix="", prefix_sep="")

# add  column back to dataframe
df_onehot['City'] = df_venues['City'] 
df_onehot['Neighborhood'] = df_venues['Neighborhood'] 

# move column to the first column
fixed_columns = list(df_onehot.columns[-2:]) + list(df_onehot.columns[:-2])

df_onehot = df_onehot[fixed_columns]

df_onehot.head()

Unnamed: 0,City,Neighborhood,Acai House,Accessories Store,African Restaurant,Airport,Airport Lounge,Airport Ticket Counter,American Restaurant,Animal Shelter,...,Water Park,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo
0,sp,Aricanduva,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,sp,Aricanduva,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,sp,Aricanduva,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,sp,Aricanduva,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,sp,Aricanduva,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [789]:
df_grouped = df_onehot.groupby(['City','Neighborhood']).mean().reset_index()
df_grouped.head()

Unnamed: 0,City,Neighborhood,Acai House,Accessories Store,African Restaurant,Airport,Airport Lounge,Airport Ticket Counter,American Restaurant,Animal Shelter,...,Water Park,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo
0,cb,Acton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0
1,cb,Ainslie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,cb,Amaroo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,cb,Aranda,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,cb,Banks,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Prepare top venues data for neighborhood <a name="prepare_top_10"></a>

In [790]:
# common venues in descending order

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [791]:
# dataframe with top venues

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City','Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
df_venues_sorted = pd.DataFrame(columns=columns)
df_venues_sorted['City'] = df_grouped['City']
df_venues_sorted['Neighborhood'] = df_grouped['Neighborhood']

for ind in np.arange(df_grouped.shape[0]):
    df_venues_sorted.iloc[ind, 2:] = return_most_common_venues(df_grouped.iloc[ind, :], num_top_venues)

df_venues_sorted.head()

Unnamed: 0,City,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,cb,Acton,Café,Coffee Shop,Hotel,Park,History Museum,Italian Restaurant,Plaza,Exhibit,Concert Hall,River
1,cb,Ainslie,Hotel,Pub,Mountain,Gym,Grocery Store,Rugby Pitch,Café,Scenic Lookout,Shopping Plaza,Business Service
2,cb,Amaroo,Shopping Plaza,Café,Lake,Italian Restaurant,Indian Restaurant,Supermarket,Playground,Grocery Store,Persian Restaurant,Korean Restaurant
3,cb,Aranda,Café,Supermarket,Bakery,Liquor Store,Nature Preserve,Gas Station,Newsstand,Sports Club,Chinese Restaurant,Mexican Restaurant
4,cb,Banks,Sports Club,Grocery Store,Pizza Place,Bistro,Acai House,Northern Brazilian Restaurant,Outdoor Supply Store,Outdoor Sculpture,Other Repair Shop,Other Nightlife


## Methodology <a name="methodology"></a>

- Data gathering
  - First we collected the data of neighborhood from the both cities, find all geolocations using geopy
  - After that we merge all neighborhoods in a single dataframe with a distinctive city index
  - After the first foursquare exploration I discovered that's Camberra have significantly less venue data than São Paulo (767 vs 2658) when is used a radius of 500 meters for both, so I tripled the radius just for Camberra.
- Data preparation
  - Transpose venue category information to columns 
  - Summarize categories by neighborhood using mean for normalization
  - Sort top 10 venue categories by neighborhood 
- Clustering
  - Clustering neighborhoods alltogether using K-Means considering top 10 venue categories
- Comparison
  - Plot clustered maps side by side to visualize similar neighborhoods
  - Compare the characteristics of the nth more common venues categories


## Analysis <a name="analysis"></a>

The first discovery that's Camberra have significantly less venue data than São Paulo (767 vs 2658) when I used a 500 meter radius, what make me think that Camberra could be a much more residential city than São Paulo. To have a similar amount of venue data I had to triple the radius for Camberra's exploration (2802 vs 2658).

Even using a larger radius for Camberra it has fewer venue categories than São Paulo (215 vs 290) which implies that São Paulo has a variety of venues 34,88% bigger even exploring a radius 3 times smaller.



### Top 5 venue categories <a name="top_5"></a>
Lets take a look of top 5 venues in each neighborhood

In [792]:
num_top_venues = 5

for city,hood in zip(df_grouped['City'],df_grouped['Neighborhood']):
    print("----"+city+':'+hood+"----")
    temp = df_grouped[(df_grouped['City'] == city) & (df_grouped['Neighborhood'] == hood)].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[2:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----cb:Acton----
            venue  freq
0            Café  0.12
1     Coffee Shop  0.10
2           Hotel  0.08
3            Park  0.05
4  History Museum  0.05


----cb:Ainslie----
           venue  freq
0          Hotel  0.19
1            Pub  0.06
2       Mountain  0.06
3            Gym  0.06
4  Grocery Store  0.06


----cb:Amaroo----
                venue  freq
0      Shopping Plaza  0.17
1                Café  0.17
2                Lake  0.08
3  Italian Restaurant  0.08
4   Indian Restaurant  0.08


----cb:Aranda----
             venue  freq
0             Café  0.15
1      Supermarket  0.10
2           Bakery  0.05
3     Liquor Store  0.05
4  Nature Preserve  0.05


----cb:Banks----
           venue  freq
0    Sports Club  0.25
1  Grocery Store  0.25
2    Pizza Place  0.25
3         Bistro  0.25
4     Acai House  0.00


----cb:Barton----
             venue  freq
0             Café  0.17
1            Hotel  0.10
2      Art Gallery  0.05
3  Thai Restaurant  0.04
4      Coffee Shop  

            venue  freq
0     Sports Club  0.22
1            Park  0.22
2  Shopping Plaza  0.11
3   Grocery Store  0.11
4     IT Services  0.11


----cb:Gordon----
                  venue  freq
0  Fast Food Restaurant  0.13
1           Supermarket  0.13
2           Gas Station  0.07
3                Bistro  0.07
4                Bridge  0.07


----cb:Gowrie----
                       venue  freq
0               Noodle House  0.09
1               Veterinarian  0.05
2          Convenience Store  0.05
3  Middle Eastern Restaurant  0.05
4            Thai Restaurant  0.05


----cb:Greenway----
                  venue  freq
0           Coffee Shop  0.13
1  Fast Food Restaurant  0.08
2      Department Store  0.08
3           Supermarket  0.08
4     Electronics Store  0.05


----cb:Griffith----
                venue  freq
0                Café  0.16
1               Hotel  0.08
2  Italian Restaurant  0.05
3     Thai Restaurant  0.05
4   Indian Restaurant  0.05


----cb:Gungahlin----
           

                venue  freq
0                Café  0.12
1         Coffee Shop  0.09
2                Park  0.05
3    Asian Restaurant  0.04
4  Italian Restaurant  0.04


----cb:Richardson----
                  venue  freq
0            Sports Bar  0.14
1  Fast Food Restaurant  0.14
2           Gas Station  0.14
3           Supermarket  0.14
4              Pharmacy  0.14


----cb:Rivett----
                   venue  freq
0                   Café  0.19
1            Supermarket  0.19
2  Vietnamese Restaurant  0.06
3            Gelato Shop  0.06
4   Gym / Fitness Center  0.06


----cb:Russell----
           venue  freq
0           Café  0.19
1  Memorial Site  0.14
2           Park  0.08
3    Art Gallery  0.06
4          Plaza  0.06


----cb:Scullin----
            venue  freq
0  Shopping Plaza  0.22
1     Gas Station  0.11
2    Soccer Field  0.11
3  Baseball Field  0.11
4     Supermarket  0.11


----cb:Southlands Centre----
                  venue  freq
0                  Café  0.20
1      

           venue  freq
0    Pizza Place  0.15
1  Deli / Bodega  0.08
2            Bar  0.08
3   Samba School  0.08
4    Salad Place  0.08


----sp:Cidade Tiradentes----
                    venue  freq
0  Furniture / Home Store  0.14
1             Bus Station  0.14
2       Electronics Store  0.14
3          Clothing Store  0.14
4                Pharmacy  0.14


----sp:Consolação----
                  venue  freq
0  Brazilian Restaurant  0.09
1           Coffee Shop  0.05
2                 Hotel  0.03
3        Ice Cream Shop  0.03
4  Gym / Fitness Center  0.03


----sp:Cursino----
                  venue  freq
0                Bakery  0.18
1     Food & Drink Shop  0.09
2  Gym / Fitness Center  0.09
3                Market  0.09
4           Candy Store  0.09


----sp:Ermelino Matarazzo----
                     venue  freq
0                BBQ Joint  0.14
1        Food & Drink Shop  0.14
2            Historic Site  0.14
3  Comfort Food Restaurant  0.14
4               Restaurant  0.14


--

                 venue  freq
0        Grocery Store  0.17
1  Sporting Goods Shop  0.08
2  Japanese Restaurant  0.08
3                  Gym  0.08
4         Soccer Field  0.08


----sp:São Miguel----
                     venue  freq
0            Grocery Store  0.21
1  Fruit & Vegetable Store  0.14
2      Japanese Restaurant  0.07
3              Pizza Place  0.07
4                      Gym  0.07


----sp:São Rafael----
           venue  freq
0            Bar   0.4
1           Park   0.2
2  Women's Store   0.2
3        Brewery   0.2
4     Acai House   0.0


----sp:Sé----
                  venue  freq
0  Brazilian Restaurant  0.10
1                Bakery  0.05
2           Snack Place  0.05
3                  Café  0.05
4    Miscellaneous Shop  0.05


----sp:Tatuapé----
            venue  freq
0            Café  0.06
1     Coffee Shop  0.06
2     Pizza Place  0.06
3  Ice Cream Shop  0.06
4    Dessert Shop  0.06


----sp:Tremembé----
                  venue  freq
0  Brazilian Restaurant  0.33

### Clustering <a name="clustering"></a>

In [804]:
# set number of clusters
kclusters = 5

df_clustering = df_grouped.drop(['City','Neighborhood'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 2, 0, 3, 0, 3, 2, 2, 2], dtype=int32)

In [805]:
# add clustering labels
df_venues_sorted.drop('Cluster Labels', axis=1, inplace=True, errors='ignore')
df_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

df_merged_k = df_merged.reset_index().drop('level_1',axis=1).set_index(['City','Neighborhood'])

df_merged_k = df_merged_k.join(df_venues_sorted.set_index(['City','Neighborhood']), on=['City','Neighborhood'])


df_merged_k.head() # check the last columns!

Unnamed: 0_level_0,Unnamed: 1_level_0,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
City,Neighborhood,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
sp,Aricanduva,Aricanduva,-23.578024,-46.511454,3.0,Bakery,Gym / Fitness Center,Grocery Store,Candy Store,Clothing Store,Paintball Field,Paella Restaurant,Outdoors & Recreation,Outdoor Supply Store,Outdoor Sculpture
sp,Carrão,Aricanduva,-23.55153,-46.537791,3.0,BBQ Joint,Pharmacy,Pizza Place,Brazilian Restaurant,Gym / Fitness Center,Bakery,Bar,Clothing Store,Park,Dessert Shop
sp,Vila Formosa,Aricanduva,-23.566876,-46.546323,3.0,Bakery,Brazilian Restaurant,Plaza,Pharmacy,Furniture / Home Store,Chocolate Shop,Clothing Store,Pizza Place,Gym,Gym / Fitness Center
sp,Butantã,Butantã,-23.569056,-46.721883,3.0,Science Museum,Mattress Store,Coffee Shop,Brazilian Restaurant,Vegetarian / Vegan Restaurant,Music Venue,Food Truck,Mineiro Restaurant,Bar,Thrift / Vintage Store
sp,Morumbi,Butantã,-23.596499,-46.717845,0.0,Soccer Stadium,Café,Snack Place,Restaurant,Athletics & Sports,Sports Bar,Coffee Shop,Clothing Store,Farmers Market,Japanese Restaurant


### Cluster map output <a name='cluster_map'></a>

In [806]:
# set color scheme for the clusters
x = np.arange(kclusters)+1
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

In [892]:
def show_cluster_map(clusters):
    # create map of são paulo using latitude and longitude values
    map_sp_cl = folium.Map(location=[lat_sp, lng_sp], zoom_start=10)

    # create map of canberra using latitude and longitude values
    map_cb_cl = folium.Map(location=[lat_cb, lng_cb], zoom_start=10)

    df_merged_k.reset_index(inplace=True)
    df_merged_k.drop('level_0',axis=1,inplace=True, errors='ignore')    
    
    # add markers to the map
    markers_colors = []
    for index, row in df_merged_k.iterrows():
        lat = row['Latitude']
        lon = row['Longitude']
        city = row['City']
        poi = row['Neighborhood']

        if pd.isnull(row['Cluster Labels']):
            cluster = -1
        else:
            cluster = int(row['Cluster Labels'])
            
        if cluster in clusters:
            if cluster == -1:
                scluster = 'No Cluster'
            else:
                scluster = ' Cluster ' + str(cluster)
                
            label = folium.Popup(str(poi) + ' ' + scluster, parse_html=True)

            ci = folium.CircleMarker(
                [lat, lon],
                radius=5,
                popup=label,
                color=rainbow[cluster-1],
                fill=True,
                fill_color=rainbow[cluster-1],
                fill_opacity=0.7)

            if city == 'cb':
                ci.add_to(map_cb_cl)
            else:
                ci.add_to(map_sp_cl)


    htmlmap = HTML('<iframe srcdoc="{}" style="float:left; width: {}px; height: {}px; display:inline-block; width: 45%; margin: 0 auto; border: 2px solid black"></iframe>'
               '<iframe srcdoc="{}" style="float:right; width: {}px; height: {}px; display:inline-block; width: 45%; margin: 0 auto; border: 2px solid black"></iframe>'
               .format(map_sp_cl.get_root().render().replace('"', '&quot;'),500,500,
                       map_cb_cl.get_root().render().replace('"', '&quot;'),500,500))
    display(htmlmap)








In [893]:
show_cluster_map([-1,0,1,2,3,4,5])

#### Clustered Map Observations <a name='clustered_map_obs'></a>

- São Paulo and Camberra have different predominant clusters
- Also as can be seen bellow Camberra had its neighborhoods more distributed in different clusters, but São Paulo have the majority clustered all together

In [890]:
df_merged_k.groupby(['City','Cluster Labels']).size()

City  Cluster Labels
cb    0.0               60
      1.0                1
      2.0               36
      3.0               39
      4.0                1
sp    0.0                4
      3.0               89
      4.0                1
dtype: int64

#### Cluster Detail <a name='cluster_detail'></a>

In [810]:
# return data from a specific cluster
def get_cluster(cluster):
    return df_merged_k.loc[df_merged_k['Cluster Labels'] == cluster, df_merged_k.columns[[0]+[1]+[2] + list(range(5, df_merged_k.shape[1]))]]

In [811]:
# return n top nth commom venue category
def show_c_venues(cluster,venue_position,n):
    df = get_cluster(cluster)
    df['Counts'] = df[['City',venue_position+' Most Common Venue']].groupby(['City',venue_position+' Most Common Venue'])['City'].transform('count')
    df = df[['City',venue_position+' Most Common Venue','Counts']]
    df.sort_values(['City','Counts'],ascending=False,inplace=True)
    df.drop_duplicates(inplace=True)
        
    df_t_sp = df[df['City'] == 'cb'].iloc[0:n]
    df_t_cb = df[df['City'] == 'sp'].iloc[0:n]
    
    return pd.concat([df_t_cb,df_t_sp])
    

In [877]:
# print the nth most commom venues
def show_cluster(cluster,n):
    df_c = get_cluster(cluster)
    print("Neighborhood count by city:")
    print("===========================")    
    print(df_c.groupby('City').size())
    print('')
    print("Most common venue categories:")
    print("=============================")
    print(show_c_venues(cluster,'1st',n))
    print()
    print(show_c_venues(cluster,'2nd',n))
    print()
    print(show_c_venues(cluster,'3rd',n))
    print()
    print("Neighborhood list")
    print("=================")    
    
    print("SP neighborhoods: ",end="")
    for row in df_c[df_c['City']=='sp'].iterrows():
      print('"'+row[1]['Neighborhood'].strip(),end='" ')
    
    print()
    
    print("CB neighborhoods: ",end="")
    for row in df_c[df_c['City']=='cb'].iterrows():
      print('"'+row[1]['Neighborhood'].strip(),end='" ')    
    
    print()

##### Neighborhoods with no venue data <a name='no_venue'></a>

In [894]:
show_cluster_map([-1])

It's possible to observe that these neighborhoods are located mainly and in forest and preservation areas

##### Cluster 0 <a name='cluster_0'></a>

In [895]:
show_cluster_map([0])

In [878]:
show_cluster(0,2)

Neighborhood count by city:
City
cb    60
sp     4
dtype: int64

Most common venue categories:
    City 1st Most Common Venue  Counts
39    sp                Bakery       2
4     sp        Soccer Stadium       1
96    cb                  Café      46
190   cb           Supermarket       3

    City 2nd Most Common Venue  Counts
4     sp                  Café       1
37    sp            Restaurant       1
129   cb                 Hotel       7
136   cb    Chinese Restaurant       6

    City 3rd Most Common Venue  Counts
4     sp           Snack Place       1
37    sp            Public Art       1
122   cb           Supermarket       7
127   cb                 Hotel       5

Neighborhood list
SP neighborhoods: "Morumbi" "Tremembé" "Jaguara" "Cangaíba" 
CB neighborhoods: "Aranda" "Bruce" "Cook" "Lawson" "Jamison Centre" "Page" "Acton" "Ainslie" "Barton" "Braddon" "Campbell" "Duntroon" "Capital Hill" "City" "Deakin" "Dickson" "Dickson Centre" "Downer" "Forrest" "Fyshwick" "Griffith" "Manu

- Predominance of Bakery, Restaurants, Cafés, etc

##### Cluster 1 <a name='cluster_1'></a>

In [896]:
show_cluster_map([1])

In [880]:
show_cluster(1,2)

Neighborhood count by city:
City
cb    1
dtype: int64

Most common venue categories:
    City 1st Most Common Venue  Counts
236   cb        Shop & Service       1

    City 2nd Most Common Venue  Counts
236   cb            Acai House       1

    City 3rd Most Common Venue  Counts
236   cb       Paintball Field       1

Neighborhood list
SP neighborhoods: 
CB neighborhoods: "Tharwa" 


- There's no equivalent in São Paulo

##### Cluster 2 <a name='cluster_2'></a>

In [897]:
show_cluster_map([2])

In [882]:
show_cluster(2,2)

Neighborhood count by city:
City
cb    36
dtype: int64

Most common venue categories:
    City 1st Most Common Venue  Counts
104   cb        Shopping Plaza      14
124   cb         Grocery Store       6

    City 2nd Most Common Venue  Counts
97    cb        Shopping Plaza       8
109   cb                  Café       4

    City 3rd Most Common Venue  Counts
97    cb           Pizza Place       6
157   cb           Supermarket       4

Neighborhood list
SP neighborhoods: 
CB neighborhoods: "Belconnen" "Belconnen Town Centre" "Charnwood" "Evatt" "Florey" "Flynn" "Giralang" "Hawker" "Higgins" "Holt" "Kippax Centre" "Kaleen" "McKellar" "Scullin" "Spence" "O'Connor" "Amaroo" "Bonner" "Casey" "Crace" "Forde" "Ngunnawal" "Palmerston" "Denman Prospect" "Bonython" "Tuggeranong Town Centre" "Isabella Plains" "Kambah" "Kambah Village Centre" "Chapman" "Duffy" "Curtin" "Curtin Centre" "Farrer" "Isaacs" "Mawson" 


- There's no equivalent in São Paulo

##### Cluster 3 <a name='cluster_3'></a>

In [898]:
show_cluster_map([3])

In [884]:
show_cluster(3,2)

Neighborhood count by city:
City
cb    39
sp    89
dtype: int64

Most common venue categories:
    City 1st Most Common Venue  Counts
0     sp                Bakery      15
34    sp  Brazilian Restaurant      10
114   cb  Fast Food Restaurant       4
116   cb           Sports Club       4

    City 2nd Most Common Venue  Counts
2     sp  Brazilian Restaurant       7
11    sp                   Bar       7
116   cb  Fast Food Restaurant       6
103   cb                  Park       3

    City 3rd Most Common Venue  Counts
1     sp           Pizza Place      13
15    sp            Restaurant       7
103   cb           Supermarket       4
114   cb           Gas Station       4

Neighborhood list
SP neighborhoods: "Aricanduva" "Carrão" "Vila Formosa" "Butantã" "Raposo Tavares" "Rio Pequeno" "Vila Sônia" "Campo Limpo" "Capão Redondo" "Vila Andrade" "Cidade Dutra" "Socorro" "Casa Verde" "Limão" "Cidade Ademar" "Pedreira" "Cidade Tiradentes" "Ermelino Matarazzo" "Ponte Rasa" "Brasilândia" "Fre

- The most populous cluster in São Paulo and who finds best fit in Camberra

##### Cluster 4 <a name='cluster_4'></a>

In [899]:
show_cluster_map([4])

In [889]:
show_cluster(4,2)

Neighborhood count by city:
City
cb    1
sp    1
dtype: int64

Most common venue categories:
    City 1st Most Common Venue  Counts
14    sp                Market       1
233   cb                Market       1

    City 2nd Most Common Venue  Counts
14    sp    Athletics & Sports       1
233   cb            Acai House       1

    City 3rd Most Common Venue  Counts
14    sp          Optical Shop       1
233   cb     Paella Restaurant       1

Neighborhood list
SP neighborhoods: "Cachoeirinha" 
CB neighborhoods: "Hall" 


## Results and Discussion <a name="results"></a>

São Paulo and Camberra are very different, although both are large Metropolis, some clusters in Camberra has no equivalence in São Paulo. However all neighborhoods in São Paulo have correspondence in Camberra. 

Clusters who has large amount of elements in Camberra have few elements in São Paulo and vice-versa. Only one cluster has a significant amount of elements in both cities.

The exchangeble neighborhoods are:

- Cluster 0
  - SP neighborhoods: "Morumbi" "Tremembé" "Jaguara" "Cangaíba" 
  - CB neighborhoods: "Aranda" "Bruce" "Cook" "Lawson" "Jamison Centre" "Page" "Acton" "Ainslie" "Barton" "Braddon" "Campbell" "Duntroon" "Capital Hill" "City" "Deakin" "Dickson" "Dickson Centre" "Downer" "Forrest" "Fyshwick" "Griffith" "Manuka" "Hackett" "Kingston" "Lyneham" "North Lyneham" "Parkes" "Red Hill" "Reid" "Russell" "Turner" "Watson" "Yarralumla" "Jacka" "Kenny" "Kinlyside" "Nicholls" "Canberra Airport" "Pialligo" "Chisholm" "Fadden" "Gilmore" "Macarthur" "Holder" "Rivett" "Stirling" "Waramanga" "Weston" "Weston Creek Centre" "Chifley" "Garran" "Hughes" "Lyons" "Southlands Centre" "O'Malley" "Pearce" "Phillip" "Woden Town Centre" "Swinger Hill" "Torrens"  
- Cluster 3
  - SP neighborhoods: "Aricanduva" "Carrão" "Vila Formosa" "Butantã" "Raposo Tavares" "Rio Pequeno" "Vila Sônia" "Campo Limpo" "Capão Redondo" "Vila Andrade" "Cidade Dutra" "Socorro" "Casa Verde" "Limão" "Cidade Ademar" "Pedreira" "Cidade Tiradentes" "Ermelino Matarazzo" "Ponte Rasa" "Brasilândia" "Freguesia do Ó" "Lajeado" "Guaianases" "Cursino" "Ipiranga" "Sacomã" "Itaim Paulista" "Vila Curuçá" "Cidade Líder" "Itaquera" "José Bonifácio" "Parque do Carmo" "Jabaquara" "Jaçanã" "Barra Funda" "Jaguaré" "Lapa" "Perdizes" "Vila Leopoldina" "Jardim Ângela" "Jardim São Luís" "Água Rasa" "Belém" "Brás" "Mooca" "Pari" "Tatuapé" "Parelheiros" "Artur Alvim" "Penha" "Vila Matilde" "Anhanguera" "Perus" "Alto de Pinheiros" "Itaim Bibi" "Jardim Paulista" "Pinheiros" "Jaraguá" "Pirituba" "São Domingos" "Mandaqui" "Santana" "Tucuruvi" "Campo Belo" "Campo Grande" "Santo Amaro" "Iguatemi" "São Rafael" "São Mateus" "São Miguel" "Jardim Helena" "Vila Jacuí" "Sapopemba" "Bela Vista" "Bom Retiro" "Cambuci" "Consolação" "Liberdade" "República" "Santa Cecília" "Sé" "Vila Guilherme" "Vila Maria" "Vila Medeiros" "Moema" "Saúde" "Vila Mariana" "São Lucas" "Vila Prudente" 
  - CB neighborhoods: "Emu Ridge" "Dunlop" "Fraser" "Latham" "Macgregor" "Macquarie" "Melba" "Strathnairn" "Weetangera" "Narrabundah" "Franklin" "Gungahlin" "Gungahlin Town Centre" "Harrison" "Mitchell" "Moncrieff" "Throsby" "Beard" "Hume" "Oaks Estate" "Symonston" "Coombs" "Molonglo" "Sulman" "Wright" "Banks" "Calwell" "Conder" "Gordon" "Gowrie" "Greenway" "Monash" "Oxley" "Richardson" "Theodore" "Wanniassa" "Erindale Centre" "Fisher" "Harman
- Cluster 4
  - SP neighborhoods: "Cachoeirinha" 
  - CB neighborhoods: "Hall"
  
  
No matching neighborhoods:
- Cluster 1
  - CB neighborhoods: "Tharwa" 
- Cluster 2
  - CB neighborhoods: "Belconnen" "Belconnen Town Centre" "Charnwood" "Evatt" "Florey" "Flynn" "Giralang" "Hawker" "Higgins" "Holt" "Kippax Centre" "Kaleen" "McKellar" "Scullin" "Spence" "O'Connor" "Amaroo" "Bonner" "Casey" "Crace" "Forde" "Ngunnawal" "Palmerston" "Denman Prospect" "Bonython" "Tuggeranong Town Centre" "Isabella Plains" "Kambah" "Kambah Village Centre" "Chapman" "Duffy" "Curtin" "Curtin Centre" "Farrer" "Isaacs" "Mawson" 

## Conclusion <a name="conclusion"></a>

Most of the people who wants to leave São Paulo will find a suitable neighborhood in Camberra, however the contrary situation don't have the same fitting.