# Capstone Project - The Battle of the Neighborhoods

## Table of contents
* [Introduction](#introduction)
* [Data](#data)

## Introduction

The objective is compare neighborhoods from São Paulo/Brasil with Canberra/Australia to provide information to  people who wants to move from the first one to the second one and vice-versa.

This way neighborhoods of both cities will be grouped by it's similarities and main venues characteristics.

## Data

Source of data:
- Boroughs and Neighborhoods
 - São Paulo was obtained from https://www.prefeitura.sp.gov.br/cidade/secretarias/subprefeituras/subprefeituras/dados_demograficos/index.php
 - Austrália was obtained from https://en.wikipedia.org/wiki/List_of_Canberra_suburbs
- Geo Location of the Neighborhoods was obtained from Nominatim from geopy
- Trending Venue data from foursquare api

After map the main venues I will clusterize them using KMeans putting the both cities together in a way Neighborhoods from them can ocuppy the same clusters

In [181]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import pgeocode
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

geolocator = Nominatim(user_agent="battlen")

Import São Paulo data e fill geolocation

In [178]:
page_sp = requests.get('https://www.prefeitura.sp.gov.br/cidade/secretarias/subprefeituras/subprefeituras/dados_demograficos/index.php')
soup_sp = BeautifulSoup(page_sp.text, 'html.parser')
    
df_sp = pd.read_html(str(soup_sp.find('table')))[0]
df_sp = df_sp[(df_sp['Distritos'] != 'TOTAL') & (df_sp['Distritos'].isnull() == False)]
df_sp.drop(columns=df_sp.columns[[2,3,4]],inplace=True)
df_sp.columns = ['Borough','Neighborhood']


for index, row in df_sp.iterrows():
    print('.',end='')
    try:
      location = geolocator.geocode('{}, Sao Paulo, Brazil'.format(row['Neighborhood']))
      df_sp.at[index,'Latitude'] = location.latitude
      df_sp.at[index,'Longitude'] = location.longitude
    except Exception as e:
      print('***',e)
        
df_sp.head()

................................................................................................

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Aricanduva,Aricanduva,-23.578024,-46.511454
1,Aricanduva,Carrão,-23.55153,-46.537791
2,Aricanduva,Vila Formosa,-23.566876,-46.546323
4,Butantã,Butantã,-23.569056,-46.721883
5,Butantã,Morumbi,-23.596499,-46.717845


In [179]:
print("Checking NA values:")
print(df_sp.isna().sum())

Checking NA values:
Borough         0
Neighborhood    0
Latitude        0
Longitude       0
dtype: int64


In [180]:
page_cb = requests.get('https://en.wikipedia.org/wiki/List_of_Canberra_suburbs')
soup_cb = BeautifulSoup(page_cb.text, 'html.parser')

district = soup_cb.find('h2').find_next('h2')

extracted_data=[]

while district != None:
    district = district.find('span')     
    
    if district != None:
        suburbs = district.find_next('ul')
    
        if suburbs != None:
          suburbs = suburbs.find_all('a')
    
          if suburbs != None:  
            if district.find('a'):
              district_name = district.a.string
            else:
              district_name = district.string
            
            if district_name != 'References' and district_name != 'External links':
                for row in suburbs:
                  row_data = {}
                  row_data['Borough'] = district_name
                  row_data['Neighborhood'] = row.string
                  extracted_data.append(row_data)
                  
        district = district.find_next('h2')
        
        
df_cb = pd.DataFrame(extracted_data)

for index, row in df_cb.iterrows():
    print('.',end='')    
    try:
      # if row['Borough'] == 'Other':
      #   s = '{}, Australia'.format(row['Neighborhood'])    
      # else:
      #   s = '{}, {}, Australia'.format(row['Neighborhood'],row['Borough']) 
        
      s = '{}, Território da Capital Australiana, Australia'.format(row['Neighborhood'])      
        
      location = geolocator.geocode(s)
    
      df_cb.at[index,'Latitude'] = location.latitude
      df_cb.at[index,'Longitude'] = location.longitude
    except Exception as e:
      print('***',row['Neighborhood'],row['Borough'],e)

df_cb.head()

...............................................................................................................................................

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Belconnen,Aranda,-35.258055,149.080426
1,Belconnen,Belconnen,-35.227434,149.043145
2,Belconnen,Belconnen Town Centre,-35.227434,149.043145
3,Belconnen,Emu Ridge,-35.235379,149.066002
4,Belconnen,Bruce,-35.245352,149.091633


In [177]:
print("Checking NA values:")
df_cb.isna().sum()

Checking NA values:


Borough         0
Neighborhood    0
Latitude        0
Longitude       0
dtype: int64