# Finding Similar Cities in São Paulo State of Brazil
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
1. [Introduction: Business Problem](#introduction)
2. [Data](#data)\
    2.1. [Data Scraping](#scraping)\
    2.2. [Geographical Position from a Python Module](#position) \
    2.3. [Data Retrieving from API](#retrieving) 


## 1. Introduction: Business Problem <a name="introduction"></a>

In this project, we will try to find similar cities in the state of **São Paulo (Brazil)**. This report can be useful to stakeholders own an establishment and wants to expand it to other similar cities that still have few or no venues of the same category, or to someone else who needs to move to another city, but keeping accessibility to venues available at his current city.

It is common, at least here in Brazil, to have every kind of establishments in big cities, but miss some categories of establishments in smaller cities, so, they could be great opportunities to endeavor. Thus, the goal of this project is to **detect similar cities in terms of their population and their most common venue categories**.

Using data science powers, we will segment and cluster cities based on the criteria above. The similar cities will be shown on maps, and the features that distinguish each cluster will then be clearly expressed through distribution plots so the stakeholders can easily aim the right city(ies).

## 2. Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* list of citiesin SP state;
* number of habitants of the city;
* each city's geographical position in latitude and longitude;
* number of existing venue categories in the city.

Following, data sources will be needed to extract/generate the required information:
* the list of all cities within state of São Paulo (Brazil) will be scraped from [**Wikipedia**](https://pt.wikipedia.org/wiki/Lista_de_mesorregi%C3%B5es_e_microrregi%C3%B5es_de_S%C3%A3o_Paulo);
* the population of each city will be scraped from [**Wikipedia**](https://pt.wikipedia.org/wiki/Lista_de_munic%C3%ADpios_do_Brasil_por_popula%C3%A7%C3%A3o);
* the geographical position of each city will be obtained using the module [**GeoPy**](https://geopy.readthedocs.io/en/stable/).
* the venues and their categories will be obtained using [**Foursquare API**](https://developer.foursquare.com/).


### 2.1. Data scraping and cleaning <a name="scraping"></a>

First of all, we will scrape list of all cities in São Paulo (SP) state available in [**Wikipedia**](https://pt.wikipedia.org/wiki/Lista_de_mesorregi%C3%B5es_e_microrregi%C3%B5es_de_S%C3%A3o_Paulo).

In [1]:
# Importing necessary modules
# -*- coding: utf-8 -*-
import sys
import pandas as pd
#from bs4 import BeautifulSoup
import numpy as np
import requests
import json
import folium  # map rendering module
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

coding = sys.stdout.encoding

In [4]:
# The URL which hospedates all information of SP cities.
url = 'https://pt.wikipedia.org/wiki/Lista_de_mesorregi%C3%B5es_e_microrregi%C3%B5es_de_S%C3%A3o_Paulo'
# Read all tables within site.
table_list = pd.read_html(url)

# Mesoregions of SP state are on first table of the site
mesoregions_df = table_list[0]

mesoregions_df.head(3)

Unnamed: 0,Mesorregião[1][2],Código,Número de municípios,Localização,Microrregiões,Código.1
0,São José do Rio Preto,1,109,,Jales,1
1,São José do Rio Preto,1,109,,Fernandópolis,2
2,São José do Rio Preto,1,109,,Votuporanga,3


On the mesoregion table, there are some irrelevant data: "Código", "Número de municípios", "Loalização", "Código.1" (from portuguese means "Code", "Number of cities", "Localization", "Code.1").
After collecting all tables, we will clean these unecessary data from this table.

In [5]:
# Microregions of SP state are on all the rest of tables of the site
microregions_df = table_list[1]
microregions_len = microregions_df.shape[0]
for i in range(2,len(table_list)-1):
    microregions_df = microregions_df.append(table_list[i], ignore_index=True)
    microregions_len += table_list[i].shape[0]

microregions_df.head(3)

Unnamed: 0,Microrregião[1][2],Código,Localização,Municípios
0,Jales,1,,Aparecida d'Oeste
1,Jales,1,,Aspásia
2,Jales,1,,Dirce Reis


Like the mesoregion table, there are irrelevant data for us on Microregion table: "Código", "Localização" (from portuguese means "Code", "Localization").

Now we have collected the raw data, **let's clean them**.

In [6]:
# Droping unecessary features from Mesoregion dataframe.
mesoregions_df.drop(columns=['Código', 'Número de municípios', 'Localização', 'Código.1'], inplace=True)
# Renaming columns with English terms.
mesoregions_df.columns = ['Mesoregion', 'Microregion']
mesoregions_df.head(3)

Unnamed: 0,Mesoregion,Microregion
0,São José do Rio Preto,Jales
1,São José do Rio Preto,Fernandópolis
2,São José do Rio Preto,Votuporanga


In [7]:
# Droping unecessary features from Microregions dataframe.
microregions_df.drop(columns=['Código', 'Localização'], inplace=True)
# Renaming columns with English terms.
microregions_df.columns = ['Microregion', 'City']
microregions_df.head(3)

Unnamed: 0,Microregion,City
0,Jales,Aparecida d'Oeste
1,Jales,Aspásia
2,Jales,Dirce Reis


Now, let's merge both dataframes based on "Microregion" feature.

In [8]:
# Merging mesoregions with microregions into one dataframe.
cities_df = mesoregions_df.merge(microregions_df, left_on='Microregion', right_on='Microregion')
print("Shape of cities_df: ", cities_df.shape)
cities_df.head(3)

Shape of cities_df:  (645, 3)


Unnamed: 0,Mesoregion,Microregion,City
0,São José do Rio Preto,Jales,Aparecida d'Oeste
1,São José do Rio Preto,Jales,Aspásia
2,São José do Rio Preto,Jales,Dirce Reis


Now we have the dataframe containing all regions and cities of SP state, **let's collect the population of each city** from [**Wikipedia**](https://pt.wikipedia.org/wiki/Lista_de_munic%C3%ADpios_do_Brasil_por_popula%C3%A7%C3%A3o).

In [9]:
# The Wikipedia URL with population information of SP state.
population_url = 'https://pt.wikipedia.org/wiki/Lista_de_munic%C3%ADpios_do_Brasil_por_popula%C3%A7%C3%A3o'
# Collect population data from table within site.
population_list = pd.read_html(population_url)
population_df = population_list[0]

population_df.head(3)

Unnamed: 0,Posição,Código IBGE,Município,Unidade federativa,População
0,1º,3550308,São Paulo,São Paulo,12252023
1,2º,3304557,Rio de Janeiro,Rio de Janeiro,6718903
2,3º,5300108,Brasília,Distrito Federal,3015268


The raw dataframe collected contains some irrelevant features for our study case: "Posição", "Código IBGE", "Unidade federativa" (from portuguese they mean "Ranking", "IBGE Code", "Federative Unit"). 
Also, this table contains population from all Brazilian cities, so we need to filter by "Unidade federativa", which is the state in which each city belongs to, considering only São Paulo.

Thus, **let's get rid of these unecessary instances and columns**, and rename the relevant feature in English terms.

In [10]:
# Drop all instance are note from SP state.
popsp_df = population_df[population_df['Unidade federativa'] == 'São Paulo'].iloc[:,2:].reset_index(drop=True)
# Drop all unecessary features.
popsp_df.drop(columns=['Unidade federativa'], inplace=True)
# Rename columns with English terms.
popsp_df.columns = ['City', 'Population']
popsp_df.head(3)

Unnamed: 0,City,Population
0,São Paulo,12252023
1,Guarulhos,1379182
2,Campinas,1204073


Finally, **let's merge the region-cities and cities-population dataframes** into one dataframe.

In [11]:
# merging city geographical information with population
spcities_df = cities_df.merge(popsp_df, left_on='City', right_on='City')
spcities_df.head(3)

Unnamed: 0,Mesoregion,Microregion,City,Population
0,São José do Rio Preto,Jales,Aparecida d'Oeste,4196
1,São José do Rio Preto,Jales,Aspásia,1822
2,São José do Rio Preto,Jales,Dirce Reis,1793


### 2.2. Geographical Position from a Python Module <a name="position"></a>

Now we have all SP state's cities and their respective population, **let's collect their respective geographical position** using the **module** [**GeoPy**](https://geopy.readthedocs.io/en/stable/).

With this module, it is possible to easily retrieve the geographical location given an address or, in this case, the city name and its state and country.

In [12]:
# Importing Nominatim
from geopy.geocoders import Nominatim

In [13]:
# Define agent to request position coordinates.
geolocator = Nominatim(user_agent='it_is_me')

# Retrieving each city's geographical coordinates.
longitude = []
latitude = []
i=0
for city in spcities_df['City']:
    city = city + ', SP'
    
    location = geolocator.geocode(city)
    longitude.append(location.longitude)
    latitude.append(location.latitude)
    i+=1
    # print("{}/{}".format(i, spcities_df.shape[0]))

# Adding the obtained coordinates into sp_cities dataframe
spcities_df['Longitude'] = longitude
spcities_df['Latitude'] = latitude

print("Shape of spcities_df: ", spcities_df.shape)
spcities_df.head(3)

Shape of spcities_df:  (643, 6)


Unnamed: 0,Mesoregion,Microregion,City,Population,Longitude,Latitude
0,São José do Rio Preto,Jales,Aparecida d'Oeste,4196,-50.880871,-20.449811
1,São José do Rio Preto,Jales,Aspásia,1822,-50.728046,-20.160028
2,São José do Rio Preto,Jales,Dirce Reis,1793,-50.606276,-20.466407


Let's **visualize these cities on map**, using the [Folium](https://python-visualization.github.io/folium/) module.

In [14]:
# Collecting state of São Paulo coordinates
geolocator = Nominatim(user_agent='it_is_me')
saopaulo = geolocator.geocode('São Paulo, Brazil')
# adjusting for better visualization
latitude, longitude = (saopaulo.latitude+1, saopaulo.longitude-1)

# Create map of São Paulo using latitude and longitude values
map_sp = folium.Map(location=[latitude, longitude], zoom_start=7)

# add markers to map
max_population = spcities_df['Population'].max()
for lat, lng, population, city in zip(spcities_df['Latitude'], spcities_df['Longitude'], spcities_df['Population'], spcities_df['City']):
    label = '{}, {}'.format(city, population)
    label = folium.Popup(label, parse_html=True, max_width=100)
    if(city != 'São Paulo'):
        rad = 6*(population/1370000)
    else:
        rad = 7    
    folium.CircleMarker(
        [lat, lng],
        radius=rad,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sp)  
    
map_sp

### 2.3. Data Retrieving from API <a name="retrieving"></a>

Next, with help of [**Foursquare API**](https://developer.foursquare.com/), let's explore and segment the cities in SP state!

Basically, **Foursquare** provide an API to easily explore venues, given a geographical location (latitude and longitude) and a radius, among other interesting features.

Before start using the API, it is necessary to sign up to Foursquare to get the necessary credentials. All steps can be read on Foursquare [_Get Started_](https://developer.foursquare.com/docs/places-api/getting-started/) tutorial.

In [17]:
# Declare your credentials to request data from Foursquare
CLIENT_ID = '1NOSQ24FIWKBZDPIOGK5TKHLVFIWZF4CAUWD0E2BY22IXVML' # your Foursquare ID
CLIENT_SECRET = '5EXD4G4A5OVVGTGSEQFUPPHYQPCYAMTKGIWXRJ05VKHNDTUI' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
ACCESS_TOKEN = 'BN2N5315OFVCV3QM2AV5TQH5ICC3CKJQMKUFB1SNKHPLZBRV' # your access token

When exploring venues with Foursquare API, the raw data retrieved is given in the JSON standard. So, to make it faster and get a pandas dataframe with the willed data, **let's define a function to parse the raw data to return a dataframe with the available venues**, given the city, its geographical coordinates, radius of research and the limit of venues to be retrieved.

In [18]:
# Function to request the list of venues from Foursquare API, and parse the json raw data into a pandas dataframe.
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    counter = 1
    max_city = len(names)
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(counter, '/', max_city, " ", name)
        counter+=1
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        try:
            # make the GET request
            results = requests.get(url).json()["response"]['groups'][0]['items']

            # return only relevant information for each nearby venue
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
            
            #print(name)
        except:
            venues_list.append([
                name, 
                lat, 
                lng, 
                '', 
                np.nan, 
                np.nan,  
                ''])
            
            #print("No venue found at " + name)

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now, we can just call this function, giving the list (in pandas Series) with all cities, their respective coordinates, the radius of exploration (in this case 5 km), and limit of venues to be retrieved (in this case, set to 200).

>P.S.: The radius and limit values were defined empiracally. Using distance measurement tool from Google Earth, I estimated the average radius of cities in SP state, measured from the center of the city.

In [None]:
saopaulo_venues = getNearbyVenues(names=spcities_df['City'],
                                 latitudes=spcities_df['Latitude'],
                                 longitudes=spcities_df['Longitude'],
                                 radius=5000,
                                 LIMIT=200)

In [23]:
#saopaulo_venues.to_csv('1_saopaulo_venues.csv')
#saopaulo_venues = pd.read_csv('1_saopaulo_venues.csv', index_col=0)
print("Shape of saopaulo_venues: ", saopaulo_venues.shape)
saopaulo_venues.head(3)

Shape of saopaulo_venues:  (14757, 7)


Unnamed: 0,City,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Aparecida d'Oeste,-20.449811,-50.880871,Farmácia do Pedro,-20.451159,-50.881847,Pharmacy
1,Aparecida d'Oeste,-20.449811,-50.880871,Bar do Fabião,-20.45006,-50.886469,African Restaurant
2,Dirce Reis,-20.466407,-50.606276,Padaria Doce Pao,-20.46483,-50.606962,Bakery


Let's see how many unique venue categories there are in this retrieved data.

In [24]:
print("There are {} different categories!".format(len(saopaulo_venues['Venue Category'].unique())))

There are 392 different categories!
