# Capstone Project - Location of a Pet Store in São Paulo

### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem

In this project we will try to find an optimal location for a pet store in São Paulo (capital). Specifically, this report will be targeted to stakeholders interested in opening a Pet Store in São Paulo, Brasil.

Since there are lots of Pet Stores in São Paulo we will try to detect locations that are not already crowded with competitors.

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

## Data

Based on definition of our problem, factor that will influence our decission are: number of existing Pet Stores in the neighborhood and population.

I decided to use a list of neighborhoods in São Paulo, from: https://www.prefeitura.sp.gov.br/cidade/secretarias/subprefeituras/subprefeituras/dados_demograficos/index.php?p=12758 and obtained the location using the argis method from geocoder.

The number of Pet Stores and location in every neighborhood will be obtained using Foursquare API.

Coordinate of São Paulo center will be obtained using Nominatim from geopy.

First, let's import de required libraries:

In [7]:
!pip install bs4
from bs4 import BeautifulSoup

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!pip install geocoder 
import geocoder # import geocoder

!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

 # uncomment this line if you haven't completed the Foursquare API lab
!pip install folium
# map rendering library
import folium 

print('Libraries imported.')

Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1
Libraries imported.


### Import Data from São Paulo City Hall

In [8]:
# using beautiful soup to import data from the city hall site
url='https://www.prefeitura.sp.gov.br/cidade/secretarias/subprefeituras/subprefeituras/dados_demograficos/index.php?p=12758'
data=requests.get(url).text
soup=BeautifulSoup(data,'html.parser')


In [9]:
table=soup.find('table')
neighborhoods=pd.DataFrame(columns=['Neighborhood']) #initiate the dataframe
population_data=pd.DataFrame(columns=['Neighborhood','Population']) #initiate population dataframe
for row in table.tbody.find_all('tr'):
    col=row.find_all('td')
    if (col!=[]):
        if len(col)==5:
            subprefeitura=col[0].text.strip()
            neighborhood=col[1].text.strip()
            area=col[2]
            population=col[3].text.strip()
            if neighborhood=='TOTAL':
                pass
            else:
                neighborhoods=neighborhoods.append({'Neighborhood':neighborhood},ignore_index=True)
                population_data=population_data.append({'Neighborhood':neighborhood,'Population':population},ignore_index=True)
            
        else:
            neighborhood=col[0].text.strip()
            area=col[1]
            population=col[2].text.strip()
            if neighborhood=='TOTAL':
                pass
            else:
                neighborhoods=neighborhoods.append({'Neighborhood':neighborhood},ignore_index=True)
                population_data=population_data.append({'Neighborhood':neighborhood,'Population':population},ignore_index=True)


In [10]:
neighborhoods.head() # examine the dataframe

Unnamed: 0,Neighborhood
0,Aricanduva
1,Carrão
2,Vila Formosa
3,Butantã
4,Morumbi


In this project we will be examining 96 neighborhoods.

In [11]:
neighborhoods.shape

(96, 1)

In [12]:
population_data.head() #examine the dataframe

Unnamed: 0,Neighborhood,Population
0,Aricanduva,89.622
1,Carrão,83.281
2,Vila Formosa,94.799
3,Butantã,54.196
4,Morumbi,46.957


In [15]:
# using arcgis to find the latitude and longitude of each neighborhood
sp_data=pd.DataFrame(columns=['Neighborhood','Latitude','Longitude']) #initiate the dataframe
for index,neighborhood in enumerate(neighborhoods['Neighborhood']):
    address = "".join((str(neighborhood),', São Paulo, São Paulo, Brasil'))
    g = geocoder.arcgis(address)
    while (g.latlng is None):
        g = geocoder.arcgis(address)
        print(address, g.latlng)
    latlng = g.latlng
    lat=latlng[0]
    sp_data=sp_data.append({'Neighborhood':neighborhood,'Latitude':lat,'Longitude':latlng[1]},ignore_index=True)

Status code Unknown from https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/find: ERROR - HTTPSConnectionPool(host='geocode.arcgis.com', port=443): Read timed out. (read timeout=5.0)


Cidade Tiradentes, São Paulo, São Paulo, Brasil [-23.60120999999998, -46.39875999999998]


Status code Unknown from https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/find: ERROR - HTTPSConnectionPool(host='geocode.arcgis.com', port=443): Read timed out. (read timeout=5.0)


Jabaquara, São Paulo, São Paulo, Brasil [-23.637179999999944, -46.64614999999998]


Status code Unknown from https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/find: ERROR - HTTPSConnectionPool(host='geocode.arcgis.com', port=443): Read timed out. (read timeout=5.0)


Jaguaré, São Paulo, São Paulo, Brasil [-23.54207999999994, -46.747919999999965]


Status code Unknown from https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/find: ERROR - HTTPSConnectionPool(host='geocode.arcgis.com', port=443): Read timed out. (read timeout=5.0)


Vila Matilde, São Paulo, São Paulo, Brasil [-23.53782999999993, -46.52613999999994]


Status code Unknown from https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/find: ERROR - HTTPSConnectionPool(host='geocode.arcgis.com', port=443): Read timed out. (read timeout=5.0)


São Miguel, São Paulo, São Paulo, Brasil [-23.511289999999974, -46.437209999999936]


In [14]:
sp_data.head() # examine the dataframe

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Aricanduva,-23.56771,-46.51025
1,Carrão,-23.54798,-46.53885
2,Vila Formosa,-23.56642,-46.5394
3,Butantã,-23.57089,-46.70968
4,Morumbi,-23.601,-46.71551


In [16]:
address = 'Sao Paulo, Sao Paulo'

geolocator = Nominatim(user_agent="sp_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of São Paulo are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of São Paulo are -23.5506507, -46.6333824.


In [17]:
# create map of São Paulo using latitude and longitude values
map_sp = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(sp_data['Latitude'], sp_data['Longitude'], sp_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sp)  
    
map_sp