# IBM Applied Data Science Capstone Course by Coursera
### Week 5 Final Report
**_Opening a new supermarket in Porto, Portugal_**
- Build a dataframe of neighborhoods in Porto, Portugal by web scraping the data from Wikipedia page
- Get the geographical coordinates of the neighborhoods from a csv file
- Obtain the venue data for the neighborhoods from Foursquare API
- Explore and cluster the neighborhoods
- Select the best cluster to open a new supermarket
***
### 1. Import libraries

In [181]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print("Libraries imported.")

Libraries imported.


### 2. Scrap data from Wikipedia page into a DataFrame

In [182]:
# send the GET request
data = requests.get("https://pt.wikipedia.org/wiki/Lista_de_freguesias_de_Portugal").text

In [183]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

In [184]:
# create a list to store neighborhood data
neighborhoodList = []

In [185]:
# append the data into the list
for row in soup.find_all("div", class_="mw-parser-output")[0].findAll("li"):
    neighborhoodList.append(row.text)
    
neighborhoodList.pop(0)

'NOTA: Nesta lista não é apresentado o texto "União das Freguesias de" nas freguesias que o têm no nome oficial.'

In [186]:
# create a new DataFrame from the list
pt_df = pd.DataFrame({"Neighborhood": neighborhoodList})

pt_df.head()

Unnamed: 0,Neighborhood
0,Abrantes (São Vicente e São João) e Alferrarede
1,Aldeia do Mato e Souto
2,Alvega e Concavada
3,Bemposta
4,Carvalhal


In [None]:
# print the number of rows of the dataframe
pt_df.shape

### 3. Get the geographical coordinates

In [195]:
# import csv file
geo_df=pd.read_csv('C:\geofile.csv',encoding = 'iso-8859-1')

geo_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Aldoar Foz do Douro e Nevogilde,41.170415,-8.667161
1,Bonfim,41.15464,-8.59907
2,Campanhã,41.165753,-8.578014
3,Cedofeita Santo Ildefonso Sé Miragaia São Nico...,41.155233,-8.614337
4,Lordelo do Ouro e Massarelos,41.153699,-8.651808


In [209]:
# merge the 2 df
pt_df['Neighborhood']=pt_df['Neighborhood'].replace(['Aldoar, Foz do Douro e Nevogilde', 'Cedofeita, Santo Ildefonso, Sé, Miragaia, São Nicolau e Vitória'],['Aldoar Foz do Douro e Nevogilde', 'Cedofeita Santo Ildefonso Sé Miragaia São Nicolau e Vitória'])

final_df = pt_df.merge(geo_df,on="Neighborhood")

final_df.drop_duplicates(inplace=True)

final_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Aldoar Foz do Douro e Nevogilde,41.170415,-8.667161
1,Bonfim,41.15464,-8.59907
2,Campanhã,41.165753,-8.578014
3,Cedofeita Santo Ildefonso Sé Miragaia São Nico...,41.155233,-8.614337
4,Lordelo do Ouro e Massarelos,41.153699,-8.651808
5,Paranhos,41.177172,-8.603603
7,Ramalde,41.172881,-8.653226


### 4. Create a map of Porto with neighborhoods superimposed on top

In [211]:
# create map of Porto using latitude and longitude values
latitude = 41.16627304622898
longitude = -8.617834456256702

map_pt = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(final_df['Latitude'], final_df['Longitude'], final_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_pt)  
    
map_pt

### 5. Use the Foursquare API to explore the neighborhoods

In [213]:
# define Foursquare Credentials and Version
CLIENT_ID = 'MTVZXQQLALQKEAOEQBWBHKOG2M1ZS0H34YW32ERKPH4WSDD2' # your Foursquare ID
CLIENT_SECRET = 'NNZY0QG2VAS4Q4MNSEC1UCXEGRBK3ACDIB22NVIBNY3253BD' # your Foursquare Secret
VERSION = '20210511' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: MTVZXQQLALQKEAOEQBWBHKOG2M1ZS0H34YW32ERKPH4WSDD2
CLIENT_SECRET:NNZY0QG2VAS4Q4MNSEC1UCXEGRBK3ACDIB22NVIBNY3253BD


**Now, let's get the top 100 venues that are within a radius of 2000 meters.**

In [214]:
radius = 2000
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(final_df['Latitude'], final_df['Longitude'], final_df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [215]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(700, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Aldoar Foz do Douro e Nevogilde,41.170415,-8.667161,Parque da Cidade,41.169351,-8.679156,Park
1,Aldoar Foz do Douro e Nevogilde,41.170415,-8.667161,ElDorado,41.164253,-8.662522,Portuguese Restaurant
2,Aldoar Foz do Douro e Nevogilde,41.170415,-8.667161,My Palace,41.165077,-8.672653,Brewery
3,Aldoar Foz do Douro e Nevogilde,41.170415,-8.667161,Mestre de Aviz,41.1631,-8.664427,Café
4,Aldoar Foz do Douro e Nevogilde,41.170415,-8.667161,Temaki D'lux Sushi,41.173241,-8.654995,Sushi Restaurant


**Let's check how many venues were returned for each neighorhood**

In [216]:
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aldoar Foz do Douro e Nevogilde,100,100,100,100,100,100
Bonfim,100,100,100,100,100,100
Campanhã,100,100,100,100,100,100
Cedofeita Santo Ildefonso Sé Miragaia São Nicolau e Vitória,100,100,100,100,100,100
Lordelo do Ouro e Massarelos,100,100,100,100,100,100
Paranhos,100,100,100,100,100,100
Ramalde,100,100,100,100,100,100


**Let's find out how many unique categories can be curated from all the returned venues**

In [217]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 117 uniques categories.


In [218]:
# print out the list of categories
venues_df['VenueCategory'].unique()[:50]

array(['Park', 'Portuguese Restaurant', 'Brewery', 'Café',
       'Sushi Restaurant', 'Gym', 'Sandwich Place', 'Bakery',
       'Art Museum', 'Office', 'Museum', 'BBQ Joint', 'Garden',
       'Restaurant', 'Supermarket', 'Greek Restaurant', 'Shopping Mall',
       'Italian Restaurant', 'Japanese Restaurant', 'Burger Joint',
       "Men's Store", 'Exhibit', 'Hotel', 'Music Venue', 'Movie Theater',
       'Clothing Store', 'Beach', 'Tapas Restaurant', 'Bookstore',
       'Dessert Shop', 'Hookah Bar', 'Food Court', 'Coffee Shop',
       'Pizza Place', 'Nightclub', 'Bar', 'Flower Shop', 'Shop & Service',
       'Wine Bar', 'Seafood Restaurant', 'Bistro', 'Tea Room',
       'Mediterranean Restaurant', 'Plaza', 'Electronics Store',
       'Sporting Goods Shop', 'Yoga Studio', 'Chinese Restaurant',
       'Snack Place', 'Monument / Landmark'], dtype=object)

In [220]:
# check if the results contain "Supermarket"
"Supermarket" in venues_df['VenueCategory'].unique()

True

### 6. Analyze Each Neighborhood

In [221]:
# one hot encoding
pt_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
pt_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [pt_onehot.columns[-1]] + list(pt_onehot.columns[:-1])
pt_onehot = pt_onehot[fixed_columns]

print(pt_onehot.shape)
pt_onehot.head()

(700, 118)


Unnamed: 0,Neighborhoods,Art Gallery,Art Museum,Asian Restaurant,BBQ Joint,Bakery,Bar,Beach,Beer Bar,Beer Garden,Bistro,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bridge,Burger Joint,Café,Candy Store,Chinese Restaurant,Chocolate Shop,Church,Clothing Store,Cocktail Bar,Coffee Shop,College Cafeteria,Comfort Food Restaurant,Concert Hall,Cosmetics Shop,Dessert Shop,Diner,Electronics Store,Exhibit,Fast Food Restaurant,Flower Shop,Food Court,Fried Chicken Joint,Furniture / Home Store,Garden,General College & University,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Hardware Store,Health Food Store,Hockey Arena,Hookah Bar,Hostel,Hot Dog Joint,Hotel,Hotel Bar,IT Services,Ice Cream Shop,Indian Restaurant,Indoor Play Area,Italian Restaurant,Japanese Restaurant,Lounge,Market,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Modern European Restaurant,Monument / Landmark,Movie Theater,Multiplex,Museum,Music Venue,Nature Preserve,Nightclub,Nightlife Spot,Noodle House,Office,Other Great Outdoors,Paper / Office Supplies Store,Park,Pastry Shop,Pedestrian Plaza,Pet Service,Pharmacy,Pizza Place,Planetarium,Playground,Plaza,Pool,Portuguese Restaurant,Pub,Restaurant,Rock Climbing Spot,Roof Deck,Salad Place,Sandwich Place,Scenic Lookout,Seafood Restaurant,Shop & Service,Shopping Mall,Snack Place,Soccer Field,Soccer Stadium,Spa,Sporting Goods Shop,Squash Court,Steakhouse,Supermarket,Sushi Restaurant,Tapas Restaurant,Tea Room,Tennis Court,Theater,Toy / Game Store,Train Station,Vegetarian / Vegan Restaurant,Wine Bar,Yoga Studio
0,Aldoar Foz do Douro e Nevogilde,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Aldoar Foz do Douro e Nevogilde,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Aldoar Foz do Douro e Nevogilde,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Aldoar Foz do Douro e Nevogilde,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Aldoar Foz do Douro e Nevogilde,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


**Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category**

In [222]:
pt_grouped = pt_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(pt_grouped.shape)
pt_grouped

(7, 118)


Unnamed: 0,Neighborhoods,Art Gallery,Art Museum,Asian Restaurant,BBQ Joint,Bakery,Bar,Beach,Beer Bar,Beer Garden,Bistro,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bridge,Burger Joint,Café,Candy Store,Chinese Restaurant,Chocolate Shop,Church,Clothing Store,Cocktail Bar,Coffee Shop,College Cafeteria,Comfort Food Restaurant,Concert Hall,Cosmetics Shop,Dessert Shop,Diner,Electronics Store,Exhibit,Fast Food Restaurant,Flower Shop,Food Court,Fried Chicken Joint,Furniture / Home Store,Garden,General College & University,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Hardware Store,Health Food Store,Hockey Arena,Hookah Bar,Hostel,Hot Dog Joint,Hotel,Hotel Bar,IT Services,Ice Cream Shop,Indian Restaurant,Indoor Play Area,Italian Restaurant,Japanese Restaurant,Lounge,Market,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Modern European Restaurant,Monument / Landmark,Movie Theater,Multiplex,Museum,Music Venue,Nature Preserve,Nightclub,Nightlife Spot,Noodle House,Office,Other Great Outdoors,Paper / Office Supplies Store,Park,Pastry Shop,Pedestrian Plaza,Pet Service,Pharmacy,Pizza Place,Planetarium,Playground,Plaza,Pool,Portuguese Restaurant,Pub,Restaurant,Rock Climbing Spot,Roof Deck,Salad Place,Sandwich Place,Scenic Lookout,Seafood Restaurant,Shop & Service,Shopping Mall,Snack Place,Soccer Field,Soccer Stadium,Spa,Sporting Goods Shop,Squash Court,Steakhouse,Supermarket,Sushi Restaurant,Tapas Restaurant,Tea Room,Tennis Court,Theater,Toy / Game Store,Train Station,Vegetarian / Vegan Restaurant,Wine Bar,Yoga Studio
0,Aldoar Foz do Douro e Nevogilde,0.0,0.01,0.0,0.01,0.08,0.01,0.06,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.01,0.0,0.02,0.07,0.0,0.01,0.0,0.0,0.01,0.0,0.03,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.01,0.02,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.01,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.01,0.1,0.0,0.04,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.01,0.01,0.01,0.0,0.0,0.02,0.0,0.0,0.02,0.06,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.01
1,Bonfim,0.01,0.0,0.01,0.01,0.02,0.04,0.0,0.02,0.01,0.0,0.0,0.0,0.01,0.02,0.03,0.01,0.01,0.06,0.0,0.01,0.01,0.02,0.01,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.05,0.01,0.03,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.02,0.02,0.0,0.0,0.01,0.0,0.01,0.08,0.0,0.11,0.01,0.03,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.01,0.0,0.01,0.0,0.03,0.0
2,Campanhã,0.0,0.0,0.0,0.03,0.09,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.02,0.0,0.02,0.05,0.0,0.01,0.0,0.0,0.03,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.04,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.03,0.02,0.01,0.01,0.01,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.01,0.0,0.03,0.01,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.03,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.03,0.0,0.07,0.01,0.06,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.02,0.01,0.0,0.01,0.0,0.02,0.0,0.01,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0
3,Cedofeita Santo Ildefonso Sé Miragaia São Nico...,0.01,0.0,0.01,0.0,0.04,0.04,0.0,0.02,0.01,0.0,0.02,0.0,0.0,0.04,0.03,0.0,0.02,0.05,0.0,0.0,0.01,0.02,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.05,0.01,0.05,0.0,0.0,0.01,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.02,0.02,0.0,0.0,0.01,0.0,0.0,0.07,0.01,0.08,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.03,0.01,0.0,0.01,0.0,0.01,0.01,0.03,0.0
4,Lordelo do Ouro e Massarelos,0.0,0.01,0.0,0.01,0.04,0.0,0.0,0.0,0.02,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.04,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.03,0.0,0.0,0.01,0.01,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.02,0.0,0.01,0.01,0.0,0.02,0.02,0.01,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.01,0.01,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.03,0.0,0.0,0.0,0.0,0.04,0.01,0.0,0.02,0.01,0.13,0.0,0.07,0.01,0.01,0.01,0.0,0.01,0.01,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.02,0.01,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.01
5,Paranhos,0.0,0.0,0.0,0.03,0.11,0.05,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.07,0.01,0.02,0.0,0.0,0.0,0.0,0.02,0.02,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.03,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.02,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.06,0.0,0.01,0.01,0.01,0.0,0.03,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.04,0.0,0.01,0.03,0.01,0.12,0.01,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.03,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Ramalde,0.0,0.01,0.0,0.01,0.06,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.02,0.11,0.0,0.02,0.0,0.0,0.03,0.0,0.02,0.0,0.0,0.0,0.01,0.01,0.0,0.01,0.01,0.01,0.0,0.02,0.01,0.01,0.02,0.0,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.01,0.0,0.01,0.01,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.02,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.03,0.0,0.0,0.01,0.01,0.07,0.0,0.02,0.0,0.01,0.01,0.01,0.0,0.0,0.01,0.01,0.0,0.03,0.0,0.02,0.02,0.0,0.0,0.03,0.03,0.0,0.02,0.01,0.0,0.01,0.0,0.0,0.0,0.0


In [223]:
len(pt_grouped[pt_grouped["Supermarket"] > 0])

5

**Create a new DataFrame for Supermarket data only**

In [224]:
pt_super = pt_grouped[["Neighborhoods","Supermarket"]]

In [225]:
pt_super.head()

Unnamed: 0,Neighborhoods,Supermarket
0,Aldoar Foz do Douro e Nevogilde,0.02
1,Bonfim,0.0
2,Campanhã,0.04
3,Cedofeita Santo Ildefonso Sé Miragaia São Nico...,0.0
4,Lordelo do Ouro e Massarelos,0.01


### 7. Cluster Neighborhoods
Run k-means to cluster the neighborhoods in Porto into 3 clusters.

In [226]:
# set number of clusters
kclusters = 3

pt_clustering = pt_super.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(pt_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 1, 2, 0, 1, 1])

In [227]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
pt_merged = pt_super.copy()

# add clustering labels
pt_merged["Cluster Labels"] = kmeans.labels_

In [228]:
pt_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
pt_merged.head()

Unnamed: 0,Neighborhood,Supermarket,Cluster Labels
0,Aldoar Foz do Douro e Nevogilde,0.02,0
1,Bonfim,0.0,2
2,Campanhã,0.04,1
3,Cedofeita Santo Ildefonso Sé Miragaia São Nico...,0.0,2
4,Lordelo do Ouro e Massarelos,0.01,0


In [229]:
# merge pt_grouped with pt_data to add latitude/longitude for each neighborhood
pt_merged = pt_merged.join(final_df.set_index("Neighborhood"), on="Neighborhood")

print(pt_merged.shape)
pt_merged.head() # check the last columns!

(7, 5)


Unnamed: 0,Neighborhood,Supermarket,Cluster Labels,Latitude,Longitude
0,Aldoar Foz do Douro e Nevogilde,0.02,0,41.170415,-8.667161
1,Bonfim,0.0,2,41.15464,-8.59907
2,Campanhã,0.04,1,41.165753,-8.578014
3,Cedofeita Santo Ildefonso Sé Miragaia São Nico...,0.0,2,41.155233,-8.614337
4,Lordelo do Ouro e Massarelos,0.01,0,41.153699,-8.651808


In [230]:
# sort the results by Cluster Labels
print(pt_merged.shape)
pt_merged.sort_values(["Cluster Labels"], inplace=True)
pt_merged

(7, 5)


Unnamed: 0,Neighborhood,Supermarket,Cluster Labels,Latitude,Longitude
0,Aldoar Foz do Douro e Nevogilde,0.02,0,41.170415,-8.667161
4,Lordelo do Ouro e Massarelos,0.01,0,41.153699,-8.651808
2,Campanhã,0.04,1,41.165753,-8.578014
5,Paranhos,0.03,1,41.177172,-8.603603
6,Ramalde,0.03,1,41.172881,-8.653226
1,Bonfim,0.0,2,41.15464,-8.59907
3,Cedofeita Santo Ildefonso Sé Miragaia São Nico...,0.0,2,41.155233,-8.614337


**Finally, let's visualize the resulting clusters**

In [231]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(pt_merged['Latitude'], pt_merged['Longitude'], pt_merged['Neighborhood'], pt_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 8. Examine Clusters

In [232]:
#Cluster 0
pt_merged.loc[pt_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Supermarket,Cluster Labels,Latitude,Longitude
0,Aldoar Foz do Douro e Nevogilde,0.02,0,41.170415,-8.667161
4,Lordelo do Ouro e Massarelos,0.01,0,41.153699,-8.651808


In [233]:
#Cluster 1
pt_merged.loc[pt_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Supermarket,Cluster Labels,Latitude,Longitude
2,Campanhã,0.04,1,41.165753,-8.578014
5,Paranhos,0.03,1,41.177172,-8.603603
6,Ramalde,0.03,1,41.172881,-8.653226


In [234]:
#Cluster 2
pt_merged.loc[pt_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Supermarket,Cluster Labels,Latitude,Longitude
1,Bonfim,0.0,2,41.15464,-8.59907
3,Cedofeita Santo Ildefonso Sé Miragaia São Nico...,0.0,2,41.155233,-8.614337


#### Observations:
Most of the supermarkets are concentrated in the central area of Porto city, with the highest number in cluster 1 and moderate number in cluster 0. On the other hand, cluster 2 has very low number to totally no supermarket in the neighborhoods. This represents a great opportunity and high potential areas to open new supermarkets as there is very little to no competition from existing ones. Meanwhile, supermarkets in cluster 1 are likely suffering from intense competition due to oversupply and high concentration of supermarkets. From another perspective, this also shows that the oversupply of supermarkets mostly happened in the central area of the city, with the suburb area still have very few supermarkets. Therefore, this project recommends property developers to capitalize on these findings to open new supermarkets in neighborhoods in cluster 2 with little to no competition. Property developers with unique selling propositions to stand out from the competition can also open new supermarkets in neighborhoods in cluster 0 with moderate competition. Lastly, property developers are advised to avoid neighborhoods in cluster 1 which already have high concentration of supermarkets and suffering from intense competition.