# IBM Applied Data Science Capstone Course by Coursera

## Week 5 Final Report

#### **Title: *Opening a new Shopping Mall in Madrid, Spain***

**Steps:** 
- Build a dataframe of the neighborhoods
- Get the geographical coordinates of the neighborhoods
- Obtain the venue data for the neghborhoods from Foursquare API 
- Explore and cluster the neighborhoods
- Select the best cluster to open the Shopping Mall

### 1. Importing libraries

In [1]:
# pip install geocoder

In [2]:
import pandas as pd # library for data analsysis

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder # to get coordinates

import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

import numpy as np
print("Libraries imported.")

Libraries imported.


### 2. Scraping Data from the Wikipedia Page

In [3]:
url = 'https://en.wikipedia.org/wiki/Category:Districts_of_Milan' # Wikipedia Page

data = requests.get(url).text

In [4]:
# Creating a BeautifulSoup object
soup = BeautifulSoup(data, 'html.parser')

In [5]:
# Filling up the list
neighborhood_list = []
for row in soup.find_all("div", class_="mw-category")[0].findAll("li"):
    neighborhood_list.append(row.text)

In [6]:
# Create a new DataFrame from the list
df = pd.DataFrame({"Neighborhood": neighborhood_list})

print(df.to_string())

                                       Neighborhood
0                                            Affori
1                                           Assiano
2                        Baggio (district of Milan)
3                                            Barona
4                       Bicocca (district of Milan)
5                                            Bovisa
6                                         Bovisasca
7                         Brera (district of Milan)
8                                          Bruzzano
9                                        Calvairate
10                     Centro Direzionale di Milano
11                  Chiaravalle (district of Milan)
12                                 Chinatown, Milan
13                                          Cimiano
14                                      Città Studi
15                                         Comasina
16                                      Crescenzago
17                                          Dergano
18          

In [7]:
# We can drop the first
df.drop([0], inplace=True)

In [8]:
# We reset the index of the dataframe
df.reset_index(drop=True, inplace=True)

In [9]:
# Printing the shape
print(df.shape)

(75, 1)


### 3. Getting the coordinates

In [10]:
# define a function to get coordinates
def get_latlng(neighborhood_list):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Milano, Italia'.format(neighborhood_list))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [11]:
# call the function to get the coordinates, store in a new list using list comprehension
coords = []
i = 0
for neighborhood in df['Neighborhood'].tolist():
    coords.append(get_latlng(neighborhood))
    print('Current neighborhood: {} {}'.format(neighborhood, coords[i]))
    i += 1

Current neighborhood: Assiano [45.450614388470385, 9.061634526056144]
Current neighborhood: Baggio (district of Milan) [45.46324000000004, 9.092700000000036]
Current neighborhood: Barona [45.433710000000076, 9.15160000000003]
Current neighborhood: Bicocca (district of Milan) [45.52149000000003, 9.213260000000048]
Current neighborhood: Bovisa [45.503130000000056, 9.161220000000071]
Current neighborhood: Bovisasca [45.515550000000076, 9.150940000000048]
Current neighborhood: Brera (district of Milan) [45.47470005107049, 9.190009970569658]
Current neighborhood: Bruzzano [45.52825000000007, 9.180710000000033]
Current neighborhood: Calvairate [45.456180000000074, 9.224880000000041]
Current neighborhood: Centro Direzionale di Milano [45.501986915580225, 9.264641435209318]
Current neighborhood: Chiaravalle (district of Milan) [45.41719000000006, 9.23971000000006]
Current neighborhood: Chinatown, Milan [45.500860000000046, 9.265130000000056]
Current neighborhood: Cimiano [45.503460000000075, 9

In [12]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
print(df_coords.shape)
df_coords.head()

(75, 2)


Unnamed: 0,Latitude,Longitude
0,45.450614,9.061635
1,45.46324,9.0927
2,45.43371,9.1516
3,45.52149,9.21326
4,45.50313,9.16122


In [13]:
# We can now merge both dataframes
df['Latitude'] = df_coords['Latitude']
df['Longitude'] = df_coords['Longitude']

In [14]:
print('The shape is: {}'.format(df.shape))

df

The shape is: (75, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Assiano,45.450614,9.061635
1,Baggio (district of Milan),45.463240,9.092700
2,Barona,45.433710,9.151600
3,Bicocca (district of Milan),45.521490,9.213260
4,Bovisa,45.503130,9.161220
...,...,...,...
70,Turro,45.494520,9.221710
71,Vaiano Valle,45.428930,9.216200
72,Vialba,45.514910,9.128150
73,Vigentino,45.433720,9.201040


In [15]:
# We will save the dataframe into a csv file
df.to_csv('milan_neighborhoods.csv', index = False)

### 4. Create a map of Madrid with neighborhoods

In [16]:
# get the coordinates of Milan
address = 'Milano, Italia'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Milano, Italia are: {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Milano, Italia are: 45.4668, 9.1905.


In [17]:
# create map of Milan using latitude and longitude values

map_milan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_milan)  
    
map_milan

In [18]:
# We will save the map for the report
# save the map as HTML file
map_milan.save('map_milan.html')

### 5. Use the Foursquare API to explore the neighborhoods

In [19]:
# define Foursquare Credentials and Version
CLIENT_ID = '2QJ03LTLPH3D0FLHKNBZ2HVAXN45512L4BXCLKIYECEXCH4L' # your Foursquare ID
CLIENT_SECRET = 'EUTT1SWU2IWN2KJ3O3UFPDSIVSI5WV2JY0PAZHXNO0M0OGDF' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 2QJ03LTLPH3D0FLHKNBZ2HVAXN45512L4BXCLKIYECEXCH4L
CLIENT_SECRET:EUTT1SWU2IWN2KJ3O3UFPDSIVSI5WV2JY0PAZHXNO0M0OGDF


In [20]:
# Let's get the top 100 venues that are within 2 Km radius
radius = 2000
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [21]:
# Convert the list into a dataframe
venues_df = pd.DataFrame(venues)

venues_df.columns = ['Neighborhood', 
                     'Latitude', 
                     'Longitude', 
                     'VenueName', 
                     'VenueLatitude', 
                     'VenueLongitude', 
                     'VenueCategory'
                    ]

print(venues_df.shape)

venues_df.head()

(6028, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Assiano,45.450614,9.061635,Panta Rei,45.455693,9.064415,Mediterranean Restaurant
1,Assiano,45.450614,9.061635,Viridea,45.445845,9.0403,Flower Shop
2,Assiano,45.450614,9.061635,Tourlé,45.443644,9.040191,Pizza Place
3,Assiano,45.450614,9.061635,Muggiano,45.451495,9.069825,Neighborhood
4,Assiano,45.450614,9.061635,Appaloosa,45.442789,9.042189,Brewery


In [22]:
# Let's check how many venues returned by neighbor
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Assiano,9,9,9,9,9,9
Baggio (district of Milan),42,42,42,42,42,42
Barona,31,31,31,31,31,31
Bicocca (district of Milan),100,100,100,100,100,100
Bovisa,100,100,100,100,100,100
...,...,...,...,...,...,...
Turro,100,100,100,100,100,100
Vaiano Valle,72,72,72,72,72,72
Vialba,72,72,72,72,72,72
Vigentino,100,100,100,100,100,100


In [23]:
# Let's check how many categories
print('There are {} uniques categories.\n'.format(len(venues_df['VenueCategory'].unique())))

print("Here are all the venues:\n ", venues_df['VenueCategory'].unique())

There are 281 uniques categories.

Here are all the venues:
  ['Mediterranean Restaurant' 'Flower Shop' 'Pizza Place' 'Neighborhood'
 'Brewery' 'Soccer Field' 'Park' 'Supermarket' 'Italian Restaurant' 'Café'
 'Bar' 'Water Park' 'Pool' 'Sushi Restaurant' 'Cheese Shop' 'Plaza'
 'Campanian Restaurant' 'Japanese Restaurant' 'Movie Theater' 'Theater'
 'Shopping Mall' 'Yoga Studio' 'Athletics & Sports' 'Bistro' 'Restaurant'
 'Food' 'Cupcake Shop' 'Bus Station' 'Bus Stop' 'Bed & Breakfast'
 'Dance Studio' 'Bakery' 'Ice Cream Shop' 'Harbor / Marina'
 'South American Restaurant' 'Bookstore' 'Event Space' 'Hotel'
 'Burger Joint' 'Arts & Crafts Store' 'Russian Restaurant' 'Cocktail Bar'
 'Chinese Restaurant' 'Light Rail Station' 'Dog Run' 'Art Gallery' 'Gym'
 'Piadineria' 'Performing Arts Venue' 'Pub' 'Trattoria/Osteria' 'Office'
 'Sandwich Place' 'Museum' 'Gastropub' 'Frozen Yogurt Shop'
 'Seafood Restaurant' 'Music Venue' 'Cosmetics Shop'
 'Sardinian Restaurant' 'Multiplex' 'Breakfast Spot' 'Pe

In [24]:
# Let's check if there is a Shopping Mall 

"Shopping Mall" in venues_df['VenueCategory'].unique()

True

### 6. Analyse each neighborhood

In [25]:
# one hot encoding
df_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
df_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [df_onehot.columns[-1]] + list(df_onehot.columns[:-1])
df_onehot = df_onehot[fixed_columns]

print(df_onehot.shape)
df_onehot.head(5)

(6028, 282)


Unnamed: 0,Neighborhoods,Abruzzo Restaurant,Accessories Store,Adult Education Center,African Restaurant,Agriturismo,Airport,Airport Lounge,Airport Service,Airport Terminal,...,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Volleyball Court,Warehouse Store,Water Park,Wine Bar,Wine Shop,Winery,Women's Store,Yoga Studio
0,Assiano,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Assiano,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Assiano,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Assiano,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Assiano,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
# Let's group rows by neighborhoods and by taking the mean of the frequency of occurrence of each category
df_grouped = df_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(df_grouped.shape)

df_grouped.head()

(75, 282)


Unnamed: 0,Neighborhoods,Abruzzo Restaurant,Accessories Store,Adult Education Center,African Restaurant,Agriturismo,Airport,Airport Lounge,Airport Service,Airport Terminal,...,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Volleyball Court,Warehouse Store,Water Park,Wine Bar,Wine Shop,Winery,Women's Store,Yoga Studio
0,Assiano,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Baggio (district of Milan),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0,0.0,0.02381
2,Barona,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bicocca (district of Milan),0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bovisa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
# Let's check how many neighborhoods have shopping malls

len(df_grouped[df_grouped["Shopping Mall"] > 0])

14

In [28]:
# We can now create a dataframe with only shopping malls
df_mall = df_grouped[["Neighborhoods","Shopping Mall"]]

df_mall.head()

Unnamed: 0,Neighborhoods,Shopping Mall
0,Assiano,0.0
1,Baggio (district of Milan),0.02381
2,Barona,0.0
3,Bicocca (district of Milan),0.01
4,Bovisa,0.0


### 7. Cluster neighborhoods

In [29]:
# Let's set the number of clusters to 3
kclusters = 3

df_clustering = df_mall.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 2, 0, 1, 0, 1, 0, 2, 0, 0], dtype=int32)

In [30]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
df_merged = df_mall.copy()

# add clustering labels
df_merged["Cluster Labels"] = kmeans.labels_

In [31]:
df_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
df_merged.head()

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels
0,Assiano,0.0,0
1,Baggio (district of Milan),0.02381,2
2,Barona,0.0,0
3,Bicocca (district of Milan),0.01,1
4,Bovisa,0.0,0


In [32]:
# merge df_grouped with df to add latitude/longitude for each neighborhood
df_merged = df_merged.join(df.set_index("Neighborhood"), on="Neighborhood")

print(df_merged.shape)
df_merged.tail() # check the last columns!

(75, 5)


Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
70,Turro,0.0,0,45.49452,9.22171
71,Vaiano Valle,0.0,0,45.42893,9.2162
72,Vialba,0.013889,1,45.51491,9.12815
73,Vigentino,0.0,0,45.43372,9.20104
74,Villapizzone,0.0,0,45.49834,9.14453


In [33]:
df_merged.tail()

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
70,Turro,0.0,0,45.49452,9.22171
71,Vaiano Valle,0.0,0,45.42893,9.2162
72,Vialba,0.013889,1,45.51491,9.12815
73,Vigentino,0.0,0,45.43372,9.20104
74,Villapizzone,0.0,0,45.49834,9.14453


In [34]:
# Let's sort the results by cluster labels
print(df_merged.shape)
df_merged.sort_values(["Cluster Labels"], inplace=True)
df_merged

(75, 5)


Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
0,Assiano,0.000000,0,45.450614,9.061635
51,Prato Centenaro,0.000000,0,45.506710,9.199210
50,Portello (district of Milan),0.000000,0,45.480230,9.149850
49,Porta Volta,0.000000,0,45.481510,9.177540
48,Porta Vittoria,0.000000,0,45.461060,9.206760
...,...,...,...,...,...
32,Niguarda,0.020000,2,45.518400,9.192010
61,Rogoredo,0.020000,2,45.430160,9.244000
1,Baggio (district of Milan),0.023810,2,45.463240,9.092700
14,Comasina,0.031746,2,45.526310,9.158870


Finally, let's visualize the resulting clusters

In [35]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['Neighborhood'], df_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [36]:
# save the map as HTML file
map_clusters.save('map_clusters.html')

### 8. Examine the clusters

**Cluster 0**

In [37]:
print(df_merged.loc[df_merged['Cluster Labels'] == 0].to_string())

                                       Neighborhood  Shopping Mall  Cluster Labels   Latitude  Longitude
0                                           Assiano            0.0               0  45.450614   9.061635
51                                  Prato Centenaro            0.0               0  45.506710   9.199210
50                     Portello (district of Milan)            0.0               0  45.480230   9.149850
49                                      Porta Volta            0.0               0  45.481510   9.177540
48                                   Porta Vittoria            0.0               0  45.461060   9.206760
47                                  Porta Vigentina            0.0               0  45.453736   9.196127
46                                    Porta Venezia            0.0               0  45.470980   9.199810
52                                         Precotto            0.0               0  45.515410   9.225530
45                                   Porta Ticinese    

**Cluster 1**

In [38]:
print(df_merged.loc[df_merged['Cluster Labels'] == 1].to_string())

                        Neighborhood  Shopping Mall  Cluster Labels  Latitude  Longitude
72                            Vialba       0.013889               1  45.51491    9.12815
3        Bicocca (district of Milan)       0.010000               1  45.52149    9.21326
5                          Bovisasca       0.010417               1  45.51555    9.15094
58                    Quarto Oggiaro       0.013699               1  45.51674    9.14090
24                       Gratosoglio       0.015873               1  45.41459    9.17122
35  Ponte Lambro (district of Milan)       0.016129               1  45.44240    9.26420
27                        Lampugnano       0.010000               1  45.49163    9.12196
20                         Garegnano       0.010000               1  45.50469    9.13697


**Cluster 2**

In [39]:
print(df_merged.loc[df_merged['Cluster Labels'] == 2].to_string())

                  Neighborhood  Shopping Mall  Cluster Labels  Latitude  Longitude
7                     Bruzzano       0.025974               2  45.52825    9.18071
32                    Niguarda       0.020000               2  45.51840    9.19201
61                    Rogoredo       0.020000               2  45.43016    9.24400
1   Baggio (district of Milan)       0.023810               2  45.46324    9.09270
14                    Comasina       0.031746               2  45.52631    9.15887
17  Figino (district of Milan)       0.024390               2  45.49234    9.07852


#### 9. Observations

Most of the shopping malls are concentrated in the suburbs of Milano with the highest number in cluster 2 and moderate number in cluster 1. 

On the other hand, cluster 0 has very low number to totally nno shoppinng mall in the neighborhoods. This represents a great opportunity and high potential areas to open a new shopping malls as there is very little to non existent competition from existing malls. 

Meanwhile, shopping malls in cluster 2 are very likely suffering from intense competition due to oversupply and high conncentration of shopping malls. 

Therefore, this project recommends property developers to capitalize on these findings to open new shopping malls in neighborhoods in cluster 0 with little to no competition. Property developers with unique selling propositions to stand out from the competition can also open new shopping malls in neighborhoods in cluster 1 with moderate competition. Lastly, property developers are advised to avoid neighborhoods in cluster 2 which already have high concentration of shopping malls and suffering from intense competition.