# Shopping Mall analysis of Warsaw, Poland

Coursera Capstone- Week 5

# 1. Introduction

#### 1.1 Business context

Statistics and conclusions from numerous analyses of Polish commercial property cannot be wrong. The extraordinary success of the market is illustrated by the commercial property investment figure for 2018, with the value of transactions reaching a record high of €7.2bn. The first quarter of 2019 indicates continued strong interest in investing in Poland. What is more, our practice and market data show the group of active investors has grown since last year.

One such investor- our client is a commercial real-estate developer who is impressed by the growing potential of Warsaw as an attraction for retail trade. With the ambition to take advantage of this wave, our client is looking to open/acquire a shopping mall in Warsaw. 

#### 1.2 Business problem

In order to understand the commercial real-estate market better, the client has approached us to answer the following key questions:

- What is the current landscape of the shopping centers/retail parks/high streets in the city of Warsaw?
    - What is the typical size?
    - Are they relatively modern construction or older?
    - What is their distribution in terms of their location (city-center, office district, residential district, suburbs etc.)
- How are malls that are situated in city center/commercial district different from those situated in residential districts in terms of size and store formats?
- Given, the client is looking to open/acquire a large shopping mall (60000 sq.m) near city center, what should be the typical stores in the mall based on current environment

# 2. Data acquisition and cleaning

#### 2.1 Data sources

For the given analyses, we will use the dataset of shopping malls/high streets/retail parks available on https://prch.org.pl/en/sc-catalog. The dataset was creating by scraping the contents of the given url. The shopping mall dataset contains the following fields:
- Name of the mall
- Address 
- Type of mall (traditional/Speciality/retail-park/high-street/mixed-use)
- GLA (Gross Leasing Area in sq.m) 
- Status (Open, In Construction, Closed, Planned)
- Opening year

In addition we will be using the location data from Foursquare API in order to analyze the venues around the location of a particular shopping mall. With the radius of 250 m, we can assume that the venues provided by Foursquare- search query are related to the venues within and around the mall. This will help us in understanding the prevailing store formats of the given shopping mall.

Importing the required libraries

In [2]:
#importing required libraries
#from bs4 import BeautifulSoup
#import lxml

import pandas as pd
import numpy as np

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library


Solving environment: done


  current version: 4.5.11
  latest version: 4.8.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    scikit-learn-0.20.1        |   py36h22eb022_0         5.7 MB
    liblapack-3.8.0            |      11_openblas          10 KB  conda-forge
    liblapacke-3.8.0           |      11_openblas          10 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    libopenblas-0.3.6          |       h5a2b251_2         7.7 MB
    numpy-1.17.3               |   py36h95a1406_0         5.2 MB  conda-forge
    libcblas-3.8.0             |      11_openblas        

#### 2.2 Data load, summary and cleaning 

In [None]:
#Loading the shopping centers dataset

In [3]:
mall_data = pd.read_csv('Shopping_Mall_Warsaw.csv')
mall_data.head()

Unnamed: 0,Name,City,Voivodeship,Postal_Code,Address,GEO_X,GEO_Y,Type,Status,Year,GLA
0,Arkadia,Warszawa,Mazowieckie,00-175,Al. Jana Pawla II 82,20.981652,52.256346,Traditional Mall,Open,2004,117000
1,ArtN,Warszawa,Mazowieckie,00-841,ul. zelazna 51/53,20.991977,52.232447,mixed-use,In construction,2020,24000
2,Atrium Promenada,Warszawa,Mazowieckie,04-175,ul. Ostrobramska 75C,21.106915,52.232908,Traditional Mall,Open,1996,93000
3,Atrium Reduta,Warszawa,Mazowieckie,02-326,Al. Jerozolimskie 148,20.951726,52.212434,Traditional Mall,Open,1999,40700
4,Atrium Targowek,Warszawa,Mazowieckie,03-287,ul. Glebocka 15,21.05922,52.303658,Traditional Mall,Open,1998,50300


Getting basic summary of the data

In [4]:
print(mall_data.shape)
print(mall_data['Type'].value_counts())
print(mall_data['Status'].value_counts())

(60, 11)
Traditional Mall    38
mixed-use            7
Retail park          5
Speciality mall      4
High street          4
outlet               2
Name: Type, dtype: int64
Open               50
In construction     7
Planned             2
Closed              1
Name: Status, dtype: int64


We see that there are 60 shopping centers in the data, with 38 as traditional malls, 7 as mixed-use, 5 as Retail parks, 4 as Speciality malls and 4 as High streets. Looking at the status of the malls, 1 is closed. We will remove this entry from our data.

In [5]:
#Removing the entry with 'Closed' status
mall_data = mall_data.set_index('Status').drop('Closed', axis = 0)
mall_data = mall_data.reset_index()

Visualizing the shopping centers in Warsaw

In [6]:
# Taking Warsaw's Main post office as the center point of the city
address = 'Świętokrzyska 31/33, 00-001 Warsaw, Poland'

geolocator = Nominatim(user_agent="pl_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Warsaw City are {}, {}.'.format(latitude, longitude))


The geograpical coordinate of Warsaw City are 52.2353714, 21.0105726.


In [7]:
# create map of Warsaw using latitude and longitude values
map_warsaw = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, name in zip(mall_data['GEO_Y'], mall_data['GEO_X'], mall_data['Name']):
    label = '{}'.format(name)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_warsaw)  
    
map_warsaw


#### 2.3 Data transformation

In order to answer the first question, we will create 2 additional variables: Distance of the mall from the city center (Distance_center) and Years of operation (Years_operation). In order to calculate the distance from the city center, we take Warsaw's main post office as the point of center. Historically, in Europe, locations of main post offices of the cities- which were typically situated in the city center, were taken to calculate the distance between any two given cities.

In [8]:
import geopy.distance

In [9]:
#Calculating 'Distance_center' variable 
mall_data['Distance_center'] = ""
for index, row in mall_data.iterrows():
    mall_data['Distance_center'].iloc[index] = geopy.distance.geodesic((row['GEO_Y'], row['GEO_X']), (latitude, longitude)).km

    
#print(geopy.distance.geodesic((mall_data[0]['GEO_Y'], mall_data[0]['GEO_X']), (latitude, longitude)).km  )  
mall_data['Distance_center'] = np.round(mall_data['Distance_center'].astype(np.double),1)
mall_data.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,Status,Name,City,Voivodeship,Postal_Code,Address,GEO_X,GEO_Y,Type,Year,GLA,Distance_center
0,Open,Arkadia,Warszawa,Mazowieckie,00-175,Al. Jana Pawla II 82,20.981652,52.256346,Traditional Mall,2004,117000,3.1
1,In construction,ArtN,Warszawa,Mazowieckie,00-841,ul. zelazna 51/53,20.991977,52.232447,mixed-use,2020,24000,1.3
2,Open,Atrium Promenada,Warszawa,Mazowieckie,04-175,ul. Ostrobramska 75C,21.106915,52.232908,Traditional Mall,1996,93000,6.6
3,Open,Atrium Reduta,Warszawa,Mazowieckie,02-326,Al. Jerozolimskie 148,20.951726,52.212434,Traditional Mall,1999,40700,4.8
4,Open,Atrium Targowek,Warszawa,Mazowieckie,03-287,ul. Glebocka 15,21.05922,52.303658,Traditional Mall,1998,50300,8.3


In [10]:
#Calculating Years_operation variable
import datetime
now = datetime.datetime.now()
year_diff =  pd.to_numeric(now.year)- pd.to_numeric(mall_data['Year'])
year_diff = [max(min(x,max(year_diff)),0) for x in year_diff]
mall_data['Years_operation'] = year_diff
mall_data.head()


Unnamed: 0,Status,Name,City,Voivodeship,Postal_Code,Address,GEO_X,GEO_Y,Type,Year,GLA,Distance_center,Years_operation
0,Open,Arkadia,Warszawa,Mazowieckie,00-175,Al. Jana Pawla II 82,20.981652,52.256346,Traditional Mall,2004,117000,3.1,15
1,In construction,ArtN,Warszawa,Mazowieckie,00-841,ul. zelazna 51/53,20.991977,52.232447,mixed-use,2020,24000,1.3,0
2,Open,Atrium Promenada,Warszawa,Mazowieckie,04-175,ul. Ostrobramska 75C,21.106915,52.232908,Traditional Mall,1996,93000,6.6,23
3,Open,Atrium Reduta,Warszawa,Mazowieckie,02-326,Al. Jerozolimskie 148,20.951726,52.212434,Traditional Mall,1999,40700,4.8,20
4,Open,Atrium Targowek,Warszawa,Mazowieckie,03-287,ul. Glebocka 15,21.05922,52.303658,Traditional Mall,1998,50300,8.3,21


# 3. What is the current landscape of the shopping centers/retail parks/high streets in the city of Warsaw?

In order to understand the characteristics of the shopping centers in Warsaw, we take 3 features- size (GLA), location (Distance_center) and years of operation (Years_operation). Let's segment the shopping centers based on these features using KMeans method.

First, let's standardize the dataset with above mentioned features

#### 3.1 Feature selection

In [11]:
#Standardizing dataset 
from sklearn import preprocessing
X = mall_data[['GLA','Distance_center','Years_operation']]
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X

  return self.partial_fit(X, y)


array([[ 3.31741588, -0.86330627,  0.28865801],
       [-0.18351056, -1.39790175, -1.31802336],
       [ 2.41395099,  0.17618495,  1.14555474],
       [ 0.44515042, -0.35841053,  0.82421846],
       [ 0.80653638,  0.68108069,  0.93133055],
       [-0.44336965,  0.2058847 ,  0.93133055],
       [ 2.07515166, -0.44750978,  0.28865801],
       [-0.82158264, -0.89300602, -1.10379918],
       [-0.87993141, -1.75429874, -1.21091127],
       [-0.62771413,  0.94837843, -0.03267827],
       [ 0.52043916,  1.33447517,  1.03844265],
       [-0.48466553,  0.08708571, -0.88957499],
       [-0.39296384, -1.04150476,  0.82421846],
       [ 0.19293314,  0.05738596, -0.99668708],
       [-0.4620789 ,  0.82957943,  0.07443383],
       [ 0.04235566, -1.66519949,  3.9304691 ],
       [-0.86110923,  0.35438345,  1.14555474],
       [-0.67288738,  0.1464852 ,  0.82421846],
       [-0.71053175, -1.368202  , -1.31802336],
       [-0.99286452, -1.45730125,  0.93133055],
       [-0.34538136,  0.3246837 , -0.675

#### 3.2 Segmentation of shopping centers

In [12]:
#Running KMeans on the dataset
# set number of clusters
kclusters = 4

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(X)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 1, 3, 0, 0, 0, 3, 1, 1, 2], dtype=int32)

Adding the cluster labels to the original dataset

In [13]:
# add clustering labels to the original dataset
mall_data_clustered = mall_data.reset_index()
mall_data_clustered.insert(0, 'Cluster Labels', kmeans.labels_)

mall_data_clustered.head()

Unnamed: 0,Cluster Labels,index,Status,Name,City,Voivodeship,Postal_Code,Address,GEO_X,GEO_Y,Type,Year,GLA,Distance_center,Years_operation
0,3,0,Open,Arkadia,Warszawa,Mazowieckie,00-175,Al. Jana Pawla II 82,20.981652,52.256346,Traditional Mall,2004,117000,3.1,15
1,1,1,In construction,ArtN,Warszawa,Mazowieckie,00-841,ul. zelazna 51/53,20.991977,52.232447,mixed-use,2020,24000,1.3,0
2,3,2,Open,Atrium Promenada,Warszawa,Mazowieckie,04-175,ul. Ostrobramska 75C,21.106915,52.232908,Traditional Mall,1996,93000,6.6,23
3,0,3,Open,Atrium Reduta,Warszawa,Mazowieckie,02-326,Al. Jerozolimskie 148,20.951726,52.212434,Traditional Mall,1999,40700,4.8,20
4,0,4,Open,Atrium Targowek,Warszawa,Mazowieckie,03-287,ul. Glebocka 15,21.05922,52.303658,Traditional Mall,1998,50300,8.3,21


Visualizing the clusters

In [14]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.viridis(np.linspace(0, 1, len(ys)))
viridis = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mall_data_clustered['GEO_Y'], mall_data_clustered['GEO_X'], mall_data_clustered['Name'], mall_data_clustered['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=4,
        popup=label,
        color=viridis[cluster-1],
        fill=True,
        fill_color=viridis[cluster-1],
        fill_opacity=1).add_to(map_clusters)
    


       
map_clusters

#### 3.3 Understanding the characteristics of each cluster of shopping malls

In [15]:
#Cluster 0
cluster0 = mall_data_clustered.loc[mall_data_clustered['Cluster Labels']==0] #Yellow dots in the map
cluster0

Unnamed: 0,Cluster Labels,index,Status,Name,City,Voivodeship,Postal_Code,Address,GEO_X,GEO_Y,Type,Year,GLA,Distance_center,Years_operation
3,0,3,Open,Atrium Reduta,Warszawa,Mazowieckie,02-326,Al. Jerozolimskie 148,20.951726,52.212434,Traditional Mall,1999,40700,4.8,20
4,0,4,Open,Atrium Targowek,Warszawa,Mazowieckie,03-287,ul. Glebocka 15,21.05922,52.303658,Traditional Mall,1998,50300,8.3,21
5,0,5,Open,Auchan Modlinska,Warszawa,Mazowieckie,03-216,ul. Modlinska 8,20.999779,52.294924,Traditional Mall,1998,17097,6.7,21
10,0,10,Open,Centrum Ursynow,Warszawa,Mazowieckie,02-801,ul. Pulawska 427,21.025902,52.141391,Traditional Mall,1997,42700,10.5,22
15,0,15,Open,DT WarsSawaJunior,Warszawa,Mazowieckie,00-017,ul. Marszalkowska 104/122,21.01084,52.232159,mixed-use,1970,30000,0.4,49
16,0,16,Open,E.Leclerc Aspekt (Bielany),Warszawa,Mazowieckie,01-904,ul. Aspekt 79,20.931104,52.278005,Traditional Mall,1996,6000,7.2,23
17,0,17,Open,E.Leclerc Jutrzenki,Warszawa,Mazowieckie,02-231,ul. Jutrzenki 156,20.935849,52.198934,Traditional Mall,1999,11000,6.5,20
21,0,21,Open,Factory Ursus,Warszawa,Mazowieckie,02-495,Plac Czerwca 1976 r. 6,20.89367,52.20134,outlet,2002,19900,8.8,17
23,0,23,Open,Galeria Bemowo,Warszawa,Mazowieckie,01-466,ul. Powstancow slaski126,20.93009,52.26438,Traditional Mall,1999,40910,6.4,20
26,0,26,Open,Galeria pod Debami,Warszawa,Mazowieckie,03-137,ul. Paslecka 8D,20.9483,52.33237,Traditional Mall,2000,2741,11.6,19


First cluster seems to be group of shopping centers that are relatively older and are situated in the residential districts of the city. Almost all the malls within this cluster are medium sized, with GLA ranging between 20k sq.m to 40k sq.m

In [14]:
#Cluster 1
cluster1 = mall_data_clustered.loc[mall_data_clustered['Cluster Labels']==1] #Purple dots in the map
cluster1

Unnamed: 0,Cluster Labels,index,Status,Name,City,Voivodeship,Postal_Code,Address,GEO_X,GEO_Y,Type,Year,GLA,Distance_center,Years_operation
1,1,1,In construction,ArtN,Warszawa,Mazowieckie,00-841,ul. zelazna 51/53,20.991977,52.232447,mixed-use,2020,24000,1.3,0
7,1,7,In construction,CEDET,Warszawa,Mazowieckie,,,21.053202,52.227876,Traditional Mall,2017,7050,3.0,2
8,1,8,Open,Centrum Marszalkowska,Warszawa,Mazowieckie,00-057,Marszalkowska 126,21.009273,52.234763,mixed-use,2018,5500,0.1,1
12,1,12,Open,Dom Mody Klif,Warszawa,Mazowieckie,01-042,ul. Okopowa 58/72,20.97958,52.247259,Traditional Mall,1999,18436,2.5,20
18,1,18,In construction,Elektrownia Powisle,Warszawa,Mazowieckie,,Elektryczna 2,21.029445,52.239531,mixed-use,2019,10000,1.4,0
19,1,19,In construction,Ethos,Warszawa,Mazowieckie,00-499,pl. TrzeKrzyzy 10/14,21.022925,52.229399,mixed-use,1998,2500,1.1,21
31,1,31,Open,Galeria Wilenska,Warszawa,Mazowieckie,03-734,ul. Targowa 72,21.03595,52.2546,Traditional Mall,2002,40000,2.8,17
32,1,32,Open,Hala Koszyki,Warszawa,Mazowieckie,00-646,ul. Koszykowa 63,21.01148,52.22253,Traditional Mall,2016,7500,1.4,3
36,1,36,Open,Koneser Centrum Praskie,Warszawa,Mazowieckie,03-736,ul. Zabkowska 27/31,21.04506,52.25481,Traditional Mall,2018,21000,3.2,1
37,1,37,Open,Metropol Dom i Wnetrze,Warszawa,Mazowieckie,03-301,ul. Jagiellonska 82,21.01817,52.27441,Speciality mall,2010,13500,4.4,9


Second cluster seems to be group of shopping centers that are situated in the city center. Most of the shopping malls are relatively newer and smaller sized. 

In [15]:
#Cluster 2
cluster2 = mall_data_clustered.loc[mall_data_clustered['Cluster Labels']==2] #Blue dots in the map
cluster2

Unnamed: 0,Cluster Labels,index,Status,Name,City,Voivodeship,Postal_Code,Address,GEO_X,GEO_Y,Type,Year,GLA,Distance_center,Years_operation
9,2,9,Open,Centrum Skorosze,Warszawa,Mazowieckie,02-497,ul. gen. Slawoja-Skladkowskiego 4,20.90022,52.188364,Traditional Mall,2007,12200,9.2,12
11,2,11,Open,Centrum lopuszanska 22,Warszawa,Mazowieckie,02-220,ul. lopuszanska 22,20.951214,52.191931,Speciality mall,2015,16000,6.3,4
13,2,13,Open,DomExpo,Warszawa,Mazowieckie,03-216,ul. Modlinska 4,21.003569,52.291336,Speciality mall,2016,34000,6.2,3
14,2,14,Open,Domoteka (czesc PH Targowek),Warszawa,Mazowieckie,03-286,ul. Malborska 41,21.079861,52.302399,Speciality mall,2006,16600,8.8,13
20,2,20,Open,Factory Annopol,Warszawa,Mazowieckie,03-236,ul. Annopol 2,21.023576,52.299068,outlet,2013,19700,7.1,6
22,2,22,Open,Ferio Wawer,Warszawa,Mazowieckie,04-738,ul. Szpotanskiego 6,21.16804,52.2061,Traditional Mall,2015,12300,11.2,4
28,2,28,Open,Galeria Renova,Warszawa,Mazowieckie,03-352,ul. Rembielinska 20,21.03028,52.28959,Traditional Mall,2008,12800,6.2,11
29,2,29,Open,Galeria Rondo Wiatraczna,Warszawa,Mazowieckie,04-077,Grochowska 207,21.087879,52.244967,Traditional Mall,2018,11000,5.4,1
39,2,39,Open,PH Zielony Targowek,Warszawa,Mazowieckie,03-287,ul. Glebocka 13,21.05777,52.30012,Retail park,2007,24985,7.9,12
42,2,42,Open,Plac Vogla,Warszawa,Mazowieckie,02-963,ul. Syta 98,21.11045,52.1647,Traditional Mall,2015,5200,10.4,4


This cluster consists of medium sized malls situated in residential districts of the city, and are newer or upcoming constructions. 

In [16]:
#Cluster 3
cluster3 = mall_data_clustered.loc[mall_data_clustered['Cluster Labels']==3] #Green dots in the map
cluster3

Unnamed: 0,Cluster Labels,index,Status,Name,City,Voivodeship,Postal_Code,Address,GEO_X,GEO_Y,Type,Year,GLA,Distance_center,Years_operation
0,3,0,Open,Arkadia,Warszawa,Mazowieckie,00-175,Al. Jana Pawla II 82,20.981652,52.256346,Traditional Mall,2004,117000,3.1,15
2,3,2,Open,Atrium Promenada,Warszawa,Mazowieckie,04-175,ul. Ostrobramska 75C,21.106915,52.232908,Traditional Mall,1996,93000,6.6,23
6,3,6,Open,Blue City,Warszawa,Mazowieckie,02-222,Al. Jerozolimskie 179,20.955113,52.213058,Traditional Mall,2004,84000,4.5,15
24,3,24,Open,Galeria Mokotow,Warszawa,Mazowieckie,02-675,ul. Woloska 12,21.00401,52.17993,Traditional Mall,2000,68500,6.2,19
25,3,25,In construction,Galeria Mlociny,Warszawa,Mazowieckie,01-943,"Zgrupowania AK ""Kampinos"" 15",20.927194,52.293985,Traditional Mall,2019,75000,8.7,0
27,3,27,Open,Galeria Polnocna,Warszawa,Mazowieckie,03-144,swiatowida,20.94375,52.338792,Traditional Mall,2017,64500,12.4,2
30,3,30,Planned,Galeria Wilanow,Warszawa,Mazowieckie,,Przyczolkowa 370,21.051414,52.161515,Traditional Mall,2016,60000,8.7,3
33,3,33,Open,Homepark Targowek,Warszawa,Mazowieckie,03-286,ul. Malborska 51-53,21.08336,52.30521,Retail park,2006,90600,9.2,13
38,3,38,Open,PH Centrum Krakowska,Warszawa,Mazowieckie,02-183,Al. Krakowska 61,20.936983,52.170121,Retail park,2001,56400,8.8,18
57,3,57,Open,Wola Park,Warszawa,Mazowieckie,01-460,ul. Gorczewska 124,20.93087,52.24131,Traditional Mall,2002,77000,5.5,17


The last cluster comprises of shopping malls that large sized and are situated in the business districts of Warsaw. Moreover, these malls seem to be middle-aged.

#### 3.4 Conclusion

To summarize, we see 4 very clear segments of shopping centers in Warsaw based on their location, size and years since operation:
- Cluster 0 - Older, medium sized malls, location in residential districts of Warsaw
- Cluster 1 - Smaller sized units, located mainly in the city-center of Warsaw
- Cluster 2 - New/Upcoming, mainly medium sized, location in the residential districts of Warsaw, 
- Cluster 3 - Middle aged, large malls, located in business districts of Warsaw 

# 4. How are malls that are situated in city center/commercial district different from those situated in residential districts in terms of size and store formats?

As we analyzed in the previous section, cluster 0 and 2 belong to shopping centers in the residential districts, while cluster 1 and 3 belong to shopping centers located in city center or business districts of the city.

In order to understand the difference in the store formats of these location, we can leverage the Foursquare data to get nearby venues of shopping centers in these areas. Assuming a radius of 250 m, we can assume that most of the venues provided by Foursquare would be located within the shopping malls.

Accessing Foursquare data.
Define Foursquare credentials and Version

In [16]:
CLIENT_ID = 'ZDIMQQNMEVWT5BQ1J5OOIXJM2IE2FS3FMRMUNT5PDLEJ5GTM' # Foursquare ID
CLIENT_SECRET = '5NOE3Z51X0YTNJ24QVKHUFE5PBUCDGDZ4SXMYDDHXDDY1NV2' #  Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [17]:
import requests

#### 4.1 Exploring store formats of all shopping centers of Warsaw

Let's create a function to analyze venues around each shopping centers in Warsaw

In [18]:
def getNearbyVenues(names, latitudes, longitudes, radius=250, LIMIT = 50):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Mall name', 
                  'Mall Latitude', 
                  'Mall Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now write the code to run the above function on each shopping mall and create a new dataframe called warsaw_venues.

In [19]:
warsaw_venues = getNearbyVenues(names=mall_data['Name'],
                                   latitudes=mall_data['GEO_Y'],
                                   longitudes=mall_data['GEO_X']
                                  )



Arkadia
ArtN
Atrium Promenada
Atrium Reduta
Atrium Targowek
Auchan Modlinska
Blue City
CEDET
Centrum Marszalkowska
Centrum Skorosze
Centrum Ursynow
Centrum lopuszanska 22
Dom Mody Klif
DomExpo
Domoteka (czesc PH Targowek)
DT WarsSawaJunior
E.Leclerc Aspekt (Bielany)
E.Leclerc Jutrzenki
Elektrownia Powisle
Ethos
Factory Annopol
Factory Ursus
Ferio Wawer
Galeria Bemowo
Galeria Mokotow
Galeria Mlociny
Galeria pod Debami
Galeria Polnocna
Galeria Renova
Galeria Rondo Wiatraczna
Galeria Wilanow
Galeria Wilenska
Hala Koszyki
Homepark Targowek
KEN Center
King Cross Praga
Koneser Centrum Praskie
Metropol Dom i Wnetrze
PH Centrum Krakowska
PH Zielony Targowek
Plac Trzech Krzyzy
Plac Unii City Shopping
Plac Vogla
Quick Park Okecie
Royal Wilanow
Sadyba Best Mall
Tesco Goclaw
Tesco Gorczewska
Tesco Kabaty
Tesco Polczynska
Tesco Stalowa
Ulica Marszalkowska
Ulica Mokotowska
Ulica Nowy swiat
Uniwersam Grochow
Vis a Vis Warszawa
Wodny Retail park
Wola Park
Zlote Tarasy


In [20]:
print(warsaw_venues.shape)
warsaw_venues.head()

(762, 7)


Unnamed: 0,Mall name,Mall Latitude,Mall Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Arkadia,52.256346,20.981652,CH Arkadia,52.257244,20.98453,Shopping Mall
1,Arkadia,52.256346,20.981652,Starbucks,52.256573,20.983657,Coffee Shop
2,Arkadia,52.256346,20.981652,Skok na Sok,52.257224,20.984014,Juice Bar
3,Arkadia,52.256346,20.981652,Peek & Cloppenburg,52.256308,20.985159,Clothing Store
4,Arkadia,52.256346,20.981652,Zielona,52.257217,20.984577,Vegetarian / Vegan Restaurant


In [21]:
#Merging the venues information to main dataset
mall_data_merged = mall_data_clustered.join(warsaw_venues.set_index('Mall name'), on = 'Name', how = 'right')
mall_data_merged.head()

Unnamed: 0,Cluster Labels,index,Status,Name,City,Voivodeship,Postal_Code,Address,GEO_X,GEO_Y,...,Year,GLA,Distance_center,Years_operation,Mall Latitude,Mall Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,3,0,Open,Arkadia,Warszawa,Mazowieckie,00-175,Al. Jana Pawla II 82,20.981652,52.256346,...,2004,117000,3.1,15,52.256346,20.981652,CH Arkadia,52.257244,20.98453,Shopping Mall
0,3,0,Open,Arkadia,Warszawa,Mazowieckie,00-175,Al. Jana Pawla II 82,20.981652,52.256346,...,2004,117000,3.1,15,52.256346,20.981652,Starbucks,52.256573,20.983657,Coffee Shop
0,3,0,Open,Arkadia,Warszawa,Mazowieckie,00-175,Al. Jana Pawla II 82,20.981652,52.256346,...,2004,117000,3.1,15,52.256346,20.981652,Skok na Sok,52.257224,20.984014,Juice Bar
0,3,0,Open,Arkadia,Warszawa,Mazowieckie,00-175,Al. Jana Pawla II 82,20.981652,52.256346,...,2004,117000,3.1,15,52.256346,20.981652,Peek & Cloppenburg,52.256308,20.985159,Clothing Store
0,3,0,Open,Arkadia,Warszawa,Mazowieckie,00-175,Al. Jana Pawla II 82,20.981652,52.256346,...,2004,117000,3.1,15,52.256346,20.981652,Zielona,52.257217,20.984577,Vegetarian / Vegan Restaurant


#### 4.2 Analyzing residential and commercial clusters 

In order to understand the difference in store formats of residential and city center/commercial malls, we apply one hot encoding method to determine the frequency of each venue category in each shopping mall. 
We then create a variable to identify a particular mall as residential or commercial. Lastly, we determine the average frequency of particular venue category for residential and commercial shopping malls to identify the difference in the store formats

In [22]:
# one hot encoding
warsaw_onehot = pd.get_dummies(mall_data_merged[['Venue Category']], prefix="", prefix_sep="")


# add neighborhood column back to dataframe
warsaw_onehot['Mall Name'] = mall_data_merged['Name']
warsaw_onehot['Cluster'] = mall_data_merged['Cluster Labels']

#Defining function to create residential/Commercial label based on Cluster labels
def f(row):
    if (row['Cluster'] == 0) | (row['Cluster'] == 2):
        value = 'residential'
    else:
        value = 'commerical'
    return value

#creating the new column        
warsaw_onehot['Res_Comm_label'] = warsaw_onehot.apply(f, axis = 1)

# move last columns to the first columns
fixed_columns = list(warsaw_onehot.columns[-3:]) + list(warsaw_onehot.columns[:-3])
warsaw_onehot = warsaw_onehot[fixed_columns]

warsaw_onehot.head()

Unnamed: 0,Mall Name,Cluster,Res_Comm_label,African Restaurant,American Restaurant,Arcade,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,...,Train Station,Tram Station,Trattoria/Osteria,Turkish Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Shop,Women's Store,Yoga Studio
0,Arkadia,3,commerical,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,Arkadia,3,commerical,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,Arkadia,3,commerical,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,Arkadia,3,commerical,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,Arkadia,3,commerical,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [23]:
warsaw_comm_res = warsaw_onehot.drop(['Cluster'], axis = 1)
warsaw_comm_res.head()

Unnamed: 0,Mall Name,Res_Comm_label,African Restaurant,American Restaurant,Arcade,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bar,...,Train Station,Tram Station,Trattoria/Osteria,Turkish Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Shop,Women's Store,Yoga Studio
0,Arkadia,commerical,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,Arkadia,commerical,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,Arkadia,commerical,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,Arkadia,commerical,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,Arkadia,commerical,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [24]:
#Calculating average frequencies of venue category for residential and commercial clusters
warsaw_grouped = warsaw_comm_res.groupby('Res_Comm_label').mean().reset_index()
warsaw_grouped.head()

Unnamed: 0,Res_Comm_label,African Restaurant,American Restaurant,Arcade,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bar,Beach,...,Train Station,Tram Station,Trattoria/Osteria,Turkish Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Shop,Women's Store,Yoga Studio
0,commerical,0.00202,0.008081,0.0,0.00202,0.016162,0.0,0.010101,0.016162,0.0,...,0.00202,0.0,0.00202,0.00404,0.00202,0.014141,0.006061,0.00202,0.006061,0.006061
1,residential,0.0,0.003745,0.003745,0.0,0.018727,0.003745,0.007491,0.0,0.003745,...,0.003745,0.003745,0.0,0.0,0.0,0.003745,0.003745,0.0,0.003745,0.0


Let's print each shopping mall along with the top 10 most common venues

In [25]:
num_top_venues = 10

for Name in warsaw_grouped['Res_Comm_label']:
    print(Name)
    temp = warsaw_grouped[warsaw_grouped['Res_Comm_label'] == Name].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

commerical
                  venue  freq
0           Coffee Shop  0.06
1                  Café  0.05
2  Fast Food Restaurant  0.04
3          Cocktail Bar  0.04
4        Clothing Store  0.04
5            Restaurant  0.03
6             Bookstore  0.02
7          Dessert Shop  0.02
8         Shopping Mall  0.02
9                 Plaza  0.02


residential
                  venue  freq
0           Coffee Shop  0.05
1          Dessert Shop  0.04
2         Shopping Mall  0.04
3           Supermarket  0.04
4        Clothing Store  0.04
5  Fast Food Restaurant  0.03
6           Pizza Place  0.03
7     Electronics Store  0.03
8                  Café  0.03
9   Sporting Goods Shop  0.03




In [26]:
#putting the above in pandas dataframe
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [27]:
#Now let's create the new dataframe and display the top 10 venues for each segment
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Res_Comm_label']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
mall_venues_sorted = pd.DataFrame(columns=columns)
mall_venues_sorted['Res_Comm_label'] = warsaw_grouped['Res_Comm_label']

for ind in np.arange(warsaw_grouped.shape[0]):
    mall_venues_sorted.iloc[ind, 1:] = return_most_common_venues(warsaw_grouped.iloc[ind, :], num_top_venues)

mall_venues_sorted.head()

Unnamed: 0,Res_Comm_label,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,commerical,Coffee Shop,Café,Cocktail Bar,Clothing Store,Fast Food Restaurant,Restaurant,Dessert Shop,Hotel,Shopping Mall,Plaza
1,residential,Coffee Shop,Dessert Shop,Supermarket,Clothing Store,Shopping Mall,Fast Food Restaurant,Electronics Store,Sporting Goods Shop,Café,Pizza Place


#### 4.3 Conclusion

What we see from the above table is that the category of stores in shopping malls located in city center/commercial districts of Warsaw are more 'entertainment based' with focus on cafes, cocktail bars, restaurants. This could be influenced by the factor that these districts also attract a large volume of tourists.

On the other hand, the shopping malls located in residential districts dominate in 'utility based' stores such as Supermarket, Electronics store and sporting goods shop. These stores are family oriented and exist to cater to household needs of the city.

# 5. The client is looking to open/acquire a large shopping mall (60000 sq.m) near city center, what should be the typical stores in the mall based on current environment

#### 5.1 Identifying the relevant cluster for the client

Based on the identified characteristics of shopping malls in the city, the typical store format of a large shopping mall near city center should resemble to Cluster 3.

Let's obtain the prevailing store formats of cluster 3 shopping malls.

In [28]:
warsaw_cluster = warsaw_onehot.drop(['Res_Comm_label'], axis = 1)
warsaw_cluster.head()

Unnamed: 0,Mall Name,Cluster,African Restaurant,American Restaurant,Arcade,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bar,...,Train Station,Tram Station,Trattoria/Osteria,Turkish Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Shop,Women's Store,Yoga Studio
0,Arkadia,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,Arkadia,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,Arkadia,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,Arkadia,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,Arkadia,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [29]:
#Calculating average frequencies of venue category for each cluster
warsaw_grouped = warsaw_cluster.groupby('Cluster').mean().reset_index()
warsaw_grouped.head()

Unnamed: 0,Cluster,African Restaurant,American Restaurant,Arcade,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bar,Beach,...,Train Station,Tram Station,Trattoria/Osteria,Turkish Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Shop,Women's Store,Yoga Studio
0,0,0.0,0.0,0.0,0.0,0.011696,0.005848,0.005848,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.005848,0.005848,0.0,0.005848,0.0
1,1,0.0,0.006042,0.0,0.003021,0.015106,0.0,0.012085,0.024169,0.0,...,0.003021,0.0,0.0,0.006042,0.003021,0.015106,0.009063,0.003021,0.006042,0.009063
2,2,0.0,0.010417,0.010417,0.0,0.03125,0.0,0.010417,0.0,0.010417,...,0.010417,0.010417,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,0.006098,0.012195,0.0,0.0,0.018293,0.0,0.006098,0.0,0.0,...,0.0,0.0,0.006098,0.0,0.0,0.012195,0.0,0.0,0.006098,0.0


In [30]:
#Now let's create the new dataframe and display the top 10 venues for each cluster
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Cluster']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
mall_venues_sorted = pd.DataFrame(columns=columns)
mall_venues_sorted['Cluster'] = warsaw_grouped['Cluster']

for ind in np.arange(warsaw_grouped.shape[0]):
    mall_venues_sorted.iloc[ind, 1:] = return_most_common_venues(warsaw_grouped.iloc[ind, :], num_top_venues)

mall_venues_sorted.head()

Unnamed: 0,Cluster,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,Coffee Shop,Electronics Store,Clothing Store,Pizza Place,Dessert Shop,Supermarket,Shopping Mall,Sandwich Place,Fast Food Restaurant,Restaurant
1,1,Coffee Shop,Café,Cocktail Bar,Restaurant,Hotel,Plaza,Dessert Shop,Bar,Polish Restaurant,Hostel
2,2,Bus Station,Shopping Mall,Dessert Shop,Supermarket,Fast Food Restaurant,Coffee Shop,Sporting Goods Shop,Café,Furniture / Home Store,Clothing Store
3,3,Clothing Store,Fast Food Restaurant,Coffee Shop,Shopping Mall,Pizza Place,Electronics Store,Bookstore,Cosmetics Shop,Supermarket,Café


#### 5.2 Results

As we see from the table, the cluster 3 shopping malls have a wide variety of store category including fast food restaurants, clothing store, Electronic stores and supermarket. These malls, owing to their size and location, are designed to cater to most of the needs of the visitors, that is both, entertainment based and utility based.

In [31]:
mall_data

Unnamed: 0,Status,Name,City,Voivodeship,Postal_Code,Address,GEO_X,GEO_Y,Type,Year,GLA,Distance_center,Years_operation
0,Open,Arkadia,Warszawa,Mazowieckie,00-175,Al. Jana Pawla II 82,20.981652,52.256346,Traditional Mall,2004,117000,3.1,15
1,In construction,ArtN,Warszawa,Mazowieckie,00-841,ul. zelazna 51/53,20.991977,52.232447,mixed-use,2020,24000,1.3,0
2,Open,Atrium Promenada,Warszawa,Mazowieckie,04-175,ul. Ostrobramska 75C,21.106915,52.232908,Traditional Mall,1996,93000,6.6,23
3,Open,Atrium Reduta,Warszawa,Mazowieckie,02-326,Al. Jerozolimskie 148,20.951726,52.212434,Traditional Mall,1999,40700,4.8,20
4,Open,Atrium Targowek,Warszawa,Mazowieckie,03-287,ul. Glebocka 15,21.05922,52.303658,Traditional Mall,1998,50300,8.3,21
5,Open,Auchan Modlinska,Warszawa,Mazowieckie,03-216,ul. Modlinska 8,20.999779,52.294924,Traditional Mall,1998,17097,6.7,21
6,Open,Blue City,Warszawa,Mazowieckie,02-222,Al. Jerozolimskie 179,20.955113,52.213058,Traditional Mall,2004,84000,4.5,15
7,In construction,CEDET,Warszawa,Mazowieckie,,,21.053202,52.227876,Traditional Mall,2017,7050,3.0,2
8,Open,Centrum Marszalkowska,Warszawa,Mazowieckie,00-057,Marszalkowska 126,21.009273,52.234763,mixed-use,2018,5500,0.1,1
9,Open,Centrum Skorosze,Warszawa,Mazowieckie,02-497,ul. gen. Slawoja-Skladkowskiego 4,20.90022,52.188364,Traditional Mall,2007,12200,9.2,12
