# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a warehouses. Specifically, this report will be targeted to management interested in opening new facilities to optimize logistic to chain supermarkets within 2 districts in **Moscow, Russia**.

Since there are lots of stores we need to deliver regularly we will try to detect **locations that allow to optimize routes and minimize expences**.

We will use our data science powers to generate a few most promissing locations based on this criteria. Advantages of each alocation will be calculated so that best possible final location can be chosen by management.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:

* number and locations of delivery points (stores)  in area
* number of warehouses we should set to supply stores effectively and reduce costs
* number of warehouses available for rent around optimum points.
* distance of a store to optimal warehouse

We decided to use area of 5 km radius around each district center, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:

* coordinates of district centers will be obtained using **GeoPy service**
* number of chain stores and location in every area will be obtained using **Foursquare API**
* centers of optimal locations will be generated algorithmically using clustering algorithm
* list of warehouses for rent will be obtained using **Yandex Realty service**.


Ok, we start. First import all tools.

In [24]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

import warnings                                  # `do not disturbe` mode
warnings.filterwarnings('ignore')

print('Libraries imported.')

Libraries imported.


### Exploration Area

Let's create latitude & longitude coordinates for centroids of **South-Eastern and Southern Administrative Okrugs(Districts)**, using name of district and **GeoPy** service, which hopfully knows them too.

In [25]:
address = 'Moscow, South-Eastern Administrative Okrug'

geolocator = Nominatim(user_agent="mos_dist")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(address, 'coordinates are {}, {}.'.format(latitude, longitude))

Moscow, South-Eastern Administrative Okrug coordinates are 55.6993352, 37.765247853023425.


In [26]:
seao_center = [address, latitude, longitude]
seao_center

['Moscow, South-Eastern Administrative Okrug', 55.6993352, 37.765247853023425]

In [27]:
address1 = 'Moscow, Southern Administrative Okrug'

geolocator = Nominatim(user_agent="mos_dist")
location = geolocator.geocode(address1)
latitude1 = location.latitude
longitude1 = location.longitude
print(address1, 'coordinates are {}, {}.'.format(latitude1, longitude1))

Moscow, Southern Administrative Okrug coordinates are 55.64796495, 37.644082606766474.


In [28]:
sao_center = [address1, latitude1, longitude1]

## Foursquare
Now that we have our district centers, let's use Foursquare API to get info on chain supermarkets in area .

We're interested in supermarkets of chain named **"Perekrestok"**, so we specify that in search.

In [29]:
CLIENT_ID = 'R1A0HJSZF23PZDHTNHIXVLYIIX0EYGBF411D4UUZIKPW440P' # your Foursquare ID
CLIENT_SECRET = 'P24XMBE41KNAS4W4CAAKRZCQO0LPKBH0432BTVZXV2LO10CX' 
VERSION = '20200622' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: R1A0HJSZF23PZDHTNHIXVLYIIX0EYGBF411D4UUZIKPW440P
CLIENT_SECRET:P24XMBE41KNAS4W4CAAKRZCQO0LPKBH0432BTVZXV2LO10CX


In [30]:
LIMIT=40
query = 'Перекресток'
radius = 5000
print(query + ' .... OK!')

Перекресток .... OK!


In [31]:
def getNearbyVenues(names, latitudes, longitudes):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&query={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng,
            query,
            radius,
            LIMIT
            )
            
        # make the GET request
        results = requests.get(url).json()
        
        # assign relevant part of JSON to venues
        venues = results['response']['venues']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],  
            v['categories'][0]['name']) for v in venues])

    nearby_venues = pd.DataFrame([venues for venue_list in venues_list for venues in venue_list])
    nearby_venues.columns = ['District', 
                  'District Latitude', 
                  'District Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude',          
                  'Venue Category']
    
    return(nearby_venues)

In [32]:
moscow_venues = getNearbyVenues(names=[seao_center[0],sao_center[0]],latitudes=[seao_center[1],sao_center[1]],longitudes=[seao_center[2],sao_center[2]])
moscow_venues.shape
                                

Moscow, South-Eastern Administrative Okrug
Moscow, Southern Administrative Okrug


(80, 7)

In [33]:
moscow_venues.head(3)

Unnamed: 0,District,District Latitude,District Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Moscow, South-Eastern Administrative Okrug",55.699335,37.765248,Перекресток,55.704134,37.765758,Supermarket
1,"Moscow, South-Eastern Administrative Okrug",55.699335,37.765248,Перекресток,55.717705,37.745007,Supermarket
2,"Moscow, South-Eastern Administrative Okrug",55.699335,37.765248,Перекресток,55.703226,37.792331,Supermarket


In [34]:
venues_map = folium.Map(location=[latitude1, longitude1], zoom_start=11) # generate map centred around the South-Eastern district

# add the supermarkets as blue circle markers
for lat, lng in zip(moscow_venues['Venue Latitude'], moscow_venues['Venue Longitude']):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup='Supermarket',
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_map)

# display map
venues_map

## Methodology 

In this project we will direct our efforts on researching areas of Moscow, South-Eastern Administrative Okrug and  Southern Administrative Okrug. Okrug means Districts and is something in between Neighborhood and Borough by similarity. Within the borders of these districts there are a number of chain supermarkets, and warehouse alternatives.

In first step we have collected the location of every supermarket within 5km from each district center. Then we separate them in a few clusters (using k-means clustering)  for more effective logistics.

Second step in our analysis we define warehouses available for rent within our districts using Yandex Realty service. To accurately calculate distances between warehouses and centroids we use functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in meters).

In third and final step we will focus on most promising warehouse locations according to how close they are to optimum cluster centers. We will present a map for optimal search facility location.



In [35]:
# set number of clusters
kclusters = 2

supermarket_clustering = moscow_venues[['Venue Latitude', 'Venue Longitude']]

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(supermarket_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [36]:
moscow_venues['Cluster']=kmeans.labels_
moscow_venues.head(3)

Unnamed: 0,District,District Latitude,District Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Cluster
0,"Moscow, South-Eastern Administrative Okrug",55.699335,37.765248,Перекресток,55.704134,37.765758,Supermarket,1
1,"Moscow, South-Eastern Administrative Okrug",55.699335,37.765248,Перекресток,55.717705,37.745007,Supermarket,1
2,"Moscow, South-Eastern Administrative Okrug",55.699335,37.765248,Перекресток,55.703226,37.792331,Supermarket,1


In [37]:
moscow_venues['Cluster'].value_counts()

1    40
0    40
Name: Cluster, dtype: int64

Each cluster contains 40 supermarkets, now we get coordinates of cluster centers (local optimum). Which is good. Because potential warehouses will have similar load.

Now get coordinates of cluster centers.

In [38]:
k_means_cluster_centers = kmeans.cluster_centers_
k_means_cluster_centers

array([[55.65652144, 37.63224279],
       [55.68979383, 37.7682774 ]])

In [39]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lng, cluster in zip(moscow_venues['Venue Latitude'], moscow_venues['Venue Longitude'], moscow_venues['Cluster']):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup='Supermarket',
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=1).add_to(map_clusters)
    
#add cluster center
for lat1, lng1 in zip(k_means_cluster_centers[:, 0], k_means_cluster_centers[:, 1]):
    folium.CircleMarker(
    [lat1, lng1],
#folium.CircleMarker(
    #[k_means_cluster_centers,],
    radius=6,
    color='green',
    popup='Cluster center',
    fill = True,
    fill_color = 'white',
    fill_opacity = 0.9
).add_to(map_clusters)

map_clusters

The map above showes our supermarkets divided into 2 groups and both group's centers.

In [40]:
warehouse=pd.read_csv('warehouse_en.csv')
warehouse.shape

(38, 6)

Now we get warehouses table that we preprocessed from Yandex Realty service. We filtered on restrictions like maintenance expences, communications, rental fee to be met. Finally have coordinates in both systems, address, footage in square meters for all 38 candidates.

In [41]:
warehouse.head(3)

Unnamed: 0,Address,Latitude,Longitude,Square footage,X,Y
0,"1-j Grajvoronovskij proezd, 20s26",55.721858,37.728083,1080,1913127.143,6411793.605
1,"1-ja Frezernaja ulitsa, 2/1s20",55.738482,37.748942,1183,1913775.003,6414024.141
2,"Avtomobil'nyj proezd, 10s14",55.724393,37.707134,1000,1911758.536,6411625.945


Now we can visualize all of them.

In [42]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add supermarkets to the map
markers_colors = []
for lat, lng, cluster in zip(moscow_venues['Venue Latitude'], moscow_venues['Venue Longitude'], moscow_venues['Cluster']):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup='Supermarket',
        color='grey',
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=1).add_to(map_clusters)
    
#add cluster centers
for lat1, lng1 in zip(k_means_cluster_centers[:, 0], k_means_cluster_centers[:, 1]):
    folium.CircleMarker(
    [lat1, lng1],
    radius=6,
    color='green',
    popup='Cluster center',
    fill = True,
    fill_color = 'white',
    fill_opacity = 0.9
).add_to(map_clusters)
    
    # add warehouses to the map
markers_colors = []
for lat2, lng2, address in zip(warehouse['Latitude'], warehouse['Longitude'], warehouse['Address']):
    folium.CircleMarker(
        [lat2, lng2],
        radius=5,
        popup=('Warehouse',address),
        color="orange",
        fill=False,
        fill_color='orange',
        fill_opacity=1).add_to(map_clusters)

map_clusters

## Analysis

When we look at the map we see, that warehouses spread is not even throug our area. So while there are some candidates really close to one supermarket optimum center, the other center has only distant options.

Now we gonna calculate distance of all warehouses to each optimum.

As mentioned above, to accurately calculate distances between warehouses and centers we use functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in meters).

In [43]:
#!pip install shapely
import shapely.geometry

#!pip install pyproj
import pyproj

import math

def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

In [49]:
data = []
for y, x in zip( k_means_cluster_centers[:, 0], k_means_cluster_centers[:, 1]):
    lon, lat = (lonlat_to_xy(x, y))
    data.append([lon, lat])

dfc = pd.DataFrame(data, columns=['X', 'Y'])
dfc

Unnamed: 0,X,Y
0,1909723.0,6402737.0
1,1916770.0,6409186.0


In [50]:
dx0=(warehouse['X']-dfc.iloc[0,0]).values
dy0=(warehouse['Y']-dfc.iloc[0,1]).values

dx1=(warehouse['X']-dfc.iloc[1,0]).values
dy1=(warehouse['Y']-dfc.iloc[1,1]).values

dist0=np.sqrt(dx0*dx0+dy0*dy0)
dist1=np.sqrt(dx1*dx1+dy1*dy1)

df = pd.DataFrame({'Address':warehouse['Address'],'Distance_C1':dist0,'Distance_C2':dist1})

df.head(5)

Unnamed: 0,Address,Distance_C1,Distance_C2
0,"1-j Grajvoronovskij proezd, 20s26",9675.027148,4479.751342
1,"1-ja Frezernaja ulitsa, 2/1s20",11992.212116,5690.080709
2,"Avtomobil'nyj proezd, 10s14",9118.818327,5573.614951
3,"Avtomobil'nyj proezd, 8s1",9516.770951,6379.688675
4,"Moskva ul. Zolotorozhskij Val d.11, str. 8 i s...",11859.386859,8932.015057


Now when we obtain table with all distances, we sort them to see what warehouses are closest.

In [51]:
Optimum1=df[['Address','Distance_C1']].sort_values('Distance_C1').reset_index().drop('index',1)
Optimum1.head(5)

Unnamed: 0,Address,Distance_C1
0,"Moskva, Kashirskij proezd, 17s1",616.917562
1,"Moskva, 2-j Kotljakovskij pereulok, 1s69",645.654343
2,"Moskva, 1-j Kotljakovskij pereulok, 13",1653.896744
3,"Moskva, 1-j Kotljakovskij pereulok, 12s2",2110.589138
4,"Moskva, Kantemirovskaja ulitsa, 65",2137.561909


Here we have 2 options within 650 m to optimum center which is good. Red cluster which is mainly in Southern Administrative Okrug.

In [47]:
Optimum2=df[['Address','Distance_C2']].sort_values('Distance_C2').reset_index().drop('index',1)
Optimum2.head(10)

Unnamed: 0,Address,Distance_C2
0,"Moskva, Shossejnaja ul, d. 2A",2823.710375
1,"Moskva, 2-j Vjazovskij proezd, 16",2978.868355
2,"Moskva, 1-j Vjazovskij proezd, 4k1",3832.524355
3,"Moskva, Juzhnoportovaja ulitsa, 40",4114.997046
4,"Moskva, Juzhnoportovaja ulitsa, 36s1",4182.554675
5,"1-j Grajvoronovskij proezd, 20s26",4479.751342
6,"Moskva, Volgogradskij prospekt, 42k23",4527.78161
7,"Moskva, 1-j Grajvoronovskij proezd, 20",4571.697271
8,"Moskva, Rjazanskij prospekt, vl4A",4710.046936
9,"Moskva, Novohohlovskaja ulitsa, 14s1",5037.965308


The other one looking not so good. The closest option is almost 3000 m to second optimum center.

## Results and Discussion <a name="results"></a>

Our analysis shows that there are 80 supermarkets that we divided optimally for two clusters in future logistic purposes. 
We also got 38 warehouse candidates for the moment that met all formal basic conditions (fee, engineer systems, construction conditions).

Mapping and future calculations showed that candidates are spread irregularly. Which can be caused by the city specific zone: living area, recreational or industry which is out of our control. Or just matter of supply up to date that could change sometime.

Resulting with two good options for one cluster and far options (about 3000 m closest) to another.

Purpose of this analysis was to only provide info on best available for rent warehouse locations. Which was solved only partially because of factors that are out of our control.

Recommended locations should therefore be considered only as a starting point for more detailed analysis with other factors taken into account and all other relevant conditions met.


## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify optimal locations for warehouse rent in area of Moscow, South-Eastern and Southern Administrative Okrugs in order to narrow down the search of available candidates on the realty market.

By getting Foursquare data we first identified locations of all “Perekrestok” supermarkets in area. 

Clustering of those locations was then performed in order to create even groups around 40 (+/-5) for optimal logistics and find their centers. Which we used like the optimum location to have a warehouse. Clustering resulted in two exactly equal groups.

Mapped results and calculations showed irregular distribution of warehouse candidates, which resulted in completion of our  optimization task partially.

Final decisions on warehouse rent will be made by management based on specific characteristics of facilities and locations closest to each optimum center. 
