# Capstone project - Week 2
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data and how it will be used to solve the problem](#data)
* [Methodology](#methodology)
* [Developing SVM model](#svnmodel)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to predict popularity of newly opened restaurant given its location.
Specifically, this report will be targeted on stakeholders, who want to open **McDonald's** restaurant in **Kiev, Ukraine** and want to see if their selected location will be popular enough.
The project is focused on **chain fast food restaurant** for few reasons:
* Fast food restaurant is valued for possibility to hop in and out to grab some food on the commute, it is not popular because of high class cuisine.
* Chain restaurants look similar and have the same menu within given country. 
* Because of similarity, we can assume that popularity metric for each restaurant in chain is more dependant on it's location, than on its cuisine or interior.
* There are already quite a few opened **McDonald's** restaurants in **Kiev**, so we can have enough data to make predictions 

Speaking of location, fast food restaurant should be more popular if it is located near some significant point(s) of interest, such as shopping center, metro station, city center or train station. But how much each type of interest affect on the popularity? Is the selected place for the new restaurant is good enough, if it was chosen by simple criterias (e.g. near metro station)? These are the questions I am trying to answer with this project.


## Data and how it will be used to solve the problem <a name="data"></a>

Based on the Business Problem, we will use multiple metrics from the **Foursquare API**:
* Number of McDonald's restaurants in Kiev
* Location for each restaurant
* Popularity of each restaurant (amount of visitors)

Unfortunately, third metric is not that simple to recover from the API. For this metric we could use total amount of check-ins for each restaurant. After some initial research, with the current version of **Foursquare API** we cannot retrive amount of check-ins anymore.<br>
Other two possible resolutions for this metric are number of *'Likes'* for each restaurant and number of *'Rating signals'*.

I've decided to stick with the amount of *'Rating Signals'* as during the initial research it seems that there are more *'Rating Signals'* than there are *'Likes'* per restaurant. Therefore results should be more precise.<br>
Rating for each restaurant is a value between 1 and 10. *Rating Signals* is the total amount of people, who rated this venue. As we are not interested in the rating itself, we will use only amount of votes which will be our indicator of popularity for the venue. More people visited place means more people rated it.<br>
But what if one restaurant was opened two years ago and have only 100 *Rating Signals* while another one is opened for 10 years and has 1000 *Rating Signals*? Can we assume that second restaurant is more popular than the first one only by the amount of *Rating Signals*? No. To solve this issue we will also use one more metric for each restaurant:
* Venue Creation Date


This is a date, when the restaurant was added to **Forsquare**. From this value we will calculate average amount of *Rating Signals* per month. This value will be our main metric for the venue popularity.
Let's call this value as **Venue Popularity Index**, or **VPI**.
The bigger **VPI** is, the higher is the popularity.

After calculating **VPI** for each McDonald's restaurant in Kiev, we can plot these values on the map of the city  and predict **VPI** for future restaurants, based on their location and possibly other parameters.



### Data collection

Lets start with collecting all neccessary data for our project.

Importing some python libraries and Foursquare API credentials:

In [1]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
from pandas import json_normalize

Foursquare credentials:

In [2]:
CLIENT_ID = #'EHGYCFMDLQCZJXQA2DP0BSXGLVMMTEIGS1NU0JNJ5O0I4QES' 
CLIENT_SECRET = #'ECRF4FO2EAABHXNR2GJIJHBSVUQRDSWL3BWPUE404HM3BG5V'

Now, we will pull a list of all McDonald's restaurants in Kiev by creating search request and creating dataframe of the result.
We will use 'near' parameter for the query, that requires name of the place, instead of coordinates for our city, so that we won't need to add radius parmeter around the coordinates. 

In [19]:
VERSION = '20200430'
LIMIT = 40

search_query='McDonalds' # 
Location = 'Kiev, Ukraine' 

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&near={}&v={}&query={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, Location, VERSION, search_query, LIMIT)
results = requests.get(url).json()

In [4]:
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
McD_df = json_normalize(venues)
McD_df

Unnamed: 0,id,name,categories,referralId,hasPerk,location.address,location.lat,location.lng,location.labeledLatLngs,location.postalCode,location.cc,location.city,location.state,location.country,location.formattedAddress,location.crossStreet,location.neighborhood,venuePage.id
0,4bcb33f7fb84c9b6b64d1e3e,McDonald's,"[{'id': '4bf58dd8d48988d16e941735', 'name': 'F...",v-1588611707,False,"вул. Борщагівська, 2б",50.448042,30.479175,"[{'label': 'display', 'lat': 50.44804208177356...",3087.0,UA,Київ,м. Київ,Україна,"[вул. Борщагівська, 2б, Київ, 03087, Україна]",,,
1,4c00b39434ccc9284a10e2cd,McDonald's,"[{'id': '4bf58dd8d48988d16e941735', 'name': 'F...",v-1588611707,False,"вул. Софіївська, 1/2",50.451128,30.521917,"[{'label': 'display', 'lat': 50.45112810903885...",1001.0,UA,Київ,м. Київ,Україна,"[вул. Софіївська, 1/2 (Майдан Незалежності), К...",Майдан Незалежності,,
2,4bd200aa77b29c748fc38d82,McDonald's,"[{'id': '4bf58dd8d48988d16e941735', 'name': 'F...",v-1588611707,False,"вул. Хрещатик, 19а",50.44752,30.522896,"[{'label': 'display', 'lat': 50.4475202043031,...",1001.0,UA,Київ,м. Київ,Україна,"[вул. Хрещатик, 19а, Київ, 01001, Україна]",,Липки,
3,568d19d0498e545e812fa206,McDonald's,"[{'id': '4bf58dd8d48988d16e941735', 'name': 'F...",v-1588611707,False,"Боричів узвіз, 10",50.459679,30.525817,"[{'label': 'display', 'lat': 50.45967850686011...",4070.0,UA,Київ,м. Київ,Україна,"[Боричів узвіз, 10 (Поштова площа), Київ, 0407...",Поштова площа,"Podil, Kyiv",
4,4ed3b0d2e5faa5ec069df659,McDonald's,"[{'id': '4bf58dd8d48988d16e941735', 'name': 'F...",v-1588611707,False,"Майдан Незалежності, 1",50.450967,30.522714,"[{'label': 'display', 'lat': 50.45096729018484...",1001.0,UA,Київ,м. Київ,Україна,"[Майдан Незалежності, 1 (ТРЦ «Глобус», фудкорт...","ТРЦ «Глобус», фудкорт",,
5,4c111d9681e976b0623e10eb,McDonald's,"[{'id': '4bf58dd8d48988d16e941735', 'name': 'F...",v-1588611707,False,"пл. Московська, 1/3",50.406227,30.518996,"[{'label': 'display', 'lat': 50.40622742762063...",2000.0,UA,Київ,м. Київ,Україна,"[пл. Московська, 1/3, Київ, 02000, Україна]",,,
6,4bc6088842419521dc76031d,McDonald's,"[{'id': '4bf58dd8d48988d16e941735', 'name': 'F...",v-1588611707,False,"вул. Богдана Хмельницького, 40/25",50.446909,30.509092,"[{'label': 'display', 'lat': 50.44690870171596...",,UA,Київ,м. Київ,Україна,"[вул. Богдана Хмельницького, 40/25 (вул. Івана...",вул. Івана Франка,,
7,4c39d1edae2da5938f1103c6,McDonald's,"[{'id': '4bf58dd8d48988d16e941735', 'name': 'F...",v-1588611707,False,"вул. Мельникова, 3",50.462544,30.481603,"[{'label': 'display', 'lat': 50.46254352624871...",4119.0,UA,Київ,м. Київ,Україна,"[вул. Мельникова, 3, Київ, 04119, Україна]",,Лукьяновка,
8,4c1686aadaf42d7f4b4e4466,McDonald's,"[{'id': '4bf58dd8d48988d16e941735', 'name': 'F...",v-1588611707,False,"вул. Вишгородська, 33а",50.506461,30.450408,"[{'label': 'display', 'lat': 50.50646077074081...",,UA,Київ,м. Київ,Україна,"[вул. Вишгородська, 33а, Київ, Україна]",,,
9,4c0a64c932daef3bf7a14b50,McDonald's,"[{'id': '4bf58dd8d48988d16e941735', 'name': 'F...",v-1588611707,False,"просп. Степана Бандери, 12А",50.488507,30.497852,"[{'label': 'display', 'lat': 50.48850712917407...",4073.0,UA,Київ,м. Київ,Україна,"[просп. Степана Бандери, 12А (Оболонський прос...",Оболонський просп.,Оболонь,


As we can see, items with id = 34 and higher are not McDonald's restaurants and are irrelevant for our task. Let's drop those items. We will also drop rows 24, 28 and 33, as those are not in Kiev itself, but in a satellite town, thus are also irrelevant (see that column 'location.city' and 'location.state' differ from other for those venues.

In [5]:
McD_df.drop(McD_df.index[[24,28,33,34,35,36,37,38,39]], axis=0, inplace=True)
McD_df = McD_df.reset_index(drop=True)

Let's also drop now all columns that we don't need. From this dataframe we will only need unique id, address and coordinates

In [6]:
McD_df = McD_df[['id', 'location.address', 'location.lat', 'location.lng']]

In [383]:
McD_df.head()

Unnamed: 0,id,location.address,location.lat,location.lng,ratingSignals,createdAt,timeDelta,VPI
0,4bcb33f7fb84c9b6b64d1e3e,"вул. Борщагівська, 2б",1.425964e-09,8.615241e-10,2190,2010-04-18 16:31:51,120.52194,18.171
1,4c00b39434ccc9284a10e2cd,"вул. Софіївська, 1/2",1.426051e-09,8.627322e-10,1131,2010-05-29 06:26:28,119.188702,9.48915
2,4bd200aa77b29c748fc38d82,"вул. Хрещатик, 19а",1.425949e-09,8.627599e-10,3406,2010-04-23 20:18:50,120.352487,28.3002
3,568d19d0498e545e812fa206,"Боричів узвіз, 10",1.426293e-09,8.628424e-10,829,2016-01-06 13:42:40,51.891947,15.9755
4,4ed3b0d2e5faa5ec069df659,"Майдан Незалежності, 1",1.426047e-09,8.627547e-10,308,2011-11-28 16:03:30,101.17106,3.04435


Now we can visualize our data: Kiev city with all McDonald's restaurants marked on it.

In [8]:
import folium
Kiev_lat = '50.45466'
Kiev_lng = '30.5238'

Kiev = folium.Map(location=[Kiev_lat, Kiev_lng], zoom_start=11)
for lat, lng, address in zip(McD_df['location.lat'], McD_df['location.lng'], McD_df['location.address']):
    label = str(address).encode('ascii', 'xmlcharrefreplace') # we will have to encode address as otherwise cyrillic symbols are rendered incorrectly
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup = label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(Kiev) 
Kiev

Now we can collect information for each venue: 'Rating Signals' and creation date.

First, we will create new copy of McD_df dataframe with two more values: '*ratingSignals*','*createdAt*'

In [9]:
#add to McD_df two more colums for ratingSignals and createdAt
McD_full_df = McD_df
McD_full_df['ratingSignals'] = ''
McD_full_df['createdAt'] = ''

In [10]:
McD_full_df

Unnamed: 0,id,location.address,location.lat,location.lng,ratingSignals,createdAt
0,4bcb33f7fb84c9b6b64d1e3e,"вул. Борщагівська, 2б",50.448042,30.479175,,
1,4c00b39434ccc9284a10e2cd,"вул. Софіївська, 1/2",50.451128,30.521917,,
2,4bd200aa77b29c748fc38d82,"вул. Хрещатик, 19а",50.44752,30.522896,,
3,568d19d0498e545e812fa206,"Боричів узвіз, 10",50.459679,30.525817,,
4,4ed3b0d2e5faa5ec069df659,"Майдан Незалежності, 1",50.450967,30.522714,,
5,4c111d9681e976b0623e10eb,"пл. Московська, 1/3",50.406227,30.518996,,
6,4bc6088842419521dc76031d,"вул. Богдана Хмельницького, 40/25",50.446909,30.509092,,
7,4c39d1edae2da5938f1103c6,"вул. Мельникова, 3",50.462544,30.481603,,
8,4c1686aadaf42d7f4b4e4466,"вул. Вишгородська, 33а",50.506461,30.450408,,
9,4c0a64c932daef3bf7a14b50,"просп. Степана Бандери, 12А",50.488507,30.497852,,


Now we can create a loop for each venue id. 

In [11]:
i = 0
for venue in McD_df['id']:
    url3 = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue, CLIENT_ID, CLIENT_SECRET,VERSION)
    venue_info = requests.get(url3).json()
    McD_full_df.at[i, 'ratingSignals'] = venue_info['response']['venue']['ratingSignals']
    McD_full_df.at[i, 'createdAt'] = venue_info['response']['venue']['createdAt']
    i+=1


In [12]:
McD_full_df

Unnamed: 0,id,location.address,location.lat,location.lng,ratingSignals,createdAt
0,4bcb33f7fb84c9b6b64d1e3e,"вул. Борщагівська, 2б",50.448042,30.479175,2190,1271608311
1,4c00b39434ccc9284a10e2cd,"вул. Софіївська, 1/2",50.451128,30.521917,1131,1275114388
2,4bd200aa77b29c748fc38d82,"вул. Хрещатик, 19а",50.44752,30.522896,3406,1272053930
3,568d19d0498e545e812fa206,"Боричів узвіз, 10",50.459679,30.525817,829,1452087760
4,4ed3b0d2e5faa5ec069df659,"Майдан Незалежності, 1",50.450967,30.522714,308,1322496210
5,4c111d9681e976b0623e10eb,"пл. Московська, 1/3",50.406227,30.518996,1452,1276190102
6,4bc6088842419521dc76031d,"вул. Богдана Хмельницького, 40/25",50.446909,30.509092,1294,1271269512
7,4c39d1edae2da5938f1103c6,"вул. Мельникова, 3",50.462544,30.481603,2640,1278857709
8,4c1686aadaf42d7f4b4e4466,"вул. Вишгородська, 33а",50.506461,30.450408,1142,1276544682
9,4c0a64c932daef3bf7a14b50,"просп. Степана Бандери, 12А",50.488507,30.497852,2027,1275749577


'*createdAt*' value is saved in epoch time format, so we will need to convert it to datetime format.

In [13]:
McD_full_df['createdAt'] = pd.to_datetime(McD_full_df['createdAt'], unit='s')

In [14]:
McD_full_df

Unnamed: 0,id,location.address,location.lat,location.lng,ratingSignals,createdAt
0,4bcb33f7fb84c9b6b64d1e3e,"вул. Борщагівська, 2б",50.448042,30.479175,2190,2010-04-18 16:31:51
1,4c00b39434ccc9284a10e2cd,"вул. Софіївська, 1/2",50.451128,30.521917,1131,2010-05-29 06:26:28
2,4bd200aa77b29c748fc38d82,"вул. Хрещатик, 19а",50.44752,30.522896,3406,2010-04-23 20:18:50
3,568d19d0498e545e812fa206,"Боричів узвіз, 10",50.459679,30.525817,829,2016-01-06 13:42:40
4,4ed3b0d2e5faa5ec069df659,"Майдан Незалежності, 1",50.450967,30.522714,308,2011-11-28 16:03:30
5,4c111d9681e976b0623e10eb,"пл. Московська, 1/3",50.406227,30.518996,1452,2010-06-10 17:15:02
6,4bc6088842419521dc76031d,"вул. Богдана Хмельницького, 40/25",50.446909,30.509092,1294,2010-04-14 18:25:12
7,4c39d1edae2da5938f1103c6,"вул. Мельникова, 3",50.462544,30.481603,2640,2010-07-11 14:15:09
8,4c1686aadaf42d7f4b4e4466,"вул. Вишгородська, 33а",50.506461,30.450408,1142,2010-06-14 19:44:42
9,4c0a64c932daef3bf7a14b50,"просп. Степана Бандери, 12А",50.488507,30.497852,2027,2010-06-05 14:52:57


Let's save our dataframe to a file for backup purposes.

In [275]:
import pickle

McD_full_df.to_pickle('./McDonalds_df.pkl')

Next step is to calculate **Venue Popularity Index**, or **VPI**. For this we will calculate the average amount of *ratingSignals* per month since the *createAt* date. The following calculations were made at May, 4th, 2020.


In [276]:
from datetime import datetime

today = datetime.today().strftime('%Y-%m-%d')
today = pd.to_datetime(today).round('D')
today 

Timestamp('2020-05-10 00:00:00')

In [277]:
McD_full_df['timeDelta'] = today - McD_full_df['createdAt']
McD_full_df['timeDelta'] = McD_full_df['timeDelta']/np.timedelta64(1,'M')
#add column with VPI values
McD_full_df['VPI'] = McD_full_df['ratingSignals']/McD_full_df['timeDelta']
#look at the results in descending order
McD_full_df.sort_values(by='VPI', ascending=False)

Unnamed: 0,id,location.address,location.lat,location.lng,ratingSignals,createdAt,timeDelta,VPI
0,4bcb33f7fb84c9b6b64d1e3e,"вул. Борщагівська, 2б",50.448042,30.479175,2190,2010-04-18 16:31:51,120.719069,18.1413
1,4c00b39434ccc9284a10e2cd,"вул. Софіївська, 1/2",50.451128,30.521917,1131,2010-05-29 06:26:28,119.385831,9.47349
2,4bd200aa77b29c748fc38d82,"вул. Хрещатик, 19а",50.44752,30.522896,3406,2010-04-23 20:18:50,120.549616,28.2539
3,568d19d0498e545e812fa206,"Боричів узвіз, 10",50.459679,30.525817,829,2016-01-06 13:42:40,52.089076,15.915
4,4ed3b0d2e5faa5ec069df659,"Майдан Незалежності, 1",50.450967,30.522714,308,2011-11-28 16:03:30,101.368189,3.03843
5,4c111d9681e976b0623e10eb,"пл. Московська, 1/3",50.406227,30.518996,1452,2010-06-10 17:15:02,118.976775,12.2041
6,4bc6088842419521dc76031d,"вул. Богдана Хмельницького, 40/25",50.446909,30.509092,1294,2010-04-14 18:25:12,120.847902,10.7077
7,4c39d1edae2da5938f1103c6,"вул. Мельникова, 3",50.462544,30.481603,2640,2010-07-11 14:15:09,117.962378,22.38
8,4c1686aadaf42d7f4b4e4466,"вул. Вишгородська, 33а",50.506461,30.450408,1142,2010-06-14 19:44:42,118.841941,9.6094
9,4c0a64c932daef3bf7a14b50,"просп. Степана Бандери, 12А",50.488507,30.497852,2027,2010-06-05 14:52:57,119.144291,17.013


Now lets visualize our VPI values formatted by color on the map of the city. Popup value for each venue is its VPI.

In [280]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

kclusters = 31
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

Kiev = folium.Map(location=[Kiev_lat, Kiev_lng], tiles = "Stamen Toner", zoom_start=11)
for index, row in McD_full_df.iterrows():
    label = folium.Popup(str(row['VPI']), parse_html=True)
    col = round(row['VPI'])
    folium.CircleMarker([row['location.lat'], row['location.lng']],                   
                        popup=label,
                        radius=7,
                        fill=True,
                        fill_color=rainbow[col],
                        fill_opacity=1,
                        color=rainbow[col]).add_to(Kiev)    
Kiev

## Methodology <a name="methodology"></a>

We've gatered information about restaurant popularity and its location on the map. Unfortunately, it seems that this data is not enough to make a prediction regarding possible popularity of the new restaurant. We will need some more parameters for our markers to make a prediction:

1. **Distance from metro station in meters.** the closer venue is to metro station the higher will be score. If it is more than 500m away from the station, score=0. We won't increase score if there is more than 1 metro station nearby. Instead we will choose only the nearest one. This is done because if person wants to visit McDonald's near metro station, he would just go to the nearest located station, therefore number of visitors is not affected by the amount of stations in close proximity, but by how close is the nearest one.
2. **Distance to the nearest Shopping Mall.** If distance is less than 500m - the score is 0. The closer is Shopping Mall the higher the score. We will add score if there is more than one Shopping center nearby. Opposite to metro stations, possible venue visitors is higher with increased amount of Malls nearby. Each mall may have different set of shops, hence some people will go to one, and other will go to the second mall. Both of those groups are possible restaurant visitors, therefore score is higher.
3. **Distance from City center.** City center will be Khreschiatyk metro station. It is located right at the middle of the Kiev main street - Khreschiatyk. 

We will also adjust manually couple of restaurants that are close to eachother. Those we will count as one, adding up VPI value and other scores. Otherwise this will negatively affect our prediction as in those paris it seems that one restaurant has significantly higher VPI than another, while both have similar other scores.

Also, gathered data will not be enough to predict continuous value of VPI. Therefore we won't be using Multiple linear regression. SVM (Support Vector Machines) will be used instead. Our VPI score will be split in 3 groups, low, medium and high popularity venues. After model is built - we will test it and then predict the group for a new restaurant with specific coordinates. 

### Our first step is to gather coordinates of all metro stations and shopping malls.

In [104]:
## Foursquare API category IDs for metro and shopping malls 
metro = '4bf58dd8d48988d1fd931735'
Shopping_Mall = '4bf58dd8d48988d1fd941735'

urlMetro = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&near={}&v={}&categoryId={}&limit=50'.format(CLIENT_ID, CLIENT_SECRET, 'Kiev, Ukraine', VERSION, metro)
urlMall = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&near={}&v={}&categoryId={}&limit=50'.format(CLIENT_ID, CLIENT_SECRET, 'Kiev, Ukraine', VERSION, Shopping_Mall)

#write returned data to JSON
resultsMetro = requests.get(urlMetro).json()
resultsMall = requests.get(urlMall).json()

# assign relevant part of JSON to venues
metro_venues = resultsMetro['response']['venues']
Mall_venues = resultsMall['response']['venues']

# tranform venues into a dataframe
metro_df = json_normalize(metro_venues)
mall_df = json_normalize(Mall_venues)

#Drop all unimportant data.

#drop train depot, which is not metro station
#metro_df.drop(metro_df.index[[12]], axis=0, inplace=True)
#metro_df = metro_df.reset_index(drop=True)


#leave only necessary columns
metro_df = metro_df[['id', 'name', 'location.address', 'location.lat', 'location.lng']]

mall_df = mall_df[['id', 'name', 'location.address', 'location.lat', 'location.lng']]

In [35]:
mall_df.head()

Unnamed: 0,id,name,location.address,location.lat,location.lng
0,4c23961fc9bbef3b250dafac,ТЦ «Глобус»,"Майдан Незалежності, 1",50.450926,30.522695
1,5ac123bb83e38058a160ae6a,ТРЦ Smart Plaza Polytech (ТРЦ «Smart Plaza Pol...,"просп. Перемоги, 26",50.451513,30.467663
2,5b6da0f6666116002ceb22c6,River Mall,"вул. Дніпровська Набережна, 10-14",50.405058,30.612352
3,4e1f094dd22d7c148ce71cd8,Dream Town (2 лінія / 2nd line) (Dream Town (2...,"просп. Оболонський, 21б",50.516236,30.498763
4,5abf905e86f4cc28361bc0ca,ТРЦ «Rive Gauche»,"вул. Здолбунівська, 17",50.41854,30.630857


In [36]:
metro_df.head()

Unnamed: 0,id,name,location.address,location.lat,location.lng
0,4c06abf82e80a593e07a74f9,Станцiя «Майдан Незалежностi»,Майдан Незалежності,50.450434,30.523836
1,4c0e80f92466a593482a7921,Станція «Університет»,бул. Тараса Шевченка,50.444418,30.505927
2,4cae0fcd18a3199ca6fb5bfb,Станція «Видубичі»,Наддніпрянське шосе,50.401064,30.562484
3,4c170ad596040f47a0d373a5,Станція «Арсенальна»,Арсенальна пл.,50.444132,30.545478
4,4c11bc8cb4dfd13a01fa2a8b,Станція «Вокзальна»,Вокзальна пл.,50.441508,30.488576


Okay. Now we can plot all above information and see if our guesses about parameters relation is correct.
Gray circle with black outline will represent metro stations and yellow circle with red outline are shopping malls.


In [108]:
kclusters = 31
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


Kiev = folium.Map(location=[Kiev_lat, Kiev_lng], tiles = "Stamen Toner", zoom_start=11)
for index, row in McD_full_df.iterrows():
    label = folium.Popup(str(row['VPI']), parse_html=True)
    col = round(row['VPI'])
    folium.CircleMarker([row['location.lat'], row['location.lng']],                   
                        popup=label,
                        radius=6,
                        fill=True,
                        fill_color=rainbow[col],
                        fill_opacity=1,
                        color=rainbow[col]).add_to(Kiev)

for index, row in metro_df.iterrows():
    folium.CircleMarker([row['location.lat'], row['location.lng']],                   
                        popup='Metro',
                        radius=3,
                        fill=True,
                        fill_color='lightgray',
                        fill_opacity=1,
                        color='black').add_to(Kiev)

for index, row in mall_df.iterrows():
    folium.CircleMarker([row['location.lat'], row['location.lng']],                   
                        popup='Shopping Mall',
                        radius=3,
                        fill=True,
                        fill_color='yellow',
                        fill_opacity=1,
                        color='red').add_to(Kiev)
     
Kiev

As we can see from the map, restaurant has higher VPI when it is near both metro and mall or if it is far enough from other McDonald's restaurant. Also generally it seems that closer to city center venues have higher VPI.

Next step will be for each venue to find nearest metro station and shopping mall and calculate the distance to them and also to the city center.

#### 1. Calculating distances between McDonald's restaurants
We will create distance matrix to find pairs of close McDonalds for further adjustment.

In [109]:
from sklearn.neighbors import DistanceMetric
from math import radians

#Load dataframe from file, in case it was previously corrupted.
with open('McDonalds_df.pkl', 'rb') as f:
    McD_full_df = pickle.load(f)

#defining distance calculating method
dist = DistanceMetric.get_metric('haversine')

#copying dataframe to new one, where we will replace coordinates values with radians for calculation purposes
McD_radian_coords = None
McD_radian_coords = McD_full_df.copy()

McD_radian_coords['location.lat'] = np.radians(McD_radian_coords['location.lat'])
McD_radian_coords['location.lng'] = np.radians(McD_radian_coords['location.lng'])


#Create new dataframe with where rows and columns will be Venue addresses and values will be distance in meters
#Earth radius for this step is set to 6365600 meters.
#It is an average radius at Kiev latitude. This will increase accuracy.
DistanceMatrix = pd.DataFrame(dist.pairwise(McD_radian_coords[['location.lat','location.lng']].to_numpy())*6365600,
                                            columns=McD_radian_coords['id'].unique(), #'id' can be changed to 'location.address' for better visibility
                                            index=McD_radian_coords['id'].unique()    # in Cyrillic lang. 'id' is easier to use later for DF merging.
                                            ) 


DistanceMatrix.head()

Unnamed: 0,4bcb33f7fb84c9b6b64d1e3e,4c00b39434ccc9284a10e2cd,4bd200aa77b29c748fc38d82,568d19d0498e545e812fa206,4ed3b0d2e5faa5ec069df659,4c111d9681e976b0623e10eb,4bc6088842419521dc76031d,4c39d1edae2da5938f1103c6,4c1686aadaf42d7f4b4e4466,4c0a64c932daef3bf7a14b50,...,4bbb2fc898c7ef3bf7cf3302,58630bc2debdf6159594e27b,5bb6545f031320002c77c9d7,54843c37498e11594c9a25d8,50a39b83e4b023198a9f04b3,5cca84bd9b04730039b7703d,57dfb26ecd1077642bc93cc3,52bf2105498ecad1a763b1a8,54901272498e4a9081a2c289,54294b26498e07eb6ca6f060
4bcb33f7fb84c9b6b64d1e3e,0.0,3043.139278,3093.642859,3543.568692,3097.237379,5433.72137,2120.287114,1620.245088,6801.585254,4685.691096,...,10455.499558,1351.386494,7849.111216,8299.826187,11113.888516,6564.908438,963.666698,4946.407475,7001.3371,11038.405403
4c00b39434ccc9284a10e2cd,3043.139278,0.0,406.773245,989.187288,59.13316,4992.780361,1021.268836,3120.907346,7959.417975,4487.979717,...,7468.215089,2912.177748,10686.725984,11291.754365,8963.860474,9535.355651,2903.773392,4217.11184,6719.38582,7998.524211
4bd200aa77b29c748fc38d82,3093.642859,406.773245,0.0,1366.506561,383.189013,4595.953188,978.940402,3364.16729,8315.503507,4885.938417,...,7363.85433,2780.178596,10591.114382,11258.901628,8634.145889,9639.948334,2825.142729,3816.012322,7125.113578,7963.019544
568d19d0498e545e812fa206,3543.568692,989.187288,1366.506561,0.0,992.3915,5958.038468,1847.287639,3143.279913,7445.367886,3764.060752,...,7367.959281,3713.648973,11355.869947,11840.775005,9482.009299,9793.31488,3631.373376,5171.671961,5885.70846,7724.363118
4ed3b0d2e5faa5ec069df659,3097.237379,59.13316,383.189013,992.3915,0.0,4977.591789,1063.970655,3179.67809,8009.072097,4526.106568,...,7410.177907,2950.417166,10730.715143,11341.367447,8911.467349,9592.825752,2947.643331,4198.944521,6750.572226,7943.318589


In [384]:
#Replace zeroes with NaN to find minimum values
DistanceMatrix = DistanceMatrix.replace(0, np.NaN)
minValuesObj = DistanceMatrix.idxmin()

#Creating new DataFrame with two addresses and distance between them
Dist=None
Dist = pd.DataFrame(minValuesObj, columns=['Address2'])

Dist.reset_index(level=0, inplace=True)
Dist.rename(columns={'index':'Address1'}, inplace=True)
Dist = Dist.assign(Distance=DistanceMatrix.min().values)

Dist.head()

Unnamed: 0,Address1,Address2,Distance
0,4bcb33f7fb84c9b6b64d1e3e,57dfb26ecd1077642bc93cc3,963.666698
1,4c00b39434ccc9284a10e2cd,4ed3b0d2e5faa5ec069df659,59.13316
2,4bd200aa77b29c748fc38d82,4ed3b0d2e5faa5ec069df659,383.189013
3,568d19d0498e545e812fa206,4c00b39434ccc9284a10e2cd,989.187288
4,4ed3b0d2e5faa5ec069df659,4c00b39434ccc9284a10e2cd,59.13316


Later we will manually combine below venue pairs:<br>
* Index 1: 4c00b39434ccc9284a10e2cd 4ed3b0d2e5faa5ec069df659
* Index 10: 4bfd5d50e529c92899cfba8c	54901272498e4a9081a2c289
* Index 22:	58630bc2debdf6159594e27b	57dfb26ecd1077642bc93cc3

In [282]:
Dist.Distance = Dist.Distance.round()
Dist.Distance = Dist.Distance.astype('int')

In [283]:
Dist.head()

Unnamed: 0,Address1,Address2,Distance
0,4bcb33f7fb84c9b6b64d1e3e,57dfb26ecd1077642bc93cc3,964
1,4c00b39434ccc9284a10e2cd,4ed3b0d2e5faa5ec069df659,59
2,4bd200aa77b29c748fc38d82,4ed3b0d2e5faa5ec069df659,383
3,568d19d0498e545e812fa206,4c00b39434ccc9284a10e2cd,989
4,4ed3b0d2e5faa5ec069df659,4c00b39434ccc9284a10e2cd,59


#### 2. Calculating distance to the nearest metro station for each venue

We will build similar 2D matrix with McDonald's ID as rows and metro stations as columns. Values will be distance between corresponding two objects. Then we will search for minimum values and calculate score for each restaurant.

In [284]:
metro_df_rad = metro_df.copy()
metro_df_rad['location.lat'] = np.radians(metro_df['location.lat'])
metro_df_rad['location.lng'] = np.radians(metro_df['location.lng'])

MetroMatrixdf = pd.DataFrame(
                            dist.pairwise(
                                McD_radian_coords[['location.lat','location.lng']].to_numpy(),
                                metro_df_rad[['location.lat','location.lng']].to_numpy())*6365600,
                                columns=metro_df['name'].unique(), 
                                index=McD_radian_coords['id'].unique()
                            )

In [285]:
MetroMatrixdf.head()

Unnamed: 0,Станцiя «Майдан Незалежностi»,Станція «Видубичі»,Станція «Університет»,Станція «Арсенальна»,Станція «Площа Льва Толстого»,Станція «Золоті Ворота»,Станція «Дніпро»,Станція «Вокзальна»,Станція «Театральна»,Станція «Нивки»,...,Станція «Дружби Народiв»,Станція «Лівобережна»,Станція «Петрівка» (Станція «Почайна»),Станція «Дарниця»,Станція «Чернігівська»,Станцiя «Голосіївська»,Станція «Житомирська»,Станція «Харківська»,Станція «Васильківська»,Станція «Іподром»
4bcb33f7fb84c9b6b64d1e3e,3170.695362,7874.810616,1935.024344,4710.935888,2817.567595,2446.904697,5708.948524,984.506587,2779.921666,5339.520604,...,5716.738006,8404.120707,4631.87721,9470.24793,10771.742182,6023.51025,8095.911772,13262.278713,6089.385494,7956.256456
4c00b39434ccc9284a10e2cd,156.153039,6259.544496,1354.79192,1839.148352,1308.655955,631.843881,2859.873538,2589.722066,704.482441,8259.604292,...,4012.819899,5370.882806,4435.767236,6427.9708,7728.909665,6048.245807,11082.171949,10720.684365,6804.526436,9101.279682
4bd200aa77b29c748fc38d82,330.442932,5872.920663,1249.011623,1641.411607,958.665581,664.194476,2661.41861,2518.40471,416.59641,8383.53044,...,3619.638659,5321.507905,4833.351045,6405.343552,7722.685266,5663.715731,11179.384113,10458.609763,6454.893136,8769.754209
568d19d0498e545e812fa206,1036.629729,7010.133286,2203.196328,2217.60399,2297.838039,1476.288312,3135.434322,3319.0314,1691.703529,8480.648713,...,4812.058575,5171.439504,3716.027545,6143.5174,7388.665965,7029.97972,11348.649774,11017.57808,7791.751882,10080.253473
4ed3b0d2e5faa5ec069df659,99.087308,6217.940967,1392.79233,1780.524732,1308.295903,677.399967,2801.012331,2633.999466,711.267163,8317.661458,...,3973.904289,5314.757431,4474.108053,6373.332772,7675.357277,6039.290268,11139.411674,10663.209455,6807.32622,9108.960008


In [286]:
#create dataframe with minimum values from MetroMatrixdf
McD_to_Metro = pd.DataFrame(MetroMatrixdf.min(axis=1), columns=['Distance'])
McD_to_Metro.reset_index(level=0, inplace=True)
McD_to_Metro.rename(columns={'index':'McDonaldsID'}, inplace=True)
McD_to_Metro.head()

Unnamed: 0,McDonaldsID,Distance
0,4bcb33f7fb84c9b6b64d1e3e,960.127932
1,4c00b39434ccc9284a10e2cd,156.153039
2,4bd200aa77b29c748fc38d82,43.903391
3,568d19d0498e545e812fa206,945.998739
4,4ed3b0d2e5faa5ec069df659,99.087308


In [287]:
McD_to_Metro.Distance = McD_to_Metro.Distance.round()
McD_to_Metro.Distance = McD_to_Metro.Distance.astype('int')
#Convert distance to Metro to Score, where the closer is restaurant to metro - the higher is score. 
McD_to_Metro.Distance = McD_to_Metro.Distance.apply(lambda x: (500-x)/5 if x <= 500 else 0).round().astype('int')

In [289]:
#Maxumum score is 100
McD_to_Metro.head()

Unnamed: 0,McDonaldsID,Distance
0,4bcb33f7fb84c9b6b64d1e3e,0
1,4c00b39434ccc9284a10e2cd,69
2,4bd200aa77b29c748fc38d82,91
3,568d19d0498e545e812fa206,0
4,4ed3b0d2e5faa5ec069df659,80


#### 3. Calculating distance from restaurants to shopping malls.

Let's repeat workflow as it was with Metro stations, but this time we will need to take into account all Shopping Malls that are near the restaurant.<br>
To complete this task we will replace all values higher than 300m from distance matrix with 0.<br>
All values that are less than 300 will be replaced with (300-value). This way the closer is the Mall - the higher is the score. Then we will summarize each row values to calculate total score for each restaurant.

In [297]:
mall_df_rad = mall_df.copy()
mall_df_rad['location.lat'] = np.radians(mall_df['location.lat'])
mall_df_rad['location.lng'] = np.radians(mall_df['location.lng'])

MallMatrixdf = pd.DataFrame(
                            dist.pairwise(
                                McD_radian_coords[['location.lat','location.lng']].to_numpy(),
                                mall_df_rad[['location.lat','location.lng']].to_numpy())*6365600,
                                columns=mall_df['id'].unique(), 
                                index=McD_radian_coords['id'].unique()
                            )
#rows = McDonalds ID
#columns = Shopping Mall ID
MallMatrixdf.head()

Unnamed: 0,4b680c8af964a52038652be3,5ac123bb83e38058a160ae6a,5abf905e86f4cc28361bc0ca,5b6da0f6666116002ceb22c6,4c5d4e7c9b28d13a2a225970,4badfeb5f964a520fe783be3,5dd57dc8579eca00076fbc04,4bfa6b71b182c9b6ca3d7a5a,5423a2e0498e4b6aeeaba589,4c4abd9ef7cc1b8d9790db3e,...,4c287ac53492a593d55cb728,5bf2be06aa6c95002c4b8b24,4bffd2536f12b7130ba6685a,5e214f563b102f00082e0b1c,5e01fead28f8fe00085349ae,531c3c4b498e20ab6407e73a,4d9339b47ac3a35d3073d125,58e2697154386d4959b15216,5acb6f41135b3906fd64e0d1,5bae02f459c423002c8a9e7c
4bcb33f7fb84c9b6b64d1e3e,3047.2901,901.126404,11223.527584,10566.78007,7615.58984,6352.133661,5261.922672,2642.99625,11027.239036,12296.209456,...,1584.934211,5858.055746,8937.612966,22739.679945,10604.733833,34963.816564,32672.121751,11275.283981,11186.639225,5107.390145
4c00b39434ccc9284a10e2cd,1076.355492,3838.297272,8517.057156,8195.515667,5414.800204,6942.93718,3915.996976,5651.895697,7987.784612,9940.314106,...,3120.557512,8520.259597,11963.330067,19782.579232,13309.457948,32203.538651,29891.576491,14184.883945,13878.466957,8131.394611
4bd200aa77b29c748fc38d82,686.914316,3932.527026,8290.963177,7895.893972,5734.75989,7325.222068,4316.238733,5731.058441,7950.888203,9642.457922,...,3360.088256,8711.805858,11956.378769,19866.400511,13170.778404,32006.057544,29699.75169,14107.411107,14074.382986,8129.190042
568d19d0498e545e812fa206,2052.932615,4212.37269,8725.424304,8621.486478,4458.712115,6303.485805,2972.675327,6012.541362,7717.046244,10349.921267,...,3154.459811,8561.951293,12457.504225,19199.232339,14046.265818,32276.862152,29950.929006,14811.462367,13877.956618,8636.181756
4ed3b0d2e5faa5ec069df659,1063.520295,3894.906699,8458.437251,8140.347914,5402.756064,6986.947323,3933.283831,5708.107431,7932.505731,9884.945534,...,3179.40091,8579.390593,12014.91422,19736.464711,13348.988711,32144.44074,29832.454185,14230.776553,13937.596036,8183.100012


In [298]:
columns = list(MallMatrixdf)

#for each row replace distances with scores
for col in columns:
    MallMatrixdf[col] = MallMatrixdf[col].round()
    MallMatrixdf[col] = MallMatrixdf[col].astype('int')
    MallMatrixdf[col] = MallMatrixdf[col].apply(lambda x: (500-x) if x <= 500 else 0)

#add column with SUM of all the scores
MallMatrixdf['SUM'] = MallMatrixdf.sum(axis=1)
#drop all columns with only zeroes (Drop Mall that are not close enought to McDonald's)
MallMatrixdf = MallMatrixdf.loc[:, (MallMatrixdf != 0).any(axis=0)]

In [299]:
MallMatrixdf.reset_index(level=0, inplace=True)
MallMatrixdf.rename(columns={'index':'McDonaldsID'}, inplace=True)
MallMatrixdf.head(10)

Unnamed: 0,McDonaldsID,4b680c8af964a52038652be3,5423a2e0498e4b6aeeaba589,4c4abd9ef7cc1b8d9790db3e,4c3c8985a9509c74f130395b,4c0c9a80b4c6d13a15740c30,58e871a66ad5a12a81f3b425,4c38774b1a38ef3b8ba29221,53f61953498e07b7674879f0,4f71f280e4b007de903b4100,4d8743c281fdb1f796cd34c0,501f66b8e4b00e1d09c90ce1,4e1f094dd22d7c148ce71cd8,4bf3aebbcad2c928db079b99,5357c101498ef6ba3a1b814f,5ba12035efa94f00254bd42e,4c287ac53492a593d55cb728,SUM
0,4bcb33f7fb84c9b6b64d1e3e,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,4c00b39434ccc9284a10e2cd,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,4bd200aa77b29c748fc38d82,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,568d19d0498e545e812fa206,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,4ed3b0d2e5faa5ec069df659,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,4c111d9681e976b0623e10eb,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,4bc6088842419521dc76031d,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,4c39d1edae2da5938f1103c6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,463,463
8,4c1686aadaf42d7f4b4e4466,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,4c0a64c932daef3bf7a14b50,0,0,0,0,0,0,0,0,0,0,0,0,300,0,0,0,300


#### 4. Calculating distance to the city center.

First, lets find City center coordinates. For the City center we will use Khreschiatyk metro station coordinates as it is right in the middle of the main street in Kiev. We will locate coordinates from metro_df by stations id value:

In [118]:
metro_df.loc[metro_df['id'] == '4bd7d62c5cf276b0e1209c00']

Unnamed: 0,id,name,location.address,location.lat,location.lng
12,4bd7d62c5cf276b0e1209c00,Станція «Хрещатик»,"вул. Хрещатик, 19а",50.447243,30.522453


Now we can calculate distance to the city center for every McDonald's.

In [153]:
#write coordinates to variable
Center_lat = np.radians(metro_df.iloc[12]['location.lat'])
Center_lng = np.radians(metro_df.iloc[12]['location.lng'])

CityCenterDistanceDf = pd.DataFrame(
                            dist.pairwise(
                                McD_radian_coords[['location.lat','location.lng']].to_numpy(),
                                [[Center_lat, Center_lng]])*6365600,
                                index=McD_radian_coords['id'].unique()
                                    )
#clean up DF
CityCenterDistanceDf = CityCenterDistanceDf.round()
CityCenterDistanceDf = CityCenterDistanceDf.astype('int32')
CityCenterDistanceDf.rename(columns = {0: 'DistanceToCC'}, inplace=True)
CityCenterDistanceDf.reset_index(level=0, inplace=True)
CityCenterDistanceDf.rename(columns={'index':'McDonaldsID'}, inplace=True)
CityCenterDistanceDf.head()

Unnamed: 0,McDonaldsID,DistanceToCC
0,4bcb33f7fb84c9b6b64d1e3e,3063
1,4c00b39434ccc9284a10e2cd,433
2,4bd200aa77b29c748fc38d82,44
3,568d19d0498e545e812fa206,1402
4,4ed3b0d2e5faa5ec069df659,414


After gathering all needed data, we can start to fill our final Dataframe that will be used for prediction purposes.
Columns for our dataframe will include:
* Venue ID
* latitude in radians
* longitude in radians
* Neighbour Venue Distance, or **NVDScore**
* Metro Distance Score, or **MDScore**
* Shopping Mall Score, or **SMScore**
* Distance from City Center, or **CCScore**
* **VPI**

In [328]:
Prediction_df = pd.DataFrame(columns = ['VenueID', 'lat', 'lng', 'NVDScore', 
                                        'MDScore', 'SMScore', 'CCScore', 'VPI'])

In [329]:
Prediction_df['VenueID'] = McD_radian_coords['id']
Prediction_df['lat'] = McD_radian_coords['location.lat']
Prediction_df['lng'] = McD_radian_coords['location.lng']
Prediction_df['NVDScore'] = Dist['Distance']
Prediction_df['MDScore'] = McD_to_Metro['Distance']
Prediction_df['SMScore'] = MallMatrixdf['SUM']
Prediction_df['CCScore'] = CityCenterDistanceDf['DistanceToCC']
Prediction_df['CCScore'] = Prediction_df['CCScore'].apply(lambda x:(Prediction_df['CCScore'].max()-x)*100/Prediction_df['CCScore'].max()).round().astype('int')
Prediction_df['VPI'] = McD_full_df['VPI']

In [330]:
Prediction_df.head()

Unnamed: 0,VenueID,lat,lng,NVDScore,MDScore,SMScore,CCScore,VPI
0,4bcb33f7fb84c9b6b64d1e3e,0.880484,0.531962,964,0,0,74,18.1413
1,4c00b39434ccc9284a10e2cd,0.880538,0.532708,59,69,0,96,9.47349
2,4bd200aa77b29c748fc38d82,0.880475,0.532725,383,91,0,100,28.2539
3,568d19d0498e545e812fa206,0.880688,0.532776,989,0,0,88,15.915
4,4ed3b0d2e5faa5ec069df659,0.880535,0.532722,59,80,0,97,3.03843


Now lets finalize our database with following actions:
* merge all previously defined pairs of restaurants,
* round VPI score to integer

In [331]:
Prediction_df['VPI'] = Prediction_df['VPI'].astype('float').round().astype('int')

In [332]:
Prediction_df.at[1, 'VPI'] = Prediction_df.at[1, 'VPI'] + Prediction_df.at[4, 'VPI']
Prediction_df.drop(index=4, inplace=True)

Prediction_df.at[10, 'VPI'] = Prediction_df.at[10, 'VPI'] + Prediction_df.at[29, 'VPI']
Prediction_df.drop(index=29, inplace=True)

Prediction_df.at[22, 'VPI'] = Prediction_df.at[22, 'VPI'] + Prediction_df.at[27, 'VPI']
Prediction_df.at[22, 'MDScore'] = Prediction_df.at[27, 'MDScore']
Prediction_df.drop(index=27, inplace=True)

Prediction_df.reset_index(inplace=True)
Prediction_df.drop(columns='index', inplace=True)

In [430]:
Prediction_df.head()

Unnamed: 0,VenueID,lat,lng,NVDScore,MDScore,SMScore,CCScore,VPI,VPI-binned
0,4bcb33f7fb84c9b6b64d1e3e,0.880484,0.531962,964,0,0,74,18,2
1,4c00b39434ccc9284a10e2cd,0.880538,0.532708,59,69,0,96,12,2
2,4bd200aa77b29c748fc38d82,0.880475,0.532725,383,91,0,100,28,3
3,568d19d0498e545e812fa206,0.880688,0.532776,989,0,0,88,16,2
4,4c111d9681e976b0623e10eb,0.879755,0.532657,814,19,0,62,12,2


### Developing SVM model  <a name="svnmodel"></a>

Modeling SVM comes in few steps. First, we need to import all needed libraries. Second, We will create 3 categories for VPI values to increase our prediction accuracy. Yes, we won't be able to predict exact amount of VPI but this is done due to limited amount of available data. Then We will define 3 categories, which will impact on result category of VPI. Last step is to train and test our model and use it on a new restaurant.

In [334]:
#save final DF as local file
Prediction_df.to_pickle('./Prediction_df.pkl')

Import libraries:

In [338]:
import matplotlib.pyplot as plt
import pylab as pl
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score, jaccard_similarity_score
%matplotlib inline

Now we will split VPI scores to three categories and add new column to Dataframe with corresponding values:<br>
**'3'**: High popularity<br>
**'2'**: Medum popularity<br>
**'1'**: Low popularity

In [339]:
bins = np.linspace(min(Prediction_df['VPI']),max(Prediction_df['VPI']),4)

group_names = ['1','2','3']
Prediction_df['VPI-binned'] = pd.cut(Prediction_df['VPI'], bins, labels = group_names, include_lowest=True)

In [340]:
#convert type to integer
Prediction_df['VPI-binned'] = Prediction_df['VPI-binned'].astype('int')

In [344]:
#Select features that will be used for prediction
feature_df = Prediction_df[[ 'MDScore', 'SMScore', 'CCScore']]
X = np.asarray(feature_df)
y = np.asarray(Prediction_df['VPI-binned'])
y [0:5]

array([2, 2, 3, 2, 2])

Split all data to testing and trainig parts:

In [378]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=3)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (22, 3) (22,)
Test set: (6, 3) (6,)


In [379]:
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train) 

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [380]:
yhat = clf.predict(X_test)
yhat [0:5]

array([2, 1, 2, 1, 2])

In [381]:
print(classification_report(y_test, yhat))
print(f1_score(y_test, yhat, average='weighted'))
print(jaccard_similarity_score(y_test, yhat))

              precision    recall  f1-score   support

           1       1.00      0.67      0.80         3
           2       0.75      1.00      0.86         3

    accuracy                           0.83         6
   macro avg       0.88      0.83      0.83         6
weighted avg       0.88      0.83      0.83         6

0.8285714285714286
0.8333333333333334




As we can see above, our Jaccard Similarity score is 0.83, which is good enough. Although I must note, that accuracy of developed model varies quite a bit with different random state. Unfortunately, to stabilize our results we would need either much more restaurant data or we could add more parameters, that will affect VPI. As I decided to see dependance of VPI from geographical parameters alone, I am left with current results.

### Predicting popularity category for new restaurant

Assume we have following coordinate for the possible future McDonald's restaurant that we need to calculate popularity for:

Latitude: 50.45236955299899<br>
Longitude: 30.44524657687377<br>
Here's its location on the map:

In [391]:
rest_lat = '50.45236955299899'
rest_lng = '30.44524657687377'
Kiev = folium.Map(location=[rest_lat, rest_lng], tiles = "Stamen Toner", zoom_start=12)

label = folium.Popup('NEW MCDONALDS', parse_html=True)
folium.CircleMarker([rest_lat, rest_lng],                   
                     popup=label, radius=6,
                        fill=True,
                        fill_color='blue',
                        fill_opacity=1,
                        color='blue').add_to(Kiev)

for index, row in metro_df.iterrows():
    folium.CircleMarker([row['location.lat'], row['location.lng']],                   
                        popup='Metro',
                        radius=3,
                        fill=True,
                        fill_color='lightgray',
                        fill_opacity=1,
                        color='black').add_to(Kiev)

for index, row in mall_df.iterrows():
    folium.CircleMarker([row['location.lat'], row['location.lng']],                   
                        popup='Shopping Mall',
                        radius=3,
                        fill=True,
                        fill_color='yellow',
                        fill_opacity=1,
                        color='red').add_to(Kiev)
     
Kiev

As we can see from the map, our restaurant is near a metro station and shopping mall. It is also not very far from the city center. Lets see what predicted category is for this venue.

Now we will find distances to metro, mall and city center:

In [412]:
#convert coordinates to radins
rest_rad_lat = np.radians(float(rest_lat))
rest_rad_lng = np.radians(float(rest_lng))

rest_to_metro_df = pd.DataFrame(
                            dist.pairwise(
                                [[rest_rad_lat,rest_rad_lng]],
                                metro_df_rad[['location.lat','location.lng']].to_numpy())*6365600,
                                columns=metro_df['name'].unique(), 
                                
                            )

#Add distance to closest metro to variabe:
closMetro = rest_to_metro_df.min(axis=1)
closMetro = closMetro[0]
print(closMetro)

277.72923005069634


Distance to near shopping mall:

In [415]:
rest_to_mall_df = pd.DataFrame(
                            dist.pairwise(
                                [[rest_rad_lat,rest_rad_lng]],
                                mall_df_rad[['location.lat','location.lng']].to_numpy())*6365600,
                                columns=mall_df['id'].unique(), 
                               
                            )
closMall = rest_to_mall_df.min(axis=1)
closMall = closMall[0]
print(closMall)

302.60394310989045


Distance to city center:

In [417]:
rest_to_CitCent_df = pd.DataFrame(
                            dist.pairwise(
                                [[rest_rad_lat,rest_rad_lng]],
                                [[Center_lat, Center_lng]])*6365600,
                
                                    )
rest_cc = rest_to_CitCent_df.min(axis=1)
rest_cc = rest_cc[0]
print(rest_cc)

5491.509232819694


Dataframe with all values from above:

In [458]:
rest_df = pd.DataFrame(columns = ['MDScore', 'SMScore', 'CCScore'])

In [431]:
#Calculating scores for distances
#MDScore
closMetroScore = (500-closMetro)/5
closMetroScore = int(closMetroScore.round())
#SMScore
closMallScore = int(((500-closMall)/5).round())
#CCScore
rest_ccScore = (CityCenterDistanceDf['DistanceToCC'].max()-rest_cc)*100/CityCenterDistanceDf['DistanceToCC'].max()
rest_ccScore = int(rest_ccScore.round())

In [459]:
#add values to our dataframe
rest_df = rest_df.append({
                          'MDScore': closMetroScore,
                          'SMScore': closMallScore, 
                          'CCScore': rest_ccScore}, ignore_index=True)

In [460]:
rest_df

Unnamed: 0,MDScore,SMScore,CCScore
0,44,39,54


After we filled dataframe, we can make a prediction for VPI category, using our trained model:

In [464]:
predictor_array = np.asarray(rest_df)
PredictedVPICategory = clf.predict(np.asarray(rest_df))
print('Predicted category for the restaurant:', PredictedVPICategory)

Predicted category for the restaurant: [2]


Our Predicted category for new venue is: **2**, or 'Medium popularity'.


## Results and Discussion <a name="results"></a>

Results that we recieved in prediction looks plausible. Coordinates for the new restaurant were selected so that we can visually determine popularity of the venue. And the fact that developed model returns result that is logically correct just from visual analysis of the map confirms that similar model may be used for similar real world problem. Although, I should note, that Foursquare does not contain the most correct venue data for Kiev anymore, as during development of this project I've stumbled upon incorrect locations, low amount of usefull data or old information. This affects seriously on the results but are enough for the proof of concept. 

There is always room for improvement: we could try to use another parameters for determining the popularity, 
Get more general data regarding restaurant chain - use not only McDonald's, but other popular chain fast food restaurant, such as KFC or Dominos. In that case, if we have enough data - we could use different prediction model and even try to predict continuous value of popularity index, not only category.

## Conclusion <a name="conclusion"></a>

Purpose for this project was to help stakeholders to decide, whether they should open restaurant in their selected location and what popularity should they expect. We defined three valuable parameters that should affect the most on the popularity of the venue. All of them are geographical-based: distance to the Shopping Center, City Center and Metro Station. These places have one of the highest flow of people daily, therefore will affect on the amount of visitors to the restaurant, located nearby. 

There is always limitation in where stakeholders could open restaurant, and so they could have couple of locations on their mind, where they wanted to place their next venue. Methodology used in this project could help them to check popularity for each of those locations and decide which one fits better to their expectations.