In [237]:
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import requests
from pandas.io.json import json_normalize
import folium

# The Starbucks Exploration

 ![Alt text](https://1000logos.net/wp-content/uploads/2016/12/Font-Starbucks-Logo.png)

# Contents

1. <a href="#item1">Introduction</a>
2. <a href="#item2">Data</a>  
3. <a href="#item3">Results</a>
4. <a href="#item4">Methodology</a> 
5. <a href="#item5">Discussion</a> 
6. <a href="#item6">Conclusion</a> 


## <a id="item1" style="color:#006400">1. Introduction</a>

### Context
***

Thanks to Coursera and IBM, we are certified in Python and machine learning, and it did not take long for coffee company Starbucks to hire us as data scientists! We will work for the European Division!  

Starbucks has only been growing bigger since its creation, and continuously open more stores across the world.  

In [106]:
url = "https://en.wikipedia.org/wiki/Starbucks#Locations"
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')

table = soup.find_all('table')
df = pd.read_html(str(table))[1]
df.tail(7)

Unnamed: 0,Year,Revenuein mil. US$,Net incomein mil. US$,Total Assetsin mil. US$,AveragePrice per Sharein US$,Employees
8,2013,14867,8,11517,33.71,182000
9,2014,16448,2068,10753,37.78,191000
10,2015,19163,2757,12416,53.25,238000
11,2016,21316,2818,14313,56.59,254000
12,2017,22387,2885,14366,57.27,277000
13,2018,24720,4518,24156,57.5,291000
14,2019,26509,3599,19220,81.44,346000


According to this same wikipedia page, as of May 2020, Starbucks is present in over 30,000 locations, on 6 continents and 79 countries.

### Business Problem
***

Our mission is to keep this global expansion going by opening a new store in Europe, but the location must be carefully chosen to guarantee success.

Our problem will be solved by studying the current stores locations. We will then choose a highly populous big city where Starbucks is not yet too present.

We will then try to find a more precise location within this city. In order to do so, we will select several successful Starbucks coffees and use Foursquare API to characterize their neighbourhood and try to find a similar location in our target city where there is no store yet!

---
## <a id="item2">2. Data</a>

### City Populations
***

We will need some population data to be able to find out where Starbucks is not yet heavily present.  
The table from Wikipedia also contains GPS coordinates, which will be useful later, therefore I put this under usable form.

In [107]:
url = "https://en.wikipedia.org/wiki/List_of_European_cities_by_population_within_city_limits"
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')

table = soup.find_all('table')
df = pd.read_html(str(table))[0]
df = df[['City', 'Country', 'Officialpopulation', 'Location']]

df['City'].replace(r'\[.\]', "", regex=True, inplace=True)
df['Officialpopulation'].replace(r'\[.\]', "", regex=True, inplace=True)
df['Officialpopulation'].replace(r',', "", regex=True, inplace=True)
df['Location'].replace(r'(.*)/ ', "", regex=True, inplace=True)
df['Latitude'] = df['Location'].str.extract(r'(\d{1,}.\d{1,})').astype('f4')
df['Longitude'] = df['Location'].replace(r'(.*)°N ', "", regex=True, inplace=False)
df['Longitude'] = df['Longitude'].replace(r'°(.*)', "", regex=True, inplace=False)

df.drop('Location', inplace = True, axis=1)
df['Officialpopulation'] = df['Officialpopulation'].astype('int32')
df['Longitude'] = df['Longitude'].astype('f4')
df.loc[5, 'Longitude'] = -df.loc[5, 'Longitude'] #Anything west of London has its longitude sign wrong!
df.loc[3, 'Longitude'] = -df.loc[3, 'Longitude'] 
df.loc[24, 'Longitude'] = -df.loc[24, 'Longitude']

df.head(10)

Unnamed: 0,City,Country,Officialpopulation,Latitude,Longitude
0,Istanbul,Turkey,15519267,41.013611,28.955
1,Moscow,Russia,12615279,55.75,37.616669
2,London,United Kingdom,9126366,51.507221,0.1275
3,Saint Petersburg,Russia,5383890,59.950001,-30.299999
4,Berlin,Germany,3748148,52.516666,13.383333
5,Madrid,Spain,3223334,40.383331,-3.716667
6,Kiev,Ukraine,2950800,50.450001,30.523333
7,Rome,Italy,2844750,41.900002,12.5
8,Paris,France,2140526,48.856701,2.3508
9,Bucharest,Romania,2106144,44.432499,26.103889


In [108]:
df.shape

(35, 5)

### Starbucks and Neighborhood Venues from Foursquare API
***

Foursquare API will be used to find the number of Starbucks store for each city and which stores have the most reviews, and hence are likely to be top spots in their respective cities!

We will also use the Foursquare API again to characterize the surroundings and try to find a similar neighbourhood in our target city.

---
## <a id="item3"><font color=darkgreen>3. Methodology</font></a>

### Finding the City with the Fewest Stores
***

A big limitation of Foursquare API is that the maximum number of results for a venue search is 50.  
It is OK for our application because we are interested in cities with a low number of stores!  

The first step of our study is to group Starbucks stores by city and count the number of occurrences.  
Let us query Foursquare API to find Starbucks Stores for each city above.

In [247]:
CLIENT_ID = 'XXX' # your Foursquare ID
CLIENT_SECRET = 'XXX' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 50
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: XXX
CLIENT_SECRET:XXX


In [111]:
search_query = 'Starbucks'
SB_nb = {'StarbucksStoresCount': []}

for Latitude, Longitude in zip(df['Latitude'], df['Longitude']): 
    radius = 15000
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&intent=browse&limit={}'.format(CLIENT_ID, CLIENT_SECRET, Latitude, Longitude, VERSION, search_query, radius, LIMIT)
    #results = requests.get(url).json() # Do not re run everytime!
    venues = results['response']['venues']

    dataframe = json_normalize(venues)
    SB_nb['StarbucksStoresCount'].append(dataframe.shape[0])

  # Remove the CWD from sys.path while we load stuff.


In [112]:
SB_nb = pd.DataFrame.from_dict(SB_nb)
df = pd.concat([df, SB_nb], axis=1)
df.head(15)

Unnamed: 0,City,Country,Officialpopulation,Latitude,Longitude,StarbucksStoresCount
0,Istanbul,Turkey,15519267,41.013611,28.955,50
1,Moscow,Russia,12615279,55.75,37.616669,50
2,London,United Kingdom,9126366,51.507221,0.1275,50
3,Saint Petersburg,Russia,5383890,59.950001,-30.299999,0
4,Berlin,Germany,3748148,52.516666,13.383333,22
5,Madrid,Spain,3223334,40.383331,-3.716667,50
6,Kiev,Ukraine,2950800,50.450001,30.523333,1
7,Rome,Italy,2844750,41.900002,12.5,2
8,Paris,France,2140526,48.856701,2.3508,50
9,Bucharest,Romania,2106144,44.432499,26.103889,32


Let us now add a column SB_Density as Starbucks Density.  
This will calculate the number of people per Starbucks. The higher, the better for our study!

In [113]:
df['SB_Density'] = df['Officialpopulation'].divide(df['StarbucksStoresCount'])
df.sort_values(by = 'SB_Density', ascending=False, inplace=True)
df.head(25)

Unnamed: 0,City,Country,Officialpopulation,Latitude,Longitude,StarbucksStoresCount,SB_Density
21,Nizhny Novgorod,Russia,1259013,56.326942,44.0075,0,inf
3,Saint Petersburg,Russia,5383890,59.950001,-30.299999,0,inf
32,Perm,Russia,1051583,56.316666,56.316666,0,inf
27,Ufa,Russia,1121429,54.75,55.966667,0,inf
6,Kiev,Ukraine,2950800,50.450001,30.523333,1,2950800.0
10,Minsk,Belarus,1982444,53.900002,27.566668,1,1982444.0
17,Kharkiv,Ukraine,1451132,50.004444,36.231388,1,1451132.0
7,Rome,Italy,2844750,41.900002,12.5,2,1422375.0
30,Tekirdağ,Turkey,1055412,40.977779,27.515278,1,1055412.0
33,Volgograd,Russia,1013533,48.700001,44.516666,1,1013533.0


Everytime I run a lot of API requests (because of a loop), I save the results as a csv file so that I do not need to do the same API requests all over again and hit the maximum allowed.

In [114]:
df.to_csv(r'european_cities.csv')
df = pd.read_csv("european_cities.csv", index_col=0)

A few observations on these results:  

* Russia, Ukraine and Belarus all look like promising markets. Starbucks are already successful in Moscow and many other populous cities could be good locations for a new store. However, the alphabet being different, it would be difficult to do the next part of this project (with some neighbourhoods spelt in Cyrillic!).

* Italy is a peculiar case, with 2 cities (Roma, Milan) in our top 25. It may sound astonishing that Starbucks has not already taken over this market, but in fact the coffee culture is very traditional and deeply rooted in Italy. This   [Forbes Article](https://www.forbes.com/sites/jennawang/2018/09/13/why-it-took-starbucks-47-years-to-open-a-store-in-italy/) is a good read. Still, Starbucks is not only about coffee, the lifestyle experience is equally important, as proves the recent store addition in Milan!

* Germany is another country with a reasonably small density of Starbucks stores! 

* SB_Density for cities with more than 50 stores are naturally wrong, because of the API results limit.

Rome therefore looks like a good place to build a new store!

### Finding the City with the Most Stores
***

The number of results limit from the API is quite annoying for this point, but a quick Google Search shows that [London](https://www.newstatesman.com/jonn-elledge/2014/05/london-has-more-branches-starbucks-any-eu-country) has the most Starbucks stores.  

The best method would have been to simply count the number of stores in the results of the API query.

### Finding the Most Popular Starbucks & Reference Neighborhood 
***

Let's explore the centre of London and try to find the most popular Starbucks!

In [115]:
Latitude = 51.507221
Longitude = -0.127500
radius = 1000
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, Latitude, Longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()
venues_London = results['response']['venues']
dataframe_London = json_normalize(venues_London)
dataframe_London.head(5)

  import sys


Unnamed: 0,id,name,categories,referralId,hasPerk,location.address,location.lat,location.lng,location.labeledLatLngs,location.distance,location.postalCode,location.cc,location.city,location.state,location.country,location.formattedAddress,location.crossStreet,location.neighborhood
0,502904745dd7750e9d63bc17,Starbucks,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1591021496,False,1-3 Villiers street,51.50745,-0.122863,"[{'label': 'display', 'lat': 51.50745, 'lng': ...",322,WC2N 6NN,GB,London,Greater London,United Kingdom,"[1-3 Villiers street, London, Greater London, ...",,
1,4b840247f964a520e91a31e3,Starbucks,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1591021496,False,"10 Kingsway, Unit B2; St Catherines House",51.513868,-0.117669,"[{'label': 'display', 'lat': 51.513868, 'lng':...",1005,WC2B 6LH,GB,London,Greater London,United Kingdom,"[10 Kingsway, Unit B2; St Catherines House, Lo...",,
2,4b73d904f964a52076bd2de3,Starbucks,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1591021496,False,"Charing Cross Road, Unit 1 129-133",51.515233,-0.130191,"[{'label': 'display', 'lat': 51.515233, 'lng':...",911,WC2H 0EA,GB,London,Greater London,United Kingdom,"[Charing Cross Road, Unit 1 129-133, Camden, ...",,
3,4b7553b8f964a5207a062ee3,Starbucks,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1591021496,False,10 Russell Street,51.512332,-0.121685,"[{'label': 'display', 'lat': 51.512332, 'lng':...",697,WC2B 5HZ,GB,London,Greater London,United Kingdom,"[10 Russell Street, London, Greater London, WC...",,
4,4ad58b9ff964a5201b0321e3,Starbucks,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1591021496,False,112 - 116 New Oxford Street,51.516645,-0.129358,"[{'label': 'display', 'lat': 51.516645, 'lng':...",1056,WC1A 1HH,GB,Bloomsbury,Greater London,United Kingdom,"[112 - 116 New Oxford Street, Bloomsbury, Grea...",,


### Finding the Best Rated Store in London and Study its Neighborhood
***

We can get the rating of each store in Central London.

In [116]:
ratings = {'rating': [], 'likes': [],}
for venue_id in dataframe_London['id']:
    url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)
    result = requests.get(url).json()
    try:
        ratings['rating'].append(result['response']['venue']['rating'])
    except:
        ratings['rating'].append(0)    
    try:
        ratings['likes'].append(result['response']['venue']['likes']['count'])
    except:
        ratings['likes'].append(0)    
        
ratings = pd.DataFrame.from_dict(ratings)
dataframe_London = pd.concat([dataframe_London, ratings], axis=1)

dataframe_London.head(5)

Unnamed: 0,id,name,categories,referralId,hasPerk,location.address,location.lat,location.lng,location.labeledLatLngs,location.distance,location.postalCode,location.cc,location.city,location.state,location.country,location.formattedAddress,location.crossStreet,location.neighborhood,rating,likes
0,502904745dd7750e9d63bc17,Starbucks,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1591021496,False,1-3 Villiers street,51.50745,-0.122863,"[{'label': 'display', 'lat': 51.50745, 'lng': ...",322,WC2N 6NN,GB,London,Greater London,United Kingdom,"[1-3 Villiers street, London, Greater London, ...",,,6.8,125
1,4b840247f964a520e91a31e3,Starbucks,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1591021496,False,"10 Kingsway, Unit B2; St Catherines House",51.513868,-0.117669,"[{'label': 'display', 'lat': 51.513868, 'lng':...",1005,WC2B 6LH,GB,London,Greater London,United Kingdom,"[10 Kingsway, Unit B2; St Catherines House, Lo...",,,7.3,70
2,4b73d904f964a52076bd2de3,Starbucks,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1591021496,False,"Charing Cross Road, Unit 1 129-133",51.515233,-0.130191,"[{'label': 'display', 'lat': 51.515233, 'lng':...",911,WC2H 0EA,GB,London,Greater London,United Kingdom,"[Charing Cross Road, Unit 1 129-133, Camden, ...",,,6.9,126
3,4b7553b8f964a5207a062ee3,Starbucks,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1591021496,False,10 Russell Street,51.512332,-0.121685,"[{'label': 'display', 'lat': 51.512332, 'lng':...",697,WC2B 5HZ,GB,London,Greater London,United Kingdom,"[10 Russell Street, London, Greater London, WC...",,,6.6,165
4,4ad58b9ff964a5201b0321e3,Starbucks,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1591021496,False,112 - 116 New Oxford Street,51.516645,-0.129358,"[{'label': 'display', 'lat': 51.516645, 'lng':...",1056,WC1A 1HH,GB,Bloomsbury,Greater London,United Kingdom,"[112 - 116 New Oxford Street, Bloomsbury, Grea...",,,6.4,293


In [117]:
dataframe_London.to_csv(r'SB_London.csv')
dataframe_London = pd.read_csv("SB_London.csv", index_col=0)

In [118]:
dataframe_London.sort_values(by = 'rating', ascending=False, inplace=True)
dataframe_London = dataframe_London[['id', 'location.address', 'location.lat', 'location.lng', 'location.postalCode', 'rating', 'likes']]
dataframe_London.head()

Unnamed: 0,id,location.address,location.lat,location.lng,location.postalCode,rating,likes
8,4ae2d576f964a520828f21e3,52 Berkeley St.,51.507442,-0.142527,W1J 8ET,7.3,186
1,4b840247f964a520e91a31e3,"10 Kingsway, Unit B2; St Catherines House",51.513868,-0.117669,WC2B 6LH,7.3,70
16,4b8e209af964a5200b1933e3,27 Berkeley St,51.508831,-0.144199,W1X 5AD,7.1,96
15,4ad9ee57f964a520031c21e3,"34 Great Marlborough St, (Carnaby Street)",51.513952,-0.139378,W1F 7JD,7.0,108
5,4b9e7bb3f964a520ece736e3,"6A Vigo Street, London",51.510288,-0.139154,W1S 3HF,7.0,380


We have 2 winners! However, the first one has more likes, so it sounds like a good place to start from.

In [122]:
London_Latitude = dataframe_London['location.lat'].iloc[0]
London_Longitude = dataframe_London['location.lng'].iloc[0]
print('Target Latitude: ' + str(London_Latitude)  + '; Target Longitude: ' + str(London_Longitude))

Target Latitude: 51.507442; Target Longitude: -0.142527


52 Berkeley St. is the reference location for the remainder of this battle.  
I will define a function to search venues in each neighborhood in the next section!

### Studying neighbourhoods in Rome
***

I could not find a list of Rome neighbourhoods along with GPS coordinates, so I created my own using Folium.  
I fine-tuned my grid until I was visually happy with the point positions. I wanted them to cover most of the city centre.

In [246]:
def plot_Rome_neigh():
    
    [latitude, longitude] = [41.900002, 12.500000]
    k = 0
    
    venues_map = folium.Map(location=[latitude, longitude], zoom_start=13)
    for i in range(-2,3):
        for j in range(-3,3):

            folium.features.CircleMarker(
                [latitude+i/100, longitude+j/80],
                radius=5,
                color='blue',
                popup='R' + str(k),
                fill = True,
                fill_color='blue',
                fill_opacity=0.6
            ).add_to(venues_map)
            k = k + 1
    return venues_map

plot_Rome_neigh()

The associated dataframe with custom neighborhood names is built as follows.  
Note I add straight away the reference London Neighborhood for efficiency!

In [126]:
neighborhoods = {'Neighborhood': ['London'], 'Latitude': [London_Latitude], 'Longitude': [London_Longitude]}
[latitude, longitude] = [41.900002, 12.500000]
k = 0
for i in range(-2,3):
    for j in range(-3,3):
        neighborhoods['Neighborhood'].append('R' + str(k))
        k = k + 1
        neighborhoods['Latitude'].append(latitude+i/110)
        neighborhoods['Longitude'].append(longitude+j/80)
df_nb = pd.DataFrame.from_dict(neighborhoods)
df_nb.head(5)

Unnamed: 0,Neighborhood,Latitude,Longitude
0,London,51.507442,-0.142527
1,R0,41.88182,12.4625
2,R1,41.88182,12.475
3,R2,41.88182,12.4875
4,R3,41.88182,12.5


At this latitude and for 1/110 deg of latitude and 1/80 deg of longitude, 1 grid step is approximately 1000m.  
Source: http://www.csgnetwork.com/degreelenllavcalc.html

Let's reuse the function from the NYC clustering exercise to get all nearby venues.

In [127]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We have a grid, which not so convenient to do a search based on coordinates and a radius, but we can just choose a radius with which our neighborhoods will overlap (this is not a huge problem - neighborhoods do not have to be fully separated).  
For a grid with points spaced by 1000m, non overlapping circles have a radius of 500m. The radius necessary to fully cover the grid is half the diagonal of a grid square: 710m. Let's choose 700m.

In [128]:
LIMIT = 100
radius = 700

# DO NOT RUN EVERYTIME - API REQUESTS LIMIT!
venues = getNearbyVenues(names=df_nb['Neighborhood'],
                                   latitudes=df_nb['Latitude'],
                                   longitudes=df_nb['Longitude']
                                  )

In [178]:
#venues.to_csv(r'battle_venues.csv')
venues = pd.read_csv("battle_venues.csv", index_col=0)
print(venues.shape)
venues.head()

(1468, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,London,51.507442,-0.142527,The Ritz London,51.507078,-0.141627,Hotel
1,London,51.507442,-0.142527,Novikov,51.507767,-0.14285,Asian Restaurant
2,London,51.507442,-0.142527,Brown's Hotel,51.509127,-0.142077,Hotel
3,London,51.507442,-0.142527,Burger & Lobster,51.507118,-0.145477,Seafood Restaurant
4,London,51.507442,-0.142527,Prada,51.508998,-0.140959,Boutique


The following code one-hot-encodes the data and works out the frequency of each venue for each neighborhood.

In [179]:
# one hot encoding
rome_onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")

rome_onehot['Neighborhood'] = venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [rome_onehot.columns[-1]] + list(rome_onehot.columns[:-1])
rome_onehot = rome_onehot[fixed_columns]

rome_onehot.head()

Unnamed: 0,Neighborhood,Abruzzo Restaurant,Accessories Store,African Restaurant,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Art Studio,Arts & Crafts Store,...,Track Stadium,Train Station,Trattoria/Osteria,Turkish Restaurant,Vegetarian / Vegan Restaurant,Watch Shop,Wine Bar,Wine Shop,Winery,Zoo
0,London,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,London,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,London,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,London,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,London,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [180]:
rome_grouped = rome_onehot.groupby('Neighborhood').mean().reset_index()
rome_grouped.head(5)

Unnamed: 0,Neighborhood,Abruzzo Restaurant,Accessories Store,African Restaurant,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Art Studio,Arts & Crafts Store,...,Track Stadium,Train Station,Trattoria/Osteria,Turkish Restaurant,Vegetarian / Vegan Restaurant,Watch Shop,Wine Bar,Wine Shop,Winery,Zoo
0,London,0.0,0.0,0.0,0.0,0.013333,0.08,0.013333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.013333,0.0,0.013333,0.0,0.0
1,R0,0.0,0.0,0.0,0.029412,0.0,0.0,0.0,0.0,0.0,...,0.0,0.029412,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0
2,R1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.045455,0.0,0.0,0.0,0.022727,0.0,0.0,0.0
3,R10,0.0,0.02439,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.04878,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,R11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.023256,0.0,0.0,0.0,0.046512,0.0,0.0,0.0


In [182]:
#rome_grouped.to_csv(r'rome_grouped.csv')
rome_grouped = pd.read_csv("rome_grouped.csv", index_col=0)
rome_grouped.columns

Index(['Neighborhood', 'Abruzzo Restaurant', 'Accessories Store',
       'African Restaurant', 'American Restaurant', 'Argentinian Restaurant',
       'Art Gallery', 'Art Museum', 'Art Studio', 'Arts & Crafts Store',
       ...
       'Track Stadium', 'Train Station', 'Trattoria/Osteria',
       'Turkish Restaurant', 'Vegetarian / Vegan Restaurant', 'Watch Shop',
       'Wine Bar', 'Wine Shop', 'Winery', 'Zoo'],
      dtype='object', length=177)

Ideally, I would need to spend a lot of time post-treating this data in order to have better groups of venues. I have only so much time, so I simply grouped together restaurants, bars, etc. There are still 114 categories at the end, so quite a lot...

In [188]:
def GroupSimilarVenues(df, strings):
    # the dataframe to condense
    # strings: list of strings to be found, for example Restaurant or Art
    df_new = df
    for string in strings:
        columns = []
        for column in rome_grouped.columns:
            if string in column:
                columns.append(column)
        df_new[string] = df_new[columns].sum(axis=1)
        df_new.drop(columns, axis=1, inplace = True)
    return df_new

venues_matrix = GroupSimilarVenues(rome_grouped, ['Restaurant', 'Art', 'Bar', 'Store'])
venues_matrix.shape

(31, 114)

Let's display the top 5 venues for each neighbourhood.

In [184]:
num_top_venues = 5

for hood in rome_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = rome_grouped[rome_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----London----
      venue  freq
0     Store  0.17
1       Art  0.09
2  Boutique  0.05
3     Hotel  0.05
4    Lounge  0.04


----R0----
         venue  freq
0        Hotel  0.09
1        Plaza  0.06
2       Winery  0.06
3  Pizza Place  0.06
4         Café  0.06


----R1----
               venue  freq
0              Plaza  0.09
1     Ice Cream Shop  0.07
2              Hotel  0.07
3               Café  0.07
4  Trattoria/Osteria  0.05


----R10----
               venue  freq
0              Hotel  0.10
1     Ice Cream Shop  0.07
2        Pizza Place  0.07
3    Bed & Breakfast  0.05
4  Trattoria/Osteria  0.05


----R11----
            venue  freq
0     Pizza Place  0.19
1            Café  0.07
2          Bistro  0.05
3           Plaza  0.05
4  Ice Cream Shop  0.05


----R12----
               venue  freq
0               Café  0.12
1              Hotel  0.11
2  Trattoria/Osteria  0.05
3     Ice Cream Shop  0.05
4             Castle  0.04


----R13----
            venue  freq
0           Hot

A proper classification of the venues would be necessary to improve the accuracy of the correlation between neighborhoods.  

I have done a first step but for example perhaps not all Restaurants are equal! For now, I am happy enough with this.

### Similarity between Neighbourhoods
***

Let's assume that we did all the necessary work to properly classify venues of the same type.  
We can now calculate the correlation between our London reference neighborhood and each Rome neighborhood. I will use Pearson correlation for this task.

In [206]:
venues_matrix_clean = venues_matrix.copy()
venues_matrix_clean.drop('Neighborhood', axis=1, inplace=True)
venues_matrix_clean.head(10)

Unnamed: 0,BBQ Joint,Bagel Shop,Bakery,Bed & Breakfast,Beer Garden,Bistro,Boarding House,Bookstore,Boutique,Breakfast Spot,...,Tour Provider,Track Stadium,Train Station,Trattoria/Osteria,Watch Shop,Wine Shop,Winery,Zoo,Restaurant,Bar
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.053333,0.0,...,0.0,0.0,0.0,0.0,0.013333,0.013333,0.0,0.0,0.0,0.0
1,0.0,0.0,0.029412,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.029412,0.0,0.0,0.0,0.058824,0.0,0.0,0.0
2,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.04878,0.0,0.02439,0.02439,0.0,0.0,0.0,...,0.0,0.0,0.0,0.04878,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.023256,0.0,0.046512,0.0,0.023256,0.0,0.0,...,0.0,0.0,0.0,0.023256,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.01,0.03,0.0,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.05,0.0,0.0,0.01,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.01,0.0,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.014286,0.028571,0.0,0.0,0.014286,0.0,0.028571,0.0,...,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.017544,0.0,0.017544,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.025,0.0,0.0,...,0.0,0.0,0.0,0.05,0.0,0.025,0.0,0.0,0.0,0.0


In [210]:
corr_Pearson = venues_matrix_clean.corrwith(venues_matrix_clean.iloc[0, :], axis=1, drop=False, method='pearson')
corr_Pearson.head(5)

0    1.000000
1    0.208595
2    0.194343
3    0.154222
4   -0.054915
dtype: float64

In [230]:
results = pd.concat([venues_matrix.iloc[:, 0], corr_Pearson], axis=1)
results.columns = ['Neighborhood', 'Correlation']

results.sort_values(by = 'Correlation', ascending=False, inplace=True)
results = pd.merge(results, df_nb, on='Neighborhood', validate="one_to_one")
results.head(5)

Unnamed: 0,Neighborhood,Correlation,Latitude,Longitude
0,London,1.0,51.507442,-0.142527
1,R20,0.448627,41.909093,12.4875
2,R19,0.435693,41.909093,12.475
3,R15,0.408138,41.900002,12.5
4,R28,0.326147,41.918184,12.5125


Naturally correlation of London with itself is 1.  We also found that R20, R19 and R15 are all plausible location for a new Starbucks store!  
Let's map these places in Rome.

---
## <a id="item4"><font color=darkgreen>4. Results</font></a>

Let's plot the location of the 3 suggested locations for a new Starbucks Store in Rome.

In [245]:
def plot_Rome_SB(df):
    
    [latitude, longitude] = [41.900002, 12.500000]
    
    venues_map = folium.Map(location=[latitude, longitude], zoom_start=14) # generate map centred around the Conrad Hotel
    for nb, lat, lng in zip(df['Neighborhood'], df['Latitude'], df['Longitude']):
        folium.features.CircleMarker(
                [lat, lng],
                radius=5,
                color='blue',
                popup=nb,
                fill = True,
                fill_color='blue',
                fill_opacity=0.6).add_to(venues_map)
    return venues_map

plot_Rome_SB(results.iloc[1:4])

Looking back at the typical venues of each neighborhoods, the main common point between our London "reference" neighborhood is the high presence of hotels, restaurant and cafes.

Interestingly, Termini Station comes as a potential location (R15), which is a major transportation hub of the city. In fact, the only "Starbucks" location that the Foursquare API search query returned for Rome is located inside Termini station. It is not a licensed Starbucks store but very much looks like it. This is a good sign that this algorithm has not completely lost it!

The other 2 are essentially next to Villa Borghese park. R19 is next to the river. I would certainly enjoy a coffee in either place!

---
## <a id="item5"><font color=darkgreen>5. Discussion</font></a>

There are several of ways this analysis could be improved.  
Because of time and resources limitations, I have taken certain shortcuts, but I believe the method would still apply.

Major improvements would consist in:
* An up to date list of current Starbucks locations, including coordinates and sales volume. This would certainly be available as an employee of the company.
* Not limiting the study to European countries. In a situation where Starbucks wanted to expand to areas where alphabet/culture can be challenging to non-locals, the best would be to have local offices able to carry out this job.
* Improving the queries to the API and sorting the request results better. This is as time consuming as necessary to get good quality data. Still the results are far from illogical.

---
## <a id="item6"><font color=darkgreen>6. Conclusion</font></a>

As a new data scientist for Starbucks, the mission was to find the best location to open a new store.
To do so, I have:
* Determined which cities in Europe have the fewest stores per inhabitant. I decided to focus on Rome.
* Chosen a neighborhood where Starbucks is highly present, with a high user ratings. The store is located in London.
* Searched venues in Rome and near the reference store in London, and worked out the similarity between all those neighborhoods.

I have determined that 3 specific locations in Rome were similar to the neighborhood in London where one of the most popular Starbucks store in Europe is located. This result is a good point to start from.  
Next step would be to verify if customers would be likely to visit these new locations.