# Helping people find the best neighborhood to live in

## Prefix - visualizing maps

Github does not show Folium maps natively. To see those, please use NB Viewer:
https://nbviewer.jupyter.org/

Copy the link below and paste into NB Viewer:

https://github.com/jrkumelys/Coursera_Capstone/blob/master/Coursera%20Capstone%20Final%20Project.ipynb

## Introduction

All over the world, people are constantly moving to different cities. They can move because of work, family, lifestyle, culture, and many other reasons.

When moving, some people may want to live in a neighborhood similar to where they live today.

However, they may not know the new city they are moving to. Therefore, we will find which neighborhoods are the best match for them, by comparing:
* Venues nearby of their current home
* Venues in each neighborhood of the destination city

For this, we will consider a walking range of half mile (800m).

Of course, there are many other variables that may affect this decision, such as cost of living, distance to work, criminality, etc. These will <b><u>not</u></b> be included in this analysis.

To test the models, we will run two scenarios:

* Scenario 1: Jane currently lives in central London, and has received a job offer in Boston. She doesn't know the city, and wants to know which neighborhoods are best for her.
* Scenario 2: John lives in New York, and wants to live one year abroad to improve his Spanish. He decided to go to Madrid. However, to reduce the culture shock, he wants to live in a neighborhood similar to the one he currently lives (or as similar as possible)

## Data

* Boston neighborhoods: The list used will be from <a href="https://www.boston.gov/neighborhoods">Boston's government website</a>. 
* Madrid neighborhoods: The list used will be from <a href="https://en.wikipedia.org/wiki/List_of_neighborhoods_of_Madrid">Wikipedia</a>.
* Neighborhood latitude and longitude: This will be fetched using Python APIs such as Geocode or Geopy.
* Venues nearby: These will be fetched using the Foursquare API.
* Current addresses from our 'clients'
  * Jane lives in Rushworth St, London
    * Latitude: 51.501463
    * Longitude: -0.1020907
  * John lives in Scholes St, Brooklyn, NY
    * Latitude: 40.708179
    * Longitude: -73.949628

## Methodology

We will follow these steps:
1. Use html processing (requests and BeautifulSoup) to get the list of neighborhoods of each destination city into dataframes
2. Use APIs to get the latitude and longitude of each neighborhood
3. For each neighborhood, get a list of nearby venues and their types (e.g. restaurants, bars, nightclubs, etc) using the Foursquare API, and consolidate by neighborhood
4. Repeat the process for each of their current addresses
5. Use some measure of similarity to identify the best neighborhoods (e.g. Euclidean distance)
6. Analyze the results and plot on a map using the Folium library

7. Conclusion and discussion

## 1. Getting the list of neighborhoods

First, we will import the necessary libraries for this step

In [244]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

import pandas as pd
import numpy as np

print('Done')

Done


Now, we will define the urls containing the list of neighborhoods. Note that each link has a different structure and will need to be treated separately.

In [245]:
url_boston = "https://www.boston.gov/neighborhoods"
url_madrid = "https://en.wikipedia.org/wiki/List_of_neighborhoods_of_Madrid"

### 1.1. Getting the neighborhoods of Boston

First, we get the data from the page and process it with BeautifulSoup

In [246]:
text_data_boston = requests.get(url_boston).text
soup_boston = BeautifulSoup(text_data_boston,"html5lib")

Inspecting the page, we can see that the neighborhoods names are all under 'h3' tags

In [247]:
all_h3s_boston = soup_boston.find_all('h3')
print('There are {} h3s in the page'.format(len(all_h3s_boston)))

There are 24 h3s in the page


We can see that we have 24 neighborhoods in the page. Great!!

Now, we will process these into a list of dictionaries and then into a dataframe

In [248]:
# Create an empty list
neigh_list_boston = []

# Append the text of each h3 into the list 
for name in all_h3s_boston:
    
    # Create empty dictionary
    cell = {}
    
    # Get the neighborhood name and append to the list
    cell['Neighborhood'] = name.text
    neigh_list_boston.append(cell)

# Pass the list into a dataframe
neigh_df_boston = pd.DataFrame(neigh_list_boston)

# Let's replace some values that our geolocator can't find under the previous name
neigh_df_boston.replace(to_replace = "Chinatown-Leather District", value="Chinatown", inplace = True)

# Let's see our list
neigh_df_boston

Unnamed: 0,Neighborhood
0,Allston
1,Back Bay
2,Bay Village
3,Beacon Hill
4,Brighton
5,Charlestown
6,Chinatown
7,Dorchester
8,Downtown
9,East Boston


Amazing. Now let's do the same for Madrid.

### 1.2. Getting the neighborhoods (and districts) of Madrid

First, we get the data from the page and process it with BeautifulSoup

In [249]:
text_data_madrid = requests.get(url_madrid).text
soup_madrid = BeautifulSoup(text_data_madrid,"html5lib")

Inspecting the page, we can see that the neighborhoods names are all on a table. Let's see if it's the only table on the page

In [250]:
all_tables_madrid = soup_madrid.find_all('table')
print('There are {} tables in the page'.format(len(all_tables_madrid)))

There are 2 tables in the page


On the page, we can see that the table we're interested in is the first one. Let's get it

In [251]:
neigh_raw_table_madrid = all_tables_madrid[0]

We can see that Madrid's neighborhoods will require a lot more pre-processing than Boston. Let's do it

In [252]:
all_tds_madrid = neigh_raw_table_madrid.find_all('td')

In [253]:
# Create an empty list
neigh_list_madrid = []
district = ""
# Append the text of each td into the list 
for name in all_tds_madrid:
    
    # Create empty dictionary
    cell = {}
    
    # Get the neighborhood name and append to the list
    text = name.text

    # Neighborhood names have a "\n" on it, so let's remove it
    text = text.replace("\n","")
    
    # There are 3 things we need to filter out:
        # (1) Cells that are only numbers
        # (2) Cells that are empty
        # (3) Cells that contain "[[]]" - these are placeholders for images
    if (not text.isdecimal()) and (text != "") and (text != "[[]]"): # This part will make sure that we ignore the numbers of each neighborhood
        
        # We also want to identify cells than contain parenthesis - these are districts
        if (text.find("(") >= 0):
            parenthesis = text.find("(") # Find the parenthesis
            text = text[0:parenthesis-1] # Only get what is before the parenthesis
            district = text.strip() # Record it as the current district

        else:
            cell['District'] = district
            cell['Neighborhood'] = text.strip() # Strip will remove leading spaces
            neigh_list_madrid.append(cell)
    
    
print("We extracted {} neighborhoods from the table".format(len(neigh_list_madrid)))

We extracted 131 neighborhoods from the table


Great, we got all 131 neighborhoods, and filtered out other information. Now let's move it into a dataframe

In [254]:
neigh_df_madrid = pd.DataFrame(neigh_list_madrid)

# Let's replace some values that our geolocator can't find under the previous name
neigh_df_madrid.replace(to_replace = "Valderrivas", value="Valderribas", inplace = True)
neigh_df_madrid.replace(to_replace = "Casco Histórico de Barajas", value="Calle Canal de Suez", inplace = True)

# Let's see our list
neigh_df_madrid

Unnamed: 0,District,Neighborhood
0,Centro,Palacio
1,Centro,Embajadores
2,Centro,Cortes
3,Centro,Justicia
4,Centro,Universidad
...,...,...
126,Barajas,Alameda de Osuna
127,Barajas,Aeropuerto
128,Barajas,Calle Canal de Suez
129,Barajas,Timón


This concludes step 1.

## 2. Getting latitude and longitude

First, we will import the necessary libraries for this step

In [255]:
!pip install geopy
from geopy.geocoders import Nominatim
print('Done')

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Done


In [256]:
!pip install folium
import folium
print('Done')

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Done


Now let's define the geolocator and test it on a famous location

In [257]:
geolocator = Nominatim(user_agent="coursera-capstone")
location = geolocator.geocode("Eiffel Tower, Paris, France")
print("Location of the Eiffel Tower: Latitude = {}, Longitude = {}".format(location.latitude, location.longitude))

Location of the Eiffel Tower: Latitude = 48.858260200000004, Longitude = 2.2944990543196795


Great. It's working

### 2.1. Boston neighborhood locations

Now let's get the latlong values for our Boston neighborhoods

In [258]:
latlong_boston = []

for neigh in zip(neigh_df_boston['Neighborhood']):
    cell = {}
    address = "{}, Boston, MA, US".format(neigh[0])
    print("..." + address)
    location = geolocator.geocode(address)
    cell['Latitude'] = location.latitude
    cell['Longitude'] = location.longitude
    latlong_boston.append(cell)
    
latlong_boston

...Allston, Boston, MA, US
...Back Bay, Boston, MA, US
...Bay Village, Boston, MA, US
...Beacon Hill, Boston, MA, US
...Brighton, Boston, MA, US
...Charlestown, Boston, MA, US
...Chinatown, Boston, MA, US
...Dorchester, Boston, MA, US
...Downtown, Boston, MA, US
...East Boston, Boston, MA, US
...Fenway-Kenmore, Boston, MA, US
...Hyde Park, Boston, MA, US
...Jamaica Plain, Boston, MA, US
...Mattapan, Boston, MA, US
...Mid-Dorchester, Boston, MA, US
...Mission Hill, Boston, MA, US
...North End, Boston, MA, US
...Roslindale, Boston, MA, US
...Roxbury, Boston, MA, US
...South Boston, Boston, MA, US
...South End, Boston, MA, US
...West End, Boston, MA, US
...West Roxbury, Boston, MA, US
...Wharf District, Boston, MA, US


[{'Latitude': 42.3554344, 'Longitude': -71.1321271},
 {'Latitude': 42.35054885, 'Longitude': -71.08031131584724},
 {'Latitude': 42.35001105, 'Longitude': -71.0669477958571},
 {'Latitude': 42.3587085, 'Longitude': -71.067829},
 {'Latitude': 42.3500971, 'Longitude': -71.1564423},
 {'Latitude': 42.3778749, 'Longitude': -71.0619957},
 {'Latitude': 42.3522166, 'Longitude': -71.0626074},
 {'Latitude': 42.2973205, 'Longitude': -71.0744952},
 {'Latitude': 42.3554309, 'Longitude': -71.0605001},
 {'Latitude': 42.3750973, 'Longitude': -71.0392173},
 {'Latitude': 42.34422445, 'Longitude': -71.09444446673666},
 {'Latitude': 42.2556543, 'Longitude': -71.1244963},
 {'Latitude': 42.3098201, 'Longitude': -71.1203299},
 {'Latitude': 42.2675657, 'Longitude': -71.0924273},
 {'Latitude': 42.3307858, 'Longitude': -71.0547497},
 {'Latitude': 42.33255965, 'Longitude': -71.10360773640765},
 {'Latitude': 42.3650974, 'Longitude': -71.0544954},
 {'Latitude': 42.2912093, 'Longitude': -71.1244966},
 {'Latitude': 42

Amazing! Let's place those back to our dataframe

In [259]:
# Let's make a copy of our neighborhood dataframe
neigh_df_latlong_boston = neigh_df_boston.copy(deep = True)

# Let's pass the lat long to the dataframe as a dictionary
neigh_df_latlong_boston['Latlong'] = latlong_boston

# Let's split it into two columns - one for latitude and one for longitude
neigh_df_latlong_boston[['Latitude','Longitude']] = pd.DataFrame(neigh_df_latlong_boston['Latlong'].tolist(),index=neigh_df_latlong_boston.index)

# Let's drop the previous column
neigh_df_latlong_boston.drop('Latlong', axis=1, inplace=True)
neigh_df_latlong_boston

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Allston,42.355434,-71.132127
1,Back Bay,42.350549,-71.080311
2,Bay Village,42.350011,-71.066948
3,Beacon Hill,42.358708,-71.067829
4,Brighton,42.350097,-71.156442
5,Charlestown,42.377875,-71.061996
6,Chinatown,42.352217,-71.062607
7,Dorchester,42.29732,-71.074495
8,Downtown,42.355431,-71.0605
9,East Boston,42.375097,-71.039217


Let's plot these to see if they seem reasonable

In [260]:
# Get average Boston coordinates to place the map
lat_boston = neigh_df_latlong_boston['Latitude'].mean()
long_boston = neigh_df_latlong_boston['Longitude'].mean()

# Define the map
map_neigh_boston = folium.Map(location=[lat_boston, long_boston], zoom_start=12)

# Add markers to neighborhoods
for lat, lng, label in zip(neigh_df_latlong_boston['Latitude'], neigh_df_latlong_boston['Longitude'], neigh_df_latlong_boston['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_neigh_boston)  

# Display the map
map_neigh_boston

Something we can immediately notice is that these are very far apart. It would be more appropriate to have smaller sections within the neighborhoods. Unfortunately, we don't have the data necessary for that

### 2.2. Madrid neighborhood locations

Let's repeat it for Madrid. This time, we can also use the district name for better precision

In [261]:
latlong_madrid = []

for neigh, distr in zip(neigh_df_madrid['Neighborhood'], neigh_df_madrid['District']):
    cell = {}
    address = "{}, {}, Madrid, Spain".format(neigh, distr)
    print("..." + address)
    location = geolocator.geocode(address)
    cell['Latitude'] = location.latitude
    cell['Longitude'] = location.longitude
    latlong_madrid.append(cell)
    
latlong_madrid

...Palacio, Centro, Madrid, Spain
...Embajadores, Centro, Madrid, Spain
...Cortes, Centro, Madrid, Spain
...Justicia, Centro, Madrid, Spain
...Universidad, Centro, Madrid, Spain
...Sol, Centro, Madrid, Spain
...Imperial, Arganzuela, Madrid, Spain
...Acacias, Arganzuela, Madrid, Spain
...Chopera, Arganzuela, Madrid, Spain
...Legazpi, Arganzuela, Madrid, Spain
...Delicias, Arganzuela, Madrid, Spain
...Palos de Moguer, Arganzuela, Madrid, Spain
...Atocha, Arganzuela, Madrid, Spain
...Pacífico, Retiro, Madrid, Spain
...Adelfas, Retiro, Madrid, Spain
...Estrella, Retiro, Madrid, Spain
...Ibiza, Retiro, Madrid, Spain
...Los Jerónimos, Retiro, Madrid, Spain
...Niño Jesús, Retiro, Madrid, Spain
...Recoletos, Salamanca, Madrid, Spain
...Goya, Salamanca, Madrid, Spain
...Fuente del Berro, Salamanca, Madrid, Spain
...Guindalera, Salamanca, Madrid, Spain
...Lista, Salamanca, Madrid, Spain
...Castellana, Salamanca, Madrid, Spain
...El Viso, Chamartín, Madrid, Spain
...Prosperidad, Chamartín, Madrid

[{'Latitude': 40.41512925, 'Longitude': -3.7156179983990922},
 {'Latitude': 40.409680550000004, 'Longitude': -3.701644426413222},
 {'Latitude': 40.4143476, 'Longitude': -3.6985251827738512},
 {'Latitude': 40.42395689999999, 'Longitude': -3.6957473208550464},
 {'Latitude': 40.425310350000004, 'Longitude': -3.706629859074133},
 {'Latitude': 40.4169467, 'Longitude': -3.7034891},
 {'Latitude': 40.4069288, 'Longitude': -3.71732197218085},
 {'Latitude': 40.4040749, 'Longitude': -3.7059572},
 {'Latitude': 40.39489315, 'Longitude': -3.6997051134630077},
 {'Latitude': 40.3911717, 'Longitude': -3.6951902},
 {'Latitude': 40.39729215, 'Longitude': -3.6894948496947286},
 {'Latitude': 40.40363845, 'Longitude': -3.6952890284051163},
 {'Latitude': 40.40053705, 'Longitude': -3.6821399313698384},
 {'Latitude': 40.4013961, 'Longitude': -3.6748832},
 {'Latitude': 40.4019026, 'Longitude': -3.6709579},
 {'Latitude': 40.4117618, 'Longitude': -3.6669977},
 {'Latitude': 40.4189526, 'Longitude': -3.6737251},
 {

Great, let's pass it to our dataframe and then place it on the map

In [262]:
# Let's make a copy of our neighborhood dataframe
neigh_df_latlong_madrid = neigh_df_madrid.copy(deep = True)

# Let's pass the lat long to the dataframe as a dictionary
neigh_df_latlong_madrid['Latlong'] = latlong_madrid

# Let's split it into two columns - one for latitude and one for longitude
neigh_df_latlong_madrid[['Latitude','Longitude']] = pd.DataFrame(neigh_df_latlong_madrid['Latlong'].tolist(),index=neigh_df_latlong_madrid.index)

# Let's drop the previous column
neigh_df_latlong_madrid.drop('Latlong', axis=1, inplace=True)
neigh_df_latlong_madrid

Unnamed: 0,District,Neighborhood,Latitude,Longitude
0,Centro,Palacio,40.415129,-3.715618
1,Centro,Embajadores,40.409681,-3.701644
2,Centro,Cortes,40.414348,-3.698525
3,Centro,Justicia,40.423957,-3.695747
4,Centro,Universidad,40.425310,-3.706630
...,...,...,...,...
126,Barajas,Alameda de Osuna,40.457581,-3.587975
127,Barajas,Aeropuerto,40.494838,-3.574081
128,Barajas,Calle Canal de Suez,40.473524,-3.579216
129,Barajas,Timón,40.473171,-3.584152


In [263]:
# Get average Boston coordinates to place the map
lat_madrid = neigh_df_latlong_madrid['Latitude'].mean()
long_madrid = neigh_df_latlong_madrid['Longitude'].mean()

# Define the map
map_neigh_madrid = folium.Map(location=[lat_madrid, long_madrid], zoom_start=11)

# Add markers to neighborhoods
for lat, lng, label in zip(neigh_df_latlong_madrid['Latitude'], neigh_df_latlong_madrid['Longitude'], neigh_df_latlong_madrid['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_neigh_madrid)  

# Display the map
map_neigh_madrid

Amazing! We can see that they are much closer to eachother, which will probably lead to more accurate results than in Boston

### 2.3. Boston LatLong Matrix

Due to Boston's neighborhoods being far apart, we will try a new approach.
It's still unclear if it will work, but we will try to generate a 'Grid' of points for Boston, not necessarily following the neighborhoods

In [264]:
# Define grid size
grid_size = 12

# Get the minimum and maximum latitudes we have from Boston
lat_min_boston = neigh_df_latlong_boston['Latitude'].min()
long_min_boston = neigh_df_latlong_boston['Longitude'].min()
lat_max_boston = neigh_df_latlong_boston['Latitude'].max()
long_max_boston = neigh_df_latlong_boston['Longitude'].max()

# Create a list of N points across latitude and across longitude
list_lat_boston = np.arange(lat_min_boston, lat_max_boston, (lat_max_boston - lat_min_boston)/grid_size)
list_long_boston = np.arange(long_min_boston, long_max_boston, (long_max_boston - long_min_boston)/grid_size)

# Pass this into a list of dictionaries, with neighborhood as placeholders
new_latlong_boston = []

for lat in list_lat_boston:
    for long in list_long_boston:
        new_latlong_boston.append({'Latitude': lat, 'Longitude': long, 'Neighborhood': 'empty (placeholder)'})
        
# Finally, pass it into a dataframe
new_df_boston = pd.DataFrame(new_latlong_boston)
new_df_boston

Unnamed: 0,Latitude,Longitude,Neighborhood
0,42.255654,-71.156442,empty (placeholder)
1,42.255654,-71.146674,empty (placeholder)
2,42.255654,-71.136905,empty (placeholder)
3,42.255654,-71.127136,empty (placeholder)
4,42.255654,-71.117367,empty (placeholder)
...,...,...,...
139,42.367690,-71.088061,empty (placeholder)
140,42.367690,-71.078292,empty (placeholder)
141,42.367690,-71.068524,empty (placeholder)
142,42.367690,-71.058755,empty (placeholder)


It seems to have worked. Now let's plot it!

In [265]:
# Get average Boston coordinates to place the map
new_lat_boston = new_df_boston['Latitude'].mean()
new_long_boston = new_df_boston['Longitude'].mean()

# Define the map
new_map_boston = folium.Map(location=[new_lat_boston, new_long_boston], zoom_start=12)

# Add markers to neighborhoods
for lat, lng, label in zip(new_df_boston['Latitude'], new_df_boston['Longitude'], new_df_boston['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(new_map_boston)  

# Display the map
new_map_boston

It works. It follows a different premise, but will be interesting to see.
We could increase the number of points, but that would be computationally expensive, so we will leave it at close to 150 for now

Let's try to get an address for each of these points

In [266]:
# Initialize an empty address list
addresses_list = []

# Find the closest address to each latlong combination and add them to the list
for lat, long in zip(new_df_boston['Latitude'], new_df_boston['Longitude']):
    found_address = geolocator.reverse([lat, long])
    print("... {}, {}".format(lat, long))
    addresses_list.append(found_address.address)

addresses_list

... 42.2556543, -71.1564423
... 42.2556543, -71.14667355
... 42.2556543, -71.13690480000001
... 42.2556543, -71.12713605000002
... 42.2556543, -71.11736730000003
... 42.2556543, -71.10759855000003
... 42.2556543, -71.09782980000004
... 42.2556543, -71.08806105000005
... 42.2556543, -71.07829230000006
... 42.2556543, -71.06852355000007
... 42.2556543, -71.05875480000007
... 42.2556543, -71.04898605000008
... 42.26583935, -71.1564423
... 42.26583935, -71.14667355
... 42.26583935, -71.13690480000001
... 42.26583935, -71.12713605000002
... 42.26583935, -71.11736730000003
... 42.26583935, -71.10759855000003
... 42.26583935, -71.09782980000004
... 42.26583935, -71.08806105000005
... 42.26583935, -71.07829230000006
... 42.26583935, -71.06852355000007
... 42.26583935, -71.05875480000007
... 42.26583935, -71.04898605000008
... 42.2760244, -71.1564423
... 42.2760244, -71.14667355
... 42.2760244, -71.13690480000001
... 42.2760244, -71.12713605000002
... 42.2760244, -71.11736730000003
... 42.27602

['121, Rockland Street, East Dedham, Dedham, Norfolk County, Massachusetts, 02026, United States',
 '6, Hyde Park Street, Dedham, Norfolk County, Massachusetts, 02026, United States',
 'Stony Brook Reservation, Dedham Boulevard, East Dedham, Boston, Suffolk County, Massachusetts, 02026, United States',
 '1339, River Street, Hyde Park, Boston, Suffolk County, Massachusetts, 02136, United States',
 'Fairmount, Walter Street, Fairmount, Hyde Park, Boston, Suffolk County, Massachusetts, 02137, United States',
 '55, Smith Road, Brushwood, Milton, Norfolk County, Massachusetts, 02137, United States',
 '28, Pagoda Street, Milton Upper Mills, Milton, Norfolk County, Massachusetts, 02137, United States',
 '65, Winthrop Street, Milton Upper Mills, Milton, Norfolk County, Massachusetts, 02126, United States',
 '104, Reedsdale Road, Milton Center, Milton, Norfolk County, Massachusetts, 02126, United States',
 'Milton Academy, Gun Hill Street, Milton, Norfolk County, Massachusetts, 02126, United St

Finally, let's add those to our new dataframe (PS: We will keep the column name as 'Neighborhood' for compatibility)

In [267]:
new_df_boston['Neighborhood'] = addresses_list
new_df_boston

Unnamed: 0,Latitude,Longitude,Neighborhood
0,42.255654,-71.156442,"121, Rockland Street, East Dedham, Dedham, Nor..."
1,42.255654,-71.146674,"6, Hyde Park Street, Dedham, Norfolk County, M..."
2,42.255654,-71.136905,"Stony Brook Reservation, Dedham Boulevard, Eas..."
3,42.255654,-71.127136,"1339, River Street, Hyde Park, Boston, Suffolk..."
4,42.255654,-71.117367,"Fairmount, Walter Street, Fairmount, Hyde Park..."
...,...,...,...
139,42.367690,-71.088061,"152, Fulkerson Street, East Cambridge, Cambrid..."
140,42.367690,-71.078292,"16, Hurley Street, East Cambridge, Cambridge, ..."
141,42.367690,-71.068524,"Craigie Drawbridge, Charles River Dam Road, We..."
142,42.367690,-71.058755,North Washington Street Bridge Replacement (20...


This has worked out well!

From now on, we will use these as the datapoints for Boston.

It was an interesting learning: sometimes we require slighly different approaches for each problem.

The new approach has the added benefit of being much more flexible - it can be applied to any location in the world, without the need to find the neighborhood names!

## 3. Get nearby venues using Foursquare API

First we configure our Foursquare IDs

In [268]:
# Correct values are in a hidden cell

CLIENT_ID = 'xxxx' # your Foursquare ID
CLIENT_SECRET = 'xxxx' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
RADIUS = 800

In [269]:
# The code was removed by Watson Studio for sharing.

Now we define function to get nearby venues based on latlong

In [270]:
def getNearbyVenues(names, latitudes, longitudes, radius=RADIUS):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    print('Done!!')
    return(nearby_venues)

### 3.1. Boston Venues

Let's get the venues for Boston. Remember, we will use the new 'grid' points

In [271]:
venues_boston = getNearbyVenues(names=new_df_boston['Neighborhood'],
                                 latitudes = new_df_boston['Latitude'],
                                 longitudes = new_df_boston['Longitude'])

121, Rockland Street, East Dedham, Dedham, Norfolk County, Massachusetts, 02026, United States
6, Hyde Park Street, Dedham, Norfolk County, Massachusetts, 02026, United States
Stony Brook Reservation, Dedham Boulevard, East Dedham, Boston, Suffolk County, Massachusetts, 02026, United States
1339, River Street, Hyde Park, Boston, Suffolk County, Massachusetts, 02136, United States
Fairmount, Walter Street, Fairmount, Hyde Park, Boston, Suffolk County, Massachusetts, 02137, United States
55, Smith Road, Brushwood, Milton, Norfolk County, Massachusetts, 02137, United States
28, Pagoda Street, Milton Upper Mills, Milton, Norfolk County, Massachusetts, 02137, United States
65, Winthrop Street, Milton Upper Mills, Milton, Norfolk County, Massachusetts, 02126, United States
104, Reedsdale Road, Milton Center, Milton, Norfolk County, Massachusetts, 02126, United States
Milton Academy, Gun Hill Street, Milton, Norfolk County, Massachusetts, 02126, United States
959, Brook Road, East Milton, Mil

Let's see our results :)

In [272]:
print('Shape of venues_boston is {}'.format(venues_boston.shape))
venues_boston.head(5)

Shape of venues_boston is (5710, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"121, Rockland Street, East Dedham, Dedham, Nor...",42.255654,-71.156442,Greek International Food Market,42.260872,-71.15714,Grocery Store
1,"121, Rockland Street, East Dedham, Dedham, Nor...",42.255654,-71.156442,Viva Mi Arepa,42.261372,-71.157,Latin American Restaurant
2,"121, Rockland Street, East Dedham, Dedham, Nor...",42.255654,-71.156442,BCYF- Mary Draper Pool,42.259236,-71.159917,Pool
3,"121, Rockland Street, East Dedham, Dedham, Nor...",42.255654,-71.156442,Family Dollar,42.251721,-71.15542,Discount Store
4,"121, Rockland Street, East Dedham, Dedham, Nor...",42.255654,-71.156442,KFC,42.257887,-71.160799,Fried Chicken Joint


Let's see how many results we have for each address

In [273]:
venues_boston.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"10, Wooddale Avenue, Mattapan, Boston, Suffolk County, Massachusetts, 02126, United States",19,19,19,19,19,19
"1000, Harvard Street, Boston, Suffolk County, Massachusetts, 02126, United States",11,11,11,11,11,11
"104, Reedsdale Road, Milton Center, Milton, Norfolk County, Massachusetts, 02126, United States",4,4,4,4,4,4
"1084, Boylston Street, Back Bay, Boston, Suffolk County, Massachusetts, 02115, United States",100,100,100,100,100,100
"11, Norway Road, Milton Upper Mills, Milton, Norfolk County, Massachusetts, 02126, United States",18,18,18,18,18,18
...,...,...,...,...,...,...
"The Jewish Advocate, School Street, Downtown Crossing, Financial District, Boston, Suffolk County, Massachusetts, 02201, United States",100,100,100,100,100,100
"United States Postal Service Lot A, A Street, Seaport District, Financial District, Boston, Suffolk County, Massachusetts, 02205, United States",100,100,100,100,100,100
"Untitled Landscape, Boston HarborWalk, Waterfront, Financial District, Boston, Suffolk County, Massachusetts, 02110, United States",100,100,100,100,100,100
"Walter C. Wood Sailing Pavilion, 134, Memorial Drive, Cambridgeport, Cambridge, Middlesex County, Massachusetts, 02139, United States",53,53,53,53,53,53


Great! Some addresses have very few (e.g. 4 venues), but some have reached the limit of 100 (for our free API)

### 3.2. Madrid Venues

As usual, let's repeat the process for Madrid.

I will keep comments to a minimum since it's the same as above

In [274]:
venues_madrid = getNearbyVenues(names=neigh_df_latlong_madrid['Neighborhood'],
                                 latitudes = neigh_df_latlong_madrid['Latitude'],
                                 longitudes = neigh_df_latlong_madrid['Longitude'])

Palacio
Embajadores
Cortes
Justicia
Universidad
Sol
Imperial
Acacias
Chopera
Legazpi
Delicias
Palos de Moguer
Atocha
Pacífico
Adelfas
Estrella
Ibiza
Los Jerónimos
Niño Jesús
Recoletos
Goya
Fuente del Berro
Guindalera
Lista
Castellana
El Viso
Prosperidad
Ciudad Jardín
Hispanoamérica
Nueva España
Castilla
Bellas Vistas
Cuatro Caminos
Castillejos
Almenara
Valdeacederas
Berruguete
Gaztambide
Arapiles
Trafalgar
Almagro
Ríos Rosas
Vallehermoso
El Pardo
Fuentelarreina
Peñagrande
Pilar
La Paz
Valverde
Mirasierra
El Goloso
Casa de Campo
Argüelles
Ciudad Universitaria
Valdezarza
Valdemarín
El Plantío
Aravaca
Los Cármenes
Puerta del Ángel
Lucero
Aluche
Campamento
Cuatro Vientos
Las Águilas
Comillas
Opañel
San Isidro
Vista Alegre
Puerta Bonita
Buenavista
Abrantes
Orcasitas
Orcasur
San Fermín
Almendrales
Moscardó
Zofío
Pradolongo
Entrevías
San Diego
Palomeras Bajas
Palomeras Sureste
Portazgo
Numancia
Pavones
Horcajo
Marroquina
Media Legua
Fontarrón
Vinateros
Ventas
Pueblo Nuevo
Quintana
Concepción


In [275]:
print('Shape of venues_boston is {}'.format(venues_madrid.shape))
venues_madrid.head(5)

Shape of venues_boston is (6298, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Palacio,40.415129,-3.715618,Santa Iglesia Catedral de Santa María la Real ...,40.415767,-3.714516,Church
1,Palacio,40.415129,-3.715618,Plaza de La Almudena,40.41632,-3.713777,Plaza
2,Palacio,40.415129,-3.715618,Palacio Real de Madrid,40.41794,-3.714259,Palace
3,Palacio,40.415129,-3.715618,Taberna Rayuela,40.413179,-3.713496,Tapas Restaurant
4,Palacio,40.415129,-3.715618,Cervecería La Mayor,40.415218,-3.712194,Beer Bar


In [276]:
venues_madrid.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Abrantes,16,16,16,16,16,16
Acacias,100,100,100,100,100,100
Adelfas,94,94,94,94,94,94
Aeropuerto,15,15,15,15,15,15
Alameda de Osuna,36,36,36,36,36,36
...,...,...,...,...,...,...
Ventas,19,19,19,19,19,19
Villaverde Alto,7,7,7,7,7,7
Vinateros,23,23,23,23,23,23
Vista Alegre,30,30,30,30,30,30


Similar to Boston, some neighborhoods have many nearby venues and others have very few

### 3.3. Process venues data

At this point, we are not interested in specific venues - only in the 'consolidated' view by neighborhood.

First, let's see how many different categories we have

In [277]:
print('There are {} unique categories in Boston.'.format(len(venues_boston['Venue Category'].unique())))
print('There are {} unique categories in Madrid.'.format(len(venues_madrid['Venue Category'].unique())))

There are 328 unique categories in Boston.
There are 293 unique categories in Madrid.


Let's apply the one-hot encoding to our dataframes (aka rows to columns).

It's important to note that there is one category named 'Neighborhood', so we will start using 'NeighborhoodName' to define our rows now

In [278]:
# First for Boston

# one hot encoding
onehot_boston = pd.get_dummies(venues_boston[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot_boston['NeighborhoodName'] = venues_boston['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns_boston = [onehot_boston.columns[-1]] + list(onehot_boston.columns[:-1])
onehot_boston = onehot_boston[fixed_columns_boston]

onehot_boston.head()

Unnamed: 0,NeighborhoodName,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Water Park,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,"121, Rockland Street, East Dedham, Dedham, Nor...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"121, Rockland Street, East Dedham, Dedham, Nor...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"121, Rockland Street, East Dedham, Dedham, Nor...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"121, Rockland Street, East Dedham, Dedham, Nor...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"121, Rockland Street, East Dedham, Dedham, Nor...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [279]:
# Now for Madrid

# one hot encoding
onehot_madrid = pd.get_dummies(venues_madrid[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot_madrid['NeighborhoodName'] = venues_madrid['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns_madrid = [onehot_madrid.columns[-1]] + list(onehot_madrid.columns[:-1])
onehot_madrid = onehot_madrid[fixed_columns_madrid]

onehot_madrid.head()

Unnamed: 0,NeighborhoodName,Accessories Store,Airport,Airport Lounge,Airport Service,American Restaurant,Aquarium,Arcade,Argentinian Restaurant,Art Gallery,...,Video Game Store,Vietnamese Restaurant,Warehouse Store,Watch Shop,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,Palacio,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Palacio,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Palacio,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Palacio,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Palacio,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's consolidate it by neighborhood

In [280]:
onehot_neigh_boston = onehot_boston.groupby('NeighborhoodName').sum().reset_index()
onehot_neigh_madrid = onehot_madrid.groupby('NeighborhoodName').sum().reset_index()

print('The shape of onehot_neigh_boston is {}'.format(onehot_neigh_boston.shape))
print('The shape of onehot_neigh_madrid is {}'.format(onehot_neigh_madrid.shape))

The shape of onehot_neigh_boston is (144, 329)
The shape of onehot_neigh_madrid is (131, 294)


Great! We got all the 144 points in Boston and all the 131 neighborhoods in Madrid. Now let's get the top 10 venue types for each point/neighborhood

In [281]:
# First, we define a function to return the most common venues

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

In [282]:
num_top_venues = 10

# Create columns according to number of top venues
columns = ['NeighborhoodName']
for ind in np.arange(num_top_venues):
    columns.append('Top {}'.format(ind+1))

# Create a new dataframe for each city
neigh_venues_sorted_boston = pd.DataFrame(columns=columns)
neigh_venues_sorted_boston['NeighborhoodName'] = onehot_neigh_boston['NeighborhoodName']


neigh_venues_sorted_madrid = pd.DataFrame(columns=columns)
neigh_venues_sorted_madrid['NeighborhoodName'] = onehot_neigh_madrid['NeighborhoodName']

# For each row, we get the top categories
for ind in np.arange(onehot_neigh_boston.shape[0]):
    neigh_venues_sorted_boston.iloc[ind, 1:] = return_most_common_venues(onehot_neigh_boston.iloc[ind, :], num_top_venues)

for ind in np.arange(onehot_neigh_madrid.shape[0]):
    neigh_venues_sorted_madrid.iloc[ind, 1:] = return_most_common_venues(onehot_neigh_madrid.iloc[ind, :], num_top_venues)


Let's visualize our dataframes

In [283]:
neigh_venues_sorted_boston.head(5)

Unnamed: 0,NeighborhoodName,Top 1,Top 2,Top 3,Top 4,Top 5,Top 6,Top 7,Top 8,Top 9,Top 10
0,"10, Wooddale Avenue, Mattapan, Boston, Suffolk...",Pizza Place,Donut Shop,Indian Restaurant,Train Station,Pharmacy,Park,Bakery,Gym / Fitness Center,BBQ Joint,Liquor Store
1,"1000, Harvard Street, Boston, Suffolk County, ...",Liquor Store,Supermarket,Video Store,Gym,Pharmacy,Latin American Restaurant,Pizza Place,Discount Store,Bank,Buffet
2,"104, Reedsdale Road, Milton Center, Milton, No...",Lake,Doctor's Office,Food,Skating Rink,Zoo Exhibit,Electronics Store,Empanada Restaurant,Ethiopian Restaurant,Event Space,Falafel Restaurant
3,"1084, Boylston Street, Back Bay, Boston, Suffo...",Clothing Store,Ice Cream Shop,Bookstore,Seafood Restaurant,Coffee Shop,Vietnamese Restaurant,Hotel,Grocery Store,Greek Restaurant,Garden
4,"11, Norway Road, Milton Upper Mills, Milton, N...",Nail Salon,Gym / Fitness Center,Playground,Pharmacy,Caribbean Restaurant,Shoe Store,Shopping Mall,Soccer Field,Fast Food Restaurant,Southern / Soul Food Restaurant


In [284]:
neigh_venues_sorted_madrid.head(5)

Unnamed: 0,NeighborhoodName,Top 1,Top 2,Top 3,Top 4,Top 5,Top 6,Top 7,Top 8,Top 9,Top 10
0,Abrantes,Metro Station,Plaza,Ice Cream Shop,Athletics & Sports,Burger Joint,Nightclub,Tapas Restaurant,Fast Food Restaurant,Park,Gym / Fitness Center
1,Acacias,Bar,Tapas Restaurant,Spanish Restaurant,Coffee Shop,Pizza Place,Art Gallery,Plaza,Indie Theater,Vegetarian / Vegan Restaurant,Market
2,Adelfas,Spanish Restaurant,Grocery Store,Bar,Bakery,Fast Food Restaurant,Gym,Burger Joint,Pizza Place,Supermarket,Hotel
3,Aeropuerto,Airport Lounge,Spanish Restaurant,Coffee Shop,Duty-free Shop,Sporting Goods Shop,Fast Food Restaurant,Breakfast Spot,Airport Service,Diner,French Restaurant
4,Alameda de Osuna,Restaurant,Hotel,Park,Spanish Restaurant,Hotel Bar,Café,Gym,Tapas Restaurant,Bistro,Coffee Shop


## 4. Get current address venues

Now we are ready to identify the venues nearby the current homes of Jane and John. Let's recall their addresses
  * Jane lives in Rushworth St, London
    * Latitude: 51.501463
    * Longitude: -0.1020907
  * John lives in Scholes St, Brooklyn, NY
    * Latitude: 40.708179
    * Longitude: -73.949628

In [285]:
lat_jane = 51.501463
long_jane = -0.1020907

lat_john = 40.708179
long_john = -73.949628

Let's confirm their addressed with our geolocator

In [286]:
address_jane = geolocator.reverse([lat_jane, long_jane])
address_john = geolocator.reverse([lat_john, long_john])
print("Jane's address: {}".format(address_jane))
print("John's address: {}".format(address_john))

Jane's address: Friars Primary (Foundation) School, Rushworth Street, Bankside, Southwark, London Borough of Southwark, London, Greater London, England, SE1 0QN, United Kingdom
John's address: 22, Scholes Street, Williamsburg, Brooklyn, Kings County, New York, 11206, United States


We can see that the addresses are in the correct streets. For Jane it indicates a school that's nearby, even though she obviously doesn't live in the school.

For our 'getNearbyVenues' function to work, we need those on a dataframe

In [287]:
client_data = [["Jane's neighborhood", lat_jane, long_jane], ["John's neighborhood", lat_john, long_john]]
client_df = pd.DataFrame(client_data, columns = ['Neighborhood', 'Latitude', 'Longitude'])
client_df

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Jane's neighborhood,51.501463,-0.102091
1,John's neighborhood,40.708179,-73.949628


Now let's get the venues

In [288]:
client_venues = getNearbyVenues(names=client_df['Neighborhood'],
                                latitudes = client_df['Latitude'],
                                longitudes = client_df['Longitude'])

print('We found a total of {} venues'.format(len(client_venues)))

client_venues.head()

Jane's neighborhood
John's neighborhood
Done!!
We found a total of 200 venues


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Jane's neighborhood,51.501463,-0.102091,Crossfit Blackfriars,51.500581,-0.100361,Gym / Fitness Center
1,Jane's neighborhood,51.501463,-0.102091,Terry's Cafe,51.500715,-0.098314,Café
2,Jane's neighborhood,51.501463,-0.102091,Baltic,51.503309,-0.10476,Eastern European Restaurant
3,Jane's neighborhood,51.501463,-0.102091,Union Theatre,51.503677,-0.101808,Performing Arts Venue
4,Jane's neighborhood,51.501463,-0.102091,Chimichurris,51.500664,-0.09902,Argentinian Restaurant


We can see that we found 200 venues. Since our API only allows 100 venues per request, we know that each location has 100 venues. Let's consolidate those

In [289]:
print('There are {} unique categories close to our clients houses.'.format(len(client_venues['Venue Category'].unique())))

There are 97 unique categories close to our clients houses.


In [290]:
# one hot encoding
client_onehot = pd.get_dummies(client_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
client_onehot['NeighborhoodName'] = client_venues['Neighborhood'] 

# move neighborhood column to the first column
fixedcols_client_onehot = [client_onehot.columns[-1]] + list(client_onehot.columns[:-1])
client_onehot = client_onehot[fixedcols_client_onehot]

client_onehot.head()

Unnamed: 0,NeighborhoodName,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Bagel Shop,Bakery,Bar,...,Taiwanese Restaurant,Thai Restaurant,Theater,Thrift / Vintage Store,Toy / Game Store,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop
0,Jane's neighborhood,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Jane's neighborhood,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Jane's neighborhood,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Jane's neighborhood,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Jane's neighborhood,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [291]:
client_onehot_neigh = client_onehot.groupby('NeighborhoodName').sum().reset_index()

print('The shape of client_onehot_neigh is {}'.format(client_onehot_neigh.shape))

The shape of client_onehot_neigh is (2, 98)


In [292]:
client_onehot_neigh

Unnamed: 0,NeighborhoodName,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Bagel Shop,Bakery,Bar,...,Taiwanese Restaurant,Thai Restaurant,Theater,Thrift / Vintage Store,Toy / Game Store,Udon Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop
0,Jane's neighborhood,1,2,1,3,0,1,0,1,2,...,0,0,3,0,0,0,0,1,1,0
1,John's neighborhood,0,1,0,0,1,0,1,4,9,...,1,2,0,1,1,1,2,0,2,2


In [294]:
num_top_venues = 10

# Create columns according to number of top venues
columns = ['NeighborhoodName']
for ind in np.arange(num_top_venues):
    columns.append('Top {}'.format(ind+1))

# Create a new dataframe for each city
client_venues_sorted = pd.DataFrame(columns=columns)
client_venues_sorted['NeighborhoodName'] = client_onehot_neigh['NeighborhoodName']

# For each row, we get the top categories
for ind in np.arange(client_onehot_neigh.shape[0]):
    client_venues_sorted.iloc[ind, 1:] = return_most_common_venues(client_onehot_neigh.iloc[ind, :], num_top_venues)
    
client_venues_sorted

Unnamed: 0,NeighborhoodName,Top 1,Top 2,Top 3,Top 4,Top 5,Top 6,Top 7,Top 8,Top 9,Top 10
0,Jane's neighborhood,Coffee Shop,Hotel,Pub,Gym / Fitness Center,Café,Theater,Italian Restaurant,Art Museum,Portuguese Restaurant,Argentinian Restaurant
1,John's neighborhood,Bar,Pizza Place,Coffee Shop,Italian Restaurant,Bakery,Japanese Restaurant,Latin American Restaurant,Wine Shop,Restaurant,Food Truck


In [295]:
num_top_venues = 10

# Create columns according to number of top venues
columns = ['NeighborhoodName']
for ind in np.arange(num_top_venues):
    columns.append('Top {}'.format(ind+1))

# Create a new dataframe for each city
client_venues_sorted = pd.DataFrame(columns=columns)
client_venues_sorted['NeighborhoodName'] = client_onehot_neigh['NeighborhoodName']

# For each row, we get the top categories
for ind in np.arange(client_onehot_neigh.shape[0]):
    client_venues_sorted.iloc[ind, 1:] = return_most_common_venues(client_onehot_neigh.iloc[ind, :], num_top_venues)
    
client_venues_sorted

Unnamed: 0,NeighborhoodName,Top 1,Top 2,Top 3,Top 4,Top 5,Top 6,Top 7,Top 8,Top 9,Top 10
0,Jane's neighborhood,Coffee Shop,Hotel,Pub,Gym / Fitness Center,Café,Theater,Italian Restaurant,Art Museum,Portuguese Restaurant,Argentinian Restaurant
1,John's neighborhood,Bar,Pizza Place,Coffee Shop,Italian Restaurant,Bakery,Japanese Restaurant,Latin American Restaurant,Wine Shop,Restaurant,Food Truck


We can see that Jane lives next to coffe shops, hotels, pubs and gyms. John lives next to bars, pizza places, coffee shops and italian restaurants.

## 5. Compare locations

First, it's important to discuss some decisions for this step.

We can see that we have 97 different categories nearby to our clients. We can also see that we have a total of 328 in Boston and 292 in Madrid. (The dataframes size are this plus one, because we also have the neighborhoods in one column)

To compare those, what we can do is:
* Assume our clients only care about the top 10 of their lists
* Compare across all venues that exist in our clients lists
* Compare across all venues

The second approach is a good balance between complexity and completeness, so it is the one we will choose.

Important: These will not be normalized. This is intentional, and we believe it will lead to better results. For example, a neighborhoods with 50 coffee shops and 50 bars is very different than a neighborhood with one of each. Normalizing, they would be represented as the same thing, and we do not want that.

For this we need three dataframes complete with all possible venue categories:
* Boston neighborhoods
* Madrid neighborhoods
* Client neighborhoods

First, let's get all the venue categories we found

In [296]:
# Let's get all the venue categories for each of our dataframes
categories_boston = onehot_neigh_boston.columns.tolist()
categories_madrid = onehot_neigh_madrid.columns.tolist()
categories_client = client_onehot_neigh.columns.tolist()

# Let's merge all of them
categories_list = categories_boston + categories_madrid + categories_client
print("We have {} values in our list, including repeats".format(len(categories_list)))

# Let's make them into a set to remove duplicates, and then transform it back to a list
categories_set = set(categories_list)
categories_list = list(categories_set)
print("We have {} values in our list after removing repeats. This includes the 'NeighborhoodName' column".format(len(categories_list)))

# Let's get the 'NeighborhoodName' to the front of the dataset and sort the rest alphabetically
categories_list = sorted(categories_list)
categories_list.insert(0, categories_list.pop(categories_list.index('NeighborhoodName')))
print("We still have {} values in our list after removing repeats. This includes the 'NeighborhoodName' column".format(len(categories_list)))
print("The first values of our list are {}".format(categories_list[0:5]))

We have 721 values in our list, including repeats
We have 403 values in our list after removing repeats. This includes the 'NeighborhoodName' column
We still have 403 values in our list after removing repeats. This includes the 'NeighborhoodName' column
The first values of our list are ['NeighborhoodName', 'ATM', 'Accessories Store', 'Afghan Restaurant', 'African Restaurant']


Great! Now let's create three empty dataframes with these columns

In [297]:
complete_df_boston = pd.DataFrame(columns = categories_list)
complete_df_madrid = pd.DataFrame(columns = categories_list)
complete_df_client = pd.DataFrame(columns = categories_list)

complete_df_client

Unnamed: 0,NeighborhoodName,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Water Park,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit


Now let's get the values in our original dataframes. For empty values, we naturally want to set the value to zero

In [298]:
# onehot_neigh_boston
# onehot_neigh_madrid
# client_onehot_neigh

# First for Boston
for column in onehot_neigh_boston:
    complete_df_boston[column] = onehot_neigh_boston[column]
complete_df_boston.fillna(0, inplace=True)

# Now for Madrid
for column in onehot_neigh_madrid:
    complete_df_madrid[column] = onehot_neigh_madrid[column]
complete_df_madrid.fillna(0, inplace=True)

# And now for our clients locations
for column in client_onehot_neigh:
    complete_df_client[column] = client_onehot_neigh[column]
complete_df_client.fillna(0, inplace=True)

Let's test our results

In [299]:
complete_df_boston.head(5)

Unnamed: 0,NeighborhoodName,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Water Park,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,"10, Wooddale Avenue, Mattapan, Boston, Suffolk...",0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,"1000, Harvard Street, Boston, Suffolk County, ...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"104, Reedsdale Road, Milton Center, Milton, No...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"1084, Boylston Street, Back Bay, Boston, Suffo...",0,1,0,0,0,0,0,0,2,...,0,0,1,1,0,0,1,1,0,0
4,"11, Norway Road, Milton Upper Mills, Milton, N...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [300]:
complete_df_madrid.head(5)

Unnamed: 0,NeighborhoodName,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Water Park,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Abrantes,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Acacias,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,Adelfas,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Aeropuerto,0,0,0,0,0,4,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Alameda de Osuna,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [301]:
complete_df_client.head(5)

Unnamed: 0,NeighborhoodName,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Water Park,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Jane's neighborhood,0,0,0,0,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,0
1,John's neighborhood,0,0,0,0,0,0,0,0,0,...,0,0,2,2,0,0,0,0,0,0


Excellent! Now we will find the distance between their current neighborhoods and the neighborhoods in their destination.

Remember, Jane is going to Boston and John is going to Madrid

For this, we will consider only the venue types that exist in our clients neighborhoods, as discussed previously. It will consider both clients together for simplicity

In [302]:
selected_df_boston = complete_df_boston[categories_client]
selected_df_madrid = complete_df_madrid[categories_client]
selected_df_client = complete_df_client[categories_client]

First let's do it for Jane

In [303]:
# Let's get the venues close to Jane on a series
venues_series_jane = selected_df_client.iloc[0].drop('NeighborhoodName')

# Let's define a series for our putput
distance_jane_boston = []

# Iterate through all rows in the Boston dataframe
for index, row_series in selected_df_boston.iterrows():
    
    # Get the neighborhood name
    neigh_name = row_series['NeighborhoodName']
    
    # Get the venues
    neigh_venues_series = row_series.drop('NeighborhoodName')
    
    # Get distance between client's series and current neighborhood
    neigh_distance = np.linalg.norm(venues_series_jane-neigh_venues_series)
    
    cell = {'Neighborhood':neigh_name, 'Distance':neigh_distance}
    distance_jane_boston.append(cell)

distance_jane_boston_df = pd.DataFrame(distance_jane_boston)
distance_jane_boston_df

Unnamed: 0,Neighborhood,Distance
0,"10, Wooddale Avenue, Mattapan, Boston, Suffolk...",19.000000
1,"1000, Harvard Street, Boston, Suffolk County, ...",19.339080
2,"104, Reedsdale Road, Milton Center, Milton, No...",19.235384
3,"1084, Boylston Street, Back Bay, Boston, Suffo...",15.842980
4,"11, Norway Road, Milton Upper Mills, Milton, N...",18.920888
...,...,...
139,"The Jewish Advocate, School Street, Downtown C...",14.456832
140,"United States Postal Service Lot A, A Street, ...",15.297059
141,"Untitled Landscape, Boston HarborWalk, Waterfr...",16.673332
142,"Walter C. Wood Sailing Pavilion, 134, Memorial...",16.000000


In [304]:
print("The maximum 'distance' betweeh Jane and Boston's neighborhoods is {} and the minimum is {}. Remember that distance, in this case, is the opposite of similarity between neighborhoods".format(
    distance_jane_boston_df['Distance'].max(),
    distance_jane_boston_df['Distance'].min()))

The maximum 'distance' betweeh Jane and Boston's neighborhoods is 25.25866188063018 and the minimum is 14.45683229480096. Remember that distance, in this case, is the opposite of similarity between neighborhoods


Let's do the same for John

In [305]:
# Let's get the venues close to Jane on a series
venues_series_john = selected_df_client.iloc[1].drop('NeighborhoodName')

# Let's define a series for our putput
distance_john_madrid = []

# Iterate through all rows in the Boston dataframe
for index, row_series in selected_df_madrid.iterrows():
    
    # Get the neighborhood name
    neigh_name = row_series['NeighborhoodName']
    
    # Get the venues
    neigh_venues_series = row_series.drop('NeighborhoodName')
    
    # Get distance between client's series and current neighborhood
    neigh_distance = np.linalg.norm(venues_series_john-neigh_venues_series)
    
    cell = {'Neighborhood':neigh_name, 'Distance':neigh_distance}
    distance_john_madrid.append(cell)

distance_john_madrid_df = pd.DataFrame(distance_john_madrid)
distance_john_madrid_df

Unnamed: 0,Neighborhood,Distance
0,Abrantes,15.905974
1,Acacias,12.529964
2,Adelfas,17.262677
3,Aeropuerto,16.462078
4,Alameda de Osuna,16.062378
...,...,...
126,Ventas,15.459625
127,Villaverde Alto,16.401219
128,Vinateros,14.899664
129,Vista Alegre,14.933185


In [306]:
print("The maximum 'distance' betweeh John and Madrid's neighborhoods is {} and the minimum is {}. Remember that distance, in this case, is the opposite of similarity between neighborhoods".format(
    distance_john_madrid_df['Distance'].max(),
    distance_john_madrid_df['Distance'].min()))

The maximum 'distance' betweeh John and Madrid's neighborhoods is 26.210684844162312 and the minimum is 12.529964086141668. Remember that distance, in this case, is the opposite of similarity between neighborhoods


Now let's calculate a similarity score using the 'Distance' we calculated (linearly)

In [322]:
distance_jane_boston_df['Score'] = (distance_jane_boston_df['Distance'].max() - distance_jane_boston_df['Distance']) / (distance_jane_boston_df['Distance'].max() - distance_jane_boston_df['Distance'].min())
distance_john_madrid_df['Score'] = (distance_john_madrid_df['Distance'].max() - distance_john_madrid_df['Distance']) / (distance_john_madrid_df['Distance'].max() - distance_john_madrid_df['Distance'].min())

print("The maximum score (scaled) for Jane is {} and the minimum is {}".format(distance_jane_boston_df['Score'].max(), distance_jane_boston_df['Score'].min()))
print("The minimum score (scaled) for John is {} and the minimum is {}".format(distance_john_madrid_df['Score'].max(), distance_john_madrid_df['Score'].min()))

# Let's visualize it
distance_john_madrid_df.head(5)

The maximum score (scaled) for Jane is 1.0 and the minimum is 0.0
The minimum score (scaled) for John is 1.0 and the minimum is 0.0


Unnamed: 0,Neighborhood,Distance,Score
0,Abrantes,15.905974,0.753229
1,Acacias,12.529964,1.0
2,Adelfas,17.262677,0.65406
3,Aeropuerto,16.462078,0.71258
4,Alameda de Osuna,16.062378,0.741796


We can see that the lower the 'Distance' factor is, the higher the 'Score'

## 6. Analyze results

### 6.1. Consolidating everything

Let's get everything together into two different dataframes, one for each destination/client, each contaning:
* Neighborhood name
* Similarity score
* Latitude
* Longitude
* Top 10 venue types

Note that we will not bring all venue categories to this

Let's recap the dataframes we will need

In [331]:
# These contain the score of destination neighborhoods
    # distance_jane_boston_df --> Key = 'Neighborhood'; Columns to use = 'Score'
    # distance_john_madrid_df --> Key = 'Neighborhood'; Columns to use = 'Score'

# These contain latitude and longitude for each neighborhood
    # new_df_boston           --> Key = 'Neighborhood'; Columns to use = 'Latitude' and 'Longitude'
    # neigh_df_latlong_madrid --> Key = 'Neighborhood'; Columns to use = 'Latitude' and 'Longitude'

# These contain the top 10 venues in each neighborhood
    # neigh_venues_sorted_boston --> Key = 'NeighborhoodName'; Columns to use = 'Top 1', 'Top 2', ..., 'Top 10'
    # neigh_venues_sorted_madrid --> Key = 'NeighborhoodName'; Columns to use = 'Top 1', 'Top 2', ..., 'Top 10'

# These will be our final dataframes
    # jane_boston_final_df
    # john_madrid_final_df

First, let's do it for Jane

In [352]:
# Merge dataframes containing score and latitude/longitude
jane_boston_final_df = pd.merge(distance_jane_boston_df, new_df_boston, left_on='Neighborhood', right_on='Neighborhood')

# Merge resulting dataframe from above and dataframe containing top venue categories
jane_boston_final_df = pd.merge(jane_boston_final_df, neigh_venues_sorted_boston, left_on='Neighborhood', right_on='NeighborhoodName')

# Drop unnecessary columns
jane_boston_final_df.drop(['NeighborhoodName', 'Distance'], axis=1, inplace=True)

# Let's see it
jane_boston_final_df.head(5)

Unnamed: 0,Neighborhood,Score,Latitude,Longitude,Top 1,Top 2,Top 3,Top 4,Top 5,Top 6,Top 7,Top 8,Top 9,Top 10
0,"10, Wooddale Avenue, Mattapan, Boston, Suffolk...",0.579408,42.276024,-71.088061,Pizza Place,Donut Shop,Indian Restaurant,Train Station,Pharmacy,Park,Bakery,Gym / Fitness Center,BBQ Joint,Liquor Store
1,"1000, Harvard Street, Boston, Suffolk County, ...",0.548017,42.276024,-71.107599,Liquor Store,Supermarket,Video Store,Gym,Pharmacy,Latin American Restaurant,Pizza Place,Discount Store,Bank,Buffet
2,"104, Reedsdale Road, Milton Center, Milton, No...",0.557616,42.255654,-71.078292,Lake,Doctor's Office,Food,Skating Rink,Zoo Exhibit,Electronics Store,Empanada Restaurant,Ethiopian Restaurant,Event Space,Falafel Restaurant
3,"1084, Boylston Street, Back Bay, Boston, Suffo...",0.871675,42.34732,-71.088061,Clothing Store,Ice Cream Shop,Bookstore,Seafood Restaurant,Coffee Shop,Vietnamese Restaurant,Hotel,Grocery Store,Greek Restaurant,Garden
4,"11, Norway Road, Milton Upper Mills, Milton, N...",0.586732,42.265839,-71.088061,Nail Salon,Gym / Fitness Center,Playground,Pharmacy,Caribbean Restaurant,Shoe Store,Shopping Mall,Soccer Field,Fast Food Restaurant,Southern / Soul Food Restaurant


Now let's repeat it for John

In [354]:
# Merge dataframes containing score and latitude/longitude
john_madrid_final_df = pd.merge(distance_john_madrid_df, neigh_df_latlong_madrid, left_on='Neighborhood', right_on='Neighborhood')

# Merge resulting dataframe from above and dataframe containing top venue categories
john_madrid_final_df = pd.merge(john_madrid_final_df, neigh_venues_sorted_madrid, left_on='Neighborhood', right_on='NeighborhoodName')

# Drop unnecessary columns
john_madrid_final_df.drop(['NeighborhoodName', 'Distance', 'District'], axis=1, inplace=True)

# Let's see it
john_madrid_final_df.head(5)

Unnamed: 0,Neighborhood,Score,Latitude,Longitude,Top 1,Top 2,Top 3,Top 4,Top 5,Top 6,Top 7,Top 8,Top 9,Top 10
0,Abrantes,0.753229,40.380998,-3.727985,Metro Station,Plaza,Ice Cream Shop,Athletics & Sports,Burger Joint,Nightclub,Tapas Restaurant,Fast Food Restaurant,Park,Gym / Fitness Center
1,Acacias,1.0,40.404075,-3.705957,Bar,Tapas Restaurant,Spanish Restaurant,Coffee Shop,Pizza Place,Art Gallery,Plaza,Indie Theater,Vegetarian / Vegan Restaurant,Market
2,Adelfas,0.65406,40.401903,-3.670958,Spanish Restaurant,Grocery Store,Bar,Bakery,Fast Food Restaurant,Gym,Burger Joint,Pizza Place,Supermarket,Hotel
3,Aeropuerto,0.71258,40.494838,-3.574081,Airport Lounge,Spanish Restaurant,Coffee Shop,Duty-free Shop,Sporting Goods Shop,Fast Food Restaurant,Breakfast Spot,Airport Service,Diner,French Restaurant
4,Alameda de Osuna,0.741796,40.457581,-3.587975,Restaurant,Hotel,Park,Spanish Restaurant,Hotel Bar,Café,Gym,Tapas Restaurant,Bistro,Coffee Shop


### 6.2. Top 10 comparison

Let's recap the Top 10 venues in Jane's current neighborhood, and compare them to the top selections for her in Boston

In [361]:
client_venues_sorted[client_venues_sorted['NeighborhoodName'] == "Jane's neighborhood"]

Unnamed: 0,NeighborhoodName,Top 1,Top 2,Top 3,Top 4,Top 5,Top 6,Top 7,Top 8,Top 9,Top 10
0,Jane's neighborhood,Coffee Shop,Hotel,Pub,Gym / Fitness Center,Café,Theater,Italian Restaurant,Art Museum,Portuguese Restaurant,Argentinian Restaurant


In [381]:
pd.set_option('display.max_colwidth', 0) # This is so we can see the entire address

jane_boston_final_df_sorted = jane_boston_final_df.sort_values(by='Score', ascending = False)
jane_boston_final_df_sorted.head(10)

Unnamed: 0,Neighborhood,Score,Latitude,Longitude,Top 1,Top 2,Top 3,Top 4,Top 5,Top 6,Top 7,Top 8,Top 9,Top 10
139,"The Jewish Advocate, School Street, Downtown Crossing, Financial District, Boston, Suffolk County, Massachusetts, 02201, United States",1.0,42.357505,-71.058755,Seafood Restaurant,Historic Site,Coffee Shop,Park,Gym / Fitness Center,Salad Place,Bakery,Clothing Store,Italian Restaurant,Restaurant
106,"Copley Place, 100, Huntington Avenue, Back Bay, Boston, Suffolk County, Massachusetts, 02116, United States",0.993612,42.34732,-71.078292,Coffee Shop,Hotel,Seafood Restaurant,Italian Restaurant,Clothing Store,French Restaurant,Shopping Mall,Ice Cream Shop,Dessert Shop,Park
140,"United States Postal Service Lot A, A Street, Seaport District, Financial District, Boston, Suffolk County, Massachusetts, 02205, United States",0.922214,42.34732,-71.048986,Hotel,Coffee Shop,Italian Restaurant,Restaurant,Asian Restaurant,Park,Bar,Steakhouse,Art Museum,Gym
107,"Craigie Drawbridge, Charles River Dam Road, West End, Boston, Suffolk County, Massachusetts, 02214, United States",0.904174,42.36769,-71.068524,Hotel,Park,Science Museum,American Restaurant,Italian Restaurant,Hotel Bar,Bar,Museum,Mediterranean Restaurant,Burrito Place
59,"41, Chestnut Street, Downtown Crossing, Beacon Hill, Boston, Suffolk County, Massachusetts, 02108, United States",0.889311,42.357505,-71.068524,Italian Restaurant,Spa,Hotel,Pizza Place,Park,French Restaurant,Coffee Shop,Sandwich Place,Gourmet Shop,Performing Arts Venue
15,"152, Fulkerson Street, East Cambridge, Cambridge, Middlesex County, Massachusetts, 02141, United States",0.883409,42.36769,-71.088061,Coffee Shop,Pizza Place,American Restaurant,New American Restaurant,Italian Restaurant,Mediterranean Restaurant,Café,Chinese Restaurant,Food Truck,Gastropub
3,"1084, Boylston Street, Back Bay, Boston, Suffolk County, Massachusetts, 02115, United States",0.871675,42.34732,-71.088061,Clothing Store,Ice Cream Shop,Bookstore,Seafood Restaurant,Coffee Shop,Vietnamese Restaurant,Hotel,Grocery Store,Greek Restaurant,Garden
83,"75;77, William T Morrissey Boulevard, Savin Hill, Boston, Suffolk County, Massachusetts, 02125, United States",0.862936,42.316765,-71.048986,Park,Coffee Shop,Café,Pharmacy,Liquor Store,Grocery Store,Pub,Pizza Place,Breakfast Spot,Bar
142,"Walter C. Wood Sailing Pavilion, 134, Memorial Drive, Cambridgeport, Cambridge, Middlesex County, Massachusetts, 02139, United States",0.857138,42.357505,-71.088061,Coffee Shop,Pizza Place,New American Restaurant,Mediterranean Restaurant,Chinese Restaurant,Sandwich Place,College Gym,Deli / Bodega,Hotel,Garden
65,"539, Shawmut Avenue, South End, Boston, Suffolk County, Massachusetts, 02118, United States",0.845611,42.337135,-71.078292,Café,Coffee Shop,Italian Restaurant,Sandwich Place,Donut Shop,Asian Restaurant,French Restaurant,Caribbean Restaurant,Sushi Restaurant,Tapas Restaurant


As we can see, many of the venue categories in the Top 10 overlap, especially Jane's top 2. We don't see many pubs, probably because these are more common in the UK than in the US.

We would recommend Jane to move to Downtown Crossing, Back Bay or Seaport District

Now let's look at John

In [363]:
client_venues_sorted[client_venues_sorted['NeighborhoodName'] == "John's neighborhood"]

Unnamed: 0,NeighborhoodName,Top 1,Top 2,Top 3,Top 4,Top 5,Top 6,Top 7,Top 8,Top 9,Top 10
1,John's neighborhood,Bar,Pizza Place,Coffee Shop,Italian Restaurant,Bakery,Japanese Restaurant,Latin American Restaurant,Wine Shop,Restaurant,Food Truck


In [386]:
john_madrid_final_df_sorted = john_madrid_final_df.sort_values(by='Score', ascending = False)
john_madrid_final_df_sorted.head(10)

Unnamed: 0,Neighborhood,Score,Latitude,Longitude,Top 1,Top 2,Top 3,Top 4,Top 5,Top 6,Top 7,Top 8,Top 9,Top 10
1,Acacias,1.0,40.404075,-3.705957,Bar,Tapas Restaurant,Spanish Restaurant,Coffee Shop,Pizza Place,Art Gallery,Plaza,Indie Theater,Vegetarian / Vegan Restaurant,Market
81,Opañel,0.924368,40.386929,-3.723178,Bar,Plaza,Fast Food Restaurant,Gym / Fitness Center,Coffee Shop,Bakery,Library,Grocery Store,Bus Station,Colombian Restaurant
35,Comillas,0.908332,40.392825,-3.714187,Bar,Park,Playground,Spanish Restaurant,Pizza Place,Fast Food Restaurant,Seafood Restaurant,Bakery,Tapas Restaurant,Beer Garden
91,Peñagrande,0.905684,40.478964,-3.725842,Bakery,Tapas Restaurant,Pizza Place,Bar,Diner,Spanish Restaurant,Soccer Field,Grocery Store,Residential Building (Apartment / Condo),Argentinian Restaurant
75,Media Legua,0.895161,40.411995,-3.657534,Restaurant,Bar,Pizza Place,Fast Food Restaurant,Coffee Shop,Gym,Plaza,Electronics Store,Park,Pub
65,La Paz,0.884745,40.482122,-3.695211,Bar,Tapas Restaurant,Spanish Restaurant,Park,Beer Garden,Pizza Place,Bus Station,Sculpture Garden,Mediterranean Restaurant,Bookstore
22,Campamento,0.879576,40.394683,-3.768279,Spanish Restaurant,Bar,Pizza Place,Italian Restaurant,Chinese Restaurant,Metro Station,Pub,Light Rail Station,Diner,Department Store
92,Pilar,0.85411,40.477133,-3.708916,Clothing Store,Tapas Restaurant,Italian Restaurant,Park,Bakery,Supermarket,Gym / Fitness Center,Pizza Place,Spanish Restaurant,Restaurant
49,Embajadores,0.834169,40.409681,-3.701644,Plaza,Theater,Bar,Restaurant,Coffee Shop,Hotel,Tapas Restaurant,Spanish Restaurant,Art Gallery,Market
128,Vinateros,0.826785,40.405197,-3.641547,Park,Plaza,Bakery,Pizza Place,Café,Bar,Pub,Gym / Fitness Center,Nightclub,Falafel Restaurant


We can see many bars and pizza places on these. The recommendations for John would be Acacias, Opañel and Comillas

### 6.3. Plotting results on a map

Lastly, let's see on a map which neighborhoods we would recommend.

The compatibility will be indicated by:
* Marker color is related to the score as follows
    * Above 0.9: blue
    * Between 0.7 and 0.9: green
    * Between 0.5 and 0.7: yellow
    * Between 0.3 and 0.5: orange
    * Below 0.3: red
* Marker opacity - the darker the shade of green, the better the match is for our clients
* Marker size - the larger the marker, the better the match is

First we do it for Jane

In [396]:
# Get average Boston coordinates to place the map
final_lat_boston = jane_boston_final_df['Latitude'].mean()
final_long_boston = jane_boston_final_df['Longitude'].mean()

# Define the map
ranked_map_boston = folium.Map(location=[final_lat_boston, final_long_boston], zoom_start=12)

# Add markers to neighborhoods
for lat, lng, label, score in zip(jane_boston_final_df['Latitude'], jane_boston_final_df['Longitude'], jane_boston_final_df['Neighborhood'], jane_boston_final_df['Score']):
    label = folium.Popup(label, parse_html=True)
    if (score >= 0.90):
        chosen_color = 'blue'
    elif (score >= 0.70):
        chosen_color = 'green'
    elif (score >= 0.50):
        chosen_color = 'yellow'
    elif (score >= 0.3):
        chosen_color = 'orange'
    else:
        chosen_color = 'red'

    folium.CircleMarker(
        [lat, lng],
        radius=5 + 5*score,
        popup=label,
        color = chosen_color,
        opacity = 0.3 + 0.5 * score,
        fill=True,
        fill_color=chosen_color,
        fill_opacity=0.3 + 0.5 * score,
        parse_html=False).add_to(ranked_map_boston)  

# Display the map
ranked_map_boston

We can see that there are four optimal neighborhoods, and many other that are reasonably compatible (in green). Most of those are close to downtown

Now let's see John

In [399]:
# Get average Boston coordinates to place the map
final_lat_madrid = john_madrid_final_df['Latitude'].mean()
final_long_madrid = john_madrid_final_df['Longitude'].mean()

# Define the map
ranked_map_madrid = folium.Map(location=[final_lat_madrid, final_long_madrid], zoom_start=11)

# Add markers to neighborhoods
for lat, lng, label, score in zip(john_madrid_final_df['Latitude'], john_madrid_final_df['Longitude'], john_madrid_final_df['Neighborhood'], john_madrid_final_df['Score']):
    label = folium.Popup(label, parse_html=True)
    if (score >= 0.90):
        chosen_color = 'blue'
    elif (score >= 0.70):
        chosen_color = 'green'
    elif (score >= 0.50):
        chosen_color = 'yellow'
    elif (score >= 0.3):
        chosen_color = 'orange'
    else:
        chosen_color = 'red'

    folium.CircleMarker(
        [lat, lng],
        radius=5 + 5*score,
        popup=label,
        color = chosen_color,
        opacity = 0.3 + 0.5 * score,
        fill=True,
        fill_color=chosen_color,
        fill_opacity=0.3 + 0.5 * score,
        parse_html=False).add_to(ranked_map_madrid)  

# Display the map
ranked_map_madrid

We can see that there is a central region, especially around Salamanca, that John would not enjoy. Overall, he would fit better in the outskirts of the city, wth 4 optimal locations

## 7. Conclusion

### 7.1. Results and recommendations

We were siccessfully able to find the best neighborhoods for Jane in Boston, and for John in Madrid.

For Jane, we recommend her to move somewhere close to city center, especially Downtown Crossing, Back Bay or Seaport District. However, there are many other places in Boston where she would find herself at home.

For John, we do not recommend the city center, especially not around Salamanca. He should move to neighbourhoods slightly towards the outskirts, such as Acacias, Opañel and Comillas.

### 7.2. Next possible steps and further analysis

Disclaimer: these will not be included in this notebook. They are simply suggestions that other people can try.

* Zoom in on selected regions to find the best streets or blocks within neighborhoods
* Experiment with different clients moving to Boston and Madrid
* Incorporate neighborhood distance to workplace into our analysis
* Incorporate neighborhood criminality levels into our analysis
* Incorporate cost of living into our analysis