# Neighborhood recommendation based on venues categories of interest.

In this notebook I'll use the location data of a Borough in Mexico City to recommend a neighborhood where a person can live so that he can live near venues that the person like the most.

## Introduction

The code in this project is meant to recommend a place to live in Mexico City based on the venues interest of someone. This can be helpful to people who wants to move to a place in Mexico City where they can have specific nearby venues.

## Collecting the data

The geospacial data was downloaded from the web page of open data in mexico city https://datos.cdmx.gob.mx. The dataset contains information of neighborhoods in mexico city like name and the Borough which they belong to, the coordinates of their centers as well as the geometry of the neighborhoods. This geometry data will be used to create a choropleth map.
The data dictionary can be downloaded in this link https://datos.cdmx.gob.mx/api/datasets/1.0/coloniascdmx/attachments/diccionario_de_datos_colonias_iecm_pdf/

The location data is obtained from the foursquare API.

In [1]:
# The code was removed by Watson Studio for sharing.

In [2]:
#Importing libraries
!pip install folium
import folium
import requests
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib import colors

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/a4/f0/44e69d50519880287cc41e7c8a6acc58daa9a9acf5f6afc52bcc70f69a6d/folium-0.11.0-py2.py3-none-any.whl (93kB)
[K     |████████████████████████████████| 102kB 10.6MB/s ta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


In [3]:
#Downloading the Neighborhoods data of Mexico City
! wget 'https://datos.cdmx.gob.mx/explore/dataset/coloniascdmx/download/?format=json&timezone=America/Mexico_City&lang=es' -O MexicoCity.json

--2020-07-03 20:55:18--  https://datos.cdmx.gob.mx/explore/dataset/coloniascdmx/download/?format=json&timezone=America/Mexico_City&lang=es
Resolving datos.cdmx.gob.mx (datos.cdmx.gob.mx)... 52.1.105.32, 34.196.27.91
Connecting to datos.cdmx.gob.mx (datos.cdmx.gob.mx)|52.1.105.32|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘MexicoCity.json’

    [     <=>                               ] 5,992,493   4.40MB/s   in 1.3s   

2020-07-03 20:55:21 (4.40 MB/s) - ‘MexicoCity.json’ saved [5992493]



In [4]:
with open('MexicoCity.json','r') as file:
    JSON = json.load(file)
JSON = list(filter(lambda x: 'geo_shape' in x['fields'], JSON))

In [5]:
# Create the dataframe with the Mexico City's location information
cdmx_neighborhoods = pd.DataFrame([
    [
        colonia['fields']['alcaldia'],
        colonia['fields']['nombre'], 
        colonia['fields']['geo_point_2d'][0],
        colonia['fields']['geo_point_2d'][1]
    ] for colonia in JSON
], columns=['Borough','Neighborhood','Latitude','Longitude'])

In [6]:
#I will analyse online the data from one Borough because there are a lot of Neighborhoods and the foursquare API only accepts 950 calls.
cuauhtemoc = cdmx_neighborhoods[cdmx_neighborhoods.Borough == 'CUAUHTEMOC'].reset_index(drop=True)
cuauhtemoc.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,CUAUHTEMOC,TABACALERA,19.435776,-99.153949
1,CUAUHTEMOC,CENTRO VII,19.430225,-99.128141
2,CUAUHTEMOC,GUERRERO I,19.449076,-99.143749
3,CUAUHTEMOC,NONOALCO-TLATELOLCO (U HAB) II,19.453315,-99.141769
4,CUAUHTEMOC,JUAREZ,19.427004,-99.161605


In [7]:
# The code was removed by Watson Studio for sharing.

In [8]:
radius = 500
LIMIT = 100
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    """A function to obtain the neighborhood's nearby venues. It sends a requests to the foursquare API"""
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):            
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [9]:
# Obtaining the foursquare location data.
# cuauhtemoc_venues = getNearbyVenues(cuauhtemoc.Neighborhood,cuauhtemoc.Latitude,cuauhtemoc.Longitude)

# To save the file and to not repit the requests to foursquare every time a open the notebook.
# project.save_data(file_name = "cuauhtemoc_venues.csv",data = cuauhtemoc_venues.to_csv(index=False))

# Reading the file from the Cloud Object Storage
cuauhtemoc_venues = pd.read_csv(project.get_file('cuauhtemoc_venues.csv'))
cuauhtemoc_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,TABACALERA,19.435776,-99.153949,Mirador Monumento a la Revolución Mexicana,19.436212,-99.15475,Scenic Lookout
1,TABACALERA,19.435776,-99.153949,Monumento a la Revolución Mexicana,19.436022,-99.154212,Monument / Landmark
2,TABACALERA,19.435776,-99.153949,Barbacoa edison,19.43701,-99.152502,Taco Place
3,TABACALERA,19.435776,-99.153949,Terraza Timberland,19.436591,-99.152948,General Entertainment
4,TABACALERA,19.435776,-99.153949,Revolution Square,19.435959,-99.153023,Historic Site


Transforming the data into one hot encoded vectors

In [10]:
cuauhtemoc_onehot = pd.get_dummies(cuauhtemoc_venues.set_index('Neighborhood')['Venue Category'])

In [11]:
cuauhtemoc_grouped = cuauhtemoc_onehot.groupby(cuauhtemoc_onehot.index).mean()

In [12]:
# Obtaining a DataFrame with the 10 most frecuent venues in each neighborhood
df_top10venues = pd.DataFrame(cuauhtemoc_grouped.columns.values[np.argsort(-cuauhtemoc_grouped.values, axis=1)[:, :10]], 
                  index=cuauhtemoc_grouped.index,
                  columns = ['1st Most Common Venue', '2nd Most Common Venue', '3rd Most Common Venue', '4th Most Common Venue',
                             '5th Most Common Venue', '6th Most Common Venue', '7th Most Common Venue', '8th Most Common Venue', 
                             '9th Most Common Venue', '10th Most Common Venue']).reset_index()

In [13]:
cuauhtemoc_merged = cuauhtemoc_grouped.merge(cuauhtemoc,left_on=cuauhtemoc_grouped.index,right_on='Neighborhood') \
    .drop('Neighborhood_y',axis=1)

## Methodology

There is a person who wants to move to Cuauhtemoc Borough in Mexico City. We will help him to find a neighborhood in which he can find the type of venues he likes the most.  We will use the cosine similarity to obtain recommendations

Suppose this person wants to live near venues with categories like Scenic Lookout, Monument or Landmark, General Entertainment, History Museum, Dance Studio and Coffee Shop

In [14]:
# The array with the venue categories in which the person is interested
category_venues = ['Scenic Lookout', 'Monument / Landmark', 'General Entertainment', 'History Museum', 'Dance Studio', 
                   'Coffee Shop']

In [15]:
# Create the dataframe with the venue categories converted in a one hot encoded vector
custom_df = pd.DataFrame([{category : 1 for category in category_venues}],columns=cuauhtemoc_grouped.columns).fillna(0)

In [16]:
# We obtain the dot product between the the vector that represents the interests of the person and all the vectors that represent 
# the neighborhoods; and filter the neighborhoods that has the largest result
recommended_neighborhoods = pd.DataFrame(
    cuauhtemoc_grouped.values * custom_df.values,columns=custom_df.columns
).sum(axis=1).to_frame().nlargest(5,0).index

df_top10venues.iloc[recommended_neighborhoods]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
59,TABACALERA,Coffee Shop,Mexican Restaurant,Taco Place,Hotel,Plaza,Argentinian Restaurant,Exhibit,Diner,Restaurant,Bar
50,ROMA SUR II,Mexican Restaurant,Taco Place,Coffee Shop,Restaurant,Optical Shop,Pet Store,Café,Tattoo Parlor,Gym / Fitness Center,Bakery
51,SAN RAFAEL I,Taco Place,Mexican Restaurant,Pizza Place,Coffee Shop,Ice Cream Shop,Sandwich Place,Theater,Spanish Restaurant,Breakfast Spot,Dance Studio
58,SANTA MARIA LA RIBERA IV,Mexican Restaurant,Coffee Shop,Taco Place,Pizza Place,Art Gallery,Food & Drink Shop,Restaurant,Theater,Dance Studio,Bakery
32,JUAREZ,Coffee Shop,Bakery,Art Gallery,Pizza Place,Hotel,Ice Cream Shop,Chinese Restaurant,Clothing Store,Cocktail Bar,Comfort Food Restaurant


In [17]:
# Create a geojson object needed in the choropleth map.
geojson_recommended = {
  "type" : "FeatureCollection",  
  "features" : [
      {
          "type" : "Feature", 
          "geometry" :  colonia['fields']['geo_shape'], 
          "properties" : { 
              key:elem
              for key,elem in colonia['fields'].items() if key != 'geo_shape' 
          }
      }
      for colonia in JSON if colonia['fields']['nombre'] in cuauhtemoc_merged.iloc[recommended_neighborhoods].Neighborhood.values
  ]
}

In [18]:
# Creation of the map with folium
latitude = 19.44306
longitude = -99.144725

n_recomm_neighborhoods = len(recommended_neighborhoods)

map_clusters = folium.Map(location=[latitude, longitude],zoom_start=14)

# set color scheme for the clusters
x = np.arange(n_recomm_neighborhoods)
ys = [i + x + (i*x)**2 for i in range(n_recomm_neighborhoods)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

folium.GeoJson(
    geojson_recommended,
    name='geojson'
).add_to(map_clusters)

# add markers to the map
markers_colors = []
for lat, lon, poi, number in zip(cuauhtemoc_merged.iloc[recommended_neighborhoods]['Latitude'], cuauhtemoc_merged.iloc[recommended_neighborhoods]['Longitude'], cuauhtemoc_merged.iloc[recommended_neighborhoods]['Neighborhood'], range(n_recomm_neighborhoods)):
    label = folium.Popup(str(poi) + ' Option ' + str(number + 1), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[number-1],
        fill=True,
        fill_color=rainbow[number-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [19]:
# Let's put the code togheter in a function
def recommend_neighborhoods(venue_categories_list):
    custom_df = pd.DataFrame([{category : 1 for category in venue_categories_list}],columns=cuauhtemoc_grouped.columns) \
    .fillna(0)
    recommended_neighborhoods = pd.DataFrame(cuauhtemoc_grouped.values * custom_df.values,columns=custom_df.columns) \
    .sum(axis=1) \
    .to_frame() \
    .nlargest(5,0).index
    
    latitude = 19.439
    longitude = -99.144725

    n_recomm_neighborhoods = len(recommended_neighborhoods)

    map_clusters = folium.Map(location=[latitude, longitude],zoom_start=13.5)

    # set color scheme for the clusters
    x = np.arange(n_recomm_neighborhoods)
    ys = [i + x + (i*x)**2 for i in range(n_recomm_neighborhoods)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]
    
    geojson_recommended = {
      "type" : "FeatureCollection",  
      "features" : [
          {
              "type" : "Feature", 
              "geometry" :  colonia['fields']['geo_shape'], 
              "properties" : { 
                  key:elem
                  for key,elem in colonia['fields'].items() if key != 'geo_shape' 
              }
          }
          for colonia in JSON if colonia['fields']['nombre'] in cuauhtemoc_merged.iloc[recommended_neighborhoods].Neighborhood.values
      ]
    }

    folium.GeoJson(
        geojson_recommended,
        name='geojson'
    ).add_to(map_clusters)

    # add markers to the map
    markers_colors = []
    for lat, lon, poi, number in zip(
        cuauhtemoc_merged.iloc[recommended_neighborhoods]['Latitude'], 
        cuauhtemoc_merged.iloc[recommended_neighborhoods]['Longitude'], 
        cuauhtemoc_merged.iloc[recommended_neighborhoods]['Neighborhood'], 
        range(n_recomm_neighborhoods)
    ):
        label = folium.Popup(str(poi) + ' Option ' + str(number + 1), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[number-1],
            fill=True,
            fill_color=rainbow[number-1],
            fill_opacity=0.7).add_to(map_clusters)

    map_clusters
    return df_top10venues.iloc[recommended_neighborhoods], map_clusters

## Results

Let's now analyse a person who is interested in Hotels, Seafood Restaurants and Italian restaurants

In [20]:
venue_categories_list = ['Hotel', 'Seafood Restaurant', 'Italian Restaurant']

df,map_ = recommend_neighborhoods(['Hotel', 'Seafood Restaurant', 'Italian Restaurant'])
df

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,ESPERANZA,Seafood Restaurant,Mexican Restaurant,Coffee Shop,Fast Food Restaurant,Market,Warehouse Store,Gym / Fitness Center,Restaurant,Burger Joint,Gym
40,OBRERA II,Taco Place,Bar,Mexican Restaurant,Bakery,Seafood Restaurant,Gym,Bistro,Stationery Store,Liquor Store,Indie Theater
44,PERALVILLO I,Mexican Restaurant,Taco Place,Seafood Restaurant,Fried Chicken Joint,Dessert Shop,Diner,Restaurant,Food,Burger Joint,Furniture / Home Store
59,TABACALERA,Coffee Shop,Mexican Restaurant,Taco Place,Hotel,Plaza,Argentinian Restaurant,Exhibit,Diner,Restaurant,Bar
60,TRANSITO,Ice Cream Shop,Mexican Restaurant,Seafood Restaurant,Taco Place,Coffee Shop,Sushi Restaurant,Steakhouse,Breakfast Spot,Burger Joint,Snack Place


In [21]:
# Display the map
map_

As we can see, the recommmended neighborhoods have some of the venues categories that the person is most interested among their most frecuent venue categories. So I think that the algorithm is performing well.

In the map we can visualize the neighborhood's territorial delimitation.

## Discussion

During the analysis of the information I noticed that every neighborhood in the data set have several Mexican Restaurants and Taco places, so I think it could be a good option to remove those categories from the analysis because they don't serve to distinguish between the neighborhoods. 
Also it would be a good idea to add some more information to the analysis, like the rating data of the venues. With this we could improve the recommendations using not only the frecuency of the venues in a specific neighborhood, but also the prestige of those venues. The problem is that the rating data is obtained with a premium requests to the foursquare API. As I am using the free account I don't have enough free premium requests to get the ratings for all venues.

## Conclusion

There are lots of different ways we can explode the location information, and if we also add some more information that we can collect in other sources like population data or economical and satatistical location data we could obtain more interesting insights.

This is a good introduction which helps to understand the value of location data.

Author: Ivan Jimenez Martinez