<a id="top"></a> 
# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

# Table Of Contents
* [Introduction: Business Problem](#intro)
* [Initial Setup](#setup)
* [Data Sources](#data)
    * [List of Barcelona Neighbourhoods](#data1)
    * [list of avg house prices](#data2)
    * [geoJSON definitions of all the neighbourhood data](#data3)
    * [merging location data](#data4)
    * [venue data from FourSquare](#data5)
    * [venue category taxonomy from FourSquare](#data6)
* [Data Processing](#processing)
* [Investigations](#investigations)
    * [price by district](#inv1)
    * [price effect each type of venue](#inv2)
    * [identifying types of districts](#inv3)
    * [underpriced/overpriced neighbourhoods by venue](#inv4)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a id="intro"></a> 

Conventions
Barcelona is (at the time of writing) made up of 10 districts ("districtes"), 73 neighbourhoods ("barris") and 1069 sub-neighbourhoods.


<div style="text-align: right"> <a href="#top">back to top</a> </div>

---


## Initial setup <a id="setup"></a>
---


Install additional libraries that we might not already have
Geopandas is really powerful and allows us to do a lot of the calculations that we'd need to do manually

In [None]:
# I started using PIP but found I got a bunch of incompatibility issues so switched to using Conda
# i needed this in order to prevent conda compatibility issues
#!conda config --set channel_priority strict
#!conda install --channel conda-forge shapely folium pandas numpy geopandas  matplotlib geopy scikit-learn python-dotenv  -y
#!conda install --channel conda-forge openpyxl -y
#!conda install --channel conda-forge nodejs -y




In [None]:
#I'm using these variables to control whether I'm downloading or repulling the data from the API or loading from local files already pulled

download_all_data = False
load_venues_from_file = True
load_venue_data_from_api = False
write_venues_to_file = False


venue_data_from_foursquare = True #1.5.1
price_by_district = True #
price_effect_each_type_of_venue = True
identifying_types_of_districts = True
underpriced_overpriced_neighbourhoods_by_venue = True

#Uncomment the below if this is the first time you're running
#download_all_data = True
#load_venues_from_file = False
#load_venue_data_from_api = True
#write_venues_to_file = True



In [None]:
import pandas as pd
import numpy as np
import geopandas as gpd
import math
from IPython.display import JSON

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files


from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
#from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt

# import k-means from clustering stage
from sklearn.cluster import KMeans

from sklearn import linear_model

from sklearn import metrics
from scipy.spatial.distance import cdist



import folium # map rendering library

from shapely.geometry import Polygon, mapping, Point

import os
from dotenv import load_dotenv
load_dotenv()

import pickle 


print('Libraries imported.')

Lets setup our coordinates for Barcelona which we'll use for a lot of our mapping

In [None]:
address = 'Barcelona, ES'

geolocator = Nominatim(user_agent="bcn_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Barcelona are {}, {}.'.format(latitude, longitude))

<div style="text-align: right"> <a href="#top">back to top</a> </div>


<br><br><br>
## 1 Data sources <a id="data"></a>
---


We have 4 different datasources that we're going to need to be able to perform the analyses we're wanting
1. List of all the Barcelona neighbourhoods
2. List of average house prices
3. GeoJSON representations of all of the neighbourhoods
4. The venue details from FourSquare
5. The FourSquare categorization taxonomy
<br><br>

### 1.1 list of Barcelona neighbourhoods <a id="data1"></a>

This file contains the names of the different districts and neighbourhoods in BCN, It's available from the Barcelona Ajuntament (Council)

In [None]:
# Uncomment this to actually download the file - commented for when you've alreay downloaded the file
if(download_all_data):
    print("Downloading")
    !wget -q -O 'barcelona_neighbourhoods.csv' https://opendata-ajuntament.barcelona.cat/data/dataset/8f144d2c-1185-4e5c-9b97-ac930eeffca8/resource/d7aa700f-c2dc-4ffb-b5c2-62f494dd3c34/download/2017_superficie.csv

Load our file in and rename some of the columns to allow linking later

In [None]:
barrios_df = pd.read_csv('barcelona_neighbourhoods.csv')
barrios_df.rename(columns={"Codi_Barri": "BARRI"}, inplace=True)
barrios_df.dropna()
#barrios_df

<div style="text-align: right"> <a href="#top">back to top</a> </div>

<br><br>
### 1.2 list of avg house prices <a id="data2"></a>
This file has each of the districs and the avg property price per m2 in 2015 - again this is available from the city council.

In [None]:
# Uncomment this to actually download the file - commented for when you've alreay downloaded the file
if(download_all_data):
    print("Downloading")
    !wget -q -O 'barcelona_prices.csv' https://opendata-ajuntament.barcelona.cat/data/dataset/59975890-c615-4080-8dd7-ef1406085590/resource/cd9118c6-427c-4390-8334-3670cc3f3f6a/download/2015_habitatges_2na_ma2015.csv

In [None]:
prices_df = pd.read_csv('barcelona_prices.csv',encoding = 'latin_1')
prices_df.head(10)

In order for us to link to our other data, we need to extract the neighbourhood id from the neighbourhood column and we can remove some unneeded columns

In [None]:

def zf(value):
    return str(value).zfill(2)
def rc(value):
    return pd.to_numeric(value.replace('.',''))


#prices_df.rename(columns={"Codi_Barri": "BARRI"}, inplace=True)
prices_df = prices_df.join( prices_df["Barris"].str.split(".", n = 1, expand = True) )
prices_df.rename(columns={'Dte.': "Codi_Districte",'2015':'price_per_m2', 0:'BARRI_lu',1:'Nom_Barri_lu'}, inplace=True)
prices_df.dropna(inplace = True)
prices_df = prices_df[prices_df['price_per_m2'] != 'n.d.']
#prices_df.astype({'BARRI': 'str'}).dtypes
#prices_df['BARRI'] = prices_df['BARRI'].apply(zf);
prices_df['price_per_m2'] = prices_df['price_per_m2'].apply(rc)
prices_df = prices_df.astype({'BARRI_lu': 'int64','price_per_m2':'int64'})
prices_df.sort_values(by=['price_per_m2'], inplace=True)
prices_df.drop(['Codi_Districte', 'Barris'], axis=1, inplace = True)
prices_df.head(5)

<div style="text-align: right"> <a href="#top">back to top</a> </div>

<br><br>
### 1.3 the geoJSON definitions of all the neighbourhood data <a id="data3"></a>
This file gives the sub neighbourhoods coded into geojson, and will allow us to map our data very acurately.

In [None]:
# Uncomment this to actually download the file - commented for when you've alreay downloaded the file
if(download_all_data):
    print("Downloading")
    !wget -q -O 'barcelona_seccio-censal.geojson'  https://raw.githubusercontent.com/martgnz/bcn-geodata/master/seccio-censal/seccio-censal.geojson

load the file in, and convert datatypes that we're going to be linking later

In [None]:
df_hoods = gpd.read_file('barcelona_seccio-censal.geojson',crs={'init':'epsg:4326'})
df_hoods = df_hoods.astype({'BARRI': 'int64'})

#df_hoods


<div style="text-align: right"> <a href="#top">back to top</a> </div>

<br><br>
### 1.4 merging location data <a id="data4"></a>
Before we can continuee, we need to merge some of our location data, so that we can look at both neighbourhoods and districts together.


In [None]:

df_hoods = df_hoods.merge(barrios_df.set_index('BARRI'), how='inner', on='BARRI')
df_hoods = df_hoods.merge(prices_df, how='inner', left_on='BARRI',right_on='BARRI_lu')
df_hoods['neighbourhood'] = df_hoods['Nom_Barri'] + ' - ' + df_hoods.LITERAL
df_hoods.drop(['BARRI_lu', 'Nom_Barri_lu'], axis=1, inplace = True)

df_hoods.index.name = 'id'
df_hoods['id'] = df_hoods.index




<br><br>
### 1.5 our venue data from FourSquare <a id="data5"></a>
In order to get our venue data from the FourSquare API, we need to pass in a centrepoint and a radius.
To do this we'll need to look at each of our sub-neighbourhoods, and calculate these two values.
Let's start with the centrepoints.
#### 1.5.1 centerpoints

We can use the built in GeoPandas centroid functionality to calculate the centre point for each sub neighbourhood like this

In [None]:
df_hoods['centre_lat']=df_hoods['geometry'].centroid.y
df_hoods['centre_lng']=df_hoods['geometry'].centroid.x

In [None]:
df_hoods.head()

We can plot our shapes in blue, and our centroids in red to check that everything looks right

In [None]:
 # create map of Barcelona using latitude and longitude values
map_bcn = folium.Map(location=[latitude, longitude], zoom_start=12)
if(venue_data_from_foursquare):
   
    folium.GeoJson(
        df_hoods.to_json(),
        name='geojson'
    ).add_to(map_bcn)


    # add markers to map
    for lat, lng, borough, neighbourhood in zip(df_hoods['centre_lat'], df_hoods['centre_lng'], df_hoods['Nom_Barri'], df_hoods['LITERAL']):
        label = '{}, {}'.format(neighbourhood, borough)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='red',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_bcn)  


map_bcn

<div style="text-align: right"> <a href="#top">back to top</a> </div>


This looks great, however, it's clear that the sizes of the subneighbourhoods are much bigger than others, they're also of different shapes.


#### 1.5.2 radiuses
So how should we choose the right search radius for Foursquare data?

Lets take a look at some examples of the issue with 3 subneighbourhoods (032,040 and 039) in the Barceloneta district

In [None]:
problem_hoods = ['032', '040','039']

df_problems = df_hoods[df_hoods.LITERAL.isin(problem_hoods) & df_hoods.Codi_Districte.eq(1) ]


df_problems.at[97, 'FHEX_COLOR'] = '#FF0000'
df_problems.at[104, 'FHEX_COLOR'] = '#00FF00'
df_problems.at[105, 'FHEX_COLOR'] = '#0000FF'
df_problems


If we plot these three neighbourhoods, along with the 500m radiuses that are standard with FourSquare, we can see that the two smaller areas maked with green and blue (areas 039 and 040) almost totally overlap each other.


Not only that, but if we look at the Museum of Catalan History highlighed by the purple circle, it would be included in both the catchment areas of 039 and 040, but this isn't correct, as it's actually situated within area 32 shown in red.

With a 500m catchment area (the red circle) , area 032 (shown in red) wouldn't include this venue.

So we need to have a custom catchement area for each sub neighbourhood.

In [None]:
def style_function(feature):
    return {
        'fillOpacity': 0.5,
        'weight': 0,
        'fillColor': feature['properties']['FHEX_COLOR'] 
    }

if(venue_data_from_foursquare):
    # create map of Barcelona using latitude and longitude values
    map_bcn = folium.Map(location=[df_problems.iloc[0].centre_lat, df_problems.iloc[0].centre_lng], zoom_start=15)
    folium.GeoJson(
        df_problems.to_json(),
        name='geojson',
        style_function=style_function
    ).add_to(map_bcn)
    # add markers to map
    for lat, lng, borough, neighbourhood,colour in zip(df_problems['centre_lat'], df_problems['centre_lng'], df_problems['Nom_Barri'], df_problems['LITERAL'],df_problems['FHEX_COLOR']):
        label = '{}, {}'.format(neighbourhood, borough)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color=colour,
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_bcn)  
        folium.Circle([lat, lng],
                        radius=500,
                      color= colour
                       ).add_to(map_bcn)
    folium.CircleMarker(
            [41.380711, 2.185559],
            radius=15,
            color='black',
            fill=True,
            fill_color='purple',
            fill_opacity=0.7,
            parse_html=False).add_to(map_bcn)  
map_bcn

Ok, so we need to find the size of the radius that should apply to each area.
Lets get the enclosing box that would cover the area in question
Then we need to calculate the length of the sides of the box in meters - for this I'm using Haversine's formula.  There's probably a way to do this directly in GeoPandas but sometimes doing it yourself isn't a bad thing.
Once we have all the lengths of the sides, we get the max, and divide by 2 to find an appropriate radius.

In [None]:


def haversine(coord1, coord2):

    # Coordinates in decimal degrees (e.g. 2.89078, 12.79797)
    lon1, lat1 = coord1
    lon2, lat2 = coord2
    R = 6371000  # radius of Earth in meters
    phi_1 = math.radians(lat1)
    phi_2 = math.radians(lat2)

    delta_phi = math.radians(lat2 - lat1)
    delta_lambda = math.radians(lon2 - lon1)

    a = math.sin(delta_phi / 2.0) ** 2 + math.cos(phi_1) * math.cos(phi_2) * math.sin(delta_lambda / 2.0) ** 2

    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))

    meters = R * c  # output distance in meters
    km = meters / 1000.0  # output distance in kilometers

    meters = round(meters)
    km = round(km, 3)
    #print(f"Distance: {meters} m")
    #print(f"Distance: {km} km")
    return meters

def calc_stats(coords):
    #x_max = max(coords[0][0],coords[1][0],coords[2][0],coords[3][0])*100002
    #y_max = max(coords[0][1],coords[1][1],coords[2][1],coords[3][1])*100002
    tot_max = max(coords)
    radius = tot_max/2
    return {'tot_max':tot_max,'radius':radius}

def calc_dist(coord1, coord2):
    dist = haversine(coord1,coord2)
    #dist1 = abs(coord1[0] - coord2[0])
    #dist2 = abs(coord1[1] - coord2[1])
    return dist

def get_radius(env):
    coors = list(zip(*env.exterior.coords.xy))
    length = [calc_dist(coors[0],coors[1])]
    length.append(calc_dist(coors[1],coors[2]))
    length.append(calc_dist(coors[2],coors[3]))
    length.append(calc_dist(coors[3],coors[4]))

    stats = calc_stats(length)
    radius = stats['radius']
    #lets increase sligtly to account for the edges of the bounding box
    radius = radius * 1.15 
    #print(stats)
    return round(radius,0)


    

Now let's calculate the envelopes and apply our new functions.

In [None]:
envelopes = df_problems['geometry'].envelope
df_problems['radius'] = envelopes.apply(get_radius)
df_problems.head()

Lets plot the bounding boxes, and the new radiuses on our map and see if it's worked.
We're looking to see if the Catalan History Museum in now correctly included within the radius of the sub-neighbourhood it's in.

In [None]:
def style_function(feature):
    return {
        'fillOpacity': 0.5,
        'weight': 0,
        'fillColor': feature['properties']['FHEX_COLOR'] 
    }

if(venue_data_from_foursquare):
    # create map of Barcelona using latitude and longitude values
    map_bcn = folium.Map(location=[df_problems.iloc[0].centre_lat, df_problems.iloc[0].centre_lng], zoom_start=15)
    folium.GeoJson(
        df_problems.to_json(),
        name='geojson',
        style_function=style_function
    ).add_to(map_bcn)


    folium.GeoJson(
        envelopes.to_json(),
        name='geojson'
    ).add_to(map_bcn)

    # add markers to map
    for lat, lng, borough, neighbourhood,colour,radius in zip(df_problems['centre_lat'], df_problems['centre_lng'], df_problems['Nom_Barri'], df_problems['LITERAL'],df_problems['FHEX_COLOR'],df_problems['radius']):
        label = '{}, {}'.format(neighbourhood, borough)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color=colour,
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_bcn)  
        folium.Circle([lat, lng],
                        radius=radius,
                      color= colour
                       ).add_to(map_bcn)

    folium.CircleMarker(
            [41.380711, 2.185559],
            radius=15,
            color='black',
            fill=True,
            fill_color='purple',
            fill_opacity=0.7,
            parse_html=False).add_to(map_bcn)  

map_bcn

That's looking great.  Now lets start pulling our Foursquare data in.


<br><br>
#### 1.5.2 venue data <a name=""></a>
We'll need to set up our API credentials.
I'm using the getenv library here to allow me to work with dynamic credentials.

In [None]:

CLIENT_ID =  os.getenv('FOURSQUARE_CLIENT_ID')
CLIENT_SECRET = os.getenv('FOURSQUARE_CLIENT_SECRET')
VERSION = '20180605' # Foursquare API version

print('Your credentials:')
#print('CLIENT_ID: '  + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

Lets set up our code to pull our venues in from the API. This code was originally based on the source code given in the course, but due to changes to the FourSquare  API reducing the numbers of venues available in each call from 100 to 50, I've modified it to allow it to pull in batches.

In [None]:

LIMIT = 50

def getNearbyVenues(names, latitudes, longitudes, radiuses):
    
    venues_list=[]
    for name, lat, lng, radius in zip(names, latitudes, longitudes, radiuses):
        #print('name:',name,' lat:',lat,' lng:',lng,' radius:',radius)
        these_venues = getNearbyVenues2(name, lat,lng,radius)
        venues_list = venues_list + these_venues
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venues_list])
    nearby_venues.columns = ['neighbourhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category',
                            'point']
    
    return(nearby_venues)




def getNearbyVenues2(name, latitude, longitude, radius):
    this_list = []
    print(name)
    offset = 0
    offset_max = 0
    while(offset <= offset_max):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&intent=checkin&llAcc=1&sortByPopularity=1&offset={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            latitude, 
            longitude, 
            radius, 
            LIMIT,
            offset * LIMIT )
        #print(url) 
        # make the GET request
        raw = requests.get(url).json()
        #print(raw)
        results = requests.get(url).json()["response"]
        if(raw["response"]):
            totresults = raw["response"]["totalResults"]
            if(offset_max == 0 and totresults > 50):
                offset_max = math.floor(totresults / 50)
            results = raw["response"]['groups'][0]['items']
            #print(results)
            # return only relevant information for each nearby venue
            
            for v in results:
                #print(v)
                this_list.append([
                    name, 
                    latitude, 
                    longitude, 
                    v['venue']['name'], 
                    v['venue']['location']['lat'], 
                    v['venue']['location']['lng'],  
                    v['venue']['categories'][0]['name'],
                    Point(v['venue']['location']['lng'],v['venue']['location']['lat'])])
            print('got batch ',offset+1 ,' for a total of ',len(results),' results')
        offset += 1
    return this_list

    

lets call for our smaller dataset

In [None]:
bcn_venues = getNearbyVenues(names=df_problems['neighbourhood'],
                                   latitudes=df_problems['centre_lat'],
                                   longitudes=df_problems['centre_lng'],
                             radiuses=df_problems['radius']
                                  )
bcn_venues.shape

We can combine our venu data with the data from the neighbourhood

In [None]:

bcn_merged = df_problems.merge(bcn_venues, how='inner',on='neighbourhood')
bcn_merged.shape

We need to identify if venues that were returned within the radius are actually within the neighbourhood

In [None]:
def is_within(row):
    return row['point'].within(row['geometry'])

bcn_merged['within'] = bcn_merged.apply(is_within, axis=1)
#bcn_merged.head()
bcn_one_district = bcn_merged[bcn_merged['LITERAL'].eq('032')]
bcn_one_district = bcn_one_district.drop(['point'], axis=1)
#bcn_one_district.head()
#bcn_one_district.shape

Then lets visualise it to see how accurate our calculations are for a single area.
Green should be within the boundries, whereas red should be outside.

In [None]:
def style_function(feature):
    if(feature['properties']['within'] == True):
        color = 'green'
    else:
        color = 'black'
    return {
        'fillOpacity': 0.5,
        'weight': 0,
        'fillColor': color 
    }
# create map of Barcelona using latitude and longitude values
map_bcn = folium.Map(location=[df_problems.iloc[0].centre_lat, df_problems.iloc[0].centre_lng], zoom_start=15)
if(venue_data_from_foursquare):

    folium.GeoJson(
        envelopes.to_json(),
        name='geojson'
    ).add_to(map_bcn)




    # add markers to map
    for lat, lng, borough, neighbourhood,colour,radius,within in zip(bcn_one_district['Venue Latitude'], bcn_one_district['Venue Longitude'], bcn_one_district['Nom_Barri'], bcn_one_district['Venue'],bcn_one_district['FHEX_COLOR'],bcn_one_district['radius'],bcn_one_district['within']):
        label = '{}, {}'.format(neighbourhood, borough)
        label = folium.Popup(label, parse_html=True)
        if(within):
            colour = 'green'
        else:
            colour = 'red'
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color=colour,
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_bcn)  


map_bcn

That looks good, so new lets exclude the venues outside each area. and replot for our 3 districts.

In [None]:
barceloneta = bcn_merged[bcn_merged['within'].eq(True)]
barceloneta = barceloneta.drop(['point'], axis=1)
barceloneta.shape

def style_function(feature):
    return {
        'fillOpacity': 0.03,
        'weight': 0,
        'fillColor': feature['properties']['FHEX_COLOR'] 
    }
if(venue_data_from_foursquare):
    # create map of Barcelona using latitude and longitude values
    map_bcn = folium.Map(location=[df_problems.iloc[0].centre_lat, df_problems.iloc[0].centre_lng], zoom_start=15)
    folium.GeoJson(
        barceloneta.to_json(),
        name='geojson',
        style_function=style_function
    ).add_to(map_bcn)




    # add markers to map
    for lat, lng, borough, neighbourhood,colour,radius,within in zip(barceloneta['Venue Latitude'], barceloneta['Venue Longitude'], barceloneta['Nom_Barri'], bcn_one_district['Venue'],barceloneta['FHEX_COLOR'],barceloneta['radius'],barceloneta['within']):
        label = '{}, {}'.format(neighbourhood, borough)
        label = folium.Popup(label, parse_html=True)

        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color=colour,
            fill=True,
            fill_color='#3186cc',
            fill_opacity=1,
            parse_html=False).add_to(map_bcn)  


map_bcn

Great, lets redo this for all of the districts.

In [None]:
all_envelopes = df_hoods['geometry'].envelope
df_hoods['radius'] = all_envelopes.apply(get_radius)

And plot each centeroid and radius to see if it looks correct

In [None]:
def style_function(feature):
    return {
        'fillOpacity': 0.5,
        'weight': 0,
        'fillColor': feature['properties']['FHEX_COLOR'] 
    }

# create map of Barcelona using latitude and longitude values
map_bcn = folium.Map(location=[df_hoods.iloc[0].centre_lat, df_hoods.iloc[0].centre_lng], zoom_start=15)
folium.GeoJson(
    df_hoods.to_json(),
    name='geojson',
    style_function=style_function
).add_to(map_bcn)


folium.GeoJson(
    envelopes.to_json(),
    name='geojson'
).add_to(map_bcn)

# add markers to map
for lat, lng, borough, neighbourhood,colour,radius in zip(df_hoods['centre_lat'], df_hoods['centre_lng'], df_hoods['Nom_Barri'], df_hoods['LITERAL'],df_hoods['FHEX_COLOR'],df_hoods['radius']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=colour,
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_bcn)  
    folium.Circle([lat, lng],
                    radius=radius,
                  color= colour
                   ).add_to(map_bcn)

folium.CircleMarker(
        [41.380711, 2.185559],
        radius=15,
        color='black',
        fill=True,
        fill_color='purple',
        fill_opacity=0.7,
        parse_html=False).add_to(map_bcn)  

#map_bcn

Great - Looks right.
So becuase we're going to make a large number of API calls, we should set things up so we get interrupted in process, we can start off where we'd got to.
To do this, we'll use a received_venues column, which will allow us to identify neighbourhoods that we've already processed.
We'll also set up a dataframe for all the venues.

In [None]:
df_hoods['received_venues'] = False
df_hoods['received_venues_cnt'] = 0
venue_columns = ['neighbourhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category',
                            'point']
df_hoods_venues = pd.DataFrame(columns = venue_columns)
#df_hoods.head()

If we've already pulled our data and saved it, let's reload it now

In [None]:
if(load_venues_from_file):
    print("Loading already pulled data from file")
    loaded_venues = pickle.load( open( "venuesAll.pkl", "rb" ) )
    loaded_hoods = pickle.load( open( "df_hoods.pkl", "rb" ) )
    df_hoods_venues = loaded_venues
    df_hoods = loaded_hoods

Now let's run for all our neighbourhoods in Barcelona

In [None]:
if(load_venue_data_from_api == False):
    print("Skipping loading from the foursquare api")
else:
    for ind in df_hoods.index: 
            this_hood = df_hoods.iloc[ind]
            if (this_hood['received_venues']):
                 print(this_hood['neighbourhood'], ' : already processed') 
            elif ind > 5000:
                #print(this_hood['neighbourhood'], ' : skipping') 
                x = 1
            else:
                #print(this_hood['neighbourhood'], ' : requesting') 
                these_venues = getNearbyVenues2(name=this_hood['neighbourhood'] , latitude=this_hood['centre_lat'],longitude=this_hood['centre_lng'],radius=this_hood['radius'])
                print(these_venues)

                if(len(these_venues)> 0  and len(these_venues[0]) > 0):
                    these_venues_df = pd.DataFrame([item for these_venues in these_venues for item in these_venues])

                    these_venues_df.columns = ['neighbourhood', 
                          'Neighborhood Latitude', 
                          'Neighborhood Longitude', 
                          'Venue', 
                          'Venue Latitude', 
                          'Venue Longitude', 
                          'Venue Category',
                                    'point']

                    df_hoods_venues = df_hoods_venues.append(these_venues_df)
                    df_hoods['received_venues'][ind] = True
                    df_hoods['received_venues_cnt'][ind] = len(these_venues_df)
                    print(this_hood['neighbourhood'], ' : got ',len(these_venues_df), ' venues') 
                else:
                    print(this_hood['neighbourhood'], ' : got 0 venues') 
                    df_hoods['received_venues'][ind] = False
                    df_hoods['received_venues_cnt'][ind] = 0



As loading out the venue data took a bunch of time, lets save it to disk so we don't need to redo it each time.

In [None]:
# comment this out to actually (over)write the saved venue data
if(write_venues_to_file):
    print("Writing venue data to file")
    pickle.dump( df_hoods_venues, open( "venuesAllNew.pkl", "wb" ) )
    pickle.dump( df_hoods, open( "df_hoods.pkl", "wb" ) )
    

<div style="text-align: right"> <a href="#top">back to top</a> </div>

<br><br>
### 1.6 venue category taxonomy from FourSquare <a id="data6"></a>
In order to be able to roll up different venue types - all restaurants rather than Chineese, Tapas etc, we'll need to get the FourSquare venue categogry taxonomy from the API.


In [None]:
def getcategories():
    this_list = []
    # create the API request URL
    url = 'https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION )
    #print(url) 
    # make the GET request
    raw = requests.get(url).json()
    #print(raw)
    #results = requests.get(url).json()["response"]
    if(raw["response"]):
        results = raw["response"]['categories']
        #print(type(results))
        for r in results:
            this_master_category = r['name']
            print(this_master_category)
            this_list = recursive_categories(this_master_category, this_list, r)
    return this_list

def recursive_categories(main_name, arr, obj):
    #print(type(obj))
    for r in obj['categories']:
        this_name = r['name']
        #print(this_name)
        arr.append([this_name ,main_name])
        #print(r['categories'])
        if(len(r['categories']) > 0):
            arr = recursive_categories(main_name, arr, r)
    return arr


In [None]:
cat_df = pd.DataFrame(getcategories(), columns =['subcategory', 'category']).set_index('subcategory')
cat_df.head(10)

<div style="text-align: right"> <a href="#top">back to top</a> </div>

---

<br><br><br>
## 2. data processing <a id="processing"></a>
---


Merge our neighbourhood data with the venue data.

In [None]:
bcn_merged_all = df_hoods.merge(loaded_venues, how='inner',on='neighbourhood')
bcn_merged_all.shape

Identify the venues which are actually within our neighbourhood (vs just being in the radius) 

In [None]:
bcn_merged_all['within'] = bcn_merged_all.apply(is_within, axis=1)

create a new dataset for only the venues actually with the district.

In [None]:
only_within = bcn_merged_all[bcn_merged_all['within'].eq(True)]

Add our FourSquare category data

In [None]:
only_within_with_cats = only_within.join(cat_df,on='Venue Category')
only_within_with_cats.head()

lets one hot encode our data, both by the venue type and the venue category

In [None]:
# one hot encoding
bcn_all_onehot = pd.get_dummies(only_within[['Venue Category']], prefix="", prefix_sep="")
bcn_all_onehot_cats = pd.get_dummies(only_within_with_cats[['category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
bcn_all_onehot['Nom_Districte'] = only_within['Nom_Districte'] 
bcn_all_onehot['Nom_Barri'] = only_within['Nom_Barri'] 
bcn_all_onehot['neighbourhood'] = only_within['neighbourhood'] 
bcn_all_onehot['price_per_m2'] = only_within['price_per_m2'] 
bcn_all_onehot_cats['Nom_Districte'] = only_within_with_cats['Nom_Districte'] 
bcn_all_onehot_cats['Nom_Barri'] = only_within_with_cats['Nom_Barri'] 
bcn_all_onehot_cats['neighbourhood'] = only_within_with_cats['neighbourhood'] 
bcn_all_onehot_cats['price_per_m2'] = only_within_with_cats['price_per_m2'] 

# move neighborhood column to the first column
fixed_columns = list(bcn_all_onehot.columns[-3:]) + list(bcn_all_onehot.columns[:-3])
fixed_columns_cats = list(bcn_all_onehot_cats.columns[-3:]) + list(bcn_all_onehot_cats.columns[:-3])
#fixed_columns = [bcn_onehot.columns[-1]] + list(bcn_onehot.columns[:-1])
bcn_all_onehot = bcn_all_onehot[fixed_columns]
bcn_all_onehot_cats = bcn_all_onehot_cats[fixed_columns_cats]
#fixed_columns
bcn_all_onehot_cats.head()

In [None]:
bcn_grouped_cats = bcn_all_onehot_cats.groupby(['Nom_Barri','price_per_m2']).mean().reset_index()
bcn_grouped_cats.head()
bcn_grouped_hoods = bcn_all_onehot.groupby(['neighbourhood','price_per_m2']).mean().reset_index()
bcn_grouped_hoods_cats = bcn_all_onehot_cats.groupby(['neighbourhood','price_per_m2']).mean().reset_index()
bcn_grouped_hoods_cats.head()

In [None]:
bcn_sum = bcn_all_onehot.groupby(['Nom_Barri','price_per_m2']).sum().reset_index()
bcn_sum_cats = bcn_all_onehot_cats.groupby(['Nom_Barri','price_per_m2']).sum().reset_index()
bcn_sum_cats.head()

In [None]:
bcn_grouped = bcn_all_onehot.groupby(['Nom_Barri','price_per_m2']).mean().reset_index()
bcn_grouped.head()

<div style="text-align: right"> <a href="#top">back to top</a> </div>

---
<br><br><br>
## 3. investigations <a id="investigations"></a>
---
Here we can start to use our data to make our investigations, specificaly we're going to:
1. Which areas are more or less expensive?
2. How can we group areas based on the type of venues they contain?
3. What type of venues correlate with richer or poorer areas?
4. Which areas appear over or under valued based on their venues?


<br><br>
### 3.1 price by district <a id="inv1"></a>
Let's check that our data makes sense by trying to map it using a clorapleth map.
First we'll extract just our prices.

In [None]:
prices = df_hoods[['id','price_per_m2']]
prices['id'] = prices['id'].astype('str')
prices.dtypes

Then we will check that geolocation isworking correctly to centre our maps.

Now we are able to draw our clorapeth map where the cheapest areas being shown in yellow, and the most expensive shown in green.

In [None]:
m = folium.Map(location=[latitude, longitude], zoom_start=12)
if(price_by_district):
    bins = list(prices['price_per_m2'].quantile([0, 0.25, 0.5, 0.75, 1]))

   

    # Add the color for the chloropleth:
    folium.Choropleth(
     geo_data=df_hoods.to_json(),
     name='choropleth',
     data=prices,
     columns=['id','price_per_m2'],
     key_on='feature.id',
     fill_color='YlGn',
     fill_opacity=0.7,
     line_opacity=0.8,
     #bins=bins,
     legend_name='price per sq m',
     reset=True
    ).add_to(m)
m

<div style="text-align: right"> <a href="#top">back to top</a> </div>

<br><br>
### 3.2 price effect each type of venue <a id="inv2"></a>
So let's see what venues are associated with higher price areas and which with lower price areas.
We'll take our one hot encoded data, create the avg by sub-neighbourhood.

We're going to apply Linear Regression in order to identify the correlation between venue proportion and price. To start we're going to split off our categories as our idenpendant variables.

In [None]:
x_colnames = bcn_grouped.columns[2:]
x_colnames

And now we use price per m2 as the dependent variable and run our regression.

In [None]:

regr = linear_model.LinearRegression()

x = np.asanyarray(bcn_grouped[x_colnames])
y = np.asanyarray(bcn_grouped['price_per_m2'])
regr.fit (x, y)
# The coefficients
print ('Coefficients: ', regr.coef_)

Our coefficients show how strongly correlated a certain venue type is with a higher or lower price

In [None]:
dict = {'category':x_colnames, 'importance':regr.coef_ }
category_df = pd.DataFrame(dict)
category_df.head(10)

Lots sort and scale them so we can better identify the reletive strength of the correlations.
From here we have a comprehensive list of all the different types of venues that exist in Barcelona, and how correlated they are with property price.
We can see that having a Pub in an area is most correlated with a lower price, and having a Beach bar most correlated with a higher price.

In [None]:
category_df.sort_values(by=['importance'], inplace=True)
max_val = max(category_df['importance'].max(),abs(category_df['importance'].min()))
category_df['scaled'] = (category_df['importance'] / max_val) * 100
category_df.set_index(['category'],inplace=True)
category_df

In [None]:
category_df.to_excel('coeficients.xlsx')

<br><br>
### 3.3 identifying types of district <a id="inv3"></a>
By looking at the types and frequency of venue types in each district, are we able to identify neighbourhoods which are similar to each other?


In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Nom_Barri']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Nom_Barri'] = bcn_grouped['Nom_Barri']

for ind in np.arange(bcn_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(bcn_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head(10)

In [None]:
num_top_venues = 9

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Nom_Barri']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted_cat = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted_cat['Nom_Barri'] = bcn_grouped_cats['Nom_Barri']

for ind in np.arange(bcn_grouped_cats.shape[0]):
    neighbourhoods_venues_sorted_cat.iloc[ind, 1:] = return_most_common_venues(bcn_grouped_cats.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted_cat.head(100)

In [None]:
num_top_venues = 8

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
sub_neighbourhoods_venues_sorted_cat = pd.DataFrame(columns=columns)
sub_neighbourhoods_venues_sorted_cat['neighbourhood'] = bcn_grouped_hoods_cats['neighbourhood']

for ind in np.arange(bcn_grouped_hoods_cats.shape[0]):
    sub_neighbourhoods_venues_sorted_cat.iloc[ind, 1:] = return_most_common_venues(bcn_grouped_hoods_cats.iloc[ind, :], num_top_venues)

sub_neighbourhoods_venues_sorted_cat.head(10)

<div style="text-align: right"> <a href="#top">back to top</a> </div>

In [None]:
#bcn_grouped = bcn_onehot.groupby('neighbourhood').mean().reset_index()
bcn_grouped_hoods_cats.head()

In [None]:
#bcn_grouped_clustering = bcn_grouped.drop(['Nom_Barri','price_per_m2'], 1)
#bcn_grouped_clustering = bcn_grouped_cats.drop(['Nom_Barri','price_per_m2'], 1)
bcn_grouped_clustering = bcn_grouped_hoods_cats.drop(['neighbourhood','price_per_m2'], 1)



In [None]:





# k means determine k
distortions = []
K = range(1,15)
for k in K:
    kmeanModel = KMeans(n_clusters=k).fit(bcn_grouped_clustering)
    kmeanModel.fit(bcn_grouped_clustering)
    distortions.append(sum(np.min(cdist(bcn_grouped_clustering, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / bcn_grouped_clustering.shape[0])

# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

In [None]:
# set number of clusters
kclusters = 8




# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(bcn_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
bcn_grouped_clustering.head()

In [None]:
# add clustering labels
#bcn_grouped.insert(0, 'Cluster Labels', kmeans.labels_)


#bcn_grouped['Cluster Labels'] =  kmeans.labels_
#bcn_grouped_cats['Cluster Labels'] =  kmeans.labels_
bcn_grouped_hoods_cats['Cluster Labels'] =  kmeans.labels_
#tor_merged = neighbourhoods

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
#tor_merged = tor_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), how='inner', on='Neighbourhood')

#bcn_grouped_cats.head() # check the last columns!
bcn_grouped_hoods_cats.head()

In [None]:
sub_neighbourhoods_venues_sorted_cat.head()

In [None]:
df_hoods_k_means = df_hoods.merge(bcn_grouped_hoods_cats.set_index('neighbourhood'), how='inner', on='neighbourhood')
df_hoods_k_means = df_hoods_k_means.merge(sub_neighbourhoods_venues_sorted_cat.set_index('neighbourhood'), how='inner', on='neighbourhood')



In [None]:
table = pd.pivot_table(df_hoods_k_means, values=['Arts & Entertainment','College & University','Food','Nightlife Spot','Outdoors & Recreation','Professional & Other Places','Residence','Shop & Service','Travel & Transport'], index=['Cluster Labels'], aggfunc=np.mean)
table

We can see that our clusters could be described as:

| Cluster | Description   |
|------|------|
|   0  | food/shops|
|   1  | food|
|   2  | recreation|
|   3  | nightlife|
|   3  | arts/food/recreation|
|   5  | shops|
|   6  | travel/transport|
|   7  | food/recreation/nightlife|



In [None]:
desc_data = {'Cluster Labels': [0,1,2,3,4,5,6,7],
        'Description': ['food/shops','mostly food','mostly recreation','nightlife','arts/food/recreation','mostly shopping','travel/transport','food/recreation/nightlife']
        }

df_desc = pd.DataFrame(desc_data, columns = ['Cluster Labels', 'Description'])
df_desc.set_index('Cluster Labels',inplace=True)

print (df_desc)

In [None]:
df_hoods_k_means = df_hoods_k_means.merge(df_desc, how='inner', on='Cluster Labels')
cat_lookups = df_hoods_k_means[['neighbourhood','Description']]


In [None]:
cat_lookups.head(100)

In [None]:
#df_hoods_k_means_dist = df_hoods.merge(bcn_grouped_cats.set_index('Nom_Barri'), how='inner', on='Nom_Barri')



m = folium.Map(location=[latitude, longitude], zoom_start=12)
 
# Add the color for the chloropleth:
choropleth = folium.Choropleth(
 geo_data=df_hoods_k_means.to_json(),
 name='choropleth',
 data=df_hoods_k_means,
 columns=['SEC_CENS','Cluster Labels'],
 key_on='feature.properties.SEC_CENS',
 fill_color='Set1',
 fill_opacity=0.7,
 line_opacity=0.2,
 legend_name='Cluster',
 reset=True
).add_to(m)


# add labels indicating the name of the community
style_function = "font-size: 15px; font-weight: bold"
choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['Description'], style=style_function, labels=False))

# create a layer control
folium.LayerControl().add_to(m)




m

In [None]:

def build_segment_map(dt):

    #seg_data = dt[['id','price_per_m2_x']]
    #dt['id'] = dt['id'].astype('str')
    geo = dt[['SEC_CENS','geometry','neighbourhood','price_per_m2_x']] 

    m = folium.Map(location=[latitude, longitude], zoom_start=12)

    # Add the color for the chloropleth:
    choropleth = folium.Choropleth(
     geo_data=geo.to_json(),
     name='choropleth',
     data=geo,
     columns=['SEC_CENS','price_per_m2_x'],
     key_on='feature.properties.SEC_CENS',
     fill_color='RdYlGn', 
     fill_opacity=0.7,
     line_opacity=1,
     legend_name='relative price',
     highlight=True,smooth_factor=0
    ).add_to(m)


    # add labels indicating the name of the community
    style_function = "font-size: 15px; font-weight: bold"
    choropleth.geojson.add_child(
        folium.features.GeoJsonTooltip(['price_per_m2_x'], style=style_function, labels=False))

    # create a layer control
    folium.LayerControl().add_to(m)
    stats = geo.sort_values(by=['price_per_m2_x']);
    stats = stats[['neighbourhood','price_per_m2_x']]
    stats.set_index('neighbourhood')
    print('Looking at Neighbourhoods of Type: ' + dt.Description.values[0] )
    print('Top 5 Lowest Priced Neighbourhoods')
    print(stats.head(5))
    stats.sort_values(by=['price_per_m2_x'],ascending=False,inplace= True);
    print('Top 5 Highest Priced Neighbourhoods')
    print(stats.head(5))

    return m


In [None]:
cat1 = df_hoods_k_means[df_hoods_k_means['Cluster Labels'].eq(0)] 
m = build_segment_map(cat1)
m

In [None]:
cat1 = df_hoods_k_means[df_hoods_k_means['Cluster Labels'].eq(1)] 
m = build_segment_map(cat1)
m

In [None]:
cat1 = df_hoods_k_means[df_hoods_k_means['Cluster Labels'].eq(2)] 
m = build_segment_map(cat1)
m

In [None]:
cat1 = df_hoods_k_means[df_hoods_k_means['Cluster Labels'].eq(3)] 
m = build_segment_map(cat1)
m

In [None]:
cat1 = df_hoods_k_means[df_hoods_k_means['Cluster Labels'].eq(4)] 
m = build_segment_map(cat1)
m

In [None]:
cat1 = df_hoods_k_means[df_hoods_k_means['Cluster Labels'].eq(5)] 
m = build_segment_map(cat1)
m

In [None]:
cat1 = df_hoods_k_means[df_hoods_k_means['Cluster Labels'].eq(6)] 
m = build_segment_map(cat1)
m

In [None]:
cat1 = df_hoods_k_means[df_hoods_k_means['Cluster Labels'].eq(7)] 
m = build_segment_map(cat1)
m

<br><br>
### 3.4 underpriced/overpriced neighbourhoods by venue <a id="inv4"></a>
By looking at the types and frequency of venue types in each district, are we able to identify neighbourhoods which are similar to each other?

<div style="text-align: right"> <a href="#top">back to top</a> </div>

In [None]:
y_hat= regr.predict(bcn_grouped[x_colnames])
bcn_grouped['predicted_price'] = y_hat
bcn_grouped = bcn_grouped.astype({"predicted_price": int})

In [None]:
bcn_grouped.head()

In [None]:
bcn_grouped['price_delta'] = bcn_grouped['price_per_m2'] - bcn_grouped['predicted_price']

In [None]:
#bcn_grouped

In [None]:
bcn_grouped_neigh = bcn_all_onehot.groupby(['neighbourhood','price_per_m2']).mean().reset_index()
#bcn_grouped_neigh

In [None]:

y_hat= regr.predict(bcn_grouped_neigh[x_colnames])
bcn_grouped_neigh['predicted_price'] = y_hat
bcn_grouped_neigh = bcn_grouped_neigh.astype({"predicted_price": int})
bcn_grouped_neigh['price_delta'] = bcn_grouped_neigh['price_per_m2'] - bcn_grouped_neigh['predicted_price']
bcn_grouped_neigh['price_diff'] = (bcn_grouped_neigh['price_delta']/bcn_grouped_neigh['price_per_m2']) * 100

In [None]:
bcn_grouped_neigh.head(50)
df_hoods_price_diff = df_hoods_k_means.merge(bcn_grouped_neigh.set_index('neighbourhood'), how='inner', on='neighbourhood')

In [None]:
df_hoods_price_diff.head()

In [None]:

prices_diff = df_hoods_price_diff[['SEC_CENS','price_diff']]
bins = list(prices_diff['price_diff'].quantile([0, 0.25, 0.5, 0.75, 1]))
bins = list(prices_diff['price_diff'].quantile([0, 0.1, 0.2, 0.3, 0.4,0.5,0.6,0.7,0.8,0.9,1]))
prices_diff['price_diff_quart'] = prices_diff['price_diff'].quantile([0, 0.1, 0.2, 0.3, 0.4,0.5,0.6,0.7,0.8,0.9,1])
m = folium.Map(location=[latitude, longitude], zoom_start=12)
 
# Add the color for the chloropleth:
choropleth = folium.Choropleth(
 geo_data=df_hoods_price_diff.to_json(),
 name='choropleth',
 data=prices_diff,
 columns=['SEC_CENS','price_diff'],
 key_on='feature.properties.SEC_CENS',
 fill_color='RdYlGn',
 fill_opacity=0.7,
 line_opacity=0.2,
 bins=bins,
 legend_name='difference from predicted price',
 reset=True
).add_to(m)

style_function = "font-size: 15px; font-weight: bold"
choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['Description'], style=style_function, labels=False))

m