<img src="Imatges/PortadaGeneral.jpg" width="1200" />

# Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis and Results](#results)
* [Discussion](#discussion)
* [Conclusion](#conclusion)

# 1 INTRODUCTION <a name="introduction"></a>

## 1.1 Summary
In this project I’m going to answer one question:
* Is there a relationship between the socioeconomic status of people living in a district and the profile of venues in it?
The target audience are:
* Companies wishing to stablish in Girona or new businesses that want to open in Girona
* Local city council

The project has two main parts:
* Data acquisition and data preparation.
  * Fetch geographical information, process it and feed it to the Foursquare queries.
  * Process the results of Foursquare queries
  * Fetch and process statistical data 
* From data to information.
  * Apply machine learning methodologies and techniques to extract relevant information out of data
  * Data visualization will help on the understanding of the insights obtained so far.

## 1.2 Background
Cities, as large human settlements, flourished in parallel with the development of agriculture and have been characterized by the specialization and social division of labor. Great cities have existed in all civilizations of the human history and they have been planned and managed by bureaucracy systems which needed data science to do the job. One example that I like to use to link cities and data science is the development of the concept of the number zero. Even if in some of them the number was not invented, most of the ancient civilizations developed the concept of zero and used it for calculations in many areas like accounting, astronomy, geography etc.

Cities are great places to enhance human interaction both for the good and for the bad. In both cases data science is paramount in improving human life. In this project I develop a process to gather data from cities which can be used by different stakeholders. I’m mainly targeting
* Companies which want to explore the socioeconomic landscape of cities to make informed decisions about their businesses
* Government and governmental agencies seeking to improve resource allocation, from the deployment of police forces to budget allocation
In all these cases, a fine granulation of the city would be of major interest as it allows us to make a zoom to small patches that can be also aggregated conveniently. Often, postal code areas are used to do this division of the city, but I think that census areas are a better choice. Census areas in Catalonia are areas where there are between 500 and 2000 electors. One census area cannot include territories of more than one municipality, and they can be grouped administratively into census districts. Moreover, official statistical data is geographically segmented by census districts.

In this project, I use Girona as a case study that can be later extrapolated to other cities in the country. Girona will be divided in census areas and from this reference area, Foursquare will be used to locate venues in neighborhoods.

### 1.2.1 Mapping venues
Mapping venues in towns is the first step in the process of typifying census areas with relation to the profile of facilities and commercial activities. This profiling can be complemented and/or further correlated with other statistical data like socioeconomic status of inhabitants.

For example, if an area is characterized by the presence of restaurants and small households occupied mainly by young people, a real state company can focus its efforts in selling or hiring in this area to this profile of young single or couple wealthy people. If there is another area with parks and supermarkets and households are of an average four people, this means that it may be a family neighborhood. A supermarket chain may consider it as a target neighborhood.

For governmental stakeholders, checking the reviews of businesses around an area might be a very useful source of information to plan specific support actions for this or that particular area.


# 2 DATA <a name="data"></a>

Though we live in times of data hype and terabytes of data accumulate endlessly, data is often kept at very different places, in different formats. Fortunately, there are official governmental sources of data which are reliable. These sources are not completely coherent in terms of format, but they are manageable.
In this project I’m going to use three main sources of data:
* Data from the Cartographic Institute of Catalonia [ICC](https://www.icgc.cat/Administracio-i-empresa/Descarregues/Capes-de-geoinformacio/Seccions-censals)
* Data from Girona’s local council that can be obtained either from Girona Open Data [GOD](https://www.girona.cat/opendata/) or from l’[Observatori](https://terra.girona.cat/apps/observatori/).
* Data from [Foursquare](https://foursquare.com/)

# 3 METHODOLOGY <a name="methodology"></a>

## 3.1 Prepping data

### 3.1.1 Geographic data 
<img src="Imatges/DataProcessing_small.jpg"  style="float:right" />
The data can be downloaded directly from the Jupiter notebook, or can be downloaded externally and kept in a local file to be further imported in the notebook. I chose the second system to avoid that network problems interfere with the development process.

In the figure beside, there is a schematic representation of the transformations that have been performed in the Geografic data from ICC and GOD. In the original sources, geographical data is kept in Shapefile format which is a common format used in geographical information systems (GIS). However, the geojson format is increasingly used in data science and is the format that is used to feed map tools like Folium and Choropleth. Thus, the first step will be to transform .shp files into .json files thanks to a library PyShp. I will define the function *shpToGeoJSON* were I will use the [PyShp](https://pypi.org/project/pyshp/) tools to read .shp files and return the information in the form of python dictionaries.
In both ICC and GOD the geographical data is kept in utm coordinates, but many current applications are better fed with WGS84 coordinates. So, I will transform the utm coordinates to WGS84 coordinates with the help of a library [PyProj](https://pypi.org/project/pyproject/). For this job I’ll create the function *getWGS84Coordinates*.

Once the data is in json format and the coordinates are coded with the WGS84 system, I’ll make some small transformations like changing some town codes by the official name of the town, this is for the sake of readability of data when observing them. I’ll also create a merged code joining the district code and the section code to uniquely identify each census section in Girona.

In [None]:
#!pip install PyShp #This is a libary to read and manipulate .shp files
#!pip install pyproj #This is a library for performing coordinate transformations

In [1]:
from datetime import date #library to get current date and time
import folium #Library to draw maps
import json #To manipulate json files, ie. transforming json to dict and viceversa
import math # Library with mathematical functions
import numpy as np # library to handle data in a vectorized manner
import os # library to operate with the file system
import pandas as pd # library for data analsysis
from pyproj import Proj #This is a library for performing coordinate transformations
import requests # library to handle requests
from scipy.spatial import ConvexHull, convex_hull_plot_2d # Library to do geometric calculations
import shapefile #This is a libary to read and manipulate .shp files
from shapely.geometry import Point # Library to do geometric calculations
from shapely.geometry.polygon import Polygon # Library to do geometric calculations #Library to perform cluster analysis
from sklearn.cluster import KMeans #Library to perform cluster analysis
import sys # Useful library for showing errors
import time # Library to manage time and timestamps

#### Function to transform coordinates to WGS84 in geo_JSON dictionaries

In [2]:
def getWGS84Coordinates(transformationParameters, JSON):
    myProj = Proj(projectParameters)
    list0 = []
    for coordinate in range(0, len(JSON['coordinates'][0])):
        east_utm = JSON['coordinates'][0][coordinate][0]
        north_utm = JSON['coordinates'][0][coordinate][1]
        lon, lat = myProj(east_utm, north_utm, inverse=True)
        list0.append([lon, lat]) #beware that Choropleth takes longitude first
    JSON['coordinates'] = [list0]
    return JSON

#### Function to transform .shp data to geoJSON and save it to file**

In [4]:
# This function is to load a .shp file transform it to a json object and then save its contents as a .geojson file
def shpToGeoJSON(PathIn, PathOut, Name, Encoding, CoordinateTransformParameters):
    inPath = PathIn + Name + '.shp'
    outPath = PathOut + Name + '.geoJSON'
    import shapefile 
    reader = shapefile.Reader(inPath, encoding = Encoding)
    fields = reader.fields[1:]
    field_names = [field[0] for field in fields]
    buffer = []
    for sr in reader.shapeRecords():
        atr = dict(zip(field_names, sr.record))
        geom = sr.shape.__geo_interface__
        #After a lot of strugle I realized that the best way to modify the geoJSON coordinates is to transform them here before saving as geoJSON
        geom = getWGS84Coordinates(CoordinateTransformParameters, geom)
        buffer.append(dict(type="Feature", \
        properties=atr, geometry=geom )) 
    # write the GeoJSON file
    from json import dumps
    import codecs
    geojson = codecs.open(outPath, "w", "utf-8")
    geojson.write(dumps({"type": "FeatureCollection",\
    "features": buffer}, indent=2) + "\n")
    geojson.close()

#### Transformation of the .shp files with utm coordinates into .json files with WGS84 coordinates

In [5]:
# Here we use the function shpToGeoJSON to transform several .shp files and save the content in .geoJSON files
PathIn = "C:/Users/user/Documents/Els_meus_documents/projectes/CompetitiveIntelligence/Data/Seccions censals/"
PathOut = "C:/Users/user/Documents/Els_meus_documents/projectes/CompetitiveIntelligence/Data/geoJSON/"
Names = ['Catalunya2016', 'Girona', 'GironaCentre']

In [6]:
# Here we use the function shpToGeoJSON to transform several .shp files and save the content in .geoJSON files
PathIn = "C:/Users/user/Documents/Els_meus_documents/projectes/CompetitiveIntelligence/Data/Seccions censals/"
PathOut = "C:/Users/user/Documents/Els_meus_documents/projectes/CompetitiveIntelligence/Data/geoJSON/"
Names = ['Catalunya2016', 'Girona', 'GironaCentre']
#Here we define the parameters to fetch coordinates with PyProject library
projectParameters = "+proj=utm +zone=31N +north +ellps=WGS84 +datum=WGS84 +units=m +no_defs"

for Name in Names:
    if "Centre" in Name:
        Encoding = 'latin-1'
    else:
        Encoding = 'utf-8'
    shpToGeoJSON(PathIn, PathOut, Name, Encoding, projectParameters)

#### Upload of the data in json format from the .geojson files**

In [2]:
Names = ['Catalunya2016', 'Girona', 'GironaCentre']
PathOut = "C:/Users/user/Documents/Els_meus_documents/projectes/CompetitiveIntelligence/Data/geoJSON/"
for Name in Names:
    Path = PathOut + Name + '.geoJSON'
    with open(Path, 'r') as f:
        if 'Catalunya' in Name:
            Catalunya = json.loads(f.read())
        elif 'Centre' in Name:
            Centre = json.loads(f.read())
        else:
            Girona = json.loads(f.read())

#### Changing the city code by city names
In the catalan database of census sections, cities are coded numerically. In order to make it more human readable we change the codes by the official names. This is a two step process.
1. We build a dictionary with codes and names
2. we loop trough the geo_JSON dict to change numeric codes by names

#### Build a dictionary with key: city code, and value: name of the city

In [3]:
nomenclator = pd.read_csv('C:/Users/user/Documents/Els_meus_documents/projectes/CompetitiveIntelligence/Data/nomenclatorPrincipatCatalunya.csv')
nomenclator_dict = {}
for x in range(0, len(nomenclator)):
   if nomenclator.iloc[x][0] == 'Municipi':
        if len(str(nomenclator.iloc[x][1])) == 5:
            codi = '0' + str(nomenclator.iloc[x][1])
        else:
            codi = str(nomenclator.iloc[x][1])
        nomenclator_dict[codi] = nomenclator.iloc[x][6]

#### Function to change the numeric municipal codes by city names
We use the data from the official Catalan Institute of Statistics [IDESCAT](https://www.idescat.cat/codis/?cin=0&nom=&ambit=a&cic=0&codi=&pobi=&pobf=&id=50&n=22&inf=c&t=01-01-2019)

In [4]:
def changeTownCodeByName(JSON, DICT):
    if 'MUNICIPI' in JSON['features'][0]['properties'].keys():
        for feature in range(0, len(JSON['features'])):
            code = JSON['features'][feature]['properties']['MUNICIPI']
            JSON['features'][feature]['properties']['MUNICIPI'] = DICT[code]
    if 'SECCIÓ' in JSON['features'][0]['properties'].keys():
        for feature in range(0, len(JSON['features'])):
            JSON['features'][feature]['properties']['SECCIO'] = JSON['features'][feature]['properties']['SECCIÓ']
            del JSON['features'][feature]['properties']['SECCIÓ']
    else:
        print("No hi ha codi de municipi")
    return JSON
    

#### Transforming the original data
Applying the functions to improve datasets

In [5]:
dataSets = [Catalunya, Girona, Centre]

j = 0
for i in dataSets:
    j +=1
    try:
        i = changeTownCodeByName(i, nomenclator_dict)
    except ValueError:
         print("Got a problem in", j)
    except:
        print("Unexpected error:", sys.exc_info()[0])
        raise

No hi ha codi de municipi
No hi ha codi de municipi


<img src="Imatges/geojson.png"  style="float:right" />
The final .geojson dictinaries look like in the figures beside. I have highlighted in yellow the two main keys of the dictionary, "Properties" which associated value is a dictionary with general information, and "Geometries" which associated value is a dictionary that contains the coordinates of the polygons (highlighted in red).

The geojson from the ICC contains the data for all the census sections of Catalonia. We are not going to use this data in this report. But it is a key database to expand the model to the rest of the country.
The geojson containing the coordinates of the census sections of Girona will be used to draw the sections in the map and to make the calls to Foursquare.

The last geojson, containing the coordinates of the census districts of Girona will be used to overlay the district on the map with the sections. It will also be very important to draw statistical information on the map.


### 3.1.2 Foursquare data
Foursquare is a powerful database that can be easily accessed with APIs provided in its developer page. However, there are some limitations that forces us to do some extra work. The main limitation I found is that the results of searches for venues are given in circle area defined by a radius. I want to assign each venue to the census section to where it belongs to. In order to achieve that I will make the following steps:
 1.	Determine the centroid of each census are which are, for the most, quite irregular polygons. For this I use the library *scipy.spatial* which has a tool to calculate the convex hull of a polygon. Then the centroid of the convex hull is the mean of the x coordinates and the mean of the y coordinates of the hull. I used this approximation because it gives better results than the simpler approximation of determining the centroid as the means of the points of the raw polygon (data not shown).
 2.	Once the centroid is found I use the function *distanceBetweenPoints* to determine the longest distance between the centroid and all the vertices. This will be the distance used to assign the parameter RADIUS in the Foursquare search.
 3.	Next step is to retrieve Foursquare data using latitude and longitude of the centroid of each census section and the radius as the longest distance between centroid and vertices. But this will yield a lot of overlapping information as the circle defined by the radius in each census section will intersect with the circles of the neighboring sections. 
 4.	Once venue data are obtained, venues that do not belong to the census section must be filtered out. In order to do this, I’ll have to check if venue coordinates are inside the polygon which defines the census section. This can be easily done using the function *polygon.contains* of the [Shapely](https://pypi.org/project/Shapely/) library that I’ll execute inside the local function *ifInDistrict*.
 5.	The final step is to convert all data in a Panda’s data frame.
 

#### Determining the coordinates of the centroid and the max radius of every census district
The following three functions are for:
* Calculate the distance between two points given their coordinates (distanceBetweenPoints)
* Generate a census code by joining the census district code with the census section code (codiSeccioCensal)
* Calculate the centroid and the maximum radius of a polygon (calcCentroidOfPolygon)

In [6]:
def distanceBetweenPoints(lat1, lon1, lat2, lon2):
    m_per_deg_lat = 111.132954
    m_per_deg_lon = 40075 * math.cos(lat1 ) / 360
    dist = math.sqrt(((lat1 - lat2) * m_per_deg_lat)**2 + ((lon1 - lon2) * m_per_deg_lon)**2)
    return dist

In [7]:
def codiSeccioCensal(districte, seccio):
    if len(str(districte)) < 2:
        codiDistricte = '0' + str(districte)
    else:
        codiDistricte = '0' + str(districte)
    if len(str(seccio)) < 2:
        codiSeccio = '00' + str(seccio)
    elif len(str(seccio)) < 3:
        codiSeccio = '0' + str(seccio)
    else:
        codiSeccio = str(seccio)
    return codiDistricte + codiSeccio

In [8]:
# There are different ways to approach the centroid of a polygon. The Simplest one is to take the x and y of the centroid
# as the average of the x an y of the vertices. This gives not optimal result for complex polygones. An alternative is to 
# use a convex hull and then calculate the centroid of the convex hull
def calcCentroidOfPolygon(JSON):
    from scipy.spatial import ConvexHull, convex_hull_plot_2d
    Centroid = []
    for feature in range(0, len(JSON['features'])):
        x = JSON['features'][feature]['geometry']['coordinates'][0]
        points = np.asarray(x, dtype = np.float32)
        hull = ConvexHull(points)
        centroid = [np.mean(points[hull.vertices,1]), np.mean(points[hull.vertices,0])]
        dist = []
        for point in range(0, len(hull.vertices)):
            dist.append(distanceBetweenPoints(centroid[0], centroid[1], points[hull.vertices[point],1], points[hull.vertices[point],0]))
        SeccioCensal = codiSeccioCensal(JSON['features'][feature]['properties']['DISTRICTE'], JSON['features'][feature]['properties']['SECCIO'])
        Centroid.append( [SeccioCensal,centroid[0], centroid[1], round(max(dist) * 1000, 0),x])
    return Centroid
    

In [9]:
Centroid = calcCentroidOfPolygon(Girona)

Centroid is a list of lists. Each of the lists contains the following items:
1. The code of the census section
2. The latitude of the centroid of the census section
3. The longitude of the centroid
4. The maximum radius of the polygon as the longest distance between the centroid and the vertices.
5. A list with the coordinates of the vertices of the census section

#### Adding data from places using Foursquare database

##### Fetching the venues in census districts
Now we know the centroid of each census district and we know the largest distance in meters between between the centroid and the vertices. Now we are going to:
1. get all venues around each centroid
2. check whether the venues are within the district or not
3. if it is in the district we keep the information

First we define the function to fetch venues around a point. The output is a data frame with the basic nformation of each venue

In [59]:
def ifInDistrict(Lon, Lat, Poly):
    point = Point(Lon, Lat)
    polygon = Polygon(Poly)
    return polygon.contains(point)


In [60]:
def getNearbyVenues(Names, Latitudes, Longitudes, Radius, Poly):
    from shapely.geometry import Point
    from shapely.geometry.polygon import Polygon
    venues_list=[]
    venues = {}
    for name, lat, lng, rad, poly in zip(Names, Latitudes, Longitudes, Radius, Poly):
        results = {}
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            rad, 
            LIMIT)
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        # Check if venue is inside district
        i = 0
        if results:
            for v in results:
                if v['venue']['name'] in venues:
                    pass
                elif ifInDistrict(v['venue']['location']['lng'], v['venue']['location']['lat'], poly):
                    i +=1
                    venues_list.append([name, lat, lng, v['venue']['name'], v['venue']['location']['lat'], v['venue']['location']['lng'], v['venue']['categories'][0]['name']])
                    venues[v['venue']['name']] = 0
                else:
                    pass
            else:
                pass
        if i >= 100:
            print("Beware, in census district", name, "there are more than 100 venues")
        time.sleep(0.5)
    nearby_venues = pd.DataFrame([item for item in venues_list])
    nearby_venues.columns = ['Census section', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now, we loop through the census districts to get the venues in them

In [61]:
CLIENT_ID = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' # your Foursquare ID
CLIENT_SECRET = 'YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY' # your Foursquare Secret
VERSION = '20200330' # Foursquare API version
LIMIT = 100
Nam =[]
Lat = []
Lon = []
Rad = []
Pol = []
Nam = [x[0] for x in Centroid]
for x in Centroid:
    Nam.append(x[0])
    Lat.append(x[1])
    Lon.append(x[2])
    Rad.append(x[3])
    Pol.append(x[4])
GironaVenues = getNearbyVenues(Nam, Lat, Lon, Rad, Pol)


In [69]:
GironaVenues.head(10)


Unnamed: 0,Census section,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Census district
0,"""02012""",41.975842,2.81903,Sweets by Abraham Balaguer,41.976007,2.819248,Dessert Shop,2
1,"""02012""",41.975842,2.81903,Cafeteria Tornés,41.975807,2.818575,Café,2
2,"""03001""",41.983753,2.821557,Umai,41.983094,2.821719,Japanese Restaurant,3
3,"""03001""",41.983753,2.821557,Museu del Cinema,41.983756,2.822213,Museum,3
4,"""03001""",41.983753,2.821557,Llibreria 22,41.984379,2.822301,Bookstore,3
5,"""03001""",41.983753,2.821557,Plaça de Josep Pla,41.983295,2.821843,Plaza,3
6,"""03001""",41.983753,2.821557,Plaça de la Constitució,41.98375,2.821248,Plaza,3
7,"""03001""",41.983753,2.821557,Bionèctar,41.984095,2.821569,Vegetarian / Vegan Restaurant,3
8,"""03001""",41.983753,2.821557,Restaurant Gran Muralla,41.984567,2.821547,Chinese Restaurant,3
9,"""03001""",41.983753,2.821557,Apple Cafè Girona,41.982338,2.82244,Sushi Restaurant,3


In [72]:
today = str(date.today())
filename = "C:/Users/user/Documents/Els_meus_documents/projectes/CompetitiveIntelligence/Data/Foursquare/" + "GironaVenues" + today + ".csv"
GironaVenues.to_csv(filename, index=False)

In [13]:
filename = "C:/Users/user/Documents/Els_meus_documents/projectes/CompetitiveIntelligence/Data/Foursquare/GironaVenues2020-04-26.csv"
GironaVenues = pd.read_csv(filename)  
GironaVenues['Census section']= GironaVenues['Census section'].str.slice(1, 6, 1)
csv_file = "C:/Users/user/Documents/Els_meus_documents/projectes/CompetitiveIntelligence/Data/seccioCensalvsBarriGirona.csv"
from csv import reader
districtes = {}
with open(csv_file, 'r') as f:
    csv_reader = reader(f)
    # Iterate over each row in the csv using reader object
    for row in csv_reader:
        districtes['0' + row[0]] = row[1]

GironaVenues['Neighborhood'] = [districtes[x] for x in GironaVenues['Census section']]
GironaVenues.head()

Unnamed: 0,Census section,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Neighborhood
0,2012,41.975842,2.81903,Sweets by Abraham Balaguer,41.976007,2.819248,Dessert Shop,Eixample
1,2012,41.975842,2.81903,Cafeteria Tornés,41.975807,2.818575,Café,Eixample
2,3001,41.983753,2.821557,Umai,41.983094,2.821719,Japanese Restaurant,Centre
3,3001,41.983753,2.821557,Museu del Cinema,41.983756,2.822213,Museum,Centre
4,3001,41.983753,2.821557,Llibreria 22,41.984379,2.822301,Bookstore,Centre


### 3.1.3 Clustering analysis
#### 3.1.3.1 Clustering by Venues
Here we are going to prepare the foursquare data to make a clustering analysis on neighborhood (census district)

#### Analyze each neighborhood

In [14]:
# one hot encoding
Girona_onehot = pd.get_dummies(GironaVenues[['Venue Category']], prefix="", prefix_sep="")

# add Postal Code column back to dataframe
Girona_onehot.drop(labels=['Neighborhood'], axis=1,inplace = True)
Girona_onehot.insert(0, 'Neighborhood', GironaVenues['Neighborhood'])  

Girona_onehot.head()

Unnamed: 0,Neighborhood,African Restaurant,American Restaurant,Argentinian Restaurant,Art Museum,Arts & Entertainment,Asian Restaurant,Athletics & Sports,Auditorium,Auto Dealership,...,Tennis Court,Theater,Toy / Game Store,Train Station,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Store,Volleyball Court,Wine Bar,Wine Shop
0,Eixample,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Eixample,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Centre,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Centre,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Centre,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Grouping rows by census district and taking the mean of the frequency of occurrence of each category

In [15]:
Girona_grouped = Girona_onehot.groupby('Neighborhood').mean().reset_index()
Girona_grouped

Unnamed: 0,Neighborhood,African Restaurant,American Restaurant,Argentinian Restaurant,Art Museum,Arts & Entertainment,Asian Restaurant,Athletics & Sports,Auditorium,Auto Dealership,...,Tennis Court,Theater,Toy / Game Store,Train Station,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Store,Volleyball Court,Wine Bar,Wine Shop
0,Centre,0.010417,0.0,0.010417,0.020833,0.0,0.0,0.0,0.0,0.0,...,0.0,0.020833,0.010417,0.0,0.010417,0.010417,0.0,0.0,0.03125,0.0
1,Eixample,0.0,0.010101,0.0,0.0,0.0,0.020202,0.010101,0.010101,0.0,...,0.0,0.0,0.0,0.020202,0.0,0.020202,0.0,0.0,0.0,0.010101
2,Est,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Mas Xirgu,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Montjuïc,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Nord,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125
6,Oest,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Santa Eugènia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.055556,0.0,0.0
8,Sud,0.032258,0.0,0.0,0.0,0.0,0.0,0.064516,0.0,0.0,...,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Grouping the 10 top venues in each neighborhood

In [16]:
# first we write a function to sort venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]
#Now, we create a new data frame
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
districts_venues_sorted = pd.DataFrame(columns=columns)
districts_venues_sorted['Neighborhood'] = Girona_grouped['Neighborhood']

for ind in np.arange(Girona_grouped.shape[0]):
    districts_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Girona_grouped.iloc[ind, :], num_top_venues)

districts_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Centre,Restaurant,Plaza,Mediterranean Restaurant,Café,Spanish Restaurant,Hotel,Tapas Restaurant,Gastropub,Wine Bar,Ice Cream Shop
1,Eixample,Café,Japanese Restaurant,Hotel,Bar,Spanish Restaurant,Diner,Pizza Place,Park,Plaza,Pub
2,Est,Historic Site,Garden,Mediterranean Restaurant,Scenic Lookout,Stables,Hotel,Basketball Stadium,Jazz Club,Soccer Stadium,Park
3,Mas Xirgu,Grocery Store,Restaurant,Auto Dealership,Chinese Restaurant,Café,Shopping Mall,Sporting Goods Shop,Supermarket,Electronics Store,Food & Drink Shop
4,Montjuïc,Garden,Italian Restaurant,Snack Place,Soccer Field,Wine Shop,Diner,Coffee Shop,College Cafeteria,Concert Hall,Creperie


#### 3.1.3.2 Clustering by statistical data

In [122]:
Path = 'C:/Users/user/Documents/Els_meus_documents/projectes/CompetitiveIntelligence/Data/Estadistica/Girona/'
files = os.listdir(Path)

categoriesInst = pd.read_csv(Path + files[0]).iloc[1:10, 0]
categoriesMembr = pd.read_csv(Path + files[10]).iloc[1:8, 0]
instruccio = pd.DataFrame(columns = categoriesInst)
membresLlar = pd.DataFrame(columns = categoriesMembr)

for name in files:
    barri = name.split('_')[2].split('.')[0]
    if "-" in name.split('_')[2]:
            barri = barri.split('-')[0] + " " + barri.split('-')[1]    
    if 'instruccio'in name.split('_')[0]:
        inst = pd.read_csv(Path + name)
        instruccio.loc[barri] = inst.iloc[1:10,2].tolist()
    else:
        membr = pd.read_csv(Path + name)
        membresLlar.loc[barri] = membr.iloc[1:8,2].tolist()
membresLlar = membresLlar.reset_index()
membresLlar.rename(columns = {'index' : 'Neighborhood'}, inplace = True)
instruccio = instruccio.reset_index()
instruccio.rename(columns = {'index' : 'Neighborhood'}, inplace = True)
instruccio

Unnamed: 0,Neighborhood,No aplicable per ser menor de 16 anys,Ni llegir ni escriure,Sense estudis o educació primària incompleta,EGB. ESO. FP1 o equivalent,BUP. Batxillerat. FP2 o equivalent,Estudis universitaris de grau mig,Estudis universitaris de grau superior,Estudis superiors no universitaris,Doctorats i post-graus
0,Centre,15.13,0.43,9.44,21.96,24.7,5.82,16.48,0.86,5.18
1,Eixample,19.13,0.33,9.48,23.13,24.75,5.52,13.28,0.5,3.88
2,Est,26.25,1.75,30.23,27.96,8.26,1.22,3.17,0.16,1.0
3,Mas Xirgu,10.53,0.0,26.32,26.32,15.79,0.0,15.79,0.0,5.26
4,Montjuïc,21.27,0.21,5.13,16.85,28.35,7.01,15.79,0.25,5.13
5,Nord,20.04,1.22,21.31,30.78,16.27,2.87,5.82,0.4,1.27
6,Oest,20.52,1.07,18.46,30.55,17.91,3.31,6.51,0.21,1.46
7,Santa Eugènia,21.23,1.08,15.21,33.26,19.77,2.78,5.28,0.26,1.12
8,Sud,18.05,0.39,10.61,20.4,25.1,5.75,14.4,0.43,4.88


# 4 Analysis and Results <a name="results"></a>

### 4.1 Clustering analysis
In this section we are going to compare the clustering results using different clustering criteria. I'm going to cluster by:
1. Venue category
2. Members in households
3. Level of education

##### 4.1.1 Clustering by venue category
The first clustering analysis is by venue gategory

In [91]:
districts_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Centre,Restaurant,Plaza,Mediterranean Restaurant,Café,Spanish Restaurant,Hotel,Tapas Restaurant,Gastropub,Wine Bar,Ice Cream Shop
1,Eixample,Café,Japanese Restaurant,Hotel,Bar,Spanish Restaurant,Diner,Pizza Place,Park,Plaza,Pub
2,Est,Historic Site,Garden,Mediterranean Restaurant,Scenic Lookout,Stables,Hotel,Basketball Stadium,Jazz Club,Soccer Stadium,Park
3,Mas Xirgu,Grocery Store,Restaurant,Auto Dealership,Chinese Restaurant,Café,Shopping Mall,Sporting Goods Shop,Supermarket,Electronics Store,Food & Drink Shop
4,Montjuïc,Garden,Italian Restaurant,Snack Place,Soccer Field,Wine Shop,Diner,Coffee Shop,College Cafeteria,Concert Hall,Creperie
5,Nord,Restaurant,Arts & Entertainment,Soccer Field,Deli / Bodega,Wine Shop,Gym,Grocery Store,Cocktail Bar,Coffee Shop,College Cafeteria
6,Oest,Hotel,Sandwich Place,Casino,Sporting Goods Shop,Plaza,Pool,Café,Burger Joint,Multiplex,Mediterranean Restaurant
7,Santa Eugènia,Restaurant,Park,Café,Bakery,Basketball Court,Mediterranean Restaurant,Pet Store,Cafeteria,Dance Studio,Soccer Field
8,Sud,Mediterranean Restaurant,Outdoors & Recreation,Athletics & Sports,Restaurant,Sporting Goods Shop,Bar,Supermarket,Soccer Field,Electronics Store,Soccer Stadium


Now I'm going to use a method of optimization of the clustring parameter *kclusters*. It is a rather manual method, consists on checking the results of the clustering with different *kclusters* values from 2 to 7. Considering that there are 9 neighborhoods I assume that using values above 7 is meaningless.

In [186]:
from sklearn.cluster import SpectralClustering #Library to perform cluster analysis
from sklearn.cluster import AgglomerativeClustering as agglo
# set number of clusters

Girona_grouped_clustering = Girona_grouped.drop('Neighborhood', 1)
venues_kmeansList = ['Venues_kmeans']
venues_spectralList = ['Venues_spectral']
venues_agglomerativeList = ['Venues_agglomerative']
for kclusters in range(2,8):
    # run k-means clustering
    kmeans = KMeans(n_clusters = kclusters, random_state=0).fit(Girona_grouped_clustering)
    spectral = SpectralClustering(n_clusters = kclusters, eigen_solver = 'lobpcg').fit(Girona_grouped_clustering)
    agglometrative = agglo(n_clusters = kclusters).fit(Girona_grouped_clustering)

    
    venues_kmeansList.append(kmeans.labels_)
    venues_spectralList.append(spectral.labels_)
    venues_agglomerativeList.append(agglometrative.labels_)
    
print("\t\t\tK-Means\t\t\tSpectral Clustering\t\tAgglomerative Clustering\nN clusters\t kluster labels \t\tkluster labels\t\t\tkluster labels")
for i in range(1, len(venues_kmeansList)):
    print(i + 1,  "\t\t", venues_kmeansList[i], "\t\t", venues_spectralList[i], "\t\t", venues_agglomerativeList[i])
    

			K-Means			Spectral Clustering		Agglomerative Clustering
N clusters	 kluster labels 		kluster labels			kluster labels
2 		 [1 1 1 1 0 1 1 1 1] 		 [0 0 1 0 1 0 0 0 0] 		 [0 0 0 0 1 0 0 0 0]
3 		 [1 1 1 1 0 2 1 1 1] 		 [0 0 0 0 2 1 0 0 0] 		 [0 0 0 0 1 2 0 0 0]
4 		 [1 1 1 0 3 2 1 1 1] 		 [0 0 0 3 2 1 0 0 0] 		 [0 0 0 3 1 2 0 0 0]
5 		 [2 2 4 0 1 3 2 2 2] 		 [1 1 0 4 2 3 1 1 0] 		 [0 0 4 3 1 2 0 0 0]
6 		 [2 2 4 0 1 3 2 5 2] 		 [2 2 0 5 3 1 2 4 0] 		 [0 0 4 3 1 5 0 2 0]
7 		 [4 4 5 0 3 2 4 6 1] 		 [1 1 4 3 0 5 1 6 2] 		 [0 0 4 3 1 5 0 6 2]


In [72]:
barris_venues = pd.DataFrame()
barris_venues['BARRIS'] = membresLlar['Neighborhood']
barris_venues['Code'] = cluster_final_venues
barris_venues

Unnamed: 0,BARRIS,Code
0,Centre,0
1,Eixample,0
2,Est,2
3,Mas Xirgu,4
4,Montjuïc,3
5,Nord,1
6,Oest,0
7,Santa Eugènia,0
8,Sud,2


#### 4.1.2 Clustering by statistical parameters
In this section, we make the clustering analysis for two parameters, the level of education of people in each neighborhood and the number of people living in each household. Next we show the data frames containing the data for the analysis.

There are 9 different levels of education that goes from the cateory of people under 16 which are suposed to be in the obligatory education program, to the doctorate level.

Concerning the number of people in households it ranges from 1 to 7 and above.

In [95]:
# instruccio = instruccio.div(100.0)
instruccio

Unnamed: 0,Neighborhood,No aplicable per ser menor de 16 anys,Ni llegir ni escriure,Sense estudis o educació primària incompleta,EGB. ESO. FP1 o equivalent,BUP. Batxillerat. FP2 o equivalent,Estudis universitaris de grau mig,Estudis universitaris de grau superior,Estudis superiors no universitaris,Doctorats i post-graus
0,Centre,15.13,0.43,9.44,21.96,24.7,5.82,16.48,0.86,5.18
1,Eixample,19.13,0.33,9.48,23.13,24.75,5.52,13.28,0.5,3.88
2,Est,26.25,1.75,30.23,27.96,8.26,1.22,3.17,0.16,1.0
3,Mas Xirgu,10.53,0.0,26.32,26.32,15.79,0.0,15.79,0.0,5.26
4,Montjuïc,21.27,0.21,5.13,16.85,28.35,7.01,15.79,0.25,5.13
5,Nord,20.04,1.22,21.31,30.78,16.27,2.87,5.82,0.4,1.27
6,Oest,20.52,1.07,18.46,30.55,17.91,3.31,6.51,0.21,1.46
7,Santa Eugènia,21.23,1.08,15.21,33.26,19.77,2.78,5.28,0.26,1.12
8,Sud,18.05,0.39,10.61,20.4,25.1,5.75,14.4,0.43,4.88


In [154]:
membresLlar

Unnamed: 0,Neighborhood,1,2,3,4,5,6,7 i més
0,Centre,38.54,27.23,15.84,11.28,4.17,1.5,1.42
1,Eixample,29.19,26.46,18.59,15.88,5.48,2.19,2.22
2,Est,22.69,22.5,18.25,18.51,10.75,4.25,3.05
3,Mas Xirgu,44.44,33.33,0.0,11.11,11.11,0.0,0.0
4,Montjuïc,15.07,23.74,27.76,24.87,6.4,1.86,0.31
5,Nord,30.4,24.5,17.32,15.1,6.31,3.29,3.09
6,Oest,22.91,26.98,20.87,20.11,5.71,1.73,1.7
7,Santa Eugènia,25.09,24.64,17.99,16.18,7.6,3.83,4.67
8,Sud,23.09,24.61,21.87,20.32,6.78,1.77,1.55


In [230]:
# set number of clusters

instruccio_clustering = instruccio.drop('Neighborhood', 1).astype(float).div(10)
membresLlar_clustering = membresLlar.drop('Neighborhood', 1).astype(float).div(10)

edu_KmeansList = ['edu_Kmeans']
edu_SpectralList = ['edu_Spectral']
edu_AgglomerativeList = ['edu_Agglomerative']
house_KmeansList = ['house_Kmeans']
house_SpectralList = ['house_Spectral']
house_AgglomerativeList = ['house_Agglomerative']
    
for kclusters in range(2,8):
    # run k-means clustering
    kmeans_inst = KMeans(n_clusters=kclusters, random_state=0).fit(instruccio_clustering)
    spectral_inst = SpectralClustering(n_clusters = kclusters).fit(instruccio_clustering)
    agglomerative_inst = agglo(n_clusters = kclusters).fit(instruccio_clustering)
    kmeans_llar = KMeans(n_clusters=kclusters, random_state=0).fit(membresLlar_clustering)
    spectral_llar = SpectralClustering(n_clusters = kclusters).fit(membresLlar_clustering)
    agglomerative_llar = agglo(n_clusters = kclusters).fit(membresLlar_clustering)
        
    edu_KmeansList.append(kmeans_inst.labels_)
    edu_SpectralList.append(spectral_inst.labels_)
    edu_AgglomerativeList.append(agglomerative_inst.labels_)
    house_KmeansList.append(kmeans_llar.labels_)
    house_SpectralList.append(spectral_llar.labels_)
    house_AgglomerativeList.append(agglomerative_llar.labels_)

print("\t\t\t\t\tEducation Level\n\n\t\t\tK-Means\t\tSpectral Clustering\tAgglomerative Clust\nN clusters\t kluster labels \t  kluster labels\t  kluster labels")
for i in range(1, len(edu_SpectralList)):
    print(i + 1,  "\t\t", edu_KmeansList[i], "\t", edu_SpectralList[i], "\t", edu_AgglomerativeList[i])
print("\n\n\t\t\t\t\tPeople in household\n\n\t\t\tK-Means\t\tSpectral Clustering\tAgglomerative Clust\nN clusters\t kluster labels \t  kluster labels\t  kluster labels")
for i in range(1, len(edu_SpectralList)):
    print(i + 1,  "\t\t", house_KmeansList[i], "\t", house_SpectralList[i], "\t", house_AgglomerativeList[i])





					Education Level

			K-Means		Spectral Clustering	Agglomerative Clust
N clusters	 kluster labels 	  kluster labels	  kluster labels
2 		 [1 1 0 0 1 0 0 0 1] 	 [0 0 1 1 0 1 1 1 0] 	 [1 1 0 0 1 0 0 0 1]
3 		 [1 1 2 0 1 2 2 2 1] 	 [0 0 1 1 0 2 2 2 0] 	 [1 1 0 2 1 0 0 0 1]
4 		 [1 1 3 0 1 2 2 2 1] 	 [3 3 2 1 3 0 0 0 3] 	 [0 0 3 2 0 1 1 1 0]
5 		 [1 1 4 0 3 2 2 2 1] 	 [0 0 2 1 4 3 3 3 0] 	 [1 1 3 2 4 0 0 0 1]
6 		 [1 1 3 0 4 2 2 5 1] 	 [5 0 1 2 4 3 3 3 5] 	 [0 0 3 5 4 1 1 2 0]
7 		 [4 1 3 0 5 2 2 6 1] 	 [5 1 0 2 4 3 3 3 6] 	 [6 0 3 5 4 1 1 2 0]


					People in household

			K-Means		Spectral Clustering	Agglomerative Clust
N clusters	 kluster labels 	  kluster labels	  kluster labels
2 		 [1 0 0 1 0 0 0 0 0] 	 [0 0 0 1 0 0 0 0 0] 	 [0 0 0 1 0 0 0 0 0]
3 		 [2 2 1 0 1 2 1 1 1] 	 [0 0 0 1 2 0 0 0 0] 	 [2 2 0 1 0 2 0 0 0]
4 		 [0 0 3 1 2 0 3 3 3] 	 [3 0 0 1 2 0 0 0 0] 	 [0 0 2 3 1 0 2 2 2]
5 		 [2 3 1 0 4 3 1 3 1] 	 [0 4 2 1 3 2 4 2 4] 	 [4 2 0 3 1 2 0 0 0]
6 		 [2 3 5 0 4 3 1 5 1] 	 [4 0 



#### 4.1.3 Determination of optimal clustering parameters
One of the limitations of certain clustering techniques is the lack of formal method to optimize clustering parameters, namely the number of clusters *k*.

The elbow technique, wich plots the within sum of squares against *k*, is usually used. Because the number of samples in our project is low we are going to use an other criterion. The criterion that we are going to use is to choose the *k* value which yields the maximum number of clusters having more than one sample in it. As there might be different *k* values wich pass this first criterion, next criterion will be to take the lowest *k* value fulfilling the first criterion if *k* > 2.

In [231]:
results = [venues_kmeansList,venues_spectralList, venues_agglomerativeList, edu_KmeansList, edu_SpectralList, edu_AgglomerativeList, house_KmeansList, house_SpectralList, house_AgglomerativeList]
optimization = {}
for List in results:
    i = 0
    countsList = []
    for ks in List:
        countsDict = {}
        if i == 0:
            name = ks
        else:
            for element in ks:
                if element in countsDict:
                    countsDict[element] = countsDict[element] + 1
                else:
                    countsDict[element] = 1
            countsMax = 0
            countsOne = 0
            for keys in countsDict:
                if countsDict[keys] == 1:
                    countsOne += 1
                else:
                    countsMax += 1
#             print(i)
            countsList.append([i,countsMax, countsOne])
        i += 1
        
    optimization[name] = list(filter(lambda x: x[1] == max([a[1] for a in countsList]) and x[0] > 1, countsList))
    
venues = []
education = []
household = []
maxVenues = 0
maxEducation = 0
maxHousehold = 0
for keys in optimization:
    if 'Venues' in keys:
        if optimization[keys][0][1] > maxVenues:
            maxVenues = optimization[keys][0][1]
            venues = [keys, optimization[keys][0][0]]
    if 'edu' in keys:
        if optimization[keys][0][1] > maxEducation:
            maxEducation = optimization[keys][0][1]
            education = [keys,optimization[keys][0][0]]
    if 'house' in keys:
        if optimization[keys][0][1] > maxHousehold:
            maxHousehold = optimization[keys][0][1]
            household = [keys, optimization[keys][0][0]]

print("Optimal method\tOptimal k value\n")    
print(venues[0], "\t", venues[1]+1, "\n", education[0], "\t\t", education[1]+1,"\n", household[0], "\t\t", household[1]+1)

Optimal method	Optimal k value

Venues_spectral 	 5 
 edu_Spectral 		 3 
 house_Kmeans 		 6


In [223]:
barris_venues = pd.DataFrame()
barris_venues['BARRIS'] = membresLlar['Neighborhood']
barris_venues['Code'] = SpectralClustering(n_clusters = venues[1]+1, eigen_solver = 'lobpcg').fit(Girona_grouped_clustering).labels_
barris_instruction = pd.DataFrame()
barris_instruction['BARRIS'] = membresLlar['Neighborhood']
barris_instruction['Code'] = SpectralClustering(n_clusters = education[1]+1).fit(instruccio_clustering).labels_
barris_household = pd.DataFrame()
barris_household['BARRIS'] = membresLlar['Neighborhood']
barris_household['Code'] = KMeans(n_clusters = household[1]+1, random_state=0).fit(membresLlar_clustering).labels_



In [222]:
barris_venues

Unnamed: 0,BARRIS,Code
0,Centre,2
1,Eixample,2
2,Est,0
3,Mas Xirgu,1
4,Montjuïc,4
5,Nord,3
6,Oest,2
7,Santa Eugènia,2
8,Sud,0


In [27]:
kmeans_final_inst
for x in range(0, len(kmeans_final_llar)):
#     print(kmeans_final_llar[x], "\t")
    if kmeans_final_llar[x] == 2:
        kmeans_final_llar[x] = 1
    elif kmeans_final_llar[x] == 1:
        kmeans_final_llar[x] = 2
#     print(kmeans_final_llar[x], "\n")

### 4.2 Visualization of the results
First we show the results of ploting the basic geographycal information on a map. In the following map the census sections are delimited by lines and the census districts are colored areas.

In [29]:
Barris = []
i = 0
for feature in range(0, len(Centre['features'])):
    Barris.append(Centre['features'][feature]['properties']['BARRIS'])
    i += 1
barris = pd.DataFrame(Barris, columns = ['BARRIS'])
barris['Code'] =  range(0, len(Barris))
barris.loc[barris['BARRIS'] == 'Eixample', ['Code']] = 3
barris.loc[barris['BARRIS'] == 'Mas Xirgu', ['Code']] = 6

In [30]:
def drawBasicMap(JSON1, JSON2, barris):
    lat = 41.9802474
    long = 2.8236477
    gir_map = folium.Map(location=[lat,long], zoom_start=13)

    # for i in range(0,len(labels)):
    #     folium.Marker([Labels[i][1], Labels[i][2]],
    #                  popup=Labels[i][0]).add_to(gir_map)

    folium.Choropleth(
        geo_data = JSON1,
        fill_opacity = 0.005, 
        line_opacity = 1
    ).add_to(gir_map)

    choropleth = folium.Choropleth(
        geo_data = JSON2,
        data = barris,
        columns = ['BARRIS', 'Code'],
        key_on = 'feature.properties.BARRIS',
        fill_color = 'Dark2', 
        fill_opacity = 0.5, 
        line_opacity = 0
    ).add_to(gir_map)

    # add labels indicating the name of the neighborhood
    style_function = "font-size: 15px; font-weight: bold"
    choropleth.geojson.add_child(
        folium.features.GeoJsonTooltip(['BARRIS'], style=style_function, labels=False))

    return gir_map
def drawMap(JSON, barris):
    lat = 41.9802474
    long = 2.8236477
    gir_map = folium.Map(location=[lat,long], zoom_start=13)

    choropleth = folium.Choropleth(
        geo_data = JSON,
        data = barris,
        columns = ['BARRIS', 'Code'],
        key_on = 'feature.properties.BARRIS',
        fill_color = 'Dark2', 
        fill_opacity = 0.5, 
        line_opacity = 1
    ).add_to(gir_map)

    # add labels indicating the name of the neighborhood
    style_function = "font-size: 15px; font-weight: bold"
    choropleth.geojson.add_child(
        folium.features.GeoJsonTooltip(['BARRIS'], style=style_function, labels=False))

    return gir_map

Next we are going to draw the map of Girona with the census sections delimited by a line, and the census districts or neighborhoods shown as colored areas. Remember that the census sections have been used as a unit to retrieve Foursquare information, but the rest of the analysis are going to be made with neighborhood areas.

In [1]:
drawBasicMap(Girona, Centre, barris, )

In [2]:
drawMap(Centre, barris_venues)

In [3]:
drawMap(Centre, barris_instruction)

In [4]:
drawMap(Centre, barris_household)

Next we plot the four maps in a combined image to better compare them
<img src="Imatges/FinalResults.png" width="1200" />

After a marathon of data processing and parameter optimization, we are at the final 100 m of the finish line, were we are going to uncover the results of clustering neighborhoods.

In the following figure, there are 4 maps of Girona with different information. The map at the top left side is the basic map where I show the overlapping between census sections and neighborhoods. A part from some small non-overlapping, the neighborhoods are clusters of census sections equivalent to census districts.

In the top right map, you can see the result of clustering neighborhoods by the profile of venue categories. Here you find a main cluster composed of four neighborhoods, then there is a cluster with two big neighborhoods and finally there are three neighborhoods each making a separate cluster.

When looking at education level there is also one big cluster containing 4 neighborhoods, but its composition is not the same than in the big cluster of venues category. Then, there is a cluster with three neighborhoods and finally, a cluster with two neighborhoods.

If we look at the number of people in households, we see a quite diverse map with three clusters with two neighborhoods and then three neighborhoods that cannot be clustered.


# 5 Discussion <a name="discussion"></a>
### Neighborhoods by venues
The total amount of venues recorded in the Foursquare database from Girona is less than 300, a number which is far below the actual number of venues. Girona is a very dynamic town and the commercial center of a rather big geographical area. It accumulates many governmental offices and it hosts a university which has had a great impact on the city over the last 30 years. However underrepresented, the Foursquare venues are able to capture the special profile of the different neighborhoods. There is a great cluster containing four neighborhoods, Centre, Eixample, Santa Eugènia and Oest. They are the four neighborhoods that constitute what we can call the center of the city. Although the socioeconomic profile of these four neighborhoods is not necessarily similar in the streets there is a complex mixture of restaurants, supermarkets, groceries, hairdressers etc. All four neighborhoods are places with all you need to live and, at the same time, in most of them there are locations were people from other neighborhoods may eventually go shopping.
Next cluster is formed by the neighborhoods Est and Sud. They are very different neighborhoods in term of socioeconomic landscape. While Est is a low-income neighborhood, Sud has areas with people with very high income. However, the venues profile is similar and it is basically characterized by almost no retailers nor personal services, and the presence of some municipal equipment.
Finally, we have three clusters with only one neighborhood. Montjuïc is a residential area with some gardens. Nord is a mixture of working-class immigration and working-class native population which is located a bit far away from the city center. It is by no means a commercial place, but there you can find some specialized businesses. The las one, Mas Xirgu, is a very special one, we see it coming alone in every clustering because it is basically an industrial area with special business like car dealers, garages, pet shops and clinics, and industrial providers.
### Neighborhoods by education level
When looking at Girona by its venues profile the main feature was the presence of a great center comprising four neighborhoods. When we look at the education level, the city is split into three clear areas and the center is also split. Now, Montjuïc joins the Centre, Eixample and Sud neighborhoods. These are the places where people with the highest education is living. Next, with middle education levels we find the west crown of the city, Santa Eugènia, Oest and Nord. Finally, neighborhoods Est and Mas Xirgu cluster together even if their profile is not exactly the same. Both have a lot of people without primary studies or just with primary studies. This is probably what makes them clustering together. But Mas Xirgu has half of people under 16 as compared with Est and Mas Xirgu. Instead Mas Xirgu has almost 16% of people with university studies while in Est neighborhood it is a bare 3%.
### Neighborhoods by number of people in households
Clustering by the number of people in households is often a good way to reveal socioeconomic features of neighborhoods and a good way to reveal familiar structures if any. Clustering Girona according to number of people in households yields a rather atomized map, with three clusters with two neighborhoods in it and three neighborhoods that come alone. First of all, the neighborhood Centre is a very special one, one of those which has experienced a process of gentrification due to the pressure of tourism and university students. Here 65% of houses have one or two residents, while only 7% of the houses have more than 4 residents. Montjuïc is another singularity, it has the lowest proportion of houses with a single resident and 75% of houses have between 2 and 4 residents, it is the prototypical middle-class family neighborhood. Mas Xirgu comes again as a special case, it has the highest number of houses with only one resident and together with houses with two residents make up 75% of households. Then there is a 22% of households with four or five residents, showing the presence of two very different areas in the same neighborhood.
Neighborhoods Est and Santa Eugènia probably cluster together because they are the two neighborhoods with the highest proportion of houses with 5 residents or more with more than 16% of households.
Neighborhoods Sud and Oest are family neighborhoods with more than 90% of households with 1 to four residents, the class with two residents per house being the one with more occurrences.
In neighborhoods Nord and Eixample the leading categories are the hoses with one or two residents and then, at a certain distance, families with 3 or 4 members.
In summary, the profile of Girona is that of a modern city were families are getting smaller and most people live alone or just in a couple. However, slight differences are relevant and show that the map of the city is quite diverse.

# 6 Conclusion <a name="conclusion"></a>
*	Foursquare might not be the most comprehensive database to make a venues profile of the city of Girona. In the future this should be complemented with data from other databases like google maps.
*	Girona is a small town of 102 thousand people but it is a quite diverse city we can say that neighborhoods matter, they matter in terms of venues and in terms of socioeconomic profile. Clustering them by different criteria gives different results showing that having the right information might be crucial for city management.
*	Future directions.
 *	Complementing Foursquare data with data from other databases
 *	Adding data from other socioeconomic variables.
 *	Define a socioeconomic index might help a lot on delivering a very focused information to stakeholders.
 *	Tracking venues to follow food traffic in time series would be of great interest for stakeholders but this will be done after confinement measures are relieved.