## Cluster analysis

This notebook contains the function cluster_analysis found in cluster_analysis.py. Each block is documented below with various aspects: The research and justification behind each component and also any experience or challenges that arose. 

In [None]:
from google.cloud import bigquery
import numpy as np
import pandas as pd
import geopy.distance
import geopandas
from shapely.geometry import Point, mapping
from shapely.geometry.polygon import Polygon
from sklearn.cluster import SpectralClustering

Ensuring that the modules were compatible with each other was a challenge. I use Numpy and Pandas on a daily basis, and have some experience with Geopy, Geopandas and Shapely. No previous experience using sci-kit, although I have studied some of the statistics and graph theory behind the techniques. 


In [None]:
# obtaining the data
bqclient=bigquery.Client()
table = bigquery.TableReference.from_string('carto-ps-bq-developers.data_test.osm_spain_pois')
rows = bqclient.list_rows(table)
spain_pois = rows.to_dataframe(create_bqstorage_client=True)
spain_pois.to_csv('spain_pois.csv')

# spain_pois=pd.read_csv('spain_pois.csv')

Having never used Google Cloud prior to this exercise, I initially found the interface quite difficult to begin with, but eventually I was able to set it up as required and obtain the data. I would like the opportunity to work with it further, as it seems to have huge potential.  

In [None]:
# Discarding the points outside of the Madrid Region
min_lon = -3.93455628
min_lat = 40.25387182
max_lon = -3.31993445
max_lat = 40.57085727
madrid_pois = spain_pois.loc[(spain_pois['lon'] >= min_lon) & (spain_pois['lon'] <= max_lon) &
                             (spain_pois['lat'] >= min_lat) & (spain_pois['lat'] <= max_lat)]
madrid_pois=madrid_pois.reset_index()

Working regularly with Pandas, this was straightforward. Used reset_index to allow for easier iteration later

In [None]:
## Cinemas
madrid_cinemas = pd.read_csv('208862-7650164-ocio_salas.csv',sep=';')

Assumed that spain_pois.csv would contain the cinemas in Madrid as Points of Interest but I was unable to find them. I did a check for 'cinema' and 'cine' under all of the columns but could find nothing. The closest I could find was 'theatre' under amenity - but a check on google maps showed that these were stage theatres and not cinemas. 
The .csv file was obtained from https://datos.madrid.es/portal/site/egob/. Unfortunately it only contains the cinemas within the municipality of Madrid and not all of the region. The website of the Communidad de Madrid did not contain this information.

In [None]:
# Converting degrees to meters, using min_lon and min_lat as the origin
def lat_to_m(mlat,lat,mlon):
    return geopy.distance.distance((mlon, lat), (mlon, mlat)).km
def lon_to_m(mlon,lon,mlat):
    return geopy.distance.distance((lon, mlat), (mlon, mlat)).km
vlat_to_m = np.vectorize(lat_to_m)
vlon_to_m = np.vectorize(lon_to_m)

madrid_pois['y_metres'] = vlat_to_m(min_lat, np.array(madrid_pois['lat']), min_lon)
madrid_pois['x_metres'] = vlon_to_m(min_lon, np.array(madrid_pois['lon']), min_lat)
madrid_cinemas['y_metres'] = vlat_to_m(min_lat, np.array(madrid_cinemas['LATITUD']), min_lon)
madrid_cinemas['x_metres'] = vlon_to_m(min_lon, np.array(madrid_cinemas['LONGITUD']), min_lat)

When trying the buffer function (below), I noticed that most points were being filtered as I was using 0.15 degrees, not kilometres. I decided to add extra columns to both the poi and cinema dataframes that are their horizontal and vertical displacements from an origin, of which I used (min_lat, min_lon). Given there are >90000 data points I opted for vectorizing these functions over a for loop.

In [None]:
# Defining the region of Madrid that is less than 0.15kn from a cinema
cinema_locs = geopandas.GeoSeries([Point(madrid_cinemas['y_metres'][i], madrid_cinemas['x_metres'][i]) for i in range(madrid_cinemas.shape[0])])
cinema_buffer = cinema_locs.buffer(0.15)

Used the y_metres and x_meters as points instead of coordinates. 

In [None]:
# Discard the pois near cinemas
madrid_pois['near_cinema'] = [0 for i in range(madrid_pois.shape[0])]

def in_buffer(p):
    p = Point(madrid_pois['y_metres'][p], madrid_pois['x_metres'][p])
    for buff in cinema_buffer:
        if buff.contains(p):
            return 1

vin_buffer = np.vectorize(in_buffer)
madrid_pois['near_cinema'] = vin_buffer(range(madrid_pois.shape[0]))
madrid_pois_nocinema = madrid_pois.loc[madrid_pois['near_cinema']!=1]

Vectorized to allow it to run faster. It takes each point as input and loops it through each 150m radius of the cinemas. The loop stops early if one has been found. Opted for marking each with a 1 and filtering at the end. 

In [None]:
# taking a 20000 sample
madrid_pois_nocinema = madrid_pois_nocinema.sample(n=20000)

Took a 20000 sample to perform the analysis. Doing the cluster analysis on all points on my machine caused a memory error. 

In [None]:
# performing the cluster analysis
X = np.zeros((madrid_pois_nocinema.shape[0],2))
X[:,0],X[:,1]  = np.array(madrid_pois_nocinema['lat']), np.array(madrid_pois_nocinema['lon'])
model = SpectralClustering(n_clusters=5)
yhat = model.fit_predict(X)

New to this kind of analysis so did some research on the possible options. I found this article [https://machinelearningmastery.com/clustering-algorithms-with-python/](https://machinelearningmastery.com/clustering-algorithms-with-python/) that gave 10 and I read though it and decided that the two most worth considering were the last two - Spectral Clustering and Gaussian Mixture, mainly because the main hyperparameter was the number of clusters. I first opted for Gaussian mixture first, but after plotting it did not produce a very suitable result on the map. So I opted for Spectral Clustering, which produced a more sensible end result. One advantage of Gaussian mixture is that it uses less memory than Spectral Clustering. I had also considered implementing a genetic algorithm using graph theory and distances as a cost function, but opted not to due to time restraints. Also the size of the problem (90000) was simply too large for the computing resources available to me.

In [None]:
out = pd.DataFrame({'lat':np.array(madrid_pois_nocinema['lat']),
                    'lon':np.array(madrid_pois_nocinema['lon']),
                    'point_id':np.array(madrid_pois_nocinema['id']),
                    'cluster_id':yhat})

out.to_csv('madrid_poi_clusters.csv')

Saved to .csv. 