# Capstone Project - The Battle of Neighbourhoods (Week 1 & 2)
<h2><center>Property Prices & Venue Data Analysis of London</center></h2>


## 1. Introduction

### 1.1. Background
It goes without saying that the coronavirus (COVID-19) has had, is currently and will continue to have a significant impact on businesses and the economy worldwide. This is evident with stock market and oil prices crash, record breaking number of people filing for unemployment and major airlines on the brink of administration.

The Real Estate & Property market is no exception to the coronavirus impact, with the London property market coming to a halt back in March when the full lock down was announced to prevent the spread of the virus. Physical viewings were postponed, constructions were suspended and estate agents & mortgage lenders no longer able to value properties in person.

As a result Zoopla has predicted that completed sales in the UK will be 50% lower in 2020 than in 2019 and Knight Frank has also predicted that the number of sales in Greater London will fall by 35%. However despite the bleak outlook for property and housing prices this year, a large number of firms & their analysts believe that the housing market could make a very strong recovery by 2021, with an estimated range of 3% - 6%.

### 1.2. Business Problem
The best decisions are often backed up by insight and data,  by utilising Machine Learning we can effectively and efficiently generate those insights in order to provide potential home-buyers and investors the best decision making support as possible. This brings us to our business problem: How can we generate insight so home-buyers and investors can make well informed choices when purchasing properties in London, especially in this uncertain economic situation?

In order to solve this business problem, we will cluster the London areas based on the average sales price, local venues and amenities, i.e. schools, supermarkets, coffee shops. We will then compare these clusters with the average property prices and rental prices for each borough, and also calculate the rental yield for each cluster for investors who are buying to let. This will provide valuable information on whether a property is a viable choice for home-buyers & investors.

## 2. Data Acquisition

### 2.1. Data Sources

The Price Paid Data (property sales data) in London will be sourced from HM Land Registry, where the data is based on the raw data released each month. The dataset will include the following columns: Transaction unique identifier, Price, Date of Transfer, Postcode, Property Type, Old/New, Duration, PAON (Primary Addressable Object Name), SAON (Secondary Addressable Object Name), Street, Locality, Town/City, District, County and PPD Category Type.

The FourSquare API will be used to access and explore venues and amenities based on the Latitude and Longitude collected using the GeoCoder library, which will then be read into a dataframe for data wrangling and cleaning. This dataframe will be merged with the Price Paid Data from HM Land Registry and processed to be suitable for fitting the machine learning model.

The list of boroughs in London will be scrapped from the Wikipedia page and the average property and rental prices per borough will be scraped from Foxtons (A UK estate agency). The data will be visualised using Plotly in order to gauge the recommendations generated by our model against average prices for each cluster.
`
Please see the References section at the end of the notebook for links and descriptions for data sources.

### 2.2. Data Collecting & Cleaning

In [364]:
import pandas as pd
import geopandas as gpd
import numpy as np
import json
import csv
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
from geopy.geocoders import Nominatim

import requests

import seaborn as sns
sns.set_style("darkgrid")
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
import hdbscan

import plotly.express as px

import ipywidgets as widgets
from ipywidgets import interact, interact_manual
print('Libraries imported.')

Libraries imported.


#### Price Paid Data

In [365]:
ppd_2019 = pd.read_csv('../data/external/pp-2019.csv')
ppd_2019.head()

Unnamed: 0,{8F1B26BD-60CA-53DB-E053-6C04A8C03649},221950,2019-04-26 00:00,TS17 5FF,D,Y,F,...,CARRAWBURGH ROAD,INGLEBY BARWICK,STOCKTON-ON-TEES,STOCKTON-ON-TEES.1,STOCKTON-ON-TEES.2,A,A.1
0,{8F1B26BD-60CB-53DB-E053-6C04A8C03649},246995,2019-03-29 00:00,TS15 9ZH,D,Y,F,...,GRESLEY CLOSE,,YARM,STOCKTON-ON-TEES,STOCKTON-ON-TEES,A,A
1,{8F1B26BD-60CC-53DB-E053-6C04A8C03649},244950,2019-05-17 00:00,TS18 2FN,T,Y,F,...,INFINITY VIEW,,STOCKTON-ON-TEES,STOCKTON-ON-TEES,STOCKTON-ON-TEES,A,A
2,{8F1B26BD-60CD-53DB-E053-6C04A8C03649},139950,2019-05-31 00:00,TS18 2FN,S,Y,F,...,INFINITY VIEW,,STOCKTON-ON-TEES,STOCKTON-ON-TEES,STOCKTON-ON-TEES,A,A
3,{8F1B26BD-60CE-53DB-E053-6C04A8C03649},271995,2019-05-31 00:00,TS15 9FD,D,Y,F,...,MALLARD DRIVE,,YARM,STOCKTON-ON-TEES,STOCKTON-ON-TEES,A,A
4,{8F1B26BD-60CF-53DB-E053-6C04A8C03649},84450,2019-04-26 00:00,TS18 2FD,T,Y,F,...,DEEPDALE AVENUE,,STOCKTON-ON-TEES,STOCKTON-ON-TEES,STOCKTON-ON-TEES,A,A


As mentioned on the 'How to access HM Land Registry Price Paid Data' website, the column headers are not supplied in the file therefore they will need to be manually added.

In [366]:
ppd_2019.columns = ['TUID', 'Price', 'Date_of_Transfer', 'Postcode', 'Property_Type', 'Old_New', 'Duration',
                    'PAON', 'SAON', 'Street', 'Locality', 'Town_City', 'District', 'County', 'PPD_Cat_Type', 'Record_Status']

ppd_2019.sort_values(by=['Date_of_Transfer'], ascending=False, inplace=True)
ppd_2019.head()

Unnamed: 0,TUID,Price,Date_of_Transfer,Postcode,Property_Type,Old_New,Duration,...,Street,Locality,Town_City,District,County,PPD_Cat_Type,Record_Status
908444,{9DBAD222-BE41-6EB3-E053-6B04A8C0F257},155000,2019-12-31 00:00,LS28 8ED,S,N,F,...,BRADFORD ROAD,,PUDSEY,LEEDS,WEST YORKSHIRE,B,A
415790,{9DBAD222-8F5A-6EB3-E053-6B04A8C0F257},19476811,2019-12-31 00:00,WA5 3UZ,O,N,L,...,LINGLEY GREEN AVENUE,LINGLEY MERE BUSINESS PARK,WARRINGTON,WARRINGTON,WARRINGTON,B,A
921591,{A2479555-56B8-74C7-E053-6B04A8C0887D},294000,2019-12-31 00:00,SP11 6ZQ,T,Y,F,...,CASHMERE DRIVE,,ANDOVER,TEST VALLEY,HAMPSHIRE,A,A
902369,{9DBAD222-B849-6EB3-E053-6B04A8C0F257},67500,2019-12-31 00:00,SA11 2HG,T,N,F,...,PENRHIWTYN STREET,,NEATH,NEATH PORT TALBOT,NEATH PORT TALBOT,B,A
885654,{9FF0D969-B57B-11ED-E053-6C04A8C06383},176000,2019-12-31 00:00,M46 9EF,S,N,L,...,CHANTERS AVENUE,ATHERTON,MANCHESTER,WIGAN,GREATER MANCHESTER,A,A


#### List of London Boroughs

In [None]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_London_boroughs').text
soup = BeautifulSoup(source)
table = soup.find('table',class_='wikitable sortable')
tr_elements = soup.find_all(['tr'])[0:34]

# Write the table headers and cells into a CSV
with open('../data/raw/london_boroughs.csv', 'w', newline='', encoding='utf-8') as f:
    column_headers = ['Borough','Inner','Status', 'Local authority', 'Political control',
                      'Headquarters', 'Area (sq_mi)', 'Population (2013_est)', 'Coordinates', 'Nr in map']
    writer = csv.writer(f)
    writer.writerow(column_headers)
    for cell in tr_elements:
            td = cell.find_all('td')
            row = [i.text.replace('\n','').replace(' / ',',') for i in td]
            writer.writerow(row)

There were 3 boroughs that were scraped with citation reference text, '[note #]', so those were removed by chaining .replace methods. The latitude and longitdue were also sliced out of the Coordinates column and assigned each to their own respected columns.

In [None]:
london_boroughs = pd.read_csv('../data/raw/london_boroughs.csv', usecols=['Borough', 'Coordinates'])
london_boroughs['Latitude'] = london_boroughs['Coordinates'].str[43:50]
london_boroughs['Longitude'] = london_boroughs['Coordinates'].str[52:60]
london_boroughs['Borough'] = [b.replace('[note 1]', '').replace('[note 4]', '').replace('[note 2]', '') for b in london_boroughs['Borough'] ]
london_boroughs

#### Property and Rental Prices 

Next I scraped a list of London postcodes and their corresponding districts from the following website: https://www.doogal.co.uk/london_postcodes.php. Using the postcode I then scraped the average property prices and rental prices from the Foxton website. All the data is written into a CSV file.

In [None]:
with open('../data/raw/london_property_prices.csv', 'w', newline='', encoding='utf-8') as f:
    column_headers = ['postcode','districts', 'avg_property_price','avg_rental_price']
    writer = csv.DictWriter(f, fieldnames = column_headers)
    writer.writeheader()
    
    # Scrape postcodes and districts 
    source_postcode = requests.get('https://www.doogal.co.uk/london_postcodes.php').text
    soup1 = BeautifulSoup(source_postcode)
    districts = soup1.find('div', class_='realContent')
    a_elements = districts.find_all('a')[2:157]
    for i in range(len(a_elements)):
        Postcode = a_elements[i].getText().split(':')[0]
        try:
            Districts= a_elements[i].getText().split(': ')[1]
        except:
            Districts = 'NaN'
        i += 1
        
        # Scrape the prices for each postcode obtained above
        source_foxtons = requests.get('https://www.foxtons.co.uk/living-in/{}'.format(Postcode)).text
        soup2 = BeautifulSoup(source_foxtons)
        var_elements = soup2.find_all(['var'], class_="price_headline")
        
        
        property_price = var_elements[0].getText()[1:]
        try:
            rental_price = var_elements[1].getText()[1:]
        except:
            rental_price = 'NaN'
        # Return NaN if there is no data for rental prices
        if len(rental_price) > 1:
            result = re.search('[0-9A-Fa-f,]+', rental_price).group()
        else:
            result = 'NaN'
            
        # Write all of the above into the CSV    
        writer.writerow({'postcode': Postcode, 'districts':Districts,
                         'avg_property_price':property_price, 'avg_rental_price':result})

In [367]:
london_property_prices = pd.read_csv('../data/raw/london_property_prices.csv')
london_property_prices.dropna(how='any', inplace=True)
london_property_prices['avg_property_price'] = london_property_prices['avg_property_price'].apply(lambda X: X.replace(",", ""))
london_property_prices 

Unnamed: 0,postcode,districts,avg_property_price,avg_rental_price
0,E1,"Mile End, Stepney, Whitechapel",659278,481
1,E2,"Bethnal Green, Shoreditch",454292,655
2,E3,"Bow, Bromley-by-Bow",490185,473
3,E4,"Chingford, Highams Park",409644,325
4,E5,Clapton,662577,514
...,...,...,...,...
109,W10,"Ladbroke Grove, North Kensington",729400,418
110,W11,"Holland Park, Notting Hill",3555708,751
111,W12,Shepherd's Bush,619423,486
112,W13,West Ealing,703833,385


### 2.3. Feature Selection 

From the Price Paid Data, most of the columns were dropped as they were not relevant in our business problem, such as TUID, Duration, PAON, SAON, Locality, PPD_Cat_Type and Record_Status. 

There were also a number of rows where the prices where very high, which could have been a commerical property. Therefore rows where the price is larger than £2,000,000 were also dropped.

In [368]:
# Drop features that are irrelevant for this project, filter for London rows and clean up the data
ppd_2019_clean = ppd_2019.drop(columns=['TUID', 'Duration', 'PAON', 'SAON', 'Locality', 'PPD_Cat_Type', 'Record_Status'])

# Filter out rows where Town_City column contains 'LONDON'
ppd_london = ppd_2019_clean[ppd_2019['Town_City']=='LONDON'].copy()
ppd_london = ppd_london.drop(ppd_london[ppd_london.Price > 2000000].index)
ppd_london.dropna(axis=0, how='any', inplace=True)

# Add a new column for the postcode prefixes
ppd_london['Postcode_Prefix'] = ppd_london['Postcode'].apply(lambda x: x.split(' ')[0])
ppd_london.sort_values('Street')

Unnamed: 0,Price,Date_of_Transfer,Postcode,Property_Type,Old_New,Street,Town_City,District,County,Postcode_Prefix
954295,296000,2019-08-09 00:00,SW2 3BN,F,N,ABBESS CLOSE,LONDON,LAMBETH,GREATER LONDON,SW2
201551,299950,2019-02-28 00:00,SW4 9LA,F,N,ABBEVILLE ROAD,LONDON,LAMBETH,GREATER LONDON,SW4
414232,793750,2019-12-16 00:00,SW4 9NA,F,N,ABBEVILLE ROAD,LONDON,LAMBETH,GREATER LONDON,SW4
955674,1335000,2019-11-29 00:00,SW4 9LP,T,N,ABBEVILLE ROAD,LONDON,LAMBETH,GREATER LONDON,SW4
767739,525000,2019-12-02 00:00,SW4 9NJ,F,N,ABBEVILLE ROAD,LONDON,LAMBETH,GREATER LONDON,SW4
...,...,...,...,...,...,...,...,...,...,...
715562,255000,2019-09-06 00:00,SE3 8EU,F,N,ZANGWILL ROAD,LONDON,GREENWICH,GREATER LONDON,SE3
585843,557000,2019-01-18 00:00,SE3 8EH,S,N,ZANGWILL ROAD,LONDON,GREENWICH,GREATER LONDON,SE3
349699,790000,2019-07-05 00:00,E3 5RB,T,N,ZEALAND ROAD,LONDON,TOWER HAMLETS,GREATER LONDON,E3
254026,375000,2019-01-28 00:00,NW9 6FD,F,N,ZENITH CLOSE,LONDON,BARNET,GREATER LONDON,NW9


In [371]:
ppd_grouped = ppd_london.groupby(['Street', 'District', 'Postcode_Prefix'])['Price'].mean().round(0).reset_index()
ppd_grouped.columns = ['street', 'district', 'postcode_prefix', 'avg_price']
ppd_grouped.sort_values(by=['street'], inplace=True)
ppd_grouped

Unnamed: 0,street,district,postcode_prefix,avg_price
0,ABBESS CLOSE,LAMBETH,SW2,296000.0
1,ABBEVILLE ROAD,LAMBETH,SW4,613870.0
2,ABBEY GARDENS,CITY OF WESTMINSTER,NW8,588750.0
3,ABBEY GARDENS,HAMMERSMITH AND FULHAM,W6,470750.0
4,ABBEY GARDENS,SOUTHWARK,SE16,330500.0
...,...,...,...,...
15550,YUNUS KHAN CLOSE,WALTHAM FOREST,E17,275000.0
15551,ZANGWILL ROAD,GREENWICH,SE3,406000.0
15552,ZEALAND ROAD,TOWER HAMLETS,E3,790000.0
15553,ZENITH CLOSE,BARNET,NW9,375000.0


Now we filter out rows from the ppd_2019_clean dataframe where the Town is 'LONDON', then we group the dataframe by the street names and find the average price paid for property on those streets.

As there are a large number of rows, getting the latitude, longitude and FourSquare data for each row/street will take a significant amount of time. A Python script will be used to get all the latitude and longitude, write them to a CSV file.

This will decrease the computional time required and provide us with an overview of properties and their nearby venues across different pricing ranges.

In [372]:
ppd_london_2019 = pd.read_csv('../data/processed/ppd_london_2019.csv')
ppd_london_2019.dropna(axis=0, how='any', inplace=True)
ppd_london_2019.shape

(14323, 6)

In [373]:
ppd_london_2019['latitude'] = pd.to_numeric(ppd_london_2019['latitude'], downcast="float")
ppd_london_2019['longitude'] = pd.to_numeric(ppd_london_2019['longitude'], downcast="float")
ppd_london_2019.reset_index(drop=True)

Unnamed: 0,street,district,postcode_prefix,avg_price,latitude,longitude
0,ABBESS CLOSE,LAMBETH,SW2,296000.0,51.442879,-0.108249
1,ABBEVILLE ROAD,LAMBETH,SW4,613870.0,51.453304,-0.140988
2,ABBEY GARDENS,CITY OF WESTMINSTER,NW8,588750.0,51.533905,-0.179989
3,ABBEY GARDENS,HAMMERSMITH AND FULHAM,W6,470750.0,51.484844,-0.213365
4,ABBEY GARDENS,SOUTHWARK,SE16,330500.0,51.491653,-0.066099
...,...,...,...,...,...,...
14318,YUNUS KHAN CLOSE,WALTHAM FOREST,E17,275000.0,51.578888,-0.019688
14319,ZANGWILL ROAD,GREENWICH,SE3,406000.0,51.472534,0.042171
14320,ZEALAND ROAD,TOWER HAMLETS,E3,790000.0,51.531441,-0.037656
14321,ZENITH CLOSE,BARNET,NW9,375000.0,51.592243,-0.255944


There are over 14000 rows in the dataframe above, if we were to get venue data using the FourSquare API the compuntational time required will be significant. In addition, an application can only make a maximum of 5000 requests per hour to the venues endpoint. In order to reduce the dataset without causing any data bias, we will use the .sample method and sample 20% of the full dataset.

In [374]:
# Let's get a sample from this dataframe
ppd_london_2019_sample = ppd_london_2019.sample(frac=0.2, replace=False, random_state=1).copy()
ppd_london_2019_sample = ppd_london_2019_sample.sort_values('street').reset_index(drop=True)
ppd_london_2019_sample

Unnamed: 0,street,district,postcode_prefix,avg_price,latitude,longitude
0,ABBEY GARDENS,SOUTHWARK,SE16,330500.0,51.491653,-0.066099
1,ABBEY PARADE,MERTON,SW19,242750.0,51.531391,-0.292546
2,ABBEY ROAD,BRENT,NW10,950000.0,51.530067,-0.269922
3,ABBEY ROAD,CAMDEN,NW6,396429.0,51.540989,-0.189608
4,ABBOTS PARK,LAMBETH,SW2,489000.0,51.442993,-0.113085
...,...,...,...,...,...,...
2860,YORK ROAD,EALING,W3,462000.0,51.518475,-0.264443
2861,YORK WAY,CAMDEN,N1C,357350.0,51.536472,-0.122328
2862,YORK WAY ESTATE,ISLINGTON,N7,275625.0,51.545193,-0.125491
2863,YOUNG STREET,KENSINGTON AND CHELSEA,W8,1275735.0,51.501156,-0.189701


## 3. Exploratory Data Analysis (EDA)

### 3.1. Price Paid Data

Let's visualise the average property prices for those streets by plotting them on a map.

In [375]:
geolocator = Nominatim(user_agent='london_explorer')
location = geolocator.geocode('London, UK')
latitude_ldn = location.latitude
longitude_ldn = location.longitude
print('The geographical coordinate of London, UK are {}, {}.'.format(latitude_ldn, longitude_ldn))

The geographical coordinate of London, UK are 51.5073219, -0.1276474.


In [376]:
mapbox_access_token = (open("../secrets/mapbox_token.txt").read())

fig = px.scatter_mapbox(ppd_london_2019_sample, lat="latitude", lon="longitude", size="avg_price", color="avg_price",
                  color_continuous_scale=px.colors.cyclical.IceFire, size_max=10, width=1000, height=700)
fig.update_layout(
    title='Property Paid Price in London 2019 (Capped at £2 million)',
    autosize=True,
    hovermode='closest',
    showlegend=True,
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=latitude_ldn,
            lon=longitude_ldn
        ),
        pitch=5,
        zoom=10,
        style='light'
    ),
)
fig.show()

As expected, neighbourhoods such as Mayfair, Chelsea, Knightsbridge, Notting Hill and Fulham have the highest average property prices. We can also see from the map above that more expensive properties are mostly located on the west side of central London, and compared to the east side there are far fewer properties that exceed the £1,000,000 mark. However there are exceptions, with a small cluster near Blackheath, Canary Wharf, Newbury Park and Bexleyheath.

This will be useful to home-buyers or investors as they may take into consideration a neighbourhood that they were not aware of previously. The next step would be to explore the said neighbourhoods using the FourSquare API.

### 3.2. Explore the area and nearby venues

Let's take a look at the first neighbourhood and it's nearby venues within a 300 meter radius.

In [None]:
secret_dict = {}
with open('../secrets/foursquare_secrets.txt') as f:
    for item in f:
        (key, val) = item.split(':')
        secret_dict[key] = val.strip('\n')

In [None]:
LIMIT = 100
radius = 300
VERSION = '20180605'
neighborhood_latitude = ppd_london_2019_sample.loc[0, 'latitude']
neighborhood_longitude = ppd_london_2019_sample.loc[0, 'longitude']
neighborhood_name = ppd_london_2019_sample.loc[0, 'street']
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(secret_dict.get('client_id'), secret_dict.get('client_secret'), neighborhood_latitude, neighborhood_longitude, VERSION, radius, LIMIT)
results = requests.get(url).json()

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [None]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues

In [None]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

#### Now we repeat what we have done above for all the other neighbourhoods by creating a function that repeat the same process.

The function below takes in 4 variables and 1 default variable, it then loops over each row in the datafraem and sends the API call to FourSquare. The JSON data returned is then processed to extract the data that we are after, in this case they are Venue name, Venue latitude, Venue longitude and Venue category. Finally the data is written into a dataframe.

In [None]:
def getNearbyVenues(names, districts, latitudes, longitudes, radius=300):
    
    venues_list=[]
    for name, dstr, lat, lng in zip(names, districts, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            secret_dict.get('client_id'), 
            secret_dict.get('client_secret'), 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
        except:
            continue
        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            dstr,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Street', 
                             'Street District', 
                             'Street Latitude', 
                             'Street Longitude', 
                             'Venue', 
                             'Venue Latitude',
                             'Venue Longitude',
                             'Venue Category']
    
    return(nearby_venues)

In [None]:
district_venues = getNearbyVenues(names=ppd_london_2019_sample['street'],
                                  districts=ppd_london_2019_sample['district'],
                            latitudes=ppd_london_2019_sample['latitude'],
                            longitudes=ppd_london_2019_sample['longitude']
                            )
district_venues.groupby(['Street', 'Street District'])['Venue'].count()

The FourSquare venue dataframe is pickled using the pandas .to_pickle method. This will eliminate the need to re-run the FourSquare venue calls above, thus saving time between runs.

In [None]:
district_venues.to_pickle('../data/processed/london_venues.pkl')  # saving the dataframe as a .pkl

In [None]:
london_venues= pd.read_pickle('../data/processed/london_venues.pkl')
print(london_venues.shape)
london_venues.head()

In [None]:
print('There are {} unique categories.'.format(len(london_venues['Venue Category'].unique())))

Let's plot the top 25 venues from the FourSquare data we collected.

In [None]:
london_venues_top25 = london_venues.groupby(['Venue Category'])['Venue Category'].count()\
    .reset_index(name="count").sort_values(['count'], ascending=False)[0:25]

fig2 = px.bar(london_venues_top25, x='Venue Category', y='count', labels={'x':'Venue Categories', 'y':'Count'})
fig2.update_layout(title='Top 25 Venue Categories',)
fig2.show()

### 3.3. Analyse each neighbourhood

In [None]:
# one hot encoding
pd.options.display.max_rows = 10
pd.options.display.max_columns = 15

london_onehot = pd.get_dummies(london_venues[['Venue Category']], prefix="", prefix_sep="")
# There was a 'Neighborhood' venue category which needed to be dropped as it was skewing the results
#ondon_onehot.drop('Neighborhood', axis = 1, inplace=True)

london_onehot_wstreets = pd.get_dummies(london_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe as the first column
london_onehot_wstreets.insert(loc=0, column='street', value=london_venues['Street'])
london_onehot_wstreets.insert(loc=1, column='district', value=london_venues['Street District'])

print(london_onehot_wstreets.shape)
london_onehot_wstreets[70:80]

In [None]:
london_onehot_grouped = london_onehot_wstreets.groupby(['street', 'district']).mean().reset_index()

print(london_onehot_grouped.shape)
london_onehot_grouped

#### Now we write a function to get the top 10 venues for each neighborhood

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

# for assigning indicators to 1st, 2nd & 3rd
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['street', 'district']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['street'] = london_onehot_grouped['street']
neighborhoods_venues_sorted['district'] = london_onehot_grouped['district']

for ind in np.arange(london_onehot_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 2:] = return_most_common_venues(london_onehot_grouped.iloc[ind, :], num_top_venues)

print(neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted

## 4. Modeling

### 4.1. K-Means Clustering

#### 4.1.1. Optimising K

The Elbow Method is used to determine the optimal value of k as this is one of the most popular methods. We will be using 2 metric values calculated from a range of k values in order to determine the 'elbow point', i.e. the point after which the metrics starts decreasing linearly.

Those 2 metric values are:
- Distortion: Calculated as the average of the squared distances from the cluster centers of the respective clusters where typically the Euclidean distance is used.
- Inertia: The sum of squared distances of samples to their closest cluster center.

In [None]:
from scipy.spatial.distance import cdist
distortions = []
Sum_of_squared_distances = []
K = range(1,15)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(ppd_grouped_clustering)
    distortions.append(sum(np.min(cdist(ppd_grouped_clustering, km.cluster_centers_, 
                      'euclidean'),axis=1)) / ppd_grouped_clustering.shape[0]) 
    Sum_of_squared_distances.append(km.inertia_)

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
ax1.plot(K, distortions, 'bx-')
ax2.plot(K, Sum_of_squared_distances, 'bx-')
ax1.set_title('The Elbow Method using Distortion', fontsize = 15)
ax2.set_title('The Elbow Method using Inertia', fontsize = 15)
ax1.set_ylabel('Distortion', fontsize = 12)
ax2.set_ylabel('Sum_of_squared_distances', fontsize = 12)
ax1.set_xlabel('k', fontsize = 12)
ax2.set_xlabel('k', fontsize = 12)

#### 4.1.2. Clustering

In [None]:
# set number of clusters
kclusters = 5

ppd_grouped_clustering = london_onehot_grouped.drop(labels=['street', 'district'], axis=1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ppd_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
# neighborhoods_venues_sorted = []

In [None]:
# add clustering labels
neighborhoods_venues_clustered = neighborhoods_venues_sorted.copy()
neighborhoods_venues_clustered.insert(loc = 0, column = 'Cluster Labels', value = kmeans.labels_)

ppd_london_merged = ppd_london_2019_sample

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
ppd_london_merged = ppd_london_merged.join(neighborhoods_venues_clustered.set_index(['street', 'district']), on=['street', 'district'])

ppd_london_merged # check the last columns!

In [None]:
ppd_london_merged.isnull().sort_values(by = '1st Most Common Venue', ascending = False)
ppd_london_merged.shape

In [None]:
ppd_london_merged.dropna(axis=0, how='any', inplace=True)
ppd_london_merged.shape

In [None]:
ppd_london_merged['Cluster Labels'] = ppd_london_merged['Cluster Labels'].astype(int)
ppd_london_merged = ppd_london_merged.sort_values('Cluster Labels')

#### Now we visualise the clusters using an interactive Plotly map.

In [None]:
fig3 = px.scatter_mapbox(ppd_london_merged, lat="latitude", lon="longitude",
                        color=ppd_london_merged["Cluster Labels"].astype(str), hover_data=['street'], width=1000, height=700)

fig3.update_layout(
    title='Clustering London Neighbourhoods (K-Means)',
    autosize=True,
    hovermode='closest',
    showlegend=True,
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(lat=latitude_ldn, lon=longitude_ldn), 
        pitch=5,
        zoom=10,
        style='light'
    ),
    legend={'title':'Clusters', 'traceorder':'normal'}
)
fig3.update_traces(marker=dict(size=10))
fig3.show()

### Cluster 0 - Bus Stops

In [None]:
cluster_0 = ppd_london_merged.loc[ppd_london_merged['Cluster Labels'] == 0, 
                      ppd_london_merged.columns[[0,1] + list(range(4, ppd_london_merged.shape[1]))]]
cluster_0

### Cluster 1 - Pubs

In [None]:
cluster_1 = ppd_london_merged.loc[ppd_london_merged['Cluster Labels'] == 1, 
                      ppd_london_merged.columns[[0,1] + list(range(5, ppd_london_merged.shape[1]))]]
cluster_1

### Cluster 2 - Parks

In [None]:
cluster_2 = ppd_london_merged.loc[ppd_london_merged['Cluster Labels'] == 2, 
                      ppd_london_merged.columns[[0,1] + list(range(5, ppd_london_merged.shape[1]))]]
cluster_2

### Cluster 3 - Grocery Stores

In [None]:
cluster_3 = ppd_london_merged.loc[ppd_london_merged['Cluster Labels'] == 3, 
                      ppd_london_merged.columns[[0,1] + list(range(5, ppd_london_merged.shape[1]))]]
cluster_3

### Cluster 4 - Coffee Shops & Cafes

In [None]:
cluster_4 = ppd_london_merged.loc[ppd_london_merged['Cluster Labels'] == 4, 
                      ppd_london_merged.columns[[0,1] + list(range(5, ppd_london_merged.shape[1]))]]
cluster_4

Let's plot the top 5 venues in each cluster onto a bar chart.

In [None]:
ppd_london_merged_top5 = ppd_london_merged.groupby(["Cluster Labels", "1st Most Common Venue"])['Cluster Labels'].count()\
.reset_index(name="count").sort_values(['Cluster Labels','count'], ascending=False)

cluster_0_top5 = (ppd_london_merged_top5[ppd_london_merged_top5['Cluster Labels'] == 0][0:5])
cluster_1_top5 = (ppd_london_merged_top5[ppd_london_merged_top5['Cluster Labels'] == 1][0:5])
cluster_2_top5 = (ppd_london_merged_top5[ppd_london_merged_top5['Cluster Labels'] == 2][0:5])
cluster_3_top5 = (ppd_london_merged_top5[ppd_london_merged_top5['Cluster Labels'] == 3][0:5])
cluster_4_top5 = (ppd_london_merged_top5[ppd_london_merged_top5['Cluster Labels'] == 4][0:5])

top_5_venues = cluster_0_top5.append([cluster_1_top5, cluster_2_top5, cluster_3_top5, 
                                      cluster_4_top5], ignore_index=True)

In [None]:
fig4 = px.bar(top_5_venues, x="Cluster Labels", y="count", color='1st Most Common Venue',
             height=500)

fig4.update_layout(title='Clustering London Streets', 
                   barmode='stack',
                   bargap=0.15,
                   bargroupgap=0.1, 
                   legend={'title':'Venue Category',
                          'traceorder':'normal'}
                  )
fig4.show()

As we can see from above, cluster 0, 1 and 2 all have a high number of pubs which is not surprising as these clusters are located around Central London where there are more then 3500 pubs.

### 4.2. K - Mode

In [None]:
kmode_onehot_grouped = london_onehot_wstreets.groupby(['street', 'district']).sum().reset_index()
kmode_grouped_clustering = kmode_onehot_grouped.drop(labels=['street', 'district'], axis=1)
kmode_grouped_clustering

In [None]:
from kmodes.kmodes import KModes

# define the k-modes model
km = KModes(n_clusters=4, init='Huang', n_init=10, verbose=1)

# fit the clusters to the skills dataframe
clusters = km.fit_predict(kmode_grouped_clustering)

# get an array of cluster modes
kmodes = km.cluster_centroids_
shape = kmodes.shape

# For each cluster mode (a vector of "1" and "0")
# find and print the column headings where "1" appears.
# If no "1" appears, assign to "no-skills" cluster.
for i in range(shape[0]):
    if sum(kmodes[i,:]) == 0:
        print("\ncluster " + str(i) + ": ")
        print("No venues cluster")
    else:
        print("\ncluster " + str(i) + ": ")
        cent = kmodes[i,:]
        for j in kmode_grouped_clustering.columns[np.nonzero(cent)]:
            print(j)

In [None]:
kmodedf = neighborhoods_venues_sorted.copy()
kmodedf.insert(loc = 2, column = 'Cluster Labels', value = clusters)
kmodedf.sort_values('Cluster Labels')

In [None]:
kmode_merge = ppd_london_2019_sample

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
kmode_merge = kmode_merge.join(kmodedf.set_index(['street', 'district']), on=['street', 'district'])
kmode_merge.dropna(inplace=True)
kmode_merge.sort_values('Cluster Labels')

fig10 = px.scatter_mapbox(kmode_merge, lat="latitude", lon="longitude",
                        color=kmode_merge["Cluster Labels"].astype(str), hover_data=['street'], width=1000, height=700)

fig10.update_layout(
    title='Clustering London Neighbourhoods (K-Mode, k = 4)',
    autosize=True,
    hovermode='closest',
    showlegend=True,
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(lat=latitude_ldn, lon=longitude_ldn), 
        pitch=5,
        zoom=10,
        style='light'
    ),
    legend={'title':'Clusters', 'traceorder':'normal'}
)
fig10.update_traces(marker=dict(size=10))
fig10.show()

In [None]:
kmodedf_top5 = kmode_merge.groupby(["Cluster Labels", "1st Most Common Venue"])['Cluster Labels'].count()\
.reset_index(name="count").sort_values(['Cluster Labels','count'], ascending=False)

kcluster_0_top5 = (kmodedf_top5[kmodedf_top5['Cluster Labels'] == 0][0:5])
kcluster_1_top5 = (kmodedf_top5[kmodedf_top5['Cluster Labels'] == 1][0:5])
kcluster_2_top5 = (kmodedf_top5[kmodedf_top5['Cluster Labels'] == 2][0:5])
kcluster_3_top5 = (kmodedf_top5[kmodedf_top5['Cluster Labels'] == 3][0:5])
kcluster_4_top5 = (kmodedf_top5[kmodedf_top5['Cluster Labels'] == 4][0:5])

kmode_top_5_venues = kcluster_0_top5.append([kcluster_1_top5, kcluster_2_top5, kcluster_3_top5, 
                                      kcluster_4_top5], ignore_index=True)

kmode_top_5_venues

In [None]:
fig11 = px.bar(kmode_top_5_venues, x="Cluster Labels", y="count", color='1st Most Common Venue',
             height=500)

fig11.update_layout(title='Clustering London Streets', 
                   barmode='stack',
                   bargap=0.15,
                   bargroupgap=0.1, 
                   legend={'title':'Venue Category',
                          'traceorder':'normal'}
                  )
fig11.show()

#### Optimising K

In [None]:
cost = []
for num_clusters in list(range(1,15)):
    kmode = KModes(n_clusters=num_clusters, init = "Huang", n_init = 5, verbose=1)
    kmode.fit_predict(kmode_grouped_clustering)
    cost.append(kmode.cost_)

In [None]:
y = np.array([i for i in range(1,15,1)])
plt.plot(y, cost)

## 5. Discussion

### 5.1. Housing Prices 1995 - 2020

In [None]:
# Average PPD per borough from 1995 - 2017
historical_london_ppd = pd.read_csv('../data/external/land-registry-house-prices-ward.csv')
historical_london_ppd = historical_london_ppd[(historical_london_ppd['Measure'] == 'Mean') & (historical_london_ppd['Value'] != '-')]
historical_london_ppd['Year'] = historical_london_ppd['Year'].apply(lambda x : x [-4:])
historical_london_ppd['Value'] = historical_london_ppd['Value'].apply(lambda X: X.replace(",", "")).astype(int)

borough_avg_ppd = historical_london_ppd.groupby(['Year', 'Borough'])['Value'].mean().round(2).reset_index()

In [None]:
# read in the csv and calculate the PPD per borough in 2018 and 2020 (March)
ppd_2018 = pd.read_csv('../data/external/pp-2018.csv')

ppd_2018.columns = ['TUID', 'Price', 'Date_of_Transfer', 'Postcode', 'Property_Type', 'Old_New', 'Duration',
                    'PAON', 'SAON', 'Street', 'Locality', 'Town_City', 'District', 'County', 'PPD_Cat_Type', 'Record_Status']

# Drop features that are irrelevant for this project, filter for London rows and clean up the data
ppd_2018_clean = ppd_2018.drop(columns=['TUID', 'Duration', 'PAON', 'SAON', 'Locality', 'PPD_Cat_Type', 'Record_Status'])

# Filter out rows where Town_City column contains 'LONDON'
ppd_london_2018 = ppd_2018_clean[ppd_2018['Town_City']=='LONDON'].copy()
ppd_london_2018.dropna(axis=0, how='any', inplace=True)

avg_ppd_borough_2018 = ppd_london_2018.groupby('District')['Price'].mean().round(2).reset_index()

In [None]:
# read in the csv and calculate the PPD per borough in 2018 and 2020 (March)
ppd_2020 = pd.read_csv('../data/external/pp-2020.csv')

ppd_2020.columns = ['TUID', 'Price', 'Date_of_Transfer', 'Postcode', 'Property_Type', 'Old_New', 'Duration',
                    'PAON', 'SAON', 'Street', 'Locality', 'Town_City', 'District', 'County', 'PPD_Cat_Type', 'Record_Status']

# Drop features that are irrelevant for this project, filter for London rows and clean up the data
ppd_2020_clean = ppd_2020.drop(columns=['TUID', 'Duration', 'PAON', 'SAON', 'Locality', 'PPD_Cat_Type', 'Record_Status'])

# Filter out rows where Town_City column contains 'LONDON'
ppd_london_2020 = ppd_2020_clean[ppd_2020['Town_City']=='LONDON'].copy()
ppd_london_2020.dropna(axis=0, how='any', inplace=True)

avg_ppd_borough_2020 = ppd_london_2020.groupby('District')['Price'].mean().round(2).reset_index()

In [None]:
# Average PPD per borough in 2019
avg_ppd_borough_2019 = ppd_london.groupby('District')['Price'].mean().round(2).reset_index()

In [None]:
fig12 = px.bar(avg_ppd_borough_2019, x='District', y='Price', color='Price', range_y=[0,1000000])
fig12.layout.update(title={'text':'Average Property Prices in London num2019'},
                   xaxis={'title':{'text':'Borough',
                                  'font':{'size':17}},
                          'tickangle':30,
                          'tickfont':{'size':12}
                         },
                   yaxis={'title':{'text':'Value in GBP (£)',
                                  'font':{'size':17}},
                         },
                  )
fig12.show()

In [None]:
avg_ppd_borough_2018

In [None]:
avg_ppd_borough_2019

In [None]:
avg_ppd_borough_2020

In [None]:
fig13 = px.bar(borough_avg_ppd, x='Borough', y='Value', color='Value', animation_frame='Year',
           hover_name='Borough', range_y=[0,2000000])

fig13.layout.update(title={'text':'Average Property Prices in London (1995-2017)'},
                   xaxis={'title':{'text':'Borough',
                                  'font':{'size':17}},
                          'tickangle':30,
                          'tickfont':{'size':12}
                         },
                   yaxis={'title':{'text':'Value in GBP (£)',
                                  'font':{'size':17}},
                         },
                  sliders=[{'visible':True,
                           'currentvalue':{'prefix':'Year: ',
                                           'font':{'size':20},
                                           'xanchor':'right',
                                           'visible':True},
                            'pad':{'t':100},
                            'transition':{'duration':20,
                                         'easing':'linear'}
                            
                           }
                          ],
                   updatemenus=[{'pad':{'t':135}}
                               ]
                  )
fig13.show()

### 5.2. Pubs vs No Pubs

In [None]:
ppd_london_merged.dropna(inplace=True)
with_pub = ppd_london_merged[(ppd_london_merged['1st Most Common Venue'] == 'Pub') | \
                             (ppd_london_merged['2nd Most Common Venue'] == 'Pub') | \
                             (ppd_london_merged['3rd Most Common Venue'] == 'Pub') | \
                             (ppd_london_merged['4th Most Common Venue'] == 'Pub') | \
                             (ppd_london_merged['5th Most Common Venue'] == 'Pub') | \
                             (ppd_london_merged['6th Most Common Venue'] == 'Pub') | \
                             (ppd_london_merged['7th Most Common Venue'] == 'Pub') | \
                             (ppd_london_merged['8th Most Common Venue'] == 'Pub') | \
                             (ppd_london_merged['9th Most Common Venue'] == 'Pub') | \
                             (ppd_london_merged['10th Most Common Venue'] == 'Pub')
                            ].copy()
with_pub

In [None]:
without_pub = ppd_london_merged[(ppd_london_merged['1st Most Common Venue'] != 'Pub') & \
                                (ppd_london_merged['2nd Most Common Venue'] != 'Pub') & \
                                (ppd_london_merged['3rd Most Common Venue'] != 'Pub') & \
                                (ppd_london_merged['4th Most Common Venue'] != 'Pub') & \
                                (ppd_london_merged['5th Most Common Venue'] != 'Pub') & \
                                (ppd_london_merged['6th Most Common Venue'] != 'Pub') & \
                                (ppd_london_merged['7th Most Common Venue'] != 'Pub') & \
                                (ppd_london_merged['8th Most Common Venue'] != 'Pub') & \
                                (ppd_london_merged['9th Most Common Venue'] != 'Pub') & \
                                (ppd_london_merged['10th Most Common Venue'] != 'Pub')
                               ].copy()
without_pub

In [None]:
with_pub['Key'] = 'with_pub'
without_pub['Key'] = 'without_pub'

df_pub = pd.concat([with_pub, without_pub], keys=['with_pub', 'without_pub'])
df_pub_grouped = df_pub.groupby('Key')['avg_price'].mean().reset_index()
fig, ax = plt.subplots(figsize=(15, 5))
fig = sns.barplot(data=df_pub_grouped, y='avg_price', x='Key', ax=ax)
fig.set_title('Average property prices within 300 meters of a pub vs without', fontsize=15)
fig.set_ylabel('Value in GBP (£)', fontsize=15)
fig.set_xlabel('')
fig.set_xticklabels(labels=['With pub', 'Without pub'], fontsize=15)

### 5.4 Most common venue in each borough

In [None]:
borough_top_venue = neighborhoods_venues_sorted.groupby(['district','1st Most Common Venue'])['1st Most Common Venue']\
    .count().reset_index(name='count').copy()
borough_top_venue.sort_values(by=['district','count'], ascending=False)
borought_top_venue_unique = borough_top_venue.loc[borough_top_venue.reset_index().groupby(['district'])['count'].idxmax()]

In [None]:
fig, ax = plt.subplots(figsize=(17, 5))
fig = sns.barplot(data=borought_top_venue_unique, x='district', y='count', ax=ax, ci=None, hue='1st Most Common Venue', dodge=False)
fig.set_ylabel('Count', fontsize=15)
fig.set_yticklabels(fig.get_yticks(), fontsize=12)
fig.set_xlabel('Borough', fontsize=15)
ax.set(ylim=(0, 40))
ax.legend(ncol=2, loc="upper right", frameon=True, fontsize=12)
plt.xticks(rotation=-35, horizontalalignment='left', fontsize=12)

In [None]:
# find neighbourhoods/districts where housing price is above the mean, then find the most common venue.

In [None]:
# calculate rental yield and find the average for each cluster and compare with data scraped from Foxton

## Bits of code

In [None]:
# Get unique street names from the new ppd_london dataframe, remove nan values.
ppd_london_streets = [x for x in ppd_london['Street'].unique() if str(x) != 'nan']
ppd_london_streets[0:10]

In [None]:
from sklearn.metrics import silhouette_score

sil = []
K_sil = range(2, 20)

for k in K_sil:
    print(k, end=' ')
    kmeans = KMeans(n_clusters=k).fit(ppd_grouped_clustering)
    labels = kmeans.labels_
    sil.append(silhouette_score(ppd_grouped_clustering, labels, metric = 'euclidean'))

In [None]:
plt.plot(K_sil, sil, 'bx-')
plt.xlabel('k')
plt.ylabel('silhouette_score')
plt.title('Silhouette Method For Optimal k')
plt.show()

In [None]:
# map animation of mean PPD 1995 - 2017
with open('../data/external/london_boroughs_proper.geojson', 'r') as response:
    boroughs = json.load(response)

fig7 = px.choropleth_mapbox(borough_avg_ppd, geojson=boroughs, locations='Borough', color='Value',
                            range_color=(0, 2000000),
                            animation_frame='Year',
                            color_continuous_scale="Viridis",
                            mapbox_style="carto-positron",
                            zoom=9, center = {"lat": latitude_ldn, "lon": longitude_ldn},
                            featureidkey='properties.name',
                            opacity=0.5,
                           labels={'Value':'Avg Property Prices'}
                          )
fig7.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig7.show()

## References

How to access HM Land Registry Price Paid Data: https://www.gov.uk/guidance/about-the-price-paid-data

Price Paid Data - HM Land Registry: https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads

Average private rental prices per borough:https://data.london.gov.uk/dataset/average-private-rents-borough

Borough property and rental prices - Foxtons: https://www.foxtons.co.uk/living-in/bermondsey

List of London boroughs : https://en.wikipedia.org/wiki/List_of_London_boroughs

London Borough GeoJSON: https://joshuaboyd1.carto.com/tables/london_boroughs_proper/public

https://stackoverflow.com/questions/36631163/what-are-the-pros-and-cons-between-get-dummies-pandas-and-onehotencoder-sciki/38650886#38650886

https://stats.stackexchange.com/questions/187595/clustering-with-categorical-and-numeric-data

https://www.ritchieng.com/machinelearning-one-hot-encoding/

https://towardsdatascience.com/clustering-burger-venues-in-s%C3%A3o-paulo-f4bfc0a031cd

https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a

k mode
https://www.kaggle.com/ashydv/bank-customer-clustering-k-modes-clustering

https://stackoverflow.com/questions/42639824/python-k-modes-explanation

https://medium.com/@davidmasse8/unsupervised-learning-for-categorical-data-dd7e497033ae