# New Restaurant Location Recommendations - The Johannesburg Edition

## 1. Introduction and Business Problem

The City of Johannesburg is the largest metropolitan in South Africa and is the economic capital of the country. It has a population of 4,434,827 as per the official 2011 Census. The municipality covers an area of 1,645 square kilometres stretching from Orange Farm in the south to Midrand in the north, and is made up of several urban centres that include Johannesburg, Midrand, Roodepoort, Diepsloot, Killarney, Melrose Arch, Randburg, Rosebank, Sandton, Soweto, and Sunninghill.
As the economic capital of the country, and by extension the Gauteng Province, the city has a lot of inbound migration from individuals and entrepreneurs seeking work, business and economic opportunities in general. In addition to migrants from other cities and provinces within South Africa, the city recieves additional migrants coming from other African countries such as Zimbabwe, Botswana, Mozambique, Lesotho, Swaziland, Nigeria and many other countries beyond the African continent. While some business intelligence services exist to inform any interested business people on possible locations of, say, new restaurants within the city, this comes at a cost. There is no inexpensive way I am aware of that would allow someone who is looking at opening a restaurant to evaluate the locations of existing restaurants within the city and locate any gaps and likely recommendations of potential areas before considering expensive and localised market research. Such evaluation of restaurants must also include the type of restaurant and the cuisine served to be able answer critical questions of knowing up-front who is the likely competition and how well are they rated by the patrons on popular platforms.


## 2. Available Data

The datasets that would be used include the following;

a. __FOURSQUARE__ Location Data <br>
The following description is mostly adapted from the Applied Data Science Capstone Course which offers a useful summary of the data and technology.<br>

Foursquare is a technology company that built a massive dataset of location data based on crowd-sourcing from people who use their app to build their dataset and add venues and complete any missing information they have in their dataset. Its location data is one the most comprehensive and accurate to appeal to service providers such as Apple Maps, Uber, Snapchat, Twitter, Garmin and many others, including over 100,000 developers, who use their platform. Foursquare can be used via a mobile device or online website. Both allow you to search for a venue of interest at a specific location of interest. The results would return all such venues, including their rating. Clicking on any of the venues allows one to explore the venue by redirecting you to its page on Foursquare that include its name, full address, working hours, menu, tips and images that users have posted about the shop and any of its drinks among other attributes. If one is interested in finding the most popular places (trending places), what is required is the area of interest only and all top venues are listed in order of rankings.
All these search activities are possible because of the comprehensive underlying database, which is available for free and available through an API. This availability, with some restrictions, anables anyone to access it via any application, which will be utilised in this project. <br>

b. __PLACES__ Database for the City of Johannesburg <br>
This data is based on the 2011 Census results which divided the city into 40 main places, and over 800 sub places. It has been scraped from the https://census2011.adrianfrith.com/ website, which provides online access to a selection of results from South Africa's Census 2011 down to the “sub place” layer of detail, as released in the Community Profile Database DVD set. As an example, the following is the head of the table of the dataset.

|Main Place|Sub Place|Population|Area (km²)|
|------|------|------|------|
|Alexandra|Alexandra Ext 1|5,267| 0.22|
|Alexandra|Alexandra Ext 10|1,865| 0.06|
|Alexandra|Alexandra Ext 11|682| 0.02|
|Alexandra|Alexandra Ext 12|353| 0.01|
|Alexandra|Alexandra Ext 13|1,164| 0.07|

The data will be the basis against which the availability of restaurants in a prticular are will be evaluating against. Initial part of the evaluation will consider if there is merit in using the sub place parameter in the various datasets, particularly for areas with extensions. <br>

c. __GEOCODER__ Package Data <br>
Geocoding is the process of converting addresses (like "1600 Amphitheatre Parkway, Mountain View, CA") into geographic coordinates (like latitude 37.423021 and longitude -122.083739), which can be use to place markers or a position a map. The Geocoder package, or any similar alternative, will be utilised to obtain the co-ordinates of teh main/sub places against which the data from Foursquare will be evaluated.

## 3. Code

Downloading all required dependancies

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!conda install -c conda-forge geocoder --yes #installing geocoder

!conda install -c conda-forge GoogleMaps --yes #installing GoogleMaps

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.17.0-py_0 conda-forge

geographiclib- 100% |################################| Time: 0:00:00  17.06 MB/s
geopy-1.17.0-p 100% |################################| Time: 0:00:00  31.66 MB/s
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    geocoder:   1.38.1-py_0  conda-forge
    orderedset: 2.0-py35_0   conda-forge
    ratelim:    0.1.6-py35_0 conda-forge

orderedset-2.0 100% |################################| Time: 0:00:00  49.86 MB/s
ratelim-0.1.6- 100% |################################| Time: 0:00:00   9.28 MB/s
geocoder-1.38. 100% |################################| Time: 0:00:00 

### 3.1 Uploading and Exploring Places Data

Inserting project tokens to be able to use uploaded data and save results in the cloud

In [2]:
# The code was removed by Watson Studio for sharing.

Loading data from the IBM Cloud Object Storage

In [5]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,MainPlace,Sub Place,Population,Area (km²)
0,Johannesburg,Alexandra,138471,3.5
1,Alexandra,East Bank,6771,0.96
2,Alexandra,Far East Bank,29251,2.38
3,Alexandra,Sejwetla,5131,0.13
4,Chartwell,Chartwell AH,1653,8.5


Changing column names to coding-friendly names

In [6]:
column_names = ['mainplace', 'subplace', 'population', 'area'] 

datadf.columns=column_names
data = datadf
data.head()

Unnamed: 0,mainplace,subplace,population,area
0,Johannesburg,Alexandra,138471,3.5
1,Alexandra,East Bank,6771,0.96
2,Alexandra,Far East Bank,29251,2.38
3,Alexandra,Sejwetla,5131,0.13
4,Chartwell,Chartwell AH,1653,8.5


Getting the shape of the data

In [7]:
data.shape

(628, 4)

Exploring the number of main places in Johannesburg

In [8]:
len(data['mainplace'].unique())

24

Exploring the number of sub-places

In [9]:
len(data['subplace'].unique())

627

The data has 24 main places and 627 sub places (quite a few!)

### 3.2 Using the GoogleMaps Geocoding Package

Inserting two columns for latitude and longitude in the dataframe

In [10]:
data['lat']=""
data['lon']=""
data.head()

Unnamed: 0,mainplace,subplace,population,area,lat,lon
0,Johannesburg,Alexandra,138471,3.5,,
1,Alexandra,East Bank,6771,0.96,,
2,Alexandra,Far East Bank,29251,2.38,,
3,Alexandra,Sejwetla,5131,0.13,,
4,Chartwell,Chartwell AH,1653,8.5,,


Importing the googlemap package and setting the API key

In [17]:
# The code was removed by Watson Studio for sharing.

Populating the dataframe with latitude and longitude data

In [12]:
#Creating the geocode result and getting the longitute and latitude

for i in range(0, len(data), 1):
    geocode_result = gmaps_key.geocode(data.iat[i, 1] + ', ' + data.iat[i, 0] + ', ZA')
    try:
        lat = geocode_result[0]["geometry"]["location"]["lat"]
        lon = geocode_result[0]["geometry"]["location"]["lng"]
        data.iat[i, data.columns.get_loc("lat")] = lat
        data.iat[i, data.columns.get_loc("lon")] = lon
    except:
        lat = None
        lon = None

Saving the new Dataframe as a .csv and checking the first 10 rows

In [13]:
project.save_data("joburglocations.csv", data.to_csv())
data.head(10)

Unnamed: 0,mainplace,subplace,population,area,lat,lon
0,Johannesburg,Alexandra,138471,3.5,-26.1033,28.0976
1,Alexandra,East Bank,6771,0.96,-26.101,28.1105
2,Alexandra,Far East Bank,29251,2.38,-26.097,28.1133
3,Alexandra,Sejwetla,5131,0.13,-26.1033,28.0976
4,Chartwell,Chartwell AH,1653,8.5,-25.987,27.9715
5,Chartwell,North Champagne Estates AH,75,0.57,-25.9755,27.9608
6,Johannesburg,Althea AH,713,18.01,-26.4289,27.9088
7,Johannesburg,Johannesburg,8537,244.0,-26.2041,28.0473
8,Johannesburg,Diepsloot Nature Reserve,683,25.07,-25.9301,27.9542
9,Johannesburg,Lanseria Airport,0,2.76,-25.9378,27.9264


__OPTIONAL:__ To be only run if the Google Goecoding API is not required as data has already been obtained

In [56]:
body = client_d56fd597f0c643299f9b555c9e110b7d.get_object(Bucket='capstoneprojectnotebook-donotdelete-pr-mrs8cgmt3vnax2',Key='joburglocations.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

data = pd.read_csv(body)
data = data.drop(['Unnamed: 0'], axis=1)
data.head(10)

Unnamed: 0,mainplace,subplace,population,area,lat,lon
0,Johannesburg,Alexandra,138471,3.5,-26.103301,28.097637
1,Alexandra,East Bank,6771,0.96,-26.100969,28.110518
2,Alexandra,Far East Bank,29251,2.38,-26.096986,28.113328
3,Alexandra,Sejwetla,5131,0.13,-26.103301,28.097637
4,Chartwell,Chartwell AH,1653,8.5,-25.987007,27.971508
5,Chartwell,North Champagne Estates AH,75,0.57,-25.975547,27.960779
6,Johannesburg,Althea AH,713,18.01,-26.42887,27.908789
7,Johannesburg,Johannesburg,8537,244.0,-26.204103,28.047305
8,Johannesburg,Diepsloot Nature Reserve,683,25.07,-25.930144,27.954189
9,Johannesburg,Lanseria Airport,0,2.76,-25.937751,27.926416


Observing the last 10 rows

In [57]:
data.tail(10)

Unnamed: 0,mainplace,subplace,population,area,lat,lon
618,Soweto,Thulani,40376,4.98,-26.224944,27.828212
619,Soweto,Tladi,14435,1.35,-26.256075,27.8414
620,Soweto,Valentine Village/Mandela View,4346,0.19,-26.248538,27.854032
621,Soweto,Winnie Camp,2608,0.13,-26.202459,27.86677
622,Soweto,Zola,44777,3.91,-26.238759,27.839935
623,Soweto,Zondi,15130,1.29,-26.234281,27.867772
624,Stretford,Stretford Ext,61139,7.0,-26.49436,27.848726
625,Johannesburg,Tshepisong,53260,6.56,-26.189778,27.801834
626,Johannesburg,Vlakfontein,27291,4.63,-26.375235,27.887028
627,Johannesburg,Zakariyya Park,6200,1.96,-26.366332,27.897071


Checking data types for the dataframe

In [58]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 628 entries, 0 to 627
Data columns (total 6 columns):
mainplace     628 non-null object
subplace      628 non-null object
population    628 non-null int64
area          628 non-null float64
lat           626 non-null float64
lon           626 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 29.5+ KB


Converting dato to the right types if required

In [34]:
#data["lat"] = pd.to_numeric(data['lat'])
#data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 628 entries, 0 to 627
Data columns (total 6 columns):
mainplace     628 non-null object
subplace      628 non-null object
population    628 non-null int64
area          628 non-null float64
lat           0 non-null float64
lon           626 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 29.5+ KB


Find rows with no data for lat and long

In [60]:
data['lon'].replace('', np.nan, inplace=True)
data['lat'].replace('', np.nan, inplace=True)
data.dropna(subset=['lat'], inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 626 entries, 0 to 627
Data columns (total 6 columns):
mainplace     626 non-null object
subplace      626 non-null object
population    626 non-null int64
area          626 non-null float64
lat           626 non-null float64
lon           626 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 34.2+ KB


Two rows removed - from 628 objects in other columns

### 3.3 Mapping and Exploring Johannesburg

Obtaining the lat and lon for the City of Johannesburg

In [61]:
geocode_result = gmaps_key.geocode('Johannesburg, ZA')
latitude = geocode_result[0]["geometry"]["location"]["lat"]
longitude = geocode_result[0]["geometry"]["location"]["lng"]
print('The geograpical coordinate of the City of Johannesburg are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of the City of Johannesburg are -26.2041028, 28.0473051.


In [62]:
# create map of City of Johannesburg using latitude and longitude values
map_joburg = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, subplace, mainplace in zip(data['lat'], data['lon'], data['subplace'], data['mainplace']):
    label = '{}, {}'.format(subplace, mainplace)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_joburg)  
    
map_joburg

#### Defining Foursquare Credentials and Version

In [50]:
# The code was removed by Watson Studio for sharing.

For demonstration and simplicity, also considering the limitations of the Fourquare API on requests, we will focus on the main place of Roodepoort for the rest of the work

In [63]:
roodepoort_data = data[data.mainplace == "Roodepoort"].reset_index(drop=True)

In [64]:
roodepoort_data.head()

Unnamed: 0,mainplace,subplace,population,area,lat,lon
0,Roodepoort,Aanwins AH,927,0.33,-26.103782,27.882422
1,Roodepoort,Allen's Nek,6373,2.89,-26.130802,27.908789
2,Roodepoort,Alsef AH,299,1.68,-26.083618,27.897803
3,Roodepoort,Ambot AH,124,0.61,-26.087073,27.900733
4,Roodepoort,Amorosa,1881,1.24,-26.100808,27.870703


Obtaining the lat and lon for Roodepoort

In [65]:
geocode_result = gmaps_key.geocode('Roodepoort, ZA')
latitude = geocode_result[0]["geometry"]["location"]["lat"]
longitude = geocode_result[0]["geometry"]["location"]["lng"]
print('The geograpical coordinate of the Roodepoort are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of the Roodepoort are -26.1201355, 27.9014654.


In [66]:
# create map of City of Johannesburg using latitude and longitude values
map_roodepoort = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, subplace, mainplace in zip(roodepoort_data['lat'], roodepoort_data['lon'], roodepoort_data['subplace'], roodepoort_data['mainplace']):
    label = '{}, {}'.format(subplace, mainplace)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_roodepoort)  
    
map_roodepoort

### Exploring a subplace in our dataset

Obtaining the name

In [67]:
roodepoort_data.loc[0, 'subplace']

'Aanwins AH'

Get the subplace's latitude and longitude values.

In [68]:
roodepoort_latitude = roodepoort_data.loc[0, 'lat'] # neighborhood latitude value
roodepoort_longitude = roodepoort_data.loc[0, 'lon'] # neighborhood longitude value

roodepoort_name = roodepoort_data.loc[0, 'subplace'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(roodepoort_name, 
                                                               roodepoort_latitude, 
                                                               roodepoort_longitude))

Latitude and longitude values of Aanwins AH are -26.1037823, 27.8824224.


In [69]:
# type your answer here
LIMIT = 100
radius = 1000
categoryId = "4d4b7105d754a06374d81259"

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, roodepoort_latitude, roodepoort_longitude, radius, LIMIT, categoryId)
results = requests.get(url).json()

Obtaining venue categories

In [70]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Creating a dataframe of the venues

In [71]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.location.distance', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(15)

Unnamed: 0,name,distance,categories,lat,lng
0,Thyme @ The Falls,639,Restaurant,-26.108729,27.885677
1,Turn 'n Tender,627,Steakhouse,-26.108698,27.885498
2,Plaashuis Coffee Shop,688,Café,-26.109674,27.884525
3,Fish Aways Strubensvallei,818,Seafood Restaurant,-26.109695,27.887289


Creating a function to find food venues in all the subplaces of Roodepoort

In [82]:
def getNearbyVenues(subplace, lat, lon, radius=1000):
    
    venues_list=[]
    for subplace, lat, lng in zip(subplace, lat, lon):
        print(subplace)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,
            categoryId)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            subplace, 
            lat, 
            lng, 
            v['venue']['name'],
            v['venue']['location']['distance'],
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue',
                  'Distance from Centre',           
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [83]:
roodepoort_venues = getNearbyVenues(subplace=roodepoort_data['subplace'],
                                   lat=roodepoort_data['lat'],
                                   lon=roodepoort_data['lon']
                                  )

Aanwins AH
Allen's Nek
Alsef AH
Ambot AH
Amorosa
Carenvale
Constantia Kloof
Cosmo City
Creswell Park
Davidsonville
Discovery
Durban Roodepoort Deep Gold Mine
Eagle Canyon
Floracliffe
Florida
Florida Glen
Florida Hills
Florida Lake
Florida North
Florida Park
Florida View
Georginia
Groblerpark
Hamberg
Harison Park
Harison View
Harveston AH
Helderkruin
Honeydew Ridge
Honey Hill
Horison
Jackal Creek Golf Estate
Kimbult AH
Kloofendal
Laser Park
Lindhaven
Little Falls
Manufacta
Matholesville
Mostyn Park AH
North Riding AH
Ontdekkers Park
Poortview AH
Princess
Princess AH
Radiokop
Reefhaven
Roodekrans
Roodekrans AH
Roodepoort SP
Roodeport North
Roodeport West
Ruimsig
Ruimsig AH
Selwyn
Sonnedal AH
Strubensvallei
Technikon
Tres Jolie AH
Weltevredenpark
Wilfordon AH
Wilgeheuwel
Willowbrook
Wilro Park
Witpoortjie
Zandspruit SP
Zonnehoewe AH


### Exploring the final data frame

In [84]:
print(roodepoort_venues.shape)
project.save_data("RoodepoortVenues.csv", roodepoort_venues.to_csv(), overwrite = True)
roodepoort_venues.head()

(305, 8)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Distance from Centre,Venue Latitude,Venue Longitude,Venue Category
0,Aanwins AH,-26.103782,27.882422,Thyme @ The Falls,639,-26.108729,27.885677,Restaurant
1,Aanwins AH,-26.103782,27.882422,Turn 'n Tender,627,-26.108698,27.885498,Steakhouse
2,Aanwins AH,-26.103782,27.882422,Plaashuis Coffee Shop,688,-26.109674,27.884525,Café
3,Aanwins AH,-26.103782,27.882422,Fish Aways Strubensvallei,818,-26.109695,27.887289,Seafood Restaurant
4,Allen's Nek,-26.130802,27.908789,Panarottis,501,-26.128111,27.904766,Pizza Place


Checking the number of venues for each neighborhood

In [88]:
Counts_df = roodepoort_venues.groupby('Neighborhood').count().reset_index()
Counts_df = Counts_df.drop(['Neighborhood Longitude', 'Venue', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category', 'Distance from Centre'], axis=1)
Counts_df.rename(columns={'Neighborhood Latitude': 'restaurantcount', 'Neighborhood': 'subplace'}, inplace=True)
Counts_df

Unnamed: 0,subplace,restaurantcount
0,Aanwins AH,4
1,Allen's Nek,22
2,Alsef AH,1
3,Ambot AH,2
4,Amorosa,5
5,Carenvale,4
6,Constantia Kloof,5
7,Cosmo City,2
8,Creswell Park,2
9,Davidsonville,3


### 3.4 Analysing each neighbourhood

This analysis is for completeness - there are very few venues per neighbourhood to warrant this - the few exceptions can be analysed separately

In [87]:
# one hot encoding
roodepoort_onehot = pd.get_dummies(roodepoort_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
roodepoort_onehot['Neighborhood'] = roodepoort_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [roodepoort_onehot.columns[-1]] + list(roodepoort_onehot.columns[:-1])
roodepoort_onehot = roodepoort_onehot[fixed_columns]

roodepoort_onehot.head()

Unnamed: 0,Neighborhood,African Restaurant,American Restaurant,Asian Restaurant,BBQ Joint,Bakery,Breakfast Spot,Burger Joint,Café,Chinese Restaurant,Deli / Bodega,Dim Sum Restaurant,Diner,Fast Food Restaurant,Fish & Chips Shop,Food,Food Truck,French Restaurant,Fried Chicken Joint,Gastropub,German Restaurant,Indian Restaurant,Italian Restaurant,Latin American Restaurant,Mexican Restaurant,Pizza Place,Portuguese Restaurant,Restaurant,Sandwich Place,Seafood Restaurant,Soup Place,Steakhouse,Sushi Restaurant,Thai Restaurant
0,Aanwins AH,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,Aanwins AH,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,Aanwins AH,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Aanwins AH,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,Allen's Nek,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


## 4. Clustering the Neigbourhoods

Start by making a consolidating some dataframes

In [110]:
final_df = pd.merge(roodepoort_data, Counts_df, on='subplace', how='left').sort_values('restaurantcount' ,na_position='first').reset_index(drop=True)
final_df = final_df.fillna(0)
final_df.head(15)

Unnamed: 0,mainplace,subplace,population,area,lat,lon,restaurantcount
0,Roodepoort,Durban Roodepoort Deep Gold Mine,744,9.09,-26.174587,27.867128,0.0
1,Roodepoort,Kloofendal,1592,2.91,-26.131335,27.886817,0.0
2,Roodepoort,Matholesville,14905,1.3,-26.169962,27.847993,0.0
3,Roodepoort,Mostyn Park AH,211,1.33,-26.01084,27.9476,0.0
4,Roodepoort,Poortview AH,924,4.26,-26.090306,27.856052,0.0
5,Roodepoort,Princess AH,1988,1.92,-26.156927,27.834741,0.0
6,Roodepoort,Roodekrans,6457,4.21,-26.107365,27.845796,0.0
7,Roodepoort,Roodekrans AH,143,2.94,-26.107365,27.845796,0.0
8,Roodepoort,Roodeport North,3561,0.62,-26.159853,27.873633,0.0
9,Roodepoort,Witpoortjie,15368,5.6,-26.163183,27.825281,0.0


The output is very clear on areas that have no to small number of restaurants in close proximity and its immediately a recommendation tool. However, there are limitations to the recommendations. For example, Witpoortjie is said to have no restaurant but this is because the business district is furter away from its centre. That said, it is still important to note that in a 1km radius there is no restaurant that people can go to.

In [111]:
final_df.tail()

Unnamed: 0,mainplace,subplace,population,area,lat,lon,restaurantcount
62,Roodepoort,Wilgeheuwel,11822,4.24,-26.112614,27.898536,10.0
63,Roodepoort,Willowbrook,5291,2.34,-26.093547,27.879492,11.0
64,Roodepoort,Allen's Nek,6373,2.89,-26.130802,27.908789,22.0
65,Roodepoort,Strubensvallei,5350,2.58,-26.120135,27.901465,28.0
66,Roodepoort,Roodepoort SP,4904,1.44,-26.120135,27.901465,28.0


Strubensvallei and Roodepoort are probably not the best place to go to for a new restaurant, unless you can offer something radically different.

__Running k-means to cluster the neighborhood into 5 clusters__

In [143]:
from sklearn import preprocessing

# set number of clusters
kclusters = 5

clus_df = final_df.drop(['mainplace', 'subplace', 'lat', 'lon'], 1)
names = clus_df.columns

# Create the Scaler object
scaler = preprocessing.StandardScaler()

# Fit your data on the scaler object
roodeport_clustering = scaler.fit_transform(clus_df)
roodeport_clustering = pd.DataFrame(roodeport_clustering, columns=names)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(roodeport_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 0, 4, 0, 3, 0, 4, 0, 0, 3], dtype=int32)

In [144]:
# add clustering labels
final_df['Cluster Labels'] = kmeans.labels_

final_df.head(15) # check the last columns!

Unnamed: 0,mainplace,subplace,population,area,lat,lon,restaurantcount,Cluster Labels
0,Roodepoort,Durban Roodepoort Deep Gold Mine,744,9.09,-26.174587,27.867128,0.0,3
1,Roodepoort,Kloofendal,1592,2.91,-26.131335,27.886817,0.0,0
2,Roodepoort,Matholesville,14905,1.3,-26.169962,27.847993,0.0,4
3,Roodepoort,Mostyn Park AH,211,1.33,-26.01084,27.9476,0.0,0
4,Roodepoort,Poortview AH,924,4.26,-26.090306,27.856052,0.0,3
5,Roodepoort,Princess AH,1988,1.92,-26.156927,27.834741,0.0,0
6,Roodepoort,Roodekrans,6457,4.21,-26.107365,27.845796,0.0,4
7,Roodepoort,Roodekrans AH,143,2.94,-26.107365,27.845796,0.0,0
8,Roodepoort,Roodeport North,3561,0.62,-26.159853,27.873633,0.0,0
9,Roodepoort,Witpoortjie,15368,5.6,-26.163183,27.825281,0.0,3


Visualizing the resulting clusters

In [145]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(final_df['lat'], final_df['lon'], final_df['subplace'], final_df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examining the Clusters

Cluster 1

In [146]:
final_df.loc[final_df['Cluster Labels'] == 0, final_df.columns[[1] + list(range(2, final_df.shape[1]))]]

Unnamed: 0,subplace,population,area,lat,lon,restaurantcount,Cluster Labels
1,Kloofendal,1592,2.91,-26.131335,27.886817,0.0,0
3,Mostyn Park AH,211,1.33,-26.01084,27.9476,0.0,0
5,Princess AH,1988,1.92,-26.156927,27.834741,0.0,0
7,Roodekrans AH,143,2.94,-26.107365,27.845796,0.0,0
8,Roodeport North,3561,0.62,-26.159853,27.873633,0.0,0
10,Wilfordon AH,17,2.23,-26.164188,27.85129,1.0,0
11,Alsef AH,299,1.68,-26.083618,27.897803,1.0,0
15,Georginia,2466,0.72,-26.164049,27.879492,1.0,0
17,Creswell Park,809,0.81,-26.172183,27.879492,2.0,0
18,Kimbult AH,1416,0.66,-26.087816,27.903662,2.0,0


Cluster 2

In [147]:
final_df.loc[final_df['Cluster Labels'] == 1, final_df.columns[[1] + list(range(2, final_df.shape[1]))]]

Unnamed: 0,subplace,population,area,lat,lon,restaurantcount,Cluster Labels
64,Allen's Nek,6373,2.89,-26.130802,27.908789,22.0,1
65,Strubensvallei,5350,2.58,-26.120135,27.901465,28.0,1
66,Roodepoort SP,4904,1.44,-26.120135,27.901465,28.0,1


Cluster 3

In [148]:
final_df.loc[final_df['Cluster Labels'] == 2, final_df.columns[[1] + list(range(2, final_df.shape[1]))]]

Unnamed: 0,subplace,population,area,lat,lon,restaurantcount,Cluster Labels
21,Cosmo City,44295,9.9,-26.021818,27.930758,2.0,2
56,Weltevredenpark,24429,10.02,-26.119422,27.930758,7.0,2
58,Florida,20082,6.32,-26.179417,27.916113,8.0,2


Cluster 4

In [149]:
final_df.loc[final_df['Cluster Labels'] == 3, final_df.columns[[1] + list(range(2, final_df.shape[1]))]]

Unnamed: 0,subplace,population,area,lat,lon,restaurantcount,Cluster Labels
0,Durban Roodepoort Deep Gold Mine,744,9.09,-26.174587,27.867128,0.0,3
4,Poortview AH,924,4.26,-26.090306,27.856052,0.0,3
9,Witpoortjie,15368,5.6,-26.163183,27.825281,0.0,3
13,Sonnedal AH,1487,4.95,-26.058556,27.913916,1.0,3
20,North Riding AH,1974,8.35,-26.051032,27.933687,2.0,3
28,Ruimsig,1591,4.11,-26.081996,27.863377,3.0,3


Cluster 5

In [150]:
final_df.loc[final_df['Cluster Labels'] == 4, final_df.columns[[1] + list(range(2, final_df.shape[1]))]]

Unnamed: 0,subplace,population,area,lat,lon,restaurantcount,Cluster Labels
2,Matholesville,14905,1.3,-26.169962,27.847993,0.0,4
6,Roodekrans,6457,4.21,-26.107365,27.845796,0.0,4
12,Radiokop,6751,2.07,-26.107887,27.914648,1.0,4
14,Zandspruit SP,31716,1.0,-26.010184,27.94101,1.0,4
16,Groblerpark,6774,2.29,-26.146561,27.839935,1.0,4
25,Lindhaven,4426,1.51,-26.145774,27.850191,2.0,4
29,Davidsonville,5343,1.62,-26.154959,27.851656,3.0,4
30,Ruimsig AH,3828,4.87,-26.077539,27.891212,3.0,4
32,Princess,6213,0.37,-26.134488,27.845796,4.0,4
38,Discovery,8110,3.39,-26.157221,27.892677,4.0,4
