## Data Science Capstone Project 

### The Battle of the Neighbourhoods (Los Angeles) by [Prosper Nnaeto]

## <u>Data</u> <a name="data"></a>

Based on the criteria specified above, the factors that will influence the final decision are: -
* Number of existing restaurants in the neighbourhood (any type of restaurant) 
* Number of and distance to Italian restaurants in the neighbourhood
* Distance of neighbourhood from city center
* Average neighbourhood rent

The following data sources will be needed to extract/generate the required information: -
* List of all neighbourhoods in LA - https://en.wikipedia.org/wiki/List_of_districts_and_neighbourhoods_of_Los_Angeles
* Coordinates of all neighbourhoods and venues - **GeoPy Nominatim geocoding**
* Number of restaurants and their type and location in every neighbourhood - **Foursquare API** -  https://developer.foursquare.com
* LA rent data - https://www.rentcafe.com/average-rent-market-trends/us/ca/los-angeles/


## <u>Methodology</u><a name="methodology"></a>

<p style='text-align: justify;'>In this project the first step will be to collect data on the neighbourhoods of Los Angeles from the internet. There are no relevant datasets available for this and therefore, data will need to be scraped from a webpage. The location coordinates of each neighbourhood will then be obtained with the help of GeoPy Nominatim geolocator and appended to the neighbourhood data. Using this data, a folium map of the Los Angeles neighbourhoods will be created.</p>

<p style='text-align: justify;'>The second step will be to explore each of neighbourhoods and their venues using Foursquare location data. The venues of the neighbourhoods will be analyzed in detail and patterns will be discovered. This discovery of patterns will be carried out by grouping the neighbourhoods using k-means clustering. Following this, each cluster will be examined and a decision will be made regarding which cluster fits the shareholder's requirements. The factor that will determine this is the frequency of occurrence of restaurants and other food venues within the cluster.</p>

<p style='text-align: justify;'>Once a cluster is picked, the neighbourhoods in that cluster will be investigated with regards to the number of Vegan/vegetarian restaurants in its vicinity. The ones that fit the requirements will be further explored and shortlisted based on how small their respective distances to the center or Los Angeles are. Finally, if there are multiple neighbourhoods that fit these conditions, Los Angeles rent data can be used to influence the shareholder's decision. </p>

<p style='text-align: justify;'>The results of the analysis will highlight potential neighbourhoods where vegan/vegetarian restaurant may be opened based on geographical location and proximity to competitors. This will only serve as a starting point since there are a lot of other factors that influence such a decision. </p>

## <u>Analysis</u><a name="analysis"></a>

* [Importing Libraries](#import)
* [Web Scraping Neighbourhood Data ](#scrapenh)
* [Loading and Cleaning Neighbourhood ](#clean)
* [Obtaining Neighbourhood Coordinates  ](#coordinates)
* [ LA Neighbourhood Map ](#lamap)
* [Defining Foursquare Credentials and Version  ](#foursquare)
* [ Exploring the first Neighbourhood ](#first)
* [Exploring all Neighbourhoods  ](#all)
* [Analyzing each Neighbourhood  ](#analyze)
* [Clustering Neighbourhoods  ](#cluster)
* [Examining the Clusters  ](#examine)
* [Visualizing Top 10 Venues for each Cluster  ](#visualize)
* [Investigating the chosen Cluster  ](#investigate)
* [Web Scraping Rent Data  ](#rent)

### Importing Libraries <a name="import"></a>

The first step in the analysis is importing the required libraries.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas data frame

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

from bs4 import BeautifulSoup

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    ca-certificates-2020.4.5.2 |       hecda079_0         147 KB  conda-forge
    certifi-2020.4.5.2         |   py36h9f0ad1d_0         152 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0           conda-forge
    geopy:          

In [4]:
import re

### Web Scraping Neighbourhood Data <a name="scrapenh"></a>

The list of all neighbourhoods in LA is obtained by scraping the relevant webpage. The data in the webpage is in the form of a list and not a table. Therefore, the data is obtained by searching for all list items and then using a particular characteristic that groups the required items.

In [5]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_districts_and_neighbourhoods_of_Los_Angeles').text
soup = BeautifulSoup(url,"html.parser")

In [6]:
lis = []
for li in soup.findAll('li'):
    if li.find(href="/wiki/Portal:Los_Angeles"):
        break
    if li.find(href=re.compile("^/wiki/")):
        lis.append(li)
    if li.text=='Pico Robertson[34]': #Pico Robertson is the only item on the list that does not have a hyperlink reference
        lis.append(li)


### Loading and Cleaning Neighbourhood <a name="clean"></a>

In [7]:
neigh = []
for i in range(0,len(lis)):
    neigh.append(lis[i].text.strip())
    
df = pd.DataFrame(neigh)
df.columns = ['Neighbourhood']

In [8]:
df

Unnamed: 0,Neighbourhood
0,Angelino Heights[1]
1,Angeles Mesa
2,Angelus Vista
3,Arleta[2][1]
4,Arlington Heights[2]
5,Arts District[3]
6,Atwater Village[2]
7,Baldwin Hills[1]
8,Baldwin Hills/Crenshaw[2]
9,Baldwin Village[1]


In [9]:
df['Neighbourhood'] = df.Neighbourhood.str.partition('[')[0] #Removes the citation and reference brackets
df['Neighbourhood'] = df.Neighbourhood.str.partition(',')[0] #Removes the alternatives for 'Bel Air'
df=df[df.Neighbourhood!='Baldwin Hills/Crenshaw'] #Removes redundancy as 'Baldwin Hills' and 'Crenshaw' exist already
df=df[df.Neighbourhood!='Hollywood Hills West'] #Removes redundancy as it has the same coordinates as 'Hollywood Hills'
df=df[df.Neighbourhood!='Brentwood Circle'] #Removes redundancy as it has the same coordinates as 'Brentwood'
df=df[df.Neighbourhood!='Wilshire Park'] #Removes redundancy as it has the same coordinates as 'Wilshire Center'
df.reset_index(inplace=True,drop=True)

### Obtaining Neighbourhood Coordinates  <a name="coordinates"></a>

In [10]:
# define the data frame columns
column_names = ['Neighbourhood', 'Latitude', 'Longitude'] 

# instantiate the data frame
nhoods = pd.DataFrame(columns=column_names)

Using GeoPy Nominatim geolocator with the user_agent "la_explorer".

In [11]:
geolocator = Nominatim(user_agent="la_explorer",timeout=5)
for i in range(0,len(df)):
    
    address = df.Neighbourhood[i]+', Los Angeles'
    location = geolocator.geocode(address)
    if location == None:
        latitude = 0
        longitude = 0
    else:
        latitude = location.latitude
        longitude = location.longitude

    nhoods = nhoods.append({'Neighbourhood': df.Neighbourhood[i],
                                              'Latitude': latitude,
                                              'Longitude': longitude}, ignore_index=True)

Clean neighbourhood data with the respective coordinates: -

In [12]:
nhoods

Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Angelino Heights,34.070289,-118.254796
1,Angeles Mesa,33.991402,-118.31952
2,Angelus Vista,0.0,0.0
3,Arleta,34.241327,-118.432205
4,Arlington Heights,34.043494,-118.321374
5,Arts District,34.041239,-118.23445
6,Atwater Village,34.118698,-118.262392
7,Baldwin Hills,34.010989,-118.337071
8,Baldwin Village,34.019456,-118.34591
9,Baldwin Vista,0.0,0.0


Deleting neighbourhoods with missing (zero) values and obvious geocoding errors: -


In [13]:

nhoods['Latitude']=nhoods['Latitude'].astype(float)
nhoods['Longitude']=nhoods['Longitude'].astype(float)

nhoods=nhoods[(nhoods.Latitude>33.5) & (nhoods.Latitude<34.4) & (nhoods.Longitude<-118)] 
nhoods.reset_index(inplace=True,drop=True)

Complete neighbourhood data frame: -

In [14]:
nhoods

Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Angelino Heights,34.070289,-118.254796
1,Angeles Mesa,33.991402,-118.31952
2,Arleta,34.241327,-118.432205
3,Arlington Heights,34.043494,-118.321374
4,Arts District,34.041239,-118.23445
5,Atwater Village,34.118698,-118.262392
6,Baldwin Hills,34.010989,-118.337071
7,Baldwin Village,34.019456,-118.34591
8,Beachwood Canyon,34.122292,-118.321384
9,Bel Air,34.098883,-118.459881


### LA Neighbourhood Map <a name="lamap"></a>

Obtaining the coordinates of the center of LA: -

In [15]:
address = 'Los Angeles, USA'

geolocator = Nominatim(user_agent="la_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of {} are {}, {}.'.format(address,latitude, longitude))

The geograpical coordinates of Los Angeles, USA are 34.0536909, -118.2427666.


Creating a map of LA with neighbourhoods superimposed on top: -

In [16]:
# create map of LA using latitude and longitude values
map_la = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighbourhood in zip(nhoods['Latitude'], nhoods['Longitude'], nhoods['Neighbourhood']):
    label = '{}'.format(neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3199cc',
        fill_opacity=0.3,
        parse_html=False).add_to(map_la)  
    
map_la

### Defining Foursquare Credentials and Version   <a name="foursquare"></a>

In [46]:
CLIENT_ID = 'BPAX1A44J1YTPYCUOCZXBY41JTWTHB3H2M2ISKZTYCO5LQMA' # Foursquare ID
CLIENT_SECRET = 'LSLUNFQNMUBH4KXC1S1O41RTPWP2J41IYQFMO44NU5GTJSER' # Foursquare Secret
VERSION = '20200605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET: ' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: BPAX1A44J1YTPYCUOCZXBY41JTWTHB3H2M2ISKZTYCO5LQMA
CLIENT_SECRET: LSLUNFQNMUBH4KXC1S1O41RTPWP2J41IYQFMO44NU5GTJSER


### Exploring the first Neighbourhood <a name="first"></a>

In [47]:
neighbourhood_latitude = nhoods.loc[0, 'Latitude'] # neighbourhood latitude value
neighbourhood_longitude = nhoods.loc[0, 'Longitude'] # neighbourhood longitude value

neighbourhood_name = nhoods.loc[0, 'Neighbourhood'] # neighbourhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighbourhood_name, 
                                                               neighbourhood_latitude, 
                                                               neighbourhood_longitude))

Latitude and longitude values of Angelino Heights are 34.0702889, -118.2547965.


In [73]:
LIMIT = 500 # Maximum is 100
cities = ['Los Angeles']
results = {}
for city in cities:
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&near={}&limit={}&categoryId={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        city,
        LIMIT,
        "4bf58dd8d48988d1d3941735") # Vegan resturants CATEGORY ID
    url

In [53]:
results[city] = requests.get(url).json()
results 

{'Los Angeles': {'meta': {'code': 200,
   'requestId': '5edf4f3e0cc1fd001b9d1ddc'},
  'response': {'suggestedFilters': {'header': 'Tap to show:',
    'filters': [{'name': '$-$$$$', 'key': 'price'}]},
   'geocode': {'what': '',
    'where': 'los angeles',
    'center': {'lat': 34.05223, 'lng': -118.24368},
    'displayString': 'Los Angeles, CA, United States',
    'cc': 'US',
    'geometry': {'bounds': {'ne': {'lat': 34.337306, 'lng': -118.155289},
      'sw': {'lat': 33.703652, 'lng': -118.668176}}},
    'slug': 'los-angeles-california',
    'longId': '72057594043296297'},
   'headerLocation': 'Los Angeles',
   'headerFullLocation': 'Los Angeles',
   'headerLocationGranularity': 'city',
   'query': 'vegetarian vegan',
   'totalResults': 245,
   'suggestedBounds': {'ne': {'lat': 34.26382807081869,
     'lng': -118.17005466082605},
    'sw': {'lat': 33.94520505227639, 'lng': -118.6262885892188}},
   'groups': [{'type': 'Recommended Places',
     'name': 'recommended',
     'items': [{'re

In [63]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Nearby venues of the first neighbourhood: -

In [68]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

KeyError: 'response'

### Exploring all Neighbourhoods  <a name="all"></a>

Function to get the nearby venues of all neighbourhoods and load the data into a data frame: -

In [69]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [70]:
la_venues = getNearbyVenues(names=nhoods['Neighbourhood'],
                                   latitudes=nhoods['Latitude'],
                                   longitudes=nhoods['Longitude']
                                  )

KeyError: 'groups'

Data frame of all venues: -

In [34]:
print(la_venues.shape)
la_venues.head()

NameError: name 'la_venues' is not defined

In [35]:
la_venues_count=la_venues.groupby('Neighbourhood').count()
la_venues_count.drop(la_venues_count.columns[[0,1,3,4,5]], axis=1,inplace=True)

NameError: name 'la_venues' is not defined

In [None]:
la_venues_count.reset_index(inplace=True)

It makes sense to set up a restaurant in one of the more popular neighbourhoods so that the restaurant attracts the attention of a lot more people.

Therefore, a list of all the popular neighbourhoods i.e. the neighbourhoods with 10 or more venues is obtained: -

In [None]:
pop_neigh=la_venues_count[(la_venues_count.Venue>=10)]
pop_neigh.reset_index(drop=True,inplace=True)
pop_neigh

Updating the venues data frame to include only the venues which are in popular neighbourhoods: -

In [None]:
pop_list=pop_neigh['Neighbourhood'].values.tolist()

for i in range(0,len(la_venues)):

    if la_venues.iloc[i,0] not in pop_list:
        la_venues.iloc[i,0]='TO DROP'

la_venues=la_venues[la_venues.Neighbourhood!='TO DROP']
la_venues.reset_index(drop=True,inplace=True)

In [None]:
la_venues

In [None]:
print('There are {} uniques categories.'.format(len(la_venues['Venue Category'].unique())))

### Analyzing each Neighbourhood  <a name="analyze"></a>

In [None]:
# one hot encoding
la_onehot = pd.get_dummies(la_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to data frame
la_onehot['Neighbourhood'] = la_venues['Neighbourhood'] 

# move neighbourhood column to the first column
fixed_columns = [la_onehot.columns[-1]] + list(la_onehot.columns[:-1])
la_onehot = la_onehot[fixed_columns]

la_onehot.head()

Grouping rows by neighbourhood by taking the mean of the frequency of occurrence of each category: -

In [None]:
la_grouped = la_onehot.groupby('Neighbourhood').mean().reset_index()
la_grouped

Printing each neighbourhood along with the top 5 most common venues: -

In [None]:
num_top_venues = 5

for hood in la_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = la_grouped[la_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['VENUE','FREQ']
    temp = temp.iloc[1:]
    temp['FREQ'] = temp['FREQ'].astype(float)
    temp = temp.round({'FREQ': 2})
    print(temp.sort_values('FREQ', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

Creating a new data frame and displaying the top 10 venues for each neighbourhood: -

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new data frame
Neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
Neighbourhoods_venues_sorted['Neighbourhood'] = la_grouped['Neighbourhood']

for ind in np.arange(la_grouped.shape[0]):
    Neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(la_grouped.iloc[ind, :], num_top_venues)

Neighbourhoods_venues_sorted.head()

### Clustering Neighbourhoods  <a name="cluster"></a>

The first step is to determine the optimal value of K for the dataset using the **Silhouette Coefficient Method.**

A higher Silhouette Coefficient score relates to a model with better defined clusters.

A higher Silhouette Coefficient indicates that the object is well matched to its own cluster and poorly matched to neighbouring clusters.

In [None]:
from sklearn.metrics import silhouette_score

la_grouped_clustering = la_grouped.drop('Neighbourhood', 1)

for n_cluster in range(2, 10):
    kmeans = KMeans(n_clusters=n_cluster).fit(la_grouped_clustering)
    label = kmeans.labels_
    sil_coeff = silhouette_score(la_grouped_clustering, label, metric='euclidean')
    print("For n_clusters={}, The Silhouette Coefficient is {}".format(n_cluster, sil_coeff))

The Silhouette Coefficient is the highest for n_clusters=4. Therefore, the neighbourhoods shall be grouped into 4 clusters (k=4) using ***k*-means clustering.**

In [None]:
# set number of clusters
kclusters = 4

la_grouped_clustering = la_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(la_grouped_clustering)

# check cluster labels generated for each row in the data frame
kmeans.labels_

Creating a new data frame that includes the cluster as well as the top 10 venues for each neighbourhood: -

In [None]:
# add clustering labels
Neighbourhoods_venues_sorted.insert(0, 'Cluster Label', kmeans.labels_.astype(int))
# Neighbourhoods_venues_sorted['Cluster Label']=kmeans.labels_.astype(int)
la_merged = nhoods

# merge la_grouped with nhoods to add latitude/longitude for each Neighbourhood
la_merged = la_merged.join(Neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
la_merged.dropna(inplace=True)
la_merged['Cluster Label'] = la_merged['Cluster Label'].astype(int)
la_merged.head() 

Visualizing the resulting neighbourhood clusters on the map: -

In [None]:
import matplotlib.colors as colors
from matplotlib.colors import rgb2hex
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
rainbow[2]='#006000'
rainbow[1]='#006ff6'
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(la_merged['Latitude'], la_merged['Longitude'], la_merged['Neighbourhood'], la_merged['Cluster Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-2],
        fill=True,
        fill_color=rainbow[cluster-2],
        fill_opacity=0.7).add_to(map_clusters)
legend_html =   '''
                <div style="position: fixed; 
                            bottom: 100px; left: 50px; width: 120px; height: 80px; 
                            border:3px solid black; z-index:9999; font-size:13px;
                            ">&nbsp; Green - Cluster 0 <br>
                              &nbsp; Red - Cluster 1 <br>
                              &nbsp; Purple - Cluster 2 <br>
                              &nbsp; Blue - Cluster 3 </i>
                </div>
                ''' 

map_clusters.get_root().html.add_child(folium.Element(legend_html))
map_clusters

### Examining the Clusters

Creating a data frame for each cluster that includes the top 10 venues for each of its neighbourhoods: -

In [None]:
la_merged.loc[la_merged['Cluster Label'] == 0, la_merged.columns[[0] + list(range(4, la_merged.shape[1]))]]

In [None]:
la_merged.loc[la_merged['Cluster Label'] == 1, la_merged.columns[[0] + list(range(4, la_merged.shape[1]))]]

In [None]:
la_merged.loc[la_merged['Cluster Label'] == 2, la_merged.columns[[0] + list(range(4, la_merged.shape[1]))]]

In [None]:
la_merged.loc[la_merged['Cluster Label'] == 3, la_merged.columns[[0] + list(range(4, la_merged.shape[1]))]]

Creating a data frame grouped by clusters by taking the mean of the frequency of occurrence of each venue category: -

In [None]:
la_results = pd.DataFrame(kmeans.cluster_centers_)
la_results.columns = la_grouped_clustering.columns
la_results.index = ['Cluster 0','Cluster 1','Cluster 2','Cluster 3']
la_results['Total Sum'] = la_results.sum(axis = 1)
la_results

### Visualizing Top 10 Venues for each Cluster  <a name="visualize"></a>

In [74]:
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

Function to generate a horizontal bar plot showing the top 10 venues for each cluster, highlighting the food venues: -

In [75]:
def generate_plot(clus,i):
    
    plt.style.use('default')

    tags=['Restaurant','Coffee','Food','Pizza','Sandwich']
    colors = []
    for value in clus.index: 
        if  any(t in value for t in tags):
            colors.append('#a80000')
        else:
            colors.append('#32069f')

    ax=clus.plot(kind='barh', figsize=(16,8), color=colors, alpha=0.7)

    plt.title('(in % of all venues)\n')
    ax.title.set_fontsize(14)
    plt.suptitle('Ten Most Prevalent Venues of Cluster {}'.format(i), fontsize=16)

    ax.spines['top'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['bottom'].set_visible(False)

    plt.xticks([])
    ax.tick_params(axis ='both', which ='both', length = 0)
    labels = [(item.get_text()+'  ') for item in ax.get_yticklabels()]
    ax.set_yticklabels(labels)

    for label in (ax.get_yticklabels()):
        label.set_fontsize(12)

    for index, value in enumerate(clus): 
        label = "%.1f " % round(value*100,1) + "%"
        # place text at the end of bar (adding 0.001 to x, and 0.1 from y to make it appear just after the bar)
        plt.annotate(label, xy=(value + 0.001, index - 0.1), color='black',fontsize=12)

    legend_elements = [Patch(facecolor='#a80000', edgecolor='#a80000',
                             label='Food Venues',alpha=0.7),
                       Patch(facecolor='#32069f', edgecolor='#32069f',
                             label='Others',alpha=0.7)]

    ax.legend(handles=legend_elements, loc='best',fontsize=12)

    plt.show()

#### *Cluster 0*

In [76]:
cluster0=pd.DataFrame(la_results.iloc[0,0:-1]).transpose()
cluster0.sort_values(by='Cluster 0',axis=1,ascending=False,inplace=True)
display(cluster0)

clus0=cluster0.iloc[0,9::-1]
generate_plot(clus0,0)

NameError: name 'la_results' is not defined

There are 6 food venues in the top 10 venues of Cluster 0 with Mexican Restaurants making up nearly 20% of all venues. These facts indicate that Cluster 0 would not be the best one to explore further in terms of setting up a new restaurant.


#### *Cluster 1*

In [None]:
cluster1=pd.DataFrame(la_results.iloc[1,0:-1]).transpose()
cluster1.sort_values(by='Cluster 1',axis=1,ascending=False,inplace=True)
display(cluster1)

clus1=cluster1.iloc[0,9::-1]
generate_plot(clus1,1)

There are 4 food venues in the top 10 venues of Cluster 1 with Korean Restaurants making up a huge majority (nearly 30%) of all venues. This is unsurprising as Cluster 1 consists of only two neighbourhoods, one being Koreatown and the other (Mid-Wilshire) also having a lot of Korean Restaurants. While there are only 4 food venues in the top 10, the complete dominance of Korean Restaurants in the area indicates the fact that Cluster 1 need not be looked into any further.

#### *Cluster 2*

In [None]:
cluster2=pd.DataFrame(la_results.iloc[2,0:-1]).transpose()
cluster2.sort_values(by='Cluster 2',axis=1,ascending=False,inplace=True)
cluster2.rename(columns={'Residential Building (Apartment / Condo)': 'Apartment / Condo'},inplace=True)
display(cluster2)

clus2=cluster2.iloc[0,9::-1]
generate_plot(clus2,2)

There are only 2 food venues in the top 10 venues of Cluster 2. To add to that, the two venues are Food Trucks and Coffee Shops as opposed to proper restaurants. There are a lot of public venues in this cluster - venues that see a lot of footfall such as parks, museums, gyms and department stores. The presence of condominium complexes in this list also suggest that the population per square unit of these neighbourhoods is high. All of these observations point in the direction of Cluster 2 being nominated as the cluster to explore further.

Having said that, the decision to explore Cluster 2 can only be confirmed after examining Cluster 3: -

#### *Cluster 3*

In [None]:
cluster3=pd.DataFrame(la_results.iloc[3,0:-1]).transpose()
cluster3.sort_values(by='Cluster 3',axis=1,ascending=False,inplace=True)
display(cluster3)

clus3=cluster3.iloc[0,9::-1]
generate_plot(clus3,3)

There are 8 food venues in the top 10 venues of Cluster 3 which is huge percentage. Except for the number 1 venue (Coffee Shops), all other food venues are proper restaurants. This clearly indicates that the neighbourhoods in Cluster 3 are saturated with restaurants already and need not be considered when opening a new restaurant.

It is now safe to confirm the decision of investigating **Cluster 2** further and eliminating all other clusters.

### Investigating the chosen Cluster    <a name="investigate"></a>

In [None]:
clus2neigh=la_merged.loc[la_merged['Cluster Label'] == 2, la_merged.columns[0]].values.tolist()
clus2neigh

In [None]:
filtered_nhoods=nhoods.copy()

for i in range(0,len(filtered_nhoods)):

    if filtered_nhoods.iloc[i,0] not in clus2neigh:
        filtered_nhoods.iloc[i,0]='TO DROP'

In [None]:
filtered_nhoods=filtered_nhoods[filtered_nhoods.Neighbourhood!='TO DROP']
filtered_nhoods.reset_index(drop=True,inplace=True)

The neighbourhoods in Cluster 2 along with their coordinates: -

In [None]:
filtered_nhoods

Function to obtain and display the closest Vegan/vegetarian restaurants from each neighbourhood in Cluster 2 and the corresponding distances: -

In [None]:
def get_neigh_vegan(url1):
    
    results = requests.get(url1).json()

    # assign relevant part of JSON to venues
    venues = results['response']['venues']

    # tranform venues into a data frame
    dataframe = json_normalize(venues)

    # keep only columns that include venue name, and anything that is associated with location
    filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
    dataframe_filtered = dataframe.loc[:, filtered_columns]

    # filter the category for each row
    dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

    # clean column names by keeping only last term
    dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]
    
    display(dataframe_filtered.loc[:,['name','categories','distance','lat','lng']])

In [77]:
category='4bf58dd8d48988d1d3941735'#The category for vegan/vegetarin restaurants obtained from https://developer.foursquare.com/docs/resources/categories
radius = 700
LIMIT=30

In [None]:
for n in range(0,len(filtered_nhoods)):
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&categoryId={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    filtered_nhoods.iloc[n,1], 
    filtered_nhoods.iloc[n,2], 
    VERSION, 
    category, 
    radius, 
    LIMIT)
    print('------------------------------------------------- '+ filtered_nhoods.iloc[n,0] + ' -------------------------------------------------')
    get_neigh_italian(url)
    print('\n\n')g

 From the data frames above, it can be observed that Park La Brea has 7 Restaurants within 700 meters from its center. Hancock Park has fewer (3) but two of them are less than 250 meters away from its center. This indicates that Park La Brea and Hancock Park would not be suitable neighbourhoods to open a Restaurant and can therefore be eliminated. This leaves the following neighbourhoods: -

In [None]:
filter2_nhoods=filtered_nhoods[(filtered_nhoods.Neighbourhood !='Park La Brea') & (filtered_nhoods.Neighbourhood !='Hancock Park')]
filter2_nhoods.reset_index(drop=True,inplace=True)
filter2_nhoods

Computing the distance of each neighbourhood from the center of LA and adding it as a column to the existing data frame: -

In [None]:
filter2_nhoods = filter2_nhoods.reindex( columns = filter2_nhoods.columns.tolist() + ['Distance from LA center (in km)'])  #this way to avoid warnings

In [None]:
from math import radians, sin, cos, acos

slat = radians(34.0536909) #LA center Latitude obtained earlier
slon = radians(-118.2427666) #LA center Longitude obtained earlier

In [None]:
for n in range(0,len(filter2_nhoods)):
    
    elat = radians(filter2_nhoods.iloc[n,1])
    elon = radians(filter2_nhoods.iloc[n,2])

    dist = 6371.01 * acos(sin(slat)*sin(elat) + cos(slat)*cos(elat)*cos(slon - elon))
    filter2_nhoods.loc[n,'Distance from LA center (in km)']=dist

In [None]:
filter2_nhoods.sort_values(by='Distance from LA center (in km)',inplace=True)
filter2_nhoods.reset_index(drop=True,inplace=True)
filter2_nhoods

It is clear from the data frame above that **Exposition Park** (6km) and **Montecito Heights** (6km) are much closer to the center of Los Angeles than Wilshire Center (17.5km) and Playa Vista (19km). Since the distance from LA center is a criterion in choosing the optimal neighbourhood, Wilshire Center and Playa Vista would not be appropriate choices.

### Web Scraping Rent Data  <a name="rent"></a>

The list of average rent of all neighbourhoods in LA can be obtained by scraping the relevant webpage. The data in the webpage is in the form of a table. Therefore, the data can be obtained much more easily.

In [None]:
url = requests.get('https://www.rentcafe.com/average-rent-market-trends/us/ca/los-angeles/').text
soup = BeautifulSoup(url,"html.parser")

In [None]:
table = soup.find('table',id="MarketTrendsAverageRentTable")
pr = table.find_all('td')
nh = table.find_all('th')

In [None]:
price = []
neighbourhood = []

for i in range(0, len(pr)):
    price.append(pr[i].text.strip())
    neighbourhood.append(nh[i+2].text.strip())
        
df_rent = pd.DataFrame(data=[neighbourhood, price]).transpose()
df_rent.columns = ['Neighbourhood', nh[1].text]
df_rent.loc[32,'Neighbourhood']='Montecito Heights' #Correcting a spelling error
df_rent

The above data frame is already in ascending order of average rent. The 2 neighbourhoods in question can be identified from the table and their average rents displayed: -

In [None]:
df_rent[(df_rent['Neighbourhood']=='Exposition Park') | (df_rent['Neighbourhood']=='Montecito Heights')]

The average rent in Exposition Park is nearly two times the average rent in Montecito Heights. This means that Exposition Park is a significantly more expensive neighbourhood. 

## <u>Results and Discussion</u><a name="results"></a>

 <p style='text-align: justify;'>In the beginning of the analysis the data frame of Los Angeles neighbourhoods was trimmed to include only the ones that had 10 or more venues. This decision was taken as it made sense to set up a restaurant in one of the more popular neighbourhoods, thereby attracting the attention of a lot more people.</p>

<p style='text-align: justify;'>When clustering the neighbourhoods, the optimal value of k (k=4) for the dataset was arrived at using the Silhouette Coefficient Method. As a consequence, all neighbourhoods were grouped into 4 clusters using k-means clustering. In order to examine the deterministic characteristics of each cluster, a data frame for each cluster was created that included their most frequently occurring venues in descending order. A horizontal bar plot was generated showing the top 10 venues for each cluster, highlighting the food venues. This helped in determining the optimal cluster for further analysis. All of the observations pointed in the direction of Cluster 2 being that cluster. It had only 2 food venues amongst the top 10 - food trucks and coffee shops - which were not full-fledged restaurants. The cluster also had apartment complexes and a lot of public venues which meant that the neighbourhoods in it see a lot of people.</p>

<p style='text-align: justify;'>The following step was to obtain and display the closest restaurants from each neighbourhood in Cluster 2 and their corresponding distances. It was observed that Park La Brea has 7 Restaurants within 700 meters from its center. Hancock Park had fewer (3) but two of them were less than 250 meters away from its center. This indicated that Park La Brea and Hancock Park would not be suitable neighbourhoods to open a Restaurant in and were eliminated. </p>

<p style='text-align: justify;'>The next criteria was the distance of each of the remaining neighbourhoods from the center of the city. It was found that Exposition Park (~6km) and Montecito Heights (~6km) are much closer to the center of Los Angeles than Wilshire Center (~17.5km) and Playa Vista (~19km). Therefore, it was understood that Wilshire Center and Playa Vista would not be appropriate choices.</p>

<p style='text-align: justify;'>The table of average rent of all neighbourhoods in LA was obtained by scraping the relevant webpage. The two neighbourhoods that remained in contention were identified from the table and their average rents displayed. It was detected that the average rent in Exposition Park is nearly two times the average rent in Montecito Heights, implying that Exposition Park is a significantly more expensive neighbourhood. However, this does not automatically mean Montecito Heights is the better option. A factor to consider is the type of restaurant the shareholder is interested in setting up. If, for example, a high-end fine dining restaurant needs to be set up, a neighbourhood that has a low average rent would not work. The reason for this is that such a neighbourhood would generally be home to people with lower income and a high-end fine dining restaurant may not see a healthy influx of people. On the other hand, if a fast-casual/casual dining restaurant needs to be set up, a high-rent neighbourhood would not be ideal simply because the restaurant will not be able to afford the rented space. While average rent can point in the direction of the right neighbourhood, a final decision cannot be made without all the required information. </p>

## <u>Conclusion</u><a name="conclusion"></a>

<div style="text-align: justify"> The objective of this project was to identify the best potential neighbourhoods in Los Angeles where a vegan/vegetarian restaurant can be set up. All the required neighbourhood data was either scraped of the internet or obtained using a geolocator. After the neighbourhoods were visualized on a folium map, their venues were explored using Foursquare location data. Based on the frequency of occurrences of different venue types, the neighbourhoods were divided into four groups with the help of k-means clustering. The clusters were examined and the best one in which a restaurant could be set up was chosen. The neighbourhoods were filtered further based on proximity to existing vegan restaurants and distance from the center of the city. The analysis brought the number of contenders down to two neighbourhoods - **Exposition Park** and **Montecito Heights**. Average neighbourhood rent data was called upon and while it provided interesting insights, it could not influence the decision only with the information at hand. As touched upon earlier, the results of the analysis highlight potential neighbourhoods where a vegan/vegetarian restaurant may be opened solely based on geographical location and proximity to competitors. This will only serve as a starting point in the overall investigation since there are a lot of other factors - availability of commercial spaces, appeal of each location, proximity to major roads, access through public transport, etc. - that influence such a decision. </div>

[Back to top of notebook](#top)