## Capstone Project - Battle of the Neighborhoods
Filipino Restaurant Analysis in Los Angeles, California

## Data

<b>Based on the criteria to identify the best areas in LA for our audience, the following factors that will influence the final decision are:</b>

- Number of existing filipino restaurants
- Number of and distance to filipino restaurants in the neighborhood 
- Distance of neighborhood to city's center
- Number of hotels in neighborhood
- Distance of neighborhood to nearby hotels



<b>The following data sources will be needed to extract/generate the required information: </b>
- List of all districts/neighborhods in LA https://en.wikipedia.org/wiki/List_of_districts_and_neighborhoods_of_Los_Angeles

- Coordinates of all neighbourhoods and venues — GeoPy Nominatim geocoding
- Number of filipino restaurants and their location in every neighbourhood — Foursquare API
- Number of hotels and their location in every neighborhod - Foursquare API

 Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
#Command to install OpenCage Geocoder for fetching Lat and Lng of Neighborhood
!pip install opencage

#Importing OpenCage Geocoder
from opencage.geocoder import OpenCageGeocode

# use the inline backend to generate the plots within the browser
%matplotlib inline 

#Importing Matplot lib and associated packages to perform Data Visualisation and Exploratory Data Analysis
import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.style.use('ggplot') # optional: for ggplot-like style

# check for latest version of Matplotlib
print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#Importing folium to visualise Maps and plot based on Lat and Lng
import folium

#Requests to request web pages by making get requests to FourSquare REST Client
import requests

#To normalise data returned by FourSquare API
from pandas.io.json import json_normalize

#Importing KMeans from SciKit library to Classify neighborhoods into clusters
from sklearn.cluster import KMeans

print('Libraries imported')



Matplotlib version:  3.2.1
Libraries imported


In [2]:
import re 

<b> First, we start with pulling neighborhood data for LA. We scrape the data from this wikipedia site to get all of the neighborhods.</b>

In [3]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_districts_and_neighbourhoods_of_Los_Angeles').text
soup = BeautifulSoup(url,"html.parser")

In [4]:
lis = []
for li in soup.findAll('li'):
    if li.find(href="/wiki/Portal:Los_Angeles"):
        break
    if li.find(href=re.compile("^/wiki/")):
        lis.append(li)
    if li.text=='Pico Robertson[34]': #Pico Robertson is the only item on the list that does not have a hyperlink reference
        lis.append(li)

In [5]:
neigh = []
for i in range(0,len(lis)):
    neigh.append(lis[i].text.strip())
    
df = pd.DataFrame(neigh)
df.columns = ['Neighborhood']

In [6]:
df

Unnamed: 0,Neighborhood
0,Angelino Heights[1]
1,Angeles Mesa
2,Angelus Vista
3,Arleta[2][1]
4,Arlington Heights[2]
...,...
193,Wilshire Park[51]
194,Windsor Square[2][1]
195,Winnetka[2][1]
196,Woodland Hills[2][1]


<b> Now we need to reformat the dataframe, by removing any punctuation, unnecessary nmbers or duplicate records we are aware of.</b>

In [7]:
df['Neighborhood'] = df.Neighborhood.str.partition('[')[0] #Removes the citation and reference brackets
df['Neighborhood'] = df.Neighborhood.str.partition(',')[0] #Removes the alternatives for 'Bel Air'
df=df[df.Neighborhood!='Baldwin Hills/Crenshaw'] #Removes redundancy as 'Baldwin Hills' and 'Crenshaw' exist already
df=df[df.Neighborhood!='Hollywood Hills West'] #Removes redundancy as it has the same coordinates as 'Hollywood Hills'
df=df[df.Neighborhood!='Brentwood Circle'] #Removes redundancy as it has the same coordinates as 'Brentwood'
df=df[df.Neighborhood!='Wilshire Park'] #Removes redundancy as it has the same coordinates as 'Wilshire Center'
df.reset_index(inplace=True,drop=True)

In [8]:
df

Unnamed: 0,Neighborhood
0,Angelino Heights
1,Angeles Mesa
2,Angelus Vista
3,Arleta
4,Arlington Heights
...,...
189,Wilshire Center
190,Windsor Square
191,Winnetka
192,Woodland Hills


<b> We are using Nominatim to pull the coordinate data for LA neighborhoods</b>

In [9]:
pip install geopy

Note: you may need to restart the kernel to use updated packages.


In [10]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [12]:
column_names = ['Neighborhood', 'Latitude', 'Longitude']

nhoods = pd.DataFrame(columns=column_names)

geolocator = Nominatim(user_agent="la_explorer",timeout=5)
for i in range(0,len(df)):
    
    address = df.Neighborhood[i]+', Los Angeles'
    location = geolocator.geocode(address)
    if location == None:
        latitude = 0
        longitude = 0
    else:
        latitude = location.latitude
        longitude = location.longitude

    nhoods = nhoods.append({'Neighborhood': df.Neighborhood[i],
                                              'Latitude': latitude,
                                              'Longitude': longitude}, ignore_index=True)

In [13]:
nhoods

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Angelino Heights,34.070289,-118.254796
1,Angeles Mesa,33.991402,-118.319520
2,Angelus Vista,0.000000,0.000000
3,Arleta,34.241327,-118.432205
4,Arlington Heights,34.043494,-118.321374
...,...,...,...
189,Wilshire Center,34.061515,-118.432771
190,Windsor Square,34.072593,-118.320810
191,Winnetka,34.205883,-118.570934
192,Woodland Hills,34.168436,-118.605838


<b> Then, with the neighborhood coordinates data, we reformat to floats for our coordinates, and remove any records we know would be wrong, based on the lattitude and longitudes in comparison to LA's latitude and longitude coordinates.</b>

In [14]:
nhoods['Latitude']=nhoods['Latitude'].astype(float)
nhoods['Longitude']=nhoods['Longitude'].astype(float)

nhoods=nhoods[(nhoods.Latitude>33.5) & (nhoods.Latitude<34.4) & (nhoods.Longitude<-118)] 
nhoods.reset_index(inplace=True,drop=True)

In [15]:
nhoods

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Angelino Heights,34.070289,-118.254796
1,Angeles Mesa,33.991402,-118.319520
2,Arleta,34.241327,-118.432205
3,Arlington Heights,34.043494,-118.321374
4,Arts District,34.041239,-118.234450
...,...,...,...
156,Wilmington,33.780016,-118.262509
157,Wilshire Center,34.061515,-118.432771
158,Windsor Square,34.072593,-118.320810
159,Winnetka,34.205883,-118.570934


In [16]:
address = 'Los Angeles, CA'

geolocator = Nominatim(user_agent="la_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Los Angeles are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Los Angeles are 34.0536909, -118.2427666.


<b> Finally, we can create a map using Folium to identify LA's neighborhoods on the map.</b>

In [17]:
map_la = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(nhoods['Latitude'], nhoods['Longitude'], nhoods['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='purple',
        fill=True,
        fill_color='#3199cc',
        fill_opacity=0.3,
        parse_html=False).add_to(map_la)  
    
map_la

<b> Next is connecting to Foursquare's API and pulling venue data for our Filipino Restaurants/Nearby Hotel analysis.</b>

In [18]:
CLIENT_ID = 'ZEN5NXXEEMSND55V4OGNWWRUUKATAWXAGWYQANPUYVR00VTU' # your Foursquare ID
CLIENT_SECRET = 'RIT4GQ53DKMFYJOXVO5BBZ43WAWOYG4FKK0N4ZPRNSSKW0OX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 1000
radius = 5000

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: ZEN5NXXEEMSND55V4OGNWWRUUKATAWXAGWYQANPUYVR00VTU
CLIENT_SECRET:RIT4GQ53DKMFYJOXVO5BBZ43WAWOYG4FKK0N4ZPRNSSKW0OX


<b> With the foursquare data, we want to pull venue data, and pull the venue details.</b>

In [22]:
def get_venues(lat,lng):
    
    #set variables
    radius=5000
    LIMIT=1000
    CLIENT_ID = 'ZEN5NXXEEMSND55V4OGNWWRUUKATAWXAGWYQANPUYVR00VTU' # your Foursquare ID
    CLIENT_SECRET = 'RIT4GQ53DKMFYJOXVO5BBZ43WAWOYG4FKK0N4ZPRNSSKW0OX' # your Foursquare Secret
    VERSION = '20180605' # Foursquare API version
    
    #url to fetch data from foursquare api
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
    
    # get all the data
    results = requests.get(url).json()
    venue_data=results["response"]['groups'][0]['items']
    venue_details=[]
    for row in venue_data:
        try:
            venue_id=row['venue']['id']
            venue_name=row['venue']['name']
            venue_category=row['venue']['categories'][0]['name']
            venue_details.append([venue_id,venue_name,venue_category])
        except KeyError:
            pass
        
    column_names=['ID','Name','Category']
    df = pd.DataFrame(venue_details,columns=column_names)
    return df

In [23]:
def get_venue_details(venue_id):
        
    CLIENT_ID = 'ZEN5NXXEEMSND55V4OGNWWRUUKATAWXAGWYQANPUYVR00VTU' # your Foursquare ID
    CLIENT_SECRET = 'RIT4GQ53DKMFYJOXVO5BBZ43WAWOYG4FKK0N4ZPRNSSKW0OX' # your Foursquare Secret
    VERSION = '20180605' # Foursquare API version
    
    #url to fetch data from foursquare api
    url = 'https://api.foursquare.com/v2/venues/{}?&client_id={}&client_secret={}&v={}'.format(
            venue_id,
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION)
    
    # get all the data
    results = requests.get(url).json()
    venue_data=results['response']['venue']
    venue_details=[]
    try:
        venue_id=venue_data['id']
        venue_name=venue_data['name']
        venue_likes=venue_data['likes']['count']
        venue_rating=venue_data['rating']
        venue_tips=venue_data['tips']['count']
        venue_details.append([venue_id,venue_name,venue_likes,venue_rating,venue_tips])
    except KeyError:
        pass
        
    column_names=['ID','Name','Likes','Rating','Tips']
    df = pd.DataFrame(venue_details,columns=column_names)
    return df

In [24]:
# prepare neighborhood list that contains filipino resturants
column_names=['Neighborhood', 'ID','Name']
filipino_rest_la=pd.DataFrame(columns=column_names)
count=1
for row in nhoods.values.tolist():
    Neighborhood, Latitude, Longitude=row
    venues = get_venues(Latitude,Longitude)
    filipino_rest=venues[venues['Category']=='Filipino Restaurant']   
    print('(',count,'/',len(nhoods),')','Filipino Restaurants in '+Neighborhood+':'+str(len(filipino_rest)))
    for resturant_detail in filipino_rest.values.tolist():
        id, name , category=resturant_detail
        filipino_rest_la = filipino_rest_la.append({'Neighborhood': Neighborhood, 
                                                'ID': id,
                                                'Name' : name
                                               }, ignore_index=True)
    count+=1

( 1 / 161 ) Filipino Restaurants in Angelino Heights:1
( 2 / 161 ) Filipino Restaurants in Angeles Mesa:0
( 3 / 161 ) Filipino Restaurants in Arleta:0
( 4 / 161 ) Filipino Restaurants in Arlington Heights:0
( 5 / 161 ) Filipino Restaurants in Arts District:1
( 6 / 161 ) Filipino Restaurants in Atwater Village:0
( 7 / 161 ) Filipino Restaurants in Baldwin Hills:0
( 8 / 161 ) Filipino Restaurants in Baldwin Village:0
( 9 / 161 ) Filipino Restaurants in Beachwood Canyon:0
( 10 / 161 ) Filipino Restaurants in Bel Air:0
( 11 / 161 ) Filipino Restaurants in Benedict Canyon:0
( 12 / 161 ) Filipino Restaurants in Beverly Crest:0
( 13 / 161 ) Filipino Restaurants in Beverly Glen:0
( 14 / 161 ) Filipino Restaurants in Beverly Grove:0
( 15 / 161 ) Filipino Restaurants in Beverly Hills Post Office:0
( 16 / 161 ) Filipino Restaurants in Beverly Park:1
( 17 / 161 ) Filipino Restaurants in Beverlywood:0
( 18 / 161 ) Filipino Restaurants in Boyle Heights:1
( 19 / 161 ) Filipino Restaurants in Brentwoo

KeyError: 'groups'

In [25]:
filipino_rest_la.head()

Unnamed: 0,Neighborhood,ID,Name
0,Angelino Heights,5956f0d0f5e9d7161f043456,Sari Sari Store LA
1,Arts District,5956f0d0f5e9d7161f043456,Sari Sari Store LA
2,Beverly Park,5956f0d0f5e9d7161f043456,Sari Sari Store LA
3,Boyle Heights,5956f0d0f5e9d7161f043456,Sari Sari Store LA
4,Bunker Hill,5956f0d0f5e9d7161f043456,Sari Sari Store LA


In [27]:
filipino_rest_la.shape

(25, 3)

In [28]:
# prepare neighborhood list that contains indian resturants
column_names=['Neighborhood', 'ID','Name']
hotel_la=pd.DataFrame(columns=column_names)
count=1
for row in nhoods.values.tolist():
    Neighborhood, Latitude, Longitude=row
    venues = get_venues(Latitude,Longitude)
    hotel=venues[venues['Category']=='Hotel']   
    print('(',count,'/',len(nhoods),')','Hotels in '+Neighborhood+':'+str(len(hotel)))
    for resturant_detail in hotel.values.tolist():
        id, name , category=resturant_detail
        hotel_la = hotel_la.append({'Neighborhood': Neighborhood, 
                                                'ID': id,
                                                'Name' : name
                                               }, ignore_index=True)
    count+=1

( 1 / 161 ) Hotels in Angelino Heights:1
( 2 / 161 ) Hotels in Angeles Mesa:0
( 3 / 161 ) Hotels in Arleta:0
( 4 / 161 ) Hotels in Arlington Heights:1
( 5 / 161 ) Hotels in Arts District:4
( 6 / 161 ) Hotels in Atwater Village:0
( 7 / 161 ) Hotels in Baldwin Hills:0
( 8 / 161 ) Hotels in Baldwin Village:0
( 9 / 161 ) Hotels in Beachwood Canyon:2
( 10 / 161 ) Hotels in Bel Air:4
( 11 / 161 ) Hotels in Benedict Canyon:5
( 12 / 161 ) Hotels in Beverly Crest:2
( 13 / 161 ) Hotels in Beverly Glen:3
( 14 / 161 ) Hotels in Beverly Grove:4
( 15 / 161 ) Hotels in Beverly Hills Post Office:9
( 16 / 161 ) Hotels in Beverly Park:2
( 17 / 161 ) Hotels in Beverlywood:6
( 18 / 161 ) Hotels in Boyle Heights:0
( 19 / 161 ) Hotels in Brentwood:0
( 20 / 161 ) Hotels in Broadway-Manchester:0
( 21 / 161 ) Hotels in Bunker Hill:2
( 22 / 161 ) Hotels in Cahuenga Pass:1
( 23 / 161 ) Hotels in Canoga Park:1
( 24 / 161 ) Hotels in Canterbury Knolls:2
( 25 / 161 ) Hotels in Carthay:4
( 26 / 161 ) Hotels in Centr

KeyError: 'groups'

In [29]:
hotel_la.head()

Unnamed: 0,Neighborhood,ID,Name
0,Angelino Heights,4b6900daf964a52092962be3,JW Marriott Los Angeles L.A. LIVE
1,Arlington Heights,52be1afd11d2f12c879452db,The LINE Hotel
2,Arts District,58c2563180e1af2b1ed0c9db,Tuck Hotel
3,Arts District,584c5cdfd772f952a50c6aef,The NoMad Hotel Los Angeles
4,Arts District,57eaefa5498eecc7ee748c22,Freehand Los Angeles


In [30]:
hotel_la.shape

(104, 3)

In [31]:
column_names=['Neighborhood', 'ID','Name','Likes','Rating','Tips']
filipino_rest_stats_la=pd.DataFrame(columns=column_names)
count=1


for row in filipino_rest_la.values.tolist():
    Neighborhood,ID,Name=row
    try:
        venue_details=get_venue_details(ID)
        print(venue_details)
        id,name,likes,rating,tips=venue_details.values.tolist()[0]
    except IndexError:
        print('No data available for id=',ID)
        # we will assign 0 value for these resturants as they may have been 
        #recently opened or details does not exist in FourSquare Database
        id,name,likes,rating,tips=[0]*5
    print('(',count,'/',len(filipino_rest_la),')','processed')
    filipino_rest_stats_la = filipino_rest_stats_la.append({
                                                'Neighborhood': Neighborhood, 
                                                'ID': id,
                                                'Name' : name,
                                                'Likes' : likes,
                                                'Rating' : rating,
                                                'Tips' : tips
                                               }, ignore_index=True)
    count+=1

                         ID                Name  Likes  Rating  Tips
0  5956f0d0f5e9d7161f043456  Sari Sari Store LA     55     8.9    15
( 1 / 25 ) processed
                         ID                Name  Likes  Rating  Tips
0  5956f0d0f5e9d7161f043456  Sari Sari Store LA     55     8.9    15
( 2 / 25 ) processed
                         ID                Name  Likes  Rating  Tips
0  5956f0d0f5e9d7161f043456  Sari Sari Store LA     55     8.9    15
( 3 / 25 ) processed
                         ID                Name  Likes  Rating  Tips
0  5956f0d0f5e9d7161f043456  Sari Sari Store LA     55     8.9    15
( 4 / 25 ) processed
                         ID                Name  Likes  Rating  Tips
0  5956f0d0f5e9d7161f043456  Sari Sari Store LA     55     8.9    15
( 5 / 25 ) processed
                         ID                Name  Likes  Rating  Tips
0  5956f0d0f5e9d7161f043456  Sari Sari Store LA     55     8.9    15
( 6 / 25 ) processed
                         ID                Nam

In [32]:
filipino_rest_stats_la.head()

Unnamed: 0,Neighborhood,ID,Name,Likes,Rating,Tips
0,Angelino Heights,5956f0d0f5e9d7161f043456,Sari Sari Store LA,55,8.9,15
1,Arts District,5956f0d0f5e9d7161f043456,Sari Sari Store LA,55,8.9,15
2,Beverly Park,5956f0d0f5e9d7161f043456,Sari Sari Store LA,55,8.9,15
3,Boyle Heights,5956f0d0f5e9d7161f043456,Sari Sari Store LA,55,8.9,15
4,Bunker Hill,5956f0d0f5e9d7161f043456,Sari Sari Store LA,55,8.9,15


In [33]:
column_names=['Neighborhood', 'ID','Name','Likes','Rating','Tips']
hotel_stats_la=pd.DataFrame(columns=column_names)
count=1


for row in hotel_la.values.tolist():
    Neighborhood,ID,Name=row
    try:
        venue_details=get_venue_details(ID)
        print(venue_details)
        id,name,likes,rating,tips=venue_details.values.tolist()[0]
    except IndexError:
        print('No data available for id=',ID)
        # we will assign 0 value for these resturants as they may have been 
        #recently opened or details does not exist in FourSquare Database
        id,name,likes,rating,tips=[0]*5
    print('(',count,'/',len(hotel_la),')','processed')
    hotel_stats_la = hotel_stats_la.append({
                                                'Neighborhood': Neighborhood, 
                                                'ID': id,
                                                'Name' : name,
                                                'Likes' : likes,
                                                'Rating' : rating,
                                                'Tips' : tips
                                               }, ignore_index=True)
    count+=1

                         ID                               Name  Likes  Rating  \
0  4b6900daf964a52092962be3  JW Marriott Los Angeles L.A. LIVE    466     8.6   

   Tips  
0   129  
( 1 / 104 ) processed
                         ID            Name  Likes  Rating  Tips
0  52be1afd11d2f12c879452db  The LINE Hotel    500     8.5    71
( 2 / 104 ) processed
                         ID        Name  Likes  Rating  Tips
0  58c2563180e1af2b1ed0c9db  Tuck Hotel     30     8.9     1
( 3 / 104 ) processed
                         ID                         Name  Likes  Rating  Tips
0  584c5cdfd772f952a50c6aef  The NoMad Hotel Los Angeles     79     9.0     8
( 4 / 104 ) processed
                         ID                  Name  Likes  Rating  Tips
0  57eaefa5498eecc7ee748c22  Freehand Los Angeles    106     9.0    15
( 5 / 104 ) processed
                         ID                            Name  Likes  Rating  \
0  4f10ae81e4b0253d4be8c453  Ace Hotel Downtown Los Angeles    520     8.7   



KeyError: 'venue'