# Data Science Capstone Project
### Jimmy J.

## Introduction 

##### Frank Lloyd Wright said it best - “Tip the world over on its side and everything loose will land in Los Angeles." For decades people from all over the world have come to this city in search of home, family, love, fame, and fortune. Los Angeles welcomes this multicultural migration with a suburban sprawl that encompases over 88 cities, and even more unincorporated neihborhoods. Each of these neighborhoods are characterized by geographical, economic, and cultural features that make them uniquely poised to cater to different demographics. 

##### Neighborhoods like Downtown Los Angeles, Culver City, Long Beach, Century City, and West Hollywood provide a mixture of urban style living and accessibility to grocery stores, malls, means of public transportation and entertainment venues within walking distance. On the otherhand suburbs like Baldwin Hills, Crenshaw, Echo Park and Boyle Heights are quiter neighborhoods with single family dwellings that are not easily accessible via public modes of transportation. 

##### Given the vast spectrum of possibilites of neighborhoods to choose from in LA, someone looking to move here might be overwhelmed. In this project I've attempted to characterize neighborhoods in LA by the nature of venue that are in their immediate vicinities using a clustering algorithm. Results from this analysis show that neighborhoods in LA fall under a few groups, defined by the nature of the venues closest to them. This is of most interest to rental unit searching apps like Westiside Rentals or Rentpad, to real estate agents, and generally, to people looking to move to LA. The results of this project can help them find neighborhoods that are most aligned with what they are looking for in a place to live and overall, provide a more satisfactory experience than chosing a neighborhood at random. 

## Data 

##### The list of neighborhoods in Los Angeles was web scraped from Wikipedia using BeautifulSoup. The names of these neigborhoods were then fed into Nominatim to obtain theie geographical coordinates. 

##### The venues in the vicinity of these neighborhoods will be retreived using the FourSquare API search engine. Venues of five different categories were chosen: Travel & Transport, Arts & Entertainment, Outdoors & Recreation, Nightlife Spot, and Food. The number of venues of each category for each neighborhood was counted and then normalized to give five parameters with which to cluster the neighborhoods by. 



## Methodology 

In [None]:
### Importing libraries 

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

import re

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

import bs4 as bs

import urllib.request

import requests

In [13]:
# @hidden_cell

CLIENT_ID = 'JS0P2BHNS4GICN4OT1LRM03JV0OLTO4QWS0I5AEITRLVI3QU' # your Foursquare ID
CLIENT_SECRET = 'KGFP21SEFHLUXM2EPAI4HDLQOAI21MC1CY24RJ4AII4UX2Q3' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails: hidden')

Your credentails: hidden


In [14]:
# Scraping list of neighborhoods in LA

source = urllib.request.urlopen('http://maps.latimes.com/neighborhoods/neighborhood/list/').read()
soup=bs.BeautifulSoup(source, 'lxml')

table = soup.find('table')

table_rows = table.find_all('tr')

ls = []
for tr in table_rows:
    td = tr.find_all('a')
    row = [tr.text.strip() for tr in td]
    ls.append(row)
    
LA_neighs = pd.DataFrame(ls, columns = ['Neighborhood','Region'])

LA_neighs=LA_neighs.drop(LA_neighs.index[0])

LA_neighs.tail()

Unnamed: 0,Neighborhood,Region
268,Willowbrook,South L.A.
269,Wilmington,Harbor
270,Windsor Square,Central L.A.
271,Winnetka,San Fernando Valley
272,Woodland Hills,San Fernando Valley


In [15]:
# Getting the latitude and longitude of neighborhoods in LA

latitude = []
longitude = []

for name in LA_neighs['Neighborhood']:
    try:
        address = str(name)+", Los Angeles, California"
        geolocator = Nominatim(user_agent="LA_explorer")
        location = geolocator.geocode(address)
        latitude.append(location.latitude)
        longitude.append(location.longitude)
    except:
        latitude.append(None)
        longitude.append(None)

LA_neighs['Latitude'] = latitude
LA_neighs['Longitude'] = longitude

In [17]:
LA_neighs[:50]

NameError: name 'LA_neighs' is not defined

In [17]:
LA_neighs.dropna(inplace = True)

In [18]:
LA_neighs[:50]

Unnamed: 0,Neighborhood,Region,Latitude,Longitude
1,Acton,Antelope Valley,34.480742,-118.186838
2,Adams-Normandie,South L.A.,34.018609,-118.287348
3,Agoura Hills,Santa Monica Mountains,34.14791,-118.765704
4,Agua Dulce,Northwest County,34.496382,-118.325635
5,Alhambra,San Gabriel Valley,34.093042,-118.12706
6,Alondra Park,South Bay,33.890134,-118.335133
7,Altadena,Verdugos,34.186316,-118.135233
8,Angeles Crest,Angeles Forest,34.234,-118.183386
9,Arcadia,San Gabriel Valley,34.136207,-118.04015
10,Arleta,San Fernando Valley,34.241327,-118.432205


In [26]:
### Saving data to csv to acoid repeated API calls
LA_neighs.to_csv('LA_neighborhoods_v0.3.csv')

In [77]:
LA_Neighs=pd.read_csv('LA_neighborhoods_v0.3.csv')
LA_Neighs=LA_Neighs.drop(columns = ['Unnamed: 0'])
LA_Neighs[:50]

Unnamed: 0,Neighborhood,Region,Latitude,Longitude
0,Acton,Antelope Valley,34.480742,-118.186838
1,Adams-Normandie,South L.A.,34.018609,-118.287348
2,Agoura Hills,Santa Monica Mountains,34.14791,-118.765704
3,Agua Dulce,Northwest County,34.496382,-118.325635
4,Alhambra,San Gabriel Valley,34.093042,-118.12706
5,Alondra Park,South Bay,33.890134,-118.335133
6,Altadena,Verdugos,34.186316,-118.135233
7,Angeles Crest,Angeles Forest,34.234,-118.183386
8,Arcadia,San Gabriel Valley,34.136208,-118.04015
9,Arleta,San Fernando Valley,34.241327,-118.432205


In [37]:
# Retreiving the coordinates for Los Angeles 

address = 'Los Angeles, CA'

geolocator = Nominatim(user_agent="LA_explorer")
location = geolocator.geocode(address)
LAlatitude = location.latitude
LAlongitude = location.longitude
print('The geograpical coordinate of Los Angeles are {}, {}.'.format(LAlatitude, LAlongitude))

The geograpical coordinate of Los Angeles are 34.0536909, -118.2427666.


In [29]:
# create map of LA using latitude and longitude values
map_LA = folium.Map(location=[LAlatitude, LAlongitude], zoom_start=11, width=800, height=600)

counter1=0

# add markers to map
for lat, lng, borough, neighborhood in zip(LA_neighs['Latitude'], LA_neighs['Longitude'], LA_neighs['Neighborhood'], LA_neighs['Region']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_LA)  
    counter1+=1
map_LA

In [3]:
## Categories and IDs from FourSquare API

catIds={}
catIds={'Travel & Transport': '4d4b7105d754a06379d81259', 
        'Arts & Entertainment': '4d4b7104d754a06370d81259',
        'Outdoors & Recreation':'4d4b7105d754a06377d81259',
        'Nightlife Spot':'4d4b7105d754a06376d81259',
        'Food':'4d4b7105d754a06374d81259',
 
       }
catIds

{'Travel & Transport': '4d4b7105d754a06379d81259',
 'Arts & Entertainment': '4d4b7104d754a06370d81259',
 'Outdoors & Recreation': '4d4b7105d754a06377d81259',
 'Nightlife Spot': '4d4b7105d754a06376d81259',
 'Food': '4d4b7105d754a06374d81259'}

In [None]:
# More categories 
"""

'Travel & Transport': '4d4b7105d754a06379d81259', 
        'Arts & Entertainment': '4d4b7104d754a06370d81259',
        'Outdoors & Recreation':'4d4b7105d754a06377d81259',
        'Nightlife Spot':'4d4b7105d754a06376d81259',
        'Food':'4d4b7105d754a06374d81259'
                    'Event':'4d4b7105d754a06373d81259',
        'Professional & Other Places':'4d4b7105d754a06375d81259',
        'Residence':'4e67e38e036454776db1fb3a',
            

        'Shop & Service':'4d4b7105d754a06378d81259',
        'College & University':'4d4b7105d754a06372d81259'
        
"""

In [31]:
for key in catIds:
    print(key)
    print(catIds[key])

Travel & Transport
4d4b7105d754a06379d81259
Arts & Entertainment
4d4b7104d754a06370d81259
Outdoors & Recreation
4d4b7105d754a06377d81259
Nightlife Spot
4d4b7105d754a06376d81259
Food
4d4b7105d754a06374d81259


In [38]:
# Function to extract nearby venues 

def getNearbyVenues(names, latitudes, longitudes, categoryIds, radius, limit):
    venues_dict={}
    
    LIMIT = limit # limit of number of venues returned by Foursquare API
    radius = radius # define radius
    for key in categoryIds:
        venues_list=[]
        for name, lat, lng in zip(names, latitudes, longitudes):
            
            categoryId = categoryIds[key]
            
            url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&categoryId={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            categoryId,
            LIMIT)
            
            # make the GET request
            
            results = requests.get(url).json()["response"]['groups'][0]['items']
        
        
            # return only relevant information for each nearby venue
            venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
            
        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
        
        venues_dict[key] = nearby_venues
    
    return(venues_dict)

In [39]:
venues_dict = getNearbyVenues(LA_neighs['Neighborhood'], LA_neighs['Latitude'], LA_neighs['Longitude'], categoryIds = catIds, radius = 500, limit = 100)


In [40]:
for key in venues_dict:
    print(key)

Travel & Transport
Arts & Entertainment
Outdoors & Recreation
Nightlife Spot
Food


In [41]:
venues_dict['Travel & Transport']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Adams-Normandie,34.018609,-118.287348,Metro Rail - Expo Park/USC Station (E),34.018237,-118.286094,Light Rail Station
1,Adams-Normandie,34.018609,-118.287348,Natural History Museum (NHM) Metro Bus 102/550,34.018072,-118.288654,Bus Stop
2,Adams-Normandie,34.018609,-118.287348,Metro Rail - Expo/Vermont Station (E),34.018241,-118.291541,Light Rail Station
3,Alhambra,34.093042,-118.127060,Avis Car Rental,34.091506,-118.124138,Rental Car Location
4,Alhambra,34.093042,-118.127060,Days Inn Alhambra CA,34.095091,-118.128615,Hotel
...,...,...,...,...,...,...,...
591,Woodland Hills,34.168436,-118.605838,Topanga/Woodland Hills,34.169685,-118.605958,Intersection
592,Woodland Hills,34.168436,-118.605838,Topanga Canyon Boulevard & Ventura Boulevard,34.168519,-118.605850,Intersection
593,Woodland Hills,34.168436,-118.605838,Bus Stop Metro 150,34.168772,-118.605629,Bus Stop
594,Woodland Hills,34.168436,-118.605838,Glendevon Motors,34.167908,-118.606049,Rental Car Location


In [44]:
## Saving the nearby venues data as a csv to avoid repeated API calls 

for key in venues_dict:
    key1=re.sub('[^A-Za-z0-9&]+ ', '', key)
    key1=re.sub('\W+','', key1)
    venues_dict[key].to_csv(str(key1)+".csv")

In [6]:
LA_VENUES={}
for key in catIds:
    key1=re.sub('[^A-Za-z0-9&]+ ', '', key)
    key1=re.sub('\W+','', key1)
    LA_VENUES[key]=pd.read_csv(str(key1)+'.csv')

In [7]:
for key in LA_VENUES:
    print(key)
    print(LA_VENUES[key].shape)

Travel & Transport
(596, 8)
Arts & Entertainment
(733, 8)
Outdoors & Recreation
(1228, 8)
Nightlife Spot
(640, 8)
Food
(3181, 8)


In [8]:
LA_VENUES['Food']

Unnamed: 0.1,Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,Adams-Normandie,34.018609,-118.287348,The Habit Burger Grill,34.020192,-118.286326,Burger Joint
1,1,Adams-Normandie,34.018609,-118.287348,Seeds Marketplace,34.020547,-118.285982,Food Court
2,2,Adams-Normandie,34.018609,-118.287348,Moreton Fig,34.019775,-118.285827,American Restaurant
3,3,Adams-Normandie,34.018609,-118.287348,Chick-fil-A,34.016633,-118.282575,Fast Food Restaurant
4,4,Adams-Normandie,34.018609,-118.287348,Chipotle Mexican Grill,34.016956,-118.282584,Mexican Restaurant
...,...,...,...,...,...,...,...,...
3176,3176,Woodland Hills,34.168436,-118.605838,El Fuego Mexican Kitchen,34.168892,-118.602021,Mexican Restaurant
3177,3177,Woodland Hills,34.168436,-118.605838,Darna Meditaranean Cusine,34.171741,-118.605770,Mediterranean Restaurant
3178,3178,Woodland Hills,34.168436,-118.605838,Villa Piacere,34.168338,-118.610297,Italian Restaurant
3179,3179,Woodland Hills,34.168436,-118.605838,Savory Cafe,34.172049,-118.603941,Food


In [9]:
## Aggregate number of venues of each category for each neighborhood 

LA_grouped={}
for key in LA_VENUES:
    
    LA_grouped[key]= LA_VENUES[key].groupby('Neighborhood').count()[['Venue']]
    LA_grouped[key]=LA_grouped[key].rename(columns={'Venue':str(key)})
    
    
    
    print(LA_grouped[key])
    print(LA_grouped[key].shape)

                   Travel & Transport
Neighborhood                         
Adams-Normandie                     3
Alhambra                            4
Altadena                            2
Arcadia                             1
Arlington Heights                   2
...                               ...
Willowbrook                         1
Wilmington                          3
Windsor Square                      3
Winnetka                            2
Woodland Hills                      6

[149 rows x 1 columns]
(149, 1)
                  Arts & Entertainment
Neighborhood                          
Adams-Normandie                     42
Agoura Hills                         1
Alhambra                             9
Alondra Park                         1
Altadena                             3
...                                ...
Whittier                             6
Whittier Narrows                     1
Wilmington                           4
Windsor Square                       4
Woodl

In [10]:
LA_grouped['Food']

Unnamed: 0_level_0,Food
Neighborhood,Unnamed: 1_level_1
Adams-Normandie,37
Agoura Hills,6
Agua Dulce,5
Alhambra,37
Altadena,9
...,...
Willowbrook,2
Wilmington,13
Windsor Square,20
Winnetka,15


In [18]:
LA_Neighs=LA_Neighs.set_index('Neighborhood')
print(LA_Neighs.shape)
LA_Neighs.head()

(251, 4)


Unnamed: 0_level_0,Unnamed: 0,Region,Latitude,Longitude
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Acton,1,Antelope Valley,34.480742,-118.186838
Adams-Normandie,2,South L.A.,34.018609,-118.287348
Agoura Hills,3,Santa Monica Mountains,34.14791,-118.765704
Agua Dulce,4,Northwest County,34.496382,-118.325635
Alhambra,5,San Gabriel Valley,34.093042,-118.12706


In [19]:
# Joining to create a single dataframe of neighborhoods and number of nearby venues of each category

LA_params=LA_Neighs
for key in LA_grouped:
    print(key)
    LA_params = LA_params.join(LA_grouped[key])
    print(LA_params.shape)
LA_params=LA_params.fillna(0)

Travel & Transport
(251, 5)
Arts & Entertainment
(251, 6)
Outdoors & Recreation
(251, 7)
Nightlife Spot
(251, 8)
Food
(251, 9)


In [24]:
#LA_params=LA_params.drop(columns = ['Unnamed: 0'])
LA_params[:50]
LA_Params=LA_params.drop(columns = ['Region', 'Latitude', 'Longitude'])
LA_Params[:50]

Unnamed: 0_level_0,Travel & Transport,Arts & Entertainment,Outdoors & Recreation,Nightlife Spot,Food
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Acton,0.0,0.0,0.0,0.0,0.0
Adams-Normandie,3.0,42.0,16.0,4.0,37.0
Agoura Hills,0.0,1.0,8.0,0.0,6.0
Agua Dulce,0.0,0.0,0.0,0.0,5.0
Alhambra,4.0,9.0,5.0,10.0,37.0
Alondra Park,0.0,1.0,4.0,0.0,0.0
Altadena,2.0,3.0,6.0,7.0,9.0
Angeles Crest,0.0,0.0,1.0,0.0,0.0
Arcadia,1.0,0.0,4.0,3.0,5.0
Arleta,0.0,2.0,0.0,1.0,1.0


In [53]:
## mean normalized

for key in LA_Params:
    
    LA_Params[key]=LA_Params[key]/LA_Params[key].sum()
LA_Params[:50]

Unnamed: 0_level_0,Travel & Transport,Arts & Entertainment,Outdoors & Recreation,Nightlife Spot,Food
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Acton,0.0,0.0,0.0,0.0,0.0
Adams-Normandie,0.005034,0.057299,0.013029,0.00625,0.011632
Agoura Hills,0.0,0.001364,0.006515,0.0,0.001886
Agua Dulce,0.0,0.0,0.0,0.0,0.001572
Alhambra,0.006711,0.012278,0.004072,0.015625,0.011632
Alondra Park,0.0,0.001364,0.003257,0.0,0.0
Altadena,0.003356,0.004093,0.004886,0.010937,0.002829
Angeles Crest,0.0,0.0,0.000814,0.0,0.0
Arcadia,0.001678,0.0,0.003257,0.004687,0.001572
Arleta,0.0,0.002729,0.0,0.001563,0.000314


In [101]:
### KNN algorithm 

# set number of clusters
kclusters = 4

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(LA_Params)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 3, 0, 0, 2, 0, 2, 0, 0, 0], dtype=int32)

In [102]:
LA_Neighs=pd.read_csv('LA_neighborhoods_v0.3.csv')
LA_Neighs=LA_Neighs.drop(columns = ['Unnamed: 0'])
LA_Neighs[:50]

Unnamed: 0,Neighborhood,Region,Latitude,Longitude
0,Acton,Antelope Valley,34.480742,-118.186838
1,Adams-Normandie,South L.A.,34.018609,-118.287348
2,Agoura Hills,Santa Monica Mountains,34.14791,-118.765704
3,Agua Dulce,Northwest County,34.496382,-118.325635
4,Alhambra,San Gabriel Valley,34.093042,-118.12706
5,Alondra Park,South Bay,33.890134,-118.335133
6,Altadena,Verdugos,34.186316,-118.135233
7,Angeles Crest,Angeles Forest,34.234,-118.183386
8,Arcadia,San Gabriel Valley,34.136208,-118.04015
9,Arleta,San Fernando Valley,34.241327,-118.432205


In [103]:

LA_Neighs.insert(0, 'Cluster Labels', kmeans.labels_)

In [104]:
LA_Neighs.head()

Unnamed: 0,Cluster Labels,Neighborhood,Region,Latitude,Longitude
0,0,Acton,Antelope Valley,34.480742,-118.186838
1,3,Adams-Normandie,South L.A.,34.018609,-118.287348
2,0,Agoura Hills,Santa Monica Mountains,34.14791,-118.765704
3,0,Agua Dulce,Northwest County,34.496382,-118.325635
4,2,Alhambra,San Gabriel Valley,34.093042,-118.12706


## Results 

In [105]:
# create map
map_clusters = folium.Map(location=[LAlatitude, LAlongitude], zoom_start=11, width=800, height=900)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
cluster_counter=0
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(LA_Neighs['Latitude'], LA_Neighs['Longitude'], LA_Neighs['Neighborhood'], LA_Neighs['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
    cluster_counter += 1
       
map_clusters

## Discussion 

## Conclusion 