<a href="https://colab.research.google.com/github/jimmy-io/Coursera_Capstone/blob/master/IBM_CapStone_JimmyJ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Capstone Project

## Jimmy J.

### Introduction
Frank Lloyd Wright said it best - “Tip the world over on its side and everything loose will land in Los Angeles." For decades people from all over the world have come to this city in search of home, family, love, fame, and fortune. Los Angeles welcomes this multicultural migration with a suburban sprawl that encompases over 88 cities, and even more unincorporated neihborhoods. Each of these neighborhoods are characterized by geographical, economic, and cultural features that make them uniquely poised to cater to different demographics.
Neighborhoods like Downtown Los Angeles, Culver City, Long Beach, Century City, and West Hollywood provide a mixture of urban style living and accessibility to grocery stores, malls, means of public transportation and entertainment venues within walking distance. On the otherhand suburbs like Baldwin Hills, Crenshaw, Echo Park and Boyle Heights are quiter neighborhoods with single family dwellings that are not easily accessible via public modes of transportation.
Given the vast spectrum of possibilites of neighborhoods to choose from in LA, someone looking to move here might be overwhelmed. In this project I've attempted to characterize neighborhoods in LA by the nature of venue that are in their immediate vicinities using a clustering algorithm. Results from this analysis show that neighborhoods in LA fall under a few groups, defined by the nature of the venues closest to them. This is of most interest to rental unit searching apps like Westiside Rentals or Rentpad, to real estate agents, and generally, to people looking to move to LA. The results of this project can help them find neighborhoods that are most aligned with what they are looking for in a place to live and overall, provide a more satisfactory experience than chosing a neighborhood at random.

### Data
The list of neighborhoods in Los Angeles was web scraped from Wikipedia using BeautifulSoup. The names of these neigborhoods were then fed into Nominatim to obtain their geographical coordinates.
The venues in the vicinity of these neighborhoods will be retreived using the FourSquare API search engine. Venues of five different categories were chosen: Travel & Transport, Arts & Entertainment, Outdoors & Recreation, Nightlife Spot, and Food. The number of venues of each category for each neighborhood was counted and then normalized to give five parameters with which to cluster the neighborhoods by.

In [0]:
### Importing libraries 

import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import re
import json # library to handle JSON files
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
#!conda install -c conda-forge folium --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
import bs4 as bs
import urllib.request
import requests

In [2]:
#@hidden_cell

CLIENT_ID = 'JS0P2BHNS4GICN4OT1LRM03JV0OLTO4QWS0I5AEITRLVI3QU' # your Foursquare ID
CLIENT_SECRET = 'KGFP21SEFHLUXM2EPAI4HDLQOAI21MC1CY24RJ4AII4UX2Q3' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails: hidden')

Your credentails: hidden


In [3]:
# Scraping list of neighborhoods in LA

source = urllib.request.urlopen('http://maps.latimes.com/neighborhoods/neighborhood/list/').read()
soup=bs.BeautifulSoup(source, 'lxml')

table = soup.find('table')

table_rows = table.find_all('tr')

ls = []
for tr in table_rows:
    td = tr.find_all('a')
    row = [tr.text.strip() for tr in td]
    ls.append(row)
    
LA_neighs = pd.DataFrame(ls, columns = ['Neighborhood','Region'])

LA_neighs=LA_neighs.drop(LA_neighs.index[0])

LA_neighs.tail()

Unnamed: 0,Neighborhood,Region
268,Willowbrook,South L.A.
269,Wilmington,Harbor
270,Windsor Square,Central L.A.
271,Winnetka,San Fernando Valley
272,Woodland Hills,San Fernando Valley


In [0]:
# Getting the latitude and longitude of neighborhoods in LA

latitude = []
longitude = []

for name in LA_neighs['Neighborhood']:
    try:
        address = str(name)+", Los Angeles, California"
        geolocator = Nominatim(user_agent="LA_explorer")
        location = geolocator.geocode(address)
        latitude.append(location.latitude)
        longitude.append(location.longitude)
    except:
        latitude.append(None)
        longitude.append(None)

LA_neighs['Latitude'] = latitude
LA_neighs['Longitude'] = longitude

In [0]:
LA_neighs.dropna(inplace = True)

In [11]:
LA_neighs[:50]

Unnamed: 0,Neighborhood,Region,Latitude,Longitude
1,Acton,Antelope Valley,34.480741,-118.186838
2,Adams-Normandie,South L.A.,34.015951,-118.283513
3,Agoura Hills,Santa Monica Mountains,34.14791,-118.765704
5,Alhambra,San Gabriel Valley,34.093042,-118.12706
6,Alondra Park,South Bay,33.890134,-118.335133
7,Altadena,Verdugos,34.186316,-118.135233
8,Angeles Crest,Angeles Forest,34.234,-118.183386
9,Arcadia,San Gabriel Valley,34.136207,-118.04015
10,Arleta,San Fernando Valley,34.241327,-118.432205
11,Arlington Heights,Central L.A.,34.128113,-118.158903


In [0]:
### Saving data to csv to acoid repeated API calls
LA_neighs.to_csv('/content/drive/My Drive/LA_neighborhoods.csv')

In [0]:
%cd '/content/drive/My Drive/'

/content/drive/My Drive


In [13]:
LA_Neighs=pd.read_csv('/content/drive/My Drive/LA_neighborhoods.csv')
LA_Neighs=LA_Neighs.drop(columns = ['Unnamed: 0'])
LA_Neighs[:50]

Unnamed: 0,Neighborhood,Region,Latitude,Longitude
0,Acton,Antelope Valley,34.480741,-118.186838
1,Adams-Normandie,South L.A.,34.015951,-118.283513
2,Agoura Hills,Santa Monica Mountains,34.14791,-118.765704
3,Alhambra,San Gabriel Valley,34.093042,-118.12706
4,Alondra Park,South Bay,33.890134,-118.335133
5,Altadena,Verdugos,34.186316,-118.135233
6,Angeles Crest,Angeles Forest,34.234,-118.183386
7,Arcadia,San Gabriel Valley,34.136208,-118.04015
8,Arleta,San Fernando Valley,34.241327,-118.432205
9,Arlington Heights,Central L.A.,34.128113,-118.158903


In [0]:
LA_neighs.shape

(252, 4)

In [8]:
# Retreiving the coordinates for Los Angeles 

address = 'Los Angeles, CA'

geolocator = Nominatim(user_agent="LA_explorer")
location = geolocator.geocode(address)
LAlatitude = location.latitude
LAlongitude = location.longitude
print('The geograpical coordinate of Los Angeles are {}, {}.'.format(LAlatitude, LAlongitude))

The geograpical coordinate of Los Angeles are 34.0536909, -118.2427666.


In [14]:
# create map of LA using latitude and longitude values
map_LA = folium.Map(location=[LAlatitude, LAlongitude], zoom_start=11, width=800, height=600)

counter1=0

# add markers to map
for lat, lng, borough, neighborhood in zip(LA_neighs['Latitude'], LA_neighs['Longitude'], LA_neighs['Neighborhood'], LA_neighs['Region']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_LA)  
    counter1+=1
map_LA

In [16]:
## Categories and IDs from FourSquare API

catIds={}
catIds={'Travel & Transport': '4d4b7105d754a06379d81259', 
        'Arts & Entertainment': '4d4b7104d754a06370d81259',
        'Outdoors & Recreation':'4d4b7105d754a06377d81259',
        'Nightlife Spot':'4d4b7105d754a06376d81259',
        'Food':'4d4b7105d754a06374d81259',
 
       }
catIds

{'Arts & Entertainment': '4d4b7104d754a06370d81259',
 'Food': '4d4b7105d754a06374d81259',
 'Nightlife Spot': '4d4b7105d754a06376d81259',
 'Outdoors & Recreation': '4d4b7105d754a06377d81259',
 'Travel & Transport': '4d4b7105d754a06379d81259'}

In [0]:
# More categories 
"""

'Travel & Transport': '4d4b7105d754a06379d81259', 
        'Arts & Entertainment': '4d4b7104d754a06370d81259',
        'Outdoors & Recreation':'4d4b7105d754a06377d81259',
        'Nightlife Spot':'4d4b7105d754a06376d81259',
        'Food':'4d4b7105d754a06374d81259'
                    'Event':'4d4b7105d754a06373d81259',
        'Professional & Other Places':'4d4b7105d754a06375d81259',
        'Residence':'4e67e38e036454776db1fb3a',
            

        'Shop & Service':'4d4b7105d754a06378d81259',
        'College & University':'4d4b7105d754a06372d81259'
        
"""

"\n\n'Travel & Transport': '4d4b7105d754a06379d81259', \n        'Arts & Entertainment': '4d4b7104d754a06370d81259',\n        'Outdoors & Recreation':'4d4b7105d754a06377d81259',\n        'Nightlife Spot':'4d4b7105d754a06376d81259',\n        'Food':'4d4b7105d754a06374d81259'\n                    'Event':'4d4b7105d754a06373d81259',\n        'Professional & Other Places':'4d4b7105d754a06375d81259',\n        'Residence':'4e67e38e036454776db1fb3a',\n            \n\n        'Shop & Service':'4d4b7105d754a06378d81259',\n        'College & University':'4d4b7105d754a06372d81259'\n        \n"

In [0]:
for key in catIds:
    print(key)
    print(catIds[key])

Travel & Transport
4d4b7105d754a06379d81259
Arts & Entertainment
4d4b7104d754a06370d81259
Outdoors & Recreation
4d4b7105d754a06377d81259
Nightlife Spot
4d4b7105d754a06376d81259
Food
4d4b7105d754a06374d81259


In [0]:

# Function to extract nearby venues 

def getNearbyVenues(names, latitudes, longitudes, categoryIds, radius, limit):
    venues_dict={}
    
    LIMIT = limit # limit of number of venues returned by Foursquare API
    radius = radius # define radius
    for key in categoryIds:
        venues_list=[]
        for name, lat, lng in zip(names, latitudes, longitudes):
            
            categoryId = categoryIds[key]
            
            url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&categoryId={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            categoryId,
            LIMIT)
            
            # make the GET request
            
            results = requests.get(url).json()["response"]['groups'][0]['items']
        
        
            # return only relevant information for each nearby venue
            venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
            
        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
        
        venues_dict[key] = nearby_venues
    
    return(venues_dict)

In [0]:
venues_dict = getNearbyVenues(LA_neighs['Neighborhood'], LA_neighs['Latitude'], LA_neighs['Longitude'], categoryIds = catIds, radius = 500, limit = 100)

In [0]:
for key in venues_dict:
    print(key)


Travel & Transport
Arts & Entertainment
Outdoors & Recreation
Nightlife Spot
Food


In [0]:
venues_dict['Travel & Transport']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Adams-Normandie,34.015951,-118.283513,Metro Rail - Expo Park/USC Station (E),34.018237,-118.286094,Light Rail Station
1,Adams-Normandie,34.015951,-118.283513,Harbor Transit and 37th St,34.017649,-118.280218,Bus Station
2,Adams-Normandie,34.015951,-118.283513,USC Hotel,34.018980,-118.281696,Hotel
3,Alhambra,34.093042,-118.127060,Avis Car Rental,34.091506,-118.124138,Rental Car Location
4,Alhambra,34.093042,-118.127060,Days Inn Alhambra CA,34.095091,-118.128615,Hotel
...,...,...,...,...,...,...,...
618,Woodland Hills,34.168436,-118.605838,Topanga/Woodland Hills,34.169685,-118.605958,Intersection
619,Woodland Hills,34.168436,-118.605838,Topanga Canyon Boulevard & Ventura Boulevard,34.168519,-118.605850,Intersection
620,Woodland Hills,34.168436,-118.605838,Bus Stop Metro 150,34.168772,-118.605629,Bus Stop
621,Woodland Hills,34.168436,-118.605838,Glendevon Motors,34.167908,-118.606049,Rental Car Location


In [0]:
## Saving the nearby venues data as a csv to avoid repeated API calls 

for key in venues_dict:
    key1=re.sub('[^A-Za-z0-9&]+ ', '', key)
    key1=re.sub('\W+','', key1)
    venues_dict[key].to_csv(str(key1)+".csv")

In [0]:
## Reading the nearby venues data as a csv
LA_VENUES={}
for key in catIds:
    key1=re.sub('[^A-Za-z0-9&]+ ', '', key)
    key1=re.sub('\W+','', key1)
    LA_VENUES[key]=pd.read_csv('/content/drive/My Drive/'+str(key1)+'.csv')
    LA_VENUES[key].dropna(inplace = True)
    LA_VENUES[key].drop(columns = ['Unnamed: 0'], inplace = True)

In [30]:
for key in LA_VENUES:
    print(key)
    print(LA_VENUES[key].shape)

Travel & Transport
(623, 7)
Arts & Entertainment
(759, 7)
Outdoors & Recreation
(1254, 7)
Nightlife Spot
(663, 7)
Food
(3344, 7)


In [31]:
LA_VENUES['Food']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Adams-Normandie,34.015951,-118.283513,Figueroa Philly Cheese Steak,34.014254,-118.282527,Sandwich Place
1,Adams-Normandie,34.015951,-118.283513,Chick-fil-A,34.016633,-118.282575,Fast Food Restaurant
2,Adams-Normandie,34.015951,-118.283513,Chipotle Mexican Grill,34.016956,-118.282584,Mexican Restaurant
3,Adams-Normandie,34.015951,-118.283513,Pizza Studio,34.018569,-118.281668,Pizza Place
4,Adams-Normandie,34.015951,-118.283513,Holbox,34.017515,-118.278439,Seafood Restaurant
...,...,...,...,...,...,...,...
3339,Woodland Hills,34.168436,-118.605838,El Fuego Mexican Kitchen,34.168892,-118.602021,Mexican Restaurant
3340,Woodland Hills,34.168436,-118.605838,Darna Meditaranean Cusine,34.171741,-118.605770,Mediterranean Restaurant
3341,Woodland Hills,34.168436,-118.605838,Villa Piacere,34.168338,-118.610297,Italian Restaurant
3342,Woodland Hills,34.168436,-118.605838,Savory Cafe,34.172049,-118.603941,Food
