# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find locations of interest for people traveling to **Los Angeles, California**. Specifically, this report will be targeted for families, sports lovers, nature lovers, and history lovers. Along with that, we will also try to find restaurants and nightlife in LA.

We will use our data science powers to generate a separate lists for each category.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* what towns are located near Los Angeles
* what sort of venues are located near these cities

Following data sources will be needed to extract/generate the required information:
* towns and their locations will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **reverse geocoding**
* venues and their type and location in every neighborhood will be obtained using **Foursquare API**

In [3]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
import geocoder
from opencage.geocoder import OpenCageGeocode

print('Libraries imported.')

Libraries imported.


### Towns in Los Angeles
Let us get the towns of Los Angeles from another website **BeautifulSoup**.

In [4]:
import requests
from bs4 import BeautifulSoup

url = "https://maps.latimes.com/neighborhoods/neighborhood/list/"
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')

table = soup.find_all('table')
df = pd.read_html(str(table))[0]
for i in range(len(df)):
    if "/" in df["Name"][i] or "-" in df["Name"][i]:
        df.drop(index = i)
df.head()

Unnamed: 0,Name,Region
0,Acton,Antelope Valley
1,Adams-Normandie,South L.A.
2,Agoura Hills,Santa Monica Mountains
3,Agua Dulce,Northwest County
4,Alhambra,San Gabriel Valley


### Zip Code
Let us find the zip codes of each town using a special library.

In [5]:
from uszipcode import SearchEngine
search = SearchEngine(simple_zipcode=True)

In [6]:
zipcode=[]
for i in range(len(df)):
    
    zipcodes = search.by_city_and_state(city=df["Name"][i],state="CA")
    zipcode.append(zipcodes)
print("Done")

Done


In [7]:
list2 = [x for x in zipcode if x]

In [8]:
for i in range(len(list2)):
    
    lists = list2[i][0].values()
    zipandcity=[lists[2],int(lists[0])]
    list2[i] = zipandcity

In [9]:
dfnew = pd.DataFrame(list2)
dfnew.rename(columns = {0:'City',1:'Zip Code'},inplace=True)
dfnew.drop_duplicates(inplace=True)
dfnew.reset_index(inplace=True)

###Latitude and Longitude

Let us now find the latitude and longitude of the towns using **reverse geocoding**

In [10]:
key='715eee27aa6a4e1590e434e479866966'
geocoder = OpenCageGeocode(key)

In [11]:
latitude = []
longitude = []
for i in range(len(dfnew)):
    
    query=str(dfnew['City'][i]+", CA")
    results = geocoder.geocode(query)
    latitude.append(results[0]['geometry']['lat'])
    longitude.append(results[0]['geometry']['lng'])

dfnew['Latitude']=latitude
dfnew["Longitude"] = longitude
print("Done")

Done


### Foursquare
Now that we have our location candidates, let's use Foursquare API to get venue information in each neighborhood.




In [12]:
CLIENT_ID = '1SJSWBAQYL02FST0BEGFEOB3J4RTZUEAP4GJBMI2VGR42JFY' # your Foursquare ID
CLIENT_SECRET = '2LBPFTDC5R5QXA40SWCP251JQUJD1JC3Z1XTDLF40X3IS40W' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [13]:
LIMIT=100
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [14]:
la_venues = getNearbyVenues(names=dfnew['City'],
                                   latitudes=dfnew['Latitude'],
                                   longitudes=dfnew['Longitude']
                                  )

Acton
Orland
Agoura Hills
Alhambra
Oak Park
Altadena
Creston
Arcadia
Artesia
Catheys Valley
Atwater
Hacienda Heights
Azusa
Baldwin Park
March Air Reserve Base
Bellflower
Bell Gardens
Elk Grove
Beverly Hills
Lynwood
Doyle
Brentwood
Chester
Burbank
Calabasas
Canoga Park
Carson
Castaic
Alameda
Sun City
Cerritos
Oak Run
Chatsworth
Chino Hills
Citrus Heights
Claremont
Merced
Compton
Covina
Culver City
Cypress
Corona Del Mar
Del Rey
Highland
Diamond Bar
Downey
Winton
Duarte
West Hollywood
La Mirada
Los Angeles
Pasadena
San Gabriel
Bass Lake
El Monte
El Segundo
El Dorado Hills
Seiad Valley
Encino
Buena Park
Fairfax
Nice
Gardena
Glendale
Glendora
Granada Hills
Canyon Dam
Harbor City
Hawaiian Gardens
Hawthorne
Hermosa Beach
Hidden Valley Lake
South El Monte
North Hollywood
Lost Hills
Huntington Park
Inglewood
Orinda
Deer Park
Forbestown
La Canada Flintridge
La Crescenta
Rowland Heights
La Habra
Lake Hughes
Lakewood
Lancaster
La Puente
Lamont
La Verne
Lawndale
Glenn
Lincoln
Littlerock
Lomita
Lon

## Methodology <a name="methodology"></a>

In this project we will try to classify each venue based on its category. For each category, we decide a pre-determined list of possible searchwords that could be found in the venue categories. We search each venue's category against these lists, and if they are included, we add them to that specific list. This way we can split the data into separate categories.

## Analysis

Let us take a look at the unique venue categories that are included and remove any repeated venues if they exist.

In [15]:
lavenues = la_venues.drop_duplicates(subset='Venue Latitude',keep=False)
lavenues.shape

(2768, 7)

In [16]:
lavenues.index = list(range(len(lavenues)))
lavenues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Orland,39.747111,-122.191136,Farwood Bar & Grill,39.747169,-122.195256,American Restaurant
1,Orland,39.747111,-122.191136,U.S. Bank ATM,39.747878,-122.185397,ATM
2,Orland,39.747111,-122.191136,Orland Library Park,39.744249,-122.194052,Park
3,Orland,39.747111,-122.191136,Billy & Emily's Donuts,39.74794,-122.185473,Donut Shop
4,Agoura Hills,34.14791,-118.765704,LA Fitness,34.146057,-118.766854,Gym


In [17]:
len(lavenues['Venue Category'].unique())

303

Let us now extract the data into different categories.

In [18]:
sports=["Sports","Baseball","Hockey","Volleyball","Gym","Pool","Soccer", "Basketball","Football"]
family=["Theme Park","Bowling","Kids","City","Theater"]
party=["Pub","Bar","Beer","Nightlife","Hookah"]
nature=["Scenic","Garden","Park","Fountain"]
history=["Antique", "Museum","Concert","Art"]
food=["Food","Restaurant","Dessert","Bistro","Cupcake","Pizza","Ice Cream","Pie","Bubble Tea"]

sports_lovers = []
family_trip = []
party_places = []
nature_lovers = []
history_lovers = []
places_to_eat = []

for i in range(len(lavenues)):
    
    category = lavenues["Venue Category"][i]
    
    if any(ext in category for ext in food):
        
        places_to_eat.append(lavenues.iloc[i])
        
    if any(ext in category for ext in sports):
        
        sports_lovers.append(lavenues.iloc[i])
    
    if any(ext in category for ext in family):
        
        family_trip.append(lavenues.iloc[i])
    
    if any(ext in category for ext in party):
        
        party_places.append(lavenues.iloc[i])
    
    if any(ext in category for ext in nature):
        
        nature_lovers.append(lavenues.iloc[i])
    
    if any(ext in category for ext in history):
        
        history_lovers.append(lavenues.iloc[i])

Now let's merge the data into one dataframe and clean up the results a little bit.

In [19]:
sports = pd.DataFrame()
for i in range(len(sports_lovers)):
    sports=sports.append(pd.DataFrame(sports_lovers[i]).transpose())

sports.drop(['Neighborhood','Neighborhood Latitude','Neighborhood Longitude','Venue Category'],axis=1,inplace=True)
sports['Category']="Sports"
sports['Cluster'] = 1
sports.index = list(range(len(sports)))

food = pd.DataFrame()
for i in range(len(places_to_eat)):
    food=food.append(pd.DataFrame(places_to_eat[i]).transpose())

food.drop(['Neighborhood','Neighborhood Latitude','Neighborhood Longitude','Venue Category'],axis=1,inplace=True)
food['Category']="Food"
food['Cluster'] = 2
food.index = list(range(len(food)))

family = pd.DataFrame()
for i in range(len(family_trip)):
    family=family.append(pd.DataFrame(family_trip[i]).transpose())

family.drop(['Neighborhood','Neighborhood Latitude','Neighborhood Longitude','Venue Category'],axis=1,inplace=True)
family['Category']="Family"
family['Cluster'] = 3
family.index = list(range(len(family)))

party = pd.DataFrame()
for i in range(len(party_places)):
    party=party.append(pd.DataFrame(party_places[i]).transpose())

party.drop(['Neighborhood','Neighborhood Latitude','Neighborhood Longitude','Venue Category'],axis=1,inplace=True)
party['Category']="Party"
party['Cluster'] = 4
party.index = list(range(len(party)))

history = pd.DataFrame()
for i in range(len(history_lovers)):
    history=history.append(pd.DataFrame(history_lovers[i]).transpose())

history.drop(['Neighborhood','Neighborhood Latitude','Neighborhood Longitude','Venue Category'],axis=1,inplace=True)
history['Category']="History"
history['Cluster'] = 5
history.index = list(range(len(history)))

nature = pd.DataFrame()
for i in range(len(nature_lovers)):
    nature=nature.append(pd.DataFrame(nature_lovers[i]).transpose())

nature.drop(['Neighborhood','Neighborhood Latitude','Neighborhood Longitude','Venue Category'],axis=1,inplace=True)
nature['Category']="Nature"
nature['Cluster'] = 6
nature.index = list(range(len(nature)))

In [30]:
len(family)

51

In [20]:
df = pd.concat([sports,food,family,party,history,nature])
df.index = list(range(len(df)))

In [21]:
df.head()

Unnamed: 0,Venue,Venue Latitude,Venue Longitude,Category,Cluster
0,LA Fitness,34.1461,-118.767,Sports,1
1,Body Energy Club,49.2777,-123.127,Sports,1
2,Vancouver Aquatic Centre,49.277,-123.135,Sports,1
3,Robert Lee YMCA,49.2818,-123.125,Sports,1
4,Club Atwater,45.4928,-73.5892,Sports,1


Now let's see how it looks on a map

In [22]:
# create map
map_clusters = folium.Map(location=[34.052235, -118.243683], zoom_start=11)

# set color scheme for the clusters
x = np.arange(6)
ys = [i + x + (i*x)**2 for i in range(6)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df['Venue Latitude'], df['Venue Longitude'], df['Venue'], df['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
display(map_clusters)

## Results and Discussion <a name="results"></a>

As we can see, and as we could assume, most of the places are food-related. After food, the next largest category was partying. This also makes sense, as Los Angeles is known to have a good nightlife. The places are relatively scattered; it is not that one location is known mostly for one category.

## Conclusion <a name="conclusion"></a>

This project was to help identify where people traveling to Los Angeles could potentially travel to based on their interests. We first took the data and organized it in terms of the different categories, and finally we mapped them out to make it visually appealing.

The final decision will be made by the people traveling; this only helps make things a little clearer.