# Capstone Project - The Battle of Neighborhoods
## IBM Data Science
###### by Matheus Faria

#### Opening a new Brazilian Restaurant in San Francisco, CA

### Introduction

San Francisco, officially the City and County of San Francisco and colloquially known as SF, San Fran, Frisco, or The City, is the cultural, commercial, and financial center of Northern California. San Francisco is the 15th most populous city in the United States, and the fourth most populous in California, with 881,549 residents as of 2019.It covers an area of about 46.89 square miles (121.4 km2), mostly at the north end of the San Francisco Peninsula in the San Francisco Bay Area, making it the second most densely populated large U.S. city, and the fifth most densely populated U.S. county, behind only four of the five New York City boroughs. San Francisco is the 12th-largest metropolitan statistical area in the United States by population, with 4.7 million people, and the fourth-largest by economic output, with GDP of $549 billion in 2018.With San Jose, it forms the fifth most populous combined statistical area in the United States, the San Jose–San Francisco–Oakland, CA Combined Statistical Area (9.67 million residents in 2018).

As of 2020, San Francisco has the highest salaries, disposable income, and median home prices in the world at 1.7 million dollars.In 2018, San Francisco was the seventh-highest-income county in the United States, with a per capita personal income of $130,696.In the same year, San Francisco proper had a GDP of 183.2 billion, and a GDP per capita of 207,371.The CSA San Francisco shares with San Jose and Oakland was the country's third-largest urban economy as of 2018, with a GDP of 1.03 trillion.Of the 500+ primary statistical areas in the U.S., this CSA had among the highest GDP per capita in 2018, at 106,757.San Francisco was ranked 8th in the world and 2nd in the United States on the Global Financial Centres Index as of March 2020.As of 2016, the San Francisco metropolitan area had the highest GDP per capita, labor productivity, and household income levels in the OECD.As of 2019, it is the highest rated American city on world liveability rankings.

### Main Problem

Some ivestors, the stakeholders, from Brazil wants to build a new Brazilian Restaurant in San Francisco, once they think the city is the perfect place to do so.

Brazilian food is famous around the world because of it's diversity and can please everyone's taste. Since the salads until the barbecue or the famous 'Feijoada' as it is called there.

The brazilian food is even better when homemade with fresh ingredients, for that, we are looking for somewhere with nearby Farmers market.

The main problem of this project is to find the best location to build a new Brazilian Restaurant. We are looking for areas where there aren't this kind of restaurant, at the same time somewhere with complementary venues like Farmers markets to always maintain the food as fresh as possible.

### Data

About the data

To complete this task we are gonna need those:

- List of neighborhoods in San Francisco
- Geographical coordinates of the neighborhoods to plot the data on a map
- San Francisco Brazilian restaurants data
- San Francisco Farmers market data

The list of neighborhoods in San Francisco will be extracted from https://en.wikipedia.org/wiki/Category:Neighborhoods_in_San_Francisco. The latitude and longitude coordinates of the neighborhoods will be retrieved using the Geocoder package. The data for Brazilian restaurants and Farmers market will be retrieved using the FourSquare API.
All the data are free and avaible to everyone

### Methodology section

In [1]:
# import libraries
import numpy as np 
import pandas as pd
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes
!pip install geocoder 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder # to get coordinates

import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes
!pip install folium
import folium # map rendering library

print("Libraries imported.")

Libraries imported.


### Preparing the data

In [2]:
# Retrieve data from Wikipedia page
data = requests.get("https://en.wikipedia.org/wiki/Category:Neighborhoods_in_San_Francisco").text

# Parse data
soup = BeautifulSoup(data, 'html.parser')

# Create empty neighborhood list
neighborhoodList = []

for row in soup.find_all("div", class_="mw-content-ltr")[0].findAll("li"):
    neighborhoodList.append(row.text)
df = pd.DataFrame({"Neighborhood": neighborhoodList})
df

Unnamed: 0,Neighborhood
0,"► Barbary Coast, San Francisco‎ (17 P)"
1,"► Castro District, San Francisco‎ (28 P)"
2,"► Chinatown, San Francisco‎ (1 C, 88 P)"
3,"► Civic Center, San Francisco‎ (29 P)"
4,"► Financial District, San Francisco‎ (2 C, 11..."
5,"► Fisherman's Wharf, San Francisco‎ (3 C, 35 P)"
6,"► Haight-Ashbury, San Francisco‎ (34 P)"
7,"► Mission District, San Francisco‎ (126 P, 2 F)"
8,"► Nob Hill, San Francisco‎ (31 P)"
9,"► North Beach, San Francisco‎ (3 C, 69 P)"


In [3]:
# Create new dataframe with only the neighborhoods and removing irrelevant subcategories
sanfrancisco_df = df.iloc[20:].reset_index(drop=True)
sanfrancisco_df.head()

Unnamed: 0,Neighborhood
0,"Alamo Square, San Francisco"
1,Alta Plaza
2,"Anza Vista, San Francisco"
3,"Balboa Park, San Francisco"
4,"Balboa Terrace, San Francisco"


In [4]:
sanfrancisco_df.shape

(95, 1)

In [None]:
# define a function to get coordinates
def get_latlng(neighborhood):
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, San Francisco, United States'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [None]:
coordinates = [ get_latlng(neighborhood) for neighborhood in sanfrancisco_df["Neighborhood"].tolist() ]

In [None]:
#create dataframe to put the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coordinates, columns=['Latitude', 'Longitude'])
    
# put the coordinates into the main dataframe
sanfrancisco_df['Latitude'] = df_coords['Latitude']
sanfrancisco_df['Longitude'] = df_coords['Longitude']

# look at the neighborhoods and the coordinates
print(sanfrancisco_df.shape)
sanfrancisco_df.head()

(95, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,"Alamo Square, San Francisco",37.77722,-122.43146
1,Alta Plaza,37.79101,-122.4393
2,"Anza Vista, San Francisco",37.78048,-122.44358
3,"Balboa Park, San Francisco",37.72493,-122.44314
4,"Balboa Terrace, San Francisco",37.7318,-122.4674


In [None]:
# getting the coordinates of San Francisco
address = 'San Francisco, United States'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of San Francisco are: {}, {}.'.format(latitude, longitude))

The geograpical coordinates of San Francisco are: 37.7790262, -122.4199061.


In [None]:
# create map of San Francisco using latitude and longitude values
sanfrancisco_map = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, neighborhood in zip(sanfrancisco_df['Latitude'], sanfrancisco_df['Longitude'], sanfrancisco_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0).add_to(sanfrancisco_map)  
    
sanfrancisco_map

### Using Foursquare API

In [None]:
# Foursquare credentials and version
CLIENT_ID = '3XQ1LPC2VXV5F0IV0XX3RAUJ1WCZOHITKGB22LSZW5CFV1N0' # your Foursquare ID
CLIENT_SECRET = 'KLF4WP12LHID5TUZQ1QBGW3DW5QIJNXO0HH2ZQOCKH2GUG4X' # your Foursquare Secret
VERSION = '20180605'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 3XQ1LPC2VXV5F0IV0XX3RAUJ1WCZOHITKGB22LSZW5CFV1N0
CLIENT_SECRET:KLF4WP12LHID5TUZQ1QBGW3DW5QIJNXO0HH2ZQOCKH2GUG4X


In [None]:
radius = 5000
LIMIT = 10000

venues = []

for lat, long, neighborhood in zip(sanfrancisco_df['Latitude'], sanfrancisco_df['Longitude'], sanfrancisco_df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [None]:
# convert the venues list into a new dataframe
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df

In [None]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

In [None]:
# print the list of unique categories
venues_df['VenueCategory'].unique()

### One Hot Encoding

In [None]:
# one hot encoding
sanfrancisco_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
sanfrancisco_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [sanfrancisco_onehot.columns[-1]] + list(sanfrancisco_onehot.columns[:-1])
sanfrancisco_onehot = sanfrancisco_onehot[fixed_columns]

print(sanfrancisco_onehot.shape)
sanfrancisco_onehot.head()

In [None]:
sanfrancisco_groupedby = sanfrancisco_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(sanfrancisco_groupedby.shape)
sanfrancisco_groupedby

In [None]:
# Check how many Brazilian restaurants there are in the data gathered
len(sanfrancisco_groupedby[sanfrancisco_groupedby["Brazilian Restaurant"] > 0])

In [None]:
# Check how many Farmers market there are in the data gathered
len(sanfrancisco_groupedby[sanfrancisco_groupedby["Farmers Market"] > 0])

In [None]:
# Build a new dataframe with the concentration of Brazilian restaurants and Farmers market per neighborhood
concentration_df = sanfrancisco_groupedby[["Neighborhoods","Brazilian Restaurant","Farmers Market" ]]
concentration_df

### K-Means Clustering

In [None]:
# set number of clusters
kclusters = 3

sanfrancisco_clustering = concentration_df.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sanfrancisco_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:]

In [None]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
sanfrancisco_merged = concentration_df.copy()

# add clustering labels
sanfrancisco_merged["Cluster Labels"] = kmeans.labels_

In [None]:
sanfrancisco_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
sanfrancisco_merged

In [None]:
#merge dataframes to add latitude/longitude for each neighborhood
sanfrancisco_merged = sanfrancisco_merged.join(sanfrancisco_df.set_index("Neighborhood"), on="Neighborhood")

sanfrancisco_merged

In [None]:
# sort the results by Cluster Labels
print(sanfrancisco_merged.shape)
sanfrancisco_merged.sort_values(["Cluster Labels"], inplace=True)
sanfrancisco_merged

### Results section

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sanfrancisco_merged['Latitude'], sanfrancisco_merged['Longitude'], sanfrancisco_merged['Neighborhood'], sanfrancisco_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0).add_to(map_clusters)
       
map_clusters

In [None]:
sanfrancisco_merged.loc[sanfrancisco_merged['Cluster Labels'] == 0]

In [None]:
sanfrancisco_merged.loc[sanfrancisco_merged['Cluster Labels'] == 1]

In [None]:
sanfrancisco_merged.loc[sanfrancisco_merged['Cluster Labels'] == 2]

### Discussion section

Looking at the neighboorhoods in cluster 1, we can see that there is one neighborhood with a high concentration of Farmers market and no Brazilian restaurant, Balboa Park.

Balboa Park is a residential area, with a school and a park. The perfect place to build a new restaurant, once the students and the park visitors are going to need lunch nearby.

### Conclusion section

The goal of this project was to find the best areas to open a new Brazilian restaurant,as requested by some investors, prioritising areas where there are any restaurant of this kind and a nearby Farmers market, once the brazilian food is even better when made with fresh organic ingredients.

The analysis suggests that the neighborhoods in cluster 1 are suitable areas to open a new Brazilian restaurant. However, the finest neihghborhood for our task is Balboa Park, currently without Brazilian restaurants but with 3 Farmers market to supply the demand for fresh ingredients.

In cluster 2, all the neighborhoods already has a brazilian restaurant, so we think is better to find a place where there isn't competition.