# The Battle of Neighborhoods

### Introduction / Business Problem definition

In the US, one dense area is known to be highly different from the rest of the country in terms of innovation. The San Francisco Bay Area is indeed the most innovative place in the world, gathering more than 50% of all VC funds in the world, in a single, small region. 
The most important and famous parts of the Bay area include cities like San Francisco, Berkeley, San Jose, Palo Alto ... but the region is not limited to these territories and also include smaller cities a few kilometers away from the center of the Bay Area. 

Most of the highly successful tech companies were launched in this area and the highest paying jobs are located in that particular area. The San Francisco Bay Area is also well known for its diversity. In the area, it is often said that one is not from San Francisco if he lived there for more than a couple of years. 

Based on these few observations, we noticed an important problem that was created in the Bay Area: because of the average salary and the quality of life in this area, housing prices are completely disproportionate and for a new comer in the area, it can be a crazy riddle to solve when it comes to finding a place to live. 

In this analysis, I will get to know more about the Bay Area, all its locations and its main venues. Based on that, we will try to identify areas that can be similar (or not) to the city center of San Francisco, so that a new comer can easily identify where would be the best place for him to live in based on his preferences and salary. Throughout this analysis, we will try to understand why is Silicon Valley so different from the rest of the world and the rest of the Bay Area / California.

### Data

#### Data Sources

In order to achieve this analysis, I will need three types of information : 
- Information regarding the cities in The Bay Area, their postal codes, their associated coordinates (latitude and longitude). This information should be easily found on governmental websites : https://catalog.data.gov/dataset/bay-area-zip-codes/resource/6cacd1a1-6bff-4c7c-9094-49188ea29f85
- Associated venue to each of the different cities and neighborhoods, using Foursquare and its API.
- The San Francisco JSON file which provides a list of all Bay Area Boroughs, Neighborhoods, and their coordinates (latitude and longitude), if we cannot find it using the governmental websites information

### Methodology 

Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.

Our methodology will be divided in different parts : 
1. Import packages and tools: In this section, we will import all tools and packages that we will require throughout the analysis. Those include but is not limited to: CSV, PD, NP, REQUESTS, KMEANS, JSON, GEOCODER, MATPLOTLIB, GEOCODER, FOLIUM, etc.
2. Data cleaning : Using the governmental website link that we found, let's import the CSV / HTML format in order to put it into a pandas dataframe. Once the required data is put into a pandas dataframe, we will be required to delete all the unecessary rows and columns, rename columns, add BLANK columns for latitude & longitude... The objective would be to have a table with columns for Neighborhood, ZIP code, Latitude and Longitude 
3. Use Geopy library: With Geopy library, we will be able to associate Coordinates (Latitude & Longitude) to the ZIP code. 
4. Use the FourSquare API: With Foursquare credentials, import all venues associated to the different locations. 
5. Once all the data is being collected and associated in one grouped table, run the Kmeans Machine Learning algorithm in order to determine what are the main clusters. Iterate based on the number of clusters required to get a good understanding of the data

#### Import & Data Cleaning

First let's import some key tools that will be useful throughout our analysis, these include but is not limited to numpy, pandas, jason ... 

In [14]:
import csv
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np
import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Install geocoder
!pip install geocoder
import geocoder
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!pip install folium
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [15]:
df = pd.read_csv ('https://data.sfgov.org/api/views/f9wk-m4qb/rows.csv?accessType=DOWNLOAD')   
df.head(10)

Unnamed: 0,PO_NAME,the_geom,ZIP,STATE,Area__,Length__
0,NAPA,MULTIPOLYGON (((-122.10329200180091 38.5132829...,94558,CA,12313260000.0,995176.225313
1,FAIRFIELD,MULTIPOLYGON (((-121.947475002335 38.301511000...,94533,CA,991786100.0,200772.556587
2,DIXON,MULTIPOLYGON (((-121.65335500334429 38.3133870...,95620,CA,7236950000.0,441860.2014
3,SONOMA,MULTIPOLYGON (((-122.406843003057 38.155681999...,95476,CA,3001414000.0,311318.546326
4,NAPA,MULTIPOLYGON (((-122.29368500225117 38.1552379...,94559,CA,1194302000.0,359104.646602
5,PETALUMA,MULTIPOLYGON (((-122.45766900253919 38.1168949...,94954,CA,2006544000.0,267474.490552
6,RIO VISTA,MULTIPOLYGON (((-121.8624620022998 38.06602999...,94571,CA,4454446000.0,492056.752411
7,TRAVIS AFB,MULTIPOLYGON (((-121.89653900297888 38.2865679...,94535,CA,302939700.0,95232.008421
8,AMERICAN CANYON,MULTIPOLYGON (((-122.20418700285576 38.2096949...,94503,CA,693134100.0,136394.695137
9,NOVATO,MULTIPOLYGON (((-122.48655900081091 38.1005269...,94949,CA,431605400.0,119395.672078


Rename PO_NAME column with a new name : "Neighborhood"

In [16]:
df = df.rename(columns={'PO_NAME': 'Neighborhood'})

Drop all unecessary columns, in order to get a better view of our table

In [17]:
neighborhoods = df.drop(['the_geom','STATE','Area__','Length__'], axis=1)
neighborhoods

Unnamed: 0,Neighborhood,ZIP
0,NAPA,94558
1,FAIRFIELD,94533
2,DIXON,95620
3,SONOMA,95476
4,NAPA,94559
5,PETALUMA,94954
6,RIO VISTA,94571
7,TRAVIS AFB,94535
8,AMERICAN CANYON,94503
9,NOVATO,94949


In [18]:
print('The dataframe has {} neighborhoods (not yet compiled).'.format(
        neighborhoods.shape[0])
    )

The dataframe has 187 neighborhoods (not yet compiled).


In [26]:
print('The dataframe has the following number of UNIQUE neighborhoods:')
neighborhoods['Neighborhood'].nunique()

The dataframe has the following number of UNIQUE neighborhoods:


97

In [27]:
print('The dataframe has the following number of UNIQUE ZIP codes:')
neighborhoods['ZIP'].nunique()

The dataframe has the following number of UNIQUE ZIP codes:


187

Thanks to this analysis, we can easily notice that our dataset has 187 rows and 2 columns : Neighborhood & ZIP. In the Neighborhood column we have several repetitions (e.g. NOVATO rows 9 & 10) because we have only 97 unique neighborhoods in the Bay Area. We can also notice that we have 187 ZIP codes. 
Thus, we can conclude by saying that one Neighborhood has several ZIP codes associated. This information should be kept in mind when we will import the different information. We should rather focus on importing data based on the ZIP codes rather than based on the name of the Neighborhood.


Our database is now completely cleaned, we can start playing a little bit with Geopy, running some analysis on our data and all different functionalities. 

#### Use geopy library to get the latitude and longitude values of New York City.
In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>sf_explorer</em>, as shown below.


In [19]:
address = 'San Fransisco, CA'

geolocator = Nominatim(user_agent="sf_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of San Francisco are {}, {}.'.format(latitude, longitude))

GeocoderUnavailable: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Max retries exceeded with url: /search?q=San+Fransisco%2C+CA&format=json&limit=1 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Read timed out. (read timeout=1)"))

Defining longitude and latitude for each different postal code using geocoder

In [None]:
neighborhoods['Latitude'] = None
neighborhoods['Longitude'] = None
neighborhoods

In [None]:
for i, postal_code in enumerate(neighborhoods['ZIP']):
    lat_lng_coords = None
    
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, CA'.format(postal_code))
        lat_lng_coords = g.latlng
    
    if lat_lng_coords:
        latitude = lat_lng_coords[0]
        longitude = lat_lng_coords[1]
    
    neighborhoods.loc[i, 'Latitude'] = latitude
    neighborhoods.loc[i, 'Longitude'] = longitude

neighborhoods

Printing Bay-Area Map

In [None]:
map_bay_area = folium.Map(location=[latitude, longitude], zoom_start=8)

for lat, lng, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=4,
        popup=label,
        color='red',
        fill=True,
        parse_html=False).add_to(map_bay_area)
    
map_bay_area

We managed to get a great map of the Bay area and all the key cities and neighborhoods composing the region. We will now use Foursquare in order to move further with our analysis. 

### Foursquare and importing credentials

In [None]:
CLIENT_ID = 'K23DSBS00GSPJIUSF3VQGCCLLLHIOUSJ0244UPB41GFV4DZL' 
CLIENT_SECRET = 'OJ2S52HI3NQGDZL2IM2JZGX11OBYEG31CKBQZXWV2V3TN4RA'
VERSION = '20201207'
LIMIT = 100

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
print('VERSION:' + VERSION)

Let's get all the different venue for each neighborhood by first defining the function that we will use ! 

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

For each of the neighborhood and borough, let's now run the above function to get all venues in the Bay Area


In [None]:
bay_area_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                )

In [None]:
print(bay_area_venues.shape)
bay_area_venues.head()

Let's see how many venues appeared for each neighborhood

In [None]:
bay_area_venues.groupby('Neighborhood').count()


In [None]:
print('There are {} uniques categories.'.format(len(bay_area_venues['Venue Category'].unique())))

We can thus notice that there is a very large number of unique categories within the different venues related to the Bay Area (378)

We can now notice that we have a total of 94 neighborhood in our dataset, and each of these neighborhood have an associated number of venues. Let's now further analyze these venues by determining the kind of venue (Art, Restaurant, Monument, video, clubs, bar...) in each neighborhood : 

In [None]:
# one hot encoding
bay_area_onehot = pd.get_dummies(bay_area_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
bay_area_onehot['Neighborhood'] = bay_area_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [bay_area_onehot.columns[-1]] + list(bay_area_onehot.columns[:-1])
bay_area_onehot = bay_area_onehot[fixed_columns]

bay_area_onehot.head()

In [None]:
bay_area_onehot.shape

The size of the new dataframe is coherent as we can see that we have 5,178 rows corresponding to the 5,178 venues and 378 columns corresponding to the 378 unique categories. 

In [None]:
bay_area_grouped = bay_area_onehot.groupby('Neighborhood').mean().reset_index()
bay_area_grouped

In [None]:
bay_area_grouped.shape

In this new dataframe, we can notice that we still have the previous 378 columns. But this time we only have 94 rows, this can be explained by the grouping of identical neighborhoods with eachother.
Let's now take a look at the TOP 5 venues for each neighborhood

In [None]:
num_top_venues = 5

for hood in bay_area_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = bay_area_grouped[bay_area_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = bay_area_grouped['Neighborhood']

for ind in np.arange(bay_area_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(bay_area_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

In [None]:
# set number of clusters
k = 25

bay_area_grouped_clustering = bay_area_grouped.drop('Neighborhood', axis=1)

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=67).fit(bay_area_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:50] 

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

bay_area_merged = neighborhoods
bay_area_merged = bay_area_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
bay_area_merged.dropna(inplace=True)


In [None]:
print(bay_area_merged.shape)
bay_area_merged['Cluster Labels'] = bay_area_merged['Cluster Labels'].astype(int)
bay_area_merged

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(bay_area_merged['Latitude'], bay_area_merged['Longitude'], bay_area_merged['Neighborhood'], bay_area_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Let's now have list of all different clusters : I only put the first 5 clusters, but we could easily focus on the other by changing the figure associated. 

In [None]:
bay_area_merged.loc[bay_area_merged['Cluster Labels'] == 0, bay_area_merged.columns[[1] + list(range(5, bay_area_merged.shape[1]))]]


In [None]:
bay_area_merged.loc[bay_area_merged['Cluster Labels'] == 1, bay_area_merged.columns[[1] + list(range(5, bay_area_merged.shape[1]))]]


In [None]:
bay_area_merged.loc[bay_area_merged['Cluster Labels'] == 2, bay_area_merged.columns[[1] + list(range(5, bay_area_merged.shape[1]))]]


In [None]:
bay_area_merged.loc[bay_area_merged['Cluster Labels'] == 3, bay_area_merged.columns[[1] + list(range(5, bay_area_merged.shape[1]))]]


In [None]:
bay_area_merged.loc[bay_area_merged['Cluster Labels'] == 4, bay_area_merged.columns[[1] + list(range(5, bay_area_merged.shape[1]))]]
