In [25]:
import numpy as np
import pandas as pd

#!conda install -c conda-forge folium=0.5.0 --yes
import folium

from sklearn.cluster import KMeans

import matplotlib

import requests
import json

# Segmenting and Clustering Neighborhoods in San Jose, CA

## IBM Data Science Capstone Project - Jonny Erickson

# Introduction

San Jose is one of the largest cities in California, and has a seat in one of the most affluent counties in the country. The city is known for its history of innovation, earning the nickname Silicon Valley due to the countless technology companies that exist in or near the city limits. This includes Apple, Facebook, Google, Netflix, Tesla, Twitter, and many others. However, San Jose is also notoriously known for its incredibly high cost of living. 

This project aims to answer a few questions, by analyzing many of the different neighborhoods within San Jose. The audience of this project would be a high-end residential real estate developer. I hope to find the cross section of neighborhoods where the mean income of top 5% residents is above the San Jose average, but the mean housing prices for that neighbrhood is below that of San Jose as a whole. Further, with these neighborhoods selected, I will use the Foursquare data to find which is closest to the highest rated amenities. 

The developer would then be able to buy land in San Jose that is underpriced, aiming to create a high end community that where they could easily sell the homes to people making well above what is required to live in that neighborhood. 

# Data

I will be using data from the following sources in order to identify the relationships I identified in the introduction:

1. Data for the house-hold incomes by neighborhood
https://statisticalatlas.com/place/California/San-Jose/Household-Income
2. Data for the median housing-prices by neighborhood
https://www.zillow.com/san-jose-ca/home-values/
3. Foursquare API Access
https://foursquare.com/developers/apps/

Using this data, I will create maps that show corresponding variable values in each category, as well as use dataframes in order to merge data and find those neighborhoods which have the conditions I am searching for.

# Methodology

I have attached the source code to the bottom of this document, which is numbered to follow along with this list of operations.
1. I will load in all of the necessary datasets that are shown above.
2. Sort neighborhoods in one dataframe by descending income, and in another dataframe by ascending mean housing price. 
3. Using the Foursquare API, I will create a rating of amenity diversity in a neighborhood by using the number of certain types of unique amenities. 
4. These dataframes will all be combined and examined to find optimal neighborhoods for the problem at hand. 

## 1. 
I first gathered the necessary data using various sources mentioned above, as well as Google Maps in order to obtain the latitude and longitude coordinates for each respective neighborhood.

All of this data, except that from Foursquare, was compiled in excel and then read into dataframes using Pandas. Then, using the neighborhoods with their coordinates, I ran a function that gathered the 100 nearest by venues to each respective neighborhood.


## 2. 
The next step was more of a sanity check than anything else, but sorted the values in each category to ensure there appeared to be a sufficient distribution of values where an optimal solution could be obtained. 

## 3. 
Given the diversity of San Jose as a whole, it could be argued that in order to have sufficient sales in a new housing development, the development would need to have reasonable access to a very diverse group of venues. This includes various different cultural restaraunts and stores.

By taking the nearest 100 venues to the neighborhood, I then counted only unique values of venue category types to create my 'Diversity Score.' Using this diversity score, we can see how many different types of venues there are in the first 100 venues that are found by Foursquare. 

This statistic, I believe, does a good job at developing a diversity score that isn't dependent on the size of a certain neighborhood. If a broad look at the total unique types of venues was done, the score would be biased towards larger neighborhoods, which could be predicted to have automatically more unique venues. 

If a ratio was taken between the numbr of unique venues and the total number of venues, then neighborhoods with a small amount of venues would most likely have higher scores, without it being clear that a neighborhood with few venues is actually a better neighborhood for diversity.

Because of this, by looking at the first, immediately close, 100 venues and looking at the diveristy of options in that dataset, there is no bias in either direction for size of the neighborhood.

## 4. 
Then, by joining the various datasets together, I was able to see the related statistics side by side. The first step I took was understanding the averages across all of the neighborhoods as far as the statistics went. Then, using this average, I began narrowing down the neighborhoods according to these statistics. 

First, I looked at the mean top 5% household income, and took only neighborhoods that had close to or higher than the mean across all neighborhoods.

Then, I looked at the mean housing value in each neighborhood, and only kept neighborhoods that had housing values close to or below this average across all neighborhoods. 

This left four neighborhoods, namely, Evergreen, Santa Teresa, Evergreen, and Blossom Valley.  Now the argument was up for which would be the best out of these neighborhoods.

# Results

These analyses allowed me to observe quantitavely:
1. The various main neighborhoods across San Jose, CA
2. The various levels of income in each of those neighborhoods
3. The current average housing valuation in each of those neighborhoods
4. A created diversity score using Foursquare, regarding the surrounding venues to a certain neighborhood

Using this data, I was able to narrow down a potential housing developer's search for possible new projects in San Jose. The final four neighborhoods in the resulting data had roughly average or lower housing prices, with roughly average or higher top 5% household incomes, and strong diversity scores for the surrounding venues.

These final four neighborhoods looked like this:

In [229]:
NbrsWIncomeAndHousing.iloc[:, [1,2,4,5]]

Unnamed: 0,Neighborhood,Mean Top 5% Household Income,"(Housing Price,)",Diversity Score
3,Blossom Valley,364.8,923200,60
5,Downtown,361.6,858000,59
8,Evergreen,448.2,1036100,48
13,Santa Teresa,375.8,888700,59


# Discussion and Recommendations

I would have very much liked to use crime data sectioned by neighborhoods in order to provide another statistic to evaluate. One potential flaw in looking at neighborhoods that have below average housing markets is that there may be an increased crime presence there, dragging down the values of the houses. I hope that finding neighborhoods with higher than average top 5% household incomes outweighed this issue.

It turns out that the San Jose Police Department has a new interactive way of looking at crime data, but the data is not in a shareable format. I tried to somehow pull this data into my own datasets but I ended up being unable to do so.

In the future, if some more considerable work was put towards the attempt to gather crime data, we could have an even better understanding of the neighborhoods that we are dealing with. 

Obviously a housing developer, hoping to build a luxury residential living community, would not want to do so in the middle of a neighborhood that is high in crime rates. 

# Conclusion

Using this combination of data, as discussed earlier, the neighborhoods with above average income, below average housing prices, and above average 'diversity scores' were Blossom Valley, Downtown, Evergreen, and Santa Teresa.

At this point, I was able to use some of my own Subject Matter Expertise (being a resident of San Jose for more than twenty years) to identify the viability of building a living community in these neighborhoods. Downtown San Jose could be thrown out first for a variety of reasons. The first is that it indeed has one of the highest crime rates in San Jose as a whole. The second is that land is very hard to come by, so despite the low valuation of housing, the ability to buy large plots of undeveloped land and develop on it would be very expensive. 

Blossom Valley, despite having a reasonably low crime rate, is also industrialized, and the acquisition of land here would similarly be difficult.

This leaves two neighborhoods as final contenders, which would actually both be pretty safe bets for a developer. Both lie on the edge of developed regions of San Jose, meaning that buying land in those neighborhoods would be feasible. The main difference lies in the diversity score. Santa Teresa is much closer to a wide variety of different venues, and could therefore be considered a better area to develop in. Not to mention, housing prices are lower here in a larger magnitude than the 5% household incomes, so we could infer that households in Santa Teresa could afford to move into a nicer housing development, and still feel at home (since they would not be leaving there immediate neighborhoods. 

The winner: 
# Santa Teresa


Given all of this, the final reccomendation to a high end residential developer would be to build a housing community that could attract a diverse crowd of people, with higher than average incomes in San Jose.


# Appendix

# 1. 

In [192]:
Incomes = pd.read_excel('HouseholdIncomeByNeighborhood.xlsx')
Housing = pd.read_excel('san-jose-ca-neighborhoods-Report.xls')
Location = pd.read_excel('LocationLatLong.xlsx')

In [168]:
CLIENT_ID = 'X32ZQRUWOYFWKT5XJWR1RNO114H3QDSBZRJDK1W4DBEORKVZ'
CLIENT_SECRET = 'C5P3Q4AFSZOF0E4CR0CE0VY4KKNXICIEL31TIIMQG5AITMKX'
VERSION = '20190801'

In [169]:
def getNearbyVenues(names, latitudes, longitudes, radius=2000, LIMIT=1000):
    
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng,            
            v['venue']['name'], 
            v['venue']['id'],
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Location', 
                  'Location Latitude', 
                  'Location Longitude', 
                  'Venue',
                  'Venue id',                
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category',
                   ]
    
    return(nearby_venues)

In [170]:
Venues = getNearbyVenues(Location['Neighborhood'], Location['Latitude'], Location['Longitude'])

# 2. Sorting the DataFrames

In [195]:
Incomes = Incomes[['Neighborhood', 'Mean Top 5% Household Income']]

In [196]:
Housing = Housing[['Region Name', 'Current']][Housing['Region Name'] != 'San Jose']
Housing.columns = [['Neighborhood', 'Housing Price']]

In [197]:
Nbrs = pd.merge(Incomes, Housing, on = Incomes['Neighborhood'])

# 3. Evaluating Diversity of Venues in Surrounding Area

In [None]:
Venue_Diversity = {}
for val in Venues['Location'].unique():
    diversity = Venues.loc[Venues['Location'] == val]['Venue Category'].unique().shape[0]
    Venue_Diversity["{}".format(val)] = diversity

In [208]:
Venue_Diversity

{'Almaden Valley': 37,
 'Alum Rock-East Foothills': 24,
 'Berryessa': 48,
 'Blossom Valley': 60,
 'Cambrian Park': 60,
 'Downtown': 59,
 'East San Jose': 52,
 'Edenvale - Seven Trees': 58,
 'Evergreen': 48,
 'Fairgrounds': 60,
 'North San Jose': 62,
 'North Valley': 37,
 'Rose Garden': 64,
 'Santa Teresa': 59,
 'West San Jose': 66,
 'Willow Glen': 53}

In [198]:
Nbrs['Diversity Score'] = None

In [200]:
for val in Nbrs['Neighborhood']:
    Nbrs.loc[Nbrs['Neighborhood'] == val, 'Diversity Score'] = Venue_Diversity[val]
Nbrs

Unnamed: 0,key_0,Neighborhood,Mean Top 5% Household Income,"(Neighborhood,)","(Housing Price,)",Diversity Score
0,Almaden Valley,Almaden Valley,598.5,Almaden Valley,1431400,37
1,Alum Rock-East Foothills,Alum Rock-East Foothills,292.9,Alum Rock-East Foothills,783400,24
2,Berryessa,Berryessa,346.3,Berryessa,1110600,48
3,Blossom Valley,Blossom Valley,364.8,Blossom Valley,923200,60
4,Cambrian Park,Cambrian Park,478.5,Cambrian Park,1198700,60
5,Downtown,Downtown,361.6,Downtown,858000,59
6,East San Jose,East San Jose,285.9,East San Jose,719800,52
7,Edenvale - Seven Trees,Edenvale - Seven Trees,260.9,Edenvale - Seven Trees,749700,58
8,Evergreen,Evergreen,448.2,Evergreen,1036100,48
9,Fairgrounds,Fairgrounds,297.4,Fairgrounds,735900,60


# 4. Combining and Evaluating the Data

In [206]:
Nbrs.sort_values('Mean Top 5% Household Income')

Unnamed: 0,key_0,Neighborhood,Mean Top 5% Household Income,"(Neighborhood,)","(Housing Price,)",Diversity Score
7,Edenvale - Seven Trees,Edenvale - Seven Trees,260.9,Edenvale - Seven Trees,749700,58
6,East San Jose,East San Jose,285.9,East San Jose,719800,52
1,Alum Rock-East Foothills,Alum Rock-East Foothills,292.9,Alum Rock-East Foothills,783400,24
9,Fairgrounds,Fairgrounds,297.4,Fairgrounds,735900,60
10,North San Jose,North San Jose,326.4,North San Jose,949400,62
11,North Valley,North Valley,332.1,North Valley,924700,37
2,Berryessa,Berryessa,346.3,Berryessa,1110600,48
5,Downtown,Downtown,361.6,Downtown,858000,59
3,Blossom Valley,Blossom Valley,364.8,Blossom Valley,923200,60
13,Santa Teresa,Santa Teresa,375.8,Santa Teresa,888700,59


In [207]:
Nbrs.describe()

Unnamed: 0,Mean Top 5% Household Income,"(Housing Price,)",Diversity Score
count,16.0,16.0,16.0
mean,397.08125,1008869.0,52.9375
std,110.9659,233563.0,11.601544
min,260.9,719800.0,24.0
25%,319.15,839350.0,48.0
50%,363.2,937050.0,58.5
75%,455.775,1164650.0,60.0
max,629.1,1444100.0,66.0


Here we see the statistics regarding each column of interest in our dataset.

By using these values, we can begin to breakdown the data and identify potential neighborhoods. The first datapoint we will look at is the average household income within the top 5% for San Jose as a whole, equally weighted in each neighborhood. 

The average for this particular column is 397.08 (in hundreds of thousands of U.S. dollars).

Let's look only at neighborhoods that have mean top 5% household incomes at 90% of this value and above. 

In [213]:
IncomeThreshold = 0.9 * 397.08
NbrsWIncome = Nbrs.loc[Nbrs['Mean Top 5% Household Income'] >= IncomeThreshold]
NbrsWIncome

Unnamed: 0,key_0,Neighborhood,Mean Top 5% Household Income,"(Neighborhood,)","(Housing Price,)",Diversity Score
0,Almaden Valley,Almaden Valley,598.5,Almaden Valley,1431400,37
3,Blossom Valley,Blossom Valley,364.8,Blossom Valley,923200,60
4,Cambrian Park,Cambrian Park,478.5,Cambrian Park,1198700,60
5,Downtown,Downtown,361.6,Downtown,858000,59
8,Evergreen,Evergreen,448.2,Evergreen,1036100,48
12,Rose Garden,Rose Garden,508.2,Rose Garden,1153300,64
13,Santa Teresa,Santa Teresa,375.8,Santa Teresa,888700,59
14,West San Jose,West San Jose,446.7,West San Jose,1444100,66
15,Willow Glen,Willow Glen,629.1,Willow Glen,1234900,53


Now let's examine the average housing price in the dataset across all neighborhoods, equally weighted. That mean housing value is around $1.009 million. We will look only at neighborhoods that have 10% higher than this value and below.

In [220]:
HousingThreshold = 1.1 * 1008869
NbrsWIncomeAndHousing = NbrsWIncome.loc[NbrsWIncome.iloc[:, 4] <= HousingThreshold]
NbrsWIncomeAndHousing

Unnamed: 0,key_0,Neighborhood,Mean Top 5% Household Income,"(Neighborhood,)","(Housing Price,)",Diversity Score
3,Blossom Valley,Blossom Valley,364.8,Blossom Valley,923200,60
5,Downtown,Downtown,361.6,Downtown,858000,59
8,Evergreen,Evergreen,448.2,Evergreen,1036100,48
13,Santa Teresa,Santa Teresa,375.8,Santa Teresa,888700,59


In [222]:
NbrsWIncomeAndHousing.sort_values('Diversity Score', ascending = False)

Unnamed: 0,key_0,Neighborhood,Mean Top 5% Household Income,"(Neighborhood,)","(Housing Price,)",Diversity Score
3,Blossom Valley,Blossom Valley,364.8,Blossom Valley,923200,60
5,Downtown,Downtown,361.6,Downtown,858000,59
13,Santa Teresa,Santa Teresa,375.8,Santa Teresa,888700,59
8,Evergreen,Evergreen,448.2,Evergreen,1036100,48
