## Battle of Neighborhoods
In this Notebook, we will develop a basis for comparing features of cities.  We will suppose that our interest is in measuring the diversity of Venue Categories in a City and how disperse this diversity is spread across the city.  To quantify these measures, we will measure the diversity of Venue Categories in a city with the Entropy Index 

H(p)=-pln(p), 

where p is a k by 1 vector.  Each element of p is the proportion of Venues in the City that are of the particular Venue Category.  

Next, to measure how disperse the Venues are across the city, we will calculate a weight vector (w) such that each element of w corresponds with a neighborhood.  For each neighborhood, we calculate the proportion of the cities Venues that are in that neighborhood.  We will again use the Enropy Index to calculate the level of dispersion of businesses across the city.  

H(w)=-wln(w).

We will suppose that our overall Satisfaction Index is:

S= -0.5pln(p) - 0.5wln(w).

We will measure our cities below to see how they perform on this Satisfaction Index.  If maximizing this index is associated with a higher level of well-being by city residents, a business person could identify the optimal location and venue category by determining which Venue Category and Location would boost this index the most.  Perhaps measuring this index over time could reveal if businesses do this on their own.

### Load Toronto Data from previous work

In [102]:
import math
import pandas as pd
## Load Toronto Borough Data 
tor_df=pd.read_csv("https://raw.githubusercontent.com/jcrooker/Coursera_Capstone/master/Toronto_Boroughs.csv")
tor_df=tor_df[tor_df['Borough']!="Not assigned"]
geo_df=pd.read_csv("http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv")

## Need to rename my 'Postal Code' column to match my tor_df dataframe.
geo_df.rename(columns={"Postal Code":"Postal_Code"},inplace=True)
geo_df.columns
tor_geo_df=pd.merge(tor_df,geo_df,how="inner",on="Postal_Code")
tor_geo_df.head()

Unnamed: 0,Postal_Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### Load New York Data from previous Lab

In [103]:
## I saved the New York Boroughs/Neighborhood data assembled in
## 'DP0701EN-3-3-2-Neighborhoods-New-York-py-v1.0.ipynb'.  This 
## file is available on my Github repository for this Capstone
## Course.
## See 'newyork_data.csv' at https://github.com/jcrooker/Coursera_Capstone
##
newyork=pd.read_csv("https://raw.githubusercontent.com/jcrooker/Coursera_Capstone/master/newyork_data.csv")
newyork.head()

Unnamed: 0.1,Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,0,Bronx,Wakefield,40.894705,-73.847201
1,1,Bronx,Co-op City,40.874294,-73.829939
2,2,Bronx,Eastchester,40.887556,-73.827806
3,3,Bronx,Fieldston,40.895437,-73.905643
4,4,Bronx,Riverdale,40.890834,-73.912585


## Some key metrics for cities

In [104]:
########
CLIENT_ID = 'your Foursquare ID' # your Foursquare ID
CLIENT_SECRET = 'your Foursquare Secret' # your Foursquare Secret
VERSION = 'Foursquare API version' # Foursquare API version
LIMIT=100

In [105]:
### Some imports
import json
#!conda install -c conda-forge geopy --yes  # Previously installed geopy.  
from geopy.geocoders import Nominatim

import requests
from pandas.io.json import json_normalize

print("Libraries loaded!")

Libraries loaded!


In [106]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
            # return only relevant information for each nearby venue
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
            
        except:
            print("An exception occurred.  Skipping and moving on.")

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

def cityBreakdowns(city_df,radius=500):
    ## Loop across city_df rows and identify
    ## venues
    city_venues = getNearbyVenues(names=city_df['Neighborhood'],
                                   latitudes=city_df['Latitude'],
                                   longitudes=city_df['Longitude'],
                                  radius=radius
                                  )
    return city_venues
    

In [107]:
# The next few lines uses the 4-Square api to identify
# neighborhood venues in NYC.  I have ran this script and
# saved in a csv for faster replication of this code.

#ny_venus=cityBreakdowns(city_df=newyork)
#ny_venus.to_csv("nyc_venues.csv")
#ny_venus.head()

#######
## Load data from csv file rather than constructing from 
## api calls.

ny_venus=pd.read_csv("https://raw.githubusercontent.com/jcrooker/Coursera_Capstone/master/nyc_venues.csv")
ny_venus.head()

Unnamed: 0.1,Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,1,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy
2,2,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
3,3,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
4,4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


In [108]:
## For the first comparison city, we will consider
## Toronto.  The next few lines will construct the Toronto venues
## using the same approach as with New York.

#tor_venus=cityBreakdowns(city_df=tor_geo_df,radius=500)
#tor_venus.to_csv("toronto_venues.csv")
#tor_venus.head()

#######
## Load data from csv file rather than constructing from 
## api calls.

tor_venus=pd.read_csv("https://raw.githubusercontent.com/jcrooker/Coursera_Capstone/master/toronto_venues.csv")
tor_venus.head()

Unnamed: 0.1,Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,3,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
4,4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


## Calculate City Satisfaction Index

In [109]:
def determine_Venue_Categories_in_2_Cities(c1_df,c2_df):
    venues=pd.Series(c1_df['Venue Category'].unique())
    vc2=pd.Series(c2_df['Venue Category'].unique())
    
    venues.append(vc2,ignore_index=True)
    return(venues)

def Venue_Category_Proportions_by_2_Cities(c1_df,c2_df):
    categories=determine_Venue_Categories_in_2_Cities(c1_df=c1_df,c2_df=c2_df)
    
    freq1 = []
    freq2 = []
    for category in categories:
        freq1.append(sum(c1_df['Venue Category']==category))
        freq2.append(sum(c2_df['Venue Category']==category))
    
    p1 = []
    p2 = []
    for i in range(0,len(freq1)):
        p1.append(freq1[i]/sum(freq1))
        p2.append(freq2[i]/sum(freq2))
        
    
    d = {'Category': categories, 'City1': p1, 'City2': p2}
    p_df = pd.DataFrame(d)
    p_df.set_index('Category')
    return(p_df)

def Neighborhood_Venue_Proportions_by_city(c_df):
    neigh1 = pd.Series(c_df['Neighborhood'].unique())
    
    freq1 = []
    for neigh in neigh1:
        freq1.append(sum(c_df['Neighborhood']==neigh))
    
    p1=[]
        
    for i in range(0,len(freq1)):
        p1.append(freq1[i]/sum(freq1))

    return(p1)

def Neighborhood_Venue_Proportion_Entropy_by_city(c_df):
    
    p=Neighborhood_Venue_Proportions_by_city(c_df=c_df)
    
    E1 = 0
    for i in range(0,len(p)):
        E1 = E1 - p[i]*math.log(p[i])

    return(E1)

def Neighborhood_Venue_Entropies_by_2_Cities(c1_df,c2_df):
    E1 = Neighborhood_Venue_Proportion_Entropy_by_city(c_df=c1_df)
    E2 = Neighborhood_Venue_Proportion_Entropy_by_city(c_df=c2_df)
    City_Entropies = {'City1':E1, 'City2':E2}
    return(City_Entropies)

def City_Venue_Entropies_by_2_Cities(c1_df,c2_df):
    proportion_df=Venue_Category_Proportions_by_2_Cities(c1_df=c1_df,c2_df=c2_df)
    
    eps = 0.0001
    E1 = 0
    E2 = 0
    for i in range(0,len(proportion_df.index)):
        pi1 = proportion_df.loc[i,'City1'] + eps
        pi2 = proportion_df.loc[i,'City2'] + eps
        E1=E1 - pi1*math.log(pi1)
        E2=E2 - pi2*math.log(pi2)
        
    City_Entropies = {'City1': E1, 'City2': E2}
    return(City_Entropies)

def Calculate_City_Satisfaction_Entropies(c1_df,c2_df,c1_name,c2_name):
    VEntropies=City_Venue_Entropies_by_2_Cities(c1_df=c1_df,c2_df=c2_df)
    NEntropies=Neighborhood_Venue_Entropies_by_2_Cities(c1_df=c1_df,c2_df=c2_df)
    
    Venue_Variety=[VEntropies['City1'],VEntropies['City2']]
    Venue_Dispersion=[NEntropies['City1'],NEntropies['City2']]
    City_Names = [c1_name,c2_name]
    
    Score = [0.5*VEntropies['City1']+0.5*NEntropies['City1'],
            0.5*VEntropies['City2']+0.5*NEntropies['City2']]
    
    d = {'Cities': City_Names, 'Venue Variety': Venue_Variety, 'Venue Dispersion': Venue_Dispersion,
        'Score': Score}
    df=pd.DataFrame(d)
    df.set_index('Cities')
    return(df)

Calculate_City_Satisfaction_Entropies(c1_df=ny_venus,c2_df=tor_venus,c1_name='NYC',c2_name='Toronto')

Unnamed: 0,Cities,Venue Variety,Venue Dispersion,Score
0,NYC,5.23183,5.25791,5.24487
1,Toronto,4.979008,3.931308,4.455158


New York has a greater diversity in Venues as well as wider dispersion of Venues across the City Neighborhoods relative to Toronto.  That is not likely too surprising.  In the next section, we ask 'adding a Venue Category' would enhance the Satisfaction score the most and 'adding a Venue to which Neighborhood?

## Marginal Analysis respective to Venue Category and Neighborhood

The marginal entropy relative to Venue Category 1 is:
$$\frac{\partial H(p)}{\partial p_i}=-ln(p)-1$$

Likewise, the marginal entropy relative to adding a Venue to neighborhood 1 is:

$$\frac{\partial H(w)}{\partial w_i}=-ln(w)-1$$

Identifying the Venue Category and the Neighborhood that produces the largest gain at the margin identifies the most appropriate location and Venue Category for expansion in a city given our Satisfaction Index.  The functions below will aid us in identifying this information for our City.

In [110]:
def optimal_Venue_Category_marginal_entropy(c1_df,c2_df):
    p_df=Venue_Category_Proportions_by_2_Cities(c1_df=c1_df,c2_df=c2_df)
    
    eps = 0.0001
    max1 = 0
    max2 = 0
    opt1 = 0
    opt2 = 0
    
    for i in range(0,len(p_df.index)):
        pi1 = p_df.loc[i,'City1'] + eps
        pi2 = p_df.loc[i,'City2'] + eps
        mpi1 = -math.log(pi1)-1
        mpi2 = -math.log(pi2)-1
        if mpi1>max1:
            max1 = mpi1
            opt1=i
            
        if mpi2>max2:
            max2 = mpi2
            opt2=i
            
    opt={'City1': p_df.loc[opt1,'Category'], 'City2': p_df.loc[opt2,'Category']}
    return(opt)

def optimal_Neighborhood_marginal_entropy(c_df):
    p = Neighborhood_Venue_Proportions_by_city(c_df=c_df)
    
    max1 = 0
    opt1 = 0
    for i in range(0,len(p)):
        mpi1 = -math.log(p[i])-1
        if mpi1>max1:
            max1=mpi1
            opt1 = i
    
    opt={'Neighborhood': c_df.loc[opt1,'Neighborhood']}
    return(opt)
    
def best_marginal_improvement(c1_df,c2_df,c1_name,c2_name):
    Venue_Category=optimal_Venue_Category_marginal_entropy(c1_df=c1_df,c2_df=c2_df)
    Neighborhood1 = optimal_Neighborhood_marginal_entropy(c_df=c1_df)
    Neighborhood2 = optimal_Neighborhood_marginal_entropy(c_df=c2_df)
    
    opt={'Venue Category': [Venue_Category['City1'],Venue_Category['City2']],
         'Neighborhood'  : [Neighborhood1['Neighborhood'],Neighborhood2['Neighborhood']],
         'City': [c1_name,c2_name]}
    d=pd.DataFrame(opt)
    d.set_index('City')
    return(d)

best_marginal_improvement(c1_df=ny_venus,c2_df=tor_venus,c1_name='NYC',c2_name='Toronto')
    



Unnamed: 0,Venue Category,Neighborhood,City
0,Platform,Kingsbridge,NYC
1,Food,Victoria Village,Toronto


Let's review the Neighborhoods that we have identified as providing the biggest increase in our Satisfaction index.  First, for New York.  Let's review the current venues in Kingsbridge

In [111]:
Kingsbridge=ny_venus[ny_venus['Neighborhood']=='Kingsbridge']
Kingsbridge['Venue Category']

54                  Gourmet Shop
55     Latin American Restaurant
56                   Pizza Place
57                Ice Cream Shop
58                   Pizza Place
                 ...            
117                  Pizza Place
118                Deli / Bodega
119            Outdoor Sculpture
120                         Park
121                     Bus Line
Name: Venue Category, Length: 68, dtype: object

Kingsbridge in New York seems to have 68 Venues!   Wow.  Now, let's explore Victoria Village in Toronto.

In [112]:
Victoria_Village=tor_venus[tor_venus['Neighborhood']=='Victoria Village']
Victoria_Village['Venue Category']

2             Hockey Arena
3    Portuguese Restaurant
4              Coffee Shop
5             Intersection
Name: Venue Category, dtype: object

Hmm, we see Victoria Village in Toronto has many fewer Venue Categories at 4.  Our analysis above indicates we should add a 'Food' Venue Category to Victoria Village.  Meanwhile, in Kingsbridge New York, we are instructed to add a 'Platform' Venue Category.  Let's see some examples of 'Platform' Venue Category establishments already in New York.

In [113]:
nyc_Platform=ny_venus[ny_venus['Venue Category']=='Platform']
nyc_Platform

Unnamed: 0.1,Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
37,37,Eastchester,40.887556,-73.827806,NYW&B Railway - Baychester,40.887882,-73.831163,Platform


Okay, there we have it-- a Train Stop in Kingsbridge would produce the greatest increase in Satisfaction according to our Satisfaction score.  In Toronto, we need to add a "Food" Venue to Victoria Village.