# Applied Data Science Capstone Project

# Which is the best neighborhood in NYC to run my new coffee shop ?

The main objective of this project is to create a solution that get location data from New York City and find some recommendation of where are the best places to run a new coffee shop, according to density of business, visitation in the area and users rating for this categories in the neighborhoods.

## Table of Contents

1. [Download and Prepare Dataset](#item1)<br>
2. [Explore Neighborhoods in New York City](#item2)<br>
3. [Predicting the best Neighborhood](#item3)<br>

In [None]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
from mpl_toolkits import mplot3d

import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib as mpl
import matplotlib.pyplot as plt

%matplotlib inline 

# import k-Nearest Neighborhood
from sklearn.neighbors import KNeighborsClassifier

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

In [None]:
!conda install -c conda-forge plotly --yes

You will need a account on Plotly website in order to run it. Go to https://plot.ly

In [None]:
# Plotly for interactive 3D plots
import plotly
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

# Configure Plotly to be rendered inline in the notebook.
# Uncomment the following line and fill with plotly credentials
# plotly.tools.set_credentials_file(username=<YOUR_USERNAME_HERE>, api_key=<YOUR_API_KEY_HERE>)

<a name="item1"><h2>1. Download and Prepare Data</h2></a>

I have downloaded the dataset from the link: https://geo.nyu.edu/catalog/nyu_2451_34572 and uploaded it to data folder my environment on Cognitive Labs.

#### Load and explore the data
Next, let's load the data.

In [None]:
with open('data/newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [None]:
# Relevant data is in Feature key
neighborhoods_data = newyork_data['features']

#### Transform the data into a *pandas* dataframe
The next task is essentially transforming this data of nested Python dictionaries into a *pandas* dataframe. So let's start by creating an empty dataframe.

In [None]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [None]:
# Then let's loop through the data and fill the dataframe one row at a time.

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

#### **I HAVE RUN OUT MY CALLS ON FREE FOURSQUARE API, SO NOW I GET SOME HELP FROM GOOGLE PLACES API, USING A FREE GOOGLE CLOUD ACCOUNT**

I am getting all data needed for the purpose of this notebook and i will make them available as .csv files.

### Search for a specific venue category
> `https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=`**LAT, LNG**`&radius=`**RADIUS**`&type=`**TYPE**`&key=`**YOUR_API_KEY**

#### ...Let´s try a search for cafes on the center of New York City

In [None]:
lat=40.7308619
lng=-73.9871558
RADIUS=500
TYPE='cafe'
YOUR_API_KEY='<YOUR API KEY>'

Test the API ...

In [None]:
url = 'https://maps.googleapis.com/maps/api/place/nearbysearch/json?location={},{}&radius={}&type={}&key={}'.format(
        lat,
        lng,
        RADIUS,
        TYPE,
        YOUR_API_KEY)
url

In [None]:
response = requests.get(url).json()
results = response['results']

In [None]:
response

If there is *next_page_token* key, then we can used that to get next 20 results. It is possible get a third page and a maximum of 60 results per call

**If we get a second page**...
<br>
if 'next_page_token' in response:<br>
>    url2 = url + '&pagetoken=' + response['next_page_token']<br>
>    response2 = requests.get(url2).json()<br>
>    results2 = response2['results']<br>
<br>
>    **If we get a third page**...
<br>
>    if 'next_page_token' in response2:<br>
>>        url3 = url + '&pagetoken=' + response2['next_page_token']
>>        response3 = requests.get(url3).json()
>>        results3 = response3['results']

<a name="item2"><h2>2. Explore Neighborhoods in NYC</h2></a>

#### Let's create a function to repeat the same process to all the neighborhoods in NYC

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius, query, next_page_list):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://maps.googleapis.com/maps/api/place/nearbysearch/json?location={},{}&radius={}&type={}&key={}'.format(
            lat,
            lng,
            RADIUS,
            TYPE,
            YOUR_API_KEY)

        try:
            response = requests.get(url).json()
            results = response['results']
            
            for v in results:
                venues_list.append([
                    name, 
                    lat,
                    lng,
                    v['name'], 
                    v['geometry']['location']['lat'], 
                    v['geometry']['location']['lng'],  
                    v['rating'], 
                    v['user_ratings_total'],  
                    ])
                
            if 'next_page_token' in response:
                next_page_list.append([name, lat, lng, url + '&pagetoken=' + response['next_page_token']])

        except:
            pass

    nearby_venues = pd.DataFrame(venues_list, columns=[
                        'Neighborhood', 
                        'Neighborhood Latitude', 
                        'Neighborhood Longitude', 
                        'Venue', 
                        'Venue Latitude', 
                        'Venue Longitude', 
                        'Venue Rating',
                        'Venue Ratings total'])

    
    return(nearby_venues)

Let´s create a function to get *page 2* venues ...

In [None]:
def getNearbyVenues2(next_page_list, page3):
    
    venues_list=[]
    
    for next_page in next_page_list:
        try:
            url = next_page[3]
            response = requests.get(url).json()
            results = response['results']
            print(next_page[0])
        
            for v in results:
                venues_list.append([
                    next_page[0], 
                    next_page[1],
                    next_page[2],
                    v['name'], 
                    v['geometry']['location']['lat'], 
                    v['geometry']['location']['lng'],  
                    v['rating'], 
                    v['user_ratings_total'],  
                    ])

            if 'next_page_token' in response:
                page3.append([next_page[0], next_page[1], next_page[2], url + '&pagetoken=' + response['next_page_token']])
          
        except:
            pass
        
    nearby_venues = pd.DataFrame(venues_list, columns=[
                        'Neighborhood', 
                        'Neighborhood Latitude', 
                        'Neighborhood Longitude', 
                        'Venue', 
                        'Venue Latitude', 
                        'Venue Longitude', 
                        'Venue Rating',
                        'Venue Ratings total'])
   
    return(nearby_venues)

... and a function to *page 3* ...

In [None]:
def getNearbyVenues3(page3_list):
    
    venues_list=[]
    
    for next_page in page3_list:
        try:
            url = next_page[3]
            response = requests.get(url).json()
            results = response['results']
            print(next_page[0])
        
            for v in results:
                venues_list.append([
                    next_page[0], 
                    next_page[1],
                    next_page[2],
                    v['name'], 
                    v['geometry']['location']['lat'], 
                    v['geometry']['location']['lng'],  
                    v['rating'], 
                    v['user_ratings_total'],  
                    ])
          
        except:
            pass
        
    nearby_venues = pd.DataFrame(venues_list, columns=[
                        'Neighborhood', 
                        'Neighborhood Latitude', 
                        'Neighborhood Longitude', 
                        'Venue', 
                        'Venue Latitude', 
                        'Venue Longitude', 
                        'Venue Rating',
                        'Venue Ratings total'])
   
    return(nearby_venues)

In [None]:
# Uncomment this cell for the first time
# page2_venues=[]
# newyork_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
#                                     latitudes=neighborhoods['Latitude'],
#                                     longitudes=neighborhoods['Longitude'],
#                                     radius=RADIUS,
#                                     query=TYPE,
#                                     next_page_list=page2_venues
#                                )
# newyork_venues.shape

If you have saved your newyork_venues dataframe, you now can read it ...

In [None]:
# newyork_venues = pd.DataFrame(columns=['Neighborhood', 
#                   'Neighborhood Latitude', 
#                   'Neighborhood Longitude', 
#                   'Venue', 
#                   'Venue Latitude', 
#                   'Venue Longitude', 
#                   'Venue Rating',
#                   'Venue Ratings total'])
# newyork_venues = pd.read_csv('data/newyork_venues.csv')
# newyork_venues.shape

Google Places API allows up to 60 venues per Nearby search. Now we will run a function to get venues for neighborhoods that have *next_page_token* ...

In [None]:
page3=[]
newyork_venues2 = getNearbyVenues2(next_page_list=page2_venues, page3=page3)

In [None]:
newyork_venues2.shape

In [None]:
newyork_venues3 = getNearbyVenues3(page3_list=page3)

In [None]:
newyork_venues3.shape

Concatenating dataframe newyork_venues and newyork_venues2 and newyork_venues3

In [None]:
frame = [newyork_venues, newyork_venues2, newyork_venues3]

In [None]:
newyork_cafes = pd.concat(frame)

In [None]:
newyork_cafes.shape

In [None]:
newyork_cafes = pd.read_csv('data/newyork_cafes.csv')
newyork_cafes.head()

### Exploratory data analysis

Let´s get initial insights about our data. Let´s generate a histogram for Average venue rating and another for Venue Ratings Total.

In [None]:
newyork_cafes[['Venue Rating']].plot(kind='hist', figsize=(8, 5), bins=50)

plt.title('Distribution of average ratings') # add a title to the histogram
plt.ylabel('Number of Cafes') # add y-label
plt.xlabel('Average ratings') # add x-label

plt.show()

In [None]:
newyork_cafes[['Venue Ratings total']].plot(kind='hist', figsize=(8, 5), bins=50)

plt.title('Distribution of number of ratings') # add a title to the histogram
plt.ylabel('Number of Cafes') # add y-label
plt.xlabel('# of ratings') # add x-label

plt.show()

In [None]:
newyork_cafes['Weighted Rating'] = newyork_cafes['Venue Rating'] * newyork_cafes['Venue Ratings total']

In [None]:
venue_count = newyork_cafes.groupby(['Neighborhood']).count()

In [None]:
venue_count.drop(columns=['Neighborhood Latitude', 'Neighborhood Longitude',
       'Venue Latitude', 'Venue Longitude', 'Venue Rating',
       'Venue Ratings total', 'Weighted Rating'], inplace=True)
venue_count.head()

In [None]:
venue_count = venue_count.rename(index=str, columns={'Venue':'VENUE_COUNT'})
venue_count.head()

In [None]:
s = newyork_cafes.groupby(['Neighborhood'])['Venue Ratings total'].sum(level=0)
user_count = pd.DataFrame(s)
user_count.head()

In [None]:
user_count = user_count.rename(index=str, columns={'Venue Ratings total':'USER_COUNT'})
user_count.head()

In [None]:
c = newyork_cafes.groupby(['Neighborhood'])['Weighted Rating'].sum(level=0)
category_avg = pd.DataFrame(c)
category_avg.head()

In [None]:
category_avg = category_avg.merge(user_count, left_index=True, right_index=True)
category_avg.head()

In [None]:
category_avg['CATEGORY_AVG'] = category_avg['Weighted Rating']/category_avg['USER_COUNT']
category_avg.head()

In [None]:
category_avg.drop(columns=['Weighted Rating', 'USER_COUNT'], inplace=True)
category_avg.head()

In [None]:
neighborhood_venues = venue_count.merge(user_count, left_index=True, right_index=True)
neighborhood_venues.head()

In [None]:
neighborhood_venues = neighborhood_venues.merge(category_avg, left_index=True, right_index=True)
neighborhood_venues.head()

In [None]:
neighborhood_venues = pd.read_csv('data/neighborhood_venues.csv')
neighborhood_venues.head()

In [None]:
print(neighborhood_venues['CATEGORY_AVG'].min())

Let´s analyze now the data aggregated by neighborhoods...
Let´s plot a histogram of each of our variable: VENUE_COUNT, USER_COUNT and CATEGORY_AVG.

In [None]:
neighborhood_venues[['VENUE_COUNT']].plot(kind='hist', bins=60, figsize=(8, 5))

plt.title('Distribution of # of cafes by neighborhood') # add a title to the histogram
plt.ylabel('Neighborhoods') # add y-label
plt.xlabel('Number of Cafes') # add x-label

plt.show()

In [None]:
neighborhood_venues[['USER_COUNT']].plot(kind='hist', bins=200, figsize=(8, 5))

plt.title('Distribution of # of users by neighborhood') # add a title to the histogram
plt.ylabel('Neighborhoods') # add y-label
plt.xlabel('Number of users') # add x-label

plt.show()

In [None]:
neighborhood_venues[['CATEGORY_AVG']].plot(kind='hist', bins=50, figsize=(8, 5))

plt.title('Distribution of average category ratings by neighborhood') # add a title to the histogram
plt.ylabel('Neighborhoods') # add y-label
plt.xlabel('Average category ratings') # add x-label

plt.show()

We see summary statistics of our final dataframe ...

In [None]:
neighborhood_venues.describe()

<a name="item3"><h2>3. Predicting the best neighborhood</h2></a>

Now let's normalize the dataset. We will divide by maximum value to normalize our dataset to [*minimum value*, 1] interval.

In [None]:
X_unscaled = neighborhood_venues[['VENUE_COUNT', 'USER_COUNT', 'CATEGORY_AVG']].values

In [None]:
X=X_unscaled
X[:,0] = X_unscaled[:,0]/X_unscaled[:,0].max()
X[:,1] = X_unscaled[:,1]/X_unscaled[:,1].max()
X[:,2] = X_unscaled[:,2]/X_unscaled[:,2].max()

... and in order to have our response predicted, we need change our neighborhood labels by numbers ...

In [None]:
y = np.array(neighborhood_venues.index)

In [None]:
# Helper function to define colors to data points
def get_rgb(array3D):
    x_max = array3D[:,0].max()
    y_max = array3D[:,1].max()
    z_max = array3D[:,2].max()
    rgb_color = []
    r = 0
    g = 0
    b = 0
    for v in array3D:
        scale_r = lambda x: 255 if x * 1000/y_max > 255 else x * 1000/y_max
        r = scale_r(v[1])
        g = v[0]*255/x_max
        scale_b = lambda x: 0 if round(x - 0.745*z_max, 3) * 950/z_max < 0 else round(x - 0.745*z_max, 3) * 950/z_max
        b = scale_b(v[2])
        rgb_color.append('rgb(' + str(int(r)) + ',' + str(int(g)) + ',' + str(int(b)) + ')')
    return(rgb_color)

In [None]:
Points = X

# Append our target point to be plot on chart
Points = np.append(Points, [[0,1,0]], axis=0)

labels = neighborhood_venues['Neighborhood']
labels = labels.append(pd.Series(['Target point']), ignore_index=True)
colors = get_rgb(Points)

In [None]:
# Configure the trace.
trace = go.Scatter3d(
    x=Points[:,0],  # <-- Put your data instead
    y=Points[:,1],  # <-- Put your data instead
    z=Points[:,2],  # <-- Put your data instead
    text=labels,
    mode='markers',
    marker={
        'size': 5,
        'opacity': 0.8,
        'color': colors
    }
)

In [None]:
# Configure the layout.
layout = go.Layout(margin={'l': 0, 'r': 0, 'b': 0, 't': 0},
                   scene = dict(
                       xaxis = dict(title='VENUE_COUNT'),
                       yaxis = dict(title='USER_COUNT'),
                       zaxis = dict(title='CATEGORY_AVG')
                                )
                  )

data = [trace]

plot_figure = go.Figure(data=data, layout=layout)
# Render the plot.
plotly.plotly.iplot(plot_figure, filename='jupyter-scatter3D-plot1')

Run K-Nearest Neighborhood with *k=1* to create one cluster for each neighborhood.

In [None]:
k=1

# run k-Nearest Neighborhood classifier
neigh1 = KNeighborsClassifier(
            n_neighbors=k,
            metric='euclidean',
            algorithm='brute',
            p=2
            ).fit(X, y)

# check cluster labels generated for each row in the dataframe
neigh1

### Predicting
We can use the model to predict the test set:<br>
#### Explaining the values to be predicted ...<br>

We want:<br>

- The least # of venues as possible, so **VENUE_COUNT** => 0<br>
- The most # of people as possible, so **USER_COUNT** => 1<br>
- The least average of ratings as possible, so **CATEGORY_AVG** => 0<br>

In [None]:
X_test = np.array([[0, 1, 0]])
yhat = neigh1.predict(X_test)
yhat

In [None]:
neighborhood_venues.iloc[yhat[0]]

**Flatiron** has few coffee shops and lots of people walking around !!!

The variables are balanced, but if you think that one variable should be more relevant, you can recalculate it, multiplying by a suitable weight.

Making **USER_COUNT** more relevant than others ...

In [None]:
X[:,1] = X[:,1]*2

In [None]:
k=1

# run k-Nearest Neighborhood classifier
neigh1 = KNeighborsClassifier(
            n_neighbors=k,
            metric='euclidean',
            algorithm='brute',
            p=2
            ).fit(X, y)

# check cluster labels generated for each row in the dataframe
neigh1

In [None]:
X_test = np.array([[0, 2, 0]])
yhat = neigh1.predict(X_test)
yhat

In [None]:
neighborhood_venues.iloc[yhat[0]]

**Soho** has much more people than Flatiron, but much more coffee shops to !!!

In [None]:
Points = X

# Append our target point to be plot on chart. Now the maximum y axis value is 2.
Points = np.append(Points, [[0,2,0]], axis=0)

labels = neighborhood_venues['Neighborhood']
labels = labels.append(pd.Series(['Target point']), ignore_index=True)
colors = get_rgb(Points)

In [None]:
# Configure the trace.
trace = go.Scatter3d(
    x=Points[:,0],  # <-- Put your data instead
    y=Points[:,1],  # <-- Put your data instead
    z=Points[:,2],  # <-- Put your data instead
    text=labels,
    mode='markers',
    marker={
        'size': 5,
        'opacity': 0.8,
        'color': colors
    }
)

In [None]:
# Configure the layout.
layout = go.Layout(margin={'l': 0, 'r': 0, 'b': 0, 't': 0},
                   scene = dict(
                       xaxis = dict(title='VENUE_COUNT'),
                       yaxis = dict(title='USER_COUNT'),
                       zaxis = dict(title='CATEGORY_AVG')
                                )
                  )

data = [trace]

plot_figure = go.Figure(data=data, layout=layout)
# Render the plot.
plotly.plotly.iplot(plot_figure, filename='jupyter-scatter3D-plot2')

Making **CATEGORY_AVG** more relevant ...

In [None]:
X[:,2] = X[:,2]*2

In [None]:
k=1

# run k-Nearest Neighborhood classifier
neigh1 = KNeighborsClassifier(
            n_neighbors=k,
            metric='euclidean',
            algorithm='brute',
            p=2
            ).fit(X, y)

# check cluster labels generated for each row in the dataframe
neigh1

In [None]:
X_test = np.array([[0, 1, 0]]) # Maximum for USER_COUNT is 1 again.
yhat = neigh1.predict(X_test)  #  CATEGORY_AVG remains 0
yhat

In [None]:
neighborhood_venues.iloc[yhat[0]]

**Travis** has one of the least average ratings for coffee shops 

In [None]:
Points = X

# Append our target point to be plot on chart. Now the maximum y axis value is 2.
Points = np.append(Points, [[0,1,0]], axis=0)

labels = neighborhood_venues['Neighborhood']
labels = labels.append(pd.Series(['Target point']), ignore_index=True)
colors = get_rgb(Points)

In [None]:
# Configure the trace.
trace = go.Scatter3d(
    x=Points[:,0],  # <-- Put your data instead
    y=Points[:,1],  # <-- Put your data instead
    z=Points[:,2],  # <-- Put your data instead
    text=labels,
    mode='markers',
    marker={
        'size': 5,
        'opacity': 0.8,
        'color': colors
    }
)

In [None]:
# Configure the layout.
layout = go.Layout(margin={'l': 0, 'r': 0, 'b': 0, 't': 0},
                   scene = dict(
                       xaxis = dict(title='VENUE_COUNT'),
                       yaxis = dict(title='USER_COUNT'),
                       zaxis = dict(title='CATEGORY_AVG')
                                )
                  )

data = [trace]

plot_figure = go.Figure(data=data, layout=layout)
# Render the plot.
plotly.plotly.iplot(plot_figure, filename='jupyter-scatter3D-plot3')

So, you should calibrate the weights in order to get a good choice ...

### Now we can see in the map our predicted neighborhoods ...

In [None]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

In [None]:
# create map
map_venue = folium.Map(location=[latitude, longitude], zoom_start=10)

In [None]:
# add markers to the map
Travis = neighborhoods.loc[neighborhoods['Neighborhood']=='Travis']
Flatiron = neighborhoods.loc[neighborhoods['Neighborhood']=='Flatiron']
Soho = neighborhoods.loc[neighborhoods['Neighborhood']=='Soho']
frame=[Travis, Flatiron, Soho]
markers = pd.concat(frame)
markers

In [None]:
colors=['black', 'green', 'red']
i=0

for lat, lng, borough, neighborhood in zip(markers['Latitude'], markers['Longitude'], markers['Borough'], markers['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat, lng],
    radius=5,
    popup=label,
    color=colors[i],
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_venue)
    i=i+1
       
map_venue

#### The predictions are reasonable based of our variables. If you want no competition go to Travis, but how many people there wants a cup of coffee ? If you go to Flatiron or Soho, be prepared to work hard because a lot of people need a good coffee. In Flatiron, getting a space for your coffee shop may be a "little bit" expensive !!! In Soho, you should be prepared for a "Coffee War", because there are lots of good coffee shops (average ratings in neighborhood is 4.18 out of 5. Of course, you can try any type of venue with this model. See types in https://developers.google.com/places/web-service/supported_types.