# Capstone Project - The Battle of Neighborhoods (Week 1)

## 1. Introduction of the business problem

For restaurants business investors, there are some crucial factors determine whether you are finally successful or not. For example, if the restaurant sits on a place with abundant guest resource, or if nearby restaurants are already in good reputation can bring you high grade mature clients from the very beginning(restaurant clustering theory) and if your menu offers attractive food . 

So assuming here we have a new investor who is planning to open a new restaurant in New York(NY), one on-line intelligent estimation system is expected. This system simply takes some easy inputs such as the location of your targeted venue and some of the food you are planning to offer, a score will be predicted as an estimation for how successful your new business will be.

To implement this system, we will explore Foursquare geographic location data and pertinent social network data, to  train a machine learning model(classification), and finally to run a back-end prediction service. 

It will be a free on-line service opened to the audience who are planning to open a restaurant in a certain venue in New York city by clicking on the map and input some of your special food. The investor then get a score as predicted success indicator, and the investor can simply move mouse to change position or type in new food to get different scores continuously. Through trying this service, the investor tests different combinations of better address and more attractive foods for their final success. This prediction system is very useful to help investors because the risk of restaurant investment is greatly decreased from beginning.

## 2. Data explaration 

To train the model, you will see how we step by step explore Foursquare data to construct pertinent information as reasonable feature dataset.

### 2.1 Explore the New York neighborhood location data

To get location data, first we believe a good restaurant must own plenty of client source, i.e it can't be too far from Neighborhood where people clustered. So we uploaded total of 5 boroughs and 306 neighborhoods. In order to segment the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and longitude coordinates of each neighborhood.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [4]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [5]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
neighborhoods_data = newyork_data['features']
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude']
neighborhoods = pd.DataFrame(columns=column_names)
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough']
    neighborhood_name = data['properties']['name']

    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]

    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
neighborhoods.shape    

(306, 4)

In [6]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


### 2.2 Explore the New York restaurants data
Now that we have all the neighborhoods listed, next step we will start from here, to get all the restaurants from New York as training data. These restaurants are representative, because they are close to resident people, in other words, they are close to potential clients.

In [7]:
column_names = ['v_id', 'v_name', 'v_dist', 'v_cat']
df1 = pd.DataFrame(columns=column_names)
df1.set_index('v_id')
df1.index.name = 'v_id'

df1 will be used to host the coming restaurant data. Through Foursquare *"search"* API we will first get all nearby restaurants for each of the above neighborhoods, to construct a full data list of all restaurant sit near those 306 New York neighborhoods.

In [9]:
CLIENT_ID = 'BXAL1QB3NPZEI2C1G00AN1B1LMIBEO1GZBD2OHDNKGRB0S4Z' # your Foursquare ID
CLIENT_SECRET = '4PG31ZMDFCOMG0LUPFHRJFOI2S1GHYTS1TOZU5EGJR0AK2E4' # your Foursquare Secret
VERSION = '20190205' # Foursquare API version
LIMIT = 1000 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius

for i in range(len(neighborhoods)):
    neighborhood_latitude   = neighborhoods.loc[i,"Latitude"]
    neighborhood_longitude  = neighborhoods.loc[i,"Longitude"]
    neighborhood_name       = neighborhoods.loc[i,"Neighborhood"]
    neighborhood_borough    = neighborhoods.loc[i,"Borough"]
#     url = 'https://api.foursquare.com/v2/venues/search?&query=Restaurant,Coffee&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
#         CLIENT_ID,
#         CLIENT_SECRET,
#         VERSION,
#         neighborhood_latitude,
#         neighborhood_longitude,
#         radius,
#         LIMIT)
#     results = requests.get(url).json()
#     with open("rstlist_{}.json".format(i), "w") as f:
#         json.dump(results, f)

# Don't uncomment the above code to execute because I'm registered Foursquare free developer, have only limited
# API access everyday. So I have stored local file.

    with open("rstlist_{}.json".format(i),'r') as load_f:
        results = json.load(load_f)
# transform the ID list json dict into a DF
    for v in results['response']['venues']:
        v_id=v.get("id")
        v_name=v.get("name")
        v_dist=v.get("location")["distance"]
        v_cat=None
        if not(v.get("categories")):continue
        for ca in v.get("categories"):
            if ca.get("primary"):v_cat=ca.get("name")
        try:
            df1=df1.append({'v_id': v_id,
                        'v_name': v_name,
                        'v_dist': v_dist,
                        'v_cat':v_cat
                       },
                      ignore_index=True)
        except:pass
    df1.to_csv('rest.csv')

In [10]:
  df1.shape

(9390, 4)

In [11]:
df1.head(5)

Unnamed: 0,v_id,v_name,v_dist,v_cat
0,4db03c875da32cf2df4509f4,Big Daddy's Caribbean Taste Restaurant,1008,Caribbean Restaurant
1,4c66e0068e9120a15929d964,Kaieteur Restaurant & Bakery,1011,Caribbean Restaurant
2,508af256e4b0578944c87392,Cooler Runnings Jamaican Restaurant Inc,479,Caribbean Restaurant
3,4be5f0eacf200f47d1fa133c,McDonald's,904,Fast Food Restaurant
4,4c994113a004a1cdc3393e6e,Bay 241 Restaurant & Lounge,792,Caribbean Restaurant


We have stored the basic restaurant information into a csv file as local file, the listed information include:
* v_id: the venue id of the restaurant, we will then use this ID to approach more data around the restaurant
* v_name: the name of the restaurant
* c_dist: the distance from  restaurant to related neighborhood
* v_cat: the restaurant's major category

Remember each restaurant may be close to multiple neighborhoods, we need to merge them.  The next step we will add more advanced data to extend the restaurant data with more features and meanwhile remove duplication.

a reminder here: the dumped csv file above embedded some encoding bugs which cause Python decode exception error, so a special data washing is done to clean those dirty data( here manually). We then change the csv file name to "ny_restaurant", hence after we will use this clean data file. 

### 2.3 Explore advanced restaurants data to construct full feature 


__First, we come up with two new numerical features:__ 
* The average distance from one restaurant to all its nearby neighborhoods, 
* How many neighborhoods sit around a certain restaurant

In [3]:
# we have done getting a full list from NY neighborhood list
df2=pd.read_csv('ny_restaurant.csv')
#get how many neighborhoods this restaurant close to
nc=df2.groupby("v_id")["v_name"].count()#

#get how far the restaurant avg distace to all its neighborhoods
nda=df2.groupby("v_id")["v_dist"].mean()

df2.drop(["v_dist"],inplace=True,axis=1)
df2.set_index("v_id", drop=True, inplace=True)
df2=df2[~df2.index.duplicated(keep='last')]
df2.reindex()
df2["avg_dist_2_neighborhood"] = nda
df2["cnt_near_neighborhood"] = nc
df2["restaurant_id"]=df2.index

df2.index = range(len(df2))
df2.to_csv("rest_features.csv",index=False)
df2.shape

(5290, 5)

In [14]:
df2.head(2)

Unnamed: 0,v_name,v_cat,avg_dist_2_neighborhood,cnt_near_neighborhood,restaurant_id
0,Cooler Runnings Jamaican Restaurant Inc,Caribbean Restaurant,479.0,1,508af256e4b0578944c87392
1,McDonald's,Fast Food Restaurant,904.0,1,4be5f0eacf200f47d1fa133c


Form above data, you can see two new columns named __"avg_dist_2_neighborhood"__, and __"cnt_near_neighborhood"__. The useless column "v_dist" has been dropped off, and duplicated restaurants are also grouped in a single line in the dataframe. 

Ok, we move ahead to __add more features__ as:
* Latitude of the restaurant
* Longitude of the restaurant
* Average rating of clustered restaurants where this restaurant is the centroid
* How many recommended popular venues nearby this restaurant 
* The label(or the true target of regression) of the restaurant
* The menu items(food) offered from this restaurant

PS. don't uncomment following code, which will kill all my daily quota of Foursquare API. I will read the feature sample csv back in next cell


In [3]:
# Don't be surprised if you see I store csv and then load it back, becasue I have only free developer access to
# Foursquare, I have to have some intermid file to save my daily API visit quota. 


# df2=pd.read_csv("rest_features.csv")
# we will add more features to the dataframe
# df2["lat"]=None  # Latitude of the restaurant
# df2["lng"]=None  # Longitude of the restaurant
# df2["avg_rate"]=None # average rating of clustered restraurants where this restaurant is the centroid
# df2["nearby_rec"]=None  #how many recomended popular venus nearby this restaurant 
# df2["rating"]=None      # this is the label(or the true target of regression) of the restaurant
# df2["menu"]=None        # the menu items(food) offerred from this restaurant

# for i in range(len(df2)):
#     rid= df2.iloc[i]["restaurant_id"]
#     rname=df2.iloc[i]["v_name"]
#     url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(
#         rid,
#         CLIENT_ID,
#         CLIENT_SECRET,
#         VERSION)
    # result= requests.get(url).json()    #
    # with open("venues_{}.json".format(i), "w") as f:
    #             json.dump(result, f)
#     try:
#         with open("venues_{}.json".format(i),'r') as load_f: result = json.load(load_f)
#     except:
#        break
        # continue
    # 2.0 get restaurant detail info
#     v_ll=[result['response']['venue']["location"]["lat"],result['response']['venue']["location"]["lng"]]
#     v_rating=result['response']['venue'].get('rating')
#     df2.loc[i,  "lat"] = v_ll[0]
#     df2.loc[i, "lng"]  = v_ll[1]
#     df2.loc[i, "rating"] = v_rating
 # 2.1 now let's get the restaurant menu
#     hasmenu=result['response']['venue'].get('hasMenu')
#     v_menu=[]
#     if not(hasmenu):
#         pass
#     else:
#         # url = 'https://api.foursquare.com/v2/venues/{}/menu?client_id={}&client_secret={}&v={}'.format(
#         #     rid,
#         #     CLIENT_ID,
#         #     CLIENT_SECRET,
#         #     VERSION)
#         # menus = requests.get(url).json()
#         # with open("menu_{}.json".format(i), "w") as f:
#         #     json.dump(menus, f)
#         try:
#             with open("menu_{}.json".format(i), "r") as mf: menus = json.load(mf)
#             mc = menus['response']['menu']['menus']['count']
#             if mc!=0:
#                 v_menu=get_menu_item(menus['response']['menu']['menus']["items"])
#         except:pass
#     df2.loc[i,"menu"] = "|".join(v_menu)

#     # 2.2 to get how many hot spot nearby the restaurant(popular venues)
#     url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&\
#           query=food, drinks, coffee, shops, arts, outdoors, sights, trending'.format(
#         CLIENT_ID,
#         CLIENT_SECRET,
#         VERSION,
#         v_ll[0],
#         v_ll[1],
#         500,
#         10)
#     recommends = requests.get(url).json()
#     venues = recommends['response']['groups'][0]['items']
#     v_count_n = len(venues)
#     print(v_count_n,"recommends found")
#     df2.loc[i,"nearby_rec"]=v_count_n
#     print(df2.loc[i])

#     v_nearby_rt = 0.0
#     if v_count_n>0:
#         nearby_venues = json_normalize(venues)  # flatten JSON into dataframe
#         filtered_columns = ['venue.id','venue.name', 'venue.categories']
#         nearby_venues = nearby_venues.loc[:, filtered_columns]
#         nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
#         #to count the average rating near by this rest
#         nb=0
#         for j in range(v_count_n):
#             print(nearby_venues.loc[j])
#             r_id = nearby_venues.loc[j,"id"]
#             url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(
#                 r_id,
#                 CLIENT_ID,
#                 CLIENT_SECRET,
#                 VERSION)
#             rst= requests.get(url).json()
#             if not (rst): continue
#             try:
#                 rating = rst['response']['venue']['rating']
#                 if not (rating): rating=0
#                 else:nb+=1
#             except:
#                 rating=0
#             v_nearby_rt=v_nearby_rt+float(rating)
#         if not(nb):v_nearby_rt=0
#         else: v_nearby_rt=v_nearby_rt/nb
#     df2.loc[i,"avg_rate"] = v_nearby_rt
#     df2.to_csv("restaurant_data.csv",index=False)

df2=pd.read_csv("NY_restaurant_data.csv") # here we load part of the data as sample to have a look
df2.shape    

(35, 11)

In [5]:
df2.head(5)

Unnamed: 0,v_name,v_cat,avg_dist_2_neighborhood,cnt_near_neighborhood,restaurant_id,lat,lng,avg_rate,nearby_rec,rating,menu
0,Cooler Runnings Jamaican Restaurant Inc,Caribbean Restaurant,479.0,1,508af256e4b0578944c87392,40.898276,-73.850381,6.416667,10,6.5,
1,McDonald's,Fast Food Restaurant,904.0,1,4be5f0eacf200f47d1fa133c,40.902645,-73.849485,6.4,13,6.5,Big Mac??|Cheeseburger|Double Cheeseburger|Ham...
2,241 St Cafe & Restaurant,American Restaurant,1019.0,1,4c010e75cf3aa593825eccb0,40.903573,-73.850228,6.4,12,6.6,
3,Ripe Kitchen & Bar,Caribbean Restaurant,798.0,1,4d375ce799fe8eec99fd2355,40.898152,-73.838875,6.7,14,8.7,Cuban Plantain Boat|Jerk Chicken Quesadilla|St...
4,Townhouse Restaurant,Restaurant,218.0,1,4be2b79d660ec9284d04ca3b,40.876086,-73.828868,5.9,10,5.6,


We got all the information about the restaurant, almost ready to fit for the machine learning models, let's define a dictionary:

|Attribute Name|Data description |Data Type|Potential Contribution|
|:-|-|-|-|
|v_name|Restaurant name|String|n/a|
|v_cat|Category|Foursquare Category|certain Category maybe special popular, more easily catch eyes 
|avg_dist_2_neighborhood|Average distance from this restaurant  to all its nearby neighborhoods|float|the lower means distance closer to potential clients|
|cnt_near_neighborhood|how many nearby neighborhoods are close to this restaurant , for example with 1 km|int|the more means more potential clients |
|restaurant_id|Venue ID |Foursquare ID |n/a
|lat|latitude|float|n/a|
|lng|longitude|float|n/a|
|avg_rate|Average rating  of nearby popular sites(such as food, drinks, coffee, shops, arts, outdoors,etc.) |float|the higher, the more possibility for stable client source|
|nearby_rec|the total number of popular sites recommended from Foursquare which centriod by the restaurant |int|the higher, the more possibility for stable client source|
|menu|special food offered|list of food items|will do text clustering first before used for classification model|
|rating|the restaurant rating( how good the restaurant is)|float/classification|the label or target value(y)|


And before training the model, we have a quick look on the brief of the data

In [8]:
df2.describe(include='all')

Unnamed: 0,v_name,v_cat,avg_dist_2_neighborhood,cnt_near_neighborhood,restaurant_id,lat,lng,avg_rate,nearby_rec,rating,menu
count,35,35,35.0,35.0,35,35.0,35.0,35.0,35.0,35.0,12
unique,35,19,,,35,,,,,,12
top,Irish Coffee Shop,Seafood Restaurant,,,4ba53e04f964a520a7f038e3,,,,,,Meatball Parmigiana|Sausage Parmigiana|Chicken...
freq,1,7,,,1,,,,,,1
mean,,,694.12381,1.257143,,40.878044,-73.847987,6.98619,14.228571,6.988571,
std,,,264.504109,0.505433,,0.020961,0.039913,0.777318,6.00042,0.923475,
min,,,137.0,1.0,,40.837673,-73.904301,5.8,10.0,5.2,
25%,,,501.5,1.0,,40.868684,-73.880813,6.4,10.0,6.25,
50%,,,746.0,1.0,,40.88262,-73.850381,6.9,11.0,6.8,
75%,,,882.5,1.0,,40.892583,-73.826776,7.55,17.0,7.7,


Notice: additional action needed for the the *"menu"* attribute as we can't use it in classification model directly. We need to vectorize it and cluster the menus to quantify them. This part will be introduced in the *Text Clustering* model in the next segment.