<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 5>The Battle of the Neighbourhoods</font></h1>

## Instructions


Now that you have been equipped with the skills and the tools to use location data to explore a geographical location, over the course of two weeks, you will have the opportunity to be as creative as you want and come up with an idea to leverage the Foursquare location data to explore or compare neighborhoods or cities of your choice or to come up with a problem that you can use the Foursquare location data to solve. If you cannot think of an idea or a problem, here are some ideas to get you started:

1. In Module 3, we explored New York City and the city of Toronto and segmented and clustered their neighborhoods. Both cities are very diverse and are the financial capitals of their respective countries. One interesting idea would be to compare the neighborhoods of the two cities and determine how similar or dissimilar they are. Is New York City more like Toronto or Paris or some other multicultural city? I will leave it to you to refine this idea.
2. In a city of your choice, if someone is looking to open a restaurant, where would you recommend that they open it? Similarly, if a contractor is trying to start their own business, where would you recommend that they setup their office?

These are just a couple of many ideas and problems that can be solved using location data in addition to other datasets. No matter what you decide to do, make sure to provide sufficient justification of why you think what you want to do or solve is important and why would a client or a group of people be interested in your project.

#### Review criteria
This capstone project will be graded by your peers. This capstone project is worth 70% of your total grade. The project will be completed over the course of 2 weeks. Week 1 submissions will be worth 30% whereas week 2 submissions will be worth 40% of your total grade.

For this week, you will required to submit the following:

1. A description of the problem and a discussion of the background. (15 marks)
2. A description of the data and how it will be used to solve the problem. (15 marks)

For the second week, the final deliverables of the project will be:

1. A link to your Notebook on your Github repository, showing your code. (15 marks)
2. A full report consisting of all of the following components (15 marks):
 - Introduction where you discuss the business problem and who would be interested in this project.
 - Data where you describe the data that will be used to solve the problem and the source of the data.
 - Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, and what machine learnings were used and why.
 - Results section where you discuss the results.
 - Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
 - Conclusion section where you conclude the report.
3. Your choice of a presentation or blogpost. (10 marks)

Clearly define a problem or an idea of your choice, where you would need to leverage the Foursquare location data to solve or execute. Remember that data science problems always target an audience and are meant to help a group of stakeholders solve a problem, so make sure that you explicitly describe your audience and why they would care about your problem.

This submission will eventually become your Introduction/Business Problem section in your final report. So I recommend that you push the report (having your Introduction/Business Problem section only for now) to your Github repository and submit a link to it.

Describe the data that you will be using to solve the problem or execute your idea. Remember that you will need to use the Foursquare location data to solve the problem or execute your idea. You can absolutely use other datasets in combination with the Foursquare location data. So make sure that you provide adequate explanation and discussion, with examples, of the data that you will be using, even if it is only Foursquare location data.

This submission will eventually become your Data section in your final report. So I recommend that you push the report (having your Data section) to your Github repository and submit a link to it.






## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Introduction</a>

2. <a href="#item2">Data</a>

3. <a href="#item3">Methodology</a>

4. <a href="#item4">Results</a>

5. <a href="#item5">Discussion</a>    

6. <a href="#item6">Conclusion</a>    
</font>
</div>

<a id='item1'></a>

## Introduction/Business Problem (Week 1 & 2)
A description of the problem and a discussion of the background.  Discuss the business problem and who would be interested in this project.

NYC is a multi-ethic melting pot. The NYC food scene is vibrant, diverse and constantly changing. Restaurants change with each passing season. For someone new to NYC and homesick, it is difficult to find which neighbourhood will offer cuisine which is closest to home. Hence our target audience are people who are new to NYC and would like to find the neighbourhood with the most familiar food culture compared to their home city, and also help foodies discover new neighbourhood cuisines.

The code will prompt the user to input their home city and confirm the location visually on a map.  It would then extract the relevant restaurant information from Foursquare, and via K-means clustering, identify the cluster which the home city belongs to and plot the neighbourhoods in the same cluster on the map.  The radius of the plots on the map is the actual area where the restaurant information is extracted from.  This can be changed via the radius setting below.

<a id='item2'></a>

## Data (Week 1 & 2)
A description of the data and how it will be used to solve the problem.  Describe the data that will be used to solve the problem and the source of the data.

We plan to leverage on Foursquare restaurant data as well as its menu data to identify neighbourhoods with most similar cuisine as our source city. From the list of neighbourhoods, we would extract the list of well-rated restaurants within a certain promixity to the center of the neighbourhood.  We also extract the same information from the home city.  We would then use that data to identify which NYC neighbourhood cuisine has the greatest similarity with the home city.

<a id='item3'></a>

## Methodology (Week 2)

##### The following user variables can be tweaked.

In [1]:
# set number of clusters.  If no similar neighbourhoods are identified within the same cluster as the home city, you can reduce the cluster size and try again.  
# Similarly, if too many results are returned, you can instead the cluster size so as to get a better fit.
kclusters = 50

# radius of foursquare search.  This will apply to both the home city as well as NYC neighbourhoods. 
radius = 750

# limit of foursquare results, max 100
LIMIT = 100


In [None]:
# We first import the necessary libraries

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

#!conda install -c conda-forge beautifulsoup4 --yes
from bs4 import BeautifulSoup # to extract information from websites

import lxml # library to handle xml parsing
import requests # library to handle website requests

from pathlib import Path

import os

print('Libraries imported.')

Solving environment: - 

In [None]:
# Use your own foursquare login details below
CLIENT_ID = 'xxxx' # your Foursquare ID
CLIENT_SECRET = 'yyyy' # your Foursquare Secret
VERSION = '20180605'

In [None]:
# The code was removed by Watson Studio for sharing.

In [None]:
# download dataset
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset

In [None]:
# open dataset
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)# split text into list

In [None]:
# extract 'features' from dataset
neighborhoods_data = newyork_data['features']

In [None]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [None]:
# parse dataset to extract latitude and longitude
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [None]:
# get NYC latitude and longitude
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))

In [None]:
# prompt user for home city, get latitude and longitude, and plot on map.

# prompt user for home city
target_address = input('Enter city: eg."Mumbai, India"  ')

# get latitude and longitude
geolocator = Nominatim(user_agent="ny_explorer")
target_location = geolocator.geocode(target_address)
target_latitude = target_location.latitude
target_longitude = target_location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(target_address, target_latitude, target_longitude))

# create map of Target City using latitude and longitude values
map_targetcity = folium.Map(location=[target_latitude, target_longitude], zoom_start=15)

# add markers to map
label = '{}'.format(target_location)
label = folium.Popup(label, parse_html=True)
folium.Circle(
    [target_latitude, target_longitude],
    radius=radius,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_targetcity)  
    
map_targetcity

In [None]:
# append home city to list of neighbourhoods
manhattan_data = neighborhoods
manhattan_data = manhattan_data.append({'Borough':target_address, 'Neighborhood':target_address, 'Latitude':target_latitude, 'Longitude':target_longitude}, ignore_index=True)

In [None]:
# define function to extract restaurants from given latitudes, longitudes and radius from Foursqure
def getNearbyVenues(names, latitudes, longitudes, radius=radius):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?section=FOOD&client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            lat, 
            lng, 
            VERSION, 
            radius, 
            LIMIT)

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['shortName'],
            v['venue']['id']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue',
                  'Venue Category',          
                  'Venue ID']
    
    return(nearby_venues)

In [None]:
# get restaurant data for all neighbourhoods from Foursquare
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

In [None]:
# convert restaurant data to one-hot encoding

# one hot encoding
onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot['Neighborhood'] = manhattan_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]


In [None]:
# reset index
manhattan_grouped = onehot.groupby('Neighborhood').mean().reset_index()

In [None]:
# define function to return most common restaurants
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
# show top 10 most common restaurants for each neighbourhood
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)


In [None]:
# use K-Means clustering to cluster neighbourhoods based on 10 most common restaurants 
manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)



In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_merged = manhattan_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
manhattan_merged = manhattan_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')



In [None]:
# add location data to the clustered data
Cluster = manhattan_merged.loc[manhattan_merged['Borough'] == target_address, 'Cluster Labels'].iloc[0]

manhattan_merged = manhattan_merged.drop(manhattan_merged[manhattan_merged['Cluster Labels'] != Cluster].index)


<a id='item4'></a>

## Results (Week 2)

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters+1)
ys = [i + x + (i*x)**2 for i in range(kclusters+1)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):

    # no data from help so all the NaN neighbourhoods will be allocated to cluster 6
    if cluster != cluster:
        cluster = 6
    else: cluster = int(cluster)
    
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.Circle(
        [lat, lon],
        radius=radius,
        height=800,
        width=600,
        popup=label,
        #color=rainbow[cluster-1],
        color='red',
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
# show list of neighbourhoods in the same cluster as the home city
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == Cluster, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

<a id='item5'></a>

## Discussion (Week 2)

<a id='item6'></a>

## Conclusion (Week 2)

### Thank you for completing this lab!

This notebook was created by [Alex Aklson](https://www.linkedin.com/in/aklson/) and [Polong Lin](https://www.linkedin.com/in/polonglin/). I hope you found this lab interesting and educational. Feel free to contact us if you have any questions!

This notebook is part of a course on **Coursera** called *Applied Data Science Capstone*. If you accessed this notebook outside the course, you can take this course online by clicking [here](http://cocl.us/DP0701EN_Coursera_Week3_LAB2).

<hr>

Copyright &copy; 2018 [Cognitive Class](https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).