# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Analyze and Cluster Indian Restaurants around Boston, Massachusetts, USA


   ## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="Introduction: Business Problem"></a>

In this project, we will try to find optimal locations to open a new **Indian** restaurant in the neighborhood of **Boston** in  the state of **Massachusetts, USA**. 

There is a large population of Asian-Indians around Boston. The population is mostly comprised of those employed in various sectors in Massachusetts state (Government, Financial, Healthcare, Hospitals, colleges to name a few), as well as students in prestigious colleges in Boston area. Most of them live in the city of Boston or the suburban towns which are well connected to Boston using Subway/commuter rail/bus system. Indian restaurants are of huge demand especially in areas which has larger concentration of Indian population.

*We are using the term Asian-Indians as Indians commonly refers to the American-Indians who are the native tribes of America*

We will try to detect the locations of already existing Indian restaurants in the locality around Boston. This data will be used to find out the locations which has a lesser concentration of Indian restaurants.

We will use our data science powers to cluster the Indian restaurants aroung the area and locate places which has a lesser concentration. 

**Note: The project can be further expanded by clustering the Asian-Indian population in Massachusetts by city and plotting the cluster of population against cluster of restaurants. This will lead to a much accurate prediction. This cannot be done at this point of time due to lack of availability of population data.**

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decision are:
* number of existing Indian restaurants in the neighborhood of Boston.
* location (latitude and longitude) of Indian restaurants.
* concentration of asian Indians in Boston and towns around Boston. (dataset unavailable, hence not used)


Following data sources will be needed to extract/generate the required information:
* Geo coordinates of the cities Boston, Framingham and Braintree will be obtained using **Geopy Nominatim**
* number of restaurants and their type and location in every neighborhood will be obtained using **Foursquare API**
<br>,<br>
**Note: Since Foursquare explore venue API returns only 50 results per search, we are using 2 more search queries with locations in the periphery of Boston, named Framingham and Braintree to get more coverage.**

### Import necessary Libraries

In [1]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Folium and geopy installed')
print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    brotlipy-0.7.0             |py36h8c4c3a4_1000         346 KB  conda-forge
    chardet-3.0.4              |py36h9f0ad1d_1006         188 KB  conda-forge
    cryptography-3.0           |   py36h45558ae_0         640 KB  c

<a id="item1"></a>

## Analyze and Cluster Indian Restaurants around Boston

### Get the geograpical coordinates of Boston, Framingham and Braintree using geopy

In [2]:
address = 'Boston, MA'

geolocator = Nominatim(user_agent="boston_explorer")
location = geolocator.geocode(address)
latitude_bos = location.latitude
longitude_bos = location.longitude
print('The geograpical coordinate of Boston are {}, {}.'.format(latitude_bos, longitude_bos))

The geograpical coordinate of Boston are 42.3602534, -71.0582912.


In [3]:
address = 'Framingham, MA'

geolocator = Nominatim(user_agent="boston_explorer")
location = geolocator.geocode(address)
latitude_frm = location.latitude
longitude_frm = location.longitude
print('The geograpical coordinate of Framignham are {}, {}.'.format(latitude_frm, longitude_frm))

The geograpical coordinate of Framignham are 42.2792625, -71.416172.


In [4]:
address = 'Braintree, MA'

geolocator = Nominatim(user_agent="boston_explorer")
location = geolocator.geocode(address)
latitude_brn = location.latitude
longitude_brn = location.longitude
print('The geograpical coordinate of Braintree are {}, {}.'.format(latitude_brn, longitude_brn))

The geograpical coordinate of Braintree are 42.2064195, -71.005067.


### Define Foursquare Credentials and Version

In [5]:
CLIENT_ID = 'B0M1MMLFMTIW1AHTNHNR2J2YCK0SPP2WVACTOPWBMFVNQZAY' # your Foursquare ID
CLIENT_SECRET = '0SUOUWMKLO3E45HD3ZCVAZDQ0X0EX55CYA1V1VXVWG2I0PTI' # your Foursquare Secret
VERSION = '20200605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: B0M1MMLFMTIW1AHTNHNR2J2YCK0SPP2WVACTOPWBMFVNQZAY
CLIENT_SECRET:0SUOUWMKLO3E45HD3ZCVAZDQ0X0EX55CYA1V1VXVWG2I0PTI


## Search for Indian Restaurants around Boston, Framingham and Braintree locations

Use Foursqaure API ~venues/search option <br>
search_query = 'Indian'<br>
category = '4d4b7105d754a06374d81259' # Food <br>
Refer http://developer.foursquare.com/docs/api-reference/venues/search/ for documentation on Foursquare venue/search

In [6]:
# Foursquare venues->search -> Boston, MA

LIMIT = 100 # limit of number of venues returned by Foursquare API (Maximum 50 resposes)
search_query = 'Indian'
radius =100000 # define radius
category = '4d4b7105d754a06374d81259' # Food

# create URL
url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&query={}&categoryId={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    search_query,
    category,
    latitude_bos, 
    longitude_bos,
    radius, 
    LIMIT)

url # display URL

'https://api.foursquare.com/v2/venues/search?&client_id=B0M1MMLFMTIW1AHTNHNR2J2YCK0SPP2WVACTOPWBMFVNQZAY&client_secret=0SUOUWMKLO3E45HD3ZCVAZDQ0X0EX55CYA1V1VXVWG2I0PTI&v=20200605&query=Indian&categoryId=4d4b7105d754a06374d81259&ll=42.3602534,-71.0582912&radius=100000&limit=100'

In [7]:
# get request for results
results = requests.get(url).json()
results  # display results

{'meta': {'code': 200, 'requestId': '5f3b606099578521cca33ded'},
 'response': {'venues': [{'id': '4cadebd2d1f8b60c19f372c6',
    'name': 'Indian Entrees',
    'location': {'address': 'Winter St',
     'lat': 42.35520589,
     'lng': -71.0600875,
     'labeledLatLngs': [{'label': 'display',
       'lat': 42.35520589,
       'lng': -71.0600875}],
     'distance': 580,
     'postalCode': '02110',
     'cc': 'US',
     'city': 'Boston',
     'state': 'MA',
     'country': 'United States',
     'formattedAddress': ['Winter St', 'Boston, MA 02110', 'United States']},
    'categories': [{'id': '4bf58dd8d48988d10f941735',
      'name': 'Indian Restaurant',
      'pluralName': 'Indian Restaurants',
      'shortName': 'Indian',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/indian_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-1597727112',
    'hasPerk': False},
   {'id': '571e47bc498e219562bd3de8',
    'name': 'Divine Indian Food',
    'locatio

#### Get relevant part of JSON and transform it into a *pandas* dataframe

In [8]:
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
df_bos = json_normalize(venues)

# check numbers of results returned
df_bos.shape  

  """


(50, 24)

In [9]:
# Foursquare venues->search -> Framingham, MA

LIMIT = 100 # limit of number of venues returned by Foursquare API (Maximum 50 resposes)
search_query = 'Indian'
radius =100000 # define radius
category = '4d4b7105d754a06374d81259' # Food

# create URL
url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&query={}&categoryId={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    search_query,
    category,
    latitude_frm, 
    longitude_frm,
    radius, 
    LIMIT)

url # display URL

'https://api.foursquare.com/v2/venues/search?&client_id=B0M1MMLFMTIW1AHTNHNR2J2YCK0SPP2WVACTOPWBMFVNQZAY&client_secret=0SUOUWMKLO3E45HD3ZCVAZDQ0X0EX55CYA1V1VXVWG2I0PTI&v=20200605&query=Indian&categoryId=4d4b7105d754a06374d81259&ll=42.2792625,-71.416172&radius=100000&limit=100'

In [10]:
# get request for results
results = requests.get(url).json()

# check numbers of results returned
results   # display results

{'meta': {'code': 200, 'requestId': '5f3b614d2f553b5623322f06'},
 'response': {'venues': [{'id': '4bba80c8b35776b0e10acb01',
    'name': 'Welcome Fine Indian Cuisine',
    'location': {'address': '770 Worcester Rd',
     'crossStreet': 'Curve St',
     'lat': 42.299047139676794,
     'lng': -71.42849553150079,
     'labeledLatLngs': [{'label': 'display',
       'lat': 42.299047139676794,
       'lng': -71.42849553150079},
      {'label': 'entrance', 'lat': 42.298953, 'lng': -71.42847}],
     'distance': 2424,
     'postalCode': '01702',
     'cc': 'US',
     'city': 'Framingham',
     'state': 'MA',
     'country': 'United States',
     'formattedAddress': ['770 Worcester Rd (Curve St)',
      'Framingham, MA 01702',
      'United States']},
    'categories': [{'id': '4bf58dd8d48988d10f941735',
      'name': 'Indian Restaurant',
      'pluralName': 'Indian Restaurants',
      'shortName': 'Indian',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/indian_',
       '

In [11]:
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
df_frm = json_normalize(venues)

# check numbers of results returned
df_frm.shape

  """


(50, 24)

In [12]:
# Foursquare venues->search -> Braintree, MA

LIMIT = 100 # limit of number of venues returned by Foursquare API (Maximum 50 resposes)
search_query = 'Indian'
radius =100000 # define radius
category = '4d4b7105d754a06374d81259' # Food

# create URL
url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&query={}&categoryId={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    search_query,
    category,
    latitude_brn, 
    longitude_brn,
    radius, 
    LIMIT)

url # display URL

'https://api.foursquare.com/v2/venues/search?&client_id=B0M1MMLFMTIW1AHTNHNR2J2YCK0SPP2WVACTOPWBMFVNQZAY&client_secret=0SUOUWMKLO3E45HD3ZCVAZDQ0X0EX55CYA1V1VXVWG2I0PTI&v=20200605&query=Indian&categoryId=4d4b7105d754a06374d81259&ll=42.2064195,-71.005067&radius=100000&limit=100'

In [13]:
# get request for results
results = requests.get(url).json()
results   # display results

{'meta': {'code': 200, 'requestId': '5f3b619ed602790ae6c5bbcd'},
 'response': {'venues': [{'id': '4ab8fc16f964a520857d20e3',
    'name': 'Indian Delight',
    'location': {'address': '428 Washington St',
     'lat': 42.20921539638786,
     'lng': -70.95807252495109,
     'labeledLatLngs': [{'label': 'display',
       'lat': 42.20921539638786,
       'lng': -70.95807252495109},
      {'label': 'entrance', 'lat': 42.209267, 'lng': -70.958204}],
     'distance': 3887,
     'postalCode': '02188',
     'cc': 'US',
     'city': 'Weymouth',
     'state': 'MA',
     'country': 'United States',
     'formattedAddress': ['428 Washington St',
      'Weymouth, MA 02188',
      'United States']},
    'categories': [{'id': '4bf58dd8d48988d10f941735',
      'name': 'Indian Restaurant',
      'pluralName': 'Indian Restaurants',
      'shortName': 'Indian',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/indian_',
       'suffix': '.png'},
      'primary': True}],
    'referralId'

In [14]:
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
df_brn = json_normalize(venues)
df_brn.shape

  """


(50, 24)

Append the restaurant search results for Boston, Framingham and Braintree. Remove duplicates entries. (Remove overlapping searches)

In [25]:
# Append the dataframes
df_append = df_bos.append([df_frm, df_brn])

# Remove duplicates
df_bos_frm_brn = df_append.drop_duplicates(subset=['id'])
df_bos_frm_brn.shape
print ("No of unique Indian restaurants identified around Boston : " + str(df_bos_frm_brn.shape[0]))

No of unique Indian restaurants identified around Boston : 67


Select only relavant information from the results, Check the results 

In [27]:
df_rest = df_bos_frm_brn [['name','location.address','location.city','location.state','location.lat','location.lng']]

AttributeError: 'NoneType' object has no attribute 'items'

                                            name         location.address  \
0                                 Indian Entrees                Winter St   
1                             Divine Indian Food           187 Devonshire   
2                Surya Indian Kitchen N Catering          114 Magazine St   
3                        Indian Chili Restaurant     100 Cambridgeside Pl   
4                      Tanjore Indian Restaurant              18 Eliot St   
..                                           ...                      ...   
42                      Mayuri Indian Restaurant             5 Nagog Park   
44                      Swagat Indian Restaurant                      NaN   
47                     Tandori Indian Restaurant                      NaN   
8   Fishtail Kitchen - Indian & Nepalese Cuisine              532 Pond St   
9                       South Shore India Market  226 Quincy Ave (Rte 53)   

   location.city location.state  location.lat  location.lng  
0         Bos

We have gathered the list of Indian restaurants around Boston area along with their location details. This concludes the data gathering phase

## Methodology

Now that we have the location details for all Indian restaurants around Boston, we will try to cluster them using K-Means clustering algorithm. This will give an idea of concentration of Indian restaurants around Boston area. The clustering will help to identify locations with lesser density of Indian restaurants. Using the density of population of Asian-Indians in the area, we can find out the clusters which could be good to open a new Indian restaurant.

In [30]:
df_rest.lat

AttributeError: 'DataFrame' object has no attribute 'lat'

#### Visualize the Indian reastaurants around Boston on a map using Folium

In [36]:
venues_map = folium.Map(location=[latitude_bos, longitude_bos], zoom_start=10) # generate map centred around Boston

# add a red circle marker to Boston
folium.features.CircleMarker(
    [latitude_bos, longitude_bos],
    radius=10,
    color='red',
    popup='Boston',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(venues_map)

# add the Italian restaurants as blue circle markers
for lat, lng, label in zip(df_rest['location.lat'], df_rest['location.lng'], df_rest['location.city']):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_map)

# display map
venues_map

#### Cluster Neighborhoods using K-Means clustering algorithm 

Create new dataframe with the latitides and longitudes of the locations to be used for Clustering

In [37]:
df_lat_lng = df_bos_frm_brn [['location.lat', 'location.lng']]

**K-Means clustering**

In [38]:
# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_lat_lng)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:67] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 2, 0, 0, 0, 0, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       3, 0, 2, 0, 2, 2, 2, 2, 2, 4, 2, 1, 1, 2, 1, 0, 2, 2, 2, 4, 4, 3,
       3], dtype=int32)

Let's add cluster labels to the dataframe df_rest

In [39]:
# add clustering labels 
df_rest.insert(0, 'Cluster Labels', kmeans.labels_)

Finally, let's visualize the resulting clusters

In [41]:
# create map
map_clusters = folium.Map(location=[latitude_bos, longitude_bos], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_rest['location.lat'], df_rest['location.lng'], df_rest['location.city'], df_rest['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results

Our analysis using the clusters map shows that there is a greater number of Indian restaurants in towns of Boston, Wobourn, Framingham, Westborough and Quincy areas. This makes sense as there is a larger number of Asian-Indian population in these towns. 