# Data Science Capestone Project

## Mohammad Baig- Final (Week 4 + 5)

### Battle of Neighbourhoods: Restaurent Recommendation

## Introduction / Business Problem

### Location is the important factor that matters when starting a new business or setting up a new office or a shop. In this assignment, we will try to build a recommendation engine which might be used by a businessman to look over the available locations to start his new business.

### Use case: A restaurant shop owner wants to setup a new fast-food corner in best place where most of the crowd lives in Pittsburgh in the state of Pennsylvania, USA. He will ask all his friends and other business partners for recommendations. The restaurant owner will decide to start his new business in a place where most of the college graduates stays and he will rely on advisor app and looks for better places to live in. The happiest, chillest places are the ones which young minds decides to look for accommodation.

### Assumption:- Businessman already has couple friends / relatives in Allegheny County.

### Is it the booming businesses? The good food? The top-rated schools? The diversity?

## Data

### Now that we have understood the business requirements. It is time to analyze and gather data for it. We will be using Pennsylvania municipalties data for helping our restaurant owner to find best neighborhood. We have collected data from valid sources, opensocrata for our analysis and performed preliminary analysis for better understanding the data and preparing it for modelling. Once we collected the required data, next step is preparation and understand it better by applying normalization techniques - Getting rid of null values, duplicate rows, data wrangling and formatting the data (Standardization). After the data is standardized the qualified data is ready to be processed. We will query for each of their geo locations using geocoder library and venues from FourSquare API. For few of the locations, foursquare doesn't have data, we shall drop such rows as they are not useful for our analysis.
### Based on the amenities provided by foursquare in the neighbourhood, Then cluster the nearby venues to find out the top ten amenities available for each location and extract those features. Based on the amenities cluster the neighbourhoods which have similar characteristics. As we know user preferences, these clusters can be analyzed and recommended for living !


## Let's Explore Data

In [1]:
!conda install -c conda-forge geopy --yes 
import numpy as np 
import pandas as pd 
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import csv
!pip install geocoder
import geocoder 
!pip install folium
import folium
import geopy
import tqdm
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans
import folium # map rendering library
import matplotlib.cm as cm
import matplotlib.colors as colors

Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.0.2p             |       h470a237_1         3.1 MB  conda-forge
    certifi-2018.10.15         |        py36_1000         138 KB  conda-forge
    geopy-1.17.0               |             py_0          49 KB  conda-forge
    ca-certificates-2018.10.15 |       ha4d7672_0         135 KB  conda-forge
    conda-4.5.11               |        py36_1000         651 KB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.1 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.49-py_0            conda-forge
    geopy:           

### Now Download Data for analysis

In [2]:
!wget -O Pennsylvania_Municipalities.csv https://opendata.socrata.com/api/views/vr4q-nrmd/rows.csv?accessType=DOWNLOAD
print("Data source downloaded")

--2018-10-17 07:03:19--  https://opendata.socrata.com/api/views/vr4q-nrmd/rows.csv?accessType=DOWNLOAD
Resolving opendata.socrata.com (opendata.socrata.com)... 52.206.140.199, 52.206.140.205, 52.206.68.26
Connecting to opendata.socrata.com (opendata.socrata.com)|52.206.140.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘Pennsylvania_Municipalities.csv’

Pennsylvania_Munici     [ <=>                  ]  42.14K  --.-KB/s   in 0.04s  

Last-modified header invalid -- time-stamp ignored.
2018-10-17 07:03:19 (1.16 MB/s) - ‘Pennsylvania_Municipalities.csv’ saved [43151]

Data source downloaded


In [3]:
df = pd.read_csv("Pennsylvania_Municipalities.csv")
df.head()

Unnamed: 0,Municipality,Liquor Ban,Beer Ban,Distributor Ban,State Store Ban
0,"Adams Township, Butler County",Allowed,1935,Allowed,Allowed
1,"Adams Township, Snyder County",1955,1955,Allowed,Allowed
2,"Akron Borough, Lancaster County",1953,Allowed,Allowed,Allowed
3,"Aldan Borough, Delaware County",1941,1941,Allowed,Allowed
4,"Alexandria Borough, Huntingdon County",Allowed,1934,Allowed,Allowed


### Keep only required data

In [4]:
penn_data = df[["Municipality"]]
penn_data.head()

Unnamed: 0,Municipality
0,"Adams Township, Butler County"
1,"Adams Township, Snyder County"
2,"Akron Borough, Lancaster County"
3,"Aldan Borough, Delaware County"
4,"Alexandria Borough, Huntingdon County"


In [5]:
penn_data[["Borough", "County"]] = penn_data['Municipality'].str.split(',',expand=True)
penn_data = penn_data[["Borough", "County"]]
penn_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


Unnamed: 0,Borough,County
0,Adams Township,Butler County
1,Adams Township,Snyder County
2,Akron Borough,Lancaster County
3,Aldan Borough,Delaware County
4,Alexandria Borough,Huntingdon County


In [6]:
len(penn_data)

684

### Drop Duplicate Entries

In [7]:
penn_data.drop_duplicates(subset=['Borough'], inplace=True)

In [8]:
penn_data.shape

(578, 2)

In [9]:
penn_data.describe()

Unnamed: 0,Borough,County
count,578,578
unique,578,55
top,Potter Township,York County
freq,1,40


In [10]:
penn_data_grp = penn_data.groupby(['County'], sort=False)['Borough'].apply(','.join).reset_index()
penn_data_grp.head()

Unnamed: 0,County,Borough
0,Butler County,"Adams Township,Allegheny Township,Brady Townsh..."
1,Lancaster County,"Akron Borough,Caernarvon Township,Drumore Town..."
2,Delaware County,"Aldan Borough,Bethel Township,Brookhaven Borou..."
3,Huntingdon County,"Alexandria Borough,Barree Township,Cass Townsh..."
4,Lycoming County,"Anthony Township,Cogan House Township,Gamble T..."


In [11]:
penn_data_grp.describe()

Unnamed: 0,County,Borough
count,55,55
unique,55,55
top,Erie County,"Center Township,Freeport Township,Gray Townshi..."
freq,1,1


In [12]:
column_names = ['Latitude', 'Longitude'] 
n_hood = pd.DataFrame(columns=column_names)
n_hood.shape

(0, 2)

In [13]:
from geopy.exc import GeocoderTimedOut
from time import sleep
def do_geocode(address):
    geopy = Nominatim()
    try:
        return geopy.geocode(address)
    except GeocoderTimedOut:
        sleep(1)
        return do_geocode(address)


for index, row in penn_data.iterrows():
    try:
        address_1 = row['Borough'] 
        address = address_1+","+row['County']+","+"Pennsylvania"+","+"USA"
        geolocator = Nominatim()
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        n_hood = n_hood.append({'Latitude': latitude,'Longitude': longitude}, ignore_index=True)
        n_hood
        pass
    except ValueError as error_message:
        print("Error")
        pass
    except AttributeError:
        address = row['County']+","+"Pennsylvania,USA"
        geolocator = Nominatim()
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        n_hood = n_hood.append({'Latitude': latitude,'Longitude': longitude}, ignore_index=True)
        n_hood
        pass



In [14]:
n_hood.head()

Unnamed: 0,Latitude,Longitude
0,40.717358,-80.012873
1,40.164064,-76.205099
2,39.919412,-75.400164
3,40.344633,-78.028119
4,41.138114,-79.738746


### Append Latitude and Longitude data

In [15]:
penn_data_geo = pd.concat([penn_data, n_hood[['Latitude', 'Longitude']]], axis=1)
penn_data_geo.shape

(657, 4)

In [16]:
penn_data_geo.head()

Unnamed: 0,Borough,County,Latitude,Longitude
0,Adams Township,Butler County,40.717358,-80.012873
1,,,40.164064,-76.205099
2,Akron Borough,Lancaster County,39.919412,-75.400164
3,Aldan Borough,Delaware County,40.344633,-78.028119
4,Alexandria Borough,Huntingdon County,41.138114,-79.738746


### Drop Null values

In [16]:
penn_data_geo = penn_data_geo.dropna(subset = ['Borough', 'County', 'Latitude', 'Longitude'])
penn_data_geo.head()

Unnamed: 0,Borough,County,Latitude,Longitude
0,Adams Township,Butler County,40.717358,-80.012873
2,Akron Borough,Lancaster County,39.919412,-75.400164
3,Aldan Borough,Delaware County,40.344633,-78.028119
4,Alexandria Borough,Huntingdon County,41.138114,-79.738746
5,Allegheny Township,Butler County,41.347269,-77.022035


In [17]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(penn_data_geo['County'].unique()),
        penn_data_geo.shape[0]
    )
)

The dataframe has 55 boroughs and 499 neighborhoods.


In [18]:
address = 'Pennsylvania,USA'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Pennsylvania are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Pennsylvania are 40.9699889, -77.7278831.


In [19]:
map_penn = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(penn_data_geo['Latitude'], penn_data_geo['Longitude'], penn_data_geo['County'], penn_data_geo['Borough']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_penn) 
    
map_penn

In [20]:
penn_data_geo = penn_data_geo.reset_index(drop=True)
#penn_data_geo.columns = penn_data_geo.columns.str.lstrip()
penn_data_geo['County'] = penn_data_geo['County'].str.strip()

In [21]:
penn_data.describe()

Unnamed: 0,Borough,County
count,578,578
unique,578,55
top,Potter Township,York County
freq,1,40


Now lets Explore Neighborhoods in Allegheny County

In [22]:
penn_data_new = penn_data_geo[penn_data_geo.County == 'Allegheny County'].reset_index(drop=True)
len(penn_data_new)
penn_data_new.head()

Unnamed: 0,Borough,County,Latitude,Longitude
0,Bellevue Borough,Allegheny County,40.398036,-76.811517
1,Ben Avon Borough,Allegheny County,39.848541,-75.486389
2,Bradford Woods Borough,Allegheny County,39.876667,-75.391951
3,Edgewood Borough,Allegheny County,40.75296,-77.92825
4,Forest Hills Borough,Allegheny County,40.691662,-80.371


### Let's see the map

In [23]:
address = 'Allegheny County, Pennsylvania'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Allegheny County are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Allegheny County are 40.4597204, -79.9760405.


In [24]:
# create map of Manhattan using latitude and longitude values
map_allegheny = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(penn_data_new['Latitude'], penn_data_new['Longitude'], penn_data_new['Borough']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_allegheny)  
    
map_allegheny

#### Now the Neighbourhoods

In [25]:
penn_data_new.loc[2, 'Borough']

'Bradford Woods Borough'

In [26]:
neighborhood_latitude = penn_data_new.loc[2, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = penn_data_new.loc[2, 'Longitude'] # neighborhood longitude value

neighborhood_name = penn_data_new.loc[2, 'Borough'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Bradford Woods Borough are 39.8766669, -75.3919513045786.


### Now let's put our Foursquare credentials

### Took out code for Foursquare credentials

In [None]:
radius = 1000
LIMIT = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

In [29]:
import requests
import json
from pandas.io.json import json_normalize

In [30]:
results = requests.get(url).json()

In [31]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [32]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON
len(nearby_venues)
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()
len(nearby_venues)
print('{} venues by Foursquare.'.format(nearby_venues.shape[0]))

39 venues by Foursquare.


Now let's see all 39 venues

In [33]:
#Call this function on each neighborhood and create a new dataframe called allegheny_venues
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    #print(nearby_venues)
    return(nearby_venues)

In [34]:
allegheny_venues = getNearbyVenues(names=penn_data_new['Borough'],
                                   latitudes=penn_data_new['Latitude'],
                                   longitudes=penn_data_new['Longitude']
                                  )

Bellevue Borough
Ben Avon Borough
Bradford Woods Borough
Edgewood Borough
Forest Hills Borough
Ingram Borough


In [35]:
allegheny_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Ben Avon Borough,39.848541,-75.486389,Bethel Community Park,39.846563,-75.482629,Park
1,Bradford Woods Borough,39.876667,-75.391951,Firehouse Subs,39.877664,-75.394346,Sandwich Place
2,Bradford Woods Borough,39.876667,-75.391951,Lowe's Home Improvement,39.879045,-75.394831,Hardware Store
3,Bradford Woods Borough,39.876667,-75.391951,Wawa,39.876445,-75.39338,Convenience Store
4,Bradford Woods Borough,39.876667,-75.391951,Starbucks,39.876556,-75.395671,Coffee Shop


Identify the number of unique categories

In [36]:
print('There are {} uniques categories.'.format(len(allegheny_venues['Venue Category'].unique())))
allegheny_venues.groupby('Neighborhood').count()

There are 23 uniques categories.


Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ben Avon Borough,1,1,1,1,1,1
Bradford Woods Borough,24,24,24,24,24,24
Forest Hills Borough,1,1,1,1,1,1
Ingram Borough,2,2,2,2,2,2


### Analyze each neighbourhood

In [37]:
# one hot encoding
allegheny_venues_onehot = pd.get_dummies(allegheny_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
allegheny_venues_onehot['Neighborhood'] = allegheny_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [allegheny_venues_onehot.columns[-1]] + list(allegheny_venues_onehot.columns[:-1])
allegheny_venues_onehot = allegheny_venues_onehot[fixed_columns]

allegheny_venues_onehot.head()

Unnamed: 0,Neighborhood,BBQ Joint,Bank,Brewery,Burger Joint,Bus Station,Coffee Shop,Convenience Store,Deli / Bodega,Dessert Shop,...,Gym / Fitness Center,Hardware Store,Liquor Store,Optical Shop,Park,Pet Store,Pizza Place,Salon / Barbershop,Sandwich Place,Shopping Plaza
0,Ben Avon Borough,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,Bradford Woods Borough,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,Bradford Woods Borough,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,Bradford Woods Borough,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Bradford Woods Borough,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
allegheny_venues_onehot.shape

(28, 24)

In [39]:
allegheny_venues_grouped = allegheny_venues_onehot.groupby('Neighborhood').mean().reset_index()
allegheny_venues_grouped

Unnamed: 0,Neighborhood,BBQ Joint,Bank,Brewery,Burger Joint,Bus Station,Coffee Shop,Convenience Store,Deli / Bodega,Dessert Shop,...,Gym / Fitness Center,Hardware Store,Liquor Store,Optical Shop,Park,Pet Store,Pizza Place,Salon / Barbershop,Sandwich Place,Shopping Plaza
0,Ben Avon Borough,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,Bradford Woods Borough,0.0,0.041667,0.0,0.041667,0.041667,0.041667,0.041667,0.0,0.041667,...,0.041667,0.041667,0.125,0.041667,0.041667,0.041667,0.041667,0.041667,0.041667,0.041667
2,Forest Hills Borough,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Ingram Borough,0.5,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Print each neighbourhood with top 5 common venues

In [40]:
num_top_venues = 5

for hood in allegheny_venues_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = allegheny_venues_grouped[allegheny_venues_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Ben Avon Borough----
                venue  freq
0                Park   1.0
1           BBQ Joint   0.0
2       Grocery Store   0.0
3      Sandwich Place   0.0
4  Salon / Barbershop   0.0


----Bradford Woods Borough----
               venue  freq
0       Liquor Store  0.12
1      Grocery Store  0.08
2              Diner  0.08
3  Food & Drink Shop  0.04
4     Sandwich Place  0.04


----Forest Hills Borough----
                venue  freq
0       Deli / Bodega   1.0
1           BBQ Joint   0.0
2       Grocery Store   0.0
3      Sandwich Place   0.0
4  Salon / Barbershop   0.0


----Ingram Borough----
                venue  freq
0           BBQ Joint   0.5
1             Brewery   0.5
2       Grocery Store   0.0
3      Sandwich Place   0.0
4  Salon / Barbershop   0.0




In [41]:
#Function to sort the venues in descending order.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

And top 10 venues of each neighbourhood

In [42]:
#function to display the top 10 venues for each neighborhood.
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = allegheny_venues_grouped['Neighborhood']

for ind in np.arange(allegheny_venues_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(allegheny_venues_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()
len(neighborhoods_venues_sorted)

4

In [43]:
allegheny_venues_grouped.head()

Unnamed: 0,Neighborhood,BBQ Joint,Bank,Brewery,Burger Joint,Bus Station,Coffee Shop,Convenience Store,Deli / Bodega,Dessert Shop,...,Gym / Fitness Center,Hardware Store,Liquor Store,Optical Shop,Park,Pet Store,Pizza Place,Salon / Barbershop,Sandwich Place,Shopping Plaza
0,Ben Avon Borough,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,Bradford Woods Borough,0.0,0.041667,0.0,0.041667,0.041667,0.041667,0.041667,0.0,0.041667,...,0.041667,0.041667,0.125,0.041667,0.041667,0.041667,0.041667,0.041667,0.041667,0.041667
2,Forest Hills Borough,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Ingram Borough,0.5,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
# set number of clusters
kclusters = 4

allegheny_grouped_clustering = allegheny_venues_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(allegheny_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 3, 2, 1], dtype=int32)

In [45]:
neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Ben Avon Borough,Park,Shopping Plaza,Fast Food Restaurant,Bank,Brewery,Burger Joint,Bus Station,Coffee Shop,Convenience Store,Deli / Bodega
1,Bradford Woods Borough,Liquor Store,Diner,Grocery Store,Shopping Plaza,Fast Food Restaurant,Bank,Burger Joint,Bus Station,Coffee Shop,Convenience Store
2,Forest Hills Borough,Deli / Bodega,Shopping Plaza,Fast Food Restaurant,Bank,Brewery,Burger Joint,Bus Station,Coffee Shop,Convenience Store,Dessert Shop
3,Ingram Borough,BBQ Joint,Brewery,Fast Food Restaurant,Bank,Burger Joint,Bus Station,Coffee Shop,Convenience Store,Deli / Bodega,Dessert Shop


In [46]:
penn_data_new

Unnamed: 0,Borough,County,Latitude,Longitude
0,Bellevue Borough,Allegheny County,40.398036,-76.811517
1,Ben Avon Borough,Allegheny County,39.848541,-75.486389
2,Bradford Woods Borough,Allegheny County,39.876667,-75.391951
3,Edgewood Borough,Allegheny County,40.75296,-77.92825
4,Forest Hills Borough,Allegheny County,40.691662,-80.371
5,Ingram Borough,Allegheny County,40.08067,-76.241128


In [47]:
penn_data_new_1 = penn_data_new.iloc[[1,2,4,5],:]

Run K-Means cluster

In [48]:
penn_data_merged = penn_data_new_1

# add clustering labels
penn_data_merged['Cluster Labels'] = kmeans.labels_

# merge penn_grouped with penn_data to add latitude/longitude for each neighborhood
penn_data_merged = penn_data_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Borough')

penn_data_merged.head() # check the last columns!

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Borough,County,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Ben Avon Borough,Allegheny County,39.848541,-75.486389,0,Park,Shopping Plaza,Fast Food Restaurant,Bank,Brewery,Burger Joint,Bus Station,Coffee Shop,Convenience Store,Deli / Bodega
2,Bradford Woods Borough,Allegheny County,39.876667,-75.391951,3,Liquor Store,Diner,Grocery Store,Shopping Plaza,Fast Food Restaurant,Bank,Burger Joint,Bus Station,Coffee Shop,Convenience Store
4,Forest Hills Borough,Allegheny County,40.691662,-80.371,2,Deli / Bodega,Shopping Plaza,Fast Food Restaurant,Bank,Brewery,Burger Joint,Bus Station,Coffee Shop,Convenience Store,Dessert Shop
5,Ingram Borough,Allegheny County,40.08067,-76.241128,1,BBQ Joint,Brewery,Fast Food Restaurant,Bank,Burger Joint,Bus Station,Coffee Shop,Convenience Store,Deli / Bodega,Dessert Shop


In [49]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(penn_data_merged['Latitude'].round(4), penn_data_merged['Longitude'].round(4), penn_data_merged['Borough'], penn_data_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Cluster 1 - Place to relax and eatout - Main Junction

In [51]:
penn_data_merged.loc[penn_data_merged['Cluster Labels'] == 0, penn_data_merged.columns[[0] + list(range(5, penn_data_merged.shape[1]))]]

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Ben Avon Borough,Park,Shopping Plaza,Fast Food Restaurant,Bank,Brewery,Burger Joint,Bus Station,Coffee Shop,Convenience Store,Deli / Bodega


Cluster 2 - Seems to be tourist spot with restaurants, pubs and motels

In [52]:
penn_data_merged.loc[penn_data_merged['Cluster Labels'] == 1, penn_data_merged.columns[[0] + list(range(5, penn_data_merged.shape[1]))]]

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Ingram Borough,BBQ Joint,Brewery,Fast Food Restaurant,Bank,Burger Joint,Bus Station,Coffee Shop,Convenience Store,Deli / Bodega,Dessert Shop


Cluster 3 - Shopping hub with lot of Malls, Stores and Bank

In [53]:
penn_data_merged.loc[penn_data_merged['Cluster Labels'] == 2, penn_data_merged.columns[[0] + list(range(5, penn_data_merged.shape[1]))]]

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Forest Hills Borough,Deli / Bodega,Shopping Plaza,Fast Food Restaurant,Bank,Brewery,Burger Joint,Bus Station,Coffee Shop,Convenience Store,Dessert Shop


Cluster 4 - Looks like happening place with everything

In [54]:
penn_data_merged.loc[penn_data_merged['Cluster Labels'] == 3, penn_data_merged.columns[[0] + list(range(5, penn_data_merged.shape[1]))]]

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Bradford Woods Borough,Liquor Store,Diner,Grocery Store,Shopping Plaza,Fast Food Restaurant,Bank,Burger Joint,Bus Station,Coffee Shop,Convenience Store



## Report
### Methodology:-

The methodology I used is pretty simple. As illustrated above, I used K-Means clusttering algorithm to cluster based on the amenities, cluster the neighbourhoods which have similar characteristics. As we know user preferences are taken into consideration such as Banks, Shopping Malls, Bus Station, Grocery, Resturants, Coffee Shops, Deli Foods/ Bakeries etc ... Find out which ones are best for living. Seems to be simple question, definitely can be done in much better ways and options are debatable and that's why we called it BON :)

### Results:-

Based on the amenities available in each area, the city-county combinations are divided into 4 clusters. Each cluster has a unique amenity. For example Cluster 1 is group of amenities found in any key junction of neighborhood with parks, shopping, grocery, bank Cluster 2 is the group which contains needs for tourist ppl such as restaurants and motels for accomodation Cluster 3 is mostly about food joints and super markets Cluster 4 is the grouping of most happening places consisting of restaurants, nightlife, pubs, malls, banks & entertainment centers

### Discussion:-

Without a lot more work in the initial data exploration and methodology phase, it would not be possible to figure out what are top amenities in the neighborhood that helps in making the decision for living in that area !

Though the data has gone through exploratory analysis, some of the issues can be found during actual run of data. For ex: few locations didn't return geo location and few others didn't return any result for FourSquare API and couldn't find nearby venues. My observation is that the Data Science methodology is a highly iterative process which needs going back and forth to tune the data as needed.
Conclusion:-

After examination of the clusters we discovered that 'Forest Hills Borough' in Allegheny County is closest match and suitable option for recommendation and will be shown as top priority.

Note:- Assuming that one of the family lives in Forest hills it would be easier for the student to use their car for college commuting (sometimes) otherwise Bus station is also available. Also Forest hills is affordable and at the same time has numerous part time job vacancies for tutoring to small kids during free time.

4 Forest Hills Borough Deli / Bodega Shopping Plaza Grocery Store Bank Brewery Burger Joint Bus Station Coffee Shop Convenience Store Diner