 # CapstoneCapstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

#### Introduction: Business Problem

Analysis of a locality based on the various venues available in that locality in give useful insights into the kind of Business thriving in that area. This profiling can be used to come up with the type of business which is likely to succeed in that locality.

This projects aims at profiling the neighborhoods to come up with the best location for starting a new restaurant. The project aims to analyze the localities of New York and Toronto. This will involve profiling these neighborhoods. The profiling will be based on the number and category of venues of various types present in an area. 


#### Data

New York and Toronto are the two cities which are planned to be analyzed as part of this assignment.

New York data was provided as part of the previous assignment in the course.

Information of the neighborhoods names of Toronto will be extracted from Wikipedia article as was done in assignment in week 3.

Coordinates will be extracted using the Geocoder API, which will then be used as input for Foursquare to obtain venue information and map generation.

#### Methodology

Stage 1 - Business Understanding
As stated in the previous section of this report, our main goal is to create a reliable profile of the neighborhoods in New York City and Toronto. Our fictional business sponsors are two entrepreneurs, one looking to open a new restaurant in New York City and another one looking to open a new bar in Toronto.

Stage 2 - Analytic Approach
To decide the ideal neighborhood for the new business, we must classify the neighborhoods into three main different kinds of regions based on the proportion of venue categories present in each one: 
a)  Residential
b) Services
c) "Going Out"
After the necessary data preparation (collection, encoding and normalization) the neighborhoods will be clustered into three groups using the k-means clustering algorithm. To solve our business problem, the third cluster "Going Out" will be further studied, and the venue categories in these neighborhoods in this group will be expanded, to give insight in the kinds of places that do not already exist in these neighborhoods. The information can help our business sponsors decide what kind of restaurant or bar are lacking and are probable business opportunities.

Stage 3 - Data Requirements
As stated in the Data & Tools section, the data requirements for this research are the venue information for each neighborhood in Toronto and New York City. Consequently, information about the neighborhoods (names and geographical coordinates) are also necessary.

Stage 4 & 5 - Data Collection & Understanding
The required data is collected in the first parts of the Jupyter Notebook. Toronto boroughs and neighborhoods are scrapped from the wikipedia link, using the BeautifulSoup package, and the New York City boroughs and neighborhoods information is scrapped from the JSON file. At this point the data is organized in a Pandas DataFrame like the following:


In [1]:
# library to handle data in a vectorized manner
import numpy as np 

# library for data analsysis
import pandas as pd
from scipy import stats
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# library to handle JSON files
import json

# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize

# useful time functions library
import time

# library to handle requests
import requests

# matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt

# seaborn and associated plotting modules
import seaborn as sns

# plotly and associated plotting modules
import plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff

# import k-means from clustering stage
from sklearn.cluster import KMeans

# map rendering library
!pip install folium
import folium

# import beautifulsoup for html data scrapping
from bs4 import BeautifulSoup

# import geocoder and geopy for geographic coordinates extraction
!pip install geocoder
import geocoder
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim 

Collecting folium
  Downloading folium-0.11.0-py2.py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 3.6 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.1-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0
Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 9.2 MB/s  eta 0:00:01
[?25hCollecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [2]:
!pip install cufflinks

Collecting cufflinks
  Downloading cufflinks-0.17.3.tar.gz (81 kB)
[K     |████████████████████████████████| 81 kB 10.7 MB/s eta 0:00:01
Collecting colorlover>=0.2.1
  Downloading colorlover-0.3.0-py3-none-any.whl (8.9 kB)
Building wheels for collected packages: cufflinks
  Building wheel for cufflinks (setup.py) ... [?25ldone
[?25h  Created wheel for cufflinks: filename=cufflinks-0.17.3-py3-none-any.whl size=67921 sha256=4b51f7a17f3ca23aeb0634fe5f196074b9f2b18888b98ba99458f7bf13910b0f
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/e1/27/13/3fe67fa7ea7be444b831d117220b3b586b872c9acd4df480d0
Successfully built cufflinks
Installing collected packages: colorlover, cufflinks
Successfully installed colorlover-0.3.0 cufflinks-0.17.3


In [3]:
import cufflinks as cf

In [4]:
# Write your google geocoder credentials in the variable below
GEOCODER_GOOGLE_KEY = 'AIzaSyBbp8Y2IrKTiL0g4Y4ccGI2xLdy0sHK5rw'

In [5]:
!pip install chart-studio

Collecting chart-studio
  Downloading chart_studio-1.1.0-py3-none-any.whl (64 kB)
[K     |████████████████████████████████| 64 kB 5.1 MB/s  eta 0:00:01
Installing collected packages: chart-studio
Successfully installed chart-studio-1.1.0


In [6]:
import plotly
import plotly.tools

In [7]:
# Write your Plotly credentials in the function below
#py.tools.set_credentials_file(username='levybuble', api_key='••••••••••')
import chart_studio
chart_studio.tools.set_credentials_file(username='levybuble', api_key='••••••••••')

In [8]:
# Write your Foursquare credentials in the variables below
CLIENT_ID = 'GEWQ4DJCOSB3YXXH1JDFZ4VCTK2L4QP04MF3CDKL2AZKUXRJ' # your Foursquare ID
CLIENT_SECRET = '2FPRWMVAASRVAUWRDQOVAW3PQQJ4FCVJTRZPFPONHJ5LQLCB' # your Foursquare Secret
VERSION = '20180323' # Foursquare API version

In [9]:
# JSON file downloaded from link https://ibm.box.com/shared/static/fbpwbovar7lf8p5sgddm06cgipa2rxpe.json
import requests
import io
from io import StringIO
import json
from pandas.io.json import json_normalize
import pandas as pd

url = 'https://ibm.box.com/shared/static/fbpwbovar7lf8p5sgddm06cgipa2rxpe.json'
r = requests.get(url=url, headers={'Accept': 'application/json'})
newyork_data = json.loads(r.content)

In [10]:
# Create empty new york data pandas DataFrame
ny_neighborhoods = pd.DataFrame(columns=['Borough', 'Neighborhood', 'Latitude', 'Longitude'])

# Populate ny_neighborhoods_df with new york imported json data
for data in newyork_data['features']:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    ny_neighborhoods = ny_neighborhoods.append({'Borough': borough,
                                                'Neighborhood': neighborhood_name,
                                                'Latitude': neighborhood_lat,
                                                'Longitude': neighborhood_lon}, 
                                                ignore_index=True)
print('Pandas DataFrame populated with New York City data.')

# Export data do csv file
ny_neighborhoods.to_csv('ny_neighborhoods.csv', sep=',', encoding='utf-8')

ny_neighborhoods.tail()

Pandas DataFrame populated with New York City data.


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
301,Manhattan,Hudson Yards,40.756658,-74.000111
302,Queens,Hammels,40.587338,-73.80553
303,Queens,Bayswater,40.611322,-73.765968
304,Queens,Queensbridge,40.756091,-73.945631
305,Staten Island,Fox Hills,40.617311,-74.08174


In [11]:
# Alternatively, import data directly from local .csv file prepared

colnames = ['Borough', 'Neighborhood', 'Latitude', 'Longitude']
ny_neighborhoods = pd.read_csv('ny_neighborhoods.csv', skiprows=1, names=colnames)
ny_neighborhoods.tail()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
301,Manhattan,Hudson Yards,40.756658,-74.000111
302,Queens,Hammels,40.587338,-73.80553
303,Queens,Bayswater,40.611322,-73.765968
304,Queens,Queensbridge,40.756091,-73.945631
305,Staten Island,Fox Hills,40.617311,-74.08174


In [29]:
# Scrap Toronto Data from Wikipedia
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

# Create BeautifulSoup object
soup = BeautifulSoup(source, "html.parser")

# Scrap wikipedia HTML data using BeautifulSoup
wiki_table = soup.find('table', {'class':'wikitable sortable'})
wiki_table_rows = wiki_table.findAll('tr')
res = []

# Get boroughs and neighborhoods names from wikipedia table
for tr in wiki_table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        if (row[1]!='Not assigned'):
            if (row[2]=='Not assigned'):
                row[2]=row[1]
            res.append(row)
#print (res)
post_df = pd.DataFrame(res, columns = ["PostalCode", "Borough", "Neighborhood"])
post_df = df.groupby(["PostalCode", "Borough"])["Neighborhood"].apply(", ".join).reset_index()
#df.head()

In [None]:
import geocoder

def get_geocoder(postal_code_from_df):
     # initialize your variable to None
     lat_lng_coords = None
     # loop until you get the coordinates
     while(lat_lng_coords is None):
       g = geocoder.google('{}, Toronto, Ontario'.format(postal_code_from_df), key=GEOCODER_GOOGLE_KEY)
       lat_lng_coords = g.latlng
     latitude = lat_lng_coords[0]
     longitude = lat_lng_coords[1]
     return latitude,longitude

for i in range(0,(len(post_df['PostalCode']))):
    post_df.iloc[i]['Latitude'],post_df.iloc[i]['Longitude']=get_geocoder(post_df.iloc[i]['PostalCode'])

In [28]:
# Iterate through 'res' array and find coordinates for each row (borough)
import geocoder
print('Importing Toronto neighborhoods geographical coordinates using geocoder...')
for j in range(0, len(res)):
    # send request
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(res[j][0]), key=GEOCODER_GOOGLE_KEY)
        lat_lng_coords = g.latlng
        
    #g = geocoder.google('Toronto, Ontario', key=GEOCODER_GOOGLE_KEY)
    #lat_lng_coords = g.latlng
    
    # append coordinates to 'res' array
    res[j].append(lat_lng_coords[0])
    res[j].append(lat_lng_coords[1])
    
# Populate to_neighborhoods_df with toronto scrapped data from wikipedia
#to_neighborhoods = pd.DataFrame(res, columns=["Postcode", "Borough", "Neighborhood", "Latitude", "Longitude"])
# Drop "Postcode" column
#to_neighborhoods = to_neighborhoods.drop(columns='Postcode')
# Print alert
#print('Pandas DataFrame populated with Toronto data.')
# Export data to local .csv file
#to_neighborhoods.to_csv('to_neighborhoods.csv', sep=',', encoding='utf-8')
#to_neighborhoods.tail()

Importing Toronto neighborhoods geographical coordinates using geocoder...


KeyboardInterrupt: 

In [None]:
ny_neighborhoods

In [None]:
def generate_map_of_city_boroughs_data(city_name, city_neighborhoods):
    
    # Find city geographical coordinates using geocode google API
    from geopy.geocoders import Nominatim
    from geopy.exc import GeopyError

    address = city_name
    if address:
        geolocator = Nominatim(user_agent="my-application")
        try:
            location = geolocator.geocode(address)
            lat_long = {
                "type": "Point",
                "coordinates": [location.longitude, location.latitude]
            }
        except (GeopyError, AttributeError):
            pass

    city_latitude = lat_long['coordinates'][1]
    city_longitude = lat_long['coordinates'][0]
    print('The geographical coordinates of "{}" are {}, {}.'.format(city_name, city_latitude, city_longitude))
    
    # Check number of Boroughs and Neighborhoods in the collected Dataset
    print('The "{}" dataframe has {} boroughs and {} neighborhoods.'.format(
          city_name,
          len(city_neighborhoods['Borough'].unique()),
          len(city_neighborhoods['Neighborhood'].unique())))
    #############################
    # create map of city using latitude and longitude values
    map_city = folium.Map(location=[city_latitude, city_longitude], zoom_start=10)
    
   
    fg=folium.FeatureGroup(name="My_Map")

    for lat, lng, borough, neighborhood in zip(city_neighborhoods['Latitude'], city_neighborhoods['Longitude'], city_neighborhoods['Borough'], city_neighborhoods['Neighborhood']):
        #print(lat, lng, borough, neighborhood)
        
        borough = borough.replace("'", "&#39;")
        neighborhood = neighborhood.replace("-", "&#39;")
        neighborhood = neighborhood.replace("'", "&#39;")
        #label = borough+', '+neighborhood
        #label = '{}, {}'.format(neighborhood, borough)
        #print (label)
        
        label1 = folium.Popup(borough+', '+neighborhood, parse_html=True)
        fg.add_child(folium.CircleMarker(
        location = [lat, lng],
        radius=5,
        popup=label1,
        fill_color='#3186cc',
        color='blue',
        fill=True,
        fill_opacity=0.7))
    map_city.add_child(fg)
    display(map_city)

In [None]:
generate_map_of_city_boroughs_data('New York City, NY', ny_neighborhoods)

In [None]:
generate_map_of_city_boroughs_data('Toronto, ON', to_neighborhoods)

In [None]:
# getNearbyVenues() is a function made to get the top venues that are in each neighborhood within a radius of X meters
def getNearbyVenues(names, latitudes, longitudes, limit=200, radius=1000):
    
    venues_list=[]
    j = 0
    
    for name, lat, lng in zip(names, latitudes, longitudes):
        
         # print progress
        if (j == int(0.1*(len(names)-1))):
            print('Foursquare loop 10% Complete.')
        if (j == int(0.2*(len(names)-1))):
            print('Foursquare loop 20% Complete.')
        if (j == int(0.3*(len(names)-1))):
            print('Foursquare loop 30% Complete.')
        if (j == int(0.4*(len(names)-1))):
            print('Foursquare loop 40% Complete.')
        if (j == int(0.5*(len(names)-1))):
            print('Foursquare loop 50% Complete.')
        if (j == int(0.6*(len(names)-1))):
            print('Foursquare loop 60% Complete.')
        if (j == int(0.7*(len(names)-1))):
            print('Foursquare loop 70% Complete.')
        if (j == int(0.8*(len(names)-1))):
            print('Foursquare loop 80% Complete.')
        if (j == int(0.9*(len(names)-1))):
            print('Foursquare loop 90% Complete.')
        if (j == int((len(names)-1))):
            print('Foursquare loop 100% Complete.')
        
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        j=j+1
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                             'Neighborhood Latitude', 
                             'Neighborhood Longitude', 
                             'Venue', 
                             'Venue Latitude', 
                             'Venue Longitude', 
                             'Venue Category']
    
    return(nearby_venues)

In [None]:
print('Importing Toronto neighborhoods nearby venues using Foursquare...')
# Get data from Foursquare
to_venues = getNearbyVenues(names=to_neighborhoods['Neighborhood'],
                            latitudes=to_neighborhoods['Latitude'],
                            longitudes=to_neighborhoods['Longitude'],
                            limit=200)

print('The "to_venues" dataframe has {} venues and {} unique venue types.'.format(
      len(to_venues['Venue Category']),
      len(to_venues['Venue Category'].unique())))
to_venues.to_csv('to_venues.csv', sep=',', encoding='UTF8')
to_venues.head()

In [None]:
def generate_map_of_city_venues_data(city_name, city_neighborhoods):
    
    # Find city geographical coordinates using geocode google API
    geolocator = Nominatim(user_agent="my_jupyter_notebook")
    geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
    city_location = geolocator.geocode(city_name) #'New York City, NY'
    city_latitude = city_location.latitude
    city_longitude = city_location.longitude
    print('The geographical coordinates of "{}" are {}, {}.'.format(city_name, city_latitude, city_longitude))
    
    # Check number of Boroughs and Neighborhoods in the collected Dataset
    print('The "{}" dataframe has {} different venue types and {} neighborhoods.'.format(
          city_name,
          len(city_neighborhoods['Venue Category'].unique()),
          len(city_neighborhoods['Neighborhood'].unique())))
    
    # create map of city using latitude and longitude values
    map_city = folium.Map(location=[city_latitude, city_longitude], zoom_start=10)

    # add markers to map
    for lat, lng, venue, category in zip(city_neighborhoods['Venue Latitude'], city_neighborhoods['Venue Longitude'], city_neighborhoods['Venue'], city_neighborhoods['Venue Category']):
        label = '{}, {}'.format(category, venue)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=0.1,
            popup=label,
            color='red',
            fill=True,
            fill_color='#FF0000',
            fill_opacity=0.3).add_to(map_city)  

    return map_city

In [None]:
generate_map_of_city_venues_data('Toronto, ON', to_venues)

In [None]:
generate_map_of_city_venues_data('New York City, NY', ny_venues)

In [None]:
""" # Code used to extract all unique venue categories in New York City
# Save unique categories list as a .csv file
# format as a block of csv text to do whatever you want
csv_rows = ["{}".format(i) for i in ny_venues['Venue Category'].unique()]
csv_text = "\n".join(csv_rows)

# write it to a file
with open('ny_unique_venues.csv', 'w') as f:
    f.write(csv_text)
"""

# Import the manually prepared data extracted with the code above
# encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'
colnames = ['BAR_CLUB', 'RESTAURANT', 'SERVICES', 'LEISURE_SPORTS', 'CULTURAL_SCHOOLS', 'PARKS_NATURE_RURAL', 'TRANSPORT_INFRASTRUCTURE', 'RESIDENTIAL']
to_unique_venues = pd.read_csv('to_unique_venues.csv', skiprows=1, names=colnames, encoding='latin1')

# Export columns to python lists
to_BAR_CLUB = to_unique_venues.BAR_CLUB.tolist()
to_BAR_CLUB = [x for x in to_BAR_CLUB if str(x) != 'nan']

to_RESTAURANT = to_unique_venues.RESTAURANT.tolist()
to_RESTAURANT = [x for x in to_RESTAURANT if str(x) != 'nan']

to_SERVICES = to_unique_venues.SERVICES.tolist()
to_SERVICES = [x for x in to_SERVICES if str(x) != 'nan']

to_LEISURE_SPORTS = to_unique_venues.LEISURE_SPORTS.tolist()
to_LEISURE_SPORTS = [x for x in to_LEISURE_SPORTS if str(x) != 'nan']

to_CULTURAL_SCHOOLS = to_unique_venues.CULTURAL_SCHOOLS.tolist()
to_CULTURAL_SCHOOLS = [x for x in to_CULTURAL_SCHOOLS if str(x) != 'nan']

to_PARKS_NATURE_RURAL = to_unique_venues.PARKS_NATURE_RURAL.tolist()
to_PARKS_NATURE_RURAL = [x for x in to_PARKS_NATURE_RURAL if str(x) != 'nan']

to_TRANSPORT_INFRASTRUCTURE = to_unique_venues.TRANSPORT_INFRASTRUCTURE.tolist()
to_TRANSPORT_INFRASTRUCTURE = [x for x in to_TRANSPORT_INFRASTRUCTURE if str(x) != 'nan']

to_RESIDENTIAL = to_unique_venues.RESIDENTIAL.tolist()
to_RESIDENTIAL = [x for x in to_RESIDENTIAL if str(x) != 'nan']

to_unique_venues.head()

In [None]:
to_info.append(["Bars and Clubs", len(to_BAR_CLUB)])
to_info.append(["Restaurants", len(to_RESTAURANT)])
to_info.append(["Services", len(to_SERVICES)])
to_info.append(["Leisure and Sports", len(to_LEISURE_SPORTS)])
to_info.append(["Education and Culture", len(to_CULTURAL_SCHOOLS)])
to_info.append(["Nature and Parks", len(to_PARKS_NATURE_RURAL)])
to_info.append(["Transportation", len(to_TRANSPORT_INFRASTRUCTURE)])
to_info.append(["Residential", len(to_RESIDENTIAL)])

to_venues_info = pd.DataFrame(to_info, columns=["Category", "Unique Sub-Categories"])
to_venues_info

In [None]:
""" # Code used to extract all unique venue categories in Toronto
# Save unique categories list as a .csv file
# format as a block of csv text to do whatever you want
csv_rows = ["{}".format(i) for i in to_venues['Venue Category'].unique()]
csv_text = "\n".join(csv_rows)

# write it to a file
with open('to_unique_venues.csv', 'w') as f:
    f.write(csv_text)
"""

# Import the manually prepared data extracted with the code above
# encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'
colnames = ['BAR_CLUB', 'RESTAURANT', 'SERVICES', 'LEISURE_SPORTS', 'CULTURAL_SCHOOLS', 'PARKS_NATURE_RURAL', 'TRANSPORT_INFRASTRUCTURE', 'RESIDENTIAL']
ny_unique_venues = pd.read_csv('ny_unique_venues.csv', skiprows=1, names=colnames, encoding='latin1')

# Export columns to python lists
ny_BAR_CLUB = ny_unique_venues.BAR_CLUB.tolist()
ny_BAR_CLUB = [x for x in ny_BAR_CLUB if str(x) != 'nan']

ny_RESTAURANT = ny_unique_venues.RESTAURANT.tolist()
ny_RESTAURANT = [x for x in ny_RESTAURANT if str(x) != 'nan']

ny_SERVICES = ny_unique_venues.SERVICES.tolist()
ny_SERVICES = [x for x in ny_SERVICES if str(x) != 'nan']

ny_LEISURE_SPORTS = ny_unique_venues.LEISURE_SPORTS.tolist()
ny_LEISURE_SPORTS = [x for x in ny_LEISURE_SPORTS if str(x) != 'nan']

ny_CULTURAL_SCHOOLS = ny_unique_venues.CULTURAL_SCHOOLS.tolist()
ny_CULTURAL_SCHOOLS = [x for x in ny_CULTURAL_SCHOOLS if str(x) != 'nan']

ny_PARKS_NATURE_RURAL = ny_unique_venues.PARKS_NATURE_RURAL.tolist()
ny_PARKS_NATURE_RURAL = [x for x in ny_PARKS_NATURE_RURAL if str(x) != 'nan']

ny_TRANSPORT_INFRASTRUCTURE = ny_unique_venues.TRANSPORT_INFRASTRUCTURE.tolist()
ny_TRANSPORT_INFRASTRUCTURE = [x for x in ny_TRANSPORT_INFRASTRUCTURE if str(x) != 'nan']

ny_RESIDENTIAL = ny_unique_venues.RESIDENTIAL.tolist()
ny_RESIDENTIAL = [x for x in ny_RESIDENTIAL if str(x) != 'nan']

ny_unique_venues.head()

In [None]:
ny_info = []
ny_info.append(["Bars and Clubs", len(ny_BAR_CLUB)])
ny_info.append(["Restaurants", len(ny_RESTAURANT)])
ny_info.append(["Services", len(ny_SERVICES)])
ny_info.append(["Leisure and Sports", len(ny_LEISURE_SPORTS)])
ny_info.append(["Education and Culture", len(ny_CULTURAL_SCHOOLS)])
ny_info.append(["Nature and Parks", len(ny_PARKS_NATURE_RURAL)])
ny_info.append(["Transportation", len(ny_TRANSPORT_INFRASTRUCTURE)])
ny_info.append(["Residential", len(ny_RESIDENTIAL)])

ny_venues_info = pd.DataFrame(ny_info, columns=["Category", "Unique Sub-Categories"])
ny_venues_info

In [None]:
trace1 = go.Bar(x=to_venues_info['Category'],
                y=to_venues_info['Unique Sub-Categories'],
                opacity=0.3,
                name="Unique Sub-Categories in Toronto")
trace2 = go.Bar(x=ny_venues_info['Category'],
                y=ny_venues_info['Unique Sub-Categories'],
                opacity=0.3,
                name="Unique Sub-Categories in New York City")

data = [trace1, trace2]
layout = go.Layout(barmode='overlay')
fig = go.Figure(data=data, layout=layout)

py.plotly.iplot(fig)

In [None]:
def encode_venues_categories(dataframe):
    res = []
    for index, row in dataframe.iterrows():
        if row["Venue Category"] in (to_BAR_CLUB+ny_BAR_CLUB):
            res.append([row["Neighborhood"], row["Neighborhood Latitude"], row["Neighborhood Longitude"],
                        row["Venue Latitude"], row["Venue Longitude"],
                        1, 0, 0, 0, 0, 0, 0, 0, 1])
        elif row["Venue Category"] in (to_RESTAURANT+ny_RESTAURANT):
            res.append([row["Neighborhood"], row["Neighborhood Latitude"], row["Neighborhood Longitude"],
                        row["Venue Latitude"], row["Venue Longitude"],
                        0, 1, 0, 0, 0, 0, 0, 0, 1])
        elif row["Venue Category"] in (to_SERVICES+ny_SERVICES):
            res.append([row["Neighborhood"], row["Neighborhood Latitude"], row["Neighborhood Longitude"],
                        row["Venue Latitude"], row["Venue Longitude"],
                        0, 0, 1, 0, 0, 0, 0, 0, 1])
        elif row["Venue Category"] in (to_LEISURE_SPORTS+ny_LEISURE_SPORTS):
            res.append([row["Neighborhood"], row["Neighborhood Latitude"], row["Neighborhood Longitude"],
                        row["Venue Latitude"], row["Venue Longitude"],
                        0, 0, 0, 1, 0, 0, 0, 0, 1])
        elif row["Venue Category"] in (to_CULTURAL_SCHOOLS+ny_CULTURAL_SCHOOLS):
            res.append([row["Neighborhood"], row["Neighborhood Latitude"], row["Neighborhood Longitude"],
                        row["Venue Latitude"], row["Venue Longitude"],
                        0, 0, 0, 0, 1, 0, 0, 0, 1])
        elif row["Venue Category"] in (to_PARKS_NATURE_RURAL+ny_PARKS_NATURE_RURAL):
            res.append([row["Neighborhood"], row["Neighborhood Latitude"], row["Neighborhood Longitude"],
                        row["Venue Latitude"], row["Venue Longitude"],
                        0, 0, 0, 0, 0, 1, 0, 0, 1])
        elif row["Venue Category"] in (to_TRANSPORT_INFRASTRUCTURE+ny_TRANSPORT_INFRASTRUCTURE):
            res.append([row["Neighborhood"], row["Neighborhood Latitude"], row["Neighborhood Longitude"],
                        row["Venue Latitude"], row["Venue Longitude"],
                        0, 0, 0, 0, 0, 0, 1, 0, 1])
        elif row["Venue Category"] in (to_RESIDENTIAL+ny_RESIDENTIAL):
            res.append([row["Neighborhood"], row["Neighborhood Latitude"], row["Neighborhood Longitude"],
                        row["Venue Latitude"], row["Venue Longitude"],
                        0, 0, 0, 0, 0, 0, 0, 1, 1])
        else:
            res.append([row["Neighborhood"], row["Neighborhood Latitude"], row["Neighborhood Longitude"],
                        row["Venue Latitude"], row["Venue Longitude"],
                        0, 0, 1, 0, 0, 0, 0, 0, 1])
    return res

#Neighborhood
#Neighborhood Latitude
#Neighborhood Longitude
#Venue
#Venue Latitude
#Venue Longitude
#Venue Category

In [None]:
# Create encoded venues dataframe
to_encoded_venues = pd.DataFrame(encode_venues_categories(to_venues), 
                                 columns=["Neighborhood", "Neighborhood Latitude", "Neighborhood Longitude", 
                                          "Venue Latitude", "Venue Longitude", 
                                          "Bars and Clubs", "Restaurants", "Services", "Leisure and Sports",
                                          "Education and Culture", "Nature and Parks", "Transportation",
                                          "Residential", "Total Venues"])


# Create encoded grouped venues dataframe
to_encoded_grouped_venues = to_encoded_venues.groupby(['Neighborhood', 
                                                       'Neighborhood Latitude', 
                                                       'Neighborhood Longitude']).sum().sort_values(by=['Total Venues']).reset_index()
# Save Neighborhood column for later
to_encoded_grouped_venues_Neighborhood = to_encoded_grouped_venues['Neighborhood']
# Drop non-integer columns
to_encoded_grouped_venues = to_encoded_grouped_venues.drop(['Neighborhood Latitude', 
                                                            'Neighborhood Longitude',
                                                            'Venue Latitude',
                                                            'Venue Longitude',
                                                            'Neighborhood'], axis=1)

# Prepare encoded grouped venues dataframe for KMeans clustering
to_encoded_grouped_venues_std = to_encoded_venues.groupby(['Neighborhood', 
                                                           'Neighborhood Latitude', 
                                                           'Neighborhood Longitude']).mean().sort_values(by=['Total Venues']).reset_index()
# Save columns for later
to_encoded_grouped_venues_std_Neighborhood = to_encoded_grouped_venues_std['Neighborhood']
to_encoded_grouped_venues_std_Latitude = to_encoded_grouped_venues_std['Neighborhood Latitude']
to_encoded_grouped_venues_std_Longitude = to_encoded_grouped_venues_std['Neighborhood Longitude']
# Drop non-integer columns
to_encoded_grouped_venues_std = to_encoded_grouped_venues_std.drop(['Neighborhood Latitude', 
                                                                    'Neighborhood Longitude',
                                                                    'Venue Latitude',
                                                                    'Venue Longitude',
                                                                    'Neighborhood',
                                                                    'Total Venues'], axis=1)

In [None]:
to_encoded_venues.tail()

In [None]:
to_encoded_grouped_venues.tail()

In [None]:
to_encoded_grouped_venues_std.tail()

In [None]:
# Create encoded venues dataframe
ny_encoded_venues = pd.DataFrame(encode_venues_categories(ny_venues), 
                                 columns=["Neighborhood", "Neighborhood Latitude", "Neighborhood Longitude", 
                                          "Venue Latitude", "Venue Longitude", 
                                          "Bars and Clubs", "Restaurants", "Services", "Leisure and Sports",
                                          "Education and Culture", "Nature and Parks", "Transportation",
                                          "Residential", "Total Venues"])

# Create encoded grouped venues dataframe
ny_encoded_grouped_venues = ny_encoded_venues.groupby(['Neighborhood', 
                                                       'Neighborhood Latitude', 
                                                       'Neighborhood Longitude']).sum().sort_values(by=['Total Venues']).reset_index()

# Save Neighborhood column for later
ny_encoded_grouped_venues_Neighborhood = ny_encoded_grouped_venues['Neighborhood']
# Drop non-integer columns
ny_encoded_grouped_venues = ny_encoded_grouped_venues.drop(['Neighborhood Latitude', 
                                                            'Neighborhood Longitude',
                                                            'Venue Latitude',
                                                            'Venue Longitude',
                                                            'Neighborhood'], axis=1)

# Prepare encoded grouped venues dataframe for KMeans clustering
ny_encoded_grouped_venues_std = ny_encoded_venues.groupby(['Neighborhood', 
                                                           'Neighborhood Latitude', 
                                                           'Neighborhood Longitude']).mean().sort_values(by=['Total Venues']).reset_index()
# Save Neighborhood column for later
ny_encoded_grouped_venues_std_Neighborhood = ny_encoded_grouped_venues_std['Neighborhood']
ny_encoded_grouped_venues_std_Latitude = ny_encoded_grouped_venues_std['Neighborhood Latitude']
ny_encoded_grouped_venues_std_Longitude = ny_encoded_grouped_venues_std['Neighborhood Longitude']
# Drop non-integer columns
ny_encoded_grouped_venues_std = ny_encoded_grouped_venues_std.drop(['Neighborhood Latitude', 
                                                                    'Neighborhood Longitude',
                                                                    'Venue Latitude',
                                                                    'Venue Longitude',
                                                                    'Neighborhood',
                                                                    'Total Venues'], axis=1)

In [None]:
ny_encoded_venues.tail()

In [None]:
ny_encoded_grouped_venues.tail()

In [None]:
ny_encoded_grouped_venues_std.tail()

In [None]:
cf.set_config_file(offline=False, world_readable=True, theme='ggplot')

series = to_encoded_venues['Neighborhood'].value_counts()#[:20]
series.head(3)

series.iplot(kind='bar', yTitle='Number of Venues', xTitle=None, title='Toronto numbers of venues per neighborhood',
             filename='toronto-bar-chart')

In [None]:
group_labels = ['Toronto Distplot']
fig = ff.create_distplot([np.array(to_encoded_grouped_venues['Total Venues'].tolist())], group_labels )
py.plotly.iplot(fig, filename='Basic Distplot')

In [None]:
cf.set_config_file(offline=False, world_readable=True, theme='ggplot')

series = ny_encoded_venues['Neighborhood'].value_counts()#[:20]
series.head(3)

series.iplot(kind='bar', yTitle='Number of Venues', xTitle=None, title='New York City numbers of venues per neighborhood',
             filename='newyork-bar-chart')

In [None]:
group_labels = ['New York Distplot']
fig = ff.create_distplot([np.array(ny_encoded_grouped_venues['Total Venues'].tolist())], group_labels )
py.plotly.iplot(fig, filename='Basic Distplot')

In [None]:
cf.set_config_file(offline=False, world_readable=True, theme='ggplot')

series = to_encoded_grouped_venues['Restaurants'].value_counts()
series.head(3)

series.iplot(kind='bar', xTitle='Number of Restaurants', yTitle='Number of Neighborhoods', 
             title='Number of Neighborhoods with X number of Restaurants in Toronto',
             filename='toronto_rest-bar-chart')

In [None]:
group_labels = ['Toronto Distplot']
fig = ff.create_distplot([np.array(to_encoded_grouped_venues['Restaurants'].tolist())], group_labels )
py.plotly.iplot(fig, filename='Basic Distplot')

In [None]:
cf.set_config_file(offline=False, world_readable=True, theme='ggplot')

series = ny_encoded_grouped_venues['Restaurants'].value_counts()
series.head(3)

series.iplot(kind='bar', xTitle='Number of Restaurants', yTitle='Number of Neighborhoods', 
             title='Number of Neighborhoods with X number of Restaurants in New York City',
             filename='newyork_rest-bar-chart')

In [None]:
group_labels = ['New York Distplot']
fig = ff.create_distplot([np.array(ny_encoded_grouped_venues['Restaurants'].tolist())], group_labels )
py.plotly.iplot(fig, filename='Basic Distplot')

In [None]:
# Add histogram data
x1 = np.array(to_encoded_grouped_venues['Restaurants'].tolist())
x2 = np.array(ny_encoded_grouped_venues['Restaurants'].tolist())

# Group data together
hist_data = [x1, x2]

group_labels = ['Toronto', 'NYC']

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels)

# Plot!
py.plotly.iplot(fig, filename='Distplot with Multiple Datasets')

In [None]:
# Add histogram data
x1 = np.array(to_encoded_grouped_venues['Bars and Clubs'].tolist())
x2 = np.array(ny_encoded_grouped_venues['Bars and Clubs'].tolist())

# Group data together
hist_data = [x1, x2]

group_labels = ['Toronto', 'NYC']

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels)

# Plot!
py.plotly.iplot(fig, filename='Distplot with Multiple Datasets')

In [None]:
# Add histogram data
x1 = np.array(to_encoded_grouped_venues['Services'].tolist())
x2 = np.array(ny_encoded_grouped_venues['Services'].tolist())

# Group data together
hist_data = [x1, x2]

group_labels = ['Toronto', 'NYC']

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels)

# Plot!
py.plotly.iplot(fig, filename='Distplot with Multiple Datasets')

In [None]:
# Copy the encoded grouped venues standarized dataframe
ny_clustered_neighborhoods = ny_encoded_grouped_venues_std

# Columns list
clmns = ["Bars and Clubs", "Restaurants", "Services", "Leisure and Sports",
         "Education and Culture", "Nature and Parks", "Transportation", "Residential"]
    
# Cluster the data
kmeans = KMeans(n_clusters=5, random_state=0).fit(ny_clustered_neighborhoods)
labels = kmeans.labels_

# Make the new Cluster column
ny_clustered_neighborhoods['Cluster'] = labels

# Add the column into our list
clmns.extend(['Cluster'])

# Lets analyze the clusters
ny_pie_clusters = ny_clustered_neighborhoods[clmns].groupby(['Cluster']).mean()
ny_pie_clusters

In [None]:
ny_pie_clusters.columns = [0,1,2,3,4,5,6,7]
ny_pie_clusters = ny_pie_clusters.T
ny_pie_clusters.columns = ['results_0', 'results_1', 'results_2', 'results_3', 'results_4']
ny_pie_clusters

In [None]:
llabels = np.array(['Bars and Clubs', 'Restaurants', 'Services', 'Leisure and Sports',
                    'Education and Culture', 'Nature and Parks', 'Transportation', 'Residential'])
colors = np.array(['#d5f4e6', '#80ced6', '#618685', '#ffef96', '#50394c', '#b2b2b2', '#f4e1d2', '#fefbd8'])
ny_pie_clusters['labels'] = llabels
ny_pie_clusters['colors'] = colors
ny_pie_clusters

In [None]:
trace = go.Pie(labels=ny_pie_clusters['labels'], 
               values=ny_pie_clusters['results_0'], 
               hoverinfo='label+percent', 
               textfont=dict(size=20),
               marker=dict(colors=ny_pie_clusters['colors']))
py.plotly.iplot([trace], filename='basic_pie_chart')

In [None]:
trace = go.Pie(labels=ny_pie_clusters['labels'], 
               values=ny_pie_clusters['results_1'], 
               hoverinfo='label+percent', 
               textfont=dict(size=20),
               marker=dict(colors=ny_pie_clusters['colors']))
py.plotly.iplot([trace], filename='basic_pie_chart')

In [None]:
trace = go.Pie(labels=ny_pie_clusters['labels'], 
               values=ny_pie_clusters['results_2'], 
               hoverinfo='label+percent', 
               textfont=dict(size=20),
               marker=dict(colors=ny_pie_clusters['colors']))
py.plotly.iplot([trace], filename='basic_pie_chart')

In [None]:
trace = go.Pie(labels=ny_pie_clusters['labels'], 
               values=ny_pie_clusters['results_3'], 
               hoverinfo='label+percent', 
               textfont=dict(size=20),
               marker=dict(colors=ny_pie_clusters['colors']))
py.plotly.iplot([trace], filename='basic_pie_chart')

In [None]:
trace = go.Pie(labels=ny_pie_clusters['labels'], 
               values=ny_pie_clusters['results_4'], 
               hoverinfo='label+percent', 
               textfont=dict(size=20),
               marker=dict(colors=ny_pie_clusters['colors']))
py.plotly.iplot([trace], filename='basic_pie_chart')

In [None]:
# Copy the encoded grouped venues standarized dataframe
to_clustered_neighborhoods = to_encoded_grouped_venues_std

# Columns list
clmns = ["Bars and Clubs", "Restaurants", "Services", "Leisure and Sports",
         "Education and Culture", "Nature and Parks", "Transportation", "Residential"]
    
# Cluster the data
kmeans = KMeans(n_clusters=5, random_state=0).fit(to_clustered_neighborhoods)
labels = kmeans.labels_

# Make the new Cluster column
to_clustered_neighborhoods['Cluster'] = labels

# Add the column into our list
clmns.extend(['Cluster'])

# Lets analyze the clusters
to_pie_clusters = to_clustered_neighborhoods[clmns].groupby(['Cluster']).mean()
to_pie_clusters

In [None]:
to_pie_clusters.columns = [0,1,2,3,4,5,6,7]
to_pie_clusters = to_pie_clusters.T
to_pie_clusters.columns = ['results_0', 'results_1', 'results_2', 'results_3', 'results_4']
to_pie_clusters

In [None]:
llabels = np.array(['Bars and Clubs', 'Restaurants', 'Services', 'Leisure and Sports',
                    'Education and Culture', 'Nature and Parks', 'Transportation', 'Residential'])
colors = np.array(['#92a8d1', '#034f84', '#f7cac9', '#f7786b', '#deeaee', '#b1cbbb', '#eea29a', '#c94c4c'])
to_pie_clusters['labels'] = llabels
to_pie_clusters['colors'] = colors
to_pie_clusters

In [None]:
trace = go.Pie(labels=to_pie_clusters['labels'], 
               values=to_pie_clusters['results_0'], 
               hoverinfo='label+percent', 
               textfont=dict(size=20),
               marker=dict(colors=to_pie_clusters['colors']))
py.plotly.iplot([trace], filename='basic_pie_chart')

In [None]:
trace = go.Pie(labels=to_pie_clusters['labels'], 
               values=to_pie_clusters['results_1'], 
               hoverinfo='label+percent', 
               textfont=dict(size=20),
               marker=dict(colors=to_pie_clusters['colors']))
py.plotly.iplot([trace], filename='basic_pie_chart')

In [None]:
trace = go.Pie(labels=to_pie_clusters['labels'], 
               values=to_pie_clusters['results_2'], 
               hoverinfo='label+percent', 
               textfont=dict(size=20),
               marker=dict(colors=to_pie_clusters['colors']))
py.plotly.iplot([trace], filename='basic_pie_chart')

In [None]:
trace = go.Pie(labels=to_pie_clusters['labels'], 
               values=to_pie_clusters['results_3'], 
               hoverinfo='label+percent', 
               textfont=dict(size=20),
               marker=dict(colors=to_pie_clusters['colors']))
py.plotly.iplot([trace], filename='basic_pie_chart')

In [None]:
trace = go.Pie(labels=to_pie_clusters['labels'], 
               values=to_pie_clusters['results_4'], 
               hoverinfo='label+percent', 
               textfont=dict(size=20),
               marker=dict(colors=to_pie_clusters['colors']))
py.plotly.iplot([trace], filename='basic_pie_chart')

In [None]:
# set color scheme for the clusters
ny_rainbow = ['#ffef96', '#d5f4e6', '#b2b2b2', '#618685', '#80ced6']
to_rainbow = ['#92a8d1', '#034f84', '#f7cac9', '#b1cbbb', '#c94c4c']

def generate_map_of_city_clustered_neighborhoods(city_name, city_neighborhoods, kclusters, rainbow):
    
    # Find city geographical coordinates using geocode google API
    geolocator = Nominatim(user_agent="my_jupyter_notebook")
    geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
    city_location = geolocator.geocode(city_name) #'New York City, NY'
    city_latitude = city_location.latitude
    city_longitude = city_location.longitude
    print('The geographical coordinates of "{}" are {}, {}.'.format(city_name, city_latitude, city_longitude))
    
    # Check number of Boroughs and Neighborhoods in the collected Dataset
    print('The "{}" dataframe has {} clusters and {} neighborhoods.'.format(
          city_name,
          kclusters,
          len(city_neighborhoods['Neighborhood'].unique())))
    
    # create map of city using latitude and longitude values
    map_city = folium.Map(location=[city_latitude, city_longitude], zoom_start=10)

    # add markers to map
    for lat, lng, neighborhood, cluster in zip(city_neighborhoods['Neighborhood Latitude'], 
                                               city_neighborhoods['Neighborhood Longitude'], 
                                               city_neighborhoods['Neighborhood'], 
                                               city_neighborhoods['Cluster']):
        label = folium.Popup(str(neighborhood)+', Cluster: '+str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_city)  

    return map_city

In [None]:
# Add the neighborhoods columns back to the clustered dataframe
ny_clustered_neighborhoods['Neighborhood'] = ny_encoded_grouped_venues_std_Neighborhood
ny_clustered_neighborhoods['Neighborhood Latitude'] = ny_encoded_grouped_venues_std_Latitude
ny_clustered_neighborhoods['Neighborhood Longitude'] = ny_encoded_grouped_venues_std_Longitude

# Drop the venues columns from the clustered dataframe
ny_clustered_neighborhoods = ny_clustered_neighborhoods.drop(["Bars and Clubs", 
                                                              "Restaurants", 
                                                              "Services", 
                                                              "Leisure and Sports",
                                                              "Education and Culture", 
                                                              "Nature and Parks", 
                                                              "Transportation", 
                                                              "Residential"], axis=1)
ny_clustered_neighborhoods.tail()

In [None]:
generate_map_of_city_clustered_neighborhoods('New York City, NY', ny_clustered_neighborhoods, 5, ny_rainbow)

In [None]:
# Add the neighborhoods columns back to the clustered dataframe
to_clustered_neighborhoods['Neighborhood'] = to_encoded_grouped_venues_std_Neighborhood
to_clustered_neighborhoods['Neighborhood Latitude'] = to_encoded_grouped_venues_std_Latitude
to_clustered_neighborhoods['Neighborhood Longitude'] = to_encoded_grouped_venues_std_Longitude

# Drop the venues columns from the clustered dataframe
to_clustered_neighborhoods = to_clustered_neighborhoods.drop(["Bars and Clubs", 
                                                              "Restaurants", 
                                                              "Services", 
                                                              "Leisure and Sports",
                                                              "Education and Culture", 
                                                              "Nature and Parks", 
                                                              "Transportation", 
                                                              "Residential"], axis=1)
to_clustered_neighborhoods.tail()

In [None]:
generate_map_of_city_clustered_neighborhoods('Toronto, ON', to_clustered_neighborhoods, 5, to_rainbow)

In [None]:
to_clustered_neighborhoods.loc[to_clustered_neighborhoods['Cluster'] == 4, to_clustered_neighborhoods.columns[[1] + list(range(2, to_clustered_neighborhoods.shape[1]))]].Neighborhood

In [None]:
to_clustered_neighborhoods.loc[to_clustered_neighborhoods['Cluster'] == 3, to_clustered_neighborhoods.columns[[1] + list(range(2, to_clustered_neighborhoods.shape[1]))]].Neighborhood

In [None]:
ny_clustered_neighborhoods.loc[(ny_clustered_neighborhoods['Cluster'] == 4)].Neighborhood

In [None]:
ny_clustered_neighborhoods.loc[(ny_clustered_neighborhoods['Cluster'] == 0)].Neighborhood