# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#Introduction)
* [Data](#Data)
* [Methodology](#Methodology)
* [Analysis](#Analysis)
* [Results and Discussion](#Results)
* [Conclusion](#Conclusion)

# Introduction

### Background
Toronto is Canada's largest city, the fourth largest in North America, and home to a diverse population of about 2.8 million people. It is a global centre for business, finance, arts and culture and is consistently ranked one of the world's most livable cities.
### Problem
When you are looking to open a restaurant, how a successful restaurant should be. Of course, food and service are important to the success of a restaurant, but the location can be just as crucial. Therefore, target audience of this project will be people who are looking to open a new restaurant. This project will segment the neighborhoods of Toronto into major clusters and examine their food. The results will help you to decide by discovering culture and variety of the neighborhood.

# Data 

### Toronto City Dataset
Data will be scraped from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. After Toronto City data is scraped, data will be preprocessed. Data is consist of __Post Code__, __Borough__,  and __Neighborhood__.

In [1]:
from bs4 import BeautifulSoup
from pattern.web import download
import pandas as pd

In [4]:
html_doc = download('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(html_doc,'lxml')
wiki_table = soup.find('table',class_='wikitable').find_all('tr')
toronto_data = []
for index, row in enumerate(wiki_table):
    if index == 0:
        pass
    else:
        data = row.find_all('td')
        postcode = data[0].text
        borough = data[1].text
        neighborhood = data[2].text.strip()
        toronto_data.append([postcode,borough,neighborhood])
toronto_df = pd.DataFrame(toronto_data, columns=['PostalCode','Borough','Neighborhood'])
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [6]:
toronto_df = toronto_df[toronto_df['Borough'] != 'Not assigned']
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


More than one neighborhood can exist in one postal code area.

In [7]:
toronto_df['Neighborhood'] = toronto_df.groupby(['PostalCode','Borough']).transform(lambda x: ', '.join(x))
toronto_df = toronto_df.drop_duplicates().reset_index(drop=True)
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [8]:
toronto_df['Neighborhood'] = toronto_df.apply(lambda row: row['Borough'] if row['Neighborhood'] == 'Not assigned' else row['Neighborhood'], axis=1)
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


In [9]:
toronto_df.shape

(103, 3)

### Geographical Coordinates
Toronto City data will be mapped with the geographical coordinates of each postal code of Toronto City. Geographical Coordinates data is consist of __Post Code__, __Latitude__,  and __Longitude__. Link: http://cocl.us/Geospatial_data

In [5]:
geographical_coordinates_df = pd.read_csv('Geospatial_Coordinates.csv')
geographical_coordinates_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [10]:
toronto_df = pd.merge(toronto_df, geographical_coordinates_df, left_on='PostalCode', right_on='Postal Code')
toronto_df = toronto_df[['PostalCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude']]
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


In [11]:
toronto_df.shape

(103, 5)

### Foursquare API
Foursquare API, a location data provider, will be used to find the venues on each postal code zone using a radius based on the area cover by each neighborhoods. Data from Foursquare API is consist of __Venue Name__, __Venue Latitude__,	__Venue Longitude__, and __Venue Category__.

In [19]:
import requests

Define Foursquare Credentials and Version

In [15]:
CLIENT_ID = 'LJD5RVRNHFATKRW32JNUHPRAMOEL02ZTBRA1VFMNCT4DU55Y' # your Foursquare ID
CLIENT_SECRET = 'MJ1XJZU0ZWGUJJHUUY3LRGDM13SWTXQZJR2GO1RWL5NKCS45' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # limit of number of venues returned by Foursquare API
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: LJD5RVRNHFATKRW32JNUHPRAMOEL02ZTBRA1VFMNCT4DU55Y
CLIENT_SECRET:MJ1XJZU0ZWGUJJHUUY3LRGDM13SWTXQZJR2GO1RWL5NKCS45


In [17]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
#         print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

In [20]:
toronto_venues_df = getNearbyVenues(names=toronto_df['Neighborhood'], latitudes=toronto_df['Latitude'], 
                                    longitudes=toronto_df['Longitude'])
toronto_venues_df.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [21]:
toronto_venues_df.to_csv('toronto_venues.csv')

# Methodology

# Analysis

Let's check how many venues were returned for each neighborhood

In [22]:
toronto_venues_df.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Agincourt,5,5,5,5,5,5
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",2,2,2,2,2,2
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",9,9,9,9,9,9
"Alderwood, Long Branch",9,9,9,9,9,9
"Bathurst Manor, Downsview North, Wilson Heights",19,19,19,19,19,19
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
Berczy Park,55,55,55,55,55,55
"Birch Cliff, Cliffside West",4,4,4,4,4,4


Let's find out how many unique categories can be curated from all the returned venues

In [24]:
print('There are {} uniques categories.'.format(len(toronto_venues_df['Venue Category'].unique())))

There are 273 uniques categories.
