# Capstone Project

## Problem statement:

### Comparison of similarity between different cities like New York, Toronto and Paris based on overall types of businesses in neighbourhoods.

## Approach:

### 1. Pull data about neighbourhoods of the cities and create summary for whole city instead of what we did in last assignment ie. we created summary for neighbourhoods.
### 2. Find euclidiean distance between two cities based on above mentioned summaries.
### 3. Compare distance between cities to assess simalirities between cities.


## Data sources:
### 1. Wikipedia
### 2. Foursquare

## Important Packages:
### 1. requests
### 2. folium
### 3. beautifulsoup





# DATA :

Loading libraries:

In [1]:
## Loading libraries

import urllib.request as urllib2
from bs4 import BeautifulSoup
import pandas as pd
import json
import requests
import numpy as np

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

### Collecting data for Toronto:

In [2]:
## Scrapping data from wikipedia

url = r'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page,'html.parser')
table = soup.find("table",{"class":"wikitable sortable"})

# Converting to dataframe 

table_rows = table.find_all('tr')

l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
    
data = pd.DataFrame(l, columns=["Postcode", "Borough", "Neighborhood"])
data['Neighborhood'] = data.Neighborhood.str.strip()

# Cleaning data

data.dropna(how='all', inplace = True)
data = data[data.Borough != 'Not assigned']

data['Neighborhood'] = data.apply(lambda x: x['Borough'] if x['Neighborhood']=='Not assigned'  else x['Neighborhood'], axis = 1 )

data['Neighborhood'] = data.groupby(['Postcode','Borough']).transform(lambda x: ','.join(x))
data = data.drop_duplicates()


### Coordinate data

In [3]:
coordinate_data_url = r'https://cocl.us/Geospatial_data'
coord_data = pd.read_csv(coordinate_data_url).rename(columns = {'Postal Code': 'Postcode'})
coord_data.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Merging data

In [4]:
neighborhoods = pd.merge(data,coord_data, on = 'Postcode', how = 'inner').drop('Postcode', axis = 1)

In [5]:
toronto_data = neighborhoods[neighborhoods.Borough.str.contains('Toronto')]

### Foursquare data

In [6]:
# Credentials

CLIENT_ID = '3MTNXT5DGUFAKIYX5N5UOXLTISANKZWEJL5YP3RZKN4OWSWK' # your Foursquare ID
CLIENT_SECRET = 'LX5KOWIAQ1YH2ZZ03M0T1MIICPXJWF4ZMV2DEDOZOVJZNULG' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 3MTNXT5DGUFAKIYX5N5UOXLTISANKZWEJL5YP3RZKN4OWSWK
CLIENT_SECRET:LX5KOWIAQ1YH2ZZ03M0T1MIICPXJWF4ZMV2DEDOZOVJZNULG


In [7]:
## Functions to pull data from Foursquare
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [8]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

print(toronto_venues.shape)
toronto_venues.head()

(1690, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Harbourfront,Regent Park",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Harbourfront,Regent Park",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Harbourfront,Regent Park",43.65426,-79.360636,Toronto Cooper Koo Family Cherry St YMCA Centre,43.653191,-79.357947,Gym / Fitness Center
3,"Harbourfront,Regent Park",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Harbourfront,Regent Park",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot


### Similarly data will be collected for different cities.