<h1 align='center'>Data Collection and Preparation</h1>

## Part I - Neighborhood Data Collection and cleaning

In [119]:
import pandas as pd
import numpy as np
import seaborn as sns

### Read Data from URL

In [3]:
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
df=pd.read_html(url)[0]

### Remove unassigned values

In [4]:
df.replace("Not assigned",np.NaN,inplace=True)
df.dropna(thresh=2,inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


### Group Data by postcodes and aggregate Neighborhoods


In [5]:
# The first line is used to apply the borough name to neighborhoods which are unassigned
df.loc[df["Neighbourhood"].isnull(),'Neighbourhood']=df.loc[df["Neighbourhood"].isnull()]['Borough']
cleanData=df.groupby(["Postcode"]).agg({"Borough":lambda x:np.unique(x),
                                        "Neighbourhood":lambda x: ', '.join(x)}) \
                                        .reset_index()

In [6]:
print("Shape of the Data :",cleanData.shape)
cleanData.head()

Shape of the Data : (103, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## Part II - Add location Data

In [7]:
geoData=pd.read_csv("https://cocl.us/Geospatial_data")
geoData.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [182]:
places=cleanData.merge(geoData,left_on="Postcode",right_on="Postal Code").drop("Postal Code",axis=1)
places=places.sort_values('Neighbourhood').reset_index().drop('index',axis=1)
places.shape

(103, 5)

<h1 align='center'>Data Analysis</h1>

## Part III - Neighborhood Clustering

In [9]:
import folium
# Toronto Co-ordinates
lat=43.6532
long=-79.3832

### View Neighborhoods
In the folium display,we make circles in the size of meters rather than markers in order to visualize how using the radius option in the Foursquare API would produce results. As we can see below,except in downtown Toronto we get a fair size using a radius of 700m for neighborhoods. The intersections in the areas are not a problem because we are interested only in the proximity of the places in the neighborhoods.

In [308]:
torontoMap=folium.Map([lat,long],zoom_start=11)
for lat, long, borough, neighborhood in zip(places['Latitude'], places['Longitude'], places['Borough'], places['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.Circle(
        [lat, long],
        radius=700,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.55,
        parse_html=False).add_to(torontoMap)
torontoMap

### Options for Foursquare API

In [27]:
import requests
CLIENT_ID = 'Foursquare ID' 
CLIENT_SECRET = 'Foursquare Secret' 
VERSION = '20180605' # Foursquare API version
radius=700
LIMIT=100

### Functions to make calls to the Foursquare API and parse the JSON Data into a Data Frame

In [62]:
def makeAPICall(lat,long):
    url="https://api.foursquare.com/v2/venues/""explore?&client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}"\
    .format(CLIENT_ID,CLIENT_SECRET,lat,long,VERSION,radius,LIMIT)
    result = requests.get(url).json()
    return result

In [64]:
from IPython.display import clear_output
def getVenues(lats,longs,places):
    venues=[]
    i=0
    for lat,lng,place in zip(lats,longs,places):
        items=makeAPICall(lat,lng)['response']['groups'][0]['items']
        clear_output()
        print("Number of Neighborhood Data Obtained:",i)
        i=i+1
        for item in items:
            venues.append([place,
                           item['venue']['name'],
                           item['venue']['categories'][0]['name'],
                           item['venue']['location']["lat"],
                           item['venue']['location']["lng"],
                           lat,
                           lng                           
                          ])
    return pd.DataFrame(data=venues,columns=["Neighborhood","Venue","Category",'Venue_lat','Venue_long',"lat","long"])

In [57]:
venueData=getVenues(places["Latitude"],places["Longitude"],places["Neighbourhood"])

Rouge, Malvern
Highland Creek, Rouge Hill, Port Union
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park, Ionview, Kennedy Park
Clairlea, Golden Mile, Oakridge
Cliffcrest, Cliffside, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Scarborough Town Centre, Wexford Heights
Maryvale, Wexford
Agincourt
Clarks Corners, Sullivan, Tam O'Shanter
Agincourt North, L'Amoreaux East, Milliken, Steeles East
L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
Silver Hills, York Mills
Newtonbrook, Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park, Don Mills South
Bathurst Manor, Downsview North, Wilson Heights
Northwood Park, York University
CFB Toronto, Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens, Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The Danforth West, 

###  Venue data from Foursquare 

In [68]:
venueData.head()

Unnamed: 0,Neighborhood,Venue,Category,Venue_lat,Venue_long,lat,long
0,"Rouge, Malvern",Images Salon & Spa,Spa,43.802283,-79.198565,43.806686,-79.194353
1,"Rouge, Malvern",Wendy's,Fast Food Restaurant,43.807448,-79.199056,43.806686,-79.194353
2,"Rouge, Malvern",Wendy's,Fast Food Restaurant,43.802008,-79.19808,43.806686,-79.194353
3,"Rouge, Malvern",Tim Hortons,Coffee Shop,43.802,-79.198169,43.806686,-79.194353
4,"Rouge, Malvern",Lee Valley,Hobby Shop,43.803161,-79.199681,43.806686,-79.194353


### Data Preparation for analysis
<ul>
    <li>Perform One Hot Encoding on the unique Venue Types
    <li>Group Data and prepare it for analysis    

In [232]:
venueClasses=pd.get_dummies(venueData["Category"])
venueClasses["Neighbourhood"]=venueData["Neighborhood"]
clusterData=venueClasses.groupby("Neighbourhood").sum()
X=clusterData.reset_index().drop("Neighbourhood",axis=1)
X.head()

Unnamed: 0,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,3,...,1,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Identify Most common venues in each Neighborhood
We make a combined Dataframe of the location data and the venues

In [304]:
tmp=clusterData.T
tmpDict={}
for col in tmp.iteritems():
    tmpDict.update({col[0]:list(col[1].sort_values(ascending=False)[:10].index)})
cols=["Venue_"+str(i) for i in range(1,11)]
finData=pd.DataFrame.from_dict(tmpDict,orient='index',columns=cols).reset_index().rename(columns={'index':'Neighbourhood'})
finData=places.merge(finData,on="Neighbourhood",suffixes=("",""))
finData.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Venue_1,Venue_2,Venue_3,Venue_4,Venue_5,Venue_6,Venue_7,Venue_8,Venue_9,Venue_10
0,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568,Coffee Shop,Café,Bar,Steakhouse,American Restaurant,Cosmetics Shop,Theater,Asian Restaurant,Restaurant,Sushi Restaurant
1,M1S,Scarborough,Agincourt,43.7942,-79.262029,Skating Rink,Fabric Shop,Badminton Court,Breakfast Spot,Pool Hall,Lounge,Shanghai Restaurant,Convenience Store,Sandwich Place,Motorcycle Shop
2,M1V,Scarborough,"Agincourt North, L'Amoreaux East, Milliken, St...",43.815252,-79.284577,Pizza Place,Chinese Restaurant,Pharmacy,Bubble Tea Shop,Bakery,Caribbean Restaurant,Noodle House,Shop & Service,BBQ Joint,Malay Restaurant
3,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437,Pizza Place,Grocery Store,Hardware Store,Sandwich Place,Fast Food Restaurant,Beer Store,Discount Store,Fried Chicken Joint,Japanese Restaurant,Pharmacy
4,M8W,Etobicoke,"Alderwood, Long Branch",43.602414,-79.543484,Pizza Place,Convenience Store,Gas Station,Pharmacy,Pool,Pub,Sandwich Place,Skating Rink,Coffee Shop,Gym


### Employ KMeans Clustering and add the cluster labels to the Data

In [270]:
from sklearn.cluster import KMeans
kmean=KMeans(7,random_state=0)
kmean.fit(X)

In [306]:
finData.insert(3,'Cluster',kmean.labels_)
finData.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Cluster,Latitude,Longitude,Venue_1,Venue_2,Venue_3,Venue_4,Venue_5,Venue_6,Venue_7,Venue_8,Venue_9,Venue_10
0,M5H,Downtown Toronto,"Adelaide, King, Richmond",4,43.650571,-79.384568,Coffee Shop,Café,Bar,Steakhouse,American Restaurant,Cosmetics Shop,Theater,Asian Restaurant,Restaurant,Sushi Restaurant
1,M1S,Scarborough,Agincourt,6,43.7942,-79.262029,Skating Rink,Fabric Shop,Badminton Court,Breakfast Spot,Pool Hall,Lounge,Shanghai Restaurant,Convenience Store,Sandwich Place,Motorcycle Shop
2,M1V,Scarborough,"Agincourt North, L'Amoreaux East, Milliken, St...",6,43.815252,-79.284577,Pizza Place,Chinese Restaurant,Pharmacy,Bubble Tea Shop,Bakery,Caribbean Restaurant,Noodle House,Shop & Service,BBQ Joint,Malay Restaurant
3,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",6,43.739416,-79.588437,Pizza Place,Grocery Store,Hardware Store,Sandwich Place,Fast Food Restaurant,Beer Store,Discount Store,Fried Chicken Joint,Japanese Restaurant,Pharmacy
4,M8W,Etobicoke,"Alderwood, Long Branch",6,43.602414,-79.543484,Pizza Place,Convenience Store,Gas Station,Pharmacy,Pool,Pub,Sandwich Place,Skating Rink,Coffee Shop,Gym


### Plot the neighborhoods by clusters

In [289]:
#Color Schema for the plotting of markers
import matplotlib.colors as colors
colorList=[colors.to_hex(x) for x in['b','g','r','c','m','y','k']]  

In [307]:
# Plot using folium
clusterMap=folium.Map([lat,long],zoom_start=11)
for lat, long, cluster, neighborhood in zip(finData['Latitude'], finData['Longitude'], finData['Cluster'], finData['Neighbourhood']):
    label = '{}, {}'.format(neighborhood,cluster)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=label,
        color=colorList[cluster],
        fill=True,
        fill_color=colorList[cluster],
        fill_opacity=0.5,
        parse_html=False).add_to(clusterMap)
clusterMap

## Inferences
<ul>
    <li> The clustering shows various clusters in the Toronto Urban Area. It is sensible because a variety of unique venues are expected to be found in Urban Areas leading to various clusters
    <li> The map also shows that most of the clusters can be visually identified as similiar based on their zones
    <li> A number of clusters being classified under the cluster colored black indicate they are sub-urban regions arounf the city and have similiar characteristics.
        </ul>
We can observe that the data clustering is fairly accurate and sensible.