<h1 align='center'>Data Collection and Preparation</h1>

## Part I - Neighborhood Data Collection and cleaning

In [309]:
import pandas as pd
import numpy as np
import seaborn as sns

### Read Data from URL

In [310]:
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
df=pd.read_html(url)[0]

### Remove unassigned values

In [311]:
df.replace("Not assigned",np.NaN,inplace=True)
df.dropna(thresh=2,inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


### Group Data by postcodes and aggregate Neighborhoods


In [312]:
# The first line is used to apply the borough name to neighborhoods which are unassigned
df.loc[df["Neighbourhood"].isnull(),'Neighbourhood']=df.loc[df["Neighbourhood"].isnull()]['Borough']
cleanData=df.groupby(["Postcode"]).agg({"Borough":lambda x:np.unique(x),
                                        "Neighbourhood":lambda x: ', '.join(x)}) \
                                        .reset_index()

In [313]:
print("Shape of the Data :",cleanData.shape)
cleanData.head()

Shape of the Data : (103, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## Part II - Add location Data

In [314]:
geoData=pd.read_csv("https://cocl.us/Geospatial_data")
geoData.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [315]:
places=cleanData.merge(geoData,left_on="Postcode",right_on="Postal Code").drop("Postal Code",axis=1)
places=places.sort_values('Neighbourhood').reset_index().drop('index',axis=1)
places.shape

(103, 5)

<h1 align='center'>Data Analysis</h1>

## Part III - Neighborhood Clustering

In [316]:
import folium
# Toronto Co-ordinates
lat=43.6532
long=-79.3832

### View Neighborhoods
In the folium display,we make circles in the size of meters rather than markers in order to visualize how using the radius option in the Foursquare API would produce results. As we can see below,except in downtown Toronto we get a fair size using a radius of 700m for neighborhoods. The intersections in the areas are not a problem because we are interested only in the proximity of the places in the neighborhoods.

In [317]:
torontoMap=folium.Map([lat,long],zoom_start=11)
for lat, long, borough, neighborhood in zip(places['Latitude'], places['Longitude'], places['Borough'], places['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.Circle(
        [lat, long],
        radius=700,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.55,
        parse_html=False).add_to(torontoMap)
torontoMap

### Options for Foursquare API

In [330]:
import requests
CLIENT_ID = 'SQQXQW23MRDH3TM4FCYAGGBG4KI5TNOMNMJYZTPSX0QFMRW4' 
CLIENT_SECRET = 'RLT452NQ14X0VBLLKQOB4ZIDN3RQACAUN4B2Z1NRHOE21OOG' 
VERSION = '20180605' # Foursquare API version
radius=700
LIMIT=100

### Functions to make calls to the Foursquare API and parse the JSON Data into a Data Frame

In [331]:
def makeAPICall(lat,long):
    url="https://api.foursquare.com/v2/venues/""explore?&client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}"\
    .format(CLIENT_ID,CLIENT_SECRET,lat,long,VERSION,radius,LIMIT)
    result = requests.get(url).json()
    return result

In [333]:
from IPython.display import clear_output
def getVenues(lats,longs,places):
    venues=[]
    i=0
    for lat,lng,place in zip(lats,longs,places):
        items=makeAPICall(lat,lng)['response']['groups'][0]['items']
        clear_output()
        print("Number of Neighborhood Data Obtained:",i)
        i=i+1
        for item in items:
            venues.append([place,
                           item['venue']['name'],
                           item['venue']['categories'][0]['name'],
                           item['venue']['location']["lat"],
                           item['venue']['location']["lng"],
                           lat,
                           lng                           
                          ])
    return pd.DataFrame(data=venues,columns=["Neighborhood","Venue","Category",'Venue_lat','Venue_long',"lat","long"])

In [334]:
venueData=getVenues(places["Latitude"],places["Longitude"],places["Neighbourhood"])

Number of Neighborhood Data Obtained: 102


###  Venue data from Foursquare 

In [336]:
venueData.head()

Unnamed: 0,Neighborhood,Venue,Category,Venue_lat,Venue_long,lat,long
0,"Adelaide, King, Richmond",Four Seasons Centre for the Performing Arts,Concert Hall,43.650592,-79.385806,43.650571,-79.384568
1,"Adelaide, King, Richmond",The Keg Steakhouse & Bar,Steakhouse,43.649937,-79.384196,43.650571,-79.384568
2,"Adelaide, King, Richmond",Nathan Phillips Square,Plaza,43.65227,-79.383516,43.650571,-79.384568
3,"Adelaide, King, Richmond",Shangri-La Toronto,Hotel,43.649129,-79.386557,43.650571,-79.384568
4,"Adelaide, King, Richmond",Rosalinda,Vegetarian / Vegan Restaurant,43.650252,-79.385156,43.650571,-79.384568


### Data Preparation for analysis
<ul>
    <li>Perform One Hot Encoding on the unique Venue Types
    <li>Group Data and prepare it for analysis    

In [337]:
venueClasses=pd.get_dummies(venueData["Category"])
venueClasses["Neighbourhood"]=venueData["Neighborhood"]
clusterData=venueClasses.groupby("Neighbourhood").sum()
X=clusterData.reset_index().drop("Neighbourhood",axis=1)
X.head()

Unnamed: 0,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,3,...,1,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Identify Most common venues in each Neighborhood
We make a combined Dataframe of the location data and the venues

In [338]:
tmp=clusterData.T
tmpDict={}
for col in tmp.iteritems():
    tmpDict.update({col[0]:list(col[1].sort_values(ascending=False)[:10].index)})
cols=["Venue_"+str(i) for i in range(1,11)]
finData=pd.DataFrame.from_dict(tmpDict,orient='index',columns=cols).reset_index().rename(columns={'index':'Neighbourhood'})
finData=places.merge(finData,on="Neighbourhood",suffixes=("",""))
finData.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Venue_1,Venue_2,Venue_3,Venue_4,Venue_5,Venue_6,Venue_7,Venue_8,Venue_9,Venue_10
0,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568,Coffee Shop,Café,Bar,Steakhouse,Sushi Restaurant,Gym,Cosmetics Shop,Hotel,Theater,American Restaurant
1,M1S,Scarborough,Agincourt,43.7942,-79.262029,Skating Rink,Badminton Court,Shanghai Restaurant,Pool Hall,Breakfast Spot,Lounge,Motorcycle Shop,Restaurant,Yoga Studio,Dive Bar
2,M1V,Scarborough,"Agincourt North, L'Amoreaux East, Milliken, St...",43.815252,-79.284577,Chinese Restaurant,Pizza Place,Noodle House,Bakery,Fast Food Restaurant,Gym,Pharmacy,Shop & Service,Park,Malay Restaurant
3,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437,Grocery Store,Fast Food Restaurant,Sandwich Place,Beer Store,Pharmacy,Fried Chicken Joint,Coffee Shop,Pizza Place,Hardware Store,Ethiopian Restaurant
4,M8W,Etobicoke,"Alderwood, Long Branch",43.602414,-79.543484,Pizza Place,Convenience Store,Skating Rink,Coffee Shop,Gym,Pharmacy,Sandwich Place,Pub,Dance Studio,Athletics & Sports


### Employ KMeans Clustering and add the cluster labels to the Data

In [339]:
from sklearn.cluster import KMeans
kmean=KMeans(7,random_state=0)
kmean.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=7, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)

In [340]:
finData.insert(3,'Cluster',kmean.labels_)
finData.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Cluster,Latitude,Longitude,Venue_1,Venue_2,Venue_3,Venue_4,Venue_5,Venue_6,Venue_7,Venue_8,Venue_9,Venue_10
0,M5H,Downtown Toronto,"Adelaide, King, Richmond",0,43.650571,-79.384568,Coffee Shop,Café,Bar,Steakhouse,Sushi Restaurant,Gym,Cosmetics Shop,Hotel,Theater,American Restaurant
1,M1S,Scarborough,Agincourt,3,43.7942,-79.262029,Skating Rink,Badminton Court,Shanghai Restaurant,Pool Hall,Breakfast Spot,Lounge,Motorcycle Shop,Restaurant,Yoga Studio,Dive Bar
2,M1V,Scarborough,"Agincourt North, L'Amoreaux East, Milliken, St...",3,43.815252,-79.284577,Chinese Restaurant,Pizza Place,Noodle House,Bakery,Fast Food Restaurant,Gym,Pharmacy,Shop & Service,Park,Malay Restaurant
3,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",3,43.739416,-79.588437,Grocery Store,Fast Food Restaurant,Sandwich Place,Beer Store,Pharmacy,Fried Chicken Joint,Coffee Shop,Pizza Place,Hardware Store,Ethiopian Restaurant
4,M8W,Etobicoke,"Alderwood, Long Branch",3,43.602414,-79.543484,Pizza Place,Convenience Store,Skating Rink,Coffee Shop,Gym,Pharmacy,Sandwich Place,Pub,Dance Studio,Athletics & Sports


### Plot the neighborhoods by clusters

In [343]:
#Color Schema for the plotting of markers
import matplotlib.colors as colors
colorList=[colors.to_hex(x) for x in['b','g','r','k','m','y','c']]  

In [344]:
# Plot using folium
clusterMap=folium.Map([lat,long],zoom_start=11)
for lat, long, cluster, neighborhood in zip(finData['Latitude'], finData['Longitude'], finData['Cluster'], finData['Neighbourhood']):
    label = '{}, {}'.format(neighborhood,cluster)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=label,
        color=colorList[cluster],
        fill=True,
        fill_color=colorList[cluster],
        fill_opacity=0.5,
        parse_html=False).add_to(clusterMap)
clusterMap

## Inferences
<ul>
    <li> The clustering shows various clusters in the Toronto Urban Area. It is sensible because a variety of unique venues are expected to be found in Urban Areas leading to various clusters
    <li> The map also shows that most of the clusters can be visually identified as similiar based on their zones
    <li> A number of clusters being classified under the cluster colored black indicate they are sub-urban regions arounf the city and have similiar characteristics.
        </ul>
We can observe that the data clustering is fairly accurate and sensible.