# Segmenting and Clustering Neighborhoods in Toronto
### Part 2

## Section 3.2 - Analysis and clustering

In [3]:
import os

import pandas as pd

### Load our work from part 1

In [4]:
path = os.path.join(os.path.abspath('../data'), 'TorontoVenues.csv')
venues_df = pd.read_csv(path)
venues_df.head()

Unnamed: 0,Neighborhood,Neighborhood_lat,Neighborhood_long,Venue,Category,Venue_lat,Venue_long
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,Park,43.751976,-79.33214
1,Parkwoods,43.753259,-79.329656,Sun Life,Construction & Landscaping,43.75476,-79.332783
2,Parkwoods,43.753259,-79.329656,Variety Store,Food & Drink Shop,43.751974,-79.333114
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,Hockey Arena,43.723481,-79.315635
4,Victoria Village,43.725882,-79.315572,Portugril,Portuguese Restaurant,43.725819,-79.312785


### Calculate the proportions of venue categories in each neighborhood

One way to do this is to one_hot code the venue categories for each neighborhood. From that we can calculate what proportions of each type of venue is found in each neighborhood. 

In [5]:
# One hot encode the venue categories for each neighborhood
venues_onehot = pd.get_dummies(venues_df['Category'], prefix='', prefix_sep='')

# Add the neighborhood column to the new dataframe
venues_onehot['Neighborhood'] = venues_df['Neighborhood']

# Pandas added the new column alphabetically, so find out where it is and move it
# to be the first column
cols = list(venues_onehot.columns)
index = cols.index('Neighborhood')
fixed_columns = [venues_onehot.columns[index]] + cols[:index] + cols[index + 1:]
venues_onehot = venues_onehot[fixed_columns]

# Now we can calculate the proportions by getting the means of each category
venues_grouped = venues_onehot.groupby('Neighborhood').mean().reset_index()
venues_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


If we sum all frequencies of venues for each neighborhood, is it aproximately 1? Just look at the first five to verify this assumption.

In [6]:
totals = []
for index, row in venues_grouped.iterrows():
    if index > 4:
        break
    totals.append({'Neighborhood': row['Neighborhood'],
                   'Sum': row[1:].sum()})
pd.DataFrame(totals)
    

Unnamed: 0,Neighborhood,Sum
0,Agincourt,1.0
1,"Alderwood, Long Branch",1.0
2,"Bathurst Manor, Wilson Heights, Downsview North",1.0
3,Bayview Village,1.0
4,"Bedford Park, Lawrence Manor East",1.0


We can identify the top 10 venue types for each neighborhood. We will also include the significance of each of type, as its 'weight.' This is just the percentage of listed venue types for the neighborhood. It will also show categories that actually do not occur for a neighborhood with fewer venue types than num_venues.

We can adapt the function to generate the row with the top num_venues venue types along with their weights. As a dictionary it is simple to assemble them into a dataframe.

In [7]:
def mostCommon(row, num_venues):
    row_cats = row.iloc[1:]
    row_cats_sorted = row_cats.sort_values(ascending=False)
    row = {'Neighborhood': row.iloc[0]}
    for i in range(num_venues):
        row[f'Top_{i + 1}'] = row_cats_sorted.index[i]
        row[f'Weight_{i + 1}'] = round(row_cats_sorted[i], 2)
    return row

Assemble the dataframe

In [8]:
num_venues = 10
rows = []
for i in range(num_venues):
    rows.append(mostCommon(venues_grouped.iloc[i], 10))
top_venues = pd.DataFrame(rows)  

top_venues.head()     

Unnamed: 0,Neighborhood,Top_1,Weight_1,Top_2,Weight_2,Top_3,Weight_3,Top_4,Weight_4,Top_5,...,Top_6,Weight_6,Top_7,Weight_7,Top_8,Weight_8,Top_9,Weight_9,Top_10,Weight_10
0,Agincourt,Skating Rink,0.25,Lounge,0.25,Breakfast Spot,0.25,Latin American Restaurant,0.25,Mexican Restaurant,...,Molecular Gastronomy Restaurant,0.0,Modern European Restaurant,0.0,Mobile Phone Shop,0.0,Miscellaneous Shop,0.0,Middle Eastern Restaurant,0.0
1,"Alderwood, Long Branch",Pizza Place,0.25,Skating Rink,0.12,Coffee Shop,0.12,Pub,0.12,Sandwich Place,...,Pharmacy,0.12,Gym,0.12,Hobby Shop,0.0,Movie Theater,0.0,Men's Store,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,0.1,Bank,0.1,Ice Cream Shop,0.05,Fried Chicken Joint,0.05,Chinese Restaurant,...,Bridal Shop,0.05,Gas Station,0.05,Diner,0.05,Sandwich Place,0.05,Intersection,0.05
3,Bayview Village,Japanese Restaurant,0.25,Café,0.25,Bank,0.25,Chinese Restaurant,0.25,Motel,...,Moroccan Restaurant,0.0,Monument / Landmark,0.0,Molecular Gastronomy Restaurant,0.0,Modern European Restaurant,0.0,Accessories Store,0.0
4,"Bedford Park, Lawrence Manor East",Sushi Restaurant,0.09,Italian Restaurant,0.09,Coffee Shop,0.09,Sandwich Place,0.09,Butcher,...,Thai Restaurant,0.04,Indian Restaurant,0.04,Café,0.04,Restaurant,0.04,Juice Bar,0.04
