
# __Where should I move?__  
A moving guidance app based on foursquare and clustering in python


# Table of Contents
1. [Introduction/Business Problem](#intro)
1. [Python related setup](#setup)
1. [Data](#data)
1. [Methodology](#methodology)
1. [Results](#results)
1. [Discussion](#discussion)
1. [Conclusion](#conclusion)

<a id='intro'></a>

# Introduction/Business Problem

So many people move to an area unfamiliar to them for a job. These moves often occur in a tight time-frame due to schedule pressures with career start dates. This forces people to pick a neighborhood quickly, with limited time to see if it is a good fit. What if there was there was tool that related neighborhoods in a new city to those a person was already familiar with?

In this project, I will assume a user has lived in two regions, and is considering a move to two more. I will create maps coding the regions of both the familiar and unfamiliar places. That way, the user can select a new neighborhood that has properties they like from their old neighborhood.

The intent is for this to be a proof of concept that could be related to arbitrary old and new neighborhoods, extending the use of this tool to a broadly marketable application service

<a id='setup'></a>

# Setup

In [5]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium # map rendering library

In [27]:
# define a helpful foursquare looper function
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    LIMIT = 100
    radius = 500
    
    venues_list=[]
    
    print_count = 0
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        if print_count <= 5:
            print(name, end=', ')
            print_count = print_count + 1
        else:
            print(name + '.')
            print_count = 0

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
            
    
    return(nearby_venues)

<a id='data'></a>

# Data

For this example, it is assumed that a person grew up in Toronto, then moved to Manhattan and is now considering a move to either Queens or Staten Island.

The New York area data comes from the Week 3's lab. I have stored the Neighborhoods data, which contains neighborhoods with their latitude, longitude and borough. For Queens and Staten Island, foursquare will need to be used to gather venue information. I have stored the venues data table for Manhattan, which already has venues from foursquare.

The Toronto area data source is Week 3's lab content, which has neighborhood with latitude, longitude and venue.

All of this data will need to be formatted for clustering, including one-hot formatting and label cleaning.

## Familiar Area 1: Toronto Data

In [7]:
# From week 3 lab
toronto_coordinates_df = pd.read_csv("Geospatial_Coordinates.csv")
toronto_coordinates_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [67]:
# From week 3 lab
toronto_venues_df = pd.read_pickle("toronto_venues.p")
print(f" The shape of the toronto table is: {toronto_venues_df.shape}")
toronto_venues_df.head()

 The shape of the toronto table is: (2226, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Affordable Toronto Movers,43.787919,-79.162977,Moving Target
3,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Scarborough Historical Society,43.788755,-79.162438,History Museum
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place


## Familiar Area 2: Manhattan Data

In [9]:
# From week 3 tutorial
manhattan_venues_df = pd.read_pickle("manhattan_venues.p")
print(f"The shape of the manhattan table is: {manhattan_venues_df.shape}")
manhattan_venues_df.head()

The shape of the manhattan table is: (3303, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop
4,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop


## Unfamiliar Place 1: Queens

In [11]:
all_new_york_df = pd.read_pickle("all_new_york_neighborhoods.p")
queens_df = all_new_york_df.loc[all_new_york_df.Borough == 'Queens']
queens_df.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
129,Queens,Astoria,40.768509,-73.915654
130,Queens,Woodside,40.746349,-73.901842
131,Queens,Jackson Heights,40.751981,-73.882821
132,Queens,Elmhurst,40.744049,-73.881656
133,Queens,Howard Beach,40.654225,-73.838138


## Unfamiliar Place 2: Staten Island

In [12]:
staten_island_df = all_new_york_df.loc[all_new_york_df.Borough == 'Staten Island']
staten_island_df.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
197,Staten Island,St. George,40.644982,-74.079353
198,Staten Island,New Brighton,40.640615,-74.087017
199,Staten Island,Stapleton,40.626928,-74.077902
200,Staten Island,Rosebank,40.615305,-74.069805
201,Staten Island,West Brighton,40.631879,-74.107182


## Foursquare Time!

In [13]:
info_file = "foursquare_info.sec"
info_df = pd.read_csv(info_file)
CLIENT_ID = info_df.ID.values[0] # your Foursquare ID
CLIENT_SECRET = info_df.SECRET.values[0] # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [26]:
queens_venues_df = getNearbyVenues(names=queens_df.Neighborhood, 
                                   latitudes=queens_df.Latitude,
                                   longitudes=queens_df.Longitude)

Astoria Woodside Jackson Heights Elmhurst Howard Beach Corona Forest Hills
Kew Gardens Richmond Hill Flushing Long Island City Sunnyside East Elmhurst Maspeth
Ridgewood Glendale Rego Park Woodhaven Ozone Park South Ozone Park College Point
Whitestone Bayside Auburndale Little Neck Douglaston Glen Oaks Bellerose
Kew Gardens Hills Fresh Meadows Briarwood Jamaica Center Oakland Gardens Queens Village Hollis
South Jamaica St. Albans Rochdale Springfield Gardens Cambria Heights Rosedale Far Rockaway
Broad Channel Breezy Point Steinway Beechhurst Bay Terrace Edgemere Arverne
Rockaway Beach Neponsit Murray Hill Floral Park Holliswood Jamaica Estates Queensboro Hill
Hillcrest Ravenswood Lindenwood Laurelton Lefrak City Belle Harbor Rockaway Park
Somerville Brookville Bellaire North Corona Forest Hills Gardens Jamaica Hills Utopia
Pomonok Astoria Heights Hunters Point Sunnyside Gardens Blissville Roxbury Middle Village
Malba Hammels Bayswater Queensbridge 

In [28]:
staten_island_venues_df = getNearbyVenues(names=staten_island_df.Neighborhood, 
                                       latitudes=staten_island_df.Latitude,
                                       longitudes=staten_island_df.Longitude)

St. George, New Brighton, Stapleton, Rosebank, West Brighton, Grymes Hill, Todt Hill.
South Beach, Port Richmond, Mariner's Harbor, Port Ivory, Castleton Corners, New Springville, Travis.
New Dorp, Oakwood, Great Kills, Eltingville, Annadale, Woodrow, Tottenville.
Tompkinsville, Silver Lake, Sunnyside, Park Hill, Westerleigh, Graniteville, Arlington.
Arrochar, Grasmere, Old Town, Dongan Hills, Midland Beach, Grant City, New Dorp Beach.
Bay Terrace, Huguenot, Pleasant Plains, Butler Manor, Charleston, Rossville, Arden Heights.
Greenridge, Heartland Village, Chelsea, Bloomfield, Bulls Head, Richmond Town, Shore Acres.
Clifton, Concord, Emerson Hill, Randall Manor, Howland Hook, Elm Park, Manor Heights.
Willowbrook, Sandy Ground, Egbertville, Prince's Bay, Lighthouse Hill, Richmond Valley, Fox Hills.


## Confirm venues for new places
Queens and Staten Island now have their venues from foursquare

In [29]:
print(f"The shape of queens venues df is: {queens_venues_df.shape}")
queens_venues_df.head()

The shape of queens venues df is: (2106, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Astoria,40.768509,-73.915654,Favela Grill,40.767348,-73.917897,Brazilian Restaurant
1,Astoria,40.768509,-73.915654,Orange Blossom,40.769856,-73.917012,Gourmet Shop
2,Astoria,40.768509,-73.915654,Titan Foods Inc.,40.769198,-73.919253,Gourmet Shop
3,Astoria,40.768509,-73.915654,CrossFit Queens,40.769404,-73.918977,Gym
4,Astoria,40.768509,-73.915654,Simply Fit Astoria,40.769114,-73.912403,Gym


In [32]:
queens_venues_df['Venue Category'].value_counts()

Pizza Place                                 88
Deli / Bodega                               69
Chinese Restaurant                          61
Bakery                                      60
Donut Shop                                  55
Pharmacy                                    50
Bank                                        48
Sandwich Place                              44
Bar                                         44
Korean Restaurant                           41
Italian Restaurant                          39
Grocery Store                               37
Supermarket                                 36
Coffee Shop                                 36
Mexican Restaurant                          35
Gym / Fitness Center                        32
Bus Station                                 30
Thai Restaurant                             29
Beach                                       29
Latin American Restaurant                   29
Park                                        28
Ice Cream Sho

In [31]:
print(f"The shape of staten island venues df is: {staten_island_venues_df.shape}")
staten_island_venues_df.head()

The shape of staten island venues df is: (835, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,St. George,40.644982,-74.079353,A&S Pizzeria,40.64394,-74.077626,Pizza Place
1,St. George,40.644982,-74.079353,Beso,40.643306,-74.076508,Tapas Restaurant
2,St. George,40.644982,-74.079353,Richmond County Bank Ballpark,40.645056,-74.076864,Baseball Stadium
3,St. George,40.644982,-74.079353,Staten Island September 11 Memorial,40.646767,-74.07651,Monument / Landmark
4,St. George,40.644982,-74.079353,Nike Factory Store,40.645753,-74.077702,Sporting Goods Shop


In [33]:
staten_island_venues_df['Venue Category'].value_counts()

Pizza Place                                 50
Italian Restaurant                          43
Bus Stop                                    43
Deli / Bodega                               39
Pharmacy                                    23
Bagel Shop                                  22
Sandwich Place                              20
Bank                                        20
Grocery Store                               17
Coffee Shop                                 17
Chinese Restaurant                          17
Donut Shop                                  16
Bar                                         14
Cosmetics Shop                              14
Liquor Store                                13
Mexican Restaurant                          13
Park                                        13
American Restaurant                         13
Ice Cream Shop                              12
Train Station                               12
Sushi Restaurant                            11
Restaurant   

## Prepare for Clustering
Prepare the contents of the four venues dataframes to be in one-hot format with the top venues

In [36]:
# Some helper functions

def get_grouped_df(venues_df):
    
    onehot_df = pd.get_dummies(venues_df[['Venue Category']], prefix="", prefix_sep="")
    onehot_df['Neighborhood'] = venues_df['Neighborhood']
    fixed_columns = [onehot_df.columns[-1]] + list(onehot_df.columns[:-1])
    onehot_df = onehot_df[fixed_columns]
    
    grouped_df = onehot_df.groupby('Neighborhood').mean().reset_index()
    
    return grouped_df

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]



In [39]:
toronto_grouped = get_grouped_df(toronto_venues_df)
manhattan_grouped = get_grouped_df(manhattan_venues_df)
queens_grouped = get_grouped_df(queens_venues_df)
staten_island_grouped = get_grouped_df(staten_island_venues_df)

In [41]:
# generate pretty column list for later
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

In [96]:
# new dfs with pretty columns from above
toronto_hoods_venues_sorted = pd.DataFrame(columns=columns)
toronto_hoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']
manhattan_hoods_venues_sorted = pd.DataFrame(columns=columns)
manhattan_hoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']
queens_hoods_venues_sorted = pd.DataFrame(columns=columns)
queens_hoods_venues_sorted['Neighborhood'] = queens_grouped['Neighborhood']
staten_island_hoods_venues_sorted = pd.DataFrame(columns=columns)
staten_island_hoods_venues_sorted['Neighborhood'] = staten_island_grouped['Neighborhood']

# pack these new dfs with their respective most common venues
for ind in np.arange(toronto_grouped.shape[0]):
    toronto_hoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)
    
for ind in np.arange(manhattan_grouped.shape[0]):
    manhattan_hoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)
    
for ind in np.arange(queens_grouped.shape[0]):
    queens_hoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(queens_grouped.iloc[ind, :], num_top_venues)
    
for ind in np.arange(staten_island_grouped.shape[0]):
    staten_island_hoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(staten_island_grouped.iloc[ind, :], num_top_venues)

## Final form of raw data
Top venues for Toronto, Manhattan, Queens, and Staten Island are sorted by neighborhood

In [97]:
toronto_hoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Bar,Steakhouse,Thai Restaurant,Restaurant,Burger Joint,Hotel,Sushi Restaurant,Asian Restaurant
1,Agincourt,Latin American Restaurant,Lounge,Skating Rink,Breakfast Spot,Women's Store,Dumpling Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Park,Bakery,Playground,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Pizza Place,Fried Chicken Joint,Pharmacy,Video Store,Fast Food Restaurant,Beer Store,Sandwich Place,Women's Store,Dog Run
4,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Pool,Gym,Skating Rink,Pharmacy,Pub,Sandwich Place,Dessert Shop,Dim Sum Restaurant


In [99]:
manhattan_hoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,Park,Coffee Shop,Hotel,Wine Shop,Women's Store,Gym,Memorial Site,Boat or Ferry,Shopping Mall,Pizza Place
1,Carnegie Hill,Coffee Shop,Pizza Place,Cosmetics Shop,Yoga Studio,Bakery,Gym,Bookstore,Grocery Store,Japanese Restaurant,Café
2,Central Harlem,Chinese Restaurant,African Restaurant,Cosmetics Shop,Bar,American Restaurant,French Restaurant,Seafood Restaurant,Salon / Barbershop,Tapas Restaurant,Gym
3,Chelsea,Coffee Shop,Bakery,Ice Cream Shop,Italian Restaurant,American Restaurant,Hotel,Nightclub,Theater,Bookstore,French Restaurant
4,Chinatown,Chinese Restaurant,Cocktail Bar,American Restaurant,Bakery,Hotpot Restaurant,Salon / Barbershop,Vietnamese Restaurant,Optical Shop,Spa,Mexican Restaurant


In [100]:
queens_hoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Arverne,Surf Spot,Metro Station,Sandwich Place,Donut Shop,Wine Shop,Playground,Beach,Pizza Place,Thai Restaurant,Board Shop
1,Astoria,Bar,Middle Eastern Restaurant,Hookah Bar,Greek Restaurant,Seafood Restaurant,Bakery,Pizza Place,Mediterranean Restaurant,Food & Drink Shop,Café
2,Astoria Heights,Chinese Restaurant,Deli / Bodega,Hostel,Plaza,Playground,Pizza Place,Italian Restaurant,Burger Joint,Bowling Alley,Bus Station
3,Auburndale,Ice Cream Shop,Italian Restaurant,Fast Food Restaurant,Supermarket,Furniture / Home Store,Noodle House,Korean Restaurant,Miscellaneous Shop,Athletics & Sports,Toy / Game Store
4,Bay Terrace,Clothing Store,Women's Store,Donut Shop,Lingerie Store,American Restaurant,Mobile Phone Shop,Cosmetics Shop,Kids Store,Deli / Bodega,Coffee Shop


In [101]:
staten_island_hoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Annadale,Pizza Place,Dance Studio,Train Station,Sushi Restaurant,Pharmacy,Restaurant,Sports Bar,Diner,Discount Store,Food
1,Arden Heights,Deli / Bodega,Bus Stop,Home Service,Pizza Place,Coffee Shop,Pharmacy,Yoga Studio,Fast Food Restaurant,Food & Drink Shop,Food
2,Arlington,Bus Stop,Deli / Bodega,Boat or Ferry,Grocery Store,Intersection,American Restaurant,Art Museum,Gastropub,Furniture / Home Store,French Restaurant
3,Arrochar,Deli / Bodega,Italian Restaurant,Bus Stop,Taco Place,Liquor Store,Mediterranean Restaurant,Middle Eastern Restaurant,Food Truck,Outdoors & Recreation,Pharmacy
4,Bay Terrace,Supermarket,Insurance Office,Plaza,Italian Restaurant,Sushi Restaurant,Donut Shop,Liquor Store,Salon / Barbershop,Shipping Store,Farmers Market


<a id='methodology'></a>

# Methodology

Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.

## Clustering

In [102]:
def cluster_my_data(grouped_df, venues_sorted_df, lat_lon_df, kclusters=5):
    # fit cluster to the sorted venues, then merge label cluster data with geometric data df
    
    grouped_clustering = grouped_df.drop('Neighborhood', 1)
    kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)
    
    #print(f"There are {len(kmeans.labels_)} labels, {len(grouped_df)} groups, and {len(venues_sorted_df)} venues")
    
    if "Cluster Labels" not in venues_sorted_df.columns:
        venues_sorted_df.insert(0, 'Cluster Labels', kmeans.labels_)
        
    clean_lat_lon_df = lat_lon_df.drop_duplicates(subset="Neighborhood")
    
    merged_df = clean_lat_lon_df.join(venues_sorted_df.set_index('Neighborhood'), on='Neighborhood')
    
    merged_df.rename(columns={'Neighborhood Latitude': 'Latitude', 'Neighborhood Longitude': 'Longitude'}, inplace=True)
    
    del venues_sorted_df
    
    return merged_df.drop(['Venue', 'Venue Category', 'Venue Latitude', 'Venue Longitude'], 1)

In [106]:
toronto_merged = cluster_my_data(toronto_grouped, toronto_hoods_venues_sorted, toronto_venues_df)
manhattan_merged = cluster_my_data(manhattan_grouped, manhattan_hoods_venues_sorted, manhattan_venues_df)
queens_merged = cluster_my_data(queens_grouped, queens_hoods_venues_sorted, queens_venues_df)
staten_island_merged = cluster_my_data(staten_island_grouped, staten_island_hoods_venues_sorted, staten_island_venues_df)

There are 99 labels, 99 groups, and 99 venues
There are 40 labels, 40 groups, and 40 venues
There are 81 labels, 81 groups, and 81 venues
There are 62 labels, 62 groups, and 62 venues


<a id='results'></a>

# Results
section where you discuss the results.


<a id='discussion'></a>

# Discussion

Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.


<a id='conclusion'></a>

# Conclusion
section where you conclude the report.