# Find My New City
### Many people wish to move cities and they are unsure on which city they should move to, one natural option to consider is moving to a city that is the most similar to your own. This project will solve that problem, it will take a users city based in the US as an input, after this then the program will cluster and classify cities based on their similarity to each other. After this is complete the cities which are most similar to the users original city will be displayed for them (on a map of the US) as well as the most similar city to theirs.
### To complete the project I will be using US cities data (found here https://simplemaps.com/data/us-cities) and the FourSquare API together to obtain information about the various cities. Data such as venues and venue types will be extracted for each city in the United States. This data will then be used as dimensions to feed a machine learning algorithm which will then classify the cities into different clusters.
### Target audience: Those wishing to move cities due to any number of reasons. One such example is someone who has received multiple university offers in different cities and wishes to find a town that closely resembles their own.
### Data used: https://simplemaps.com/data/us-cities and using the FourSquare API (https://foursquare.com) to extract venue types within a certain radius around the cities.

## Fields for US cities data
* **city-**	The name of the city/town.
* **city_ascii-**	city as an ASCII string.
* **lat-**	The latitude of the city/town.
* **lng-**	The longitude of the city/town.
* **state_id-**	The state or territory's USPS postal abbreviation.
* **state_name-**	The name of the state or territory that contains the city/town.
* **county_fips-**	The 5-digit FIPS code for the county. The first two digits correspond to the state's FIPS code.
* **county_name-**	The name of the county (or equivalent) that contains the city/town.
* **population-**	An estimate of the city's urban population as measured by the Census. 2016 data (when available).
* **population_proper-**	An estimate of the city's municipal population as measured by the Census. 2016 data (when available).
* **density-**	The number of people per square kilometer. population / land_area (estimated when area unknown).
* **source-**	For some cities, our data is generated from a polygon representing the city, for others we simply have a point.
* **incorporated-**	TRUE if the place is a city/town. FALSE if the place is just a commonly known name for a populated area.
* **timezone-**	The city's timezone in the tz database format. (e.g. America/Los_Angeles)
* **zips-**	A string containing all five-digit zip codes in the city/town, delimited by a space. Learn more.
* **id-**	A 10-digit unique id generated by SimpleMaps. It is consistent across releases and databases (e.g. World Cities Database).

In [1]:
import numpy as np
import pandas as pd
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
from geopy.geocoders import Nominatim
import folium
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from scipy.spatial import distance

In [2]:
us_cities = pd.read_csv('C:/Users/james/Desktop/Projects_and_Datasets/us_cities.csv')

In [3]:
us_cities.head()

Unnamed: 0,city,city_ascii,state_id,state_name,county_fips,county_name,lat,lng,population,population_proper,density,source,incorporated,timezone,zips,id
0,Prairie Ridge,Prairie Ridge,WA,Washington,53053,Pierce,47.1443,-122.1408,,,1349.8,polygon,False,America/Los_Angeles,98360 98391,1840037882
1,Edison,Edison,WA,Washington,53057,Skagit,48.5602,-122.4311,,,127.4,polygon,False,America/Los_Angeles,98232,1840017314
2,Packwood,Packwood,WA,Washington,53041,Lewis,46.6085,-121.6702,,,213.9,polygon,False,America/Los_Angeles,98361,1840025265
3,Wautauga Beach,Wautauga Beach,WA,Washington,53035,Kitsap,47.5862,-122.5482,,,261.7,point,False,America/Los_Angeles,98366,1840037725
4,Harper,Harper,WA,Washington,53035,Kitsap,47.5207,-122.5196,,,342.1,point,False,America/Los_Angeles,98366,1840037659


In [4]:
us_cities.drop(  
    ['city_ascii','state_id','county_fips','county_name','population','population_proper','source',  
     'incorporated','timezone','zips','id'], axis=1, inplace=True)

In [5]:
# sample list for cities_moving (space after the comma doesn't make a difference)
# city you are from Miami, Florida
# New York,New York,Los Angeles,California,Chicago,Illinois,Houston,Texas,Phoenix,Arizona,Philadelphia,Pennsylvania,San Antonio,Texas,San Diego,California,Dallas,Texas,San Jose,California,Austin,Texas,Jacksonville,Florida,San Francisco,California,Columbus,Ohio,Fort Worth,Texas,Indianapolis,Indiana,Charlotte,North Carolina,Seattle,Washington,Denver,Colorado,Boston,Massachusetts,El Paso,Texas,Detroit,Michigan,Nashville,Tennessee,Memphis,Tennessee,Portland,Oregon,Oklahoma City,Oklahoma,Las Vegas,Nevada,Louisville,Kentucky,Baltimore,Maryland,Milwaukee,Wisconsin

In [6]:
city = input('Enter the current US city together with the state that you live in: \n (example: New York, New York) ')
cities_moving = input('Enter the US cities together with the state name you are considering to move to (3 or more) as a comma separated list:\n (example: Washington, Iowa, Washington, Texas) ') 
print('Please wait, this make take a few moments depending on how many cities you are considering...')

Enter the current US city together with the state that you live in: 
 (example: New York, New York) Miami, Florida
Enter the US cities together with the state name you are considering to move to (3 or more) as a comma separated list:
 (example: Washington, Iowa, Washington, Texas) New York,New York,Los Angeles,California,Chicago,Illinois,Houston,Texas,Phoenix,Arizona,Philadelphia,Pennsylvania,San Antonio,Texas,San Diego,California,Dallas,Texas,San Jose,California,Austin,Texas,Jacksonville,Florida,San Francisco,California,Columbus,Ohio,Fort Worth,Texas,Indianapolis,Indiana,Charlotte,North Carolina,Seattle,Washington,Denver,Colorado,Boston,Massachusetts,El Paso,Texas,Detroit,Michigan,Nashville,Tennessee,Memphis,Tennessee,Portland,Oregon,Oklahoma City,Oklahoma,Las Vegas,Nevada,Louisville,Kentucky,Baltimore,Maryland,Milwaukee,Wisconsin
Please wait, this make take a few moments depending on how many cities you are considering...


In [7]:
city = city.split(',')
cities_moving = cities_moving.split(',')  

In [8]:
cities_list = city + cities_moving

In [9]:
states = cities_list[1:][::2]
cities = cities_list[::2]

In [10]:
cities_final = []
for city in cities:
    city = city.strip().title()
    cities_final.append(city)
    
states_final = []
for state in states:
    state = state.strip().title()
    states_final.append(state)

In [11]:
"""
For final deployable program this code would be incorporated to checks that all cities 
and states are entered correctly and in the same format as required.
"""

city_number = 0

for city, state in zip(cities_final, states_final):
    city_number += 1
    try:        
        if (city not in list(us_cities['city']))  or  (state not in list(us_cities['state_name'])):
            raise Exception
        
        else:
            print('pair {} okay'.format(city_number))

    except Exception:
        print('City: {} or State: {} not found'.format(city, state))
        ans = input('Would you like to try entering again (y/n)?')
        if ans == 'y':
            city_number=0
        else:
            city_number=len(cities_final)
            break #UNNECESSARY IN PYTHON SCRIPT (NOT IN NOTEBOOK)

pair 1 okay
pair 2 okay
pair 3 okay
pair 4 okay
pair 5 okay
pair 6 okay
pair 7 okay
pair 8 okay
pair 9 okay
pair 10 okay
pair 11 okay
pair 12 okay
pair 13 okay
pair 14 okay
pair 15 okay
pair 16 okay
pair 17 okay
pair 18 okay
pair 19 okay
pair 20 okay
pair 21 okay
pair 22 okay
pair 23 okay
pair 24 okay
pair 25 okay
pair 26 okay
pair 27 okay
pair 28 okay
pair 29 okay
pair 30 okay
pair 31 okay


In [12]:
# CELL BELOW HIDDEN, CONTAINS CREDENTIALS FOR THE FOURSQUARE API

In [14]:
"""
Function gets the venues and venue types from within a certain range from the FourSquare API and returns a dataframe
"""

def getNearbyVenues(name, latitudes, longitudes, radius=4000): #Change depending on speed
    
    LIMIT = 100 #max number of venues to get (depending on speed)
    venues_list=[]
    print('Calculating for cities:')
    
    for name, lat, lng in zip(name, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
    
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City',
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [15]:
cities_and_states_df = pd.concat([pd.Series(cities_final),pd.Series(states_final)], axis=1)
cities_and_states_df.head()

Unnamed: 0,0,1
0,Miami,Florida
1,New York,New York
2,Los Angeles,California
3,Chicago,Illinois
4,Houston,Texas


In [16]:
cities_and_states_df = cities_and_states_df.rename(columns={0:'city',1:'state_name'})

In [17]:
final_df = us_cities.merge(cities_and_states_df, left_on=['city','state_name'], right_on=['city','state_name'])

In [18]:
final_df.head()

Unnamed: 0,city,state_name,lat,lng,density
0,Seattle,Washington,47.6211,-122.3244,3337.0
1,Milwaukee,Wisconsin,43.064,-87.9669,2390.0
2,Miami,Florida,25.784,-80.2102,4971.0
3,Jacksonville,Florida,30.3322,-81.6749,460.0
4,Fort Worth,Texas,32.7814,-97.3473,978.0


In [19]:
city_venues = getNearbyVenues(final_df['city'], final_df['lat'], final_df['lng'])

Calculating for cities:
Seattle
Milwaukee
Miami
Jacksonville
Fort Worth
El Paso
Dallas
Austin
Houston
San Antonio
Charlotte
Memphis
Nashville
New York
Philadelphia
San Francisco
San Diego
San Jose
Los Angeles
Las Vegas
Denver
Chicago
Indianapolis
Oklahoma City
Phoenix
Baltimore
Boston
Columbus
Detroit
Louisville
Portland


In [20]:
city_venues_onehot = pd.get_dummies(city_venues[['Venue Category']],prefix='', prefix_sep='')
city_venues_onehot['City'] = city_venues['City']
city_venues_onehot['City Latitude'] = city_venues['City Latitude']
city_venues_onehot['City Longitude'] = city_venues['City Longitude']

In [21]:
cities_grouped = city_venues_onehot.groupby(['City','City Latitude','City Longitude']).mean().reset_index()

In [22]:
cities_grouped.head()

Unnamed: 0,City,City Latitude,City Longitude,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Animal Shelter,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Austin,30.3006,-97.7517,0.0,0.0,0.0,0.0,0.0,0.06,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.02,0.0
1,Baltimore,39.3051,-76.6144,0.0,0.0,0.01,0.01,0.0,0.03,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0
2,Boston,42.3188,-71.0846,0.02,0.0,0.0,0.0,0.0,0.07,0.0,...,0.02,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0
3,Charlotte,35.2079,-80.8303,0.0,0.0,0.0,0.0,0.0,0.06,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.0
4,Chicago,41.8373,-87.6861,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
if len(cities_grouped) >= 8:
    range_n_clusters = np.arange(int(np.floor(len(cities_grouped)*0.17)), int(np.floor(len(cities_grouped)*0.5)))
else:
    range_n_clusters = np.arange(2, len(cities_grouped)-1)

cities_grouped_clustering = cities_grouped.drop(['City','City Latitude','City Longitude'], axis=1)

cluster_scores = []

for n_clusters in range_n_clusters:
    
    # Initialize the clusterer with n_clusters value and a random generator seed of 12 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=12)
    cluster_labels = clusterer.fit_predict(cities_grouped_clustering)

    # This gives a perspective into the density and separation of the formed clusters
    silhouette_avg = silhouette_score(cities_grouped_clustering, cluster_labels)
    
    cluster_scores.append([n_clusters,silhouette_avg])
    
cluster_scores = pd.DataFrame(cluster_scores, columns=['Cluster','Score'])

best_num_clusters = int(cluster_scores['Cluster'][cluster_scores['Score']==max(cluster_scores['Score'])])

best_num_clusters

5

In [24]:
# set number of clusters to best found using silhouette scoring metric 
kclusters = best_num_clusters

cities_grouped_clustering = cities_grouped.drop(['City','City Latitude','City Longitude'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=12).fit(cities_grouped_clustering)

# check cluster labels generated for each row in the dataframe
cluster_labels = pd.Series(kmeans.labels_)

In [25]:
cities_with_clusters = pd.concat([cluster_labels,cities_grouped], axis=1).rename(columns={0:'Cluster Labels'})
cities_with_clusters.head()

Unnamed: 0,Cluster Labels,City,City Latitude,City Longitude,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,American Restaurant,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo
0,4,Austin,30.3006,-97.7517,0.0,0.0,0.0,0.0,0.0,0.06,...,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.02,0.0
1,1,Baltimore,39.3051,-76.6144,0.0,0.0,0.01,0.01,0.0,0.03,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0
2,4,Boston,42.3188,-71.0846,0.02,0.0,0.0,0.0,0.0,0.07,...,0.02,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0
3,4,Charlotte,35.2079,-80.8303,0.0,0.0,0.0,0.0,0.0,0.06,...,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.0
4,2,Chicago,41.8373,-87.6861,0.0,0.0,0.0,0.0,0.0,0.0,...,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
lat_lng_origin = final_df[['lat','lng']][(final_df['city']==cities_final[0]) & (final_df['state_name']==states_final[0])]
lat = lat_lng_origin['lat'].item()
lng = lat_lng_origin['lng'].item()

In [27]:
origin_point = (cities_with_clusters.drop(['City','City Latitude','City Longitude'], axis=1)
                [(cities_with_clusters['City Latitude']==lat) & (cities_with_clusters['City Longitude']==lng)])

dist_cluster = origin_point['Cluster Labels'].item()

others_with_cities = cities_with_clusters[cities_with_clusters['Cluster Labels']==dist_cluster]

other_cluster_points = (cities_with_clusters.drop(['City','City Latitude','City Longitude'], axis=1)
                [cities_with_clusters['Cluster Labels']==dist_cluster])

other_cluster_points = other_cluster_points.drop('Cluster Labels', axis=1).values

origin_point = origin_point.drop('Cluster Labels', axis=1).values

In [28]:
distances = distance.cdist(origin_point, other_cluster_points, 'euclidean')

min_distance = 1000000

for i, dist in enumerate(distances[0]):
    if (dist != 0) and (dist < min_distance):
        min_distance=dist
        close_index = i

        
closest_city = others_with_cities.iloc[close_index,:]

closest_city = closest_city[['City','City Latitude','City Longitude']]

closest_city = pd.DataFrame(closest_city).transpose()

In [29]:
# The geograpical coordinates of the US are:
latitude = 37.0902
longitude = -95.7129 

In [30]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=4)

# set color scheme for the clusters
x = np.arange(len(cluster_labels.unique()))
ys = [i + (i*x)**2 for i in range(len(cluster_labels.unique()))]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

folium.Marker([closest_city['City Latitude'].item(),closest_city['City Longitude'].item()], 
              popup='{}: Highest Similarity'.format(closest_city['City'].item(),font_size=20)).add_to(map_clusters)


# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(cities_with_clusters['City Latitude'], cities_with_clusters['City Longitude'], cities_with_clusters['City'], cities_with_clusters['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=6,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.6).add_to(map_clusters)
    
       
print('\n\n\n\n{0} shown with the large blue marker, has the highest similarity to your city {1}.\nOther cities that are also similar to your own have the same colored circles as {0}\nClick the circles or markers on the map to see the city name.\n\n\n'.format(closest_city['City'].item(),cities_final[0],font_size=20))
map_clusters





Columbus shown with the large blue marker, has the highest similarity to your city Miami.
Other cities that are also similar to your own have the same colored circles as Columbus
Click the circles or markers on the map to see the city name.



