#New country - Similar city
##This data visualization project is part of the IBM applied data science capstone project on the coursera platform.
*By Jesper Mølgaard<br>Some code is modified from the course*
<br>

---

Imagine you had the choice of living in a different country. Where would you like to go?
Let's imagine that you like the city you are currently living in, and would like to move to a new city that resembles it. That is the purpose of this tool.

For the project we will be using:
 - The foursquare API
 - World Cities Database from https://simplemaps.com/data/world-cities containing 15.000 cities
 - The folium data visualization package.
 - (as well as the usual suspects for python; numpy, pandas, matplotlib, scikit and so forth.)

In [None]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# transform json file into a pandas dataframe library
from pandas.io.json import json_normalize

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
print('Folium installed')

# import k-means for clustering stage
from sklearn.cluster import KMeans

!pip install geocoder
import geocoder # import geocoder

print('Libraries imported.')

print ("Hello Capstone Project Course!")

Folium installed
Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 3.6MB/s 
Collecting ratelim
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
Libraries imported.
Hello Capstone Project Course!


In [None]:
#Foursquare credentials:
CLIENT_ID = 'JVD55THRFSX1BNI0EVJIPYJT3Z5KWU3JT3E3R3XWBPBY0RSN'
CLIENT_SECRET = 'CADRD0KQXUZ1IVE1JJGZKFV4H3DTKHU5EYGEQC3HTOYEPQ4C'
VERSION = '20200604'
LIMIT = 100

Next. We upload and analyse the 'World Cities Database' from https://simplemaps.com/data/world-cities <br>This database contains 15.000 cities with latitude, longitude and population. Let's take a look at how it looks:

In [None]:
url = 'https://github.com/moelgaardjesper/courseracapstoneproject/raw/master/worldcities.csv'

df = pd.read_csv(url)
print ('The full dataframe has the dimensions:\n',df.shape)

#The city names contains a variety of different accents and special characters, so for simplicity, we only keep the column with standard characters.
list_to_keep = ['city_ascii','lat','lng','country','population']
df = df[list_to_keep]
df.rename(columns = {'city_ascii': 'city'}, 
          inplace=True)

#Let's see which countries have most cities listed, and how many cities are in them:
df.groupby('country').count().sort_values(by='city',ascending=False)


The full dataframe has the dimensions:
 (15493, 11)


Unnamed: 0_level_0,city,lat,lng,population
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
United States,7328,7328,7328,7328
Russia,569,569,569,567
China,392,392,392,392
Brazil,387,387,387,387
Canada,250,250,250,250
...,...,...,...,...
Grenada,1,1,1,1
Monaco,1,1,1,1
Saint Lucia,1,1,1,1
Mayotte,1,1,1,1


#Intermediate Conclusion:
Initially, i thought about listing similar cities in the entire dataframe. 

However, since the dataset contains more than 15.000 cities, it will be impossible to give a simultaneous clustering of all cities in it.

This is due to the fact that our access to the foursquare API only allows for 99.500 lookups per day.<br>
This would mean that for every city, we could only gather 6 locations for each one, and this would be too little to give a trustworthy description of the cities.
Similarly, the US has more than 7.300 cities, which will only allow for 14 locations for each.

Therefore we will do the following:

1. The person using the tool will define a starting location(city).
2. After this user will be asked to choose a target country.
3. The cities will be compared and selected by population.
4. The comparison will then be based on matching those two criteria, thus reducing the need for accessing the foursquare API.

In [None]:
#Here is an example of how it would work.
#First we query the user for a starting city, and a destination country:
x = 0
while x==0:
  input_city = (input('Enter the starting city: ')).lower()
  if df['city'].str.lower().eq(input_city).any():
    print ('Starting city selected')
    input_df = df[df['city'].str.lower() == input_city]
    input_population = input_df.iloc[0]['population']
    lat = input_df.iloc[0]['lat']
    lng = input_df.iloc[0]['lng']
    x=1
  else:
    print ('Invalid starting destination - Check your spelling - Not all cities are in database')
    x=0

y=0
while y==0:
  target_country = (input('Enter the destination country: ')).lower()
  if df['country'].str.lower().eq(target_country).any():
    print ('Destination country selected')
    y=1
  else:
    print ('Invalid starting destination - Check your spelling - Not all countries are in database')
    y=0

Enter the starting city: copenhagen
Starting city selected
Enter the destination country: sweden
Destination country selected


In [None]:
#Add cities to target dataframe, until it contains 20 cities.
#We start by selecting similar sized cities. Then we take a step in each direction, and add those cities, we repeat until our target dataframe has size 20.

target_df = pd.DataFrame()
z=0
k=0
step = 0.01  #step could be increased if necessary for efficiency.

while z==0:
  if df[df['country'].str.lower()==target_country].shape[0] >= 20: #check if there is more than 20 cities in target country.
    if target_df.shape[0] < 20:
      k +=step
      target_df = df[(df['country'].str.lower() == target_country) &
               (df['population'] <= input_population * (1+k)) &
               (df['population'] >= input_population * (1-k))
               ]
    else:
      z=1
  else:
    target_df = df[df['country'].str.lower() == target_country].dropna(subset=['population'])
    z=1

print (target_df)

target_df = target_df.append(input_df) # We append the starting city to the dataframe, so we don't need to run the following code on 2 dataframes.

target_df.reset_index(inplace=True)

             city      lat      lng country  population
381     Stockholm  59.3508  18.0973  Sweden   1264000.0
999      Goteborg  57.7500  12.0000  Sweden    537797.0
1688        Malmo  55.5833  13.0333  Sweden    269349.0
2771      Uppsala  59.8601  17.6400  Sweden    133117.0
3132     Vasteras  59.6300  16.5400  Sweden    107194.0
3282       Orebro  59.2803  15.2200  Sweden     98573.0
3303    Linkoping  58.4100  15.6299  Sweden     96732.0
3362  Helsingborg  56.0505  12.7000  Sweden     91304.0
3376    Jonkoping  57.7713  14.1650  Sweden     89780.0
3505         Umea  63.8300  20.2400  Sweden     78197.0
3547     Karlstad  59.3671  13.4999  Sweden     74141.0
3621        Gavle  60.6670  17.1666  Sweden     68635.0
3746        Vaxjo  56.8837  14.8167  Sweden     59600.0
3822     Halmstad  56.6718  12.8556  Sweden     55657.0
3936        Lulea  65.5966  22.1584  Sweden     48638.0
3992    Ostersund  63.1833  14.6500  Sweden     46178.0
4022  Trollhattan  58.2671  12.3000  Sweden     

In [None]:
#As the foursquare API as a maximum returns 50 results per query, we add some extra coordinates from a sine and cosine function for looking up extra information:

def extra_coordinates(lat,lng,city,points):
  d = pd.DataFrame(columns=['lat','lng','city'],index=range(points+1*len(city)))
  x=0
  for p in range(len(city)):
    d.loc[x] = lat[p],lng[p],city[p]
    x+=1
    for i in range(points):

      angle = (360/points)*i
      sin_fun = np.sin(angle)*0.02
      cos_fun = np.cos(angle)*0.02
      d.loc[x] = [lat[p]+sin_fun,
      lng[p]+cos_fun,
      city[p]]
      x+=1
  return d

extra_locations_df = extra_coordinates(lat=target_df['lat'],lng=target_df['lng'],city=target_df['city'],points=6)

#So if the final dataframe has 21 cities. We will query foursquare API for a total of:
#21*7*100 = 14.700 venues.

In [None]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#Center on target country:
address = target_country

geolocator = Nominatim(user_agent="location_comparison")
location = geolocator.geocode(target_country)
latitude = location.latitude
longitude = location.longitude

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=6)

# add markers to the map
markers_colors = []
for lat, lon, poi, in zip(extra_locations_df['lat'], extra_locations_df['lng'], extra_locations_df['city']):
    label = folium.Popup(str(poi),
                         parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        fill=True,
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
def getNearbyVenues(name, latitudes, longitudes, radius=500):
    z=1
    def backline():        
      print ('\r', end='')

    venues_list=[]
    for name, lat, lng in zip(name, latitudes, longitudes):
        print('Progress: ',
              "[",
              '-'*int((z/extra_locations_df.shape[0])*20),
              ' '*(20-int((z/extra_locations_df.shape[0])*20)),
              ']',
              "{:.1f}".format(z/extra_locations_df.shape[0]*100),'%',
              end = '',
              flush=True)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        backline()
        z+=1

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    print ('Finished importing - ',nearby_venues.shape[0],'locations imported.')    
    return(nearby_venues)

In [None]:
venues = getNearbyVenues(name=extra_locations_df['city'],
                                   latitudes=extra_locations_df['lat'],
                                   longitudes=extra_locations_df['lng']
                                  )

Finished importing -  1205 locations imported.


In [None]:
#Drop duplicates from our list of venues.

venues.drop_duplicates(inplace=True,subset=['Venue','Venue Latitude','Venue Longitude','Venue Category'])
venues.shape[0]

1202

In [None]:
venues[['City','Venue']].groupby('City').count().sort_values(by='Venue',ascending=False)

Unnamed: 0_level_0,Venue
City,Unnamed: 1_level_1
Copenhagen,330
Stockholm,144
Uppsala,132
Vasteras,56
Helsingborg,55
Halmstad,49
Linkoping,44
Umea,39
Boras,38
Malmo,38


Now we have the list of venues gathered for each city.<br>
The next step is to use convert this dataframe to a format that is useful in machine learning.

For some reason, in this example, we gather quite a bit more venues for Copenhagen than for Stockholm.<br>This seems a bit unreasonable as Stockholm is actually more populous than Copenhagen.<br>The explanation is most likely that either the foursquare database is not as complete as it is for Copenhagen, or as you will later see, that the coordinate for Stockholm is not in the center of the city. 

In [None]:
#One-hot encoding
venues_onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")

#Add city column back to new one-hot encoded frame:
venues_onehot['City'] = venues['City']

# move neighborhood column to the first column
new_cols = [venues_onehot.columns[-1]] + list(venues_onehot.columns[:-1])
venues_onehot = venues_onehot[new_cols]

venues_onehot

Unnamed: 0,City,Airport,American Restaurant,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Australian Restaurant,Auto Garage,Auto Workshop,Automotive Shop,BBQ Joint,Bagel Shop,Bakery,Bar,Beach,Bed & Breakfast,Beer Bar,Beer Store,Bike Rental / Bike Share,Bistro,Boat or Ferry,Bookstore,Bowling Alley,Breakfast Spot,Brewery,Buffet,Burger Joint,Bus Line,Bus Station,Bus Stop,Business Service,Café,Campground,Candy Store,Capitol Building,Castle,...,Skate Park,Skating Rink,Ski Area,Smoke Shop,Snack Place,Soccer Field,Soccer Stadium,South American Restaurant,Spa,Spanish Restaurant,Sporting Goods Shop,Sports Bar,Sports Club,Stables,Stadium,Steakhouse,Student Center,Supermarket,Sushi Restaurant,Tapas Restaurant,Tea Room,Tennis Stadium,Thai Restaurant,Theater,Theme Park,Theme Park Ride / Attraction,Thrift / Vintage Store,Tourist Information Center,Toy / Game Store,Trail,Train Station,Tram Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Water Park,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Stockholm,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Stockholm,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Stockholm,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Stockholm,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Stockholm,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1200,Copenhagen,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1201,Copenhagen,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1202,Copenhagen,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1203,Copenhagen,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
venues_grouped = venues_onehot.groupby('City').mean().reset_index()

venues_grouped

Unnamed: 0,City,Airport,American Restaurant,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Australian Restaurant,Auto Garage,Auto Workshop,Automotive Shop,BBQ Joint,Bagel Shop,Bakery,Bar,Beach,Bed & Breakfast,Beer Bar,Beer Store,Bike Rental / Bike Share,Bistro,Boat or Ferry,Bookstore,Bowling Alley,Breakfast Spot,Brewery,Buffet,Burger Joint,Bus Line,Bus Station,Bus Stop,Business Service,Café,Campground,Candy Store,Capitol Building,Castle,...,Skate Park,Skating Rink,Ski Area,Smoke Shop,Snack Place,Soccer Field,Soccer Stadium,South American Restaurant,Spa,Spanish Restaurant,Sporting Goods Shop,Sports Bar,Sports Club,Stables,Stadium,Steakhouse,Student Center,Supermarket,Sushi Restaurant,Tapas Restaurant,Tea Room,Tennis Stadium,Thai Restaurant,Theater,Theme Park,Theme Park Ride / Attraction,Thrift / Vintage Store,Tourist Information Center,Toy / Game Store,Trail,Train Station,Tram Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Water Park,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Boras,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,0.026316,0.0,0.052632,0.026316,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.026316,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Copenhagen,0.0,0.006061,0.0,0.0,0.009091,0.00303,0.009091,0.006061,0.0,0.0,0.0,0.0,0.0,0.0,0.006061,0.039394,0.036364,0.0,0.0,0.009091,0.00303,0.0,0.00303,0.0,0.012121,0.00303,0.006061,0.00303,0.00303,0.009091,0.0,0.00303,0.0,0.0,0.060606,0.0,0.0,0.00303,0.0,...,0.00303,0.0,0.0,0.0,0.0,0.0,0.0,0.00303,0.00303,0.0,0.006061,0.0,0.00303,0.0,0.0,0.009091,0.0,0.015152,0.018182,0.0,0.00303,0.0,0.015152,0.009091,0.00303,0.0,0.0,0.00303,0.009091,0.0,0.0,0.0,0.0,0.0,0.009091,0.0,0.021212,0.009091,0.006061,0.0
2,Gavle,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Goteborg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.034483,0.0,0.034483,0.0,0.068966,0.034483,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.068966,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Halmstad,0.020408,0.020408,0.0,0.0,0.0,0.0,0.0,0.020408,0.020408,0.0,0.0,0.0,0.020408,0.0,0.0,0.0,0.040816,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020408,0.020408,0.061224,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.020408,0.0,0.0,0.020408,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.040816,0.040816,0.0,0.0,0.0,0.0,0.040816,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Helsingborg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.036364,0.018182,0.0,0.018182,0.0,0.0,0.018182,0.0,0.0,0.018182,0.0,0.036364,0.0,0.0,0.0,0.0,0.018182,0.0,0.054545,0.0,0.0,0.0,0.018182,...,0.0,0.0,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.036364,0.0,0.0,0.054545,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Jonkoping,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Karlstad,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Linkoping,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.113636,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.022727,0.022727,0.0,0.045455,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Lulea,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.028571,0.0,0.0,0.057143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
num_top_venues = 5

for city in venues_grouped['City']:
    print("----"+city+"----")
    temp = venues_grouped[venues_grouped['City'] == city].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Boras----
                 venue  freq
0    Electronics Store  0.05
1          Pizza Place  0.05
2         Burger Joint  0.05
3        Grocery Store  0.05
4  Sporting Goods Shop  0.05


----Copenhagen----
                     venue  freq
0  Scandinavian Restaurant  0.06
1                     Café  0.06
2                      Bar  0.04
3              Pizza Place  0.04
4                   Bakery  0.04


----Gavle----
                        venue  freq
0                 Pizza Place  0.42
1             Nature Preserve  0.08
2  Construction & Landscaping  0.08
3                 Supermarket  0.08
4        Other Great Outdoors  0.08


----Goteborg----
                        venue  freq
0           Electronics Store  0.10
1      Furniture / Home Store  0.07
2         Sporting Goods Shop  0.07
3  Construction & Landscaping  0.07
4                    Bus Stop  0.07


----Halmstad----
                  venue  freq
0                  Park  0.06
1                  Café  0.06
2       Harbor / 

In [None]:
#Function to return the most common venue from the frequency distribution in df: venue_grouped 
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
cities_venues_sorted = pd.DataFrame(columns=columns)
cities_venues_sorted['City'] = venues_grouped['City']

for ind in np.arange(venues_grouped.shape[0]):
    cities_venues_sorted.iloc[ind, 1:] = return_most_common_venues(venues_grouped.iloc[ind, :], num_top_venues)
cities_venues_sorted.head()

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Boras,Pool,Burger Joint,Furniture / Home Store,Café,Sporting Goods Shop,Grocery Store,Electronics Store,Construction & Landscaping,Pizza Place,Fast Food Restaurant
1,Copenhagen,Scandinavian Restaurant,Café,Bakery,Pizza Place,Bar,Italian Restaurant,Coffee Shop,French Restaurant,Hotel,Cocktail Bar
2,Gavle,Pizza Place,Construction & Landscaping,Outdoors & Recreation,Nature Preserve,Supermarket,Other Great Outdoors,Thai Restaurant,Grocery Store,Dessert Shop,Diner
3,Goteborg,Electronics Store,Construction & Landscaping,Bus Stop,Sporting Goods Shop,Furniture / Home Store,Bus Line,Grocery Store,Supermarket,Scandinavian Restaurant,Sandwich Place
4,Halmstad,Park,Pizza Place,Café,Harbor / Marina,Supermarket,Theater,Sushi Restaurant,Pub,Restaurant,Bar


In [None]:
# set number of clusters
kclusters = int(venues_grouped.shape[0]/5)

venues_grouped_clustering = venues_grouped.drop('City', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(venues_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 1, 2, 0, 1, 1, 1, 3, 1, 1], dtype=int32)

In [None]:
#Should also try to use dbscan clustering:
from sklearn.cluster import DBSCAN
db_clusters = DBSCAN().fit(venues_grouped_clustering)
db_clusters.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [None]:
from sklearn.cluster import MeanShift, estimate_bandwidth

# Compute clustering with MeanShift

# The following bandwidth can be automatically detected using
bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500)

ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

print("number of estimated clusters : %d" % n_clusters_)

In [None]:
# Compute clustering with Ward hierarchical clustering.
from sklearn.cluster import AgglomerativeClustering

print("Compute structured hierarchical clustering...")
st = time.time()
n_clusters = 27  # number of regions
ward = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward',
                               connectivity=connectivity)
ward.fit(X)
label = np.reshape(ward.labels_, rescaled_coins.shape)
print("Elapsed time: ", time.time() - st)
print("Number of pixels: ", label.size)
print("Number of clusters: ", np.unique(label).size)

In [None]:
# add clustering labels
cities_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

cities_merged = target_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
cities_merged = cities_merged.join(cities_venues_sorted.set_index('City'), on='city')

cities_merged # check the last columns!

Unnamed: 0,index,city,lat,lng,country,population,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,381,Stockholm,59.3508,18.0973,Sweden,1264000.0,1,Scandinavian Restaurant,Boat or Ferry,Italian Restaurant,Café,Grocery Store,Thai Restaurant,Museum,Park,Sushi Restaurant,Pizza Place
1,999,Goteborg,57.75,12.0,Sweden,537797.0,0,Electronics Store,Construction & Landscaping,Bus Stop,Sporting Goods Shop,Furniture / Home Store,Bus Line,Grocery Store,Supermarket,Scandinavian Restaurant,Sandwich Place
2,1688,Malmo,55.5833,13.0333,Sweden,269349.0,0,Bus Stop,Fast Food Restaurant,Pizza Place,Athletics & Sports,Falafel Restaurant,Bakery,Gym / Fitness Center,Turkish Restaurant,Hockey Rink,Food Truck
3,2771,Uppsala,59.8601,17.64,Sweden,133117.0,1,Café,Hotel,Coffee Shop,Restaurant,Sushi Restaurant,Italian Restaurant,Thai Restaurant,Bookstore,Scandinavian Restaurant,Supermarket
4,3132,Vasteras,59.63,16.54,Sweden,107194.0,1,Restaurant,Café,Pizza Place,Asian Restaurant,Park,Construction & Landscaping,Mountain,Coffee Shop,Scandinavian Restaurant,Bar
5,3282,Orebro,59.2803,15.22,Sweden,98573.0,1,Park,Train Station,Grocery Store,Beach,Soccer Field,Clothing Store,Restaurant,Gym / Fitness Center,Rental Car Location,Sushi Restaurant
6,3303,Linkoping,58.41,15.6299,Sweden,96732.0,1,Café,Hotel,Restaurant,Gym / Fitness Center,Pub,Grocery Store,Supermarket,Plaza,Italian Restaurant,Sports Bar
7,3362,Helsingborg,56.0505,12.7,Sweden,91304.0,1,Hotel,Restaurant,Supermarket,Café,Harbor / Marina,Pizza Place,Pet Store,Brewery,Stadium,Bar
8,3376,Jonkoping,57.7713,14.165,Sweden,89780.0,1,Pizza Place,Fast Food Restaurant,Burger Joint,Hotel,Park,Playground,Coffee Shop,Farm,Furniture / Home Store,Bookstore
9,3505,Umea,63.83,20.24,Sweden,78197.0,1,Hotel,Restaurant,Café,Supermarket,Italian Restaurant,Fast Food Restaurant,Pub,Shopping Mall,Coffee Shop,Train Station


In [None]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#Center on target country:
address = target_country

geolocator = Nominatim(user_agent="location_comparison")
location = geolocator.geocode(target_country)
latitude = location.latitude
longitude = location.longitude

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=6)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster,top in zip(cities_merged['lat'], cities_merged['lng'], cities_merged['city'], cities_merged['Cluster Labels'],cities_merged['1st Most Common Venue']):
    label = folium.Popup(str(poi) +
                         ' Cluster ' + 
                         str(cluster) + '\n'+
                         'Most common venue: '+
                         str(top),
                         parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters