# Capstone Project

## Problem definition

For this project, we are going to find out the similarities or dissimilarities between European and North American cities. At the beginning, it was supposed to compare the neighborhoods of one city of each continent but finally we have decided to explore the capital of each country and state.
This will be helpful for people who need to travel abroad and they are afraid of change so they prefer to move to cities that are alike to theirs.
We expect to find that American and European cities are very similar (with exceptions), unlike if we would compare eastern (i.e. Asian cities) and western cities.

## Data description

For the problem, we are going to need a list containing the coordinates of the capital of each European country, the name of the country and the capital name; and the same for North American states.
Location data from Foursquare will be used to discover the venues of each capital in order to analyze the (dis)similarities between cities.

## Code

In [1]:
import pandas as pd
import numpy as np

In [2]:
usa = pd.read_csv('statelatlong.csv')
usa.head()

Unnamed: 0,State,Latitude,Longitude,City
0,AL,32.601011,-86.680736,Alabama
1,AK,61.302501,-158.77502,Alaska
2,AZ,34.168219,-111.930907,Arizona
3,AR,34.751928,-92.131378,Arkansas
4,CA,37.271875,-119.270415,California


In [None]:
world = pd.read_csv('country-capitals.csv', error_bad_lines=False)

In [4]:
europe = world.loc[world.ContinentName == 'Europe'].copy()
europe.drop('ContinentName', axis = 1, inplace = True)
europe.rename(index=str, columns={"CapitalName": "City", "CapitalLatitude": "Latitude", "CapitalLongitude": "Longitude"}, inplace = True)
europe.reset_index(inplace = True)
europe.head()

Unnamed: 0,index,CountryName,City,Latitude,Longitude,CountryCode
0,4,Aland Islands,Mariehamn,60.116667,19.9,AX
1,10,Albania,Tirana,41.316667,19.816667,AL
2,13,Andorra,Andorra la Vella,42.5,1.516667,AD
3,18,Armenia,Yerevan,40.166667,44.5,AM
4,21,Austria,Vienna,48.2,16.366667,AT


In [5]:
print(usa.count())
print(europe.count())

State        51
Latitude     51
Longitude    51
City         51
dtype: int64
index          58
CountryName    58
City           58
Latitude       58
Longitude      58
CountryCode    57
dtype: int64


We have a NA value in europe.CountryCode

In [6]:
print(usa.duplicated().sum())
print(europe.duplicated().sum())

0
0


In [7]:
europe[europe.isnull().any(axis=1)] # Check rows with NA values

Unnamed: 0,index,CountryName,City,Latitude,Longitude,CountryCode
57,238,Northern Cyprus,North Nicosia,35.183333,33.366667,


### USA

The geographic center of the contiguous United States is the center of 48 U.S. states

In [8]:
us_latitude = 39.828175
us_longitude = -98.5795

In [9]:
import folium

map_usa = folium.Map(location=[us_latitude, us_longitude], zoom_start = 3)

# add markers to map
for lat, lng, state, city in zip(usa['Latitude'], usa['Longitude'], usa['State'], usa['City']):
    label = '{}, {}'.format(state, city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_usa)  
    
map_usa

We need to start with zoom 3 because of Alaska and Hawaii

#### Foursquare

In [None]:
CLIENT_ID = "REMOVED"
CLIENT_SECRET = "REMOVED"
VERSION = "20180605"

In [11]:
usa.loc[0, 'City']

'Alabama'

In [12]:
usa_latitude = usa.loc[0, 'Latitude'] # neighborhood latitude value
usa_longitude = usa.loc[0, 'Longitude'] # neighborhood longitude value

usa_name = usa.loc[0, 'City'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(usa_name, 
                                                               usa_latitude, 
                                                               usa_longitude))

Latitude and longitude values of Alabama are 32.601011199999995, -86.6807365.


In [13]:
LIMIT = 500 # limit of number of venues returned by Foursquare API

radius = 100000 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    usa_latitude, 
    usa_longitude, 
    radius, 
    LIMIT)

In [None]:
import requests 
import json

results = requests.get(url).json()
results

In [15]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']

    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [16]:
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

venues = results['response']['groups'][0]['items']

nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]         

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Pratt Park,Park,32.455921,-86.467692
1,Uncle Mick's Cajun Market & Cafe,Cajun / Creole Restaurant,32.460007,-86.474098
2,Durbin Farms Market,Sandwich Place,32.802761,-86.583011
3,Bruster's Ice Cream,Ice Cream Shop,32.460184,-86.424777
4,Sweet Frog,Frozen Yogurt Shop,32.460733,-86.413018


In [17]:
def getNearbyVenues(names, latitudes, longitudes, radius=2000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [18]:
usa_venues = getNearbyVenues(names=usa['City'],
                                   latitudes=usa['Latitude'],
                                   longitudes=usa['Longitude']
                                  )

Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming


In [19]:
print(usa_venues.shape)
usa_venues.head()

(513, 7)


Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Arizona,34.168219,-111.930907,Cay Evi 145,34.178977,-111.92859,Restaurant
1,Arkansas,34.751928,-92.131378,Main Theatre,34.7519,-92.13055,Movie Theater
2,Arkansas,34.751928,-92.131378,Stonelinks Golf Course,34.746026,-92.12557,Golf Course
3,Arkansas,34.751928,-92.131378,Reed Electric,34.737694,-92.131911,Construction & Landscaping
4,Arkansas,34.751928,-92.131378,Guloc-Roc,34.738866,-92.123549,Bookstore


In [20]:
usa_venues.groupby('City').count()

Unnamed: 0_level_0,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Arizona,1,1,1,1,1,1
Arkansas,4,4,4,4,4,4
California,1,1,1,1,1,1
Connecticut,36,36,36,36,36,36
Delaware,2,2,2,2,2,2
District of Columbia,100,100,100,100,100,100
Georgia,2,2,2,2,2,2
Indiana,12,12,12,12,12,12
Iowa,1,1,1,1,1,1
Kentucky,1,1,1,1,1,1


In [21]:
print('There are {} uniques categories.'.format(len(usa_venues['Venue Category'].unique())))

There are 156 uniques categories.


#### Analyzing each city

In [22]:
# one hot encoding
usa_onehot = pd.get_dummies(usa_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
usa_onehot['City'] = usa_venues['City']

# move neighborhood column to the first column
fixed_columns = [usa_onehot.columns[-1]] + list(usa_onehot.columns[:-1])
usa_onehot = usa_onehot[fixed_columns]

usa_onehot.head()

Unnamed: 0,City,Airport,American Restaurant,Art Gallery,Art Museum,Asian Restaurant,Austrian Restaurant,BBQ Joint,Bagel Shop,Bakery,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Winery,Yoga Studio
0,Arizona,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Arkansas,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Arkansas,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Arkansas,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Arkansas,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
usa_grouped = usa_onehot.groupby('City').mean().reset_index()
usa_grouped

Unnamed: 0,City,Airport,American Restaurant,Art Gallery,Art Museum,Asian Restaurant,Austrian Restaurant,BBQ Joint,Bagel Shop,Bakery,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Winery,Yoga Studio
0,Arizona,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Arkansas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,California,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Connecticut,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.027778,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0
4,Delaware,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,District of Columbia,0.0,0.02,0.0,0.05,0.01,0.0,0.01,0.0,0.0,...,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0
6,Georgia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Indiana,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Iowa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Kentucky,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
num_top_venues = 5

for hood in usa_grouped['City']:
    print("----"+hood+"----")
    temp = usa_grouped[usa_grouped['City'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Arizona----
                     venue  freq
0               Restaurant   1.0
1                  Airport   0.0
2                     Park   0.0
3  New American Restaurant   0.0
4           Nightlife Spot   0.0


----Arkansas----
                        venue  freq
0                   Bookstore  0.25
1                 Golf Course  0.25
2               Movie Theater  0.25
3  Construction & Landscaping  0.25
4                     Airport  0.00


----California----
                 venue  freq
0       Cosmetics Shop   1.0
1              Airport   0.0
2  Peruvian Restaurant   0.0
3       Nightlife Spot   0.0
4               Office   0.0


----Connecticut----
            venue  freq
0           Hotel  0.14
1  Baseball Field  0.08
2  Sandwich Place  0.06
3     Pizza Place  0.06
4        Pharmacy  0.06


----Delaware----
                     venue  freq
0                    Beach   1.0
1                  Airport   0.0
2            National Park   0.0
3  New American Restaurant   0.0
4     

In [25]:
def return_most_common_venues(row, num_top_venues):
    
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)

    return row_categories_sorted.index.values[0:num_top_venues]

In [26]:
# Let’s create the new dataframe and display the top 10 venues for each neighborhood.

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
usa_venues_sorted = pd.DataFrame(columns=columns)
usa_venues_sorted['City'] = usa_grouped['City']

for ind in np.arange(usa_grouped.shape[0]):
    usa_venues_sorted.iloc[ind, 1:] = return_most_common_venues(usa_grouped.iloc[ind, :], num_top_venues)

usa_venues_sorted

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Arizona,Restaurant,Yoga Studio,Farm,Food Court,Food & Drink Shop,Flower Shop,Flea Market,Fishing Spot,Filipino Restaurant,Fast Food Restaurant
1,Arkansas,Bookstore,Movie Theater,Construction & Landscaping,Golf Course,Yoga Studio,Filipino Restaurant,Food Court,Food & Drink Shop,Flower Shop,Flea Market
2,California,Cosmetics Shop,Yoga Studio,Fast Food Restaurant,Food Truck,Food Court,Food & Drink Shop,Flower Shop,Flea Market,Fishing Spot,Filipino Restaurant
3,Connecticut,Hotel,Baseball Field,Pharmacy,Donut Shop,Sandwich Place,Pizza Place,Liquor Store,Pool,Fast Food Restaurant,Mexican Restaurant
4,Delaware,Beach,Yoga Studio,Furniture / Home Store,Forest,Food Truck,Food Court,Food & Drink Shop,Flower Shop,Flea Market,Fishing Spot
5,District of Columbia,Coffee Shop,Art Museum,Pizza Place,Cocktail Bar,Theater,Ice Cream Shop,New American Restaurant,Monument / Landmark,Mediterranean Restaurant,Burger Joint
6,Georgia,Restaurant,Forest,Yoga Studio,Fast Food Restaurant,Food Court,Food & Drink Shop,Flower Shop,Flea Market,Fishing Spot,Filipino Restaurant
7,Indiana,Ice Cream Shop,Golf Course,Park,Optical Shop,Office,Sandwich Place,Fast Food Restaurant,Pizza Place,Breakfast Spot,Furniture / Home Store
8,Iowa,Cosmetics Shop,Yoga Studio,Fast Food Restaurant,Food Truck,Food Court,Food & Drink Shop,Flower Shop,Flea Market,Fishing Spot,Filipino Restaurant
9,Kentucky,Liquor Store,Yoga Studio,Fast Food Restaurant,Food Truck,Food Court,Food & Drink Shop,Flower Shop,Flea Market,Fishing Spot,Filipino Restaurant


Now we are gonna do the same with Europe and then merge both dataframes and cluster them so we can see whether european and north american cities are similar or not

### Europe

The location of the geographical centre of Europe depends on the definition of the borders of Europe, mainly whether remote islands are included to define the extreme points of Europe, and on the method of calculating the final result.

In [27]:
eu_latitude = 48.499998
eu_longitude = 23.3833318

In [28]:
map_europe = folium.Map(location=[eu_latitude, eu_longitude], zoom_start = 4)

# add markers to map
for lat, lng, country, city in zip(europe['Latitude'], europe['Longitude'], europe['CountryName'], europe['City']):
    label = '{}, {}'.format(state, city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_europe)  
    
map_europe

In [29]:
europe.loc[0, 'City']

'Mariehamn'

In [30]:
europe_latitude = europe.loc[0, 'Latitude'] # neighborhood latitude value
europe_longitude = europe.loc[0, 'Longitude'] # neighborhood longitude value

europe_name = europe.loc[0, 'City'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(europe_name, 
                                                               europe_latitude, 
                                                               europe_longitude))

Latitude and longitude values of Mariehamn are 60.11666700000001, 19.9.


In [31]:
LIMIT = 500 # limit of number of venues returned by Foursquare API

radius = 100000 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    usa_latitude, 
    usa_longitude, 
    radius, 
    LIMIT)

In [None]:
results = requests.get(url).json()
results

In [33]:
venues = results['response']['groups'][0]['items']

nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]         

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Pratt Park,Park,32.455921,-86.467692
1,Uncle Mick's Cajun Market & Cafe,Cajun / Creole Restaurant,32.460007,-86.474098
2,Durbin Farms Market,Sandwich Place,32.802761,-86.583011
3,Bruster's Ice Cream,Ice Cream Shop,32.460184,-86.424777
4,Sweet Frog,Frozen Yogurt Shop,32.460733,-86.413018


In [34]:
europe_venues = getNearbyVenues(names=europe['City'],
                                   latitudes=europe['Latitude'],
                                   longitudes=europe['Longitude']
                                  )

Mariehamn
Tirana
Andorra la Vella
Yerevan
Vienna
Baku
Minsk
Brussels
Sarajevo
Sofia
Zagreb
Nicosia
Prague
Copenhagen
Tallinn
Torshavn
Helsinki
Paris
Tbilisi
Berlin
Gibraltar
Athens
Saint Peter Port
Vatican City
Budapest
Reykjavik
Dublin
Douglas
Rome
Saint Helier
Pristina
Riga
Vaduz
Vilnius
Luxembourg
Skopje
Valletta
Chisinau
Monaco
Podgorica
Amsterdam
Oslo
Warsaw
Lisbon
Bucharest
Moscow
San Marino
Belgrade
Bratislava
Ljubljana
Madrid
Longyearbyen
Stockholm
Bern
Ankara
Kyiv
London
North Nicosia


In [35]:
print(europe_venues.shape)
europe_venues.head()

(5124, 7)


Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Mariehamn,60.116667,19.9,S-market,60.113724,19.914182,Supermarket
1,Mariehamn,60.116667,19.9,Avancia,60.105645,19.927145,Gym / Fitness Center
2,Mariehamn,60.116667,19.9,Feja,60.112267,19.912491,Paper / Office Supplies Store
3,Mariehamn,60.116667,19.9,Ramsholmen,60.111304,19.885456,Trail
4,Mariehamn,60.116667,19.9,Backage claim,60.124889,19.907579,Airport Terminal


In [36]:
europe_venues.groupby('City').count()

Unnamed: 0_level_0,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Amsterdam,100,100,100,100,100,100
Andorra la Vella,100,100,100,100,100,100
Ankara,100,100,100,100,100,100
Athens,100,100,100,100,100,100
Baku,100,100,100,100,100,100
Belgrade,39,39,39,39,39,39
Berlin,100,100,100,100,100,100
Bern,16,16,16,16,16,16
Bratislava,100,100,100,100,100,100
Brussels,100,100,100,100,100,100


In [37]:
print('There are {} uniques categories.'.format(len(europe_venues['Venue Category'].unique())))

There are 384 uniques categories.


In [38]:
# one hot encoding
europe_onehot = pd.get_dummies(europe_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
europe_onehot['Cities'] = europe_venues['City']

# move neighborhood column to the first column
fixed_columns = [europe_onehot.columns[-1]] + list(europe_onehot.columns[:-1])
europe_onehot = europe_onehot[fixed_columns]

europe_onehot.head()

Unnamed: 0,Cities,Accessories Store,African Restaurant,Airport,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,...,Vietnamese Restaurant,Water Park,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Mariehamn,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Mariehamn,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Mariehamn,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Mariehamn,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Mariehamn,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
europe_grouped = europe_onehot.groupby('Cities').mean().reset_index()
europe_grouped

Unnamed: 0,Cities,Accessories Store,African Restaurant,Airport,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,...,Vietnamese Restaurant,Water Park,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Amsterdam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01
1,Andorra la Vella,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0
2,Ankara,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.01,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Athens,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.02,0.04,0.0,0.0,0.0,0.0,0.0
4,Baku,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Belgrade,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Berlin,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0
7,Bern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Bratislava,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.0
9,Brussels,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0


In [40]:
num_top_venues = 5

for hood in europe_grouped['Cities']:
    print("----"+hood+"----")
    temp = europe_grouped[europe_grouped['Cities'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Amsterdam----
                venue  freq
0               Hotel  0.09
1   French Restaurant  0.07
2                Café  0.06
3  Italian Restaurant  0.06
4         Coffee Shop  0.05


----Andorra la Vella----
                venue  freq
0          Restaurant  0.13
1               Hotel  0.11
2  Spanish Restaurant  0.05
3       Shopping Mall  0.04
4   French Restaurant  0.04


----Ankara----
            venue  freq
0            Café  0.12
1   Historic Site  0.07
2  History Museum  0.07
3     Art Gallery  0.06
4   Jewelry Store  0.05


----Athens----
          venue  freq
0           Bar  0.12
1          Café  0.11
2   Coffee Shop  0.08
3  Cocktail Bar  0.07
4      Wine Bar  0.04


----Baku----
         venue  freq
0   Restaurant  0.09
1  Coffee Shop  0.08
2         Café  0.07
3        Hotel  0.07
4       Lounge  0.06


----Belgrade----
         venue  freq
0  Supermarket  0.10
1         Café  0.08
2    Nightclub  0.08
3  Flower Shop  0.08
4   Restaurant  0.08


----Berlin----
      

                venue  freq
0              Bakery  0.07
1          Restaurant  0.07
2  Italian Restaurant  0.06
3        Dessert Shop  0.05
4         Coffee Shop  0.05


----Stockholm----
                     venue  freq
0                     Café  0.10
1              Coffee Shop  0.07
2                    Hotel  0.06
3  Scandinavian Restaurant  0.06
4              Pizza Place  0.04


----Tallinn----
                         venue  freq
0                         Café  0.08
1  Eastern European Restaurant  0.07
2                   Restaurant  0.06
3               Scenic Lookout  0.05
4                     Wine Bar  0.05


----Tbilisi----
                  venue  freq
0                 Hotel  0.37
1  Caucasian Restaurant  0.10
2            Restaurant  0.05
3                  Park  0.03
4       Bed & Breakfast  0.03


----Tirana----
                venue  freq
0                Café  0.11
1  Italian Restaurant  0.10
2               Hotel  0.07
3        Cocktail Bar  0.07
4         Coffee Sh

In [41]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
europe_venues_sorted = pd.DataFrame(columns=columns)
europe_venues_sorted['City'] = europe_grouped['Cities']

for ind in np.arange(europe_grouped.shape[0]):
    europe_venues_sorted.iloc[ind, 1:] = return_most_common_venues(europe_grouped.iloc[ind, :], num_top_venues)

europe_venues_sorted

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Amsterdam,Hotel,French Restaurant,Italian Restaurant,Café,Coffee Shop,Restaurant,Bar,Ice Cream Shop,Park,Bakery
1,Andorra la Vella,Restaurant,Hotel,Spanish Restaurant,Shopping Mall,French Restaurant,Tapas Restaurant,Coffee Shop,Café,Clothing Store,Burger Joint
2,Ankara,Café,History Museum,Historic Site,Art Gallery,Jewelry Store,Coffee Shop,Theater,Restaurant,Bakery,Bookstore
3,Athens,Bar,Café,Coffee Shop,Cocktail Bar,Falafel Restaurant,Wine Bar,Greek Restaurant,Theater,Bookstore,Boutique
4,Baku,Restaurant,Coffee Shop,Hotel,Café,Lounge,Clothing Store,Park,Italian Restaurant,Supermarket,Pub
5,Belgrade,Supermarket,Restaurant,Nightclub,Café,Flower Shop,Gas Station,Clothing Store,Seafood Restaurant,Modern European Restaurant,Coffee Shop
6,Berlin,Hotel,History Museum,Coffee Shop,Bookstore,Ice Cream Shop,Art Gallery,Concert Hall,Steakhouse,Theater,Plaza
7,Bern,Train Station,Supermarket,Convenience Store,Japanese Restaurant,Swiss Restaurant,Bed & Breakfast,Tennis Court,Grocery Store,Bakery,Discount Store
8,Bratislava,Café,Coffee Shop,Vegetarian / Vegan Restaurant,Burger Joint,Hotel,Brewery,Restaurant,Farmers Market,Art Gallery,Bar
9,Brussels,Bar,French Restaurant,Italian Restaurant,Sandwich Place,Coffee Shop,Brasserie,Restaurant,Plaza,Park,Farmers Market


## Clustering

In [42]:
usa_grouped.head()

Unnamed: 0,City,Airport,American Restaurant,Art Gallery,Art Museum,Asian Restaurant,Austrian Restaurant,BBQ Joint,Bagel Shop,Bakery,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Winery,Yoga Studio
0,Arizona,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Arkansas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,California,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Connecticut,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.027778,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0
4,Delaware,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
europe_grouped.head()

Unnamed: 0,Cities,Accessories Store,African Restaurant,Airport,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,...,Vietnamese Restaurant,Water Park,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Amsterdam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01
1,Andorra la Vella,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0
2,Ankara,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.01,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Athens,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.02,0.04,0.0,0.0,0.0,0.0,0.0
4,Baku,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


As we can see there is no way we can merge the two dataframes since they have differen number of columns because there are different venues depending on each country. 

So we have two ways now:
* We can cluster Europe and North America separately and assume that, cluster 1 in North America = Cluster 1 in Europe and so on
* We can merge both dataframes before using Foursquare so we are going to get the same number of features for both continents and then cluster them together, so we will be sure that countries in Cluster 1 will be similar to each other.

### World

First of all we are going to preprocess both data frames so they have the same amount of columnns so we can concatenate them

In [44]:
europe.drop(['index', 'CountryName'], axis = 1, inplace = True)
europe.rename(index=str, columns={"CountryCode": "Code"}, inplace = True)
usa.rename(index=str, columns={"State": "Code"}, inplace = True)
europe = europe[['Code', 'Latitude', 'Longitude', 'City']]

In [45]:
usa.head()

Unnamed: 0,Code,Latitude,Longitude,City
0,AL,32.601011,-86.680736,Alabama
1,AK,61.302501,-158.77502,Alaska
2,AZ,34.168219,-111.930907,Arizona
3,AR,34.751928,-92.131378,Arkansas
4,CA,37.271875,-119.270415,California


In [46]:
europe.head()

Unnamed: 0,Code,Latitude,Longitude,City
0,AX,60.116667,19.9,Mariehamn
1,AL,41.316667,19.816667,Tirana
2,AD,42.5,1.516667,Andorra la Vella
3,AM,40.166667,44.5,Yerevan
4,AT,48.2,16.366667,Vienna


In [47]:
world = [usa, europe]
world = pd.concat(world)
world.reset_index(drop = True, inplace = True)
world.head()

Unnamed: 0,Code,Latitude,Longitude,City
0,AL,32.601011,-86.680736,Alabama
1,AK,61.302501,-158.77502,Alaska
2,AZ,34.168219,-111.930907,Arizona
3,AR,34.751928,-92.131378,Arkansas
4,CA,37.271875,-119.270415,California


In [48]:
world.tail()

Unnamed: 0,Code,Latitude,Longitude,City
104,CH,46.916667,7.466667,Bern
105,TR,39.933333,32.866667,Ankara
106,UA,50.433333,30.516667,Kyiv
107,GB,51.5,-0.083333,London
108,,35.183333,33.366667,North Nicosia


Now we are gonna do the same analysis that we did with the previous data sets

In [49]:
latitude = 37.91331
longitude = -19.44808

In [50]:
map_world = folium.Map(location=[latitude, longitude], zoom_start = 2)

# add markers to map
for lat, lng, state, city in zip(world['Latitude'], world['Longitude'], world['Code'], world['City']):
    label = '{}, {}'.format(state, city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_world)  
    
map_world

In [51]:
world_venues = getNearbyVenues(names=world['City'],
                                   latitudes=world['Latitude'],
                                   longitudes=world['Longitude']
                                  )

Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Mariehamn
Tirana
Andorra la Vella
Yerevan
Vienna
Baku
Minsk
Brussels
Sarajevo
Sofia
Zagreb
Nicosia
Prague
Copenhagen
Tallinn
Torshavn
Helsinki
Paris
Tbilisi
Berlin
Gibraltar
Athens
Saint Peter Port
Vatican City
Budapest
Reykjavik
Dublin
Douglas
Rome
Saint Helier
Pristina
Riga
Vaduz
Vilnius
Luxembourg
Skopje
Valletta
Chisinau
Monaco
Podgorica
Amsterdam
Oslo
Warsaw
Lisbon
Bucharest
Moscow
San Marino
Belgrade
Bratislava
Ljubljana
Madrid
Longyearbyen
Stockholm
Bern
Ankara
Kyiv
London
North Nicosia


In [52]:
print(world_venues.shape)
world_venues.head()

(5637, 7)


Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Arizona,34.168219,-111.930907,Cay Evi 145,34.178977,-111.92859,Restaurant
1,Arkansas,34.751928,-92.131378,Main Theatre,34.7519,-92.13055,Movie Theater
2,Arkansas,34.751928,-92.131378,Stonelinks Golf Course,34.746026,-92.12557,Golf Course
3,Arkansas,34.751928,-92.131378,Reed Electric,34.737694,-92.131911,Construction & Landscaping
4,Arkansas,34.751928,-92.131378,Guloc-Roc,34.738866,-92.123549,Bookstore


In [53]:
world_venues.groupby('City').count()

Unnamed: 0_level_0,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Amsterdam,100,100,100,100,100,100
Andorra la Vella,100,100,100,100,100,100
Ankara,100,100,100,100,100,100
Arizona,1,1,1,1,1,1
Arkansas,4,4,4,4,4,4
Athens,100,100,100,100,100,100
Baku,100,100,100,100,100,100
Belgrade,39,39,39,39,39,39
Berlin,100,100,100,100,100,100
Bern,16,16,16,16,16,16


In [54]:
print('There are {} uniques categories.'.format(len(world_venues['Venue Category'].unique())))

There are 407 uniques categories.


In [55]:
# one hot encoding
world_onehot = pd.get_dummies(world_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
world_onehot['City'] = world_venues['City']

# move neighborhood column to the first column
fixed_columns = [world_onehot.columns[-1]] + list(world_onehot.columns[:-1])
world_onehot = world_onehot[fixed_columns]

world_onehot.head()

Unnamed: 0,City,Accessories Store,African Restaurant,Airport,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,...,Vietnamese Restaurant,Water Park,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Arizona,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Arkansas,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Arkansas,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Arkansas,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Arkansas,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [56]:
world_onehot.shape

(5637, 408)

In [57]:
world_grouped = world_onehot.groupby('City').mean().reset_index()
world_grouped

Unnamed: 0,City,Accessories Store,African Restaurant,Airport,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,...,Vietnamese Restaurant,Water Park,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Amsterdam,0.00,0.00,0.000000,0.0000,0.0,0.000000,0.00,0.00,0.00,...,0.00,0.0000,0.000000,0.00,0.0200,0.00,0.000000,0.0000,0.00,0.010000
1,Andorra la Vella,0.00,0.00,0.000000,0.0000,0.0,0.000000,0.00,0.00,0.00,...,0.00,0.0000,0.000000,0.00,0.0100,0.00,0.000000,0.0000,0.00,0.000000
2,Ankara,0.00,0.00,0.000000,0.0000,0.0,0.000000,0.03,0.01,0.00,...,0.00,0.0000,0.000000,0.00,0.0000,0.00,0.000000,0.0000,0.00,0.000000
3,Arizona,0.00,0.00,0.000000,0.0000,0.0,0.000000,0.00,0.00,0.00,...,0.00,0.0000,0.000000,0.00,0.0000,0.00,0.000000,0.0000,0.00,0.000000
4,Arkansas,0.00,0.00,0.000000,0.0000,0.0,0.000000,0.00,0.00,0.00,...,0.00,0.0000,0.000000,0.00,0.0000,0.00,0.000000,0.0000,0.00,0.000000
5,Athens,0.00,0.00,0.000000,0.0000,0.0,0.000000,0.00,0.00,0.00,...,0.00,0.0000,0.000000,0.02,0.0400,0.00,0.000000,0.0000,0.00,0.000000
6,Baku,0.00,0.00,0.000000,0.0000,0.0,0.000000,0.00,0.00,0.00,...,0.00,0.0000,0.000000,0.00,0.0000,0.00,0.000000,0.0000,0.00,0.000000
7,Belgrade,0.00,0.00,0.000000,0.0000,0.0,0.000000,0.00,0.00,0.00,...,0.00,0.0000,0.000000,0.00,0.0000,0.00,0.000000,0.0000,0.00,0.000000
8,Berlin,0.00,0.00,0.000000,0.0000,0.0,0.000000,0.00,0.00,0.00,...,0.01,0.0000,0.000000,0.00,0.0100,0.00,0.000000,0.0000,0.00,0.000000
9,Bern,0.00,0.00,0.000000,0.0000,0.0,0.000000,0.00,0.00,0.00,...,0.00,0.0625,0.000000,0.00,0.0000,0.00,0.000000,0.0000,0.00,0.000000


In [58]:
world_grouped.shape

(86, 408)

In [59]:
num_top_venues = 5

for hood in world_grouped['City']:
    print("----"+hood+"----")
    temp = world_grouped[world_grouped['City'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Amsterdam----
                venue  freq
0               Hotel  0.09
1   French Restaurant  0.07
2  Italian Restaurant  0.06
3                Café  0.06
4         Coffee Shop  0.05


----Andorra la Vella----
                venue  freq
0          Restaurant  0.13
1               Hotel  0.11
2  Spanish Restaurant  0.05
3       Shopping Mall  0.04
4   French Restaurant  0.04


----Ankara----
            venue  freq
0            Café  0.12
1   Historic Site  0.07
2  History Museum  0.07
3     Art Gallery  0.06
4   Jewelry Store  0.05


----Arizona----
               venue  freq
0         Restaurant   1.0
1  Accessories Store   0.0
2       Night Market   0.0
3  Outdoor Sculpture   0.0
4    Other Nightlife   0.0


----Arkansas----
                        venue  freq
0               Movie Theater  0.25
1                   Bookstore  0.25
2  Construction & Landscaping  0.25
3                 Golf Course  0.25
4                Night Market  0.00


----Athens----
          venue  freq
0   

                 venue  freq
0                 Park  0.08
1  American Restaurant  0.07
2          Coffee Shop  0.06
3         Cocktail Bar  0.04
4               Bridge  0.03


----Nicosia----
              venue  freq
0       Coffee Shop  0.15
1              Café  0.10
2  Greek Restaurant  0.09
3               Bar  0.08
4          Wine Bar  0.05


----North Carolina----
                 venue  freq
0       Scenic Lookout  0.33
1  Fried Chicken Joint  0.33
2       Cosmetics Shop  0.33
3         Night Market  0.00
4    Outdoor Sculpture  0.00


----North Nicosia----
           venue  freq
0           Café  0.16
1            Bar  0.12
2    Coffee Shop  0.06
3     Restaurant  0.06
4  Historic Site  0.04


----Ohio----
                        venue  freq
0                     Airport  0.33
1              Cosmetics Shop  0.33
2  Construction & Landscaping  0.33
3           Accessories Store  0.00
4                   Nightclub  0.00


----Oregon----
               venue  freq
0     Cosmetics 

In [60]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
world_venues_sorted = pd.DataFrame(columns=columns)
world_venues_sorted['City'] = world_grouped['City']

for ind in np.arange(world_grouped.shape[0]):
    world_venues_sorted.iloc[ind, 1:] = return_most_common_venues(world_grouped.iloc[ind, :], num_top_venues)

world_venues_sorted

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Amsterdam,Hotel,French Restaurant,Italian Restaurant,Café,Coffee Shop,Bar,Ice Cream Shop,Restaurant,Bakery,Park
1,Andorra la Vella,Restaurant,Hotel,Spanish Restaurant,Shopping Mall,Tapas Restaurant,French Restaurant,Café,Coffee Shop,Cocktail Bar,Bar
2,Ankara,Café,History Museum,Historic Site,Art Gallery,Jewelry Store,Coffee Shop,Theater,Bookstore,Restaurant,Antique Shop
3,Arizona,Restaurant,Yoga Studio,Filipino Restaurant,Ethiopian Restaurant,Event Space,Exhibit,Eye Doctor,Fabric Shop,Falafel Restaurant,Farm
4,Arkansas,Bookstore,Movie Theater,Construction & Landscaping,Golf Course,Yoga Studio,Filipino Restaurant,Exhibit,Eye Doctor,Fabric Shop,Falafel Restaurant
5,Athens,Bar,Café,Coffee Shop,Cocktail Bar,Wine Bar,Greek Restaurant,Falafel Restaurant,Boutique,Bookstore,Theater
6,Baku,Restaurant,Coffee Shop,Café,Hotel,Lounge,Clothing Store,Italian Restaurant,Park,Pub,Steakhouse
7,Belgrade,Supermarket,Nightclub,Flower Shop,Restaurant,Café,Seafood Restaurant,Gas Station,Clothing Store,Electronics Store,Bar
8,Berlin,Hotel,History Museum,Ice Cream Shop,Bookstore,Art Gallery,Coffee Shop,Steakhouse,Concert Hall,Cocktail Bar,Clothing Store
9,Bern,Train Station,Supermarket,Electronics Store,Italian Restaurant,Grocery Store,Bed & Breakfast,Bakery,Swiss Restaurant,Japanese Restaurant,Discount Store


## Clustering

Now we can cluster each capital as we have all in a single dataframe

In [61]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 3

world_grouped_clustering = world_grouped.drop('City', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(world_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([2, 2, 2, 1, 2, 2, 2, 2, 2, 2])

In [None]:
# As we don'have different shapes in both dataframens we need another one with the same cities

world_merged = world[world['City'].isin(world_grouped['City'].unique())]

# add clustering labels
world_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
world_merged = world_merged.join(world_venues_sorted.set_index('City'), on='City')

world_merged.head() # check the last columns!

In [63]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=3)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(world_merged['Latitude'], world_merged['Longitude'], world_merged['City'], world_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Cluster 1

In [67]:
world_merged.loc[world_merged['Cluster Labels'] == 0, world_merged.columns[[3] + list(range(5, world_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
30,New Jersey,Pizza Place,Restaurant,American Restaurant,Supermarket,Pharmacy,Music Store,Bakery,Bar,Beach,General Entertainment
49,Wisconsin,Cosmetics Shop,Campground,Yoga Studio,Filipino Restaurant,Event Space,Exhibit,Eye Doctor,Fabric Shop,Falafel Restaurant,Farm
73,Saint Peter Port,Hotel,Restaurant,Pub,Café,Coffee Shop,Boat or Ferry,Supermarket,Bar,Seafood Restaurant,Cocktail Bar
90,Podgorica,Hotel,Italian Restaurant,Pizza Place,Café,Restaurant,Lounge,Jazz Club,Dessert Shop,Park,Pub
102,Longyearbyen,Hotel,Scandinavian Restaurant,Bar,Grocery Store,Pub,Café,Campground,Boarding House,Liquor Store,Bakery
105,Ankara,Café,History Museum,Historic Site,Art Gallery,Jewelry Store,Coffee Shop,Theater,Bookstore,Restaurant,Antique Shop


### Cluster 2

In [68]:
world_merged.loc[world_merged['Cluster Labels'] == 1, world_merged.columns[[3] + list(range(5, world_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Connecticut,Hotel,Baseball Field,Donut Shop,Sandwich Place,Pharmacy,Pizza Place,Deli / Bodega,Mountain,Lake,Beach
40,South Carolina,Cosmetics Shop,Optical Shop,Yoga Studio,Filipino Restaurant,Event Space,Exhibit,Eye Doctor,Fabric Shop,Falafel Restaurant,Farm


### Cluster 3

In [66]:
world_merged.loc[world_merged['Cluster Labels'] == 2, world_merged.columns[[3] + list(range(5, world_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Arizona,Restaurant,Yoga Studio,Filipino Restaurant,Ethiopian Restaurant,Event Space,Exhibit,Eye Doctor,Fabric Shop,Falafel Restaurant,Farm
3,Arkansas,Bookstore,Movie Theater,Construction & Landscaping,Golf Course,Yoga Studio,Filipino Restaurant,Exhibit,Eye Doctor,Fabric Shop,Falafel Restaurant
4,California,Cosmetics Shop,Yoga Studio,Filipino Restaurant,Ethiopian Restaurant,Event Space,Exhibit,Eye Doctor,Fabric Shop,Falafel Restaurant,Farm
7,Delaware,Beach,Yoga Studio,Filipino Restaurant,Event Space,Exhibit,Eye Doctor,Fabric Shop,Falafel Restaurant,Farm,Farmers Market
8,District of Columbia,Art Museum,Coffee Shop,Cocktail Bar,Pizza Place,Theater,Mediterranean Restaurant,Monument / Landmark,New American Restaurant,Burger Joint,Ice Cream Shop
10,Georgia,Forest,Restaurant,Filipino Restaurant,Ethiopian Restaurant,Event Space,Exhibit,Eye Doctor,Fabric Shop,Falafel Restaurant,Farm
14,Indiana,Ice Cream Shop,Karaoke Bar,Breakfast Spot,Furniture / Home Store,Fast Food Restaurant,Office,Golf Course,Optical Shop,Park,Sandwich Place
15,Iowa,Cosmetics Shop,Yoga Studio,Filipino Restaurant,Ethiopian Restaurant,Event Space,Exhibit,Eye Doctor,Fabric Shop,Falafel Restaurant,Farm
17,Kentucky,Liquor Store,Yoga Studio,Filipino Restaurant,Ethiopian Restaurant,Event Space,Exhibit,Eye Doctor,Fabric Shop,Falafel Restaurant,Farm
20,Maryland,Pizza Place,Shopping Mall,Grocery Store,Chinese Restaurant,Fast Food Restaurant,Park,Convenience Store,Sandwich Place,Gym / Fitness Center,Greek Restaurant
