# A. Introduction
## A1. Description & Disscusion of the Background

I have been acquiring skills related to data science by taking the IBM Data Science Professional Course on Coursera. The last course contains a capstone project. This project is about applying data science toolset and obtained skills to analyze a problem in reality and creating value. My project's theme concerns a topic that I have been really interested in: Gym and health industry. My analysis was performed in Python. The details are pushed to Github.

## A2. Business problem

In recent years, there is a great boom in the healthy living industry. She is interested in opening a new unit, which will focus on offering her clients a personalized routine according to their weight, age, expectations and available time. Taking into account the financial plan in which the gym will operate, the intention is to find an optimal location in an area of Buenos Aires. The following criteria should be considered:
- Accessibility for local citizens (transport)
- Nearby competitors
- Metropolitan area

The assumption behind the analysis is that we can use unsupervised machine learning to create district groups that will provide us with a list of areas for potential gym locations. The purpose is that the gym is located near one of the most populated areas with less competition and easy access.

## A3. Data requirements

To perform this analysis, we will need the following data:

List of the districts Buenos Aires, Argentina
Geo-coordinates of the districts in Buenos Aires
Top venues of districts
List of districts will be obtained from Wikipedia. (http://download.geonames.org/export/zip/AR.zip)

- Geo-coordinates of districts will be obtained with the help of the geocoder tool in the notebook.

- Top venues data will be obtained from Foursquare through an API.

# B. Solution

## B1. Cluster Neighborhoods by preferences
### Import libreries

In [1]:
from urllib.request import urlopen
from zipfile import ZipFile
from datetime import date, timedelta
import pandas as pd
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium # map rendering library
import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import numpy

### Download Location Data of Buenos Aires

In [2]:
value = 'http://download.geonames.org/export/zip/AR.zip'
#print(count)
zipurl = value
# Download the file from the URL
zipresp = urlopen(zipurl)
# Create a new file on the hard drive
tempzip = open("/tmp/tempfile.zip", "wb")
# Write the contents of the downloaded file into the new file
tempzip.write(zipresp.read())
# Close the newly-created file
tempzip.close()
zf = ZipFile("/tmp/tempfile.zip", 'r')
with zf.open(zf.namelist()[1]) as f:
    df = pd.read_csv(f, sep='\t',header=None)


In [3]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,AR,3636,"POZO CERCADO (EL CHORRO (F), DPTO. RIVADAVIA (S))",Salta,A,,,,,-23.4933,-61.9267,3
1,AR,4123,LAS SALADAS,Salta,A,,,,,-25.7833,-64.5,4
2,AR,4126,CEIBAL,Salta,A,,,,,-26.1,-65.0167,4
3,AR,4126,BARADERO,Salta,A,,,,,-26.0833,-65.263,3
4,AR,4126,CANDELARIA,Salta,A,,,,,-26.1,-65.1,4


In [4]:
# Drop NaN Columns
df = df.drop([4,5,6,7,8,11], axis = 1)
# Rename columns
df.columns = ['Country', 'Zip-code', 'Neighborhood','Borough','Latitude', 'Longitude']
# Filter Buenos Aires Borough
df = df[df.Borough=='Buenos Aires']

In [5]:
df_buenos = df[['Zip-code','Latitude','Longitude']]
df_nei = df[['Zip-code','Neighborhood']].groupby(['Zip-code'])['Neighborhood'].apply(','.join).reset_index()
df_buenos = df_buenos.groupby(['Zip-code']).mean()
df_buenos['Neighborhood']=df_nei['Neighborhood'].tolist()
df_buenos['Borough'] = 'Buenos Aires'
df_buenos = df_buenos.reset_index()
df_buenos.head()

Unnamed: 0,Zip-code,Latitude,Longitude,Neighborhood,Borough
0,1601,-34.5167,-58.5389,ISLA MARTIN GARCIA,Buenos Aires
1,1602,-34.5167,-58.5,"FLORIDA,PUENTE SAAVEDRA,JUAN B. JUSTO (ESTACIO...",Buenos Aires
2,1605,-34.5333,-58.55,"MUNRO,MUNRO ESTAFETA No.2,CARAPACHAY",Buenos Aires
3,1607,-34.5167,-58.5389,"JOSE MARTI,BARRIO OBRERO FERROVIARIO,BARRIO ARCA",Buenos Aires
4,1609,-34.5,-58.5667,"BOULOGNE,BOULOGNE SUR MER,BOULOGNE ESTAFETA No...",Buenos Aires


In [6]:
df_buenos.describe()

Unnamed: 0,Zip-code,Latitude,Longitude
count,548.0,548.0,548.0
mean,5092.560219,-35.682369,-59.972056
std,2497.415024,1.643029,1.711379
min,1601.0,-40.8,-64.175
25%,1908.5,-36.785395,-61.324994
50%,6430.5,-35.116675,-59.654175
75%,7172.5,-34.60835,-58.548175
max,8512.0,-26.1625,-56.6667


#### Use geopy library to get the latitude and longitude values of Buenos Aires

In [7]:
from geopy.geocoders import Nominatim
address = 'Buenos Aires, ARG'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Buenos Aires are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Buenos Aires are -34.4708742, -58.6544118.


In [8]:
# create map of New York using latitude and longitude values
map_buenos = folium.Map(location=[latitude, longitude], zoom_start=7)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_buenos['Latitude'], df_buenos['Longitude'], df_buenos['Borough'], df_buenos['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_buenos)  
    
map_buenos

## Explore Neighborhoods in Buenos Aires

In [9]:
CLIENT_ID = 'VOLO0GSGWFVDSLZUTDQXVNDD3BYBWT1LJNNGOFI5QABR00JY' # your Foursquare ID
CLIENT_SECRET = '55KS4YVPMDET1XTYJAZ0BKFF2G3EEGNOO5AHCLDXYN3V00JB' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 50 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius

In [10]:
venues_error_list = []
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        if len(results) > 0:
            print(name)
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name'],0) for v in results])
        else:
            print('no venues {}'.format(name))
            venues_list.append([(
                name, 
                lat, 
                lng, 
                np.NaN, 
                np.NaN, 
                np.NaN,  
                np.NaN,1)])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category','Error']
   
    return nearby_venues

In [15]:
b_aires_venues = getNearbyVenues(names=df_buenos['Zip-code'],
                                   latitudes=df_buenos['Latitude'],
                                   longitudes=df_buenos['Longitude']
                                  )

1601
1602
1605
1607
1609
1611
1612
1613
1615
1617
1619
1621
1623
1625
no venues 1627
no venues 1629
1633
1635
1636
1640
1642
1643
1644
1646
no venues 1647
1648
1649
1650
1651
1655
1657
1659
1661
1663
1664
1665
1667
1669
1672
1674
1676
1678
1682
no venues 1684
1686
1688
1702
1704
1706
1708
1712
1713
1714
1716
1718
1722
1723
no venues 1727
no venues 1733
no venues 1735
no venues 1737
no venues 1739
no venues 1741
1742
1744
1746
1748
1752
1754
1755
1757
1759
1761
1763
1765
1766
1770
1772
1773
1776
1778
1802
1804
1806
1807
1808
1812
no venues 1814
no venues 1815
no venues 1816
1822
1824
1825
1826
1828
1832
1834
1836
1838
1842
1846
1847
1848
1849
1852
1854
1856
1858
1862
1864
no venues 1865
1870
1871
1872
1874
1876
1878
1879
1881
1882
1884
1885
1886
1888
1889
no venues 1890
1891
1893
1894
1895
1896
no venues 1897
1900
1901
1903
no venues 1905
no venues 1907
no venues 1909
1911
1913
no venues 1915
no venues 1917
no venues 1919
no venues 1921
1923
1925
no venues 1927
no venues 1929
1931
no ve

### Handling missing values

In [16]:
b_aires_venues.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Error
0,1601,-34.5167,-58.5389,Sandwiches Fredy,-34.514734,-58.542054,Sandwich Place,0
1,1601,-34.5167,-58.5389,Acacia Pastelería,-34.514233,-58.54188,Deli / Bodega,0
2,1601,-34.5167,-58.5389,La Colón,-34.519342,-58.54413,Bakery,0
3,1601,-34.5167,-58.5389,La Reja,-34.518106,-58.54351,Ice Cream Shop,0
4,1601,-34.5167,-58.5389,Retaceria Mary,-34.518633,-58.540269,Department Store,0
5,1601,-34.5167,-58.5389,Vicente Lopez Futbol,-34.516129,-58.528349,Soccer Field,0
6,1601,-34.5167,-58.5389,El Retorno,-34.511655,-58.540818,Argentinian Restaurant,0
7,1601,-34.5167,-58.5389,City Bar,-34.510774,-58.532785,Rock Club,0
8,1601,-34.5167,-58.5389,Colonial Helados & Cafe,-34.512004,-58.541162,Ice Cream Shop,0
9,1601,-34.5167,-58.5389,Farmacia Selma,-34.513812,-58.530791,Pharmacy,0


In [17]:
# Checking null values
b_aires_venues.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1888 entries, 0 to 1887
Data columns (total 8 columns):
Neighborhood              1888 non-null int64
Neighborhood Latitude     1888 non-null float64
Neighborhood Longitude    1888 non-null float64
Venue                     1524 non-null object
Venue Latitude            1524 non-null float64
Venue Longitude           1524 non-null float64
Venue Category            1524 non-null object
Error                     1888 non-null int64
dtypes: float64(4), int64(2), object(2)
memory usage: 118.1+ KB


we drop the null values from this analysis because it does not present remarkable places within a radius of 1km

In [18]:
# Drop null values
df_b_aires = b_aires_venues[b_aires_venues.Error==0]
print('We drop {} (null values) of {} zip-codes'.format(b_aires_venues[b_aires_venues.Error==1].count(axis=0)[0],b_aires_venues.groupby('Neighborhood').mean().count(axis=0)[0] ))
print('We have {} zip-code (neiborhoods) with {} venues to analize'.format(df_b_aires.groupby('Neighborhood').mean().count(axis=0)[0], df_b_aires.count()[0]))

We drop 364 (null values) of 548 zip-codes
We have 184 zip-code (neiborhoods) with 1524 venues to analize


In [19]:
df_b_aires.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Error
0,1601,-34.5167,-58.5389,Sandwiches Fredy,-34.514734,-58.542054,Sandwich Place,0
1,1601,-34.5167,-58.5389,Acacia Pastelería,-34.514233,-58.54188,Deli / Bodega,0
2,1601,-34.5167,-58.5389,La Colón,-34.519342,-58.54413,Bakery,0
3,1601,-34.5167,-58.5389,La Reja,-34.518106,-58.54351,Ice Cream Shop,0
4,1601,-34.5167,-58.5389,Retaceria Mary,-34.518633,-58.540269,Department Store,0


## Analyze Each Neighborhood
### Create Dummies

In [20]:
# one hot encoding
ba_onehot = pd.get_dummies(df_b_aires[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
ba_onehot['zip number'] = df_b_aires['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [ba_onehot.columns[-1]] + list(ba_onehot.columns[:-1])
ba_onehot = ba_onehot[fixed_columns]

ba_onehot.head()

Unnamed: 0,zip number,Accessories Store,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Asian Restaurant,Athletics & Sports,Auto Garage,Auto Workshop,...,Theme Park,Theme Park Ride / Attraction,Thrift / Vintage Store,Toy / Game Store,Train Station,Tunnel,Vegetarian / Vegan Restaurant,Warehouse Store,Weight Loss Center,Women's Store
0,1601,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1601,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1601,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1601,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1601,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
ba_grouped = ba_onehot.groupby('zip number').mean().reset_index()
ba_grouped.head()

Unnamed: 0,zip number,Accessories Store,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Asian Restaurant,Athletics & Sports,Auto Garage,Auto Workshop,...,Theme Park,Theme Park Ride / Attraction,Thrift / Vintage Store,Toy / Game Store,Train Station,Tunnel,Vegetarian / Vegan Restaurant,Warehouse Store,Weight Loss Center,Women's Store
0,1601,0.0,0.0,0.066667,0.0,0.0,0.0,0.066667,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1602,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0
2,1605,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1607,0.0,0.0,0.066667,0.0,0.0,0.0,0.066667,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1609,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [23]:
# Now let's create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['zip number']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
zip_venues_sorted = pd.DataFrame(columns=columns)
zip_venues_sorted['zip number'] = ba_grouped['zip number']

for ind in np.arange(ba_grouped.shape[0]):
    zip_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ba_grouped.iloc[ind, :], num_top_venues)

zip_venues_sorted.head()

Unnamed: 0,zip number,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1601,Soccer Field,Ice Cream Shop,Department Store,Rock Club,Grocery Store,Gym / Fitness Center,Bakery,Hardware Store,Pharmacy,Athletics & Sports
1,1602,Sports Club,Bakery,Gym,Butcher,Gym / Fitness Center,Tennis Court,Gymnastics Gym,Fruit & Vegetable Store,Deli / Bodega,Train Station
2,1605,Plaza,Dessert Shop,Pharmacy,Bakery,Women's Store,Event Space,Food & Drink Shop,Flower Shop,Fish Market,Financial or Legal Service
3,1607,Soccer Field,Ice Cream Shop,Department Store,Rock Club,Grocery Store,Gym / Fitness Center,Bakery,Hardware Store,Pharmacy,Athletics & Sports
4,1609,Fast Food Restaurant,Grocery Store,Women's Store,Event Space,Food Court,Food & Drink Shop,Flower Shop,Fish Market,Financial or Legal Service,Farmers Market


## Cluster Zip Codes

In [24]:
# set number of clusters
kclusters = 5

ba_grouped_clustering = ba_grouped.drop('zip number', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ba_grouped_clustering)

# check cluster labels generated for each row in the dataframe
zip_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
zip_venues_sorted.head()

Unnamed: 0,Cluster Labels,zip number,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,1601,Soccer Field,Ice Cream Shop,Department Store,Rock Club,Grocery Store,Gym / Fitness Center,Bakery,Hardware Store,Pharmacy,Athletics & Sports
1,0,1602,Sports Club,Bakery,Gym,Butcher,Gym / Fitness Center,Tennis Court,Gymnastics Gym,Fruit & Vegetable Store,Deli / Bodega,Train Station
2,0,1605,Plaza,Dessert Shop,Pharmacy,Bakery,Women's Store,Event Space,Food & Drink Shop,Flower Shop,Fish Market,Financial or Legal Service
3,0,1607,Soccer Field,Ice Cream Shop,Department Store,Rock Club,Grocery Store,Gym / Fitness Center,Bakery,Hardware Store,Pharmacy,Athletics & Sports
4,0,1609,Fast Food Restaurant,Grocery Store,Women's Store,Event Space,Food Court,Food & Drink Shop,Flower Shop,Fish Market,Financial or Legal Service,Farmers Market


In [25]:
ba_merged = df_b_aires.set_index('Neighborhood').iloc[:,:-1]

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
ba_merged = ba_merged.join(zip_venues_sorted.set_index('zip number'))

# Reset index and change name to zip number (neighborhood)
ba_merged = ba_merged.reset_index(drop=False)
ba_merged = ba_merged.rename(columns={'index':'zip number', 'Neighborhood Latitude':'zip Latitude','Neighborhood Longitude':'zip Longitude' })

#ba_merged # check the last columns!
ba_merged.head()

Unnamed: 0,zip number,zip Latitude,zip Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1601,-34.5167,-58.5389,Sandwiches Fredy,-34.514734,-58.542054,Sandwich Place,0,Soccer Field,Ice Cream Shop,Department Store,Rock Club,Grocery Store,Gym / Fitness Center,Bakery,Hardware Store,Pharmacy,Athletics & Sports
1,1601,-34.5167,-58.5389,Acacia Pastelería,-34.514233,-58.54188,Deli / Bodega,0,Soccer Field,Ice Cream Shop,Department Store,Rock Club,Grocery Store,Gym / Fitness Center,Bakery,Hardware Store,Pharmacy,Athletics & Sports
2,1601,-34.5167,-58.5389,La Colón,-34.519342,-58.54413,Bakery,0,Soccer Field,Ice Cream Shop,Department Store,Rock Club,Grocery Store,Gym / Fitness Center,Bakery,Hardware Store,Pharmacy,Athletics & Sports
3,1601,-34.5167,-58.5389,La Reja,-34.518106,-58.54351,Ice Cream Shop,0,Soccer Field,Ice Cream Shop,Department Store,Rock Club,Grocery Store,Gym / Fitness Center,Bakery,Hardware Store,Pharmacy,Athletics & Sports
4,1601,-34.5167,-58.5389,Retaceria Mary,-34.518633,-58.540269,Department Store,0,Soccer Field,Ice Cream Shop,Department Store,Rock Club,Grocery Store,Gym / Fitness Center,Bakery,Hardware Store,Pharmacy,Athletics & Sports


In [26]:
# Check clusters
ba_merged.groupby('Cluster Labels').count()


Unnamed: 0_level_0,zip number,zip Latitude,zip Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,359,359,359,359,359,359,359,359,359,359,359,359,359,359,359,359,359
1,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
3,1140,1140,1140,1140,1140,1140,1140,1140,1140,1140,1140,1140,1140,1140,1140,1140,1140
4,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19


## Map Clusters

In [27]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ba_merged['zip Latitude'], ba_merged['zip Longitude'], ba_merged['zip number'], ba_merged['Cluster Labels']):
    label = folium.Popup('Zip-Code: '+str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
    
map_clusters

## B.2 Cluster by characteristics of nearby gyms

In [28]:
condition = ((b_aires_venues['Venue Category']=='Gym') | (b_aires_venues['Venue Category']=='Gym / Fitness Center')) |(b_aires_venues['Venue Category']=='Gym Pool')
df_gym = b_aires_venues[condition]
df_gym = df_gym.groupby('Venue').first()
df_gym.reset_index(drop=False, inplace=True)
print('There are {} Gyms near zip-codes selected'.format(df_gym['Venue'].nunique()))

There are 44 Gyms near zip-codes selected


### In order to have a longer gym database we will explore all the gymnasiums in Buenos Aires (around 50 km) using Foursquare API

In [29]:
CLIENT_ID = 'VOLO0GSGWFVDSLZUTDQXVNDD3BYBWT1LJNNGOFI5QABR00JY' # your Foursquare ID
CLIENT_SECRET = '55KS4YVPMDET1XTYJAZ0BKFF2G3EEGNOO5AHCLDXYN3V00JB' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 50000 # define radius

In [30]:
def getNearbyVenuesbyCat(names, latitudes, longitudes, cat, radius=10000):
    
    venues_list=[]
    for name, lat, lng, c in zip(names, latitudes, longitudes, cat):
        
            
        # create the API request URL
        #print(name)
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId=4bf58dd8d48988d175941735'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,c)
        # make the GET request
        results = requests.get(url).json()["response"]['venues']
        
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['id'],
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],  
            v['categories'][0]['name']) for v in results])
            

    nearby_venues_by_cat = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues_by_cat.columns = ['Borough', 
                  'Borough Latitude', 
                  'Borough Longitude', 'Venue Id',
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues_by_cat)


In [31]:
b_aires_venues_by_cat = getNearbyVenuesbyCat(names=['Buenos Aires'],
                                   latitudes=[latitude],
                                   longitudes=[longitude],cat=['4bf58dd8d48988d175941735'],radius=10000,
                                  )

In [32]:
b_aires_venues_by_cat.head()

Unnamed: 0,Borough,Borough Latitude,Borough Longitude,Venue Id,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Buenos Aires,-34.470874,-58.654412,50203111e4b0ab4947270817,Vuelta al Hipódromo de San Isidro,-34.480489,-58.519612,Track
1,Buenos Aires,-34.470874,-58.654412,4bd4c2f629eb9c74115392e1,Club Nordelta,-34.404626,-58.655813,Gym
2,Buenos Aires,-34.470874,-58.654412,4df52479ae609e69dd9f7334,Corredor Aeróbico Muñiz,-34.559512,-58.693454,Track
3,Buenos Aires,-34.470874,-58.654412,59aad25b646e382e654d1f98,SportClub,-34.405893,-58.620206,Gym / Fitness Center
4,Buenos Aires,-34.470874,-58.654412,506d7a9fe4b0377aa9d05e47,San Fernando Centro,-34.441638,-58.555633,Gym / Fitness Center


In [33]:
b_aires_venues_by_cat.shape

(45, 8)

### Search Rating of every Gym in Buenos Aires

In [34]:
CLIENT_ID = 'VOLO0GSGWFVDSLZUTDQXVNDD3BYBWT1LJNNGOFI5QABR00JY' # your Foursquare ID
CLIENT_SECRET = '55KS4YVPMDET1XTYJAZ0BKFF2G3EEGNOO5AHCLDXYN3V00JB' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

def getStarsVenue(names, names_id):
    
    venues_list=[]
    for name, name_id in zip(names, names_id):
        
            
        # create the API request URL
        #print(name)
        url = 'https://api.foursquare.com/v2/venues/{}?&client_id={}&client_secret={}&v={}'.format(
            name_id,
            CLIENT_ID, 
            CLIENT_SECRET,
            VERSION ) 
        # make the GET request
        print(name)
        results = requests.get(url).json()["response"]['venue']
        
        
        # return only relevant information for each nearby venue
        venues_list.append([
            results])
            
    return(venues_list)

In [35]:
gym_json_rating = getStarsVenue(b_aires_venues_by_cat['Venue'],b_aires_venues_by_cat['Venue Id'])

Vuelta al Hipódromo de San Isidro
Club Nordelta
Corredor Aeróbico Muñiz
SportClub
San Fernando Centro
Gimnasio Salud Física
Body Builders Gym
Jambie Gym
Activity-GYM
Fitness Blue Gym
Cat´s
GYM Estilo
Instituto Elian
E.N.E.R gym
New Gym Center
Gym
Gimnasio (Pacheco Golf Club)
Kyosei
Gimnasio Marina del Sol
Gimnasio NM
Pilates
Amaicha Vinyasa
World Hurlingham Gym
Club Atlético San Miguel (CASM)
Gimnasio Shark
Club Uno Gym
Campo de Deportes Nº 1
Viken Gym
Gimnasio Pow
Club Italiano
Estudio Pilates (Leticia Secilio)
Sociedad Alemana de Gimnasia de Villa Ballester
Ashtanga Baires
Little Ranch
Crossfit Nordelta
Aikido Palma Dojo
Gym St Andrews
Gimnasio Físico y Forma
Club La Calle
Boomerang Pilates & Entrenamiento Funcional
Al Trote
Long Life Gym
Gelmini
Mordor Elite Fitness
Gimnasio


### Extract infor from Json, if there is no classification we will give a neutral value (5)

In [36]:
# integrate ranking on DataFrame
venues_rating = []
venues_r = []
for v in gym_json_rating:
    venues_rating.append([v[0]['id'],v[0]['name'], v[0]['location']['lat'], v[0]['location']['lng'], v[0]['categories'][0]['name']])
for c in gym_json_rating:
    try:
        venues_r.append(c[0]['rating'])
    except:
        venues_r.append(5)
venues_rating = pd.DataFrame(venues_rating)

venues_rating.columns = ['Venue Id',
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
venues_rating['Venue Rating'] = venues_r

venues_rating = pd.DataFrame(venues_rating)

In [37]:
venues_rating.head()

Unnamed: 0,Venue Id,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue Rating
0,50203111e4b0ab4947270817,Vuelta al Hipódromo de San Isidro,-34.480489,-58.519612,Track,8.6
1,4bd4c2f629eb9c74115392e1,Club Nordelta,-34.404626,-58.655813,Gym,8.1
2,4df52479ae609e69dd9f7334,Corredor Aeróbico Muñiz,-34.559512,-58.693454,Track,6.6
3,59aad25b646e382e654d1f98,SportClub,-34.405893,-58.620206,Gym / Fitness Center,5.0
4,506d7a9fe4b0377aa9d05e47,San Fernando Centro,-34.441638,-58.555633,Gym / Fitness Center,5.0


In [38]:
df_b_aires_clustered=ba_merged.groupby('zip number').mean().drop(['Venue Latitude','Venue Longitude'], axis = 1)

In [39]:
df_b_aires_clustered.tail()

Unnamed: 0_level_0,zip Latitude,zip Longitude,Cluster Labels
zip number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8103,-38.775,-62.258325,3
8111,-38.9222,-62.0778,3
8168,-38.15,-61.8,3
8183,-37.7,-63.1667,1
8504,-40.8,-62.9833,3


To accurately calculate distances we need to create our grid of locations in Cartesian 2D coordinate system which allows us to calculate distances in meters (not in latitude/longitude degrees). Then we'll project those coordinates back to latitude/longitude degrees to be shown on Folium map. So let's create functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in meters).

### Get Distance Gyms Matrix

In [40]:
import math
from geopy.distance import lonlat, distance

def calc_xy_distance(lon1, lat1, lon2, lat2):
    lonlat1 = (lon1, lat1)
    lonlat2 = (lon2, lat2)

    return round(distance(lonlat(*lonlat1), lonlat(*lonlat2)).meters,0)


In [41]:
def distance_gyms(zips_xy, gyms_xy, zips_numbers):
    distance_matrix = []
    for i in range(len(zips_xy)):
        #print(i)
        x1 = zips_xy.iloc[i,1]
        y1 = zips_xy.iloc[i,0]
        distance_array = []
        for j in range(len(gyms_xy)):
            x2 = gyms_xy.iloc[j,1]
            y2 = gyms_xy.iloc[j,0]

            distance_array.append(calc_xy_distance(x1, y1, x2, y2))
        distance_matrix.append(distance_array)
    return distance_matrix

In [42]:
df_distances = pd.DataFrame(distance_gyms(df_b_aires_clustered.iloc[:,:-1],venues_rating.iloc[:,2:4],df_b_aires_clustered.index),columns = venues_rating['Venue'])
df_distances = df_distances.set_index(df_b_aires_clustered.index)

In [43]:
df_distances.head()

Venue,Vuelta al Hipódromo de San Isidro,Club Nordelta,Corredor Aeróbico Muñiz,SportClub,San Fernando Centro,Gimnasio Salud Física,Body Builders Gym,Jambie Gym,Activity-GYM,Fitness Blue Gym,...,Aikido Palma Dojo,Gym St Andrews,Gimnasio Físico y Forma,Club La Calle,Boomerang Pilates & Entrenamiento Funcional,Al Trote,Long Life Gym,Gelmini,Mordor Elite Fitness,Gimnasio
zip number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1601,4390.0,16431.0,14962.0,14384.0,8467.0,5521.0,5568.0,6910.0,15926.0,6958.0,...,4054.0,6289.0,22405.0,3565.0,13932.0,3667.0,2172.0,15393.0,15225.0,19251.0
1602,4402.0,18961.0,18383.0,16525.0,9770.0,7027.0,5452.0,7107.0,19437.0,7761.0,...,6504.0,9222.0,25974.0,6203.0,17314.0,7060.0,5601.0,18896.0,18706.0,22627.0
1605,6489.0,17270.0,13485.0,15536.0,10181.0,7297.0,7656.0,8942.0,14641.0,8874.0,...,5517.0,7107.0,21542.0,1463.0,12395.0,4118.0,1275.0,14089.0,13865.0,19064.0
1607,4390.0,16431.0,14962.0,14384.0,8467.0,5521.0,5568.0,6910.0,15926.0,6958.0,...,4054.0,6289.0,22405.0,3565.0,13932.0,3667.0,2172.0,15393.0,15225.0,19251.0
1609,4836.0,13379.0,13379.0,11539.0,6554.0,3980.0,5629.0,6308.0,13999.0,5768.0,...,1903.0,3186.0,19857.0,4971.0,12496.0,802.0,2883.0,13508.0,13432.0,16229.0


In [44]:
walk_near = 2000
bike_near = 5000
car_near = 10000
df_b_aires_clustered2 = df_b_aires_clustered
df_b_aires_clustered2['nearby gym walking'] = df_distances[df_distances.iloc[:]<=walk_near].count(axis=1)
df_b_aires_clustered2['nearby gym bike'] = df_distances[df_distances.iloc[:]<=bike_near].count(axis=1) - df_distances[df_distances.iloc[:]<=walk_near].count(axis=1)
df_b_aires_clustered2['nearby gym car'] = df_distances[df_distances.iloc[:]<=car_near].count(axis=1) - df_distances[df_distances.iloc[:]<=bike_near].count(axis=1)
df_b_aires_clustered2['min gym distance'] = df_distances[df_distances.iloc[:]<=car_near].min(axis=1)
df_b_aires_clustered2['mean gym distance'] = df_distances[df_distances.iloc[:]<=car_near].mean(axis=1)

In [45]:
df_b_aires_clustered2.head()

Unnamed: 0_level_0,zip Latitude,zip Longitude,Cluster Labels,nearby gym walking,nearby gym bike,nearby gym car,min gym distance,mean gym distance
zip number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1601,-34.5167,-58.5389,0,0,7,19,2172.0,6073.538462
1602,-34.5167,-58.5,0,0,3,19,3826.0,6976.409091
1605,-34.5333,-58.55,0,2,3,20,1275.0,6978.08
1607,-34.5167,-58.5389,0,0,7,19,2172.0,6073.538462
1609,-34.5,-58.5667,0,3,9,15,802.0,4906.407407


### Get Gyms Rating Matrix

In [46]:
# transpose rating
df_ratings = pd.DataFrame(venues_rating['Venue Rating']).T
df_ratings = df_ratings.reset_index(drop=True)

ratings_array = []
# True False ratings array <= car_near
ratings_array = df_distances.iloc[:]<=car_near
# True false dataframe to 1/0 (int)
ratings_array[:] = ratings_array[:].astype(int)
ratings_array = ratings_array.reset_index(drop=True)
#ratings_array = ratings_array*df_ratings.iloc[0]
ratings_array.columns = df_ratings.columns

# combine two ratings array and df_ratings
df_ratings = ratings_array*df_ratings.iloc[0]

#df_b_aires_clustered2['mean ratings nearby gyms'] =
df_ratings = pd.DataFrame(df_ratings[df_ratings.iloc[:]>0].mean(axis=1))

df_ratings = df_ratings.set_index(df_b_aires_clustered2.index)

df_b_aires_clustered2['mean ratings nearby gyms'] = df_ratings[0]

df_b_aires_clustered2.head(20)

Unnamed: 0_level_0,zip Latitude,zip Longitude,Cluster Labels,nearby gym walking,nearby gym bike,nearby gym car,min gym distance,mean gym distance,mean ratings nearby gyms
zip number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1601,-34.5167,-58.5389,0,0,7,19,2172.0,6073.538462,5.380769
1602,-34.5167,-58.5,0,0,3,19,3826.0,6976.409091,5.45
1605,-34.5333,-58.55,0,2,3,20,1275.0,6978.08,5.396
1607,-34.5167,-58.5389,0,0,7,19,2172.0,6073.538462,5.380769
1609,-34.5,-58.5667,0,3,9,15,802.0,4906.407407,5.366667
1611,-34.5,-58.6333,3,3,3,24,1278.0,7071.466667,5.193333
1612,-34.4696,-58.6713,3,0,5,17,3491.0,7331.272727,5.140909
1613,-34.5,-58.6833,0,0,3,17,2261.0,6312.1,5.08
1615,-34.4833,-58.7167,0,1,2,15,1486.0,6918.333333,5.2
1617,-34.4602,-58.634504,3,2,4,19,47.0,6517.24,5.192


#### null values will be filled with 0, since nulls represent that they have no gyms nearby

In [47]:
df_b_aires_normalize = df_b_aires_clustered2.fillna(0)
df_b_aires_normalize = df_b_aires_normalize.drop(['zip Latitude', 'zip Longitude'],axis = 1)
df_b_aires_normalize = df_b_aires_normalize.reset_index(drop = True)
df_b_aires_normalize.head()

Unnamed: 0,Cluster Labels,nearby gym walking,nearby gym bike,nearby gym car,min gym distance,mean gym distance,mean ratings nearby gyms
0,0,0,7,19,2172.0,6073.538462,5.380769
1,0,0,3,19,3826.0,6976.409091,5.45
2,0,2,3,20,1275.0,6978.08,5.396
3,0,0,7,19,2172.0,6073.538462,5.380769
4,0,3,9,15,802.0,4906.407407,5.366667


### Normalize Data

In [48]:
from sklearn import preprocessing

x = df_b_aires_normalize.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_b_aires_normalize = pd.DataFrame(x_scaled)
df_b_aires_normalize.head()

Unnamed: 0,0,1,2,3,4,5,6
0,0.0,0.0,0.583333,0.791667,0.233047,0.63343,0.860923
1,0.0,0.0,0.25,0.791667,0.410515,0.727594,0.872
2,0.0,0.222222,0.25,0.833333,0.136803,0.727768,0.86336
3,0.0,0.0,0.583333,0.791667,0.233047,0.63343,0.860923
4,0.0,0.333333,0.75,0.625,0.086052,0.511706,0.858667


## B.3 Model 2nd Cluster

In [49]:
# set number of clusters
kclusters = 5

ba_grouped_clustering2 = df_b_aires_normalize

# run k-means clustering
kmeans2 = KMeans(n_clusters=kclusters, random_state=0).fit(ba_grouped_clustering2)

# check cluster labels generated for each row in the dataframe
try:
    df_b_aires_clustered3 = df_b_aires_clustered2
    df_b_aires_clustered3.insert(0, '2nd Cluster Labels', kmeans2.labels_)
except:
    df_b_aires_clustered3['2nd Cluster Labels']=kmeans2.labels_
finally:
    pass
df_b_aires_clustered3=df_b_aires_clustered2.reset_index(drop=False)
df_b_aires_clustered3.head()

Unnamed: 0,zip number,2nd Cluster Labels,zip Latitude,zip Longitude,Cluster Labels,nearby gym walking,nearby gym bike,nearby gym car,min gym distance,mean gym distance,mean ratings nearby gyms
0,1601,1,-34.5167,-58.5389,0,0,7,19,2172.0,6073.538462,5.380769
1,1602,1,-34.5167,-58.5,0,0,3,19,3826.0,6976.409091,5.45
2,1605,1,-34.5333,-58.55,0,2,3,20,1275.0,6978.08,5.396
3,1607,1,-34.5167,-58.5389,0,0,7,19,2172.0,6073.538462,5.380769
4,1609,1,-34.5,-58.5667,0,3,9,15,802.0,4906.407407,5.366667


In [50]:
df_b_aires_clustered3.shape

(184, 11)

In [51]:
# create map
map_clusters2 = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_b_aires_clustered3['zip Latitude'], df_b_aires_clustered3['zip Longitude'], df_b_aires_clustered3['zip number'], df_b_aires_clustered3['2nd Cluster Labels']):
    label = folium.Popup('Zip-Code: '+str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters2)
    
map_clusters2

In [68]:
# Cambiando orden de tabla
df_final = df_b_aires_clustered3.join(zip_venues_sorted.iloc[:,2:7])
df_final = df_final.drop(['zip Latitude','zip Longitude'], axis=1)
df_final = df_final[['zip number',
                    '2nd Cluster Labels',
                    'nearby gym walking',
                    'nearby gym bike',
                    'nearby gym car',
                    'min gym distance',
                    'mean gym distance',
                    'mean ratings nearby gyms',
                    'Cluster Labels',
                    '1st Most Common Venue',
                    '2nd Most Common Venue',
                    '3rd Most Common Venue',
                    '4th Most Common Venue',
                    '5th Most Common Venue']]

df_final = df_final.round(2)
df_final = df_final.rename(columns={'Cluster Labels':'1st Cluster Labels'})


# Cluster 0 - No nearby gyms and fast-food preferences

In [53]:
df_final[df_final['2nd Cluster Labels']==0]

Unnamed: 0,zip number,2nd Cluster Labels,nearby gym walking,nearby gym bike,nearby gym car,min gym distance,mean gym distance,mean ratings nearby gyms,1st Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
13,1625,0,0,0,0,,,,3,Construction & Landscaping,Bus Station,Department Store,Shopping Mall,Women's Store
42,1702,0,0,0,0,,,,3,Fast Food Restaurant,Café,Bus Stop,Motel,Intersection
46,1712,0,0,0,0,,,,3,Gym,Ice Cream Shop,Soccer Field,Café,Coffee Shop
47,1713,0,0,0,0,,,,3,Home Service,Hotel,Ice Cream Shop,Deli / Bodega,Locksmith
48,1714,0,0,0,0,,,,3,Gym,Ice Cream Shop,Soccer Field,Café,Coffee Shop
49,1716,0,0,0,0,,,,3,Pizza Place,Art Gallery,Construction & Landscaping,Music Venue,Dessert Shop
50,1718,0,0,0,0,,,,3,Pizza Place,Ice Cream Shop,Gym,Golf Course,Gastropub
51,1722,0,0,0,0,,,,3,Plaza,Pool,Construction & Landscaping,Asian Restaurant,Furniture / Home Store
52,1723,0,0,0,0,,,,3,Salon / Barbershop,Gym,Warehouse Store,Health Food Store,Flower Shop
55,1746,0,0,0,0,,,,3,Frame Store,Insurance Office,Mexican Restaurant,Candy Store,Factory


In [54]:
df_final[df_final['2nd Cluster Labels']==0].mean()

zip number                  3910.605505
2nd Cluster Labels             0.000000
nearby gym walking             0.000000
nearby gym bike                0.000000
nearby gym car                 0.000000
min gym distance                    NaN
mean gym distance                   NaN
mean ratings nearby gyms            NaN
1st Cluster Labels             3.027523
dtype: float64

# Cluster 1 - various gyms at long distance

In [55]:
df_final[df_final['2nd Cluster Labels']==1]

Unnamed: 0,zip number,2nd Cluster Labels,nearby gym walking,nearby gym bike,nearby gym car,min gym distance,mean gym distance,mean ratings nearby gyms,1st Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,1601,1,0,7,19,2172.0,6073.54,5.38,0,Soccer Field,Ice Cream Shop,Department Store,Rock Club,Grocery Store
1,1602,1,0,3,19,3826.0,6976.41,5.45,0,Sports Club,Bakery,Gym,Butcher,Gym / Fitness Center
2,1605,1,2,3,20,1275.0,6978.08,5.4,0,Plaza,Dessert Shop,Pharmacy,Bakery,Women's Store
3,1607,1,0,7,19,2172.0,6073.54,5.38,0,Soccer Field,Ice Cream Shop,Department Store,Rock Club,Grocery Store
4,1609,1,3,9,15,802.0,4906.41,5.37,0,Fast Food Restaurant,Grocery Store,Women's Store,Event Space,Food Court
7,1613,1,0,3,17,2261.0,6312.1,5.08,0,Gym,Train Station,Park,Women's Store,Food & Drink Shop
8,1615,1,1,2,15,1486.0,6918.33,5.2,0,Construction & Landscaping,Rugby Pitch,Train Station,Ice Cream Shop,Supermarket
10,1619,1,0,2,3,2291.0,7107.0,6.02,0,Gym,Soccer Field,Sushi Restaurant,Event Space,Food Service
18,1642,1,9,8,8,830.0,4155.48,5.3,0,Bakery,BBQ Joint,Athletics & Sports,Theater,Sports Club
19,1643,1,8,9,9,435.0,4223.88,5.28,0,Gym / Fitness Center,Bakery,Coffee Shop,Sandwich Place,Candy Store


In [56]:
df_final[df_final['2nd Cluster Labels']==1].mean()

zip number                  1634.176471
2nd Cluster Labels             1.000000
nearby gym walking             1.705882
nearby gym bike                4.529412
nearby gym car                13.235294
min gym distance            1820.000000
mean gym distance           6078.031765
mean ratings nearby gyms       5.351176
1st Cluster Labels             0.000000
dtype: float64

# Cluster 2 - No nearby gyms - recreational preferences

In [57]:
df_final[df_final['2nd Cluster Labels']==2]

Unnamed: 0,zip number,2nd Cluster Labels,nearby gym walking,nearby gym bike,nearby gym car,min gym distance,mean gym distance,mean ratings nearby gyms,1st Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
14,1633,2,0,0,0,,,,0,Business Service,Gift Shop,BBQ Joint,Bakery,Women's Store
53,1742,2,0,0,0,,,,0,Spa,Bakery,Plaza,Locksmith,Gym / Fitness Center
60,1757,2,0,0,0,,,,0,Construction & Landscaping,Soccer Field,Department Store,Pharmacy,Women's Store
61,1759,2,0,0,0,,,,0,Convenience Store,Train Station,Supermarket,Gym / Fitness Center,Plaza
64,1765,2,0,0,0,,,,0,Construction & Landscaping,Pizza Place,BBQ Joint,Motorcycle Shop,Spa
66,1770,2,0,0,0,,,,0,Soccer Field,Ice Cream Shop,Pizza Place,Bar,Restaurant
67,1772,2,0,0,0,,,,0,Train Station,Pizza Place,Skating Rink,Business Service,Restaurant
69,1776,2,0,0,0,,,,0,Train Station,Pizza Place,Skating Rink,Business Service,Restaurant
70,1778,2,0,0,0,,,,0,Train Station,Portuguese Restaurant,Bakery,Empanada Restaurant,Food & Drink Shop
72,1804,2,0,0,0,,,,0,Train Station,Pharmacy,Supermarket,Electronics Store,Women's Store


In [58]:
df_final[df_final['2nd Cluster Labels']==2].mean()

zip number                  3035.225806
2nd Cluster Labels             2.000000
nearby gym walking             0.000000
nearby gym bike                0.000000
nearby gym car                 0.000000
min gym distance                    NaN
mean gym distance                   NaN
mean ratings nearby gyms            NaN
1st Cluster Labels             0.096774
dtype: float64

# Cluster 3 - Few gyms at long distance

In [59]:
df_final[df_final['2nd Cluster Labels']==3]

Unnamed: 0,zip number,2nd Cluster Labels,nearby gym walking,nearby gym bike,nearby gym car,min gym distance,mean gym distance,mean ratings nearby gyms,1st Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
11,1621,3,0,1,7,4955.0,6882.12,5.64,3,Sports Club,Gas Station,Clothing Store,Lounge,Women's Store
12,1623,3,0,0,5,8161.0,8737.6,6.02,0,Soccer Field,Athletics & Sports,Food Service,Women's Store,Event Space
15,1635,3,0,0,1,7534.0,7534.0,5.0,0,Convenience Store,Restaurant,Bakery,Women's Store,Event Service
20,1644,3,0,0,3,9072.0,9464.0,5.0,4,Restaurant,Gift Shop,Women's Store,Event Service,Food & Drink Shop
22,1648,3,0,0,3,9072.0,9464.0,5.0,4,Restaurant,Gift Shop,Women's Store,Event Service,Food & Drink Shop
23,1649,3,0,0,3,9072.0,9464.0,5.0,4,Restaurant,Gift Shop,Women's Store,Event Service,Food & Drink Shop
35,1672,3,0,0,3,6603.0,7430.67,5.83,0,Train Station,Social Club,Bakery,Sporting Goods Shop,Park
36,1674,3,0,0,3,6603.0,7430.67,5.83,0,Train Station,Social Club,Bakery,Sporting Goods Shop,Park
37,1676,3,0,0,3,6603.0,7430.67,5.83,0,Train Station,Social Club,Bakery,Sporting Goods Shop,Park
38,1678,3,0,0,3,6603.0,7430.67,5.83,0,Train Station,Social Club,Bakery,Sporting Goods Shop,Park


In [60]:
df_final[df_final['2nd Cluster Labels']==3].mean()

zip number                  1670.142857
2nd Cluster Labels             3.000000
nearby gym walking             0.000000
nearby gym bike                0.071429
nearby gym car                 2.928571
min gym distance            7728.357143
mean gym distance           8305.516429
mean ratings nearby gyms       5.504286
1st Cluster Labels             1.714286
dtype: float64

# Cluster 4 - high density gyms and good rating


In [61]:
df_final[df_final['2nd Cluster Labels']==4]

Unnamed: 0,zip number,2nd Cluster Labels,nearby gym walking,nearby gym bike,nearby gym car,min gym distance,mean gym distance,mean ratings nearby gyms,1st Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
5,1611,4,3,3,24,1278.0,7071.47,5.19,3,Argentinian Restaurant,Café,Ice Cream Shop,Gym,Bakery
6,1612,4,0,5,17,3491.0,7331.27,5.14,3,Furniture / Home Store,Construction & Landscaping,Bus Station,Auto Workshop,Women's Store
9,1617,4,2,4,19,47.0,6517.24,5.19,3,Pizza Place,Plaza,Athletics & Sports,Café,Supermarket
16,1636,4,0,5,17,3877.0,6914.73,5.45,3,Ice Cream Shop,Argentinian Restaurant,Bar,Coffee Shop,Gym / Fitness Center
17,1640,4,4,8,10,1062.0,4694.64,5.45,3,Ice Cream Shop,Pizza Place,Deli / Bodega,Café,Plaza
24,1650,4,2,2,19,294.0,7298.96,5.43,3,Mexican Restaurant,Gastropub,Pool,Gym,Pizza Place
25,1651,4,0,3,13,2419.0,7490.81,5.62,3,Ice Cream Shop,Pizza Place,Intersection,Pet Store,Women's Store
27,1657,4,2,2,19,294.0,7298.96,5.43,3,Mexican Restaurant,Gastropub,Pool,Gym,Pizza Place
28,1659,4,2,2,19,294.0,7298.96,5.43,3,Mexican Restaurant,Gastropub,Pool,Gym,Pizza Place
32,1665,4,0,3,10,2484.0,6105.31,5.28,3,Home Service,Convenience Store,Park,Tennis Court,Event Space


In [62]:
df_final[df_final['2nd Cluster Labels']==4].mean()

zip number                  1650.307692
2nd Cluster Labels             4.000000
nearby gym walking             1.307692
nearby gym bike                3.076923
nearby gym car                15.769231
min gym distance            1649.846154
mean gym distance           6944.637692
mean ratings nearby gyms       5.346154
1st Cluster Labels             3.000000
dtype: float64

# C. Results and Discussion 

In the analysis presented, I began with the analysis of the zip-codes in Buenos Aires-Argentina, in order to carry out the first cluster, the relevant places were searched within a radius of 1000 meters around the zip-codes. The data was clustered into 5 sets using artificial intelligence.
To carry out a more in-depth analysis, all the gyms, fitness centers and related businesses 50,000 meters around Buenos Aires were searched. Obtaining data from 45 gyms which registered through the use of a geolocation tool called Foursquare, we obtained: latitude, longitude, category, rating, zip-code; with which the following variables were obtained:
- 1st Cluster: Cluster made based on the most popular places by zip-code.
- Nearby gyms walking: Gymnasiums less than 2,000 meters to the zip-code
- Nearby gyms bike: Gymnasiums at a distance of less than 5,000 meters to the zip-code
- Nearby gyms car: Gymnasiums less than 10,000 meters to the zip-code
- Rating: social rating 1 to 10 of each gym

With the described variables and using artificial intelligence clustering models, the following clusters were obtained:
- Cluster 0 and Cluster 2: the main characteristic is that they do not have gyms near at short, medium or long distance. That is why making an investment in these points is a great risk given that there is no certainty that there is a potential demand in those coordinates.
- Cluster 1: This group is characterized by being several gymnasiums at a medium and long distance, being more precise, the minimum distance to a gym is 2,000 meters and an average distance is 6,800 meters.This group represents a good market opportunity, although there is strong competition over long distances.
- Cluster 3 It is characterized by having no offer at short-medium distance and very low offer at long distance with an average of 3 gymnasiums, of which they have an average distance of 8,300 meters and a minimum distance of 7,700 meters.
- Cluster 4 It is characterized by being a small group with a high density of gyms, having an average of 5 gyms at short distance, 9 gyms at medium distance and 11.6 gyms at long distances.Given its high level of supply, investing in this group is risky due to the high level of competition.

The recommended addresses in cluster 3 are detailed below:


In [64]:
from geopy.geocoders import Nominatim
print('==============================================================')
print('Addresses of centers of areas recommended for further analysis')
print('==============================================================\n')

df_final_cluster4 = df_b_aires_clustered3[df_b_aires_clustered3['2nd Cluster Labels']==3]
for i in range(len(df_final_cluster4)):
    
    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.reverse('{}, {}'.format(df_final_cluster4.iloc[i,2],df_final_cluster4.iloc[i,3]))
    print(location.address)


Addresses of centers of areas recommended for further analysis

Club Newman, Benavídez, Partido de Tigre, Buenos Aires, 1621, Argentina
Barrio Santa Isabel, Ingeniero Maschwitz, Partido de Escobar, Buenos Aires, B1623, Argentina
Presidente Derqui, Partido del Pilar, Buenos Aires, 1635, Argentina
Isla Nazar Anchorena, Primera Sección, Partido de Tigre, Buenos Aires, B1644BHH, Argentina
Isla Nazar Anchorena, Primera Sección, Partido de Tigre, Buenos Aires, B1644BHH, Argentina
Isla Nazar Anchorena, Primera Sección, Partido de Tigre, Buenos Aires, B1644BHH, Argentina
879, 411 - Beazley, Sáenz Peña, Partido de Tres de Febrero, Buenos Aires, B1674AVJ, Argentina
879, 411 - Beazley, Sáenz Peña, Partido de Tres de Febrero, Buenos Aires, B1674AVJ, Argentina
879, 411 - Beazley, Sáenz Peña, Partido de Tres de Febrero, Buenos Aires, B1674AVJ, Argentina
879, 411 - Beazley, Sáenz Peña, Partido de Tres de Febrero, Buenos Aires, B1674AVJ, Argentina
Colegio Ward, 599, Concejal Héctor Coucheiro, Villa Sa

# D. Conclusion

Purpose of this project was to identify Buenoa Aires areas close to center with low number of Gyms  in order to aid stakeholders in narrowing down the search for optimal location for a Gym and fitness center. By calculating Gym density distribution from Foursquare data we have first identified general boroughs that justify further analysis (Kreuzberg and Friedrichshain), and then generated extensive collection of locations which satisfy some basic requirements regarding existing nearby gyms. Clustering of those locations was then performed in order to create major zones of interest (containing greatest number of potential locations).
Final decission on optimal gym location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location (proximity to park or water), levels of noise / proximity to major roads , real estate availability, prices, social and economic dynamics of every neighborhood etc.