# **Capstone Project - The Battle of the Neighborhoods**
## **Applied Data Science Capstone by IBM/Coursera**

## **Table of Contents:**



*   ### Introduction: Business Problem
*   ### Data
*   ### Methodology
*   ### Analysis
*   ### Results and Discussion
*   ### Conclusion



## **Introtuction : Business Problem**

This project is intended to bring information to Esoterics inc., a small group of entrepreneurs from North America interested in opening a **New Age Store** in the city of Barranquilla, in  Colombia.

The city is relatively new to this sort of things and widely attached to traditions, so in order to determine the correct location, the stakeholders consider that an analysis should be conducted to determine the points of the city in which the trends go in a different direction than traditions.

So, it will be necessary to spot those activities held in the city which have a relationship with New Age, and which could be considered as a potential source of purchasers of esoteric goods, given the strategic location, which is to be determined through this study.

## Data

Data will be obtained from **Foursquare API**, the idea is to extract the different categories of stablishments which have a similarity with New Age trends, and once obtained, determine in which spot of the city the concentrate the most to consider it as a potential location for the new store.
Open data from local authorities and other informative sources such as local contributors will be considered in order to get the location of the city, and its communities and 'barrios', which constitute the inner segmentations of population  so determined by local administration.

### The City of Barranquilla

Barranquilla is a small city, capital of the Atlántico department, located in the caribbean region of Colombia, characterized by its portuary activity and its most highlighted party: the carnival. With a considerable commercial and industrial activity, it's considered as one of the top cities of the region.

First, let's determine the location of Barranquilla and display it in a map:

In [66]:
import pandas as pd
import numpy as np

In [2]:
column_names = ['Code','Municipio', 'Department Code','Department','Latitude','Longitude']

colombia_df = pd.read_csv('http://blog.jorgeivanmeza.com/wp-content/uploads/2008/09/municipioscolombiacsv.txt',sep=',',header=None)
colombia_df.columns = column_names
colombia_df.head()

Unnamed: 0,Code,Municipio,Department Code,Department,Latitude,Longitude
0,5002,Abejorral,5,Antioquia,5.75,-75.416667
1,5004,Abriaquí,5,Antioquia,6.6666667,-76.083333
2,50006,Acacías,50,Meta,3.9166667,-73.833333
3,27006,Acandí,27,Chocó,8.3333333,-77.166667
4,41006,Acevedo,41,Huila,1.75,-75.916667


In [3]:
barranquilla_df = colombia_df[colombia_df['Municipio']=="Barranquilla"]
barranquilla_df

Unnamed: 0,Code,Municipio,Department Code,Department,Latitude,Longitude
89,8001,Barranquilla,8,Atlántico,10.9638889,-74.796389


It can be seen that the coordinates of the city are Lat: 10.96388 and Long: -74.796389, now let's use those coordinates to show Barranquilla in a map.

In [4]:
!conda install -c conda-forge folium
import folium
print("Done!")

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    certifi-2020.6.20          |   py36h9f0ad1d_0         151 KB  conda-forge
    folium-0.11.0              |             py_0          61 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2020.6.20  |       hecda079_0         145 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    branca:          0.4.1-py_0        conda-forge
    folium:          

In [5]:
latitud = barranquilla_df['Latitude']
longitud = barranquilla_df['Longitude']
map_barranquilla = folium.Map(location=[latitud,longitud],zoom_start=12)

map_barranquilla

Now it's time to find the **'Barrios'**:

In [6]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Comunidad,Barrio,latitud,longitud
0,Riomar,Adela de Char,11.034698,-74.868855
1,Riomar,Adelita de Char Etp. 2,11.03258,-74.874455
2,Riomar,Altamira,11.006025,-74.825973
3,Riomar,Altos de Riomar,11.015493,-74.824646
4,Riomar,Altos del Limón,11.014751,-74.828707


In [7]:
df_barrios.shape

(182, 4)

A dataframe has been constructed from the 'barrios' information; Barranquilla has 182 barrios; now that information will be used for retrieving the different venues surrounding through the **Foursquare API**.

In [8]:
import requests
import json
import pandas.io.json

CLIENT_ID = 'SXRKGQKWBPIPIWWJVMJQQIVRMNX4ZJOZRUCEYR3OD0YDVFU4' 
CLIENT_SECRET = 'UAKYCUESOESAB3TVBL5QONHD12HSO1Q0C41HTKMARP4XZ0M1'
VERSION = '20200707'
LIMIT = 100
radius = 500

In [11]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Barrio', 
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

print('done')

done


In [12]:
barranquilla_vns = getNearbyVenues(df_barrios['Barrio'],df_barrios['latitud'],df_barrios['longitud'])

Adela de Char
Adelita de Char Etp. 2
Altamira
Altos de Riomar
Altos del Limón
Altos del Prado
Andalucía
Buenavista
El Castillo I
El Golf
El Limoncito
El Poblado
Eduardo Santos
La Castellana
La Floresta
Las Flores
Las Tres Avemarias
Paraiso
Riomar
San Marino
San Salvador
San Vicente
Santa Mónica
Siape
Solaire
Urbanización La Playa
Villa Campestre
Villa Carolina
Villa del Este
Villa Santos
Villamar
Villas del Puerto
Barlovento
Barranquillita
Barrio Abajo
Bellavista
Betania
Boston
Campo Alegre
Centro
Ciudad Jardín
Colombia
El Boliche
El Castillo
El Golf
El Porvenir
El Prado
El Recreo
El Rosario
El Tabor
Granadillo
Industrial Vía 40
La Bendición de Dios
La Campiña
La Concepción
La Cumbre
La Felicidad
La Loma
Las Colinas
Las Delicias
Las Mercedes
Los Alpes
Los Jobos
Los Nogales
Miramar
Modelo
Montecristo
Nuevo Horizonte
Parque Rosado
San Francisco
Santa Ana
Villa Country
Villa Tarel
Villanueva
7 de Abril
20 de Julio
Buenos Aires
Carrizal
Cevillar
Ciudadela 20 de Julio
El Santuario
Kennedy
L

In [13]:
barranquilla_vns.head()

Unnamed: 0,Barrio,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Adela de Char,11.034698,-74.868855,Aires Y Electricidad Nuñez,11.037871,-74.867362,Construction & Landscaping
1,Adela de Char,11.034698,-74.868855,Enterlock Colombia S.A.S.,11.032169,-74.865724,Furniture / Home Store
2,Adelita de Char Etp. 2,11.03258,-74.874455,Donde Jota,11.033551,-74.870806,Latin American Restaurant
3,Altamira,11.006025,-74.825973,Arabe Internacional,11.006065,-74.826594,Falafel Restaurant
4,Altamira,11.006025,-74.825973,Zahle,11.005083,-74.826653,Middle Eastern Restaurant


## Methodology

The scope of this project is to retrieve the best location for a New Age store, under certain criteria.

In the above cells, location of the city of Barranquilla and its barrios were obtained, as well as the location of venues surrounding the different 'Barrios'. Now, it will be necessary to get the 'Barrios' which meet the desired criteria: surrounded within a radius of 500 m by places which activity is somehow aligned with New Age trends, for that purpose, the stakeholders considered that these activities could be:

* **Reading:** bookstores offer a variety of texts for general public, even in the esoteric area, therefore, people into esoteric readings may be potential customers for our new store.

* **Veganism:** there are trends within certain kinds of spiritualities that suggest that the sacrifice of animals is a crime against life and therefore, eating their meat is not good. So, according to the stakeholders criteria, such places as vegan restaurants may be attended by people who are aligned with such philosiphies, and therefore be potential customers for the store.

* **Fitness:** fitness centers offer yoga within their service packages, though it's mostly offered within the frame of physical wellbeing, it encloses a spiritual background, so, there's a potential customer in a yoga practitioner acording to the stakeholders.

Once the filtering is conducted, the next step is to carry over a clustering analysis on the barrios obtained in order to determine how they are like each other and choose a good location among the options.

### Analysis

To perform the analysis, first thing is to search within the results obtained for the categories which match the aboce exposed criteria.

First, let's start by creating a function that will aid in retrieving the indexes of the dataframe where those categories can be found.

In [24]:
def get_index_cat(keyword,series):
    idx= []
    for a,b in series.iteritems():
        splt = b.split()
        transf = [f.lstrip(',./').lower() for f in splt]
        if keyword.lower() in transf:
            idx.append(a)
    return idx

In [49]:
get_index_cat('vegan',barranquilla_vns['Venue Category'])

print("indexes for vegan: ", get_index_cat('vegan',barranquilla_vns['Venue Category']) )
print("indexes for bookstore: ",get_index_cat('bookstore',barranquilla_vns['Venue Category']))
print("indexes for fitness: ",get_index_cat('fitness',barranquilla_vns['Venue Category']))

indexes for vegan:  [772, 1057, 1098, 1216, 1256]
indexes for bookstore:  [37, 42, 96, 118, 769, 774, 1065, 1106]
indexes for fitness:  [182, 326, 338, 350, 406, 412, 481, 503, 542, 622, 776, 792, 948, 1203, 1235, 1245, 1297, 1299, 1394]


In [60]:
list_vegan = [772, 1057, 1098, 1216, 1256]
list_bookstore = [37, 42, 96, 118, 769, 774, 1065, 1106]
list_fitness = [182, 326, 338, 350, 406, 412, 481, 503, 542, 622, 776, 792, 948, 1203, 1235, 1245, 1297, 1299, 1394]

cat_indexes = list_vegan+list_bookstore+list_fitness
cat_indexes.sort()
print(cat_indexes)

[37, 42, 96, 118, 182, 326, 338, 350, 406, 412, 481, 503, 542, 622, 769, 772, 774, 776, 792, 948, 1057, 1065, 1098, 1106, 1203, 1216, 1235, 1245, 1256, 1297, 1299, 1394]


In [61]:
barranquilla_vns = barranquilla_vns.loc[cat_indexes,:]
barranquilla_vns

Unnamed: 0,Barrio,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
37,Altos de Riomar,11.015493,-74.824646,Librería Nacional S.A.,11.013726,-74.827062,Bookstore
42,Altos de Riomar,11.015493,-74.824646,Panamericana,11.013303,-74.828388,Bookstore
96,Altos del Limón,11.014751,-74.828707,Panamericana,11.013303,-74.828388,Bookstore
118,Altos del Limón,11.014751,-74.828707,Librería Nacional S.A.,11.013726,-74.827062,Bookstore
182,Andalucía,11.01552,-74.816503,Centro Médico Deportivo Body & Soul,11.013104,-74.813725,Gym / Fitness Center
326,La Floresta,11.023139,-74.812633,Springfield Gym,11.024397,-74.812476,Gym / Fitness Center
338,Las Tres Avemarias,11.020626,-74.810417,Springfield Gym,11.024397,-74.812476,Gym / Fitness Center
350,Paraiso,11.014722,-74.811007,Centro Médico Deportivo Body & Soul,11.013104,-74.813725,Gym / Fitness Center
406,San Marino,11.024231,-74.810243,Springfield Gym,11.024397,-74.812476,Gym / Fitness Center
412,San Salvador,11.023434,-74.809524,Springfield Gym,11.024397,-74.812476,Gym / Fitness Center


As can be seen above, venues meeting the criteria have been found, now this information will be put to further analysis, so first, let's group the dataframe by the 'Barrios' and the mean of occurrence of the venues around each 'Barrio'.

In [85]:
bqvenues_dummy = pd.get_dummies(barranquilla_vns['Venue Category'])

bqvenues_dummy['Barrio'] = barranquilla_vns['Barrio']

listz = bqvenues_dummy.columns.to_list()#list of columns
finalist=[]                             #empty list to fix the order of columns
finalist.append(listz[-1])              #adding 'barrios' from the column lists
finalist=finalist+listz[0:-1]           
bqvenues_dummy = bqvenues_dummy[finalist]
bq_groupped = bqvenues_dummy.groupby('Barrio').mean()
bq_grouped = bq_groupped.reset_index()
bq_grouped

Unnamed: 0,Barrio,Bookstore,Gym / Fitness Center,Vegetarian / Vegan Restaurant
0,Alfonso López,0.0,1.0,0.0
1,Altos de Riomar,1.0,0.0,0.0
2,Altos del Limón,1.0,0.0,0.0
3,Andalucía,0.0,1.0,0.0
4,Betania,0.0,1.0,0.0
5,California,0.0,0.0,1.0
6,El Carmen,0.0,1.0,0.0
7,El Prado,0.666667,0.0,0.333333
8,El Recreo,0.0,1.0,0.0
9,El Romance,0.0,0.0,1.0


The above information will allow for ordering each category according to the occurrences, this information will be used to determine the clusters we mentioned in the **Methodology** section.

In [81]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [89]:
num_top_venues = 3

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Barrio']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
barrios_venues_sorted = pd.DataFrame(columns=columns)
barrios_venues_sorted['Barrio'] = bq_grouped['Barrio']

for ind in np.arange(bq_grouped.shape[0]):
    barrios_venues_sorted.iloc[ind, 1:] = return_most_common_venues(bq_grouped.iloc[ind, :], num_top_venues)

barrios_venues_sorted.head()

Unnamed: 0,Barrio,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Alfonso López,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore
1,Altos de Riomar,Bookstore,Vegetarian / Vegan Restaurant,Gym / Fitness Center
2,Altos del Limón,Bookstore,Vegetarian / Vegan Restaurant,Gym / Fitness Center
3,Andalucía,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore
4,Betania,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore


K-Means is one of the most popular unsupervised machine learning algorithms, and for its nature, it's the one that fits the most the scope of this project, therfore, let's run the model.

In [94]:
from sklearn.cluster import KMeans

kclusters = 5

bq_grouped_clustering = bq_grouped.drop('Barrio', 1)


kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(bq_grouped_clustering)


kmeans.labels_[0:10] 


array([0, 3, 3, 0, 0, 2, 0, 4, 0, 2], dtype=int32)

In [107]:
barrios_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

bq_merged = df_barrios

bq_merged = bq_merged.join(barrios_venues_sorted.set_index('Barrio'), on='Barrio')

bq_merged = bq_merged.dropna()

bq_merged['Cluster Labels'] = bq_merged['Cluster Labels'].astype('int')

bq_merged.head()

Unnamed: 0,Comunidad,Barrio,latitud,longitud,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
3,Riomar,Altos de Riomar,11.015493,-74.824646,3,Bookstore,Vegetarian / Vegan Restaurant,Gym / Fitness Center
4,Riomar,Altos del Limón,11.014751,-74.828707,3,Bookstore,Vegetarian / Vegan Restaurant,Gym / Fitness Center
6,Riomar,Andalucía,11.01552,-74.816503,0,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore
14,Riomar,La Floresta,11.023139,-74.812633,0,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore
16,Riomar,Las Tres Avemarias,11.020626,-74.810417,0,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore


In [114]:
import matplotlib.cm as cm
import matplotlib.colors as colors

map_clusters = folium.Map(location=[latitud, longitud], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(bq_merged['latitud'], bq_merged['longitud'], bq_merged['Barrio'], bq_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [109]:
bq_merged.loc[bq_merged['Cluster Labels'] == 0, bq_merged.columns[[1] + list(range(5, bq_merged.shape[1]))]]

Unnamed: 0,Barrio,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
6,Andalucía,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore
14,La Floresta,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore
16,Las Tres Avemarias,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore
17,Paraiso,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore
19,San Marino,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore
20,San Salvador,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore
24,Solaire,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore
28,Villa del Este,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore
31,Villas del Puerto,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore
36,Betania,Gym / Fitness Center,Vegetarian / Vegan Restaurant,Bookstore


In [110]:
bq_merged.loc[bq_merged['Cluster Labels'] == 1, bq_merged.columns[[1] + list(range(5, bq_merged.shape[1]))]]

Unnamed: 0,Barrio,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
69,San Francisco,Vegetarian / Vegan Restaurant,Bookstore,Gym / Fitness Center
70,Santa Ana,Vegetarian / Vegan Restaurant,Bookstore,Gym / Fitness Center


In [111]:
bq_merged.loc[bq_merged['Cluster Labels'] == 2, bq_merged.columns[[1] + list(range(5, bq_merged.shape[1]))]]

Unnamed: 0,Barrio,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
104,California,Vegetarian / Vegan Restaurant,Gym / Fitness Center,Bookstore
120,El Romance,Vegetarian / Vegan Restaurant,Gym / Fitness Center,Bookstore


In [112]:
bq_merged.loc[bq_merged['Cluster Labels'] == 3, bq_merged.columns[[1] + list(range(5, bq_merged.shape[1]))]]

Unnamed: 0,Barrio,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
3,Altos de Riomar,Bookstore,Vegetarian / Vegan Restaurant,Gym / Fitness Center
4,Altos del Limón,Bookstore,Vegetarian / Vegan Restaurant,Gym / Fitness Center


In [113]:
bq_merged.loc[bq_merged['Cluster Labels'] == 4, bq_merged.columns[[1] + list(range(5, bq_merged.shape[1]))]]

Unnamed: 0,Barrio,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
46,El Prado,Bookstore,Vegetarian / Vegan Restaurant,Gym / Fitness Center


## Discussion

As can be seen, the 'Barrios' were segmented in 5 clusters, and each with its particularity.
There are different locations for converging places, but further analysis now should be made relying on in-city experience, since there are other factors that may be influencial for the location of a new store, such a socio-demographical factors, security, etc.
It also can be seen that Barranquilla is quite new to the issue, while in other cities around the world there aren places directly categorized as yoga centers or such, our city is just taking its first steps towards that through the inclusion of that discipline in a sportly manner. 
There are fewer places where the occurrences of bookstores is the highest, which tells a bit about the population appeal for reading on paper books, which can also be an indicator of their intellectual inclination.
In order to make a choice, it is necessary to determine which of the three categories influences the most the final decision: while the occurrence of gym/fitness places is greater in opposition to that of bookstores, it must be said that gyms occurrence shouldn't be considered as the primary factor for determining the location of the business, due to the fact that such places not always involve yoga as part of their agenda and not all of them offer it with a slight spiritual bias.
Bookstores are not quite frequent around the city, same for vegan/vegetarian restaurants. Therefore, given the variety of customer a bookstore has, it shouldn't be considered in the first place, rather in the second place after vegan/vegetarian restaurants, which are mostly attended either by people with any health problems or people with certain beliefs which makes them more aligned with the type of store which is to be started.

## Conclusion
From this analysis, it can be concluded that there are many possibilities for starting the business, but retrieving additional information about the city, asking locals and such, it was advised that the most suitable areas of the city for commercial activity are the center-northern area and the north, towards the city has developed greatly in the past years, also the majority of the wealthy population is clustered.
This is aligned with the analysis ran in this project, since if we split the city in two halves (north and south), it can be appreciated that the most of the clusters resulting from the study appear more in the northern half than in the southern half.
Also, given the fact that was stated in the discussion in regards to which category should by primary to make de final decision of which 'Barrio' is the best for starting business, *Altos de Riomar* and *Altos del Limón* appear as the best candidates, but another factor to be considered is a socio-domographical one, which is how far it is from the south of the city, so *El Prado* appears as another good option in this regard, which like *Altos de Riomar* and *Altos del Limón* has bookstores and vegetarian/vegan restaurants occurring in the 1st and 2nd place respectively, but has a special advantage: it is closer to the center of the city, a traditional commercial area, and frequented by people from both the north and south.
Therefore, *El Prado* will be considered as the most suitable location for the store.
