# __Capstone Project - Where to deploy an ATM (Automated Teller Machine)?__
### Applied Data Science Capstone

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## __Introduction: Business Problem__ <a name="introduction"></a>

#### The costs of acquisition, deployment and maintenance of ATMs make it necessary to evaluate where they will be implemented. Poor geographic location results in excessive costs and poor service for end-customers. 
#### Any bank can take advantage of the data provided by geolocation tools, which provide updated information on companies or businesses used or recommended by consumers in real time. This data is useful for every financial institution that is considering expand its ATMs or branch locations.
#### In the city of Cochabamba - Bolivia there are different banks, some with many clients and some with few clients. This project is limited to evaluating the availability of ATMs of a single bank (https://www.bg.com.bo/), however this analysis can be extended to other entities.

## __Data__ <a name="data"></a>

#### The initial source of data will be the current location of the ATMs, this is restricted to the city of Cochabamba and data from other cities within Bolivia will not be used. The data source is found on the bank's website within the branches section(https://www.bg.com.bo/sucursales/)
#### The second source is will be the venues founded around the city, all provided by Foursquare app.
#### This two datasets will be merged trying to establish a relation between them.

## __Methodology__ <a name="methodology"></a>

#### __In this section we need to define the area of interest in Cochabamba city__

In [1]:
# import libraries
import folium
from geopy.geocoders import Nominatim
import pandas as pd
import json
import requests
import numpy as np
from pandas.io.json import json_normalize
import math

#### Define functions to transform latitude,longitude to/from UTM (X,Y), this functions will be usefull to calculate the distance between venues and ATM locations

In [3]:
from pyproj import CRS, Transformer

def latlon_to_xy(lat, lon):
    crs_latlon = CRS("EPSG:4326") # latitude, longitude (WGS84)
    crs_xy = CRS("EPSG:32633")  # XY coordinates (UTM Zones:North)
    transformer = Transformer.from_crs(crs_latlon,crs_xy)
    xy = transformer.transform(lat,lon)
    return xy[1], xy[0]

def xy_to_latlon(x, y):
    crs_latlon = CRS("EPSG:4326") # latitude, longitude (WGS84)
    crs_xy = CRS("EPSG:32633")  # XY coordinates (UTM Zones:North)
    transformer = Transformer.from_crs(crs_xy,crs_latlon)
    latlon= transformer.transform(y,x)
    return latlon[1], latlon[0]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

#### Define a center point in Cochabamba, this is not the center of the city but will be the center of the geographical analysis

In [2]:
# get latitude, longitude for Cochabamba City
address = 'Puente cobija, Cochabamba, Bolivia'
geolocator = Nominatim(user_agent='bolivia-agent')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
location

Location(Puente Cobija, Villa Galindo, Distrito 10, Adela Zamudio, Cochabamba, Kanata, Cercado, Cochabamba, 0, Bolivia, (-17.3876009, -66.1649141, 0.0))

#### Using the center point on previous step let's define the area of analysis, the area covered will be 4Km. of radius (8Km. diameter) this area covers the most important parts on the city and the places where we can find the greatest amount of venues

In [4]:
cbba_center_x, cbba_center_y = latlon_to_xy(latitude,longitude) # Map center in Cartesian coordinates
k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = cbba_center_x - 6000
x_step = 1000
y_min = cbba_center_y - 6000 - (int(21/k)*k*1000 - 12000)/2
y_step = x_step * k 

latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []
for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 300 if i%2==0 else 0
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(cbba_center_x, cbba_center_y, x, y)
        if (distance_from_center <= 8001):
            lon, lat = xy_to_latlon(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

In [5]:
print(len(latitudes), 'candidate centers generated.')

218 candidate neighborhood centers generated.


In [6]:
# use a dataframe to save the latitudes and longitudes founded for the area of interest
df=pd.DataFrame()
df['Latitude']=latitudes
df['Longitude']=longitudes
df.head(3)

Unnamed: 0,Latitude,Longitude
0,-17.367656,-66.178763
1,-17.366301,-66.175918
2,-17.364946,-66.173074


#### Let's see the area covered for analysis 

In [151]:
map_cbba = folium.Map(location=[latitude,longitude], tiles='Stamen Terrain', zoom_start=14)
for lat, lon in zip(latitudes, longitudes):
    folium.Circle([lat, lon], radius=30, color='red', fill=True, popup=str(lat)).add_to(map_cbba)    
map_cbba

#### __It is time to work with Foursquare founding the venues in the area of interest, we will use a function defined on previuos weeks of this course__

In [7]:
# define some variables
CLIENT_ID = 'G35DEDCTV0G2QZHIJQYHJXWRJY1GAFWNFPWMMAFIEJ0L30' 
CLIENT_SECRET = 'N1KX1AWQCEQBWSYXMCJUT2JY204TO3F1O42BMKNKWRD0YJ'
VERSION = '20180605'
limit = 100
radius = 300

In [8]:
# Define a funtion to found venues to all neigbourhood
def getNearbyVenues(latitudes, longitudes, radius):
    
    venues_list=[]
    for lat, lng in zip(latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(v['venue']['name'], v['venue']['location']['lat'], v['venue']['location']['lng']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Venue', 'Venue Latitude', 'Venue Longitude']
    
    return(nearby_venues)

In [9]:
# Apply the function to Cochabamba and see how many venues were found with a radious of 300m around every centroid
cbba_venues = getNearbyVenues(latitudes=df['Latitude'],longitudes=df['Longitude'],radius=300)
print('Total venues founded:',len(cbba_venues))

Total venues founded: 835


In [10]:
cbba_venues.tail(3)

Unnamed: 0,Venue,Venue Latitude,Venue Longitude
832,La Cueva,-17.411785,-66.156042
833,Krel Electrónica,-17.41254,-66.15654
834,Celutron,-17.40765,-66.15072


#### This value (835) has to be cleaned to prevent duplicate values (venues)

In [59]:
cbba2 = cbba_venues.copy()

In [60]:
cbba2[cbba2.duplicated(['Venue','Venue Latitude','Venue Longitude'])].shape

(493, 3)

In [61]:
cbba2.drop_duplicates(['Venue','Venue Latitude','Venue Longitude'],inplace=True)
cbba2.reset_index(drop=True, inplace =True)
cbba2.shape

(342, 3)

#### So with this new value (342) we can visualize this venues on the map

In [130]:
map_cbba2 = folium.Map(location=[latitude,longitude], tiles='Stamen Terrain', zoom_start=13)
for lat, lng, venue in zip(cbba2['Venue Latitude'],cbba2['Venue Longitude'],cbba2['Venue']):
    label = folium.Popup(venue, parse_html=True)
    folium.CircleMarker([lat,lng],radius=5, popup=label, color='red', fill=True,parse_html=False).add_to(map_cbba2)
map_cbba2

### __But, what about ATMs?__
#### Let's take the ATM's addresses from the webpage of the institution (https://www.bg.com.bo/sucursales/) and write it on a dictionary

In [15]:
# these are the addresses of the bank's ATMs
bank_addresses = {'HIPERMAXI CIRCUNVALACION':'Hipermaxi Circunvalacion', 'AG. AMERICA':'Banco Ganadero Central', 'AG. LA CANCHA':'Calle Honduras', 
                    'HIPERMAXI PRADO':'Avenida Ballivian 753','OF. CENTRAL':'Correos', 'TORRES SOFER':'Torres Sofer', 'BLANCO GALINDO':'Rotonda Peru', 
                    'SERVICIO DE CAMINOS':'Avenida Eliodoro Villazón', 'SURTIDOR EL CRISTO':'Clinica Los Angeles',
                    'IC NORTE':'Avenida América 817','AMERICA Y MELCHOR':'Melchor Perez Olguin', 'HIPER MAXI JUAN DE LA ROSA':'Hipermaxi Juan Rosa', 
                    'AEROPUERTO':'Aeropuerto', 'U. CATOLICA':'Plaza Tarija'}

#### now we find the latitude, longitude for those addresses using the GeoPy library

In [18]:
# get location data for any office in bank
city = ' Cochabamba, Bolivia'
offices,lats,longs=[],[],[]
geolocator = Nominatim(user_agent='bolivia-agent')
for office,address in zip(bank_addresses.keys(),bank_addresses.values()): 
    offices.append(office)
    lats.append(geolocator.geocode(address+city).latitude)
    longs.append(geolocator.geocode(address+city).longitude)

# add office's location to df
atm = pd.DataFrame()
atm['Office'] = offices
atm['Latitude'] = lats
atm['Longitude'] = longs
atm.shape

(14, 3)

In [132]:
atm.head(14)

Unnamed: 0,Office,Latitude,Longitude
0,HIPERMAXI CIRCUNVALACION,-17.363888,-66.165877
1,AG. AMERICA,-17.372293,-66.162408
2,AG. LA CANCHA,-17.401421,-66.152221
3,HIPERMAXI PRADO,-17.384373,-66.159049
4,OF. CENTRAL,-17.392794,-66.158624
5,TORRES SOFER,-17.384613,-66.151078
6,BLANCO GALINDO,-17.393905,-66.170656
7,SERVICIO DE CAMINOS,-17.379576,-66.122469
8,SURTIDOR EL CRISTO,-17.378686,-66.164538
9,IC NORTE,-17.37267,-66.151056


#### __At this point we have two datasets: venues locations and ATM locations (14), if we have in mind these datasets we can image the most simple relation between them: the distance of every venue to every ATM. So the next step is to calculate those distances.__

In [21]:
# calcuate distance between venues and any ATM
list0 = []
# loop on venues dataframe
for vlat, vlon in zip(cbba2['Venue Latitude'],cbba2['Venue Longitude']): 
    x0, y0 = latlon_to_xy(vlat,vlon)
    list1 = []
# loop on atm dataframe
    for olat, olon in zip(atm['Latitude'],atm['Longitude']):
        x1, y1 = latlon_to_xy(olat,olon)
        calc_distance = int(calc_xy_distance(x0, y0, x1, y1))
        list1.append(calc_distance)
    list0.append(list1)
len(list0)

342

#### With previuos data create a proper dataframe 

In [22]:
distances = pd.DataFrame(list0,columns=atm.Office.to_list())

#### just validate the shape of dataframes

In [62]:
print(distances.shape)
print(cbba2.shape)

(342, 14)
(342, 3)


#### Finally join this data

In [63]:
cbba3 = cbba2.join(distances)
cbba3.shape

(342, 17)

In [133]:
cbba3.tail(4)

Unnamed: 0,Venue,Venue Latitude,Venue Longitude,HIPERMAXI CIRCUNVALACION,AG. AMERICA,AG. LA CANCHA,HIPERMAXI PRADO,OF. CENTRAL,TORRES SOFER,BLANCO GALINDO,SERVICIO DE CAMINOS,SURTIDOR EL CRISTO,IC NORTE,AMERICA Y MELCHOR,HIPER MAXI JUAN DE LA ROSA,AEROPUERTO,U. CATOLICA,cluster
338,Eterna store,-17.404624,-66.155805,13717,10792,1538,6709,3973,6718,5841,13306,8928,10572,12519,10484,5932,10886,12
339,La Cueva,-17.411785,-66.156042,15992,13088,3598,9025,6269,9032,7441,14916,11164,12905,14565,12533,5633,13114,9
340,Krel Electrónica,-17.41254,-66.15654,16206,13310,3883,9256,6497,9303,7544,15201,11368,13170,14724,12695,5511,13398,9
341,Celutron,-17.40765,-66.15072,15108,12150,2092,8059,5461,7543,7715,12779,10433,11456,14204,12176,7309,11364,12


#### __With this dataset, it was applied the K-Means algorithm, the number of clusters (k value) defined was 14 because there are 14 ATMs, so in theory each ATM should be at the center of each cluster.__

In [105]:
from sklearn.cluster import KMeans
kclusters = 14
kmeans = KMeans(n_clusters = kclusters, random_state=0).fit(distances)
kmeans.labels_[0:10]

array([11, 11, 11, 11, 11, 11, 11, 11, 11, 11])

In [106]:
cbba3['cluster'] = kmeans.labels_

In [154]:
cbba3.head(3) #look the last column on dataframe

Unnamed: 0,Venue,Venue Latitude,Venue Longitude,HIPERMAXI CIRCUNVALACION,AG. AMERICA,AG. LA CANCHA,HIPERMAXI PRADO,OF. CENTRAL,TORRES SOFER,BLANCO GALINDO,SERVICIO DE CAMINOS,SURTIDOR EL CRISTO,IC NORTE,AMERICA Y MELCHOR,HIPER MAXI JUAN DE LA ROSA,AEROPUERTO,U. CATOLICA,cluster
0,Jala Gym,-17.365922,-66.175388,3068,4589,13726,7939,10266,9801,9292,17244,5401,7973,1841,3728,14420,10525,11
1,Jala Game Room,-17.365832,-66.175428,3075,4614,13758,7970,10298,9830,9323,17264,5433,7994,1870,3760,14450,10545,11
2,Casona Mayorazgo,-17.365488,-66.174592,2794,4438,13717,7890,10264,9699,9396,17040,5361,7775,2011,3817,14556,10319,11


#### A brief description of the clusters shows the minimum (9), maximum (51) and the mean(24) venues on them

In [153]:
cbba3['cluster'].value_counts().describe().round(1)

count    14.0
mean     24.4
std      10.9
min       9.0
25%      18.0
50%      23.5
75%      26.8
max      51.0
Name: cluster, dtype: float64

## __Results and Discussion__ <a name="results"></a>

#### Before to see the clusters, as first result we can compare the locations of ATMs and all the venues founded, so this distribution give us an approach

In [131]:
for lat,lng,office in zip(atm['Latitude'],atm['Longitude'],atm['Office']):
    label = folium.Popup(office)
    folium.CircleMarker([lat,lng],radius=10,popup=label,color='black',fill=True).add_to(map_cbba2)
map_cbba2

#### In the chart above, it is clear that some ATMs are too far of venues cloud, the ATM situated on east side is an outliner and has none venue near of him but this is because in the process of search venues that location was not considered. In the south we can see another ATM that has not many venues near of him but this is the airport of the city so that location is a good one. On the west there is not any ATM; in the center and in the north of the city is where you find the most venues and of course the most ATM deployed.

#### Now let's visualize the clusters

In [76]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [134]:
clusters = folium.Map(location=[latitude,longitude], tiles='Stamen Terrain', zoom_start=13)
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

#add markers to the map
for lat, lon, venue, cluster in zip(cbba3['Venue Latitude'], cbba3['Venue Longitude'], cbba3['Venue'], cbba3['cluster']):
    label = folium.Popup(str(venue) + ', Cluster: ' + str(cluster), parse_html=True)
    folium.CircleMarker([lat, lon], radius=5, popup=label, 
        color=rainbow[cluster-1], fill=True, fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(clusters)
clusters

### __The results of K-means clustering are in the above chart, you can see that k-means was so clever to group the venues as the best possible way, in fact, the only variable used was the distance, but this distance is related to the ATMs, this distance is not the distance between venues. Is good to see how k-means works, giving us a results that are not expected.__

#### __Evaluating the next image is possible to suggest a better distribution of ATMs near to the centroids of the cluster but this has to be evaluated by the stakeholders because it is possible that the current provision obeys other rules that only the bank knows.__

In [135]:
for lat,lng,office in zip(atm['Latitude'],atm['Longitude'],atm['Office']):
    label = folium.Popup(office)
    folium.CircleMarker([lat,lng],radius=10,popup=label,color='black',fill=True).add_to(clusters)
clusters

#### Another point of evaluation is the number of elements in each cluster, the three largest clusters contains 51, 38 and 34 venues, for this reason it is possible that the ATM requires greater maintenance tasks, and for this reason is recommended relocate some ATMs or add a new one.

In [144]:
cbba3['cluster'].value_counts()

10    51
1     38
5     34
0     27
4     26
8     25
3     25
6     22
13    21
12    18
11    18
7     16
2     12
9      9
Name: cluster, dtype: int64

#### The final chart presents the largest clusters (10,1,5) in relation to ATMs locations

In [150]:
cluster10 = folium.Map(location=[latitude,longitude], tiles='Stamen Terrain', zoom_start=13)
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

#add markers to the map
for lat, lon, venue, cluster in zip(cbba3['Venue Latitude'], cbba3['Venue Longitude'], cbba3['Venue'], cbba3['cluster']):
    if cluster == 10 or cluster == 5 or cluster == 1:
        label = folium.Popup(str(venue) + ', Cluster: ' + str(cluster), parse_html=True)
        folium.CircleMarker([lat, lon], radius=5, popup=label, 
        color=rainbow[cluster-1], fill=True, fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(cluster10)

for lat,lng,office in zip(atm['Latitude'],atm['Longitude'],atm['Office']):
    label = folium.Popup(office)
    folium.CircleMarker([lat,lng],radius=10,popup=label,color='black',fill=True).add_to(cluster10)
cluster10

## __Conclusion__ <a name="conclusion"></a>

### In this study we look for best locations of actual or new ATMs, using the Foursquare data we compare the actual disposition of ATMs and the distributions of venues frequented by the people in Cochabamba city.

### Using an ML algorithm, that used the geographical distances between the venues, it was possible to create groups or zones that could be the best place to deploy new ATMs.

### It was recommended to move or deploy new ATMs considering the clusters presented by k-means, some clusters are very long so the recommendation is valid.

### This study could be applied to other financial institutions with ATMs (or branches) that need to be relocated or that need to deploy new ones.


***