<h1 style="text-align:center;">Applied Data Science Capstone</h1>
<h3 style="text-align:center;">Capstone Project - The Battle of Neighborhoods</h3>
<h2 style="text-align:center;">Istanbul</h2>
<br>
<h3>Initial Libraries</h3>
<br>

In [31]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy  as np
import json
import requests
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium # map rendering library
from sklearn.cluster import KMeans

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#First Step: Scraping this wikipedia page and getting list of neighborhoods
url = "https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"

data  = requests.get(url).text 
soup = BeautifulSoup(data,"html5lib")


<h3>Finding List of Neighborhoods</h3>
<br>

In [3]:
# Creating the dataframe
df = pd.DataFrame(columns=['Neighborhood', 'Borough', 'Cluster Labels', 'Latitude', 'Longitude'])

# Finding the list of neighborhoods
span_under_h2_title = soup.find(id="Neighbourhoods_by_districts")
h2_title = span_under_h2_title.parent
h3_object = h2_title.next_sibling.next_sibling
neighborhood_count = {}

# Traversing through the list
while h3_object.name == 'h3' :
    counter = 0
    borough_span = h3_object.find(class_ = 'mw-headline')
    # Borough name
    borough_name = borough_span.string
    ol_object = h3_object.next_sibling.next_sibling
    if ol_object.name == 'ol' :
        for neighborhood_li in ol_object.find_all('li', recursive=False) :
            if neighborhood_li.name == 'li' :
                neighborhood_ahref = neighborhood_li.find('a')
                if neighborhood_ahref == None :
                    neighborhood = neighborhood_li.string
                else :
                    neighborhood = neighborhood_ahref.string
                if("," in neighborhood) :
                    neighborhood = neighborhood.split(",")[0]
                borough_name = borough_name.strip()
                # Neighborhood name
                neighborhood = neighborhood.strip()
                counter = counter + 1
                new_row = {
                    'Neighborhood'  : neighborhood,
                    'Borough'       : borough_name,
                    'Cluster Labels': 0,
                    'Latitude'      : 0.0,
                    'Longitude'     : 0.0
                  }
                df = df.append(new_row, ignore_index=True)
    neighborhood_count[borough_name] = counter
    h3_object = ol_object.next_sibling.next_sibling
df.head()

Unnamed: 0,Neighborhood,Borough,Cluster Labels,Latitude,Longitude
0,Burgazada,Adalar,0,0.0,0.0
1,Heybeliada,Adalar,0,0.0,0.0
2,Kınalıada,Adalar,0,0.0,0.0
3,Maden,Adalar,0,0.0,0.0
4,Nizam,Adalar,0,0.0,0.0


In [5]:
# Check if everything holds up :)

anyProblems = False
counts = df['Borough'].value_counts()
for key in neighborhood_count.keys() :
    if neighborhood_count[key] != counts[key] :
        print (key)
        anyProblems = True
    if key == "" :
        print ("Empty key/borough!")
        anyProblems = True
if len(counts) != len(neighborhood_count):
    print ("Not same")
    anyProblems = True

if not anyProblems :
    print ("All good :)")

All good :)


<h3>Finding Coordinates of Neighborhoods</h3>
<br>

In [17]:
geolocator = Nominatim(user_agent="istanbul_explorer")
problematic_neighborhood_names = [] 

for index, row in df.iterrows():
    address = '{}, {}'.format(row['Neighborhood'], row['Borough'])
    print(address)
    location = geolocator.geocode(address)
    if location == None :
        print ("We have a problem : ", address)
        problematic_neighborhood_names.append((row['Neighborhood'], row['Borough']))
    else :
        #print ("{} is at : ({}, {})".format(address, location.latitude, location.longitude))
        df.loc[ ((df['Neighborhood'] == row['Neighborhood']) &
                (df['Borough']       == row['Borough'])),
                ['Latitude', 'Longitude'] ] = (location.latitude, location.longitude)

Burgazada, Adalar
Heybeliada, Adalar
Kınalıada, Adalar
Maden, Adalar
Nizam, Adalar
Anadolu, Arnavutköy
Arnavutköy İmrahor, Arnavutköy
We have a problem :  Arnavutköy İmrahor, Arnavutköy
Arnavutköy İslambey, Arnavutköy
We have a problem :  Arnavutköy İslambey, Arnavutköy
Arnavutköy Merkez, Arnavutköy
Arnavutköy Yavuzselim, Arnavutköy
We have a problem :  Arnavutköy Yavuzselim, Arnavutköy
Atatürk, Arnavutköy
Bahşayış, Arnavutköy
We have a problem :  Bahşayış, Arnavutköy
Boğazköy Atatürk, Arnavutköy
Boğazköy İstiklal, Arnavutköy
Boğazköy Merkez, Arnavutköy
We have a problem :  Boğazköy Merkez, Arnavutköy
Bolluca, Arnavutköy
Deliklikaya, Arnavutköy
Dursunköy, Arnavutköy
Durusu Cami, Arnavutköy
We have a problem :  Durusu Cami, Arnavutköy
Durusu Zafer, Arnavutköy
We have a problem :  Durusu Zafer, Arnavutköy
Hastane, Arnavutköy
İstasyon, Arnavutköy
We have a problem :  İstasyon, Arnavutköy
Sazlıbosna, Arnavutköy
Nakkaş, Arnavutköy
Karlıbayır, Arnavutköy
Haraççı, Arnavutköy
Hicret, Arnavutkö

Nişanca, Eyüp
Rami Cuma, Eyüp
Rami Yeni, Eyüp
Sakarya, Eyüp
Silahtarağa, Eyüp
Topçular, Eyüp
Yeşilpınar, Eyüp
Aksaray, Fatih
Akşemsettin, Fatih
Alemdar, Fatih
Ali Kuşçu, Fatih
Atikali, Fatih
Ayvansaray, Fatih
Balabanağa, Fatih
Balat, Fatih
Beyazıt, Fatih
Binbirdirek, Fatih
Cankurtaran, Fatih
Cerrahpaşa, Fatih
Cibali, Fatih
Demirtaş, Fatih
Derviş Ali, Fatih
Eminsinan, Fatih
Hacıkadın, Fatih
Hasekisultan, Fatih
Hırkaişerif, Fatih
We have a problem :  Hırkaişerif, Fatih
Hobyar, Fatih
Hoca Giyasettin, Fatih
Hocapaşa, Fatih
İskenderpaşa, Fatih
Kalenderhane, Fatih
Karagümrük, Fatih
Katip Kasım, Fatih
Kemalpaşa, Fatih
Kocamustafapaşa, Fatih
Küçükayasofya, Fatih
Mercan, Fatih
Mesihpaşa, Fatih
Mevlanakapı, Fatih
Mimar Hayrettin, Fatih
Mimar Kemalettin, Fatih
Mollafenari, Fatih
Mollagürani, Fatih
We have a problem :  Mollagürani, Fatih
Mollahüsrev, Fatih
Muhsinehatun, Fatih
We have a problem :  Muhsinehatun, Fatih
Nişanca, Fatih
Rüstempaşa, Fatih
Saraçishak, Fatih
Sarıdemir, Fatih
Seyyid Ömer, F

Dumlupınar, Ümraniye
Elmalıkent, Ümraniye
Esenevler, Ümraniye
Esenşehir, Ümraniye
Fatih Sultan Mehmet, Ümraniye
Hekimbaşı, Ümraniye
Huzur, Ümraniye
Ihlamurkuyu, Ümraniye
İnkılap, Ümraniye
İstiklal, Ümraniye
Kâzım Karabekir, Ümraniye
Mehmet Akif, Ümraniye
Madenler, Ümraniye
Namık Kemal, Ümraniye
Necip Fazıl, Ümraniye
Parseller, Ümraniye
Saray, Ümraniye
Site, Ümraniye
Şerifali, Ümraniye
Tantavi, Ümraniye
Tatlısu, Ümraniye
Tepeüstü, Ümraniye
Topağacı, Ümraniye
Yamanevler, Ümraniye
Yeni Sanayi, Ümraniye
We have a problem :  Yeni Sanayi, Ümraniye
Yukarıdudullu, Ümraniye
Acıbadem, Üsküdar
Ahmediye, Üsküdar
Altunizade, Üsküdar
Aziz Mahmud Hüdayi, Üsküdar
Bahçelievler, Üsküdar
Barbaros, Üsküdar
Beylerbeyi, Üsküdar
Bulgurlu, Üsküdar
Burhaniye, Üsküdar
Cumhuriyet, Üsküdar
Çengelköy, Üsküdar
Ferah, Üsküdar
Güzeltepe, Üsküdar
İcadiye, Üsküdar
Kandilli, Üsküdar
Kirazlıtepe, Üsküdar
Kısıklı, Üsküdar
Kuleli, Üsküdar
Kuzguncuk, Üsküdar
Küçük Çamlıca, Üsküdar
Küçüksu, Üsküdar
Küplüce, Üsküdar
Mehmet Ak

In [18]:
df.head()

Unnamed: 0,Neighborhood,Borough,Cluster Labels,Latitude,Longitude
0,Burgazada,Adalar,0,40.882124,29.064212
1,Heybeliada,Adalar,0,40.876259,29.091027
2,Kınalıada,Adalar,0,40.908452,29.04842
3,Maden,Adalar,0,40.872361,29.130448
4,Nizam,Adalar,0,40.857676,29.118957


<h4>Fixing Problematic Cases</h4>
<br>
Namely:
 <ul>
  <li>Worngly written names in list</li>
  <li>Not returned values</li>
  <li>Repetitions caused by errors of Geolocation API</li>
</ul> 

In [24]:
for tuple_nb in problematic_neighborhood_names :
    neigh = tuple_nb[0]
    bor   = tuple_nb[1]
    print ('(\'{}\', \'{}\', \'{}\'),'.format(neigh, bor, neigh))

('Arnavutköy İmrahor', 'Arnavutköy', 'Arnavutköy İmrahor'),
('Arnavutköy İslambey', 'Arnavutköy', 'Arnavutköy İslambey'),
('Arnavutköy Yavuzselim', 'Arnavutköy', 'Arnavutköy Yavuzselim'),
('Bahşayış', 'Arnavutköy', 'Bahşayış'),
('Boğazköy Merkez', 'Arnavutköy', 'Boğazköy Merkez'),
('Durusu Cami', 'Arnavutköy', 'Durusu Cami'),
('Durusu Zafer', 'Arnavutköy', 'Durusu Zafer'),
('İstasyon', 'Arnavutköy', 'İstasyon'),
('Taşoluk Çilingir', 'Arnavutköy', 'Taşoluk Çilingir'),
('Aşıkveysel', 'Ataşehir', 'Aşıkveysel'),
('Yeniçamlıca', 'Ataşehir', 'Yeniçamlıca'),
('Arapcami', 'Beyoğlu', 'Arapcami'),
('Asmalımescit', 'Beyoğlu', 'Asmalımescit'),
('Çatmamescit', 'Beyoğlu', 'Çatmamescit'),
('Kamerhatun', 'Beyoğlu', 'Kamerhatun'),
('Kalyoncukulluğu', 'Beyoğlu', 'Kalyoncukulluğu'),
('Keçecipiri', 'Beyoğlu', 'Keçecipiri'),
('Kemankeş Kara Mustafa Paşa', 'Beyoğlu', 'Kemankeş Kara Mustafa Paşa'),
('Kılıçalipaşa', 'Beyoğlu', 'Kılıçalipaşa'),
('Küçükpiyale', 'Beyoğlu', 'Küçükpiyale'),
('Ömeravni', 'Beyoğlu',

In [31]:
fixed_versions = [
('Arnavutköy İmrahor', 'Arnavutköy', 'İmrahor'),
('Arnavutköy İslambey', 'Arnavutköy', 'İslambey'),
('Arnavutköy Yavuzselim', 'Arnavutköy', 'Yavuz Selim'),
('Bahşayış', 'Arnavutköy', 'Bahşayış'),
('Boğazköy Merkez', 'Arnavutköy', 'Boğazköy'),
('Durusu Cami', 'Arnavutköy', 'Duru Su Cami'),
('Durusu Zafer', 'Arnavutköy', 'Duru Su Zafer'),
('İstasyon', 'Arnavutköy', 'İstasyon'),
('Taşoluk Çilingir', 'Arnavutköy', 'Taşoluk Çilingir'),
('Aşıkveysel', 'Ataşehir', 'Aşık Veysel'),
('Yeniçamlıca', 'Ataşehir', 'Yeni Çamlıca'),
('Arapcami', 'Beyoğlu', 'Arap Cami'),
('Asmalımescit', 'Beyoğlu', 'Asmalı Mescit'),
('Çatmamescit', 'Beyoğlu', 'Çatma Mescit'),
('Kamerhatun', 'Beyoğlu', 'Kamer Hatun'),
('Kalyoncukulluğu', 'Beyoğlu', 'Kalyoncu Kulluğu'),
('Keçecipiri', 'Beyoğlu', 'Keçeci Piri'),
('Kemankeş Kara Mustafa Paşa', 'Beyoğlu', 'Kemankeş Kara Mustafa Paşa'),
('Kılıçalipaşa', 'Beyoğlu', 'Kılıç Ali Paşa'),
('Küçükpiyale', 'Beyoğlu', 'Küçük Piyale'),
('Ömeravni', 'Beyoğlu', 'Ömer Avni'),
('Piripaşa', 'Beyoğlu', 'Piri Paşa'),
('Muratbey', 'Büyükçekmece', 'Murat Bey'),
('Muratçeşme', 'Büyükçekmece', 'Murat Çeşme'),
('İzettin', 'Çatalca', 'İzzettin'),
('Mimarsinan', 'Esenler', 'Mimar Sinan'),
('Turgutreis', 'Esenler', 'Turgut Reis'),
('Ardıçlıevler', 'Esenyurt', 'Ardıçlı Evler'),
('Çakmaklı', 'Esenyurt', 'Çakmaklı'),
('Güzelyurt (Haramidere)', 'Esenyurt', 'Güzelyurt'),
('Sanayii', 'Esenyurt', 'Sanayi'),
('Mimarsinan', 'Eyüp', 'Mimar Sinan'),
('Hırkaişerif', 'Fatih', 'Hırka-i Şerif'),
('Mollagürani', 'Fatih', 'Molla Gürani'),
('Muhsinehatun', 'Fatih', 'Muhsine Hatun'),
('Mareşal Fevzi Çakmak', 'Güngören', 'Fevzi Çakmak'),
('Mehmet Nezih Özmen', 'Güngören', 'Mehmet Nezih Özmen'),
('Sahrayıcedid', 'Kadıköy', 'Sahrayı Cedid'),
('Ortamahalle', 'Kartal', 'Orta Mahalle'),
('Yukarımahalle', 'Kartal', 'Yukarı Mahalle'),
('Yenimahalle', 'Pendik', 'Yeni Mahalle'),
('Eyüpsultan', 'Sancaktepe', 'Eyüp Sultan'),
('Bahçeköy Yenimahalle', 'Sarıyer', 'Yeni Mahalle'),
('Çanta Fatih', 'Silivri', 'Fatih'),
('Çanta Mimarsinan', 'Silivri', 'Mimar Sinan'),
('Kavaklı Hürriyet', 'Silivri', 'Kavaklı Hürriyet'),
('Mimarsinan', 'Sultanbeyli', 'Mimar Sinan'),
('Eski Habibler', 'Sultangazi', 'Eski Habipler'),
('Zübeydehanım', 'Sultangazi', 'Zübeyde Hanım'),
('Aşağıdudullu', 'Ümraniye', 'Aşağı Dudullu'),
('Ataken', 'Ümraniye', 'Atakent'),
('Yeni Sanayi', 'Ümraniye', 'Yeni Sanayi')
]

In [32]:
print ("{} problematic cases were found".format(len(fixed_versions)))
problematic_neighborhood_names = [] 
for ne_bor_fixed in fixed_versions:
    neighborhood_in_df = ne_bor_fixed[0]
    borough_in_df      = ne_bor_fixed[1]
    neighborhood_fixed = ne_bor_fixed[2]
    address = '{}, {}'.format(neighborhood_fixed, borough_in_df)
    #print(address)
    location = geolocator.geocode(address)
    if location == None :
        print ("We have a problem : ", address)
        problematic_neighborhood_names.append((neighborhood_in_df, borough_in_df))
    else :
        #print ("{} is at : ({}, {})".format(address, location.latitude, location.longitude))
        df.loc[ ((df['Neighborhood'] == neighborhood_in_df) &
                (df['Borough']       == borough_in_df)),
                ['Latitude', 'Longitude'] ] = (location.latitude, location.longitude)
print ("{} problematic cases remain".format(len(problematic_neighborhood_names)))

52 problematic cases were found
We have a problem :  Bahşayış, Arnavutköy
We have a problem :  Duru Su Cami, Arnavutköy
We have a problem :  Duru Su Zafer, Arnavutköy
We have a problem :  İstasyon, Arnavutköy
We have a problem :  Taşoluk Çilingir, Arnavutköy
We have a problem :  Kemankeş Kara Mustafa Paşa, Beyoğlu
We have a problem :  Murat Bey, Büyükçekmece
We have a problem :  Çakmaklı, Esenyurt
We have a problem :  Mehmet Nezih Özmen, Güngören
We have a problem :  Kavaklı Hürriyet, Silivri
We have a problem :  Yeni Sanayi, Ümraniye
11 problematic cases remain


In [27]:
df.shape

(783, 5)

In [33]:
df.to_csv("intermediary_coordinate_df.csv")

In [32]:
df = pd.read_csv("intermediary_coordinate_df.csv")

In [33]:
duplicateDFRow = df[df.duplicated(subset=['Latitude', 'Longitude'], keep=False)]
duplicateDFRow.shape

(48, 6)

In [34]:
duplicateDFRow = duplicateDFRow.sort_values(['Latitude', 'Longitude'])
duplicateDFRow.head(duplicateDFRow.shape[0])

Unnamed: 0.1,Unnamed: 0,Neighborhood,Borough,Cluster Labels,Latitude,Longitude
11,11,Bahşayış,Arnavutköy,0,0.0,0.0
18,18,Durusu Cami,Arnavutköy,0,0.0,0.0
19,19,Durusu Zafer,Arnavutköy,0,0.0,0.0
21,21,İstasyon,Arnavutköy,0,0.0,0.0
32,32,Taşoluk Çilingir,Arnavutköy,0,0.0,0.0
216,216,Kemankeş Kara Mustafa Paşa,Beyoğlu,0,0.0,0.0
252,252,Muratbey,Büyükçekmece,0,0.0,0.0
303,303,Çakmaklı,Esenyurt,0,0.0,0.0
421,421,Mehmet Nezih Özmen,Güngören,0,0.0,0.0
615,615,Kavaklı Hürriyet,Silivri,0,0.0,0.0


In [35]:
df.drop(df[(df['Latitude'] == 0.0) & (df['Longitude'] == 0.0)].index, inplace=True)
df.shape

(772, 6)

In [36]:
df.drop(df[df.duplicated(subset=['Latitude', 'Longitude'], keep='first')].index, inplace=True)
df.shape

(751, 6)

<h4>Final State of the Data Set of Neighborhoods iss ready</h4>
<br>

In [37]:
print('There are {} neighborhoods in the data.'.format(len(df['Neighborhood'].unique())))
print('There are {} boroughs in the data.'.format(len(df['Borough'].unique())))

There are 596 neighborhoods in the data.
There are 39 boroughs in the data.


In [38]:
df = df[['Neighborhood','Borough','Cluster Labels','Latitude','Longitude']]
df.to_csv("coordinates_df.csv")
df.head()

Unnamed: 0,Neighborhood,Borough,Cluster Labels,Latitude,Longitude
0,Burgazada,Adalar,0,40.882124,29.064212
1,Heybeliada,Adalar,0,40.876259,29.091027
2,Kınalıada,Adalar,0,40.908452,29.04842
3,Maden,Adalar,0,40.872361,29.130448
4,Nizam,Adalar,0,40.857676,29.118957


In [39]:
df = pd.read_csv("coordinates_df.csv")

<h3>Gathering Venue Info</h3>
<br>
Using Foursquare Places API

In [40]:
CLIENT_ID     = 'Nope'
CLIENT_SECRET = 'Nope'
VERSION       = '20180605'
radius        = 500
LIMIT         = 100

In [41]:
# From Capstone Project Labs
# for more info : https://www.coursera.org/learn/applied-data-science-capstone?specialization=ibm-data-science
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        if "," in name:
            name = name.split(",")[0]
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [66]:
istanbul_venues = getNearbyVenues(   names=df['Neighborhood'],
                                    latitudes=df['Latitude'],
                                    longitudes=df['Longitude']
                                )

Burgazada
Heybeliada
Kınalıada
Maden
Nizam
Anadolu
Arnavutköy İmrahor
Arnavutköy İslambey
Arnavutköy Merkez
Arnavutköy Yavuzselim
Atatürk
Boğazköy Atatürk
Bolluca
Deliklikaya
Dursunköy
Hastane
Sazlıbosna
Nakkaş
Karlıbayır
Haraççı
Hicret
Mavigöl
Nenehatun
Ömerli
Taşoluk
Taşoluk Adnan Menderes
Yeşilbayır
Aşıkveysel
Atatürk
Barbaros
Esatpaşa
Ferhatpaşa
Fetih
İçerenköy
İnönü
Kayışdağı
Küçükbakkalköy
Mevlana
Mimarsinan
Mustafa Kemal
Örnek
Yeniçamlıca
Yenişehir
Yenisahra
Ambarlı
Cihangir
Denizköşkler
Firuzköy
Gümüşpala
Merkez
Mustafa Kemal Paşa
Tahtakale
Üniversite
Yeşilkent
Bağlar
Barbaros
Çınar
Demirkapı
Evren
Fatih
Fevzi Çakmak
Göztepe
Güneşli
Hürriyet
İnönü
Kâzım Karabekir
Kemalpaşa
Kirazlı
Mahmutbey
Merkez
Sancaktepe
Yavuzselim
Yenigün
Yenimahalle
Yıldıztepe
Yüzyıl
Bahçelievler
Cumhuriyet
Çobançeşme
Fevzi Çakmak
Hürriyet
Kocasinan
Siyavuşpaşa
Soğanlı
Şirinevler
Yenibosna
Zafer
Ataköy 1. kısım
Basınköy
Cevizlik
Kartaltepe
Osmaniye
Sakızağacı
Şenlikköy
Yenimahalle
Yeşilköy
Yeşilyurt
Zeyti

In [67]:
istanbul_venues.to_csv("venues_df.csv")

In [42]:
istanbul_venues = pd.read_csv("venues_df.csv")

In [43]:
istanbul_venues.shape

(28288, 8)

In [44]:
istanbul_venues.head()

Unnamed: 0.1,Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,Burgazada,40.882124,29.064212,Sait Faik Abasıyanık Müzesi,40.881015,29.067458,History Museum
1,1,Burgazada,40.882124,29.064212,Adalar Cemevi Çay Bahçesi,40.879195,29.068156,Tea Room
2,2,Burgazada,40.882124,29.064212,Burgazada Sahil,40.881171,29.0696,Beach
3,3,Burgazada,40.882124,29.064212,Sinem Dondurma,40.880984,29.069779,Ice Cream Shop
4,4,Burgazada,40.882124,29.064212,Burgazada Meydan,40.881099,29.06957,Plaza


In [45]:
print('There are {} uniques categories.'.format(len(istanbul_venues['Venue Category'].unique())))

There are 504 uniques categories.


In [46]:
print('Values returned for {} neighborhoods.'.format(len(istanbul_venues['Neighborhood'].unique())))

Values returned for 589 neighborhoods.


In [47]:
print('There were {} neighborhoods in the data.'.format(len(df['Neighborhood'].unique())))

There were 596 neighborhoods in the data.


We don't have venue data for 7 neighborhoods. 

In [48]:
# one hot encoding
istanbul_onehot = pd.get_dummies(istanbul_venues[['Venue Category']], prefix="", prefix_sep="")

In [49]:
istanbul_onehot.head()

Unnamed: 0,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,...,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yemeni Restaurant,Yoga Studio,Zoo,Zoo Exhibit,Çöp Şiş Place
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [50]:
istanbul_onehot['Neighborhood'][0:5]

0    0
1    0
2    0
3    0
4    0
Name: Neighborhood, dtype: uint8

Weirdly there is a Neighborhood category under venue categories. Let's drop it.

In [51]:
istanbul_onehot.drop(columns=['Neighborhood'], inplace=True)

In [52]:
istanbul_onehot['Neighborhood'][0:5]

KeyError: 'Neighborhood'

'Neighborhood' column in istanbul_onehot dataframe is dropped.

In [53]:
# add neighborhood column back to dataframe
istanbul_onehot['Neighborhood'] = istanbul_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [istanbul_onehot.columns[-1]] + list(istanbul_onehot.columns[:-1])
istanbul_onehot = istanbul_onehot[fixed_columns]

istanbul_onehot.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,...,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yemeni Restaurant,Yoga Studio,Zoo,Zoo Exhibit,Çöp Şiş Place
0,Burgazada,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Burgazada,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Burgazada,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Burgazada,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Burgazada,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's get the frequency of the occurence of venue types/categories.
(Mean of One-Hot Encoded Table grouped by Neighborhood = Frequency / Total Venue in that neighborhood)

In [54]:
istanbul_grouped = istanbul_onehot.groupby('Neighborhood').mean().reset_index()
istanbul_grouped.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,...,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yemeni Restaurant,Yoga Studio,Zoo,Zoo Exhibit,Çöp Şiş Place
0,19 Mayıs,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.007246,0.0,0.0,0.0
1,50. Yıl,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,75. Yıl,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0
3,Abbasağa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Abdurrahman Nafiz Gürman,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0


In [55]:
istanbul_grouped.shape

(589, 504)

In [56]:
# From Capstone Project Labs
# for more info : https://www.coursera.org/learn/applied-data-science-capstone?specialization=ibm-data-science
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [57]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = istanbul_grouped['Neighborhood']

for ind in np.arange(istanbul_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(istanbul_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,19 Mayıs,Café,Coffee Shop,Clothing Store,Gym / Fitness Center,Dessert Shop,Turkish Restaurant,Theater,Cosmetics Shop,Bakery,Sporting Goods Shop
1,50. Yıl,Café,Breakfast Spot,Department Store,Turkish Restaurant,Electronics Store,Steakhouse,Bookstore,Go Kart Track,BBQ Joint,Tram Station
2,75. Yıl,Café,Restaurant,Food Court,Ski Chalet,Gym / Fitness Center,Gym,Electronics Store,Shopping Mall,Park,Bar
3,Abbasağa,Coffee Shop,Turkish Restaurant,Café,Soccer Stadium,Gym / Fitness Center,Food,Art Studio,Historic Site,Lounge,Health & Beauty Service
4,Abdurrahman Nafiz Gürman,Café,Clothing Store,Electronics Store,Turkish Restaurant,Cosmetics Shop,Mobile Phone Shop,Breakfast Spot,Restaurant,Bagel Shop,Gym / Fitness Center


#### Clustering Motive

The analysis has been based on 10 most frequent venue categories of the neighborhoods.
To make it clear, in result of this analysis, 5 groups of neighborhoods created.
This grouping has been relied on the type of the most frequently seen venues in those neighborhoods.

In [58]:
# set number of clusters
kclusters = 5

istanbul_grouped_clustering = istanbul_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(istanbul_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 2, 2, 1, 1, 1, 1, 1, 1])

In [59]:
df.head()

Unnamed: 0.1,Unnamed: 0,Neighborhood,Borough,Cluster Labels,Latitude,Longitude
0,0,Burgazada,Adalar,0,40.882124,29.064212
1,1,Heybeliada,Adalar,0,40.876259,29.091027
2,2,Kınalıada,Adalar,0,40.908452,29.04842
3,3,Maden,Adalar,0,40.872361,29.130448
4,4,Nizam,Adalar,0,40.857676,29.118957


In [60]:
neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,19 Mayıs,Café,Coffee Shop,Clothing Store,Gym / Fitness Center,Dessert Shop,Turkish Restaurant,Theater,Cosmetics Shop,Bakery,Sporting Goods Shop
1,50. Yıl,Café,Breakfast Spot,Department Store,Turkish Restaurant,Electronics Store,Steakhouse,Bookstore,Go Kart Track,BBQ Joint,Tram Station
2,75. Yıl,Café,Restaurant,Food Court,Ski Chalet,Gym / Fitness Center,Gym,Electronics Store,Shopping Mall,Park,Bar
3,Abbasağa,Coffee Shop,Turkish Restaurant,Café,Soccer Stadium,Gym / Fitness Center,Food,Art Studio,Historic Site,Lounge,Health & Beauty Service
4,Abdurrahman Nafiz Gürman,Café,Clothing Store,Electronics Store,Turkish Restaurant,Cosmetics Shop,Mobile Phone Shop,Breakfast Spot,Restaurant,Bagel Shop,Gym / Fitness Center


In [71]:
# add clustering labels
neighborhoods_venues_sorted.drop(columns=['Cluster Labels'], inplace=True)
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

df_new = pd.DataFrame(columns=['Neighborhood', 'Borough', 'Cluster Labels', 'Latitude', 'Longitude'])
for label, row in df.iterrows():
    try:
        new_row = {
                    'Neighborhood'  : row['Neighborhood'],
                    'Borough'       : row['Borough'],
                    'Cluster Labels': neighborhoods_venues_sorted.loc[neighborhoods_venues_sorted['Neighborhood'] == row['Neighborhood'], 'Cluster Labels'].item(),
                    'Latitude'      : row['Latitude'],
                    'Longitude'     : row['Longitude']
                  }
        df_new = df_new.append(new_row, ignore_index=True)
    except ValueError :
        print('Foursquare data did not return any venues for : ', row['Neighborhood'], '. So, no cluster assigned...')    
    
df_new.head()

Foursquare data did not return any venues for :  Arnavutköy Merkez . So, no cluster assigned...
Foursquare data did not return any venues for :  Binkılıç . So, no cluster assigned...
Foursquare data did not return any venues for :  Çiftlikköy . So, no cluster assigned...
Foursquare data did not return any venues for :  Mithatpaşa . So, no cluster assigned...
Foursquare data did not return any venues for :  Bahçeköy Kemer . So, no cluster assigned...
Foursquare data did not return any venues for :  Semizkumlar . So, no cluster assigned...
Foursquare data did not return any venues for :  Habibler . So, no cluster assigned...


Unnamed: 0,Neighborhood,Borough,Cluster Labels,Latitude,Longitude
0,Burgazada,Adalar,1,40.882124,29.064212
1,Heybeliada,Adalar,1,40.876259,29.091027
2,Kınalıada,Adalar,2,40.908452,29.04842
3,Maden,Adalar,3,40.872361,29.130448
4,Nizam,Adalar,3,40.857676,29.118957


In [72]:
df_new.shape

(744, 5)

In [73]:
address = 'İstanbul'

geolocator = Nominatim(user_agent="istanbul_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of İstanbul are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of İstanbul are 41.0096334, 28.9651646.


In [74]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_new['Latitude'], df_new['Longitude'], df_new['Neighborhood'], df_new['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Here is an image of folium map because in GitHub it does not show the map:

![Folium Map](https://raw.githubusercontent.com/nech21/Coursera_Capstone/main/Week%205/folium3_close_shot.jpg "Folium Map of Clusters")

Here is an image of folium map because in GitHub it does not show the map:

![Folium Map](https://raw.githubusercontent.com/nech21/Coursera_Capstone/main/Week%205/folium.PNG "Folium Map of Clusters")

In [78]:
s1 = pd.merge(df_new, neighborhoods_venues_sorted, how='inner', on=['Neighborhood', 'Cluster Labels'])
s1.drop(columns=['Latitude', 'Longitude'], inplace=True)
s1.head()

Unnamed: 0,Neighborhood,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Burgazada,Adalar,1,Seafood Restaurant,Café,Boat or Ferry,Hotel,Tea Room,Cafeteria,Mediterranean Restaurant,Boarding House,History Museum,Beach
1,Heybeliada,Adalar,1,Café,Surf Spot,Mountain,Museum,Beach,University,Scenic Lookout,Tennis Court,Bed & Breakfast,Hotel
2,Kınalıada,Adalar,2,Beach,Pool,Boat or Ferry,Beach Bar,Harbor / Marina,Forest,Café,Athletics & Sports,Church,Trail
3,Maden,Adalar,3,Hotel,Café,Breakfast Spot,Restaurant,Seafood Restaurant,Ice Cream Shop,Soccer Stadium,Gym,Grocery Store,Pier
4,Maden,Sarıyer,3,Hotel,Café,Breakfast Spot,Restaurant,Seafood Restaurant,Ice Cream Shop,Soccer Stadium,Gym,Grocery Store,Pier


In [82]:
cluster3 = s1.loc[s1['Cluster Labels'] == 3, :]
cluster3.head(50)

Unnamed: 0,Neighborhood,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Maden,Adalar,3,Hotel,Café,Breakfast Spot,Restaurant,Seafood Restaurant,Ice Cream Shop,Soccer Stadium,Gym,Grocery Store,Pier
4,Maden,Sarıyer,3,Hotel,Café,Breakfast Spot,Restaurant,Seafood Restaurant,Ice Cream Shop,Soccer Stadium,Gym,Grocery Store,Pier
5,Nizam,Adalar,3,Hotel,Café,Bed & Breakfast,Tea Room,Campground,Harbor / Marina,Mountain,Fountain,Restaurant,Forest
67,Yenişehir,Ataşehir,3,Café,Hotel,Turkish Restaurant,Restaurant,Convenience Store,Gym / Fitness Center,Steakhouse,Pool,Park,Doner Restaurant
68,Yenişehir,Beyoğlu,3,Café,Hotel,Turkish Restaurant,Restaurant,Convenience Store,Gym / Fitness Center,Steakhouse,Pool,Park,Doner Restaurant
69,Yenişehir,Pendik,3,Café,Hotel,Turkish Restaurant,Restaurant,Convenience Store,Gym / Fitness Center,Steakhouse,Pool,Park,Doner Restaurant
72,Cihangir,Avcılar,3,Hotel,Coffee Shop,Café,Dessert Shop,Pizza Place,Bakery,Bar,Restaurant,Kebab Restaurant,Seafood Restaurant
73,Cihangir,Beyoğlu,3,Hotel,Coffee Shop,Café,Dessert Shop,Pizza Place,Bakery,Bar,Restaurant,Kebab Restaurant,Seafood Restaurant
97,Evren,Bağcılar,3,Restaurant,Turkish Restaurant,Hotel,Coffee Shop,Department Store,Café,Motorcycle Shop,Breakfast Spot,Kebab Restaurant,Women's Store
130,Kemalpaşa,Bağcılar,3,Café,Hotel,Turkish Restaurant,Steakhouse,Mobile Phone Shop,Kebab Restaurant,Clothing Store,Electronics Store,Restaurant,Nightclub


## Conclusion

As it can be seen from the table above, in neighborhoods labeled as cluster 3, there are mostly "Hotel", "Café", "Breakfast Spot", "Restaurant" and tourist attraction places like "Art Gallery", some historical "Mosque"s, "History Museum" etc. Moreover, as it can be seen from the map above, the places marked with green color (which are cluster 3 neighborhoods) focus on The Historical Peninsula. Blue Mosque, Haghia Sophia Mosque, Topkapı Palace and many other tourist attractions of Istanbul are in the same place which is called The Historical Peninsula. Therfore, green color in the map focusing on and around The Historical Peninsula means highly touristic places are marked as cluster 3, which is suggested for openning a restaurant.

In conclusion, cluster 3 is perfect for openning a restaurant because neighborhoods labeled as cluster 3 have - in addition to at least 5 different types of restaurants/take-out places - hotels and tourist attractions. Because cluster 3 neighborhoods are highly touristic neighborhoods, there is more people traffic than any other place in Istanbul. Thus, cluster 3 is suggested for openning a restaurant.