# IBM Applied Data Science Capstone Course by Coursera

## Final report

Opening Cafe in Cracow, Poland
1. Build dataframe of neighborhoods districts in Cracow
2. Check geographical coordinates of the neighborhoods
3. Getting venue data from Foursquare API
4. Clustering the neighborhoods
5. Conclusion about clusters

## 1. Import libraries

In [1]:
import numpy as np 
import pandas as pd 
import json
from geopy.geocoders import Nominatim 
import geocoder 
import requests
from bs4 import BeautifulSoup
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium 


## 2. Scrap data from Wikipedia

In [2]:
df = requests.get("https://en.wikipedia.org/wiki/Category:Districts_of_Krak%C3%B3w").text
soup = BeautifulSoup(df, 'html.parser')
neighborhoodList = []
for row in soup.find_all("div", class_="mw-category")[0].findAll("li"):
    neighborhoodList.append(row.text)    

In [3]:
df = pd.DataFrame({"Neighborhood": neighborhoodList})
df.head()

Unnamed: 0,Neighborhood
0,Districts of Kraków
1,"Rakowice, Kraków"
2,Wola Justowska
3,"Bielany, Kraków"
4,Bieńczyce (Kraków)


In [4]:
df.shape

(31, 1)

## 3. Getting geographical coordiates

In [5]:
def get_latlng(neighborhood):
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Kraków, Poland'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [6]:
geo = [get_latlng(neighborhood) for neighborhood in df["Neighborhood"].tolist()]

In [7]:
df_geo = pd.DataFrame(geo, columns=['Latitude', 'Longitude'])
df['Latitude'] = df_geo['Latitude']
df['Longitude'] = df_geo['Longitude']

In [8]:
df.shape

(31, 3)

In [9]:
df

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Districts of Kraków,50.06045,19.93242
1,"Rakowice, Kraków",50.06045,19.93242
2,Wola Justowska,50.06692,19.86192
3,"Bielany, Kraków",50.034933,19.814603
4,Bieńczyce (Kraków),50.08725,20.02729
5,Bieżanów-Prokocim,50.01802,20.02803
6,Bronowice (Kraków),50.0821,19.88076
7,Bronowice Małe,50.0821,19.88076
8,Czyżyny,50.06494,20.01019
9,Dębniki (Kraków),50.01593,19.87315


## 4. Create map of Cracow 

In [10]:
address = 'Kraków, Poland'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Kraków are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Kraków are 50.0469432, 19.997153435836697.


In [11]:
# create map of Manhattan using latitude and longitude values
map_Cracow = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Cracow)  
    
map_Cracow

## 5. Use foursquare API

In [12]:
CLIENT_ID = 'D0FLU1NTPQNGA0VOTCKU0BKSPUAB5SWCECJZUFCKLB3JM4NZ' # your Foursquare ID
CLIENT_SECRET = '1SCSKIMDFY0KRNQ153VLWF4GH53S4NC3MSYWX3BNJFXFJG45' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: D0FLU1NTPQNGA0VOTCKU0BKSPUAB5SWCECJZUFCKLB3JM4NZ
CLIENT_SECRET:1SCSKIMDFY0KRNQ153VLWF4GH53S4NC3MSYWX3BNJFXFJG45


#### Now, let's get the top 100 venues that are within a radius of 1000 meters.

In [13]:
limit = 100
radius = 1000

venues = []

for lat, long, neighborhood in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        limit)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return information from nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

Convert to data frame

In [14]:
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']
venues_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Districts of Kraków,50.06045,19.93242,Mercy Brown,50.061475,19.931788,Lounge
1,Districts of Kraków,50.06045,19.93242,Massolit Bakery & Cafe,50.058975,19.930351,Bakery
2,Districts of Kraków,50.06045,19.93242,Radisson Blu Hotel Kraków,50.058299,19.933145,Hotel
3,Districts of Kraków,50.06045,19.93242,Kino Pod Baranami,50.061603,19.935257,Indie Movie Theater
4,Districts of Kraków,50.06045,19.93242,Rynek Główny,50.06147,19.936192,Plaza


In [15]:
venues_df.shape

(1187, 7)

How many venues were returned for each neighorhood

In [16]:
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bielany, Kraków",4,4,4,4,4,4
Bieńczyce (Kraków),19,19,19,19,19,19
Bieżanów-Prokocim,5,5,5,5,5,5
Bronowice (Kraków),20,20,20,20,20,20
Bronowice Małe,20,20,20,20,20,20
Czyżyny,26,26,26,26,26,26
Districts of Kraków,100,100,100,100,100,100
Dzielnica I Stare Miasto,100,100,100,100,100,100
Dębniki (Kraków),3,3,3,3,3,3
Grzegórzki (Kraków),55,55,55,55,55,55


In [17]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 155 uniques categories.


In [18]:
venues_df['VenueCategory'].unique()[:50]

array(['Lounge', 'Bakery', 'Hotel', 'Indie Movie Theater', 'Plaza',
       'Church', 'Café', 'Bookstore', 'Pizza Place', 'Concert Hall',
       'Beer Bar', 'Wine Bar', 'Italian Restaurant', 'Salon / Barbershop',
       'Park', 'Pub', 'Art Gallery', 'Cupcake Shop', 'French Restaurant',
       'Polish Restaurant', 'Juice Bar', 'Steakhouse', 'Coffee Shop',
       'Historic Site', 'Vegetarian / Vegan Restaurant',
       'Bed & Breakfast', 'Museum', 'IT Services', 'Tattoo Parlor',
       'Seafood Restaurant', 'Cocktail Bar', 'Nightclub', 'Tea Room',
       'Burger Joint', 'Theater', 'Hotel Bar', 'Castle', 'Ice Cream Shop',
       'Bar', 'Arts & Crafts Store', 'Sandwich Place', 'Diner',
       'Art Museum', 'Hostel', 'Sushi Restaurant', 'Jewelry Store',
       'Restaurant', 'Fast Food Restaurant', 'Spa', 'Dessert Shop'],
      dtype=object)

## 6. Analysis of neighborhood

In [19]:
df_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

df_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

df_onehot.head()

Unnamed: 0,American Restaurant,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bagel Shop,Bakery,...,Toy / Game Store,Train Station,Tram Station,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Water Park,Wine Bar,Neighborhoods
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Districts of Kraków
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,Districts of Kraków
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Districts of Kraków
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Districts of Kraków
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Districts of Kraków


In [20]:
#Shift the last column (Neighborhoods) to the first place
cols = list(df_onehot.columns)
cols = [cols[-1]] + cols[:-1]
df_onehot = df_onehot[cols]

In [21]:
print(df_onehot.shape)
df_onehot.head()

(1187, 156)


Unnamed: 0,Neighborhoods,American Restaurant,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bagel Shop,...,Theme Park Ride / Attraction,Toy / Game Store,Train Station,Tram Station,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Water Park,Wine Bar
0,Districts of Kraków,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Districts of Kraków,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Districts of Kraków,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Districts of Kraków,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Districts of Kraków,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's group rows by neighborhood with mean frequency of occurence of each category 

In [22]:
df_onehot_grouped = df_onehot.groupby(["Neighborhoods"]).mean().reset_index()
df_onehot_grouped

Unnamed: 0,Neighborhoods,American Restaurant,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bagel Shop,...,Theme Park Ride / Attraction,Toy / Game Store,Train Station,Tram Station,Udon Restaurant,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Water Park,Wine Bar
0,"Bielany, Kraków",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bieńczyce (Kraków),0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,...,0.0,0.0,0.0,0.105263,0.0,0.0,0.0,0.0,0.0,0.0
2,Bieżanów-Prokocim,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bronowice (Kraków),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0
4,Bronowice Małe,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0
5,Czyżyny,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0
6,Districts of Kraków,0.0,0.0,0.0,0.02,0.01,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01
7,Dzielnica I Stare Miasto,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01
8,Dębniki (Kraków),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Grzegórzki (Kraków),0.0,0.0,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,...,0.018182,0.0,0.0,0.0,0.018182,0.0,0.0,0.018182,0.0,0.0


In [23]:
len(df_onehot_grouped[df_onehot_grouped["Coffee Shop"] > 0])

10

Check in which neighborhoods occur Coffee Shops and create dataframe only with this column 

In [24]:
df_cs= df_onehot_grouped[["Neighborhoods","Coffee Shop"]]
df_cs

Unnamed: 0,Neighborhoods,Coffee Shop
0,"Bielany, Kraków",0.0
1,Bieńczyce (Kraków),0.0
2,Bieżanów-Prokocim,0.0
3,Bronowice (Kraków),0.0
4,Bronowice Małe,0.0
5,Czyżyny,0.076923
6,Districts of Kraków,0.02
7,Dzielnica I Stare Miasto,0.04
8,Dębniki (Kraków),0.0
9,Grzegórzki (Kraków),0.054545


## 7. Clustering neighborhoods

Run k means algorithm for 3 clusters

In [25]:
df_clustering = df_cs.drop(["Neighborhoods"], 1)


kmeans3 = KMeans(n_clusters= 3, random_state=2021).fit(df_clustering)

# check cluster labels generated for each row in the dataframe
kmeans3.labels_[0:10]

array([1, 1, 1, 1, 1, 2, 0, 0, 1, 2], dtype=int32)

Add numbers of clusters, latitude and longitude to dataframe  

In [26]:
df_merged = df_cs.copy()

df_merged["Cluster"] = kmeans3.labels_

In [27]:
df_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)

df_merged = df_merged.join(df.set_index("Neighborhood"), on="Neighborhood")

In [28]:
df_merged.sort_values(["Cluster"], inplace=True)
df_merged

Unnamed: 0,Neighborhood,Coffee Shop,Cluster,Latitude,Longitude
22,"Rakowice, Kraków",0.02,0,50.06045,19.93242
25,Tyniec,0.02,0,50.06045,19.93242
24,Template:Kraków districts,0.02,0,50.06045,19.93242
12,Kraków Old Town,0.02,0,50.06045,19.93242
6,Districts of Kraków,0.02,0,50.06045,19.93242
7,Dzielnica I Stare Miasto,0.04,0,50.06673,19.9382
11,Kleparz,0.03125,0,50.07422,19.93618
10,Kazimierz,0.02,0,50.04987,19.94301
19,Prądnik Biały,0.0,1,50.09438,19.91619
20,Prądnik Czerwony,0.0,1,50.08631,19.96766


Visualize our analysis on the map

In [29]:
kclusters = 3
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['Neighborhood'], df_merged['Cluster']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Dividing neighborhood by number of cluster

**For cluster 0**

In [30]:
df_merged.loc[df_merged["Cluster"] == 0]

Unnamed: 0,Neighborhood,Coffee Shop,Cluster,Latitude,Longitude
22,"Rakowice, Kraków",0.02,0,50.06045,19.93242
25,Tyniec,0.02,0,50.06045,19.93242
24,Template:Kraków districts,0.02,0,50.06045,19.93242
12,Kraków Old Town,0.02,0,50.06045,19.93242
6,Districts of Kraków,0.02,0,50.06045,19.93242
7,Dzielnica I Stare Miasto,0.04,0,50.06673,19.9382
11,Kleparz,0.03125,0,50.07422,19.93618
10,Kazimierz,0.02,0,50.04987,19.94301


**For cluster 1**

In [31]:
df_merged.loc[df_merged["Cluster"] == 1]

Unnamed: 0,Neighborhood,Coffee Shop,Cluster,Latitude,Longitude
19,Prądnik Biały,0.0,1,50.09438,19.91619
20,Prądnik Czerwony,0.0,1,50.08631,19.96766
0,"Bielany, Kraków",0.0,1,50.034933,19.814603
18,Podgórze Duchackie,0.0,1,50.01378,19.96582
23,Swoszowice (Kraków),0.0,1,49.98832,19.94146
26,Wola Justowska,0.0,1,50.06692,19.86192
27,Wzgórza Krzesławickie,0.0,1,50.09865,20.07457
21,Płaszów,0.0,1,50.03447,19.97418
17,Podgórze,0.0,1,50.04034,20.00459
14,"Lubocza, Kraków",0.0,1,50.09636,20.08795


**For cluster 1**

In [32]:
df_merged.loc[df_merged["Cluster"] == 2]

Unnamed: 0,Neighborhood,Coffee Shop,Cluster,Latitude,Longitude
9,Grzegórzki (Kraków),0.054545,2,50.06262,19.96167
5,Czyżyny,0.076923,2,50.06494,20.01019


**Conclusions:**

Based on clusters the following conclusions can be drawn:
1. Most of the Coffee Shops are concentrated in the central area of Kraków city, with the highest number in cluster 2 but only on 2 district and moderate number in cluster 0 but more unique districts. 
2. Coffee Shops in cluster 1 no occurs or have very low number of total occurance

This shows that the social life in Krakow occurs in the city center. Areas around give a great opportunity to open new coffee shops, because of little to competition from others Coffee Shops. In other hand coffee shops in cluster 0 and 2 likely suffering from intense  competition due to oversupply.
Based on analysis above recommends open new coffee shops in around of cluster 1 and should avoids to neighborhoods in cluster 0.   
