## Coursera Capstone project

### Problem statement
A car manufacturer is setting up office in Cologne, Germany and is sourcing human resource from all over the world. The company plans to recruit 2000 people from various parts of the world, across culture, most of whom are absolutely new to Germany. The people moving to Germany are seeking the guidance of the HR department to understand the city and what it has to offer. As it became a huge task for the company to reply to each such query, they decided to hire a data scientist to provide comprehensive data about the various suburbs. I am the data scientist they have hired.

### The source of data
The company didn't have any source of data based on which a helpful solution could be provided. The company just stated the problem and gave a free hand on where I source the data from. I shared the intent of using websites to source some basic details of Cologne and use the FourSquare API for getting information about places of interest. The company approved of this.

### Data collection:
There is no comprehensive data set available about Germany. I sourced the data about the various postal codes in Cologne from http://zip-code.en.mapawi.com/germany/10/kreisfreie-stadt-koln/2/269/koln/50667/9428/ and used geopy.geocoders.Nominatim package to extract the names of all the suburbs, districts, latitude and longitude and store it in a CSV file named KolnDetails.csv which is here https://github.com/lauvshree/Coursera_Capstone/blob/master/KolnDetails.csv. Then I used the FourSquare API to find all the venues of interest in one kilometer distance from the city centre. And then based on the list of postal code I collected the details about places of interest in each of the suburbs in cologne. The Four Square API returns data for the venues of various categories which are then stored in a dataframe.

### Import all the necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Import and use Nominatium from open source geopy.geocoders to get the location parameters

In [2]:
from geopy.geocoders import Nominatim

address = 'Cologne'

geolocator = Nominatim(user_agent="Exploring_Germany")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Cologne, Germany are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Cologne, Germany are 50.938361, 6.959974.


#### Moving forward using the FourSquare API to explore the areas

In [3]:
CLIENT_ID = 'SECAGOPU1RPKJSGUPZL4FAS0GTTGZ5AW2KCVR2LFZ4EQP04H' # your Foursquare ID
CLIENT_SECRET = 'U4DTPVJFLMOE1TO32QFQ2IS10UNLBZDCBSEYQIPKL2XNRYK2' # your Foursquare Secret
VERSION = '20200519' # Foursquare API version

In [4]:
LIMIT = 100
radius = 1000
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
url # display URL


'https://api.foursquare.com/v2/venues/explore?&client_id=SECAGOPU1RPKJSGUPZL4FAS0GTTGZ5AW2KCVR2LFZ4EQP04H&client_secret=U4DTPVJFLMOE1TO32QFQ2IS10UNLBZDCBSEYQIPKL2XNRYK2&v=20200519&ll=50.938361,6.959974&radius=1000&limit=100'

In [5]:
import requests
from pandas.io.json import json_normalize

results = requests.get(url).json()


#### Borrowing the get_categories from the FourSquare lab

In [6]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [7]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON


In [8]:
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng','venue.location.postalCode']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(50)


Unnamed: 0,name,categories,lat,lng,postalCode
0,Papa Joe's Jazzlokal,Jazz Club,50.937882,6.962241,50667.0
1,Rheinufer Altstadt,Pedestrian Plaza,50.938827,6.96287,50667.0
2,Craftbeer Corner,Beer Bar,50.937222,6.958928,50667.0
3,Rheingarten,Park,50.938243,6.962875,50667.0
4,Kölner Philharmonie,Concert Hall,50.940537,6.960486,50667.0
5,El Chango,Steakhouse,50.936599,6.959978,50667.0
6,Alter Markt,Plaza,50.938623,6.96007,50667.0
7,Fischmarkt,Plaza,50.938363,6.962527,50667.0
8,Servus Colonia Alpina,Bavarian Restaurant,50.937423,6.959806,
9,LEGO Store,Toy / Game Store,50.937042,6.95637,50667.0


In [9]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


### Visualizing the map of Cologne with markers on the places of interest

In [10]:
import folium
import numpy.random as random
map_cologne = folium.Map(location=[latitude, longitude], zoom_start=14)

for row in nearby_venues.iterrows():
    folium.CircleMarker(
        [row[1][2], row[1][3]],
        radius=5, color="red").add_to(map_cologne)
    
map_cologne

In [11]:
def getNearbyVenues(postcodes, latitudes, longitudes, radius=100):
    
    venues_list=[]
    o =3
    for postcode, lat, lng in zip(postcodes, latitudes, longitudes):
        if o>3:
            return []
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)

        # make the GET request
        json_resp = requests.get(url).json()
        
#         print(names)
        contains = False;
        for key in list(json_resp["response"].keys()):
            if key == 'groups':
                contains = True

        if contains:
            results = json_resp["response"]['groups'][0]['items']
        
        

            # return only relevant information for each nearby venue
            venues_list.append([(
                postcode, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Read the KolnDetails file into a dataframe to explore all the suburbs in Cologne

In [12]:
koln_df = pd.read_csv("KolnDetails.csv")

koln_venues = getNearbyVenues(postcodes=koln_df['Postal Code'],
                                   latitudes=koln_df['Latitude'],
                                   longitudes=koln_df['Longitude']
                                  )


In [13]:
koln_venues.columns = ['Postal Code', 
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']


### Visualize all the suburbs in Cologne around the city centre

In [40]:
import folium
import numpy.random as random
map_cologne = folium.Map(location=[latitude, longitude], zoom_start=11)

folium.CircleMarker(
    [latitude, longitude],
    radius=6, color="red").add_to(map_cologne)

for row in koln_df.iterrows():
    folium.CircleMarker(
        [row[1][1], row[1][2]],
        radius=5, color="black").add_to(map_cologne)
map_cologne

In [15]:
# one hot encoding
koln_onehot = pd.get_dummies(koln_venues[['Venue Category']], prefix="", prefix_sep="")

koln_onehot.insert(0, "Postal Code", koln_venues['Postal Code'], True) 



In [16]:
koln_grouped = koln_onehot.groupby('Postal Code').sum().reset_index()
koln_grouped

Unnamed: 0,Postal Code,African Restaurant,Art Museum,Bakery,Bar,Bed & Breakfast,Brazilian Restaurant,Bus Stop,Café,Chinese Restaurant,...,Pub,Restaurant,Rhenisch Restaurant,Shoe Repair,Steakhouse,Supermarket,Sushi Restaurant,Theater,Tram Station,Water Park
0,50667,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,50670,0,0,0,1,0,1,0,0,1,...,0,0,0,0,1,0,2,0,0,0
2,50672,1,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,50674,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,50676,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5,50677,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
6,50733,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,50735,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,50765,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,50823,0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [17]:
koln_df.drop("Unnamed: 0",axis=1,inplace=True)

In [18]:
num_top_venues = 5

for hood in koln_grouped['Postal Code']:
    area = koln_df[koln_df['Postal Code'] == hood]["Suburb"]
    print("----"+area.iloc[0]+"----")
    
    dist_name = koln_df[koln_df['Postal Code'] == hood]["District"]

    dist_name = dist_name.iloc[0]
    print("District - "+dist_name)
    temp = koln_grouped[koln_grouped['Postal Code'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

---- Innenstadt----
District - Altstadt-Nord
                venue  freq
0          Art Museum   1.0
1  African Restaurant   0.0
2                 Pub   0.0
3  Mexican Restaurant   0.0
4           Multiplex   0.0


---- Innenstadt----
District - Neustadt/Nord
                venue  freq
0    Sushi Restaurant   2.0
1    Currywurst Joint   1.0
2  Chinese Restaurant   1.0
3               Hotel   1.0
4           Gastropub   1.0


---- Innenstadt----
District - Neustadt/Nord
                venue  freq
0  African Restaurant   1.0
1                Café   1.0
2  Mexican Restaurant   1.0
3              Bakery   1.0
4        Tram Station   0.0


---- Innenstadt----
District - Neustadt/Süd
                venue  freq
0                Café   1.0
1         Comedy Club   1.0
2  African Restaurant   0.0
3          Restaurant   0.0
4           Multiplex   0.0


---- Innenstadt----
District - Altstadt-Süd
                venue  freq
0          Water Park   1.0
1                 Pub   0.0
2  Mexican Re

In [19]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

### Extracting the top 5 venues in each of the suburbs

In [20]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
district_venues_sorted = pd.DataFrame(columns=columns)
district_venues_sorted['Postal Code'] = koln_grouped['Postal Code']

place1 = []
place2 = []
place3 = []
place4 = []
place5 = []

for ind in np.arange(koln_grouped.shape[0]):
    temp = return_most_common_venues(koln_grouped.iloc[ind, :], num_top_venues)
    place1.append(temp[0])
    place2.append(temp[1])
    place3.append(temp[2])
    place4.append(temp[3])
    place5.append(temp[4])



In [21]:
# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Postal Code'] = koln_grouped['Postal Code']


venues_sorted["1st Most Common Venue"] = place1
venues_sorted["2nd Most Common Venue"] = place2
venues_sorted["3rd Most Common Venue"] = place3
venues_sorted["4th Most Common Venue"] = place4
venues_sorted["5th Most Common Venue"] = place5

venues_sorted.head()



Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,50667,Art Museum,Water Park,Comedy Club,Gastropub,Forest
1,50670,Sushi Restaurant,Hotel,Steakhouse,Bar,Diner
2,50672,African Restaurant,Bakery,Mexican Restaurant,Café,Cosmetics Shop
3,50674,Comedy Club,Café,Water Park,Gastropub,Forest
4,50676,Water Park,Comedy Club,Gastropub,Forest,Farmers Market


In [22]:
from sklearn.cluster import KMeans

koln_grouped_clustering = koln_grouped.drop('Postal Code', 1)

kmeans = KMeans(n_clusters=5, random_state=0).fit(koln_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 1, 4, 2, 2, 2, 2, 2, 2, 2], dtype=int32)

In [23]:
# add clustering labels
venues_sorted['Cluster Labels']=kmeans.labels_

koln_merged = koln_df

koln_merged = koln_merged.join(venues_sorted.set_index('Postal Code'), on='Postal Code')
koln_merged. dropna(inplace=True)
koln_merged['Cluster Labels'] = koln_merged['Cluster Labels'].astype(int)



In [24]:
venues_sorted.head()

Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,Cluster Labels
0,50667,Art Museum,Water Park,Comedy Club,Gastropub,Forest,2
1,50670,Sushi Restaurant,Hotel,Steakhouse,Bar,Diner,1
2,50672,African Restaurant,Bakery,Mexican Restaurant,Café,Cosmetics Shop,4
3,50674,Comedy Club,Café,Water Park,Gastropub,Forest,2
4,50676,Water Park,Comedy Club,Gastropub,Forest,Farmers Market,2


### Visualize each of the 5 clusters 

In [25]:
import matplotlib.cm as cm
import matplotlib.colors as colors

kclusters = 5
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighborhood, cluster in zip(koln_merged['Latitude'], koln_merged['Longitude'], koln_merged['Suburb'], koln_merged['Cluster Labels']):
    label = folium.Popup(str(neighborhood) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine Clusters

### Cluster 1

In [26]:
koln_merged.loc[koln_merged['Cluster Labels'] == 0, koln_merged.columns[[0] + list([1,6,7,8,9,10])]]


Unnamed: 0,Postal Code,Latitude,State,Country,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
34,51065,50.959901,Nordrhein-Westfalen,Deutschland,Bakery,Drugstore,Restaurant
40,51109,50.945011,Nordrhein-Westfalen,Deutschland,Bed & Breakfast,Drugstore,Water Park


### Cluster 2

In [27]:
koln_merged.loc[koln_merged['Cluster Labels'] == 1, koln_merged.columns[[0] + list([1,6,7,8,9,10])]]


Unnamed: 0,Postal Code,Latitude,State,Country,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
2,50670,50.946885,Nordrhein-Westfalen,Deutschland,Sushi Restaurant,Hotel,Steakhouse


### Cluster 3

In [28]:
koln_merged.loc[koln_merged['Cluster Labels'] == 2, koln_merged.columns[[0] + list([1,6,7,8,9,10])]]


Unnamed: 0,Postal Code,Latitude,State,Country,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,50667,50.939809,Nordrhein-Westfalen,Deutschland,Art Museum,Water Park,Comedy Club
4,50674,50.932359,Nordrhein-Westfalen,Deutschland,Comedy Club,Café,Water Park
5,50676,50.932472,Nordrhein-Westfalen,Deutschland,Water Park,Comedy Club,Gastropub
6,50677,50.922279,Nordrhein-Westfalen,Deutschland,Theater,Water Park,Chinese Restaurant
9,50733,50.969006,Nordrhein-Westfalen,Deutschland,Playground,Water Park,Comedy Club
10,50735,50.973839,Nordrhein-Westfalen,Deutschland,Pet Service,Water Park,Comedy Club
13,50765,51.022553,Nordrhein-Westfalen,Deutschland,Mexican Restaurant,Water Park,Comedy Club
16,50823,50.95082,Nordrhein-Westfalen,Deutschland,Bar,Rhenisch Restaurant,Greek Restaurant
18,50827,50.955841,Nordrhein-Westfalen,Deutschland,Shoe Repair,Water Park,Chinese Restaurant
19,50829,50.972112,Nordrhein-Westfalen,Deutschland,Playground,Water Park,Comedy Club


### Cluster 4

In [29]:
koln_merged.loc[koln_merged['Cluster Labels'] == 3, koln_merged.columns[[0] + list([1,6,7,8,9,10])]]


Unnamed: 0,Postal Code,Latitude,State,Country,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
36,51069,50.975662,Nordrhein-Westfalen,Deutschland,Farmers Market,Tram Station,Bakery


### Cluster 5

In [30]:
koln_merged.loc[koln_merged['Cluster Labels'] == 4, koln_merged.columns[[0] + list([1,6,7,8,9,10])]]


Unnamed: 0,Postal Code,Latitude,State,Country,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
3,50672,50.940788,Nordrhein-Westfalen,Deutschland,African Restaurant,Bakery,Mexican Restaurant


In [56]:
nearby_venues[nearby_venues['categories'].str.contains("Restaurant")][['name','categories']]

Unnamed: 0,name,categories
8,Servus Colonia Alpina,Bavarian Restaurant
13,Beirut,Lebanese Restaurant
22,Sattgrün,Vegetarian / Vegan Restaurant
26,Via Sistina An Farina,Italian Restaurant
36,Poncho's,South American Restaurant
37,El Chango,Chinese Restaurant
41,Restaurant Heumarkt,German Restaurant
42,Rosendorn,Tapas Restaurant
46,Vinoteca da Rino,Italian Restaurant
51,Frites Belgique,Fast Food Restaurant
