# IBM Applied Data Science Capstone Course by Coursera

## Week 5 Final Report
**Opening a New Super Market in Casablanca, Morocco**

* Build a dataframe of neighborhoods in Kuala Lumpur, Malaysia by web scraping the data from Wikipedia page
* Get the geographical coordinates of the neighborhoods
* Obtain the venue data for the neighborhoods from Foursquare API
* Explore and cluster the neighborhoods
* Select the best cluster to open a new shopping market

## 1. Import libraries

In [1]:

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
!conda install -c conda-forge geocoder --yes

import geocoder # to get coordinates

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# for webscraping import Beautiful Soup 
from bs4 import BeautifulSoup

import xml

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('All Required Libraries imported!')

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.18.1-py_0 conda-forge

geographiclib- 100% |################################| Time: 0:00:00  23.27 MB/s
geopy-1.18.1-p 100% |################################| Time: 0:00:00  35.49 MB/s
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    geocoder:   1.38.1-py_0  conda-forge
    orderedset: 2.0-py35_0   conda-forge
    ratelim:    0.1.6-py35_0 conda-forge

orderedset-2.0 100% |################################| Time: 0:00:00  51.10 MB/s
ratelim-0.1.6- 100% |################################| Time: 0:00:00  12.96 MB/s
geocoder-1.38. 100% |################################| Time: 0:00:00 

## 2. Scrape data from Wikipedia page into a DataFrame


In [3]:
# send the GET request
data = requests.get("https://en.wikipedia.org/wiki/Category:Neighbourhoods_of_Casablanca").text

In [4]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

# create a list to store neighborhood data
neighborhoodList = []

# append the data into the list
for row in soup.find_all("div", class_="mw-category")[0].findAll("li"):
    neighborhoodList.append(row.text)

In [5]:
# create a new DataFrame from the list
kl_df = pd.DataFrame({"Neighborhood": neighborhoodList})

kl_df.head()

Unnamed: 0,Neighborhood
0,Ain Diab
1,Aïn Sebaâ
2,Anfa
3,Belvedere (Casablanca)
4,Bourgogne (Casablanca)


In [6]:
# print the number of rows of the dataframe
kl_df.shape

(24, 1)

## 3. Get the geographical coordinates


In [7]:
# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{},Casablanca, Morocco'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [8]:
# call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(neighborhood) for neighborhood in kl_df["Neighborhood"].tolist() ]
coords

[[33.596610000000055, -7.618889999999965],
 [33.60996000000006, -7.542339999999967],
 [33.588310000000035, -7.61137999999994],
 [33.595120000000065, -7.58809999999994],
 [33.602670000000046, -7.645299999999963],
 [33.53281000000004, -7.6330899999999815],
 [33.596610000000055, -7.618889999999965],
 [33.57593000000003, -7.629709999999932],
 [33.596610000000055, -7.618889999999965],
 [33.6051620255491, -7.652691057320623],
 [33.57977000000005, -7.66757999999993],
 [33.57594000000006, -7.676739999999938],
 [33.596610000000055, -7.618889999999965],
 [33.60107000000005, -7.584429999999941],
 [33.57367000000005, -7.598109999999963],
 [33.596610000000055, -7.618889999999965],
 [33.57957000000005, -7.635999999999967],
 [33.55119000000008, -7.5515799999999444],
 [33.55741000000006, -7.6815299999999525],
 [33.58921000000004, -7.640609999999981],
 [33.59946000000008, -7.583719999999971],
 [33.53825000000006, -7.55350999999996],
 [33.546910000000025, -7.575049999999976],
 [33.524820000000034, -7.65

In [9]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [10]:
# merge the coordinates into the original dataframe
kl_df['Latitude'] = df_coords['Latitude']
kl_df['Longitude'] = df_coords['Longitude']

In [28]:
# check the neighborhoods and the coordinates
print(kl_df.shape)
kl_df

(24, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Ain Diab,33.59661,-7.61889
1,Aïn Sebaâ,33.60996,-7.54234
2,Anfa,33.58831,-7.61138
3,Belvedere (Casablanca),33.59512,-7.5881
4,Bourgogne (Casablanca),33.60267,-7.6453
5,California (neighborhood),33.53281,-7.63309
6,CIL (Casablanca),33.59661,-7.61889
7,Derb Ghallef,33.57593,-7.62971
8,Derb sultane,33.59661,-7.61889
9,Habous (Casablanca),33.605162,-7.652691


In [11]:
# save the DataFrame as CSV file
kl_df.to_csv("kl_df.csv", index=False)

## 4. Create a map of Casablanca with neighborhoods superimposed on top


In [12]:
# get the coordinates of Kuala Lumpur
address = 'Casablanca, Morocco'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Casablanca, Morocco {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Casablanca, Morocco 33.5950627, -7.6187768.


In [14]:
# create map of Toronto using latitude and longitude values
map_kl = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(kl_df['Latitude'], kl_df['Longitude'], kl_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_kl)  
    
map_kl

In [15]:
# save the map as HTML file
map_kl.save('map_kl.html')

##  5. Use the Foursquare API to explore the neighborhoods


In [16]:
# define Foursquare Credentials and Version
CLIENT_ID = 'UE0LCMD0NK53U0CJMW5WYWXFLJJSBVWDB1YG01X3L30QABJD' # your Foursquare ID
CLIENT_SECRET = 'N5LQ0C0X4DWYBJMGFMW5PBFPE5N01OSHZMUXO0LW4TIAILKS' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version



**Let's explore the top venu 100 venues that are within 2000 meters**

In [17]:
radius = 2000
LIMIT = 200

venues = []

for lat, long, neighborhood in zip(kl_df['Latitude'], kl_df['Longitude'], kl_df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [19]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(1442, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Ain Diab,33.59661,-7.61889,Casa Jose,33.597823,-7.615341,Tapas Restaurant
1,Ain Diab,33.59661,-7.61889,Sofitel Casablanca Tour Blanche,33.598251,-7.61396,Hotel
2,Ain Diab,33.59661,-7.61889,La Bodega,33.59522,-7.611576,Pub
3,Ain Diab,33.59661,-7.61889,La Sqala: Café Maure,33.602983,-7.61943,Moroccan Restaurant
4,Ain Diab,33.59661,-7.61889,Hyatt Regency Casablanca,33.596195,-7.618708,Hotel


**Let's check how many venues were returned for each neighorhood**



In [20]:
venues_df.groupby(["Neighborhood"]).count()


Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ain Diab,100,100,100,100,100,100
Anfa,100,100,100,100,100,100
Aïn Sebaâ,28,28,28,28,28,28
Belvedere (Casablanca),32,32,32,32,32,32
Bourgogne (Casablanca),100,100,100,100,100,100
CIL (Casablanca),100,100,100,100,100,100
California (neighborhood),40,40,40,40,40,40
Derb Ghallef,100,100,100,100,100,100
Derb sultane,100,100,100,100,100,100
Habous (Casablanca),74,74,74,74,74,74


In [21]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))


There are 111 uniques categories.


In [22]:
# print out the list of categories
venues_df['VenueCategory'].unique()

array(['Tapas Restaurant', 'Hotel', 'Pub', 'Moroccan Restaurant',
       'Hotel Bar', 'French Restaurant', 'Coffee Shop',
       'Seafood Restaurant', 'Mediterranean Restaurant', 'Lounge', 'Plaza',
       'Sandwich Place', 'Indie Movie Theater', 'Italian Restaurant',
       'Café', 'Pizza Place', 'Gastropub', 'Brazilian Restaurant',
       'Restaurant', 'Ice Cream Shop', 'Sushi Restaurant', 'Burger Joint',
       'Salad Place', 'Art Gallery', 'Steakhouse', 'Fast Food Restaurant',
       'Vegetarian / Vegan Restaurant', 'Japanese Restaurant',
       'Spanish Restaurant', 'Library', 'General Entertainment',
       'Cupcake Shop', 'Bakery', 'Latin American Restaurant', 'Diner',
       'American Restaurant', 'Middle Eastern Restaurant',
       'Clothing Store', 'Bar', 'Asian Restaurant', 'Farmers Market',
       'Vietnamese Restaurant', 'Wings Joint', 'Noodle House', 'Pool Hall',
       'Big Box Store', 'Food & Drink Shop', 'Flea Market', 'Gym',
       'Carpet Store', 'Hardware Store', 'Tr

In [23]:
# check if the results contain "SuperMarket" 
"Supermarket" in venues_df['VenueCategory'].unique()

True

## 6. Analyze Each Neighborhood


In [24]:

# one hot encoding
kl_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
kl_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [kl_onehot.columns[-1]] + list(kl_onehot.columns[:-1])
kl_onehot = kl_onehot[fixed_columns]

print(kl_onehot.shape)
kl_onehot.head()

(1442, 112)


Unnamed: 0,Neighborhoods,American Restaurant,Antique Shop,Art Gallery,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bar,Beach,Beach Bar,Beer Garden,Big Box Store,Boarding House,Brazilian Restaurant,Breakfast Spot,Burger Joint,Café,Carpet Store,Clothing Store,Cocktail Bar,Coffee Shop,Comedy Club,Comfort Food Restaurant,Creperie,Cupcake Shop,Department Store,Dessert Shop,Diner,Doner Restaurant,Electronics Store,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Flea Market,Flower Shop,Food & Drink Shop,French Restaurant,Garden Center,Gastropub,General Entertainment,Golf Course,Grocery Store,Gym,Gym / Fitness Center,Harbor / Marina,Hardware Store,History Museum,Hookah Bar,Hotel,Hotel Bar,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Italian Restaurant,Japanese Restaurant,Jazz Club,Juice Bar,Latin American Restaurant,Library,Lighthouse,Lounge,Market,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Modern European Restaurant,Moroccan Restaurant,Multiplex,Neighborhood,Nightclub,Noodle House,Performing Arts Venue,Pharmacy,Pizza Place,Plaza,Pool Hall,Pub,Racetrack,Resort,Restaurant,Rock Club,Salad Place,Sandwich Place,Scenic Lookout,Seafood Restaurant,Shoe Store,Shopping Mall,Snack Place,Soccer Field,Soccer Stadium,Spa,Spanish Restaurant,Sporting Goods Shop,Sports Club,Steakhouse,Supermarket,Surf Spot,Sushi Restaurant,Taco Place,Tapas Restaurant,Tennis Stadium,Thai Restaurant,Theater,Theme Park,Train Station,Tram Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint,Yoga Studio
0,Ain Diab,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
1,Ain Diab,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Ain Diab,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Ain Diab,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Ain Diab,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


**Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category**

In [26]:
kl_grouped = kl_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(kl_grouped.shape)
kl_grouped

(24, 112)


Unnamed: 0,Neighborhoods,American Restaurant,Antique Shop,Art Gallery,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bar,Beach,Beach Bar,Beer Garden,Big Box Store,Boarding House,Brazilian Restaurant,Breakfast Spot,Burger Joint,Café,Carpet Store,Clothing Store,Cocktail Bar,Coffee Shop,Comedy Club,Comfort Food Restaurant,Creperie,Cupcake Shop,Department Store,Dessert Shop,Diner,Doner Restaurant,Electronics Store,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Flea Market,Flower Shop,Food & Drink Shop,French Restaurant,Garden Center,Gastropub,General Entertainment,Golf Course,Grocery Store,Gym,Gym / Fitness Center,Harbor / Marina,Hardware Store,History Museum,Hookah Bar,Hotel,Hotel Bar,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Italian Restaurant,Japanese Restaurant,Jazz Club,Juice Bar,Latin American Restaurant,Library,Lighthouse,Lounge,Market,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Modern European Restaurant,Moroccan Restaurant,Multiplex,Neighborhood,Nightclub,Noodle House,Performing Arts Venue,Pharmacy,Pizza Place,Plaza,Pool Hall,Pub,Racetrack,Resort,Restaurant,Rock Club,Salad Place,Sandwich Place,Scenic Lookout,Seafood Restaurant,Shoe Store,Shopping Mall,Snack Place,Soccer Field,Soccer Stadium,Spa,Spanish Restaurant,Sporting Goods Shop,Sports Club,Steakhouse,Supermarket,Surf Spot,Sushi Restaurant,Taco Place,Tapas Restaurant,Tennis Stadium,Thai Restaurant,Theater,Theme Park,Train Station,Tram Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint,Yoga Studio
0,Ain Diab,0.01,0.0,0.01,0.01,0.0,0.0,0.03,0.01,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.03,0.11,0.0,0.02,0.0,0.05,0.0,0.0,0.0,0.01,0.0,0.0,0.03,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.04,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.01,0.02,0.0,0.01,0.04,0.02,0.0,0.0,0.01,0.01,0.0,0.03,0.0,0.01,0.0,0.02,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.03,0.01,0.01,0.04,0.0,0.0,0.04,0.0,0.01,0.02,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.01,0.0
1,Anfa,0.0,0.01,0.01,0.01,0.0,0.01,0.03,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.14,0.0,0.0,0.0,0.04,0.0,0.0,0.01,0.01,0.0,0.0,0.04,0.0,0.0,0.01,0.01,0.05,0.0,0.01,0.0,0.04,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.05,0.01,0.01,0.0,0.01,0.03,0.01,0.0,0.0,0.0,0.01,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.01,0.02,0.02,0.0,0.0,0.05,0.0,0.0,0.03,0.0,0.02,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.03,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0
2,Aïn Sebaâ,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.285714,0.035714,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.107143,0.035714,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0
3,Belvedere (Casablanca),0.0,0.0,0.0,0.0,0.0,0.03125,0.0625,0.03125,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.3125,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.03125,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.03125,0.0,0.0,0.0,0.0,0.0,0.0
4,Bourgogne (Casablanca),0.02,0.0,0.0,0.01,0.0,0.0,0.03,0.02,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.04,0.08,0.0,0.0,0.0,0.1,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.03,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.05,0.0,0.04,0.0,0.0,0.04,0.01,0.0,0.01,0.01,0.0,0.01,0.03,0.0,0.01,0.01,0.04,0.0,0.03,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.01,0.03,0.0,0.0,0.01,0.01,0.03,0.03,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0
5,CIL (Casablanca),0.01,0.0,0.01,0.01,0.0,0.0,0.03,0.01,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.03,0.11,0.0,0.02,0.0,0.05,0.0,0.0,0.0,0.01,0.0,0.0,0.03,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.04,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.01,0.02,0.0,0.01,0.04,0.02,0.0,0.0,0.01,0.01,0.0,0.03,0.0,0.01,0.0,0.02,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.03,0.01,0.01,0.04,0.0,0.0,0.04,0.0,0.01,0.02,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.01,0.0
6,California (neighborhood),0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.025,0.025,0.075,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.025,0.025,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.075,0.0,0.0,0.0,0.0,0.05,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.0,0.0,0.0,0.0,0.0,0.025,0.025,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0
7,Derb Ghallef,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.16,0.0,0.02,0.0,0.08,0.0,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.02,0.01,0.0,0.0,0.02,0.0,0.02,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.03,0.02,0.0,0.07,0.03,0.0,0.02,0.01,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.04,0.01,0.02,0.02,0.0,0.0,0.03,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.0
8,Derb sultane,0.01,0.0,0.01,0.01,0.0,0.0,0.03,0.01,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.03,0.11,0.0,0.02,0.0,0.05,0.0,0.0,0.0,0.01,0.0,0.0,0.03,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.04,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.01,0.02,0.0,0.01,0.04,0.02,0.0,0.0,0.01,0.01,0.0,0.03,0.0,0.01,0.0,0.02,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.03,0.01,0.01,0.04,0.0,0.0,0.04,0.0,0.01,0.02,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.01,0.0
9,Habous (Casablanca),0.013514,0.0,0.0,0.013514,0.0,0.0,0.027027,0.027027,0.027027,0.013514,0.013514,0.0,0.0,0.0,0.0,0.013514,0.081081,0.0,0.0,0.0,0.081081,0.0,0.0,0.0,0.0,0.0,0.0,0.013514,0.0,0.013514,0.0,0.0,0.027027,0.0,0.0,0.0,0.013514,0.0,0.0,0.013514,0.013514,0.013514,0.0,0.013514,0.027027,0.0,0.0,0.0,0.094595,0.0,0.013514,0.0,0.0,0.013514,0.0,0.0,0.0,0.0,0.0,0.027027,0.040541,0.0,0.013514,0.0,0.040541,0.0,0.027027,0.013514,0.0,0.040541,0.0,0.0,0.0,0.027027,0.0,0.013514,0.013514,0.013514,0.0,0.0,0.013514,0.013514,0.013514,0.013514,0.013514,0.0,0.013514,0.0,0.013514,0.0,0.0,0.013514,0.013514,0.0,0.013514,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013514


In [27]:
len(kl_grouped[kl_grouped["Supermarket"] > 0])


5

**Create a new DataFrame for Supermarket data only**



In [28]:
kl_market = kl_grouped[["Neighborhoods","Supermarket"]]
kl_market

Unnamed: 0,Neighborhoods,Supermarket
0,Ain Diab,0.0
1,Anfa,0.0
2,Aïn Sebaâ,0.0
3,Belvedere (Casablanca),0.0
4,Bourgogne (Casablanca),0.0
5,CIL (Casablanca),0.0
6,California (neighborhood),0.025
7,Derb Ghallef,0.0
8,Derb sultane,0.0
9,Habous (Casablanca),0.0


## 7. Cluster Neighborhoods


Run k-means to cluster the neighborhoods in Casablanca into 3 clusters.



In [29]:
# set number of clusters
kclusters = 3

kl_clustering = kl_market.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(kl_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 0, 0, 0, 2, 0, 0, 0], dtype=int32)

In [30]:

# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
kl_merged = kl_market.copy()

# add clustering labels
kl_merged["Cluster Labels"] = kmeans.labels_

In [31]:
kl_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
kl_merged.head()

Unnamed: 0,Neighborhood,Supermarket,Cluster Labels
0,Ain Diab,0.0,0
1,Anfa,0.0,0
2,Aïn Sebaâ,0.0,0
3,Belvedere (Casablanca),0.0,0
4,Bourgogne (Casablanca),0.0,0


In [32]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
kl_merged = kl_merged.join(kl_df.set_index("Neighborhood"), on="Neighborhood")

print(kl_merged.shape)
kl_merged.head() # check the last columns!

(24, 5)


Unnamed: 0,Neighborhood,Supermarket,Cluster Labels,Latitude,Longitude
0,Ain Diab,0.0,0,33.59661,-7.61889
1,Anfa,0.0,0,33.58831,-7.61138
2,Aïn Sebaâ,0.0,0,33.60996,-7.54234
3,Belvedere (Casablanca),0.0,0,33.59512,-7.5881
4,Bourgogne (Casablanca),0.0,0,33.60267,-7.6453


In [33]:
# sort the results by Cluster Labels
print(kl_merged.shape)
kl_merged.sort_values(["Cluster Labels"], inplace=True)
kl_merged

(24, 5)


Unnamed: 0,Neighborhood,Supermarket,Cluster Labels,Latitude,Longitude
0,Ain Diab,0.0,0,33.59661,-7.61889
21,Salmia 2 (Casablanca),0.0,0,33.53825,-7.55351
20,"Roches Noires, Morocco",0.0,0,33.59946,-7.58372
19,Racine (Casablanca),0.0,0,33.58921,-7.64061
17,Oasis (Casablanca),0.0,0,33.55119,-7.55158
16,Maârif,0.0,0,33.57957,-7.636
15,"Lamkansa, Casablanca-Settat",0.0,0,33.59661,-7.61889
13,Inara (Casablanca),0.0,0,33.60107,-7.58443
12,Hay Salama,0.0,0,33.59661,-7.61889
22,Sbata,0.0,0,33.54691,-7.57505


In [34]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(kl_merged['Latitude'], kl_merged['Longitude'], kl_merged['Neighborhood'], kl_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [63]:

# save the map as HTML file
map_clusters.save('map_clusters.html')

## 8. Examine Clusters

**Cluster 0**


In [64]:
kl_merged.loc[kl_merged['Cluster Labels'] == 0]


Unnamed: 0,Neighborhood,Supermarket,Cluster Labels,Latitude,Longitude
0,Ain Diab,0.0,0,33.59661,-7.61889
21,Salmia 2 (Casablanca),0.0,0,33.53825,-7.55351
20,"Roches Noires, Morocco",0.0,0,33.59946,-7.58372
19,Racine (Casablanca),0.0,0,33.58921,-7.64061
17,Oasis (Casablanca),0.0,0,33.55119,-7.55158
16,Maârif,0.0,0,33.57957,-7.636
15,"Lamkansa, Casablanca-Settat",0.0,0,33.59661,-7.61889
13,Inara (Casablanca),0.0,0,33.60107,-7.58443
12,Hay Salama,0.0,0,33.59661,-7.61889
22,Sbata,0.0,0,33.54691,-7.57505


**Cluster 1**

In [65]:
kl_merged.loc[kl_merged['Cluster Labels'] == 1]


Unnamed: 0,Neighborhood,Supermarket,Cluster Labels,Latitude,Longitude
18,Oulfa,0.071429,1,33.55741,-7.68153


**Cluster 3**

In [66]:
kl_merged.loc[kl_merged['Cluster Labels'] == 2]


Unnamed: 0,Neighborhood,Supermarket,Cluster Labels,Latitude,Longitude
14,La Colline (Casablanca),0.017544,2,33.57367,-7.59811
6,California (neighborhood),0.025,2,33.53281,-7.63309
10,Hay El Hanaa,0.014493,2,33.57977,-7.66758
11,Hay El Hassani,0.029412,2,33.57594,-7.67674


**Observations:**


Most of the supermarkets are concentrated in the outer area of Casablanca, with the highest number in cluster 2 and moderate number in cluster 1. On the other hand, cluster 0 has very low number to totally no supermarkets in the neighborhoods. This represents a great opportunity and high potential areas to open new supermarkets as there is very little to no competition from existing ones. Meanwhile, supermarkets in cluster 2 are likely suffering from intense competition due to oversupply and high concentration of supermarkets. From another perspective, this also shows that the oversupply of supermarkets mostly happened in the outer area of the city, with the central area still having very few supermrkets. Therefore, this project recommends property developers to capitalize on these findings to open new supermarkets in neighborhoods in cluster 0 with little to no competition. Property developers with unique selling propositions to stand out from the competition can also open new supermarkets in neighborhoods in cluster 1 with moderate competition. Lastly, property developers are advised to avoid neighborhoods in cluster 2 which already have high concentration of supermarkets and suffering from intense competition