# Applied Data Science Capstone - IBM Professional Certificate
### Marcelo Porto

## Introduction

The skills learned during this course give us tools to solve some interesting problem. If you are deciding to open a new business, it might be interesting to explore a city and find a neighborhood where your business is needed.  
Let's say you want to open a coffee shop. It wouldn't be interesting to open in an area where there are 3 Starbucks' already. We can use the Foursquare API to find neighborhoods where there might be a need for your new coffee shop.

We can use clustering to find similar neighborhoods in the city that might support your plan: neighborhood X has a couple of coffee shops and it is similar to neighborhood Y. So it might be good to take our business to neighborhood Y if there isn't many coffee shops there already. Are neighborhoods X and Y similar enough if we remove coffee shops from the equation?

### Target Audience

Prospective business owners could use this to find the best location for their business.  
City officials or the appropriate city departments could use this project to identify areas to invest in and attract new business, enhancing the value of possible up-and-coming neighborhoods.

### Data Required

Some cities do not have postal codes. Abu Dhabi, in the United Arab Emirates, is one of the cities. So first I will need to search for a list of neighborhoods in Abu Dhabi. Then, I will need to find coordinates for each one of these neighborhoods with the geopy service. After this is done, I will be able to search the city for different venues with the Foursquare API, using this data to cluster and compare neighborhoods.

For Abu Dhabi, [Wikipedia](https://en.wikipedia.org/wiki/Abu_Dhabi#Neighborhoods) provides a list of neighborhoods, which I can use to search for coordinates.

### Structure

The Methodology section will explain how the process was done and methods applied. In Results, the findings will be presented along with the codes used. This reports ends with a brief discussion about the results and limitations of this exercise, and improvements that can be made.

## Methodology
 
The Foursquare API will be used to extract information from business venues in the city. Geopy will be used to turn neighborhood names into latitude and longitude coordinates. The folium package will allows us to visualize these findings in beautiful interactive maps. And talking about beautiful, I will use the BeautifulSoup package to scrape neighborhood names for Abu Dhabi from Wikipedia.  

Last but not least, I will apply K-means to cluster the neighborhoods based on what type of businesses are most common between them. After specifying how many clusters are to be created, this unsupervised machine learning algorithm chooses random points as cluster centers, and every other point is assigned to the closest center via the calculation of its Euclidean distances. The mean point of each cluster, or centroids, is calculated, and it becomes the new center for the cluster. This centroid minimizes the total squared distance of each point to the cluster center (Source: my thesis. Just trust me on this). There are limitations to this method, but this is out of the scope of this exercise.  

All the coding here is done in Python, in a Jupyter Notebook, using the [IBM Skill Network Labs](https://labs.cognitiveclass.ai) platform, which is free and I **highly** recommend, it will make your life easier. As you can see, a Github repository is used to store the notebook. Feel free to clone it and use for your own learning!  

(By the way, even though I am not using it here, I also suggest you to open an account on IBM Cloud. They have loads of free and cool resources for Data Science!)

## Results

Let's begin by loading and installing our tools!

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          92 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.21.0-py_0



Downloading and Extracting Packages
geopy-1.21.0         | 58 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ##################################### |

First, let's test with some location, to see if there are entries in the Foursquare database for Abu Dhabi.

In [2]:
#Test with Abu Dhabi
CLIENT_ID = 'yourid' # your Foursquare ID
CLIENT_SECRET = 'yoursecret' # your Foursquare Secret
VERSION = '20200401' # Foursquare API version
neighborhood_latitude =24.4539
neighborhood_longitude =54.3773
# type your answer here
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e9e8d9c6001fe001c91e8d2'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Abu Dhabi',
  'headerFullLocation': 'Abu Dhabi',
  'headerLocationGranularity': 'city',
  'totalResults': 5,
  'suggestedBounds': {'ne': {'lat': 24.458400004500007,
    'lng': 54.38223422930707},
   'sw': {'lat': 24.449399995499995, 'lng': 54.37236577069292}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '52958c2d11d2ce1a0539c550',
       'name': 'Umm Al Emarat Park (حديقة المشرف المركزية)',
       'location': {'address': 'Al Karamah St',
        'crossStreet': 'Mohammed bin Khalifa St',
        'lat': 24.453299953559355,
        'lng': 54.3810916845015,
        'label

In [3]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [4]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
0,Umm Al Emarat Park (حديقة المشرف المركزية),Park,24.4533,54.381092
1,Home Bakery,Bakery,24.453337,54.381018
2,Café Arabia,Café,24.45576,54.379318
3,Mushrif palace park,Park,24.453375,54.374729
4,Murjan Asfar Hotel Apartment,Hotel,24.453511,54.377871


In [5]:
nearby_venues.shape

(5, 4)

Ok, apparently there are some. Let's plot these venues in a map of Abu Dhabi.

In [6]:
# create map of AD using latitude and longitude values
latitude = 24.4539
longitude = 54.3773
#neighborhood_latitude =24.4539
#neighborhood_longitude =54.3773
map_AD = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(nearby_venues['lat'], nearby_venues['lng'], nearby_venues['name'], nearby_venues['categories']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_AD)  
    
map_AD

Cool! Now, those are venues. There are no postal codes in Abu Dhabi.  
We can find a list of neighborhoods on Wikipedia.  
Let's see if I can find coordinates for one of these neighborhoods, Al Karama.

In [7]:
address = 'Al Karama, Abu Dhabi, United Arab Emirates'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

25.244402800000003 55.30475541735386


Alright, it works. There aren't that many, so I could probably copy the list to a spreadsheet and load here as a csv file.  
But that's no fun. Let's scrape the Wikipedia page to get the Neighborhood names, using the BeautifulSoup package.

In [8]:
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/Abu_Dhabi').text
!pip install BeautifulSoup4
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'html.parser')

Collecting BeautifulSoup4
[?25l  Downloading https://files.pythonhosted.org/packages/e8/b5/7bb03a696f2c9b7af792a8f51b82974e51c268f15e925fc834876a4efa0b/beautifulsoup4-4.9.0-py3-none-any.whl (109kB)
[K     |████████████████████████████████| 112kB 6.1MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from BeautifulSoup4)
  Downloading https://files.pythonhosted.org/packages/05/cf/ea245e52f55823f19992447b008bcbb7f78efc5960d77f6c34b5b45b36dd/soupsieve-2.0-py2.py3-none-any.whl
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.9.0 soupsieve-2.0


The output is pretty long, so I suggest collapsing it for a better read of the notebook. But printing the whole thing is necessary to find what we want. Or you can check the source code of the page.

In [9]:
print(soup)


<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Abu Dhabi - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"XptYvApAEJcAAoQ3thkAAADU","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Abu_Dhabi","wgTitle":"Abu Dhabi","wgCurRevisionId":950793038,"wgRevisionId":950793038,"wgArticleId":18950756,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","All articles with dead external links","Articles with dead external links from August 2016","Webarchive template webcite links","Articles with dead extern

Sorry about that, that was long. By inspecting our "soup", we see that the list is under the class "div-col columns column-width".

In [10]:
#div-col columns column-width
hoods = soup.find(class_="div-col columns column-width")
print(hoods)

<div class="div-col columns column-width" style="-moz-column-width: 18em; -webkit-column-width: 18em; column-width: 18em;">
<ul><li><a class="new" href="/w/index.php?title=Al_Aman&amp;action=edit&amp;redlink=1" title="Al Aman (page does not exist)">Al Aman</a></li>
<li><a class="new" href="/w/index.php?title=Al_Bateen&amp;action=edit&amp;redlink=1" title="Al Bateen (page does not exist)">Al Bateen</a></li>
<li><a class="new" href="/w/index.php?title=Al_Dhafrah&amp;action=edit&amp;redlink=1" title="Al Dhafrah (page does not exist)">Al Dhafrah</a></li>
<li><a class="new" href="/w/index.php?title=Al_Falah&amp;action=edit&amp;redlink=1" title="Al Falah (page does not exist)">Al Falah</a></li>
<li><a href="/wiki/Al_Karama,_United_Arab_Emirates" title="Al Karama, United Arab Emirates">Al Karama</a></li>
<li><a class="new" href="/w/index.php?title=Al_Khubeirah&amp;action=edit&amp;redlink=1" title="Al Khubeirah (page does not exist)">Al Khubeirah</a></li>
<li><a href="/wiki/Al_Lulu_Island" tit

To get just the name of the neighborhood, we use this code:

In [11]:
hoods.a.text

'Al Aman'

I'll create a dataframe for the neighborhoods and iterate over our "soup" to get all the names.

In [12]:
header=["Neighborhood","Lat","Lon"]

df = pd.DataFrame(columns=header)
hoods_rows = hoods.find_all('a')
for tr in hoods_rows:
    row = tr.text
    
    df = df.append({'Neighborhood': row}, ignore_index=True)





In [13]:
df

Unnamed: 0,Neighborhood,Lat,Lon
0,Al Aman,,
1,Al Bateen,,
2,Al Dhafrah,,
3,Al Falah,,
4,Al Karama,,
5,Al Khubeirah,,
6,Al Lulu Island,,
7,Al Madina,,
8,Al Maryah Island,,
9,Al Manaseer,,


Great! Now we can put that into the geocoder to get our coordinates.

In [14]:
address = '{}, Abu Dhabi, United Arab Emirates'
geolocator = Nominatim(user_agent="foursquare_agent")

for i in range(0, df.shape[0]):
    hood = df.iloc[i,0]
    location = geolocator.geocode(address.format(hood))
    if location is not None:
        df.iloc[i,1] = location.latitude
        df.iloc[i,2] = location.longitude

In [15]:
df

Unnamed: 0,Neighborhood,Lat,Lon
0,Al Aman,24.432,54.4266
1,Al Bateen,24.2151,55.6263
2,Al Dhafrah,24.4761,54.3694
3,Al Falah,24.4447,54.7282
4,Al Karama,25.2444,55.3048
5,Al Khubeirah,24.4652,54.3368
6,Al Lulu Island,24.4996,54.3457
7,Al Madina,24.3409,54.4907
8,Al Maryah Island,24.5021,54.3902
9,Al Manaseer,,


We couldn't find all the neighborhoods, but that's ok, we still got plenty for this exercise.  
Let's remove the ones without coordinates.

In [16]:
df=df.dropna()

In [17]:
df.reset_index(inplace=True,drop=True)
df

Unnamed: 0,Neighborhood,Lat,Lon
0,Al Aman,24.432,54.4266
1,Al Bateen,24.2151,55.6263
2,Al Dhafrah,24.4761,54.3694
3,Al Falah,24.4447,54.7282
4,Al Karama,25.2444,55.3048
5,Al Khubeirah,24.4652,54.3368
6,Al Lulu Island,24.4996,54.3457
7,Al Madina,24.3409,54.4907
8,Al Maryah Island,24.5021,54.3902
9,Al Manhal,24.4666,54.366


Now we can plot these in a map:

In [18]:
# create map of Abu Dhabi using latitude and longitude values
latitude = 24.4539
longitude = 54.3773
map_AD = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(df['Lat'], df['Lon'], df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_AD)  
    
map_AD

We use the Foursquare API to find venues in these neighborhoods.  
Let's use that trusty function that was provided in the course.

In [19]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Running the function...

In [20]:
AD_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Lat'],
                                   longitudes=df['Lon']
                                  )

Al Aman
Al Bateen
Al Dhafrah
Al Falah
Al Karama
Al Khubeirah
Al Lulu Island
Al Madina
Al Maryah Island
Al Manhal
Al Maqtaa
Al Markaziyah
Al Mushrif
Al Nahyan
Al Reef
Al Reem Island
Al Rowdah
Al Shamkha
Al Zahiyah
Al Zahraa
Bain Al Jisrain
Khalifa City
Masdar City
Mohammed Bin Zayed City
Saadiyat Island
Shakhbout City
Officers City
Qasr El Bahr
Yas Island


Now we have a list of venues with coordinates for each neighborhood.  
Let's just look at the first 10 entries.

In [36]:
AD_venues.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Al Aman,24.43199,54.42655,Subway,24.435156,54.425187,Sandwich Place
1,Al Aman,24.43199,54.42655,Novotel Abu Dhabi Al Bustan,24.429431,54.429166,Hotel
2,Al Aman,24.43199,54.42655,McDonald's,24.434646,54.424393,Fast Food Restaurant
3,Al Aman,24.43199,54.42655,Cafe Bonjour Bonsoir,24.429314,54.42825,Café
4,Al Aman,24.43199,54.42655,Adagio Aparthotel,24.429396,54.428604,Hotel
5,Al Aman,24.43199,54.42655,Coffee Planet,24.434725,54.423094,Coffee Shop
6,Al Dhafrah,24.476147,54.36936,Jumeirah Etihad Tower,24.476051,54.367716,Hotel
7,Al Dhafrah,24.476147,54.36936,Starbucks (ستاربكس),24.47743,54.371626,Coffee Shop
8,Al Dhafrah,24.476147,54.36936,Gudee Pizza & Café,24.47786,54.371012,Pizza Place
9,Al Dhafrah,24.476147,54.36936,Al Shater Hassan Restaurant,24.47875,54.369562,Falafel Restaurant


How many unique type of businesses do we have in Abu Dhabi?

In [37]:
print('There are {} unique businesses.'.format(len(AD_venues['Venue Category'].unique())))

There are 125 unique businesses.


In [58]:
AD_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Al Aman,6,6,6,6,6,6
Al Dhafrah,32,32,32,32,32,32
Al Karama,38,38,38,38,38,38
Al Khubeirah,17,17,17,17,17,17
Al Lulu Island,1,1,1,1,1,1
Al Madina,3,3,3,3,3,3
Al Manhal,7,7,7,7,7,7
Al Maqtaa,27,27,27,27,27,27
Al Markaziyah,39,39,39,39,39,39
Al Maryah Island,56,56,56,56,56,56


We see we have a problem with our data here. There are a few neighborhoods with very few entries.  
We could attempt to increase the radius of search. Or find another data source.  
For this exercise we will limit our analysis to those areas with at least 5 businesses.

In [70]:
temp = AD_venues.groupby('Neighborhood').count()
id = temp.index[temp.iloc[:,0] >= 5]

In [74]:
#rpt[rpt['STK_ID'].isin(stk_list)]
AD_venues = AD_venues[AD_venues['Neighborhood'].isin(id)]

What type of venue is more frequent in each neighborhood?  
How are we going to cluster these together?
To start to answer these questions, we turn the venues into dummy variables.  
We can get the dummies with one hot encoding.

In [75]:
# one hot encoding
AD_onehot = pd.get_dummies(AD_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
AD_onehot['Neighborhood'] = AD_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [AD_onehot.columns[-1]] + list(AD_onehot.columns[:-1])
AD_onehot = AD_onehot[fixed_columns]

AD_onehot.head()

Unnamed: 0,Women's Store,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Arcade,Asian Restaurant,BBQ Joint,Bakery,Beach,Bed & Breakfast,Bistro,Bookstore,Boutique,Bowling Alley,Breakfast Spot,Buffet,Burger Joint,Cafeteria,Café,Candy Store,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,Convenience Store,Cosmetics Shop,Cupcake Shop,Department Store,Dessert Shop,Donut Shop,Electronics Store,Ethiopian Restaurant,Falafel Restaurant,Fast Food Restaurant,Filipino Restaurant,Fish & Chips Shop,Flower Shop,Food & Drink Shop,Food Court,French Restaurant,Fried Chicken Joint,Furniture / Home Store,Gift Shop,Go Kart Track,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Harbor / Marina,Health & Beauty Service,Hookah Bar,Hostel,Hot Spring,Hotel,Hotel Bar,IT Services,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Jewelry Store,Juice Bar,Korean Restaurant,Lebanese Restaurant,Lingerie Store,Lounge,Medical Center,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Moroccan Restaurant,Movie Theater,Multiplex,Nail Salon,Neighborhood,Nightclub,Optical Shop,Pakistani Restaurant,Park,Peruvian Restaurant,Pet Store,Pharmacy,Pizza Place,Playground,Plaza,Pool,Pub,Racetrack,Residential Building (Apartment / Condo),Resort,Restaurant,Sandwich Place,Seafood Restaurant,Shawarma Place,Shoe Store,Shopping Mall,Snack Place,South Indian Restaurant,Spa,Sporting Goods Shop,Sports Bar,Steakhouse,Supermarket,Sushi Restaurant,Tennis Court,Theater,Theme Park,Theme Park Ride / Attraction,Theme Restaurant,Toy / Game Store,Turkish Restaurant,Vegetarian / Vegan Restaurant,Wine Bar
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Al Aman,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Al Aman,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Al Aman,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Al Aman,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Al Aman,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [76]:
AD_grouped = AD_onehot.groupby('Neighborhood').mean().reset_index()
AD_grouped

Unnamed: 0,Neighborhood,Women's Store,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Arcade,Asian Restaurant,BBQ Joint,Bakery,Beach,Bed & Breakfast,Bistro,Bookstore,Boutique,Bowling Alley,Breakfast Spot,Buffet,Burger Joint,Cafeteria,Café,Candy Store,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,Convenience Store,Cosmetics Shop,Cupcake Shop,Department Store,Dessert Shop,Donut Shop,Electronics Store,Ethiopian Restaurant,Falafel Restaurant,Fast Food Restaurant,Filipino Restaurant,Fish & Chips Shop,Flower Shop,Food & Drink Shop,Food Court,French Restaurant,Fried Chicken Joint,Furniture / Home Store,Gift Shop,Go Kart Track,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Harbor / Marina,Health & Beauty Service,Hookah Bar,Hostel,Hot Spring,Hotel,Hotel Bar,IT Services,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Jewelry Store,Juice Bar,Korean Restaurant,Lebanese Restaurant,Lingerie Store,Lounge,Medical Center,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Moroccan Restaurant,Movie Theater,Multiplex,Nail Salon,Nightclub,Optical Shop,Pakistani Restaurant,Park,Peruvian Restaurant,Pet Store,Pharmacy,Pizza Place,Playground,Plaza,Pool,Pub,Racetrack,Residential Building (Apartment / Condo),Resort,Restaurant,Sandwich Place,Seafood Restaurant,Shawarma Place,Shoe Store,Shopping Mall,Snack Place,South Indian Restaurant,Spa,Sporting Goods Shop,Sports Bar,Steakhouse,Supermarket,Sushi Restaurant,Tennis Court,Theater,Theme Park,Theme Park Ride / Attraction,Theme Restaurant,Toy / Game Store,Turkish Restaurant,Vegetarian / Vegan Restaurant,Wine Bar
0,Al Aman,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Al Dhafrah,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09375,0.0,0.0,0.03125,0.0,0.03125,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,0.03125,0.0625,0.03125,0.0,0.0,0.0,0.0,0.0,0.0625,0.0625,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.03125,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.15625,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.0
2,Al Karama,0.0,0.0,0.0,0.0,0.0,0.0,0.105263,0.0,0.078947,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.026316,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.447368,0.0,0.026316,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.026316,0.0
3,Al Khubeirah,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.235294,0.0,0.0,0.0,0.0,0.235294,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.117647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Al Manhal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.428571,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Al Maqtaa,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.185185,0.0,0.0,0.0,0.0,0.074074,0.037037,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.037037,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.074074,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Al Markaziyah,0.0,0.0,0.0,0.0,0.025641,0.0,0.051282,0.0,0.025641,0.0,0.025641,0.0,0.0,0.0,0.025641,0.025641,0.0,0.0,0.0,0.076923,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.025641,0.0,0.0,0.0,0.128205,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.128205,0.025641,0.0,0.025641,0.025641,0.051282,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.025641,0.025641,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.025641,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Al Maryah Island,0.017857,0.017857,0.0,0.0,0.035714,0.0,0.0,0.0,0.035714,0.0,0.0,0.017857,0.0,0.017857,0.0,0.0,0.017857,0.017857,0.0,0.089286,0.0,0.035714,0.0,0.017857,0.071429,0.0,0.017857,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017857,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.017857,0.035714,0.017857,0.035714,0.0,0.0,0.0,0.0,0.035714,0.0,0.017857,0.035714,0.035714,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.017857,0.017857,0.0,0.0,0.017857,0.017857,0.0,0.0,0.0,0.0,0.017857,0.017857,0.0,0.0,0.017857,0.0,0.0,0.035714,0.017857,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017857
8,Al Nahyan,0.0,0.0,0.0,0.0,0.034483,0.0,0.034483,0.0,0.068966,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.0,0.172414,0.0,0.0,0.0,0.0,0.103448,0.034483,0.0,0.034483,0.0,0.068966,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.068966,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.034483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Al Reef,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.083333,0.0,0.0,0.25,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We now have the frequency for each neighborhood.  
What are the top 5 venues for each neighborhood?

In [77]:
num_top_venues = 5

for hood in AD_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = AD_grouped[AD_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Al Aman----
                  venue  freq
0                 Hotel  0.33
1  Fast Food Restaurant  0.17
2           Coffee Shop  0.17
3        Sandwich Place  0.17
4                  Café  0.17


----Al Dhafrah----
                       venue  freq
0  Middle Eastern Restaurant  0.16
1                       Café  0.09
2              Movie Theater  0.06
3        Fried Chicken Joint  0.06
4       Fast Food Restaurant  0.06


----Al Karama----
               venue  freq
0  Indian Restaurant  0.45
1   Asian Restaurant  0.11
2             Bakery  0.08
3        Supermarket  0.03
4               Park  0.03


----Al Khubeirah----
                  venue  freq
0           Coffee Shop  0.24
1                  Café  0.24
2  Fast Food Restaurant  0.12
3           Supermarket  0.06
4        Shawarma Place  0.06


----Al Manhal----
            venue  freq
0            Café  0.43
1           Hotel  0.14
2      Hot Spring  0.14
3  Medical Center  0.14
4     Coffee Shop  0.14


----Al Maqtaa----
    

Let's put this information in a dataframe, with the top 5 most common venues for each neighborhood.

In [78]:
#This function sorts the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [79]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = AD_grouped['Neighborhood']

for ind in np.arange(AD_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(AD_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Al Aman,Hotel,Coffee Shop,Fast Food Restaurant,Sandwich Place,Café
1,Al Dhafrah,Middle Eastern Restaurant,Café,Fried Chicken Joint,Furniture / Home Store,Fast Food Restaurant
2,Al Karama,Indian Restaurant,Asian Restaurant,Bakery,Ice Cream Shop,Cafeteria
3,Al Khubeirah,Café,Coffee Shop,Fast Food Restaurant,Donut Shop,Gym / Fitness Center
4,Al Manhal,Café,Hot Spring,Medical Center,Coffee Shop,Hotel


Let's have a look at the whole thing...

In [80]:
neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Al Aman,Hotel,Coffee Shop,Fast Food Restaurant,Sandwich Place,Café
1,Al Dhafrah,Middle Eastern Restaurant,Café,Fried Chicken Joint,Furniture / Home Store,Fast Food Restaurant
2,Al Karama,Indian Restaurant,Asian Restaurant,Bakery,Ice Cream Shop,Cafeteria
3,Al Khubeirah,Café,Coffee Shop,Fast Food Restaurant,Donut Shop,Gym / Fitness Center
4,Al Manhal,Café,Hot Spring,Medical Center,Coffee Shop,Hotel
5,Al Maqtaa,Café,Coffee Shop,Middle Eastern Restaurant,Pharmacy,Shopping Mall
6,Al Markaziyah,Fast Food Restaurant,Hotel,Café,Italian Restaurant,Asian Restaurant
7,Al Maryah Island,Café,Coffee Shop,Sushi Restaurant,Middle Eastern Restaurant,American Restaurant
8,Al Nahyan,Café,Coffee Shop,Dessert Shop,Flower Shop,Bakery
9,Al Reef,Pool,Gym,Pizza Place,Convenience Store,Coffee Shop


Now we can attempt to cluster these neighborhoods based on how similar its businesses are.

In [81]:
# set number of clusters
kclusters = 5

AD_grouped_clustering = AD_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(AD_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:5]

array([0, 1, 3, 1, 2], dtype=int32)

In [82]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster_Labels', kmeans.labels_)

AD_merged = df

# merge AD_grouped with AD_data to add latitude/longitude for each neighborhood
AD_merged = AD_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

AD_merged.head()

Unnamed: 0,Neighborhood,Lat,Lon,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Al Aman,24.432,54.4266,0.0,Hotel,Coffee Shop,Fast Food Restaurant,Sandwich Place,Café
1,Al Bateen,24.2151,55.6263,,,,,,
2,Al Dhafrah,24.4761,54.3694,1.0,Middle Eastern Restaurant,Café,Fried Chicken Joint,Furniture / Home Store,Fast Food Restaurant
3,Al Falah,24.4447,54.7282,,,,,,
4,Al Karama,25.2444,55.3048,3.0,Indian Restaurant,Asian Restaurant,Bakery,Ice Cream Shop,Cafeteria


In [83]:
AD_merged

Unnamed: 0,Neighborhood,Lat,Lon,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Al Aman,24.432,54.4266,0.0,Hotel,Coffee Shop,Fast Food Restaurant,Sandwich Place,Café
1,Al Bateen,24.2151,55.6263,,,,,,
2,Al Dhafrah,24.4761,54.3694,1.0,Middle Eastern Restaurant,Café,Fried Chicken Joint,Furniture / Home Store,Fast Food Restaurant
3,Al Falah,24.4447,54.7282,,,,,,
4,Al Karama,25.2444,55.3048,3.0,Indian Restaurant,Asian Restaurant,Bakery,Ice Cream Shop,Cafeteria
5,Al Khubeirah,24.4652,54.3368,1.0,Café,Coffee Shop,Fast Food Restaurant,Donut Shop,Gym / Fitness Center
6,Al Lulu Island,24.4996,54.3457,,,,,,
7,Al Madina,24.3409,54.4907,,,,,,
8,Al Maryah Island,24.5021,54.3902,1.0,Café,Coffee Shop,Sushi Restaurant,Middle Eastern Restaurant,American Restaurant
9,Al Manhal,24.4666,54.366,2.0,Café,Hot Spring,Medical Center,Coffee Shop,Hotel


Since we used our original dataframe some of the neighborhoods in which no venues were found on Foursquare reappeared here.  
Let's drop these NA's. Also we will change the cluster values to integer.

In [84]:
AD_merged=AD_merged.dropna()
AD_merged['Cluster_Labels'] = AD_merged.Cluster_Labels.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [85]:
AD_merged

Unnamed: 0,Neighborhood,Lat,Lon,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Al Aman,24.432,54.4266,0,Hotel,Coffee Shop,Fast Food Restaurant,Sandwich Place,Café
2,Al Dhafrah,24.4761,54.3694,1,Middle Eastern Restaurant,Café,Fried Chicken Joint,Furniture / Home Store,Fast Food Restaurant
4,Al Karama,25.2444,55.3048,3,Indian Restaurant,Asian Restaurant,Bakery,Ice Cream Shop,Cafeteria
5,Al Khubeirah,24.4652,54.3368,1,Café,Coffee Shop,Fast Food Restaurant,Donut Shop,Gym / Fitness Center
8,Al Maryah Island,24.5021,54.3902,1,Café,Coffee Shop,Sushi Restaurant,Middle Eastern Restaurant,American Restaurant
9,Al Manhal,24.4666,54.366,2,Café,Hot Spring,Medical Center,Coffee Shop,Hotel
10,Al Maqtaa,24.4346,54.4544,1,Café,Coffee Shop,Middle Eastern Restaurant,Pharmacy,Shopping Mall
11,Al Markaziyah,24.4933,54.3667,1,Fast Food Restaurant,Hotel,Café,Italian Restaurant,Asian Restaurant
13,Al Nahyan,24.4684,54.3852,1,Café,Coffee Shop,Dessert Shop,Flower Shop,Bakery
14,Al Reef,24.4577,54.6737,4,Pool,Gym,Pizza Place,Convenience Store,Coffee Shop


Now we can map the neighborhoods differentiating the clusters with different colors.

In [86]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(AD_merged['Lat'], AD_merged['Lon'], AD_merged['Neighborhood'], AD_merged['Cluster_Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Let's inspect cluster 1, the most common one.

In [87]:
AD_merged.loc[AD_merged['Cluster_Labels'] == 1, AD_merged.columns[[0] + list(range(4, AD_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,Al Dhafrah,Middle Eastern Restaurant,Café,Fried Chicken Joint,Furniture / Home Store,Fast Food Restaurant
5,Al Khubeirah,Café,Coffee Shop,Fast Food Restaurant,Donut Shop,Gym / Fitness Center
8,Al Maryah Island,Café,Coffee Shop,Sushi Restaurant,Middle Eastern Restaurant,American Restaurant
10,Al Maqtaa,Café,Coffee Shop,Middle Eastern Restaurant,Pharmacy,Shopping Mall
11,Al Markaziyah,Fast Food Restaurant,Hotel,Café,Italian Restaurant,Asian Restaurant
13,Al Nahyan,Café,Coffee Shop,Dessert Shop,Flower Shop,Bakery
18,Al Zahiyah,Middle Eastern Restaurant,Indian Restaurant,Hotel,Coffee Shop,Fast Food Restaurant
20,Bain Al Jisrain,Coffee Shop,Spa,Hotel,Italian Restaurant,Lebanese Restaurant
22,Masdar City,Sushi Restaurant,Italian Restaurant,Café,Supermarket,Fast Food Restaurant
28,Yas Island,Theme Park Ride / Attraction,Café,Coffee Shop,Clothing Store,Sporting Goods Shop


We can already get some insights from this table. We see that our business idea, a coffee shop, is very present is most areas. But let's take a look at Al Zahiyah.

In [88]:
AD_merged.loc[AD_merged['Neighborhood'] == 'Al Zahiyah']

Unnamed: 0,Neighborhood,Lat,Lon,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
18,Al Zahiyah,24.4933,54.3799,1,Middle Eastern Restaurant,Indian Restaurant,Hotel,Coffee Shop,Fast Food Restaurant


Ok, coffee shop appears only in 4th place. So perhaps there is an opportunity here.
This could give a potential business owner a starting point to hers/his research.  

I would like to remove coffee shops from the equation and recluster (is that a word?) the neighborhoods. It would be nice to see if maybe we get different neighborhoods that might be clustered together, which could suggest that a coffee shop would be a good investment in area X, because neighborhood Y has a couple, but it was clustered together with X when we removed coffee shops.  

However, our current database does not have enough venues to attemp this.  

Before we go, let's take a look at our other clusters.

In [89]:
AD_merged.loc[AD_merged['Cluster_Labels'] == 0, AD_merged.columns[[0] + list(range(4, AD_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Al Aman,Hotel,Coffee Shop,Fast Food Restaurant,Sandwich Place,Café


In [90]:
AD_merged.loc[AD_merged['Cluster_Labels'] == 2, AD_merged.columns[[0] + list(range(4, AD_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
9,Al Manhal,Café,Hot Spring,Medical Center,Coffee Shop,Hotel
16,Al Rowdah,Café,Coffee Shop,Cosmetics Shop,Wine Bar,Donut Shop


In [91]:
AD_merged.loc[AD_merged['Cluster_Labels'] == 3, AD_merged.columns[[0] + list(range(4, AD_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
4,Al Karama,Indian Restaurant,Asian Restaurant,Bakery,Ice Cream Shop,Cafeteria


In [92]:
AD_merged.loc[AD_merged['Cluster_Labels'] == 4, AD_merged.columns[[0] + list(range(4, AD_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
14,Al Reef,Pool,Gym,Pizza Place,Convenience Store,Coffee Shop


For this last cluster we see Al Reef, which is an up-and-coming residential area. Maybe it could use another coffee shop?

## Discussion and Final Thoughts...

As you can see, there a lot of tools to work with here and make some interesting analysis. Of course, we could always use more data. We can find the locations of the other neighborhoods, even if we have to do it manually.  

We showed that there aren't many venues listed on Foursquare for some areas. We could use other databases to see if we get better results. It would also be interesting to integrate this with information about if an area is a business area or a more residential one. As mentioned, we can also explore further the clustering method as well.  

However, even with limitations, even if we do not get clear cut conclusions for this exercise, we do get some things from it:
* We can at least get some insights and give a direction or focus for our research/work.
* This was a great learning experience. Even though I had seen most of these methods during my Master's, I had done everything with R. A refreshment on the subject with Python was a great way to learn a new programming language.
* Also, there was much more learned during the course that is not being used in this report. SQL, for example. Plus, as already mentioned in the Methodology, I was presented to a couple of really cool platforms to code and do analysis on, in IBM Skills Network Labs and IBM Cloud with Watson services. Have a look on those.

Thank you for reading this.