# Hipsterhood
### Applied Data Science Capstone Project (IBM Data Science Professional Certificate)
### *by Helder Reis, March 2019*

## Notebook

## 1. Introduction

**The goal of this project is to find a "Neighborhood Hipster Rating" ("Hipsterhood" for short) for neighborhoods in a city.** 
During the course we used clustering and the Foursquare API to segment and cluster the neighborhoods in the city of New York and Toronto. <br />
For this project we will use similar techniques to try to find that Hipsterhood rating, according to the frequency of venues with categories typically associated with the "hipster lifestyle" (Organic Grocery, Thrift / Vintage Store, Cupcake Shop, etc). This part is tricky, since "hipster lifestyle" is not an exact science. <br />
To better evaluate the results we will use the city of Madrid, where I've been living for a few years hence am familiar with.  <br />
We will find the rating of each neighborhood, then cluster them into 3 clusters (something like "Hipster Paradise", "Something for everyone", "Mostly Traditional"). <br />

Imports

In [1]:
import numpy as np
import pandas as pd

import requests
from bs4 import BeautifulSoup

!conda install -c conda-forge folium=0.5.0
import folium

from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  55.35 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  36.65 MB/s
vincent-0.4.4- 100% |################################| Time: 0:00:00  38.05 MB/s
folium-0.5.0-p 100% |################################| Time: 0:00:00  49.68 MB/s


## 2. Data

### 2.1 Madrid City data
According to [this Wikipedia entry](https://en.wikipedia.org/wiki/List_of_wards_of_Madrid) "Madrid, the capital city of Spain, is divided into 21 districts (distritos), which are further subdivided into 128 wards (barrios).". <br />
On that entry each of ward names is a link for a wikipedia article that has the coordinates of that neighborhood.

In [2]:
#Get the HTML as text
html_doc = requests.get('https://en.wikipedia.org/wiki/List_of_wards_of_Madrid').text
#create a soup object of the whole page for html parsing
soup = BeautifulSoup(html_doc, 'html.parser')
#isolate the table we're interested in
table = soup.find('table', class_='wikitable sortable')

#### Get the list of neighbordhoods
We are not interested in the Number and Image rows, but we want to keep the district, ward and the coordinates we get following the link. The coordinates in Wikipedia are in DMS format (example 40°24′54″N), so we need to convert them into decimal format. At the end we create a Pandas Data Frame with all the data.

In [3]:
#function that receives a relative link to wikipedia and returns coordinates
def getCoordinatesFromWikiLink(wikiLink):
    coordinates = None
    #Get the HTML as text
    html_doc = requests.get('https://en.wikipedia.org/'+wikiLink).text
    
    #create a soup object of the whole page for html parsing
    soup = BeautifulSoup(html_doc, 'html.parser')
    #isolate the parts we're interested in
    latitude = soup.find('span', class_='latitude')
    longitude = soup.find('span', class_='longitude')

    if latitude and longitude:
        coordinates = {'Latitude': dms2dd(latitude.find(text=True)), 'Longitude': dms2dd(longitude.find(text=True))}
    else:
        #some articles might not have coordinates
        coordinates = {'Latitude': 'None', 'Longitude': 'None'}
    return coordinates

#function that converts DMS to Decimal coordinates
def dms2dd(dms):
    #remove the symbols
    dms = dms.replace('°'," ")
    dms = dms.replace('′'," ")
    dms = dms.replace('″'," ")
    dmsList = dms.split()
    degrees = float(dmsList[0])
    minutes = float(dmsList[1])
    #sometimes seconds are ommited
    if len(dmsList) == 4:
        seconds = float(dmsList[2])
        direction = dmsList[3]
    else:
        seconds = 0
        direction = dmsList[2]
                
    dd = float(degrees) + float(minutes)/60 + float(seconds)/(60*60);
    if direction == 'S' or direction == 'W':
        dd *= -1
    return dd;


#parse the table entries into a list
cityData = []
rows = table.find_all('tr')
for row in rows:
    cells = row.find_all('td')
    if len(cells)>0:
        #if first cell is District
        if cells[0].has_attr('rowspan'):
            currentDistrict = cells[0].find(text=True)
            ward = cells[2].find(text=True)
            link = cells[2].a['href']
            coords = getCoordinatesFromWikiLink(link)
        else:
            ward = cells[1].find(text=True)
            link = cells[1].a['href']
            coords = getCoordinatesFromWikiLink(link)

        #add only if latitude and longitude are not null
        if coords['Latitude'] != 'None' and coords['Longitude'] != 'None':
            cityData.append([currentDistrict, ward, coords['Latitude'], coords['Longitude']])
    
#transform into a pandas dataframe
column_names = ['District', 'Neighborhood', 'Latitude', 'Longitude']
dfCityData = pd.DataFrame(cityData, columns=column_names)

dfCityData.head()


Unnamed: 0,District,Neighborhood,Latitude,Longitude
0,Centro,Palacio,40.415,-3.713333
1,Centro,Embajadores,40.408889,-3.699722
2,Centro,Cortes,40.414167,-3.698056
3,Centro,Justicia,40.423889,-3.696389
4,Centro,Universidad,40.425278,-3.708333


#### Draw city map

Let's visualize a map with our neighborhoods. We are also drawing a circle around the neighborhood center, to come up with an approximate area of influence of each neighborhood.

In [4]:
#get city coordindates
coords = getCoordinatesFromWikiLink('/wiki/Madrid')
latitude = coords['Latitude']
longitude = coords['Longitude']
city_map = folium.Map(location=[latitude, longitude], zoom_start=12)

#add markers of neighborhoods to map
for lat, lng, label in zip(dfCityData['Latitude'], dfCityData['Longitude'], dfCityData['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.Circle(
        [lat, lng],
        radius=400,
        popup=label,
        color='blue',
        fill=False,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(city_map)  

#show map
city_map

We can verify that the neighborhoods are where they are supposed to be (so our **city data is correct**) and a **400 meter radius** seems to give a good coverage without too much overlap.

### 2.2 Categories
Foursquare has available a [list of categories](https://developer.foursquare.com/docs/resources/categories), which we will use to identify clearly "hipster" categories (Organic Grocery, Thrift / Vintage Store, Indie Theater, Cupcake Shop, etc).

Foursquare crendentials, **remove keys from final version**

In [5]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20190301' # Foursquare API version

Get all the categories

In [6]:
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION)
# make the GET request
results = requests.get(url).json()["response"]["categories"]

# let's create a list with the category names
categoriesList = []
for v in results:
    for category in v['categories']:
        categoriesList.append(category['name'])

print("We got", len(categoriesList), "categories")        
print(categoriesList)

We got 456 categories
['Amphitheater', 'Aquarium', 'Arcade', 'Art Gallery', 'Bowling Alley', 'Casino', 'Circus', 'Comedy Club', 'Concert Hall', 'Country Dance Club', 'Disc Golf', 'Exhibit', 'General Entertainment', 'Go Kart Track', 'Historic Site', 'Karaoke Box', 'Laser Tag', 'Memorial Site', 'Mini Golf', 'Movie Theater', 'Museum', 'Music Venue', 'Pachinko Parlor', 'Performing Arts Venue', 'Pool Hall', 'Public Art', 'Racecourse', 'Racetrack', 'Roller Rink', 'Salsa Club', 'Samba School', 'Stadium', 'Theme Park', 'Tour Provider', 'Water Park', 'Zoo', 'College Academic Building', 'College Administrative Building', 'College Auditorium', 'College Bookstore', 'College Cafeteria', 'College Classroom', 'College Gym', 'College Lab', 'College Library', 'College Quad', 'College Rec Center', 'College Residence Hall', 'College Stadium', 'College Theater', 'Community College', 'Fraternity House', 'General College & University', 'Law School', 'Medical School', 'Sorority House', 'Student Center', 'Tra

Looking at the categories, there don't seem to be that many that could be clearly identified as a "hipster favorite". Hence we will add a "hipster rating" to each category, an integer from 1-3, where 1 is something clearly not "hipster" (Train Station, Country Dance Club), 2 the things that could be (Coffee Shop, Concert Hall) and 3 the most associated with the hipster lifestyle (Bistro, Vintage Store).
Since there's no way of doing these in an automated way, we will manually edit it the values offline, then upload a file with the rated categories.

In [7]:
#load the file from my github
!wget -q -O 'categoriesRated.csv' https://raw.githubusercontent.com/helder-a-reis/Coursera_Capstone/master/categoriesRated.csv
#convert it into a data frame
dfCategories = pd.read_csv('categoriesRated.csv')

Let's make the category the index and check some values

In [8]:
dfCategories.set_index('Category', inplace=True)
dfCategories.head(20)

Unnamed: 0_level_0,Rating
Category,Unnamed: 1_level_1
Art Gallery,2
Bowling Alley,1
Casino,1
Circus,1
Comedy Club,2
Concert Hall,2
Country Dance Club,1
Disc Golf,1
Exhibit,2
General Entertainment,2


### 2.3 Venue data
We need a list of all venues in the radius of 400m of each neighborhood. We'll reuse the function from the course, although using the search endpoint instead of the explore.

In [9]:
def getVenues(names, latitudes, longitudes, radius=400):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&intent=browse'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius)
            
        # make the GET request
        results = requests.get(url).json()["response"]['venues']
        
        # return only relevant information for each nearby venue
        for v in results:
            #some venues are Uncategorized, we want to exclude those
            if v['categories']:
                venues_list.append([(
                    name, 
                    v['name'], 
                    v['location']['lat'], 
                    v['location']['lng'],
                    v['categories'][0]['name'])
                ])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',  
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Get the venues for our city

In [10]:
dfCityVenues = getVenues(names=dfCityData['Neighborhood'], latitudes=dfCityData['Latitude'], longitudes=dfCityData['Longitude'])
dfCityVenues.head()

Unnamed: 0,Neighborhood,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Palacio,Santa Iglesia Catedral de Santa María la Real ...,40.415767,-3.714516,Church
1,Palacio,Palacio Real de Madrid,40.41794,-3.714259,Palace
2,Palacio,Tienda La Rebelión de los Mandiles,40.415133,-3.713358,Wine Shop
3,Palacio,Viaducto de Segovia,40.41405,-3.713542,Monument / Landmark
4,Palacio,Cripta de la Catedral / Parroquia de Santa Mar...,40.415356,-3.713726,Church


Check the size of our city data

In [11]:
dfCityVenues.size

18410

## 3. Methodology

### 3.1 Frequency of venues
Similar to what we did for New Yor and Toronto, let's check the frequency of venue categories per neighborhood.

In [12]:
# one hot encoding
dfCity_onehot = pd.get_dummies(dfCityVenues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
dfCity_onehot['Neighborhood'] = dfCityVenues['Neighborhood']

# group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
dfCity_grouped = dfCity_onehot.groupby('Neighborhood').mean().reset_index()

#### Let's print each neighborhood along with the top 10 most common venues
num_top_venues = 10

for hood in dfCity_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = dfCity_grouped[dfCity_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['category','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

---- Villaverde Alto----
                      category  freq
0                          Bar  0.07
1               Farmers Market  0.07
2                         Café  0.03
3                 Burger Joint  0.03
4              Coworking Space  0.03
5                      Brewery  0.03
6                  Fish Market  0.03
7         Fast Food Restaurant  0.03
8  College Technology Building  0.03
9        General Entertainment  0.03


----Abrantes----
                    category  freq
0                   Bus Line  0.10
1                       Park  0.07
2               Optical Shop  0.07
3                       Bank  0.03
4                        Bar  0.03
5        Government Building  0.03
6          Other Repair Shop  0.03
7              Metro Station  0.03
8               Tech Startup  0.03
9  Middle Eastern Restaurant  0.03


----Acacias----
                    category  freq
0               Dance Studio  0.07
1                Pizza Place  0.07
2                       Park  0.07
3     

               category  freq
0                   Bar  0.10
1        General Travel  0.07
2                Office  0.03
3           Event Space  0.03
4                   Gym  0.03
5        Medical Center  0.03
6  Other Great Outdoors  0.03
7         Metro Station  0.03
8                Bakery  0.03
9     Electronics Store  0.03


----Canillas----
                       category  freq
0                        Office  0.10
1                      Bus Line  0.07
2  General College & University  0.07
3                           Pub  0.07
4                   Snack Place  0.07
5                Medical Center  0.03
6                   High School  0.03
7                General Travel  0.03
8                           Bar  0.03
9                 Metro Station  0.03


----Canillejas----
           category  freq
0    Airport Lounge  0.14
1            Office  0.07
2            School  0.07
3               Gym  0.07
4   Coworking Space  0.03
5  Asian Restaurant  0.03
6          Building  0.03
7   

                                   category  freq
0                                       Bar  0.25
1                        Spanish Restaurant  0.11
2                        Salon / Barbershop  0.07
3                                   Theater  0.07
4  Residential Building (Apartment / Condo)  0.07
5                                    Bakery  0.04
6                    Argentinian Restaurant  0.04
7                                    Church  0.04
8             Vegetarian / Vegan Restaurant  0.04
9                               Art Gallery  0.04


----Entrevías----
           category  freq
0     Train Station  0.14
1      Burger Joint  0.07
2       Coffee Shop  0.07
3            Office  0.07
4              Café  0.07
5          Bus Line  0.03
6              Bank  0.03
7    General Travel  0.03
8         Bike Shop  0.03
9  Sushi Restaurant  0.03


----Estrella----
                       category  freq
0                           Bar  0.10
1                        School  0.10
2          

            category  freq
0             Office  0.07
1            Brewery  0.07
2         Restaurant  0.07
3     Student Center  0.07
4  Elementary School  0.03
5    Doctor's Office  0.03
6     Emergency Room  0.03
7     Medical Center  0.03
8      Metro Station  0.03
9                Bar  0.03


----Los Ángeles----
                                   category  freq
0                                       Bar  0.07
1                        Miscellaneous Shop  0.07
2                            Student Center  0.07
3                          Tapas Restaurant  0.07
4                                   Brewery  0.03
5                                  Bus Line  0.03
6                                      Food  0.03
7                                    Office  0.03
8                                    Church  0.03
9  Residential Building (Apartment / Condo)  0.03


----Lucero----
             category  freq
0  Salon / Barbershop  0.10
1      Medical Center  0.07
2            Wine Bar  0.07
3 

               category  freq
0                   Bar  0.13
1                  Pool  0.07
2               Brewery  0.07
3        Student Center  0.07
4           Pizza Place  0.07
5  Other Great Outdoors  0.03
6           High School  0.03
7             BBQ Joint  0.03
8           College Gym  0.03
9    Mexican Restaurant  0.03


----Pradolongo----
              category  freq
0                 Park  0.14
1       Medical Center  0.07
2                  Bar  0.07
3    Food & Drink Shop  0.07
4      Automotive Shop  0.07
5                Plaza  0.03
6         Dessert Shop  0.03
7  Government Building  0.03
8        Grocery Store  0.03
9   Falafel Restaurant  0.03


----Prosperidad----
             category  freq
0                 Bar  0.07
1    Tapas Restaurant  0.07
2          Restaurant  0.07
3  Salon / Barbershop  0.07
4  Spanish Restaurant  0.07
5     Doctor's Office  0.03
6        Optical Shop  0.03
7        Trade School  0.03
8              Office  0.03
9         Supermarket  0.03


                      category  freq
0             Tapas Restaurant  0.14
1                  Coffee Shop  0.07
2                   Restaurant  0.07
3  Professional & Other Places  0.04
4                   Shoe Store  0.04
5                  Supermarket  0.04
6            College Classroom  0.04
7                    Bookstore  0.04
8                    Gastropub  0.04
9      Health & Beauty Service  0.04


----Universidad----
                category  freq
0      Food & Drink Shop  0.07
1            Coffee Shop  0.07
2          Metro Station  0.07
3                  Hotel  0.07
4                    Bar  0.03
5  Outdoors & Recreation  0.03
6        College Library  0.03
7            Supermarket  0.03
8     Falafel Restaurant  0.03
9       Stationery Store  0.03


----Valdeacederas----
             category  freq
0     Automotive Shop  0.13
1            Bus Line  0.07
2  Salon / Barbershop  0.03
3          Skate Park  0.03
4  Spanish Restaurant  0.03
5         Coffee Shop  0.03
6         

This already gives you a good feeling of what you can find in each neighborhood.

### 3.2 Add category weight
So far we have the frequency of each venue category, but we want to use a weighted value, using the categories rating created earlier.

This is our current data frame.

In [13]:
#dfCity_grouped.set_index('Neighborhood', inplace=True)
dfCity_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Education Center,Advertising Agency,Airport,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,Airport Tram,...,Warehouse,Water Park,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Villaverde Alto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Abrantes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Acacias,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0
3,Adelfas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Aeropuerto,0.034483,0.0,0.0,0.034483,0.275862,0.068966,0.034483,0.034483,0.034483,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# tranpose the dataframe so the categories become the rows
dfTranposed = dfCity_grouped.set_index('Neighborhood').transpose()
dfTranposed.index.names = ['Category']
dfTranposed.head()

Neighborhood,Villaverde Alto,Abrantes,Acacias,Adelfas,Aeropuerto,Alameda de Osuna,Almagro,Almenara,Almendrales,Aluche,...,Valdeacederas,Valdefuentes,Valdemarín,Valdezarza,Vallehermoso,Valverde,Ventas,Vinateros,Vista Alegre,Zofío
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Accessories Store,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Adult Education Center,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Advertising Agency,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Airport,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.0,...,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Airport Gate,0.0,0.0,0.0,0.0,0.275862,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
# multiply by matching index
dfWeighted = dfTranposed.mul(dfCategories['Rating'], axis=0)
#replace the NaN values
dfWeighted.fillna(0, inplace=True)
#transpose the df again
dfWeighted = dfWeighted.transpose()
dfWeighted.reset_index(inplace=True)

dfWeighted.head()

Category,Neighborhood,ATM,Accessories Store,Adult Boutique,Adult Education Center,Advertising Agency,Afghan Restaurant,African Restaurant,Airport,Airport Gate,...,Well,Whisky Bar,Windmill,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Villaverde Alto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Abrantes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Acacias,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.068966,0.0,0.0,0.0,0.0
3,Adelfas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Aeropuerto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's print the frequencies with the new weighting

In [16]:
#before
print("----Justicia BEFORE----")
temp = dfCity_grouped[dfCity_grouped['Neighborhood'] == 'Justicia'].T.reset_index()
temp.columns = ['category','freq']
temp = temp.iloc[1:]
temp['freq'] = temp['freq'].astype(float)
temp = temp.round({'freq': 2})
print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
print('\n')

#after
print("----Justicia AFTER----")
temp = dfWeighted[dfWeighted['Neighborhood'] == 'Justicia'].T.reset_index()
temp.columns = ['category','freq']
temp = temp.iloc[1:]
temp['freq'] = temp['freq'].astype(float)
temp = temp.round({'freq': 2})
print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
print('\n')

----Justicia BEFORE----
                        category  freq
0                           Café  0.10
1                    Art Gallery  0.10
2                 Cosmetics Shop  0.07
3                    Music Store  0.07
4             Salon / Barbershop  0.07
5  Vegetarian / Vegan Restaurant  0.07
6                       Boutique  0.07
7                         Bistro  0.03
8                  Metro Station  0.03
9                     Restaurant  0.03


----Justicia AFTER----
                        category  freq
0  Vegetarian / Vegan Restaurant  0.21
1                    Art Gallery  0.21
2                           Café  0.21
3                    Music Store  0.14
4             Salon / Barbershop  0.14
5                 Cosmetics Shop  0.14
6                         Bistro  0.10
7                 Clothing Store  0.07
8            American Restaurant  0.07
9                         Market  0.07




We can see how the weight of "Vegetarian / Vegan Restaurant" (rating=3) jumped to the top. Bistro, another of the "hipster" categories wasn't in the top10 but made it after the weighting.

Let's put the new data in a dataframe

In [17]:
#A function to sort the categories in descending order.
def return_most_common_categories(row, num_top_categories):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_categories]

# create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_categories_sorted = pd.DataFrame(columns=columns)
neighborhoods_categories_sorted['Neighborhood'] = dfCity_grouped['Neighborhood']

for ind in np.arange(dfCity_grouped.shape[0]):
    neighborhoods_categories_sorted.iloc[ind, 1:] = return_most_common_categories(dfCity_grouped.iloc[ind, :], num_top_venues)

neighborhoods_categories_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Villaverde Alto,Bar,Farmers Market,Pharmacy,Kids Store,Burger Joint,Fish Market,Brewery,Café,Food & Drink Shop,Men's Store
1,Abrantes,Bus Line,Park,Optical Shop,Middle Eastern Restaurant,Bar,Gas Station,Tech Startup,Lighthouse,Bookstore,Bowling Alley
2,Acacias,Pizza Place,Dance Studio,Park,Non-Profit,Trade School,College Academic Building,Office,General Entertainment,General College & University,Bus Line
3,Adelfas,Automotive Shop,Hotel,Spa,Supermarket,Auto Garage,Thrift / Vintage Store,Diner,Music Venue,Tapas Restaurant,Chinese Restaurant
4,Aeropuerto,Airport Gate,Office,Airport Lounge,Accessories Store,Travel Lounge,Coffee Shop,Duty-free Shop,Newsstand,Fast Food Restaurant,Men's Store


### 3.3 Create clusters
We will use *k*-means to cluster the neighborhood into 3 clusters.

In [18]:
# set number of clusters
kclusters = 3
city_grouped_clustering = dfWeighted.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(city_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

# add clustering labels
neighborhoods_categories_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

city_merged = dfCityData

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
city_merged = city_merged.join(neighborhoods_categories_sorted.set_index('Neighborhood'), on='Neighborhood')

Let's look at the clusters

Cluster 0

In [19]:
city_merged.loc[city_merged['Cluster Labels'] == 0, city_merged.columns[[1] + list(range(5, city_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Palacio,Church,Monument / Landmark,Plaza,Government Building,Park,Spanish Restaurant,Museum,Wine Bar,Market,Bar
2,Cortes,Spanish Restaurant,Restaurant,Office,Hotel,Tapas Restaurant,Bookstore,Gift Shop,Laundromat,Bar,Bed & Breakfast
9,Legazpi,Bridge,Spanish Restaurant,Park,Bank,Plaza,Office,Bookstore,Salon / Barbershop,Design Studio,Metro Station
11,Palos de Moguer,Spanish Restaurant,Grocery Store,Hotel,Snack Place,School,Gastropub,Tech Startup,Medical Center,Spa,Cafeteria
13,Pacífico,Bakery,Spanish Restaurant,Pub,Bed & Breakfast,Organic Grocery,General Entertainment,Diner,Farmers Market,Bar,Church
16,Ibiza,Spanish Restaurant,Supermarket,Salon / Barbershop,Seafood Restaurant,Breakfast Spot,Restaurant,Coffee Shop,Bakery,College Academic Building,Mediterranean Restaurant
19,Recoletos,Boutique,Shoe Store,Spanish Restaurant,Restaurant,Brewery,Art Museum,Men's Store,Shoe Repair,Embassy / Consulate,Sandwich Place
22,La Guindalera,Spanish Restaurant,Building,Mediterranean Restaurant,Medical Center,Doctor's Office,Trade School,Motorcycle Shop,Storage Facility,Salon / Barbershop,Supermarket
24,Castellana,Office,Spanish Restaurant,Cocktail Bar,Building,Gym,Gastropub,Bar,Bank,Bakery,Miscellaneous Shop
40,Almagro,Spanish Restaurant,Hospital,Medical Center,Government Building,Doctor's Office,Boutique,Café,Embassy / Consulate,School,Salon / Barbershop


Cluster 1

In [20]:
city_merged.loc[city_merged['Cluster Labels'] == 1, city_merged.columns[[1] + list(range(5, city_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Justicia,Art Gallery,Café,Salon / Barbershop,Vegetarian / Vegan Restaurant,Music Store,Cosmetics Shop,Boutique,Bistro,Medical Center,Market
4,Universidad,Coffee Shop,Food & Drink Shop,Hotel,Metro Station,Sandwich Place,Stationery Store,Music Store,Lottery Retailer,Outdoors & Recreation,Supermarket
5,Sol,Bank,Government Building,Shoe Store,Plaza,Mobile Phone Shop,Fast Food Restaurant,Metro Station,Sporting Goods Shop,Mediterranean Restaurant,Event Space
7,Acacias,Pizza Place,Dance Studio,Park,Non-Profit,Trade School,College Academic Building,Office,General Entertainment,General College & University,Bus Line
8,Chopera,Laundromat,Coffee Shop,Park,Bank,Garden,Other Great Outdoors,Fair,Bar,Gas Station,Tapas Restaurant
10,Delicias,Office,Nursery School,Gym / Fitness Center,Restaurant,Bridge,Café,Tennis Court,Salon / Barbershop,Sandwich Place,Market
12,Atocha,Train Station,Office,Café,Pet Store,Bank,Language School,General Travel,General College & University,Bakery,Light Rail Station
14,Adelfas,Automotive Shop,Hotel,Spa,Supermarket,Auto Garage,Thrift / Vintage Store,Diner,Music Venue,Tapas Restaurant,Chinese Restaurant
17,Jerónimos,Art Gallery,Art Museum,Café,Monument / Landmark,Restaurant,Church,Fountain,Diner,Exhibit,Mediterranean Restaurant
20,Goya,Cosmetics Shop,Bookstore,Pharmacy,Metro Station,Shoe Store,Mobile Phone Shop,Restaurant,Lingerie Store,Tanning Salon,Food


Cluster 2

In [21]:
city_merged.loc[city_merged['Cluster Labels'] == 2, city_merged.columns[[1] + list(range(5, city_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Embajadores,Bar,Spanish Restaurant,Theater,Residential Building (Apartment / Condo),Salon / Barbershop,Art Gallery,Vegetarian / Vegan Restaurant,Bakery,Miscellaneous Shop,Diner
6,Imperial,Bar,Office,General College & University,Plaza,Spa,Government Building,Cocktail Bar,Coffee Shop,Gas Station,Chinese Restaurant
15,Estrella,Bar,School,Sports Club,Pool,Tech Startup,Basketball Court,Student Center,Bus Line,Garden Center,Theme Park Ride / Attraction
18,Niño Jesús,Bar,Plaza,Other Great Outdoors,College Arts Building,Office,Park,Tapas Restaurant,Bank,Tech Startup,Restaurant
23,Lista,Bar,Bakery,Salon / Barbershop,Asian Restaurant,Office,Bank,Doctor's Office,Supermarket,Electronics Store,Market
26,Prosperidad,Restaurant,Bar,Salon / Barbershop,Tapas Restaurant,Spanish Restaurant,Medical Center,Laboratory,Coffee Shop,Trade School,Office
27,Ciudad Jardín,Café,Bar,Sporting Goods Shop,General Entertainment,Smoke Shop,Salon / Barbershop,Church,Diner,Burger Joint,Language School
31,Bellas Vistas,Bar,Grocery Store,Spanish Restaurant,Event Space,Fast Food Restaurant,General College & University,Market,Martial Arts Dojo,Auto Garage,Other Great Outdoors
34,Almenara,Building,Office,Café,Bar,Government Building,Salon / Barbershop,Flea Market,Library,Dive Bar,Dog Run
36,Berruguete,Tapas Restaurant,Bar,Deli / Bodega,Spanish Restaurant,Cocktail Bar,Automotive Shop,Bakery,Gastropub,Residential Building (Apartment / Condo),Nail Salon


Looking at our Categories Rating and the clusters created, we can say that:
* Cluster 1 is the one with more rating 3 categories
* Cluster 0 is the second one with more rating 3 categories
* Cluster 2 is the one with least rating 3 categories

So we can update the cluster numbers to match our ratings

In [22]:
city_merged['Cluster Labels'] = city_merged['Cluster Labels'].map({0: 2, 1: 3, 2: 1})
city_merged.head()

Unnamed: 0,District,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Centro,Palacio,40.415,-3.713333,2,Church,Monument / Landmark,Plaza,Government Building,Park,Spanish Restaurant,Museum,Wine Bar,Market,Bar
1,Centro,Embajadores,40.408889,-3.699722,1,Bar,Spanish Restaurant,Theater,Residential Building (Apartment / Condo),Salon / Barbershop,Art Gallery,Vegetarian / Vegan Restaurant,Bakery,Miscellaneous Shop,Diner
2,Centro,Cortes,40.414167,-3.698056,2,Spanish Restaurant,Restaurant,Office,Hotel,Tapas Restaurant,Bookstore,Gift Shop,Laundromat,Bar,Bed & Breakfast
3,Centro,Justicia,40.423889,-3.696389,3,Art Gallery,Café,Salon / Barbershop,Vegetarian / Vegan Restaurant,Music Store,Cosmetics Shop,Boutique,Bistro,Medical Center,Market
4,Centro,Universidad,40.425278,-3.708333,3,Coffee Shop,Food & Drink Shop,Hotel,Metro Station,Sandwich Place,Stationery Store,Music Store,Lottery Retailer,Outdoors & Recreation,Supermarket


### 3.4 Visualize map


In [23]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(city_merged['Latitude'], city_merged['Longitude'], city_merged['Neighborhood'], city_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

