# Introduction

Delhi and Mumbai are two major cities in India. Both cities become a center of attention for residential, job employment, tourism, education, shopping and sports activity and are well known in India, thus becoming a best choice for local and foreign communities.<br/><br/>

<li><b>Delhi:</b> India’s capital territory, is a massive metropolitan area in the country’s north. In Old Delhi, a neighborhood dating to the 1600s, stands the imposing Mughal-era Red Fort, a symbol of India, and the sprawling Jama Masjid mosque, whose courtyard accommodates 25,000 people. Nearby is Chandni Chowk, a vibrant bazaar filled with food carts, sweets shops and spice stallsm with total ares of 1,484 km² and population 18.98 million as of 2012.</li><br/>

<li><b>Mumbai:</b> Also known as Bombay is the capital city of the Indian state of Maharashtra. A financial center, it's India's largest city. On the Mumbai Harbour waterfront stands the iconic Gateway of India stone arch, built by the British Raj in 1924. Offshore, nearby Elephanta Island holds ancient cave temples dedicated to the Hindu god Shiva. The city's also famous as the heart of the Bollywood film industry. As of 2011 it is the most populous city in India with an estimated city proper population of 12.4 million. </li>

# Objective
The purpose of this project is to compare how similar and dissimilar are the districts in the state of Delhi and Mumbai. This can be achieved by area classification using Foursquare data and machine learning segmentation and clustering.

# Data

The data is gathered using wikipedia and is organized into csv file for easier mainpulation. The csv files are attached to this project.<br/><br/>

<li><a>https://github.com/pareshprakash/capstone-project/blob/master/Delhi-District.csv</a></li>
<li><a>https://github.com/pareshprakash/capstone-project/blob/master/Mumbai-District.csv</a></li>

One should keep in mind that the amount and accuracy of data captured using Four Square API cannot be 100%.

In [1]:
#import required libraries
import numpy as np
import pandas as pd

#read data for Delhi
df_delhi = pd.read_csv(r"C:\Users\shiva\Downloads\Delhi-District.csv")
df_delhi.head()

Unnamed: 0,Pincode,Location,District,State
0,110001,Baroda House,CENTRAL DELHI,DELHI
1,110001,Bengali Market,CENTRAL DELHI,DELHI
2,110001,Bhagat Singh Market,CENTRAL DELHI,DELHI
3,110001,Connaught Place,CENTRAL DELHI,DELHI
4,110001,Constitution House,CENTRAL DELHI,DELHI


In [2]:
#examine data
print("Delhi dataframe has {} district and {} locations".format(
        len(df_delhi['District'].unique()),
        df_delhi.shape[0]
    )
)

#grouping data to find District with highest number of area
df_delhi.groupby('District').count()

Delhi dataframe has 7 district and 73 locations


Unnamed: 0_level_0,Pincode,Location,State
District,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CENTRAL DELHI,5,5,5
EAST DELHI,7,7,7
NORTH DELHI,12,12,12
NORTH EAST DELHI,12,12,12
SOUTH DELHI,7,7,7
SOUTH WEST DELHI,20,20,20
WEST DELHI,10,10,10


In [3]:
#read data for Mumbai
df_mumbai = pd.read_csv(r"C:\Users\shiva\Downloads\Mumbai-District.csv")
df_mumbai.head()

Unnamed: 0,Pincode,Location,District,State
0,400004,Ambewadi,Ambewadi,Maharashtra
1,400004,Charni Road,Ambewadi,Maharashtra
2,400004,Chaupati,Ambewadi,Maharashtra
3,400004,Girgaon,Ambewadi,Maharashtra
4,400004,Madhavbaug,Ambewadi,Maharashtra


In [4]:
#examine data
print("Mumbai dataframe has {} district and {} locations".format(
        len(df_mumbai['District'].unique()),
        df_mumbai.shape[0]
    )
)

#grouping data to find District with highest number of area
df_mumbai.groupby('District').count()

Mumbai dataframe has 12 district and 114 locations


Unnamed: 0_level_0,Pincode,Location,State
District,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ambewadi,6,6,6
Andheri,7,7,7
Bhawani Shankar,10,10,10
Churchgate,16,16,16
Colaba,9,9,9
Goregaon,9,9,9
Malad,10,10,10
Mumbai Central,9,9,9
Mumbai East,17,17,17
Navi Mumbai,6,6,6


In [5]:
!pip install geocoder
#import geocoder to add latitudes and longtitudes to each district
print('geocoder has been installed before.')
import geocoder
print('geocoder has been successfully imported.')

geocoder has been installed before.
geocoder has been successfully imported.


In [6]:
#function to get latitude and longitude for India
def get_latlng(postal_code):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, India'.format(postal_code))
        lat_lng_coords = g.latlng
    return lat_lng_coords
    
get_latlng('M4G')

[23.379379735000043, 79.44332654800007]

In [7]:
#put new column of latitude and logitude into dataframe of Delhi
postal_delhi_codes = df_delhi['Location']    
coords = [ get_latlng(postal_code) for postal_code in postal_delhi_codes.tolist() ]

#add latitude and longtitude to Delhi
df_delhi_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
df_delhi['Latitude'] = df_delhi_coords['Latitude']
df_delhi['Longitude'] = df_delhi_coords['Longitude']
df_delhi.head(10)

Unnamed: 0,Pincode,Location,District,State,Latitude,Longitude
0,110001,Baroda House,CENTRAL DELHI,DELHI,28.61648,77.22925
1,110001,Bengali Market,CENTRAL DELHI,DELHI,28.62919,77.23216
2,110001,Bhagat Singh Market,CENTRAL DELHI,DELHI,28.97528,77.71057
3,110001,Connaught Place,CENTRAL DELHI,DELHI,28.63396,77.21979
4,110001,Constitution House,CENTRAL DELHI,DELHI,-33.92413,18.42088
5,110005,Election Commission,SOUTH DELHI,DELHI,30.74108,76.77884
6,110005,Anand Parbat Indl. Area,SOUTH DELHI,DELHI,28.66585,77.17347
7,110005,Anand Parbat,SOUTH DELHI,DELHI,28.66585,77.17347
8,110005,Bank Street,SOUTH DELHI,DELHI,43.65962,-70.25125
9,110005,Desh Bandhu Gupta Road,SOUTH DELHI,DELHI,28.64519,77.21281


In [8]:
#add latitude and longtitude to dataframe of Mumbai
postal_codes_mumbai = df_mumbai['Location']    
coords = [ get_latlng(postal_code) for postal_code in postal_codes_mumbai.tolist() ]

df_mumbai_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
df_mumbai['Latitude'] = df_mumbai_coords['Latitude']
df_mumbai['Longitude'] = df_mumbai_coords['Longitude']
df_mumbai.head(10)

Unnamed: 0,Pincode,Location,District,State,Latitude,Longitude
0,400004,Ambewadi,Ambewadi,Maharashtra,18.01874,76.94887
1,400004,Charni Road,Ambewadi,Maharashtra,18.95719,72.82477
2,400004,Chaupati,Ambewadi,Maharashtra,21.18535,72.80715
3,400004,Girgaon,Ambewadi,Maharashtra,16.44751,74.52361
4,400004,Madhavbaug,Ambewadi,Maharashtra,23.07582,72.56212
5,400004,Opera House,Ambewadi,Maharashtra,21.2358,72.86974
6,400052,Danda,Andheri,Maharashtra,24.12784,83.94542
7,400052,Khar Colony,Andheri,Maharashtra,30.70523,76.24135
8,400052,Khar Delivery,Andheri,Maharashtra,18.52061,73.85731
9,400052,Andheri,Andheri,Maharashtra,30.38924,77.12491


In [9]:
from geopy.geocoders import Nominatim
import folium

address = 'Delhi, India'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of New York using latitude and longitude values
map_delhi = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_delhi['Latitude'], df_delhi['Longitude'], df_delhi['District'], df_delhi['Location']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_delhi)  
    
map_delhi

  """


In [10]:
address = 'Mumbai, India'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of New York using latitude and longitude values
map_mumbai = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_mumbai['Latitude'], df_mumbai['Longitude'], df_mumbai['District'], df_mumbai['Location']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_mumbai)  
    
map_mumbai

  


# Methodology

The addresses found above were converted into their equivalent latitude and longitude coordinates. Now, the Foursquare API will be used to explore neighborhoods in both cities of Delhi and Mumbai. After that, explore function to get the most common venue categories in each neighborhood, and then this feature can be used to group the neighborhoods into clusters via K-means clustering algorithm. And also, the Folium library will be then used to visualize the neighborhoods in Delhi and Mumbai and their emerging clusters.

Based on dataframe analysis above, we found out that South Delhi area in Delhi and Mumbai Central area in Mumbai are both have the highest number of area within it those district.

In [11]:
#slice the original dataframe and create a new dataframe of the Hauz Qazi
sDelhi = df_delhi[df_delhi['District'] == 'SOUTH DELHI'].reset_index(drop=True)

#get the geographical coordinates of Bukit Bintang, Kuala Lumpur
address = 'SOUTH DELHI, India'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of Bukit Bintang using latitude and longitude values
map_sDelhi = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(sDelhi['Latitude'], sDelhi['Longitude'], sDelhi['Location']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sDelhi)  
    
map_sDelhi

  


In [12]:
#slice the original dataframe and create a new dataframe of the Jacob Circle
mCentral = df_mumbai[df_mumbai['District'] == 'Mumbai Central'].reset_index(drop=True)

#get the geographical coordinates of Manhattan
address = 'Mumbai Central, India'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of Bukit Bintang using latitude and longitude values
map_mCentral = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(mCentral['Latitude'], mCentral['Longitude'], mCentral['Location']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_mCentral)  
    
map_mCentral

  


Using Foursquare API to get venues at surounding area of both Jama Masjid, Delhi and Jacob Circle, Mumbai.

In [13]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

#Foursquare Credentials and Version
CLIENT_ID = 'OHI3T00DYQCL20NHSX3AS1LGEC4KDKNKWWTBRQBH23BJAESC'
CLIENT_SECRET = 'CHTMCPDHRQD1KXZWP3NBCP2MBAZHCZKNC24W0XBJXHN03PV2'
VERSION = '20180605'

#explore the first neighborhood in our dataframe
#Get the neighborhood's latitude and longitude values.
neighborhood_latitude = sDelhi.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = sDelhi.loc[0, 'Longitude'] # neighborhood longitude value
neighborhood_name = sDelhi.loc[0, 'Location'] # neighborhood name

#get the top 100 venues that are in Bukit Bintang within a radius of 500 meters
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

#Send the GET request and examine the resutls
results = requests.get(url).json()

#borrow the get_category_type function from the Foursquare lab.
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#clean the json and structure it into a pandas dataframe
venues = results['response']['groups'][0]['items']    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
print('{} venues were returned by Foursquare for South Delhi, Delhi'.format(nearby_venues.shape[0]))
nearby_venues.head()


11 venues were returned by Foursquare for South Delhi, Delhi


Unnamed: 0,name,categories,lat,lng
0,Softy Corner,Ice Cream Shop,30.740414,76.781619
1,Sector 17,Miscellaneous Shop,30.739541,76.782158
2,Indian Coffee House,Coffee Shop,30.740343,76.780902
3,Hot Millions 2,Fast Food Restaurant,30.740557,76.782547
4,Ghazal,Indian Restaurant,30.739055,76.783358


In [14]:
#explore the first neighborhood in our dataframe
#Get the neighborhood's latitude and longitude values.
neighborhood_latitude = mCentral.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = mCentral.loc[0, 'Longitude'] # neighborhood longitude value
neighborhood_name = mCentral.loc[0, 'Location'] # neighborhood name

#get the top 100 venues that are in Marble Hill within a radius of 500 meters
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

#Send the GET request and examine the resutls
results = requests.get(url).json()

#clean the json and structure it into a pandas dataframe
venues = results['response']['groups'][0]['items']    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
print('{} venues were returned by Foursquare for Mumbai Central, Mumbai'.format(nearby_venues.shape[0]))
nearby_venues.head()

3 venues were returned by Foursquare for Mumbai Central, Mumbai


Unnamed: 0,name,categories,lat,lng
0,CPT Square,Plaza,8.53595,76.990825
1,RP Swamy,Astrologer,8.53386,76.991325
2,Rajappan Fireworks,Fireworks Store,8.534635,76.997696


In [15]:
#function to repeat the same process to all area
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Area', 
                  'Area Latitude', 
                  'Area Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#run the above function on each neighborhood and create a new dataframe
sDelhi_venues = getNearbyVenues(names=sDelhi['Location'],
                                   latitudes=sDelhi['Latitude'],
                                   longitudes=sDelhi['Longitude']
                                  )

#check the size of the resulting dataframe
print(sDelhi_venues.shape)
sDelhi_venues.head()

Election Commission  
Anand Parbat Indl. Area  
Anand Parbat  
Bank Street  
Desh Bandhu Gupta Road  
Karol Bagh  
Master Prithvi Nath Marg  
(166, 7)


Unnamed: 0,Area,Area Latitude,Area Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Election Commission,30.74108,76.77884,Softy Corner,30.740414,76.781619,Ice Cream Shop
1,Election Commission,30.74108,76.77884,Sector 17,30.739541,76.782158,Miscellaneous Shop
2,Election Commission,30.74108,76.77884,Indian Coffee House,30.740343,76.780902,Coffee Shop
3,Election Commission,30.74108,76.77884,Hot Millions 2,30.740557,76.782547,Fast Food Restaurant
4,Election Commission,30.74108,76.77884,Ghazal,30.739055,76.783358,Indian Restaurant


In [19]:
#run the above function on each neighborhood and create a new dataframe
mCentral_venues = getNearbyVenues(names=mCentral['Location'],
                                   latitudes=mCentral['Latitude'],
                                   longitudes=mCentral['Longitude']
                                  )

#check the size of the resulting dataframe
print(mCentral_venues.shape)
mCentral_venues.head()

Bharat Nagar
Grant Road
Swami Vivekand Road
Tardeo
Falkland Road
Kamathipura
Mumbai Central
Hajiali
Tulsiwadi
(73, 7)


Unnamed: 0,Area,Area Latitude,Area Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bharat Nagar,8.53727,76.99409,CPT Square,8.53595,76.990825,Plaza
1,Bharat Nagar,8.53727,76.99409,RP Swamy,8.53386,76.991325,Astrologer
2,Bharat Nagar,8.53727,76.99409,Rajappan Fireworks,8.534635,76.997696,Fireworks Store
3,Grant Road,18.95929,72.83108,Taj Ice Cream,18.960013,72.830779,Ice Cream Shop
4,Grant Road,18.95929,72.83108,Shalimar Restaurant,18.95818,72.832367,Indian Restaurant


In [20]:
#check how many venues were returned for each area
print('There are {} uniques categories in Delhi'.format(len(sDelhi_venues['Venue Category'].unique())))
sDelhi_venues.groupby('Area').count()

There are 73 uniques categories in Delhi


Unnamed: 0_level_0,Area Latitude,Area Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Area,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Anand Parbat,4,4,4,4,4,4
Anand Parbat Indl. Area,4,4,4,4,4,4
Bank Street,100,100,100,100,100,100
Desh Bandhu Gupta Road,35,35,35,35,35,35
Election Commission,11,11,11,11,11,11
Karol Bagh,6,6,6,6,6,6
Master Prithvi Nath Marg,6,6,6,6,6,6


In [21]:
#check how many venues were returned for each area
print('There are {} uniques categories in Mumbai.'.format(len(mCentral_venues['Venue Category'].unique())))
mCentral_venues.groupby('Area').count()

There are 35 uniques categories in Mumbai.


Unnamed: 0_level_0,Area Latitude,Area Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Area,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bharat Nagar,3,3,3,3,3,3
Falkland Road,1,1,1,1,1,1
Grant Road,11,11,11,11,11,11
Hajiali,14,14,14,14,14,14
Kamathipura,5,5,5,5,5,5
Mumbai Central,17,17,17,17,17,17
Swami Vivekand Road,1,1,1,1,1,1
Tardeo,15,15,15,15,15,15
Tulsiwadi,6,6,6,6,6,6


# Anazlyze Delhi

In [22]:
# one hot encoding
sDelhi_onehot = pd.get_dummies(sDelhi_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
sDelhi_onehot['Area'] = sDelhi_venues['Area'] 

# move neighborhood column to the first column
fixed_columns = sDelhi_onehot.columns[-1] + sDelhi_onehot.columns[:-1]

#examine the new dataframe size after one hot encoding
print('{} rows were returned after one hot encoding.'.format(sDelhi_onehot.shape[0]))

#group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
sDelhi_grouped = sDelhi_onehot.groupby('Area').mean().reset_index()

#examine the new dataframe size after one hot encoding
print('{} rows were returned after grouping.'.format(sDelhi_grouped.shape[0]))

166 rows were returned after one hot encoding.
7 rows were returned after grouping.


In [23]:
#print each neighborhood along with the top 5 most common venues
num_top_venues = 5

for hood in sDelhi_grouped['Area']:
    print("----"+hood+"----")
    temp = sDelhi_grouped[sDelhi_grouped['Area'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Anand Parbat  ----
                       venue  freq
0              Train Station  0.25
1           Business Service  0.25
2          Convenience Store  0.25
3           Airport Terminal  0.25
4  Indian Chinese Restaurant  0.00


----Anand Parbat Indl. Area  ----
                       venue  freq
0              Train Station  0.25
1           Business Service  0.25
2          Convenience Store  0.25
3           Airport Terminal  0.25
4  Indian Chinese Restaurant  0.00


----Bank Street  ----
                 venue  freq
0   Seafood Restaurant  0.07
1  American Restaurant  0.05
2          Coffee Shop  0.05
3                Hotel  0.05
4   Italian Restaurant  0.04


----Desh Bandhu Gupta Road  ----
         venue  freq
0        Hotel  0.40
1   Restaurant  0.09
2  Pizza Place  0.06
3       Hostel  0.06
4         Café  0.06


----Election Commission  ----
                  venue  freq
0     Indian Restaurant  0.18
1        Ice Cream Shop  0.18
2           Coffee Shop  0.09
3         

In [24]:
#put into a pandas dataframe

#write a function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#create the new dataframe and display the top 10 venues for each neighborhood
num_top_venues = 8

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Area']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
areas_venues_sorted = pd.DataFrame(columns=columns)
areas_venues_sorted['Area'] = sDelhi_grouped['Area']

for ind in np.arange(sDelhi_grouped.shape[0]):
    areas_venues_sorted.iloc[ind, 1:] = return_most_common_venues(sDelhi_grouped.iloc[ind, :], num_top_venues)

areas_venues_sorted

Unnamed: 0,Area,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,Anand Parbat,Train Station,Airport Terminal,Convenience Store,Business Service,Farmers Market,Comic Shop,Department Store,Dessert Shop
1,Anand Parbat Indl. Area,Train Station,Airport Terminal,Convenience Store,Business Service,Farmers Market,Comic Shop,Department Store,Dessert Shop
2,Bank Street,Seafood Restaurant,Hotel,American Restaurant,Coffee Shop,Italian Restaurant,Brewery,Bar,Pizza Place
3,Desh Bandhu Gupta Road,Hotel,Restaurant,Hostel,Café,Pizza Place,Coffee Shop,Indian Restaurant,Indian Chinese Restaurant
4,Election Commission,Indian Restaurant,Ice Cream Shop,Fast Food Restaurant,Department Store,Café,Coffee Shop,Shoe Store,Miscellaneous Shop
5,Karol Bagh,Fast Food Restaurant,Indian Restaurant,Bakery,Snack Place,Train Station,Electronics Store,Comic Shop,Convenience Store
6,Master Prithvi Nath Marg,Gift Shop,Asian Restaurant,Indian Restaurant,Business Service,Coffee Shop,Electronics Store,Farmers Market,Convenience Store


# K-Means Clustering for North Delhi, Delhi

In [25]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 3

sDelhi_grouped_clustering = sDelhi_grouped.drop('Area', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sDelhi_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

# add clustering labels
areas_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [26]:
#create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
sDelhi_merged = sDelhi

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
sDelhi_merged = sDelhi_merged.join(areas_venues_sorted.set_index('Area'), on='Location')

sDelhi_merged

Unnamed: 0,Pincode,Location,District,State,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,110005,Election Commission,SOUTH DELHI,DELHI,30.74108,76.77884,0,Indian Restaurant,Ice Cream Shop,Fast Food Restaurant,Department Store,Café,Coffee Shop,Shoe Store,Miscellaneous Shop
1,110005,Anand Parbat Indl. Area,SOUTH DELHI,DELHI,28.66585,77.17347,2,Train Station,Airport Terminal,Convenience Store,Business Service,Farmers Market,Comic Shop,Department Store,Dessert Shop
2,110005,Anand Parbat,SOUTH DELHI,DELHI,28.66585,77.17347,2,Train Station,Airport Terminal,Convenience Store,Business Service,Farmers Market,Comic Shop,Department Store,Dessert Shop
3,110005,Bank Street,SOUTH DELHI,DELHI,43.65962,-70.25125,0,Seafood Restaurant,Hotel,American Restaurant,Coffee Shop,Italian Restaurant,Brewery,Bar,Pizza Place
4,110005,Desh Bandhu Gupta Road,SOUTH DELHI,DELHI,28.64519,77.21281,0,Hotel,Restaurant,Hostel,Café,Pizza Place,Coffee Shop,Indian Restaurant,Indian Chinese Restaurant
5,110005,Karol Bagh,SOUTH DELHI,DELHI,28.65156,77.18858,1,Fast Food Restaurant,Indian Restaurant,Bakery,Snack Place,Train Station,Electronics Store,Comic Shop,Convenience Store
6,110005,Master Prithvi Nath Marg,SOUTH DELHI,DELHI,28.65611,77.20108,0,Gift Shop,Asian Restaurant,Indian Restaurant,Business Service,Coffee Shop,Electronics Store,Farmers Market,Convenience Store


In [27]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#Finally, let's visualize the resulting clusters
# create map 3.1343385, 101.6863371
sDelhi_clusters = folium.Map(location=[3.1343385, 101.6863371], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sDelhi_merged['Latitude'], sDelhi_merged['Longitude'], sDelhi_merged['Location'], sDelhi_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    cluster
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[(int)(cluster-1)],
        fill=True,
        fill_color=rainbow[(int)(cluster-1)],
        fill_opacity=0.7).add_to(sDelhi_clusters)
       
sDelhi_clusters

# Analyze Mumbai

In [28]:
# one hot encoding
mCentral_onehot = pd.get_dummies(mCentral_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
mCentral_onehot['Area'] = mCentral_venues['Area'] 

# move neighborhood column to the first column
fixed_columns = [mCentral_onehot.columns[-1]] + list(mCentral_onehot.columns[:-1])
mCentral_onehot = mCentral_onehot[fixed_columns]

#examine the new dataframe size after one hot encoding
print('{} rows were returned after one hot encoding.'.format(mCentral_onehot.shape[0]))

#group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
mCentral_grouped = mCentral_onehot.groupby('Area').mean().reset_index()

#examine the new dataframe size after one hot encoding
print('{} rows were returned after grouping.'.format(mCentral_grouped.shape[0]))

73 rows were returned after one hot encoding.
9 rows were returned after grouping.


In [29]:
#print each neighborhood along with the top 5 most common venues
num_top_venues = 5

for hood in mCentral_grouped['Area']:
    print("----"+hood+"----")
    temp = mCentral_grouped[mCentral_grouped['Area'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bharat Nagar----
             venue  freq
0            Plaza  0.33
1       Astrologer  0.33
2  Fireworks Store  0.33
3              ATM  0.00
4      Men's Store  0.00


----Falkland Road----
         venue  freq
0     Pharmacy   1.0
1          ATM   0.0
2        Plaza   0.0
3       Market   0.0
4  Men's Store   0.0


----Grant Road----
               venue  freq
0  Indian Restaurant  0.36
1       Dessert Shop  0.18
2             Arcade  0.09
3       Antique Shop  0.09
4        Snack Place  0.09


----Hajiali----
               venue  freq
0      Shopping Mall  0.14
1  Indian Restaurant  0.07
2      Deli / Bodega  0.07
3             Market  0.07
4          Juice Bar  0.07


----Kamathipura----
                 venue  freq
0    Indian Restaurant   0.4
1         Antique Shop   0.2
2  Arts & Crafts Store   0.2
3               Bakery   0.2
4                Plaza   0.0


----Mumbai Central----
                           venue  freq
0  Vegetarian / Vegan Restaurant  0.12
1           Fast 

In [30]:
#create the new dataframe and display the top 10 venues for each neighborhood
num_top_venues = 8

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Area']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
areas_venues_sorted = pd.DataFrame(columns=columns)
areas_venues_sorted['Area'] = mCentral_grouped['Area']

for ind in np.arange(mCentral_grouped.shape[0]):
    areas_venues_sorted.iloc[ind, 1:] = return_most_common_venues(mCentral_grouped.iloc[ind, :], num_top_venues)

areas_venues_sorted.head()

Unnamed: 0,Area,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,Bharat Nagar,Fireworks Store,Astrologer,Plaza,Vegetarian / Vegan Restaurant,Chinese Restaurant,Fast Food Restaurant,Electronics Store,Dessert Shop
1,Falkland Road,Pharmacy,Vegetarian / Vegan Restaurant,Ice Cream Shop,Fireworks Store,Fast Food Restaurant,Electronics Store,Dessert Shop,Deli / Bodega
2,Grant Road,Indian Restaurant,Dessert Shop,Arcade,Restaurant,Ice Cream Shop,Antique Shop,Snack Place,Arts & Crafts Store
3,Hajiali,Shopping Mall,Indian Restaurant,Electronics Store,Golf Course,Ice Cream Shop,Deli / Bodega,Italian Restaurant,Juice Bar
4,Kamathipura,Indian Restaurant,Antique Shop,Arts & Crafts Store,Bakery,Coffee Shop,Fireworks Store,Fast Food Restaurant,Electronics Store


In [31]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 3

mCentral_grouped_clustering = mCentral_grouped.drop('Area', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(mCentral_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

# add clustering labels
areas_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [32]:
#create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
mCentral_merged = mCentral

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
mCentral_merged = mCentral_merged.join(areas_venues_sorted.set_index('Area'), on='Location')

mCentral_merged

Unnamed: 0,Pincode,Location,District,State,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,400007,Bharat Nagar,Mumbai Central,Maharashtra,8.53727,76.99409,0,Fireworks Store,Astrologer,Plaza,Vegetarian / Vegan Restaurant,Chinese Restaurant,Fast Food Restaurant,Electronics Store,Dessert Shop
1,400007,Grant Road,Mumbai Central,Maharashtra,18.95929,72.83108,0,Indian Restaurant,Dessert Shop,Arcade,Restaurant,Ice Cream Shop,Antique Shop,Snack Place,Arts & Crafts Store
2,400007,Swami Vivekand Road,Mumbai Central,Maharashtra,28.24905,77.07279,1,ATM,Chinese Restaurant,Fireworks Store,Fast Food Restaurant,Electronics Store,Dessert Shop,Deli / Bodega,Coffee Shop
3,400007,Tardeo,Mumbai Central,Maharashtra,18.97243,72.81483,0,Chinese Restaurant,Vegetarian / Vegan Restaurant,Fast Food Restaurant,Automotive Shop,Ice Cream Shop,Italian Restaurant,Bengali Restaurant,Platform
4,400008,Falkland Road,Mumbai Central,Maharashtra,21.67448,87.55792,2,Pharmacy,Vegetarian / Vegan Restaurant,Ice Cream Shop,Fireworks Store,Fast Food Restaurant,Electronics Store,Dessert Shop,Deli / Bodega
5,400008,Kamathipura,Mumbai Central,Maharashtra,18.96172,72.82627,0,Indian Restaurant,Antique Shop,Arts & Crafts Store,Bakery,Coffee Shop,Fireworks Store,Fast Food Restaurant,Electronics Store
6,400008,Mumbai Central,Mumbai Central,Maharashtra,18.96972,72.81507,0,Vegetarian / Vegan Restaurant,Chinese Restaurant,Fast Food Restaurant,Ice Cream Shop,Train Station,Pizza Place,Bengali Restaurant,Coffee Shop
7,400034,Hajiali,Mumbai Central,Maharashtra,18.97834,72.81214,0,Shopping Mall,Indian Restaurant,Electronics Store,Golf Course,Ice Cream Shop,Deli / Bodega,Italian Restaurant,Juice Bar
8,400034,Tulsiwadi,Mumbai Central,Maharashtra,22.31646,73.20885,0,Sandwich Place,Indian Restaurant,Juice Bar,Pool Hall,Bakery,Arts & Crafts Store,Astrologer,Arcade


In [33]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#Finally, let's visualize the resulting clusters
# create map 3.1343385, 101.6863371
mCentral_clusters = folium.Map(location=[3.1343385, 101.6863371], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sDelhi_merged['Latitude'], sDelhi_merged['Longitude'], mCentral_merged['Location'], sDelhi_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    cluster
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[(int)(cluster-1)],
        fill=True,
        fill_color=rainbow[(int)(cluster-1)],
        fill_opacity=0.7).add_to(mCentral_clusters)
       
mCentral_clusters

# Results

In [34]:
#Cluster 1 for Delhi
sDelhi_merged.loc[sDelhi_merged['Cluster Labels'] == 0, sDelhi_merged.columns[[2] + list(range(5, sDelhi_merged.shape[1]))]]

Unnamed: 0,District,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,SOUTH DELHI,76.77884,0,Indian Restaurant,Ice Cream Shop,Fast Food Restaurant,Department Store,Café,Coffee Shop,Shoe Store,Miscellaneous Shop
3,SOUTH DELHI,-70.25125,0,Seafood Restaurant,Hotel,American Restaurant,Coffee Shop,Italian Restaurant,Brewery,Bar,Pizza Place
4,SOUTH DELHI,77.21281,0,Hotel,Restaurant,Hostel,Café,Pizza Place,Coffee Shop,Indian Restaurant,Indian Chinese Restaurant
6,SOUTH DELHI,77.20108,0,Gift Shop,Asian Restaurant,Indian Restaurant,Business Service,Coffee Shop,Electronics Store,Farmers Market,Convenience Store


In [35]:
#Cluster 2 for Delhi
sDelhi_merged.loc[sDelhi_merged['Cluster Labels'] == 1, sDelhi_merged.columns[[2] + list(range(5, sDelhi_merged.shape[1]))]]

Unnamed: 0,District,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
5,SOUTH DELHI,77.18858,1,Fast Food Restaurant,Indian Restaurant,Bakery,Snack Place,Train Station,Electronics Store,Comic Shop,Convenience Store


In [36]:
#Cluster 3 for Delhi
sDelhi_merged.loc[sDelhi_merged['Cluster Labels'] == 2, sDelhi_merged.columns[[2] + list(range(5, sDelhi_merged.shape[1]))]]

Unnamed: 0,District,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
1,SOUTH DELHI,77.17347,2,Train Station,Airport Terminal,Convenience Store,Business Service,Farmers Market,Comic Shop,Department Store,Dessert Shop
2,SOUTH DELHI,77.17347,2,Train Station,Airport Terminal,Convenience Store,Business Service,Farmers Market,Comic Shop,Department Store,Dessert Shop


In [37]:
#Cluster 1 for Mumbai
mCentral_merged.loc[mCentral_merged['Cluster Labels'] == 0, mCentral_merged.columns[[2] + list(range(5, mCentral_merged.shape[1]))]]

Unnamed: 0,District,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
0,Mumbai Central,76.99409,0,Fireworks Store,Astrologer,Plaza,Vegetarian / Vegan Restaurant,Chinese Restaurant,Fast Food Restaurant,Electronics Store,Dessert Shop
1,Mumbai Central,72.83108,0,Indian Restaurant,Dessert Shop,Arcade,Restaurant,Ice Cream Shop,Antique Shop,Snack Place,Arts & Crafts Store
3,Mumbai Central,72.81483,0,Chinese Restaurant,Vegetarian / Vegan Restaurant,Fast Food Restaurant,Automotive Shop,Ice Cream Shop,Italian Restaurant,Bengali Restaurant,Platform
5,Mumbai Central,72.82627,0,Indian Restaurant,Antique Shop,Arts & Crafts Store,Bakery,Coffee Shop,Fireworks Store,Fast Food Restaurant,Electronics Store
6,Mumbai Central,72.81507,0,Vegetarian / Vegan Restaurant,Chinese Restaurant,Fast Food Restaurant,Ice Cream Shop,Train Station,Pizza Place,Bengali Restaurant,Coffee Shop
7,Mumbai Central,72.81214,0,Shopping Mall,Indian Restaurant,Electronics Store,Golf Course,Ice Cream Shop,Deli / Bodega,Italian Restaurant,Juice Bar
8,Mumbai Central,73.20885,0,Sandwich Place,Indian Restaurant,Juice Bar,Pool Hall,Bakery,Arts & Crafts Store,Astrologer,Arcade


In [38]:
#Cluster 2 for Mumbai
mCentral_merged.loc[mCentral_merged['Cluster Labels'] == 1, mCentral_merged.columns[[2] + list(range(5, mCentral_merged.shape[1]))]]

Unnamed: 0,District,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
2,Mumbai Central,77.07279,1,ATM,Chinese Restaurant,Fireworks Store,Fast Food Restaurant,Electronics Store,Dessert Shop,Deli / Bodega,Coffee Shop


In [39]:
#Cluster 3 for Mumbai
mCentral_merged.loc[mCentral_merged['Cluster Labels'] == 2, mCentral_merged.columns[[2] + list(range(5, mCentral_merged.shape[1]))]]

Unnamed: 0,District,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue
4,Mumbai Central,87.55792,2,Pharmacy,Vegetarian / Vegan Restaurant,Ice Cream Shop,Fireworks Store,Fast Food Restaurant,Electronics Store,Dessert Shop,Deli / Bodega


# Discussion

Based on cluster for each cities above, we believe that classification for each cluster can be done better with calculation of venues categories (most common) in each cities. Refering to each clsuter, we can't deterimine clearly what represent in each cluster by using Foursquare - Most Common Venue data.

However, we can make an assumption about each cluster as follow:

<b>Cluster 1:</b> <i>Delhi:</i> Gift Shop<br/>
<b>Cluster 2:</b> <i>Delhi:</i> Restaurants<br/>
<b>Cluster 3:</b> <i>Delhi:</i> Train Station<br/>
<b>Cluster 1:</b> <i>Mumbai:</i> Indian Restuarants<br/>
<b>Cluster 2:</b> <i>Mumbai:</i> Mix Cuisine Restuarants<br/>
<b>Cluster 3:</b> <i>Mumbai:</i> Vegeterian Restuarants<br/>

What is lacking at this point is a systematic, quantitative way to identify and distinguish different district and to describe the correlation most common venues as recorded in Foursquare. The reality is however more complex: similar cities might have or might not have similar common venues. A further step in this classification would be to find a method to extract these common venues and integrate the spatial correlations between different of areas or district.

We believe that the classification we propose is an encouraging step towards a quantitative and systematic comparison of the different cities. Further studies are indeed needed in order to relate the data acquired, then observe it to more meaningful and objective results.

# Conclusion

Using Foursquare API, we can captured data of common places all around the world. Using it, we refer back to our main objectives, which is to determine;

<li>The similarities or dissimilarirties in both the cities</li>
<li>Classification of area located inside the city whether it is restaurant, gift place, or others </li>
In conclusion, both cities Delhi and Mumbai are the center of attraction among India. However, to declare both cities are similar or dissimilar base on common venues visited is quite difficult. Both cities is similar in some venues also dissimilar in certain venues. And for classitification based on common venues, again we must have more systematic or quantitative way to identify and declare this. Comparison can be made, but no such method or quantitative data to determine this. We hope in the future, a method to determine it can be establish and explore for references.