### Introduction 
There are lots of challenges to be faced while starting a restaurant business. Amongst all the decisions that are to be made location is a major discussion point and a prime concern. One needs to be very careful while choosing a location keeping an eye on the footfall available in that area and competition existing in the locality.  
 
Hyderabad, popularly known as “City of Pearls”, is a 400-year-old metropolitan. It is the sixth most populous urban agglomeration in India. Hyderabad is the most happening city in the south of India with a rich scope for food business. According to a survey, it is voted the second-best city in India for doing food business. It stands famously for food, shopping, technology, arts, and pearls.  
 
<u>This project will help the audience address the following questions</u>: 

1. Which are some of the most lucrative localities in Hyderabad to set up a restaurant? 

2. Which are some good areas in Hyderabad where a business can be set up with minimum competition? 

3. How many clusters can the neighborhoods in the Hyderabad be divided into? 

Techniques Used: 

**1. Web Scraping:** A list of all neighborhoods in Hyderabad is fetched from Wikipedia using several python requests along with their latitude and longitude details. Foursquare API is then used to access venues around these neighborhoods  

**2. Data Preprocessing and Cleansing**: A data frame of all these neighborhoods mapped against their geo location details as well as venues is created indicating the presence/ absence of the venue category in that area 

**3. Clustering:** K-means clustering technique is then used to cluster all the neighborhoods based on the presence of all the venues except for the venue category – restaurants. Folium library is imported to help visualize the map of Hyderabad and superimpose the clustered data points onto it

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import numpy as np
!pip install geocoder
import geocoder
!pip install folium
import folium
import re
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize  # tranform JSON file into a pandas dataframe

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
%matplotlib inline

print('Libraries have been imported')

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 7.3 MB/s  eta 0:00:01
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 5.3 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1
Libraries have been imported


### Neighborhood Data: 
Hyderabad has many neighborhoods. The dataset with the list of all the localities in Hyderabad can be found on Wikipedia. There were total of 200 neighborhoods found in Hyderabad when the data from this Wikipedia dataset was fetched. 

In [2]:
# Send the GET request
hyd_data = requests.get("https://en.wikipedia.org/wiki/Category:Neighbourhoods_in_Hyderabad,_India").text

# Parse data from the html into a beautifulsoup object
soup = BeautifulSoup(hyd_data, 'html.parser')

# Create a list to store neighbourhood data
hyd_neighbourhood_list = []

# Append the data parsed from HTML into the list
for row in soup.find_all("div", class_="mw-category")[0].findAll("li"):
  hyd_neighbourhood_list.append(row.text)

# Create a new DataFrame from the list
hyd_neighbourhood_df = pd.DataFrame({"Neighborhood": hyd_neighbourhood_list})
hyd_neighbourhood_df.head(10)

Unnamed: 0,Neighborhood
0,A. C. Guards
1,A. S. Rao Nagar
2,Abhyudaya Nagar
3,Abids
4,Adibatla
5,Adikmet
6,Afzal Gunj
7,Aghapura
8,"Aliabad, Hyderabad"
9,Alijah Kotla


In [3]:
# Get coordinates data from geocoder package by defining a function

# Assign variable to None to intialize it
def get_latlng(neighborhood):
    lat_lng_coords = None
    # Loop until coordinates are returned
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Hyderabad, India'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

# Call the function to get the coordinates, store in a new list using list comprehension
hyd_coords = [ get_latlng(neighborhood) for neighborhood in hyd_neighbourhood_df["Neighborhood"].tolist()]

print('Coordinates fetched')

Coordinates fetched


### Geolocation Data: 
There was no latitude and longitude data found on the page. Therefore, the python package – Geocoder was used to fetch and map the neighborhoods to their respective latitudes and longitudes. These details were then transformed into a Data Frame 

In [4]:
# Populate the coordinates data into a Pandas DataFrame
coords_df = pd.DataFrame(hyd_coords, columns=['Latitude', 'Longitude'])

# Merge the two Data Frames into one
hyd_neighbourhood_df['Latitude'] = coords_df['Latitude']
hyd_neighbourhood_df['Longitude'] = coords_df['Longitude']

# Print the shape of the merged data frame
print(hyd_neighbourhood_df.shape)
hyd_neighbourhood_df

(200, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,A. C. Guards,17.395015,78.459812
1,A. S. Rao Nagar,17.411200,78.508240
2,Abhyudaya Nagar,17.337650,78.564140
3,Abids,17.389800,78.476580
4,Adibatla,17.235790,78.541300
...,...,...,...
195,Secunderabad,17.442000,78.501920
196,Serilingampally,17.482160,78.323000
197,Shah-Ali-Banda,17.357390,78.473200
198,Shahran Market,17.364890,78.476290


In [6]:
# Fetch the neighbourhoods only from Hyderabad
address = 'Hyderabad, India'
geolocator = Nominatim(user_agent="p15mneha@iimidr.ac.in")# Hiding my email address for privacy
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Hyderabad, India {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Hyderabad, India 17.360589, 78.4740613.


### Data Visualization: 
Python’s package folium is then used to superimpose this data onto the map of Hyderabad

In [7]:
# Visualize tha data on a map after populating it into a Pandas Data Frame
hyd_map = folium.Map(location=[latitude, longitude], zoom_start=11)

# Add markers on the map
for lat, lng, neighborhood in zip(hyd_neighbourhood_df['Latitude'],  hyd_neighbourhood_df['Longitude'], hyd_neighbourhood_df['Neighborhood']):
 label = '{}'.format(neighborhood)
 label = folium.Popup(label, parse_html=True)
 folium.CircleMarker([lat, lng],radius=5,popup=label,color='blue',fill=True,fill_color='#3186cc',fill_opacity=0.7).add_to(hyd_map)

# Display the map
hyd_map

In [15]:
#Foursquare Credentials
# @hidden_cell
CLIENT_ID = 'MPDZQZGCCW1N5ONM1FOKL54YOWX4YJD1HUPWPOIQRQPEKM2C'
CLIENT_SECRET = 'VQL214VFXNKCRNEZQ1VN5AKC5MLHBE0RMPLGPT20PMWE0KQM'
VERSION = '20210203' # Foursquare API version

In [16]:
radius = 9999
LIMIT = 900
venues = []


### Venue Data 
Next step was to fetch the details of the venues each of these neighborhoods using the Foursquare API. The contents of the dataset namely – Venue, Venue Latitude, Venue Longitude, Venue Category. A total of about 180 venues were extracted using the Foursquare API. A data frame was then created mapping all the venues against the neighborhoods in the format below. 

In [None]:
# Fetch venue details for all the neighbourhoods of Hyderabad from Foursquare
for lat, long, neighborhood in zip(hyd_neighbourhood_df['Latitude'], hyd_neighbourhood_df['Longitude'], hyd_neighbourhood_df['Neighborhood']):
    
    # Create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(CLIENT_ID,CLIENT_SECRET,VERSION,lat,long,radius,LIMIT)
    
    # Make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # Return only relevant information for each nearby venue
    for venue in results:
        venues.append((neighborhood,lat,long,venue['venue']['name'],venue['venue']['location']['lat'],venue['venue']['location']['lng'],venue['venue']['categories'][0]['name']))

### Create Neighborhood Data Frame with Venues: 
Populate all details (location, latitude and longitude) in a Pandas Data Frame 
 

In [None]:
# Create a dataframe for all the venues mapped against all the neighbourhoods
venues_df = pd.DataFrame(venues)

# Defining the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

# Print the venues Data Frame
print(venues_df.shape)

venues_df.head()

In [None]:
# The number of venues that were returned for each neighbourhood
venues_df.groupby(["Neighborhood"]).count()

# Total number of unique categories that can be curated from all the returned values
print('There are {} unique categories.'.format(len(venues_df['VenueCategory'].unique())))

# Displaying the first 50 Venue Category names
venues_df['VenueCategory'].unique()[:50]

In [None]:
# One hot encoding
hyd_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# Adding neighborhood column back to dataframe
hyd_onehot['Neighborhoods'] = venues_df['Neighborhood']

# Moving neighbourhood column to the first column
fixed_columns = [hyd_onehot.columns[-1]] + list(hyd_onehot.columns[:-1])
hyd_onehot = hyd_onehot[fixed_columns]

print(hyd_onehot.shape)

### Create Venues Data Frame: 

Merge the venues details against each of the neighborhoods create a larger D. In this DF, each venue is analyzed against the presence of venues and the most popular locations with more venues are chosen to be ideal

In [13]:
# Grouping rows of neighbourhood by taking the sum of the frequency of occurrence of each category.
hyd_grouped=hyd_onehot.groupby(["Neighborhoods"]).sum().reset_index()

print(hyd_grouped.shape)
hyd_grouped

(200, 123)


Unnamed: 0,Neighborhoods,ATM,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Sporting Goods Shop,Sports Bar,Stadium,Steakhouse,Taxi Stand,Tea Room,Thai Restaurant,Train Station,Vegetarian / Vegan Restaurant,Zoo
0,A. C. Guards,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,1,0
1,A. S. Rao Nagar,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,2,0
2,Abhyudaya Nagar,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,1,0
3,Abids,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,2,0
4,Adibatla,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,Secunderabad,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,1,0,0,2,0
196,Serilingampally,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,2,0
197,Shah-Ali-Banda,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,2,1
198,Shahran Market,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,1,1


In [15]:
# Since this analysis is being done for a restaurant setup, lets see the total number of neighbourhoods with Restaurants
len((hyd_grouped[hyd_grouped["Restaurant"] > 0]))

187

In [16]:
# Creating a dataframe for Restaurants data only
hyd_restaurant = hyd_grouped[["Neighborhoods","Restaurant"]]

## Clustering: 

Cluster the neighborhoods using K-means clustering technique and identify localities that already have higher number of restaurants to help answer the question as to which neighborhoods are most suitable to open new restaurants 
Number 

In [17]:
# Setting the number of clusters to three
kclusters = 3
hyd_clustering = hyd_restaurant.drop(["Neighborhoods"], 1)

# Run k-means clustering algorithm
kmeans = KMeans(n_clusters=kclusters,random_state=0).fit(hyd_clustering)

# Checking cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 1, 0, 1, 0, 1, 1, 1, 1, 1], dtype=int32)

In [18]:
# Creating a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
hyd_merged = hyd_restaurant.copy()

# Add the clustering labels
hyd_merged["Cluster Labels"] = kmeans.labels_

hyd_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
hyd_merged.head(10)

Unnamed: 0,Neighborhood,Restaurant,Cluster Labels
0,A. C. Guards,3,1
1,A. S. Rao Nagar,3,1
2,Abhyudaya Nagar,0,0
3,Abids,3,1
4,Adibatla,0,0
5,Adikmet,3,1
6,Afzal Gunj,2,1
7,Aghapura,3,1
8,"Aliabad, Hyderabad",3,1
9,Alijah Kotla,2,1


In [19]:
 # Adding latitude and longitude values to the existing dataframe
hyd_merged['Latitude'] = hyd_neighbourhood_df['Latitude']
hyd_merged['Longitude'] = hyd_neighbourhood_df['Longitude']

# Sorting the results by Cluster Labels
hyd_merged.sort_values(["Cluster Labels"], inplace=True)
hyd_merged

Unnamed: 0,Neighborhood,Restaurant,Cluster Labels,Latitude,Longitude
146,"Nagaram, Medchal–Malkajgiri district",0,0,17.609930,78.491220
145,Nacharam,1,0,17.433510,78.566730
45,Chengicherla,0,0,17.437070,78.606840
56,Dundigal,0,0,17.593680,78.404020
137,Miyapur,0,0,17.421010,78.582460
...,...,...,...,...,...
116,Macha Bollaram,5,2,17.525910,78.376330
115,Lothkunta,4,2,17.494050,78.515140
173,Quthbullapur,5,2,17.505370,78.467490
175,Raidurg,5,2,17.424852,78.457255


### Visualize: 

Use Python’s Package folium to visualize the clusters by color coding 3 clusters created in the earlier step 

In [21]:
# Creating the map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# Setting color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(hyd_merged['Latitude'], hyd_merged['Longitude'], hyd_merged['Neighborhood'], hyd_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker([lat,lon],radius=5,popup=label,color=rainbow[cluster-1],fill=True,fill_color=rainbow[cluster-1],fill_opacity=0.7).add_to(map_clusters)
    
map_clusters

In [23]:
len(hyd_merged.loc[hyd_merged['Cluster Labels'] == 0])


23

In [24]:
len(hyd_merged.loc[hyd_merged['Cluster Labels'] == 1])


139

In [25]:
len(hyd_merged.loc[hyd_merged['Cluster Labels'] == 2])

38

In [30]:
# Detailed cluster tables: Cluster = 0
hyd_merged.loc[hyd_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Restaurant,Cluster Labels,Latitude,Longitude
146,"Nagaram, Medchal–Malkajgiri district",0,0,17.60993,78.49122
145,Nacharam,1,0,17.43351,78.56673
45,Chengicherla,0,0,17.43707,78.60684
56,Dundigal,0,0,17.59368,78.40402
137,Miyapur,0,0,17.42101,78.58246
131,Meerpet–Jillelguda,1,0,17.32964,78.53301
65,Gautham Nagar,1,0,17.32528,78.53086
165,Patancheru,0,0,17.52677,78.25234
126,Mallapur,1,0,17.28864,78.49796
70,Gundlapochampalli,1,0,17.58123,78.47761


In [31]:
# Detailed cluster tables: Cluster = 1
hyd_merged.loc[hyd_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Restaurant,Cluster Labels,Latitude,Longitude
139,Moghalpura,2,1,17.358680,78.477120
138,Moazzam Jahi Market,2,1,17.384480,78.474420
198,Shahran Market,2,1,17.364890,78.476290
102,"Koti, Hyderabad",3,1,17.385940,78.483380
136,Mir Alam Tank,3,1,17.366203,78.457983
...,...,...,...,...,...
60,"Fateh Nagar, Hyderabad",3,1,17.458410,78.451810
94,Kavadiguda,2,1,17.422620,78.489390
58,Edi Bazar,3,1,17.344380,78.494210
57,ECIL X Roads,3,1,17.462026,78.559603


In [32]:
# Detailed cluster tables: Cluster = 2
hyd_merged.loc[hyd_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Restaurant,Cluster Labels,Latitude,Longitude
15,Anandbagh,5,2,17.45787,78.53882
140,Moosapet,4,2,17.46705,78.42858
190,Sainikpuri,5,2,17.477175,78.52848
12,Amberpet,4,2,17.38582,78.51836
104,Kukatpally,6,2,17.48735,78.42087
187,Safilguda,5,2,17.46643,78.53565
10,Allwyn Colony,5,2,17.50337,78.41602
101,"Kothapet, Hyderabad",4,2,17.368347,78.525018
100,Kondapur,6,2,17.4666,78.35685
91,"Karkhana, Secunderabad",4,2,17.4581,78.49908


### Conclusion 
The purpose of this project was to assess the neighborhoods in Hyderabad and create a Kmeans clustering model to suggest restauranters lucrative locations to set up a restaurant business. The neighborhoods data was extracted from multiple sources online including Wikipedia and Foursquare using its API. Through this project we were able to find that from the three clusters created, the cluster 2 had neighborhoods with the greatest number of restaurants in the area. Setting up a restaurant in an already saturated locality would create immense competition which the owner should definitely avoid. Therefore, this cluster was eliminated from the discussion.  
We identified several localities in the cluster 0 and cluster 1 within the city limits which would be ideal for a restaurant setup. These areas had less competition and were well within the boundaries to ensure the presence of population density to ensure footfall.  
 