## Bangalore: The Battle of Neighborhoods

- Building a dataframe of neighborhoods in Bangalore, Karnataka by web scraping from Wikipedia
- Getting the geographical coordinates of neighborhoods
- Obtaining the venue data for neighborhoods from Foursquare API
- Exploring and clustering the neighborhoods

***


### Acknowledgement

This project is a result of inspiration taken from Chia Lim's work on opening a Shopping Mall in Kuala Lumpur, Malaysia which can be accessed [here](https://github.com/limchiahooi/Coursera_Capstone/).

It's possible that you might end up getting different results than those posted on my blog (depending on Foursquare data) but you should still see a similar trend among 6 clusters especially at the peripheries of the map (with discrepancies in the centre of the city - Central zone divided either into North & South or East & West).


I'd also like to thank the IBM teaching staff for providing the required skills and tools to complete this project. This is an updated version of my submission in the last week of October, 2019. To use this notebook, you need to create a Foursquare API account [here](https://developer.foursquare.com/).

### Importing libraries

In [None]:
# library to handle data in a vectorized manner
import numpy as np 

# library for data analysis
import pandas as pd 
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

# library to handle JSON files
import json 

# installing geocoder and converting addresses into latitude and longitude values
!pip install geocoder
import geocoder

# package for getting coordinates
!pip install geopy
from geopy.geocoders import Nominatim

# library to handle requests
import requests 

# library to parse HTML and XML documents
!pip install beautifulsoup4
from bs4 import BeautifulSoup 

# tranforming JSON file into a pandas dataframe
from pandas.io.json import json_normalize 

# plotting modules: Matplotlib and MPL
import matplotlib.cm as cm
import matplotlib.colors as colors
from matplotlib.pyplot import figure
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D 

# importing preprocessing tools to scale features
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# importing k-means from clustering stage
from sklearn.cluster import KMeans

# importing clustering visualizer
!pip install yellowbrick
from yellowbrick.cluster import KElbowVisualizer

# map rendering library
!pip install folium
import folium 

# importing package and its set of stopwords
!pip install wordcloud
from wordcloud import WordCloud, STOPWORDS

print("Libraries imported.")

### Scrapping data from Wikipedia into DataFrame

In [None]:
# sending the GET request
data = requests.get("https://en.wikipedia.org/wiki/Category:Neighbourhoods_in_Bangalore").text

# parsing data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

In [None]:
# creating a list to store neighborhood data
neighborhoodList = []

# appending the data into the list
for row in soup.find_all("div", class_="mw-category")[0].findAll("li"):
    neighborhoodList.append(row.text)

In [None]:
# creating a new DataFrame from the list
bl_df = pd.DataFrame({"Neighborhood": neighborhoodList})
bl_df.head(5)

In [None]:
# dropping the first three rows since they don't contain neighborhood data
bl_df.drop(bl_df.index[0:3], inplace=True)
bl_df.reset_index(inplace=True)
bl_df.drop(["index"],axis=1,inplace=True)
bl_df.head(5)

In [None]:
# printing the number of rows of the dataframe
print("Rows in DataFrame [Number of Neigborhoods in Bangalore]:", bl_df.shape)

### Getting the geographical coordinates

In [None]:
# defining a function to get coordinates, initializing and looping to get the coordinates
def get_ll(neighborhood):
    llc = None
    while(llc is None):
        str_neigh = neighborhood + ', Bengaluru, Karnataka'
        g = geocoder.arcgis(str_neigh)
        llc = g.latlng
        # print(llc)
    return llc

In [None]:
# calling the function to get coordinates and storing in a new list using list comprehension
coords_l = []

for neighborhood in bl_df["Neighborhood"].tolist():
    coords = get_ll(neighborhood)
    coords_l.append(coords) 

print("Coordinates of neighborhoods:", coords_l[0:5])

In [None]:
# creating temporary dataframe to populate coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords_l, columns=['Latitude', 'Longitude'])

In [None]:
# merging the coordinates into the original dataframe
bl_df['Latitude'] = df_coords['Latitude']
bl_df['Longitude'] = df_coords['Longitude']

In [None]:
# checking the neighborhoods and coordinates
bl_df.head(5)

In [None]:
# saving the DataFrame as CSV file
bl_df.to_csv("bl_df.csv", index=False)

### Creating a map of Bangalore with mapped neighborhoods

In [None]:
# getting the coordinates of Bengaluru
address = 'Bengaluru, Karnataka'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Bengaluru, Karnataka are {}, {}.'.format(latitude, longitude))

In [None]:
# creating a map of Bangalore using latitude and longitude values
map_bl = folium.Map(location=[latitude, longitude], zoom_start=10.5)

# add markers to map
for lat, lng, neighborhood in zip(bl_df['Latitude'], bl_df['Longitude'], bl_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_bl)  
    
map_bl

### Using Foursquare API to explore neighborhoods

In [None]:
# defining Foursquare Credentials and Version including ID, Secret & API
CLIENT_ID = 'Enter your ID' 
CLIENT_SECRET = 'Enter your Secret' 
VERSION = '20180605'

print('Your credentails:\n')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET: ' + CLIENT_SECRET)

**Getting top 100 venues in a 2 km radius of each neighborhood**

In [None]:
# calling Foursquare API
radius = 2000
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(bl_df['Latitude'], bl_df['Longitude'], bl_df['Neighborhood']):
    
    # creating the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # making the GET request
    results_mined = requests.get(url).json()
    
    results = results_mined['response']['groups'][0]['items']
    
    # returning only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [None]:
# converting the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# defining column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

venues_df.head(5)

In [None]:
# sorting the venues DataFrame
venues_df.sort_values(by=['VenueLongitude'])

venues_df_l_l = venues_df[["Neighborhood","Latitude","Longitude"]]
venues_df_g = venues_df_l_l.groupby("Neighborhood").mean()

lati_data = venues_df_g["Latitude"]
long_data = venues_df_g["Longitude"]

venues_df_g.head(5)

In [None]:
# checking for duplicates using a loop
dupl_c=0
n=len(venues_df['Neighborhood'].values)
bool_series = venues_df.duplicated(subset=None, keep='first')

for i in range(n):
    if bool_series[i]==True:
        dupl_c=dupl_c+1

if dupl_c==0:
    print("No duplicates")

In [None]:
# counting various venues after grouping by neighborhood
venues_df_count = venues_df.groupby(["Neighborhood"]).count()
print("Count of venues for each neighborhood:","\n\n", np.asanyarray(venues_df_count["VenueName"]))

**Checking if neighborhoods extracted from Wiki match those obtained from venue data**

In [None]:
# comparing values of sets
neigh_wiki=bl_df["Neighborhood"].tolist()
neigh_fsq=venues_df["Neighborhood"].tolist()

print("Neighborhood not scanned:", (set(neigh_wiki)-set(neigh_fsq)))

In [None]:
# storing out the list of categories
unique_categ = venues_df['VenueCategory'].unique()

### Wordcloud to check for popular places

In [None]:
# creating DataFrame of venue categories
newTest_words = venues_df[['VenueCategory']]
newTest_words.head(5)

In [None]:
# writing categories to file and creating a cloud object
newTest_words.to_csv('blr_venue_categ.txt', sep=',', index=False)

myTest = open('blr_venue_categ.txt', 'r').read()
stopwords = set(STOPWORDS)

bl_v = WordCloud(
background_color='white',
    max_words=2000,
    stopwords=stopwords)

bl_v.generate(myTest)

In [None]:
# displaying the word cloud
fig = plt.figure()

# setting width
fig.set_figwidth(15) 

# setting height
fig.set_figheight(20) 

plt.imshow(bl_v, interpolation='bilinear')
plt.axis('off')
plt.show()

### One hot encoding of Venue Categories

In [None]:
# one hot encoding
bl_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# adding neighborhood column back to dataframe
bl_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# moving neighborhood column to the first column
fixed_columns = [bl_onehot.columns[-1]] + list(bl_onehot.columns[:-1])
bl_onehot = bl_onehot[fixed_columns]

bl_onehot.head(5)

In [None]:
# grouping venues of each category by neighborhood
bl_grouped = bl_onehot.groupby(["Neighborhoods"]).sum()
bl_grouped.head(5)

**Getting total number of venues classified as restaurants for each neighborhood**

In [None]:
# setting a string keyword to go through the venue titles and appending to DataFrame
str1 = 'Restaurant'
col_list = []

for column_id in bl_grouped.columns:
    if str1 in column_id:
        col_list.append(column_id)
        
# adding up count of restaurants in all neighborhoods 
t_r_f = 0
ven_val = venues_df.VenueCategory

for v_c in range(len(ven_val)):
    if str1 in ven_val[v_c]:
        t_r_f = t_r_f + 1

In [None]:
# printing types of restaurants after validation
bl_grouped['Total Restaurants'] = bl_grouped[col_list].sum(axis=1)

t_r_cc = np.asanyarray(bl_grouped['Total Restaurants'])

print("Total number of eateries categorized as restaurants:", t_r_f, "\n")  
if t_r_f == t_r_cc.sum():
    print("Restaurant division classification has been verified","\n") 
print("Types of Restaurants:", col_list)

bl_grouped['Latitude'] = lati_data
bl_grouped['Longitude'] = long_data

In [None]:
# resetting indexes
bl_grouped = bl_grouped.reset_index()
bl_grouped.head(5)

In [None]:
# extracting and grouping all venues classified as restaurants for each neighborhood
bl_r_1 = bl_grouped[["Neighborhoods","Total Restaurants","Latitude","Longitude"]]
bl_r_1.head(5)

### Clustering of restaurants by neighborhood

In [None]:
# preprocessing for scaling
bl_clustering = bl_r_1.drop(["Neighborhoods"], 1)

# feature scaling
X = bl_clustering
X = np.nan_to_num(X)
Clus_dataSet = MinMaxScaler().fit_transform(X)
Clus_dataSet[0:5]

In [None]:
# finding the elbow region for optimal cluster size
krng = np.arange(2, 15)
sse = []

for k in krng:
    km = KMeans(init = "k-means++", n_clusters = k, n_init = 200, max_iter = 400)
    km.fit(X)
    sse.append(km.inertia_)
    
plt.figure(1, figsize=(6, 4), dpi=100, facecolor='w', edgecolor='k')

plt.scatter(krng, sse)
plt.plot(krng, sse)

plt.xlabel('Number of Clusters', fontsize=12)
plt.ylabel('Inertia', fontsize=12)

In [None]:
# setting number of clusters and checking inflexion using calinski harabasz model
kclusters = 6

# running k-means clustering
kmeans = KMeans(init = "k-means++", n_clusters = kclusters, n_init = 200, max_iter = 400).fit(Clus_dataSet)

In [None]:
# creating a new dataframe that includes cluster labels
bl_merged = bl_r_1.copy()
bl_merged["Labels"] = kmeans.labels_

# displaying cluster centers
kmeans.cluster_centers_

In [None]:
# renaming columns for merging
bl_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
bl_merged.head(5)

In [None]:
# sorting results by Cluster Labels
bl_merged.sort_values(["Labels"], inplace=True)
bl_merged = bl_merged[["Neighborhood","Total Restaurants",
                       "Labels", "Latitude","Longitude"]]

t_r_c = np.asanyarray(bl_merged["Total Restaurants"])
c_c = []

for i in range(bl_merged.shape[0]):
    if t_r_c[i] <40:
        if t_r_c[i] <20:
            c_c.append("Low")
        else:
            c_c.append("Medium")
    else:
        c_c.append("High")
        
bl_merged["Restaurant Density"] = c_c

# setting index name
bl_merged.index.name="Index Number"
bl_merged.head(5)

In [None]:
# Counting occurrence of restaurant densities
rest_density=pd.get_dummies(bl_merged[["Restaurant Density"]], prefix="", prefix_sep="").sum()
rest_dens_c = rest_density.to_frame()

In [None]:
# Renaming columns for display purposes
rest_dens_c.reset_index(inplace=True)
rest_dens_c

In [None]:
# Tabulating for visualization
col_n_u = ['Restaurant Density','Count']
rest_dens_c.columns = col_n_u
rest_dens_c["Level"] = [2,0,1]
rest_dens_c = rest_dens_c.sort_values("Level").drop("Level",axis=1)
rest_dens_c

In [None]:
# Visualization of restaurant density categories
plt.figure(1, figsize=(6, 4), dpi=100, facecolor='w', edgecolor='k')
plt.bar(rest_dens_c["Restaurant Density"], rest_dens_c["Count"])

plt.ylabel("Count", fontsize=12)
plt.xlabel("Category", fontsize=12)

plt.show()

### Visualization of Clusters

**Scatter plot of restaurants grouped by cluster**

In [None]:
# scatter plot in 3D to visualize clusters
fig = plt.figure(1, figsize=(16, 12), dpi=200, facecolor='w', edgecolor='k')

plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

labels = bl_merged['Labels']

ar = np.arange(1,128)
area = np.pi * ( ar/(ar.max()))**2*150
colors_td = cm.rainbow(labels.astype(np.float)/kclusters)

plt.cla()

ax.set_xlabel('Latitude', fontsize=16)
ax.set_ylabel('Longitude', fontsize=16)
ax.set_zlabel('Total Restaurants', fontsize=16)

# Turn off tick labels
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_zticklabels([])

ax.scatter(Clus_dataSet[:, 1], Clus_dataSet[:, 2], Clus_dataSet[:, 0], s=area, c=colors_td)

**Folium plot of restaurants grouped by cluster**

In [None]:
# creating map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10.5)

# setting color scheme for clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**3 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# adding markers to map
markers_colors = []
for lat, lon, poi, cluster in zip(bl_merged['Latitude'], bl_merged['Longitude'], bl_merged['Neighborhood'], bl_merged['Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color="black",
        weight=1,
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=2).add_to(map_clusters)
       
map_clusters

### Cluster Information

**Cluster 0**

In [None]:
bl_r_clus_1 = bl_merged[bl_merged.Labels==0]
bl_r_clus_1.sort_values(["Total Restaurants"], inplace=True)
        
bl_r_clus_1

**Cluster 1**

In [None]:
bl_r_clus_2 = bl_merged[bl_merged.Labels==1]
bl_r_clus_2.sort_values(["Total Restaurants"], inplace=True)

bl_r_clus_2

**Cluster 2**

In [None]:
bl_r_clus_3 = bl_merged[bl_merged.Labels==2]
bl_r_clus_3.sort_values(["Total Restaurants"], inplace=True)

bl_r_clus_3

**Cluster 3**

In [None]:
bl_r_clus_4 = bl_merged[bl_merged.Labels==3]
bl_r_clus_4.sort_values(["Total Restaurants"], inplace=True)

bl_r_clus_4

**Cluster 4**

In [None]:
bl_r_clus_5 = bl_merged[bl_merged.Labels==4]
bl_r_clus_5.sort_values(["Total Restaurants"], inplace=True)

bl_r_clus_5

**Cluster 5**

In [None]:
bl_r_clus_6 = bl_merged[bl_merged.Labels==5]
bl_r_clus_6.sort_values(["Total Restaurants"], inplace=True)

bl_r_clus_6

### Observations

There is a high demand for varied cuisine in the city of Bangalore and this is confirmed by Foursquare data.