#   IBM Applied Data Science Capstone Course by Coursera
  **Week 5 Final Report**
  
**Opening a New Shopping Mall in Mumbai, India**

Build a dataframe of neighborhoods in Mumbai, India by web scraping the data from Wikipedia page
Get the geographical coordinates of the neighborhoods
Obtain the venue data for the neighborhoods from Foursquare API
Explore and cluster the neighborhoods
Select the best cluster to open a new shopping mall

In [None]:
#Import all libraries

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder # to get coordinates

import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library




In [None]:
!pip install --user geocoder

In [None]:

# send the GET request
data = requests.get("https://en.wikipedia.org/wiki/Category:Suburbs_of_Mumbai").text

In [None]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

In [None]:
# create a list to store neighborhood data
neighborhoodList = []

In [None]:
# append the data into the list
for row in soup.find_all("div", class_="mw-category")[0].findAll("li"):
    neighborhoodList.append(row.text)

In [None]:
# create a new DataFrame from the list
mumbai_df = pd.DataFrame({"Neighborhood": neighborhoodList})
mumbai_df.head()

In [None]:
mumbai_df.shape

**Get the geographical coordinates**

In [None]:
# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Mumbai, India'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [None]:
# call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(neighborhood) for neighborhood in mumbai_df["Neighborhood"].tolist() ]

In [None]:
coords

In [None]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [None]:
# merge the coordinates into the original dataframe
mumbai_df['Latitude'] = df_coords['Latitude']
mumbai_df['Longitude'] = df_coords['Longitude']

In [None]:
# check the neighborhoods and the coordinates
print(mumbai_df.shape)
mumbai_df

In [None]:
# save the DataFrame as CSV file
mumbai_df.to_csv("kl_df.csv", index=False)


**Create a map of mumbai with neighbourhoods superimposed **

In [None]:
# get the coordinates of Kuala Lumpur
address = 'Mumbai, India'

geolocator = Nominatim(user_agent="Coursera Capstone")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Mumbai, India {}, {}.'.format(latitude, longitude))

In [None]:
# create map of Toronto using latitude and longitude values
map_mumbai = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(mumbai_df['Latitude'], mumbai_df['Longitude'], mumbai_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_mumbai)  
    
map_mumbai

In [None]:
# save the map as HTML file
map_mumbai.save('map_mumbai.html')

In [None]:
CLIENT_ID = '2ZA2NVZXQTI4DWWBTAC2UOE1V3HE3IRMFIDBSPQFR2Q1COVT' # your Foursquare ID
CLIENT_SECRET = 'YE0X2KY0PUWBWYGUU0ETQ3RHBCC4I5Z5HX1GQE3LQXCTKQJN' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

**Now, let's get the top 100 venues that are within a radius of 2000 meters.**

In [None]:
radius = 2000
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(mumbai_df['Latitude'], mumbai_df['Longitude'], mumbai_df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [None]:

# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

**No of venues returned**

In [None]:
venues_df.groupby(["Neighborhood"]).count()

In [None]:
# No of unique venues
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

In [None]:
venues_df['VenueCategory'].unique()[:50]


In [None]:
# check if the results contain "Shopping Mall"
"Neighborhood" in venues_df['VenueCategory'].unique()

**Analyze each neighbourhood**

In [None]:

# one hot encoding
mumbai_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
mumbai_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [mumbai_onehot.columns[-1]] + list(mumbai_onehot.columns[:-1])
mumbai_onehot = mumbai_onehot[fixed_columns]

print(mumbai_onehot.shape)
mumbai_onehot.head()

In [None]:
mumbai_onehot

In [None]:
mumbai_grouped = mumbai_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(mumbai_grouped.shape)
mumbai_grouped

In [None]:
len(mumbai_grouped[mumbai_grouped["Shopping Mall"] > 0])


**Creating new datframe for shopping mall only**

In [None]:
mumbai_mall = mumbai_grouped[["Neighborhoods","Shopping Mall"]]

In [None]:
mumbai_mall.head()

**Cluster Neighbourhoods**

In [None]:
# set number of clusters
kclusters = 3

mumbai_clustering = mumbai_mall.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(mumbai_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
mumbai_merged = mumbai_mall.copy()

# add clustering labels
mumbai_merged["Cluster Labels"] = kmeans.labels_

In [None]:
mumbai_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
mumbai_merged.head()

In [None]:
# merge mumbai_grouped with mumbai_data to add latitude/longitude for each neighborhood
mumbai_merged = mumbai_merged.join(mumbai_df.set_index("Neighborhood"), on="Neighborhood")

print(mumbai_merged.shape)
mumbai_merged.head() # check the last columns!

In [None]:
# sort the results by Cluster Labels
print(mumbai_merged.shape)
mumbai_merged.sort_values(["Cluster Labels"], inplace=True)
mumbai_merged

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mumbai_merged['Latitude'], mumbai_merged['Longitude'], mumbai_merged['Neighborhood'], mumbai_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
# save the map as HTML file
map_clusters.save('map_clusters.html')

**Exaimining clusters**

In [None]:
# Cluster 0
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 0]

In [None]:
# Cluster 1
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 1]

In [None]:
# Cluster 2
mumbai_merged.loc[mumbai_merged['Cluster Labels'] == 2]

**Observations :**
Most of the shopping malls are concentrated in the western area of Mumbai city, with the highest number in cluster 0 and moderate number in cluster 2. On the other hand, cluster 1 has  low number of shopping mall in the neighborhoods. This represents a great opportunity and high potential areas to open new shopping malls as there is very little to no competition from existing malls. Meanwhile, shopping malls in cluster 0 are likely suffering from intense competition due to oversupply and high concentration of shopping malls. From another perspective, this also shows that the oversupply of shopping malls mostly happened in the western area of the city, with the suburb area still have very few shopping malls. Therefore, this project recommends property developers to capitalize on these findings to open new shopping malls in neighborhoods in cluster 1 with little to no competition. Property developers with unique selling propositions to stand out from the competition can also open new shopping malls in neighborhoods in cluster 2 with moderate competition. Lastly, property developers are advised to avoid neighborhoods in cluster 0 which already have high concentration of shopping malls and suffering from intense competition.