<h1 align=center><font size = 5>Which Management District in Houston is Most Favorable for Opening a Restaurant?</font></h1>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
!pip install geopy



In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

!pip install BeautifulSoup4
!pip install lxml

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>

## 1. Download and Explore Dataset

Download and modify Houston management districts in the same way as the previous notebook.

In [3]:
# Import libraries
import pandas as pd
import numpy as np
import urllib.request
from bs4 import BeautifulSoup

# Specify the url
url = "https://en.wikipedia.org/wiki/List_of_Houston_neighborhoods"

# Open the url
page = urllib.request.urlopen(url)

# Parse the HTML from the URL into the parse tree format
soup = BeautifulSoup(page, "html5lib")

# Find the right table
right_table=soup.find_all("table", class_='wikitable sortable')[0]

# Sort rows
A=[]

# Read each row of the table
for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        # Check if the management district is not assigned
        if cells[1].find(text=True)!="Not assigned\n":
            # Find the name of the management district
            addas = cells[0].find(text=True)
            # Remove the \n at the end of each term
            adda = addas.replace("\n", "")
            A.append(adda)
                    
# Put into a dataframe and display
df=pd.DataFrame(A,columns=['Management District'])
df

Unnamed: 0,Management District
0,5 Corners District
1,Baybrook Management District
2,Downtown District
3,East Downtown Management District
4,Generation Park Management District
5,Greater East End Management District
6,Greater Northside Management District
7,Houston Southeast
8,International Management District
9,Memorial Management District


#### Load and explore the data

Next, let's compile it into one dataframe with latitude and longitude.  

In [4]:
# Import file into a dataframe
import requests

# Instantiate latitude and longitude columns for df
lat=[]
long=[]

# Iterate through each district
for row in df['Management District']:
    # Find the associated latitude and longitude
    ad = ', Houston, TX'
    address = row+ad
    geolocator = Nominatim(user_agent="foursquare_agent")
    location = geolocator.geocode(address)
    test = geolocator.geocode('Nowhere, Houston, TX')
    if location!=test:
        addlat = location.latitude
        addlong = location.longitude
    else: 
        addlat = 'N/A'
        addlong = 'N/A'
    # Add the latitude and longitude to an array
    lat.append(addlat)
    long.append(addlong)
            
# Put into a dataframe, add latitude and longitude columns, and display a portion
newdf=df
newdf['Latitude']=lat
newdf['Longitude']=long
newdf

Unnamed: 0,Management District,Latitude,Longitude
0,5 Corners District,,
1,Baybrook Management District,,
2,Downtown District,32.7826,-96.8088
3,East Downtown Management District,,
4,Generation Park Management District,29.9071,-95.18
5,Greater East End Management District,,
6,Greater Northside Management District,,
7,Houston Southeast,29.7674,-95.3675
8,International Management District,,
9,Memorial Management District,29.9353,-95.4583


In [5]:
# Generate a new dataframe, removing cells with no positional data
A=[]
B=[]
C=[]
index = 0
       
# Check that latitude and longitude is known
for rows in newdf['Latitude']: 
    # Record known values in an array
    if rows!='N/A':
        A.append(newdf.iloc[index,0])
        B.append(rows)
    index+=1
for rows in newdf['Longitude']:
    # Record known values in an array
    if rows!='N/A':
        C.append(rows)
# Generate and display new dataframe
dfnew = pd.DataFrame(A,columns=['Management District'])
dfnew['Latitude']=B
dfnew['Longitude']=C
dfnew

Unnamed: 0,Management District,Latitude,Longitude
0,Downtown District,32.782611,-96.808781
1,Generation Park Management District,29.907106,-95.179968
2,Houston Southeast,29.767383,-95.367519
3,Memorial Management District,29.935331,-95.45826
4,Midtown Houston,29.741414,-95.353201
5,Near Northwest Management District,29.931694,-95.509396
6,North Houston District,29.944719,-95.416074
7,Southwest Management District,29.931694,-95.509396
8,Westchase District,29.729353,-95.573857


In [6]:
# Startup Foursquare API
CLIENT_ID = 'ROUSWCDE02ULBTFWSNIGUWYZVHHGBZIRCAEDNIQOKNQHD0R3' # your Foursquare ID
CLIENT_SECRET = 'ZKAGVFWWI153Q3SRMNSVMDHIETIJJUMQGHNVUGHE3PGHJJ15' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

LIMIT=103

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ROUSWCDE02ULBTFWSNIGUWYZVHHGBZIRCAEDNIQOKNQHD0R3
CLIENT_SECRET:ZKAGVFWWI153Q3SRMNSVMDHIETIJJUMQGHNVUGHE3PGHJJ15


In [7]:
# Use venues functions from previous project for each district
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Management District', 
                  'Management District Latitude', 
                  'Management District Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [8]:
# Generate a list of venues in each district
houston_venues = getNearbyVenues(names=dfnew['Management District'],latitudes=dfnew['Latitude'],longitudes=dfnew['Longitude'])
houston_venues

Downtown District
Generation Park Management District
Houston Southeast
Memorial Management District
Midtown Houston
Near Northwest Management District
North Houston District
Southwest Management District
Westchase District


Unnamed: 0,Management District,Management District Latitude,Management District Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Downtown District,32.782611,-96.808781,Tutta's,32.781305,-96.807423,Pizza Place
1,Downtown District,32.782611,-96.808781,Ellen's,32.781989,-96.807682,Southern / Soul Food Restaurant
2,Downtown District,32.782611,-96.808781,House of Blues,32.785033,-96.808221,Music Venue
3,Downtown District,32.782611,-96.808781,Y. O. Ranch Steakhouse,32.781296,-96.806402,Steakhouse
4,Downtown District,32.782611,-96.808781,Wild Bill's Western,32.781294,-96.806734,Gift Shop
5,Downtown District,32.782611,-96.808781,Medina Oven & Bar,32.785579,-96.809745,Moroccan Restaurant
6,Downtown District,32.782611,-96.808781,The Sixth Floor Museum,32.779994,-96.808609,History Museum
7,Downtown District,32.782611,-96.808781,House of Blues Foundation Room,32.785145,-96.808369,Lounge
8,Downtown District,32.782611,-96.808781,la Madeleine Country French Café,32.784763,-96.807906,French Restaurant
9,Downtown District,32.782611,-96.808781,Tiff's Treats,32.782551,-96.8046,Bakery


#### Create a data set

In [9]:
# one hot encoding
houston_onehot = pd.get_dummies(houston_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
houston_onehot['Management District'] = houston_venues['Management District'] 

# move neighborhood column to the first column
fixed_columns = [houston_onehot.columns[-1]] + list(houston_onehot.columns[:-1])
manhattan_onehot = houston_onehot[fixed_columns]

houston_grouped = houston_onehot.groupby('Management District').mean().reset_index()

num_top_venues=10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Management District']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
   
    # create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Management District'] = houston_grouped['Management District']

for ind in np.arange(houston_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(houston_grouped.iloc[ind, :], num_top_venues)

In [10]:
# Generate restaurant score for each management district
length=len(neighborhoods_venues_sorted['Management District'])
restscores=np.zeros(length)
restnum=np.zeros(length)
restpop=np.zeros(length)
index=0

# Check the first through tenth most common venues for each district
for row in neighborhoods_venues_sorted:
    indexes=0
    for i in neighborhoods_venues_sorted[row]:
        # Search for restaurant or restaurant related venues
        # For each competitor, reduce restaurant score
        # Increase restaurant score if restaurants are popular
        if i.find('Restaurant')!=-1:
            if index<10:
                restnum[indexes]+=1
                restpop[indexes]+=11-index
        if i.find('Food')!=-1:
            if index<10:
                restnum[indexes]+=1
                restpop[indexes]+=11-index
        if i.find('Bar')!=-1:
            if index<10:
                restnum[indexes]+=1
                restpop[indexes]+=11-index
        if i.find('Place')!=-1:
            if index<10:
                restnum[indexes]+=1
                restpop[indexes]+=11-index
        if i.find('Joint')!=-1:
            if index<10:
                restnum[indexes]+=1
                restpop[indexes]+=11-index
        if i.find('Coffee')!=-1:
            if index<10:
                restnum[indexes]+=1
                restpop[indexes]+=11-index
        if i.find('Bakery')!=-1:
            if index<10:
                restnum[indexes]+=1
                restpop[indexes]+=11-index
        if index<10:
            if restnum[indexes]!=0:
                restscores[indexes]+=float(restpop[indexes])/float(restnum[indexes])
        indexes+=1
    index+=1

In [11]:
# add restaurant score to a dataframe
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', restscores)

houston_merged = dfnew

# merge houston_grouped with houston_data to add latitude/longitude for each neighborhood
houston_merged = houston_merged.join(neighborhoods_venues_sorted.set_index('Management District'), on='Management District')

houston_merged # check the last columns!

Unnamed: 0,Management District,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown District,32.782611,-96.808781,35.166667,History Museum,Plaza,Liquor Store,Pizza Place,Gift Shop,Cocktail Bar,Coffee Shop,Convenience Store,French Restaurant,Fried Chicken Joint
1,Generation Park Management District,29.907106,-95.179968,,,,,,,,,,,
2,Houston Southeast,29.767383,-95.367519,56.0,Public Art,Park,Coffee Shop,Aquarium,Train Station,Lawyer,Theater,Monument / Landmark,Concert Hall,Seafood Restaurant
3,Memorial Management District,29.935331,-95.45826,78.433333,Fast Food Restaurant,Seafood Restaurant,Market,Intersection,Gas Station,Fried Chicken Joint,Food Truck,Vietnamese Restaurant,Café,BBQ Joint
4,Midtown Houston,29.741414,-95.353201,64.833333,Yoga Studio,Bar,Modern European Restaurant,Gym / Fitness Center,Gym,Flower Shop,Arts & Crafts Store,Bakery,Business Service,Italian Restaurant
5,Near Northwest Management District,29.931694,-95.509396,73.866667,Vietnamese Restaurant,Intersection,Mexican Restaurant,Bar,Business Service,Sandwich Place,Gym,Fried Chicken Joint,Gas Station,Gift Shop
6,North Houston District,29.944719,-95.416074,48.0,Shoe Store,Supplement Shop,American Restaurant,Gym,Mexican Restaurant,Movie Theater,Pizza Place,Café,Department Store,Theme Park
7,Southwest Management District,29.931694,-95.509396,73.866667,Vietnamese Restaurant,Intersection,Mexican Restaurant,Bar,Business Service,Sandwich Place,Gym,Fried Chicken Joint,Gas Station,Gift Shop
8,Westchase District,29.729353,-95.573857,2.0,Video Store,Flower Shop,Gas Station,Jewelry Store,Hotel,Massage Studio,Home Service,Pharmacy,French Restaurant,Fried Chicken Joint


In [21]:
# create map
latitude = 29.8
longitude = -95.3
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
colors_array = cm.rainbow(np.linspace(0, 1, 5))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Convert cluster labels to integers
houston_merged['Cluster Labels']=houston_merged['Cluster Labels'].fillna(0).astype('int')

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(houston_merged['Latitude'], houston_merged['Longitude'], houston_merged['Management District'], houston_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Restaurant Score ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[round(cluster/30)-1],
        fill=True,
        fill_color=rainbow[round(cluster/30)-1],
        fill_opacity=0.7).add_to(map_clusters)

# Display map based on restaurant score.  
# Lighter colors are greater restaurant scores
macks = houston_merged['Cluster Labels'].max()

index = 0
for row in houston_merged['Cluster Labels']:
    if round(row)==macks:
        recommend = houston_merged.iloc[index,0]
    index+=1
print('The district most suitable for opening a restaurant is: ',recommend)
map_clusters

The district most suitable for opening a restaurant is:  Memorial Management District


<hr>

Copyright &copy; 2018 [Cognitive Class](https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).