# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

This project is aim to recommend the neighborhoods that suit to open a coffee shop in **Harris County in Houston, USA** to the shop owner. 

The location is better to be **close to restaurants** in case customers want to grab a coffee after meals. But the place should **not be already crowded with other coffee shops including dessert shops**. In general, we want to open the coffee shop in a place where we have the largest number of potential customers coming in throughout the year.

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by shop owner.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* number of existing coffee shops in the neighborhood (any type of coffee shops including boba tea shops)
* number of and distance to the closeby restaurants in the neighborhood, if any
* number and type of venues in the neighborhood

We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **Google Maps API reverse geocoding**
* number of existing coffee shops in the neighborhood in every neighborhood will be obtained using **Foursquare API**
* number of and distance to the closeby restaurants in the neighborhood will be obtained using **Foursquare API**
* coordinate of Houston center will be obtained using **Google Maps API geocoding** of a list of zipcodes from http://zipatlas.com/us/tx/houston/zip-code-comparison/median-household-income.htm

### Neighborhood Candidates

The geographical coordinates of each postal code in Houston is obtained from the website online: http://zipatlas.com/us/tx/houston/zip-code-comparison/median-household-income.htm

In [9]:
import pandas as pd
import numpy
import requests

In [52]:
url = 'http://zipatlas.com/us/tx/houston/zip-code-comparison/median-household-income.htm'
html = requests.get(url).content
df_list = pd.read_html(html)
geo = df_list[-3]
geo.head()

new_header = geo.iloc[0] 
geo = geo[1:] 
geo.columns = new_header 

geo = geo[['#','Zip Code','Location','Population']]

geo.reset_index(inplace=True)
geo.drop(columns='index',inplace=True)
geo.head()

Unnamed: 0,#,Zip Code,Location,Population
0,1.0,77010,"29.754310, -95.361109",76
1,2.0,77094,"29.769285, -95.681292",7779
2,3.0,77046,"29.733084, -95.430659",471
3,4.0,77059,"29.615219, -95.134960",16690
4,5.0,77005,"29.718435, -95.423555",23338


In [276]:
ll = pd.DataFrame(geo['Location'].str.split(',').tolist(),columns=['Latitude','Longitude'])
ll = ll.astype('float64')
geo_ll = pd.concat([geo,ll],axis=1)
geo_ll.drop(columns='Location',inplace=True)
geo_ll = geo_ll.astype({'Zip Code':'int32'})


geo_ll.to_pickle('./locations.pkl')   

geo_ll.head()

Unnamed: 0,#,Zip Code,Population,Latitude,Longitude
0,1.0,77010,76,29.75431,-95.361109
1,2.0,77094,7779,29.769285,-95.681292
2,3.0,77046,471,29.733084,-95.430659
3,4.0,77059,16690,29.615219,-95.13496
4,5.0,77005,23338,29.718435,-95.423555


...and let's now save/persist this data into local file.

Let's visualize the data we have so far.

In [23]:
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/lijialing/opt/anaconda3

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    certifi-2019.11.28         |           py37_0         148 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         8

In [25]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/lijialing/opt/anaconda3

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          97 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.22.0-pyh9f0ad1d_0



Downloading and Extracting Packages
geographiclib-1.50   | 34 KB     | ##################################### | 100% 
geopy-1.22.0         | 63 KB     | ###################################

In [26]:
import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')

Libraries imported.


In [348]:
# create map of Harris county using latitude and longitude values

address = 'Houston,TX'

geolocator = Nominatim(user_agent="HST_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Houston,TX are {}, {}.'.format(latitude, longitude))
map_HST = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, ZipCode in zip(geo_ll['Latitude'], geo_ll['Longitude'], geo_ll['Zip Code']):
    label = '{}'.format(ZipCode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_HST)  
    
map_HST

The geograpical coordinate of Houston,TX are 29.7589382, -95.3676974.


### Foursquare
Now that we have our location candidates, let's use Foursquare API to get info on colleges in each neighborhood.


In [235]:
CLIENT_ID = 'LQ2WLXPJOFJEAHKGQMMWR1EAHIWLG00QKWAWOENFFSP0330D' # your Foursquare ID
CLIENT_SECRET = 'DVBWFA4KDRUG4AGQTFK5ZUP401I4NVO1O3FKEEFTONE5PJIG' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: LQ2WLXPJOFJEAHKGQMMWR1EAHIWLG00QKWAWOENFFSP0330D
CLIENT_SECRET:DVBWFA4KDRUG4AGQTFK5ZUP401I4NVO1O3FKEEFTONE5PJIG


In [93]:
!pip install shapely
import shapely.geometry

!pip install pyproj
import pyproj

import math

Collecting shapely
  Downloading Shapely-1.7.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 2.8 MB/s eta 0:00:01
[?25hInstalling collected packages: shapely
Successfully installed shapely-1.7.0
Collecting pyproj
  Downloading pyproj-2.6.1.post1-cp37-cp37m-macosx_10_9_x86_64.whl (13.0 MB)
[K     |████████████████████████████████| 13.0 MB 2.5 MB/s eta 0:00:01    |███████████████▎                | 6.2 MB 2.5 MB/s eta 0:00:03
[?25hInstalling collected packages: pyproj
Successfully installed pyproj-2.6.1.post1


In [310]:
def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
location = geolocator.geocode('Houston, TX')
latitude = location.latitude
longitude = location.longitude
houston = [latitude,longitude]
print('Houston, TX longitude={}, latitude={}'.format(houston[1], houston[0]))
x, y = lonlat_to_xy(houston[1], houston[0])
print('Housotn, TX  UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('Houston Tx  longitude={}, latitude={}'.format(lo, la))

Coordinate transformation check
-------------------------------
Houston, TX longitude=-95.3676974, latitude=29.7589382
Housotn, TX  UTM X=-6758703.768728998, Y=13474008.092953565
Houston Tx  longitude=-95.36769740000382, latitude=29.758938199994205


  after removing the cwd from sys.path.
  # Remove the CWD from sys.path while we load stuff.


In [270]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [277]:
# Category IDs corresponding to Italian restaurants were taken from Foursquare web site (https://developer.foursquare.com/docs/resources/categories):

food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues

cafe_categories = ['4bf58dd8d48988d128941735','4bf58dd8d48988d16d941735','4bf58dd8d48988d1e0931735','4bf58dd8d48988d1d0941735',
                   '4bf58dd8d48988d1bc941735','512e7cae91d4cbb4e5efe0af','4bf58dd8d48988d1c9941735','5744ccdfe4b0c0459246b4e2',
                   '52e81612bcbc57f1066b7a0a']

def is_restaurant(categories, specific_filter=None):
    restaurant_words = ['restaurant', 'diner']
    restaurant = False
    specific = False
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        for r in restaurant_words:
            if r in category_name:
                restaurant = True
        if 'fast food' in category_name:
            restaurant = False
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            restaurant = True
    return restaurant, specific

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', Deutschland', '')
    address = address.replace(', Germany', '')
    return address

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
    return venues

First let's take a look at the **coffee shops** around the candidate areas.

In [290]:
# Let's now go over our neighborhood locations and get nearby restaurants; we'll also maintain a dictionary of all found restaurants and all found italian restaurants

import pickle

def get_restaurants(lats, lons):
    restaurants = {}
    cafe = {}
    location_restaurants = []

    print('Obtaining venues around candidate locations:', end='')
    for lat, lon in zip(lats, lons):
        # Using radius=350 to meke sure we have overlaps/full coverage so we don't miss any restaurant (we're using dictionaries to remove any duplicates resulting from area overlaps)
        foursquare_client_id = CLIENT_ID
        foursquare_client_secret = CLIENT_SECRET
        venues = get_venues_near_location(lat, lon, food_category, foursquare_client_id, foursquare_client_secret, radius=350, limit=100)
        area_restaurants = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            is_res, is_cafe = is_restaurant(venue_categories, specific_filter=cafe_categories)
            if is_res:
                x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_cafe, x, y)
                if venue_distance<=300:
                    area_restaurants.append(restaurant)
                restaurants[venue_id] = restaurant
                if is_cafe:
                    cafe[venue_id] = restaurant
        location_restaurants.append(area_restaurants)
        print(' .', end='')
    print(' done.')
    return restaurants, cafe, location_restaurants

# Try to load from local file system in case we did this before
restaurants = {}
cafe = {}
location_restaurants = []
loaded = False
try:
    with open('restaurants_350.pkl', 'rb') as f:
        restaurants = pickle.load(f)
    with open('cafe_350.pkl', 'rb') as f:
        cafe = pickle.load(f)
    with open('location_restaurants_350.pkl', 'rb') as f:
        location_restaurants = pickle.load(f)
    print('Restaurant data loaded.')
    loaded = True
except:
    pass

# If load failed use the Foursquare API to get the data
if not loaded:
    restaurants, cafe, location_restaurants = get_restaurants(geo_ll['Latitude'], geo_ll['Longitude'])
    
    # Let's persists this in local file system
    with open('restaurants_350.pkl', 'wb') as f:
        pickle.dump(restaurants, f)
    with open('cafe_350.pkl', 'wb') as f:
        pickle.dump(cafe, f)
    with open('location_restaurants_350.pkl', 'wb') as f:
        pickle.dump(location_restaurants, f)
        

Obtaining venues around candidate locations:

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . . . .

  after removing the cwd from sys.path.


 . . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 .

  after removing the cwd from sys.path.


 . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 .

  after removing the cwd from sys.path.


 . . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . . . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . . .

  after removing the cwd from sys.path.


 .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . . . . . . . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . .

  after removing the cwd from sys.path.


 .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . . . . . . . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . . . . . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 .

  after removing the cwd from sys.path.


 . . . . . .

  after removing the cwd from sys.path.


 . .

  after removing the cwd from sys.path.


 . . . . . . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . . . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . . .

  after removing the cwd from sys.path.


 . . . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . . . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . . . .

  after removing the cwd from sys.path.
  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


 . . . done.


  after removing the cwd from sys.path.


In [291]:
import numpy as np

print('Total number of restaurants:', len(restaurants))
print('Total number of cafes:', len(cafe))
print('Percentage of cafes: {:.2f}%'.format(len(cafe) / len(restaurants) * 100))
print('Average number of restaurants in neighborhood:', np.array([len(r) for r in location_restaurants]).mean())

Total number of restaurants: 132
Total number of cafes: 15
Percentage of cafes: 11.36%
Average number of restaurants in neighborhood: 1.0625


In [292]:
print('List of all restaurants')
print('-----------------------')
for r in list(restaurants.values())[:10]:
    print(r)
print('...')
print('Total:', len(restaurants))

List of all restaurants
-----------------------
('4ae106ebf964a520b48421e3', 'The Grove', 29.752581423767637, -95.36053197196186, '1611 Lamar St (Dallas), Houston, TX 77010, United States', 200, False, -6760385.146072027, 13473771.685396012)
('57ac51ed498ece561de831aa', 'Xochi', 29.754480182176135, -95.35908860025862, '1777 Walker St Ste A, Houston, TX 77010, United States', 196, False, -6760237.255227516, 13473363.73106907)
('58d1190bfb549a21c7b9269b', 'Bayou & Bottle', 29.75420655229634, -95.36297655317625, 'Houston, TX, United States', 180, False, -6759894.668541908, 13473913.973317979)
('58914bfd88cfcc6161498a46', 'Brasserie du Parc', 29.754073609294203, -95.36102153792176, '1440 Lamar St, Houston, TX 77010, United States', 27, False, -6760108.346069797, 13473667.956436105)
('4b7c8437f964a5203b982fe3', 'Quattro', 29.754122193587296, -95.36246896887286, '1300 Lamar St (Austin St), Houston, TX 77010, United States', 133, False, -6759957.774785712, 13473855.749917114)
('4b44cf2df964a5

In [293]:
print('List of cafes')
print('---------------------------')
for r in list(cafe.values())[:10]:
    print(r)
print('...')
print('Total:', len(cafe))

List of cafes
---------------------------
('4b106c98f964a520257023e3', 'The Lake House', 29.753505227121686, -95.35956500686406, '1600 McKinney St, Houston, TX 77010, United States', 174, True, -6760339.373623759, 13473537.845367301)
('569b99b9498e336a7ce2c82e', 'The Grove  Houston-Downtown', 29.752835, -95.362307, 'Houston, TX, United States', 200, True, -6760170.793579837, 13473980.003847552)
('5863369e3bd4ab627c721b65', 'Texas T @ The Marriott Marquis Houston', 29.754837266433082, -95.35867888295716, 'Houston, TX, United States', 242, True, -6760223.112077849, 13473268.545201348)
('4c73df673c26ef3b623ae7d5', 'Alonti', 29.755658304763408, -95.36417404278404, '1001 Fannin St (Lamar), Houston, TX 77002, United States', 332, True, -6759554.074411901, 13473909.352746297)
('4f107cece4b0253d4baefba6', 'Nestlé Toll House Café by Chip', 29.730884973548207, -95.43201330915794, '1 Greenway Plz Spc C-670, Houston, TX 77046, United States', 277, True, -6756624.072176383, 13485771.575700153)
('4c

In [294]:
print('Restaurants around location')
print('---------------------------')
for i in range(10, 40):
    rs = location_restaurants[i][:8]
    names = ', '.join([r[1] for r in rs])
    print('Restaurants around location {}: {}'.format(i+1, names))

Restaurants around location
---------------------------
Restaurants around location 11: B.B. Italia
Restaurants around location 12: 
Restaurants around location 13: La Griglia, Brasserie 19, Epicure Cafe, la Madeleine French Bakery & Café River Oaks, Pesca, Izakaya Wa
Restaurants around location 14: 
Restaurants around location 15: 
Restaurants around location 16: Ouisie's Table, RA Sushi, Ouises
Restaurants around location 17: Hawi Hawaiian Bbq
Restaurants around location 18: 
Restaurants around location 19: Ciscos, Cisco's Mexican Restaurant & Deli
Restaurants around location 20: French Corner, Safina Mediterranean
Restaurants around location 21: 
Restaurants around location 22: 
Restaurants around location 23: 
Restaurants around location 24: Droubi's Mediterranean Grill, sarku japan
Restaurants around location 25: 
Restaurants around location 26: 
Restaurants around location 27: A&A Catering
Restaurants around location 28: Pappadeaux Seafood Kitchen, Fat Bao, Little Pappasito's Can

Let's now see all the collected restaurants in our area of interest on map.

In [297]:
houston = location[1]
map_houston = folium.Map(location=houston, zoom_start=13)
folium.Marker(houston, popup='Houston, TX').add_to(map_houston)
for res in restaurants.values():
    lat = res[2]; lon = res[3]
    is_cafe = res[6]
    color = 'red' if is_cafe else 'blue'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_houston)
map_houston

Now let's take a look at all the venues within 350 mile of all zip codes 

Looking good. So now we have all the venue information in areas within few kilometers from each zip code in Harris County, and we know what type of those venues are! This concludes the data gathering phase - we're now ready to use this data for analysis to produce the report on optimal locations for a new coffee shop!

## Methodology <a name="methodology"></a>

In this project we will direct our efforts on detecting areas of Houston that have restaurants in neighborhood, but low density of other coffee shop. We will limit our analysis to area ~500km around each zip code.

In first step I have collected the required **data: location of all restaurants and cafe in each zip code neighborhood; location and type (category) of every venue within 350km from the center of each zip code**.

Second step in our analysis will be filter out the areas without restaurant. 

In third step, I will calculate and explore of '**coffee shop density**' across different areas of Houston - we will use **heatmaps** to identify a few promising areas close to center with low number of coffee shops in general and focus our attention on those areas.

In third and final step we will focus on most promising areas and within those create **clusters of locations that meet some basic requirements** established in discussion with stakeholders. We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

## Analysis <a name="analysis"></a>

Let's perform some basic explanatory data analysis and derive some additional info from our raw data. First let's count the number of restaurants in every area candidate:

In [306]:
location_restaurants_count = [len(res) for res in location_restaurants]

geo_ll['Restaurants in area'] = location_restaurants_count

print('Average number of restaurants in every area with radius=350m:', np.array(location_restaurants_count).mean())

geo_ll.head(10)

Average number of restaurants in every area with radius=350m: 1.0625


Unnamed: 0,#,Zip Code,Population,Latitude,Longitude,Restaurants in area
0,1.0,77010,76,29.75431,-95.361109,20
1,2.0,77094,7779,29.769285,-95.681292,0
2,3.0,77046,471,29.733084,-95.430659,5
3,4.0,77059,16690,29.615219,-95.13496,0
4,5.0,77005,23338,29.718435,-95.423555,0
5,6.0,77024,32746,29.771991,-95.515453,0
6,7.0,77068,9505,30.00883,-95.487234,1
7,8.0,77095,39275,29.916055,-95.663077,0
8,9.0,77062,26978,29.575781,-95.134334,0
9,10.0,77056,14031,29.749035,-95.469021,2


OK, now let's calculate the distance to nearest cafe from every area candidate center (not only those within 350m - we want distance to closest one, regardless of how distant it is).

In [329]:
df_filtered = geo_ll[geo_ll['Restaurants in area']!=0]
df_filtered.reset_index(inplace=True)
df_filtered.drop(columns='index',inplace=True)
print('There are {} areas left.'.format(df_filtered.shape[0]))

There are 31 areas left.


Let's crete a map showing heatmap / density of restaurants and try to extract some meaningfull info from that. Also, let's show borders of Houston boroughs on our map and a few circles indicating distance of 1km, 2km and 3km from houston center.

In [330]:
restaurant_latlons = [[res[2], res[3]] for res in restaurants.values()]

cafe_latlons = [[res[2], res[3]] for res in cafe.values()]

In [333]:
from folium import plugins
from folium.plugins import HeatMap

map_houston = folium.Map(location=houston, zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(map_houston) #cartodbpositron cartodbdark_matter
HeatMap(restaurant_latlons).add_to(map_houston)
folium.Marker(houston).add_to(map_houston)
folium.Circle(houston, radius=1000, fill=False, color='white').add_to(map_houston)
folium.Circle(houston, radius=2000, fill=False, color='white').add_to(map_houston)
folium.Circle(houston, radius=3000, fill=False, color='white').add_to(map_houston)

map_houston

Looks like a few pockets of high restaurant density can be found west from Houston center.

Let's create another **heatmap map showing heatmap/density of cafes only.**

In [334]:
from folium import plugins
from folium.plugins import HeatMap

map_houston = folium.Map(location=houston, zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(map_houston) #cartodbpositron cartodbdark_matter
HeatMap(cafe_latlons).add_to(map_houston)
folium.Marker(houston).add_to(map_houston)
folium.Circle(houston, radius=1000, fill=False, color='white').add_to(map_houston)
folium.Circle(houston, radius=2000, fill=False, color='white').add_to(map_houston)
folium.Circle(houston, radius=3000, fill=False, color='white').add_to(map_houston)

map_houston

This map is not so 'hot' (Cafe represent a subset of ~11% of all restaurants in Houston) but it also indicates higher density of existing Cafe directly **west from Houston center.**

Since we want areas with both high density of restaurants and low density of cafes, based on this we will now focus our analysis on the following zip codes:

In [351]:
df_filtered = df_filtered.sort_values(by=['Restaurants in area'],ascending=False).head(10)
df_filtered

Unnamed: 0,#,Zip Code,Population,Latitude,Longitude,Restaurants in area
0,1.0,77010,76,29.75431,-95.361109,20
12,28.0,77098,12179,29.734813,-95.416098,14
13,36.0,77063,27200,29.736295,-95.523292,9
20,57.0,77002,13289,29.756845,-95.365652,7
5,13.0,77019,15640,29.752702,-95.407379,6
1,3.0,77046,471,29.733084,-95.430659,5
16,47.0,77099,43116,29.670869,-95.58599,4
18,51.0,77007,22497,29.77153,-95.414883,3
17,48.0,77086,19815,29.920445,-95.489262,3
6,16.0,77027,14217,29.744002,-95.443213,3


We will only take a look at the areas with the top 10 most restaurants around the area center

In [298]:
def getNearbyVenues(names, latitudes, longitudes, radius=350):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Zip Code', 
                  'Zip Code Latitude', 
                  'Zip Code Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [352]:
radius = 350
LIMIT = 100

houston_venues = getNearbyVenues(names=df_filtered['Zip Code'],
                                   latitudes=df_filtered['Latitude'],
                                   longitudes=df_filtered['Longitude']
                                  )

77010
77098
77063
77002
77019
77046
77099
77007
77086
77027


In [353]:
print(houston_venues.shape)
houston_venues.head()

(252, 7)


Unnamed: 0,Zip Code,Zip Code Latitude,Zip Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,77010,29.75431,-95.361109,Discovery Green,29.753238,-95.359483,Park
1,77010,29.75431,-95.361109,Phoenicia Specialty Foods,29.754502,-95.36176,Supermarket
2,77010,29.75431,-95.361109,Pappas Bros. Steakhouse,29.75527,-95.36304,Steakhouse
3,77010,29.75431,-95.361109,The Grove,29.752581,-95.360532,New American Restaurant
4,77010,29.75431,-95.361109,Embassy Suites by Hilton,29.752999,-95.361341,Hotel


In [354]:
houston_venues.groupby('Zip Code').count()

Unnamed: 0_level_0,Zip Code Latitude,Zip Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Zip Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
77002,28,28,28,28,28,28
77007,19,19,19,19,19,19
77010,68,68,68,68,68,68
77019,41,41,41,41,41,41
77027,19,19,19,19,19,19
77046,12,12,12,12,12,12
77063,17,17,17,17,17,17
77086,11,11,11,11,11,11
77098,32,32,32,32,32,32
77099,5,5,5,5,5,5


In [355]:
print('There are {} uniques categories.'.format(len(houston_venues['Venue Category'].unique())))

There are 113 uniques categories.


In [356]:
# one hot encoding
houston_onehot = pd.get_dummies(houston_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
houston_onehot['Zip Code'] = houston_venues['Zip Code'] 

# move neighborhood column to the first column
fixed_columns = [houston_onehot.columns[-1]] + list(houston_onehot.columns[:-1])
houston_onehot = houston_onehot[fixed_columns]

houston_onehot.head()

Unnamed: 0,Zip Code,Accessories Store,African Restaurant,Airport,American Restaurant,Art Gallery,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,...,Supplement Shop,Sushi Restaurant,Taco Place,Thai Restaurant,Thrift / Vintage Store,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Store,Whisky Bar,Yoga Studio
0,77010,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,77010,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,77010,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,77010,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,77010,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [357]:
houston_grouped = houston_onehot.groupby('Zip Code').mean().reset_index()
houston_grouped

Unnamed: 0,Zip Code,Accessories Store,African Restaurant,Airport,American Restaurant,Art Gallery,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,...,Supplement Shop,Sushi Restaurant,Taco Place,Thai Restaurant,Thrift / Vintage Store,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Store,Whisky Bar,Yoga Studio
0,77002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.035714,0.0,0.035714,0.035714,0.0,0.0,0.0,0.0,0.0,0.0
1,77007,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,77010,0.0,0.0,0.0,0.029412,0.0,0.0,0.014706,0.0,0.014706,...,0.014706,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014706,0.0
3,77019,0.02439,0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02439,0.0,0.0
4,77027,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,...,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,77046,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.083333,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,77063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,...,0.0,0.0,0.0,0.058824,0.058824,0.058824,0.0,0.058824,0.0,0.0
7,77086,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,77098,0.0,0.0,0.0,0.0,0.03125,0.03125,0.0,0.0,0.0,...,0.0,0.03125,0.0,0.03125,0.0,0.0,0.03125,0.0,0.0,0.03125
9,77099,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [375]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Zip Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Zip Code'] = houston_grouped['Zip Code']

for ind in np.arange(houston_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(houston_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(10)

Unnamed: 0,Zip Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,77002,Hotel,Burger Joint,Coffee Shop,Seafood Restaurant,Bar,Optical Shop,Mexican Restaurant,Doctor's Office,Pizza Place,Lounge
1,77007,Bar,Grocery Store,Nightclub,Pharmacy,Pizza Place,Dessert Shop,Gym / Fitness Center,Lounge,Mexican Restaurant,Athletics & Sports
2,77010,Hotel,New American Restaurant,Lounge,Mexican Restaurant,Sandwich Place,Spanish Restaurant,Dumpling Restaurant,Shipping Store,Bank,Steakhouse
3,77019,Men's Store,Coffee Shop,Cosmetics Shop,Café,French Restaurant,Clothing Store,Gym,Accessories Store,Ice Cream Shop,Kitchen Supply Store
4,77027,Bank,Cupcake Shop,Bistro,Electronics Store,Cosmetics Shop,Flower Shop,Clothing Store,Sandwich Place,Frozen Yogurt Shop,Furniture / Home Store
5,77046,Coffee Shop,Burger Joint,Hotel,Airport,Food Truck,Food Court,Office,Seafood Restaurant,BBQ Joint,Laundry Service
6,77063,Rental Car Location,Pharmacy,Gun Range,Martial Arts Dojo,Massage Studio,Coffee Shop,Department Store,Japanese Restaurant,BBQ Joint,Pizza Place
7,77086,Pizza Place,Mexican Restaurant,Cajun / Creole Restaurant,Gas Station,Clothing Store,Latin American Restaurant,Grocery Store,Food Truck,Hardware Store,Discount Store
8,77098,Mexican Restaurant,Italian Restaurant,Dessert Shop,Yoga Studio,Restaurant,Pizza Place,Paper / Office Supplies Store,Other Repair Shop,Nightclub,Mediterranean Restaurant
9,77099,African Restaurant,Fast Food Restaurant,Bar,Café,Pizza Place,Yoga Studio,Furniture / Home Store,Discount Store,Doctor's Office,Dumpling Restaurant


Let us now cluster those locations to get some options

In [376]:
# set number of clusters
kclusters = 4

houston_grouped_clustering = houston_grouped.drop('Zip Code', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(houston_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 0, 1, 1, 1, 2, 1, 1, 1, 3], dtype=int32)

In [377]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

houston_merged = df_filtered

neighborhoods_venues_sorted['Cluster Labels'].astype('int32')
#houston_merged = houston_merged.rename(columns={'Zip Code':'Zip  Code'})

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
houston_merged = houston_merged.join(neighborhoods_venues_sorted.set_index('Zip Code'), on='Zip Code')

houston_merged.dropna(axis=0,inplace=True) # check the last columns!

houston_merged

Unnamed: 0,#,Zip Code,Population,Latitude,Longitude,Restaurants in area,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1.0,77010,76,29.75431,-95.361109,20,1,Hotel,New American Restaurant,Lounge,Mexican Restaurant,Sandwich Place,Spanish Restaurant,Dumpling Restaurant,Shipping Store,Bank,Steakhouse
12,28.0,77098,12179,29.734813,-95.416098,14,1,Mexican Restaurant,Italian Restaurant,Dessert Shop,Yoga Studio,Restaurant,Pizza Place,Paper / Office Supplies Store,Other Repair Shop,Nightclub,Mediterranean Restaurant
13,36.0,77063,27200,29.736295,-95.523292,9,1,Rental Car Location,Pharmacy,Gun Range,Martial Arts Dojo,Massage Studio,Coffee Shop,Department Store,Japanese Restaurant,BBQ Joint,Pizza Place
20,57.0,77002,13289,29.756845,-95.365652,7,2,Hotel,Burger Joint,Coffee Shop,Seafood Restaurant,Bar,Optical Shop,Mexican Restaurant,Doctor's Office,Pizza Place,Lounge
5,13.0,77019,15640,29.752702,-95.407379,6,1,Men's Store,Coffee Shop,Cosmetics Shop,Café,French Restaurant,Clothing Store,Gym,Accessories Store,Ice Cream Shop,Kitchen Supply Store
1,3.0,77046,471,29.733084,-95.430659,5,2,Coffee Shop,Burger Joint,Hotel,Airport,Food Truck,Food Court,Office,Seafood Restaurant,BBQ Joint,Laundry Service
16,47.0,77099,43116,29.670869,-95.58599,4,3,African Restaurant,Fast Food Restaurant,Bar,Café,Pizza Place,Yoga Studio,Furniture / Home Store,Discount Store,Doctor's Office,Dumpling Restaurant
18,51.0,77007,22497,29.77153,-95.414883,3,0,Bar,Grocery Store,Nightclub,Pharmacy,Pizza Place,Dessert Shop,Gym / Fitness Center,Lounge,Mexican Restaurant,Athletics & Sports
17,48.0,77086,19815,29.920445,-95.489262,3,1,Pizza Place,Mexican Restaurant,Cajun / Creole Restaurant,Gas Station,Clothing Store,Latin American Restaurant,Grocery Store,Food Truck,Hardware Store,Discount Store
6,16.0,77027,14217,29.744002,-95.443213,3,1,Bank,Cupcake Shop,Bistro,Electronics Store,Cosmetics Shop,Flower Shop,Clothing Store,Sandwich Place,Frozen Yogurt Shop,Furniture / Home Store


In [378]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(houston_merged['Latitude'], houston_merged['Longitude'], houston_merged['Zip Code'], houston_merged['Cluster Labels']):
    cluster = int(cluster)
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [379]:
# Group 1
houston_merged[houston_merged['Cluster Labels'] == 0]

Unnamed: 0,#,Zip Code,Population,Latitude,Longitude,Restaurants in area,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,51.0,77007,22497,29.77153,-95.414883,3,0,Bar,Grocery Store,Nightclub,Pharmacy,Pizza Place,Dessert Shop,Gym / Fitness Center,Lounge,Mexican Restaurant,Athletics & Sports


In [380]:
# Group 2
houston_merged[houston_merged['Cluster Labels'] == 1]

Unnamed: 0,#,Zip Code,Population,Latitude,Longitude,Restaurants in area,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1.0,77010,76,29.75431,-95.361109,20,1,Hotel,New American Restaurant,Lounge,Mexican Restaurant,Sandwich Place,Spanish Restaurant,Dumpling Restaurant,Shipping Store,Bank,Steakhouse
12,28.0,77098,12179,29.734813,-95.416098,14,1,Mexican Restaurant,Italian Restaurant,Dessert Shop,Yoga Studio,Restaurant,Pizza Place,Paper / Office Supplies Store,Other Repair Shop,Nightclub,Mediterranean Restaurant
13,36.0,77063,27200,29.736295,-95.523292,9,1,Rental Car Location,Pharmacy,Gun Range,Martial Arts Dojo,Massage Studio,Coffee Shop,Department Store,Japanese Restaurant,BBQ Joint,Pizza Place
5,13.0,77019,15640,29.752702,-95.407379,6,1,Men's Store,Coffee Shop,Cosmetics Shop,Café,French Restaurant,Clothing Store,Gym,Accessories Store,Ice Cream Shop,Kitchen Supply Store
17,48.0,77086,19815,29.920445,-95.489262,3,1,Pizza Place,Mexican Restaurant,Cajun / Creole Restaurant,Gas Station,Clothing Store,Latin American Restaurant,Grocery Store,Food Truck,Hardware Store,Discount Store
6,16.0,77027,14217,29.744002,-95.443213,3,1,Bank,Cupcake Shop,Bistro,Electronics Store,Cosmetics Shop,Flower Shop,Clothing Store,Sandwich Place,Frozen Yogurt Shop,Furniture / Home Store


In [381]:
# Group 3
houston_merged[houston_merged['Cluster Labels'] == 2]

Unnamed: 0,#,Zip Code,Population,Latitude,Longitude,Restaurants in area,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,57.0,77002,13289,29.756845,-95.365652,7,2,Hotel,Burger Joint,Coffee Shop,Seafood Restaurant,Bar,Optical Shop,Mexican Restaurant,Doctor's Office,Pizza Place,Lounge
1,3.0,77046,471,29.733084,-95.430659,5,2,Coffee Shop,Burger Joint,Hotel,Airport,Food Truck,Food Court,Office,Seafood Restaurant,BBQ Joint,Laundry Service


In [382]:
# Group 4
houston_merged[houston_merged['Cluster Labels'] == 3]

Unnamed: 0,#,Zip Code,Population,Latitude,Longitude,Restaurants in area,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,47.0,77099,43116,29.670869,-95.58599,4,3,African Restaurant,Fast Food Restaurant,Bar,Café,Pizza Place,Yoga Studio,Furniture / Home Store,Discount Store,Doctor's Office,Dumpling Restaurant


## Results and Discussion <a name="results"></a>

Our analysis shows that there are not many areas in Harris County, Houston that have both high density of restaurants and low density of cafes. Most of these areas are detected west from the Houston center. After we filtered out the areas with lowest density of resturants within 350m from the area centers, we took a look at all the venues available in the districts. Those location candidates were then clustered into 4 groups to create zones of interest which contain greatest number of location candidates. Group 2 has the greatest number of location candidates and highest density of restaurants.

We recommend to choose from the areas with zip codes in group 2. There are 6 areas in group 2. 77010 has the greatest number of restaurants, but with the lowest number of population. Considering the most common venues, 77010 is probably in downtown area. 77063 and 77019 already have many coffee shops so they are not considered either. 77087 and 77027 do not have as many restaurants as 77098. At the same time, 77098 also has a large population. Overall, 77098 is the most recommended area to open a coffee shop in. 

## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify Harris County, Houston areas with high number of restaurants but low density of cafes in order to aid shop owner in narrowing down the search for optimal location for a new cafe. By calculating restaurant density distribution from Foursquare data we have first identified 10 zip code areas that justify further analysis, and then generated extensive collection of venues around those areas which satisfy some basic requirements regarding existing nearby restaurants. Clustering of those locations was then performed in order to create major zones of interest (containing greatest number of potential locations).

Final decission on optimal coffee shop location will be made by shop owner based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location (proximity to park or water), levels of noise / proximity to major roads, real estate availability, prices, social and economic dynamics of every neighborhood etc.