<a href="https://colab.research.google.com/github/mehlmanmichael/Coursera_Capstone/blob/main/Week4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction / Business Problem
## Motivation
My brother lives in Houston, Texas.  He is recently unemployed and thought that this would be a great time to start a new company.  I plan to help him figure out what type of businesses are under-represented or over-represented in different areas of the city near where he lives.

If a business is under-represented relative in an area relative to that area's peers, then it might be worth researching more to see if it makes sense to pursue that type of venture.
## Approach
### Cluster
First we will need to cluster the area around my brother's house to see what areas are similar in order to generate a peer group.  I had several ideas on how to do this:
* Use neighborhoods: I found a list of Houston neighborhoods on wikipedia; however, I could not find corresponsing lat/long information
* Use zip codes: All Houston area zip codes start with 770XX.  I did this project first using this approach, but the areas were too coarse (some zip codes are large).
* __Create a lat/long grid:__ I chose this approach because it gave sufficient resolution on the areas surrounding my brother's house.

I used a similar approach (kmeans) to break a 10x10 lat/long grid of 100 areas into 10 different clusters or "types" of areas.  I then used these areas for the next part...

### Find area characteristics relative to cluster means
I took the mean of the number of each different type of venue per cluster and compared it to the representation of that type of venue in each region.  I then looked at a couple of these venu types (some that were of interest) and looked for the area where that venue type was most under-represented.  These are good candidate businesses.

## Required Data
* End points for the bounding box of lat/long to consider (obtained manually via google maps)
* Foursquare data on top venues in each of these areas (used both for clustering and then to compare to average cluster properties).

# Data

## Import libraries and load API keys
API keys are loaded through an import for security as below.

In [None]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
from IPython.display import Image 
from IPython.core.display import HTML 
import folium # mapping library
from folium.features import DivIcon
from sklearn.cluster import KMeans
import sklearn.utils
import matplotlib.cm as cm
import matplotlib.colors as colors

Collecting pyzillow
  Downloading https://files.pythonhosted.org/packages/92/7f/c81039e2bd95eb506453ea75451290f7a533953a03e00205e6febfd26ef3/pyzillow-0.7.0-py2.py3-none-any.whl
Installing collected packages: pyzillow
Successfully installed pyzillow-0.7.0


In [None]:
# Foursquare API credentials are stored on my google drive in a .py file.  Load them here.
from google.colab import drive
drive.mount('/content/drive')
import os
import sys
libpath = '/content/drive/My Drive/Programming/Python/Coursera Data Science Capstone/'
sys.path.append(libpath)
import keys
drive.flush_and_unmount()
#print(keys.CLIENT_ID+keys.CLIENT_SECRET+keys.ACCESS_TOKEN)
VERSION = '20180604'
LIMIT = 50

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Get Houston neighborhoods and lat/long

### Zip code approach (too coarse)

In [None]:
geolocator = Nominatim(user_agent='explorer')
location = geolocator.geocode('Houston, Texas')
hlat = location.latitude
hlon = location.longitude
regs = ['770{:02d}'.format(i) for i in range(100)]
locs = [geolocator.geocode(reg+', Texas') for reg in regs]
lats = [loc.latitude if loc is not None else None for loc in locs]
lons = [loc.longitude if loc is not None else None for loc in locs]

### Lat/Long grid approach (this is what I ultimately used)

In [None]:
lat1, lon1 = 29.828408887699396, -95.38141035479127
lat2, lon2 = 29.703835072319947, -95.51242592009352
latlist = np.linspace(lat1, lat2, 10)
lonlist = np.linspace(lon1, lon2, 10)
lats, lons, regs = [], [], []
i = 0
for lat in latlist:
  for lon in lonlist:
    lats.append(lat)
    lons.append(lon)
    regs.append(i)
    i += 1

Convert either of the data sources to dataframe.

In [None]:
df = pd.DataFrame({'reg':regs,'lat':lats,'lon':lons})
df = df.dropna().set_index('reg')

## Get the Foursquare data

In [None]:
LIMIT = 50
def getNearbyVenues(regs, lats, lons, radius=2000):
    venues_list=[]
    for zc, lat, lon in zip(regs, lats, lons):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(keys.CLIENT_ID, keys.CLIENT_SECRET, VERSION, lat, lon, radius, LIMIT)
        results = requests.get(url).json()["response"]['groups'][0]['items']
        venues_list.append([(zc, lat, lon, v['venue']['name'], v['venue']['location']['lat'], v['venue']['location']['lng'], v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['reg', 'lat', 'lon', 'venue', 'venue_lat', 'venue_lon', 'venue_category']
    return(nearby_venues)

In [None]:
venues = getNearbyVenues(regs=df.index.tolist(), lats=df['lat'], lons=df['lon'])

## Prepare data for kmeans

In [None]:
onehot = pd.get_dummies(venues[['venue_category']], prefix="", prefix_sep="")
onehot['reg'] = venues['reg'] 
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]
grouped = onehot.groupby('reg').mean().reset_index()

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10
columns = ['reg']
for ind in np.arange(num_top_venues):
    columns.append('{}_most_common_venue'.format(ind+1))
regs_venues_sorted = pd.DataFrame(columns=columns)
regs_venues_sorted['reg'] = grouped['reg']
for ind in np.arange(grouped.shape[0]):
    regs_venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)
grouped_clustering = grouped.drop('reg', 1)

## Run kmeans, clean up dataframe, and plot to see what it looks like

In [None]:
n_clusters = 10
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(grouped_clustering)#fit clusters
regs_venues_sorted.insert(0, 'cluster_label', kmeans.labels_)

In [None]:
merged = df
merged = merged.join(regs_venues_sorted.set_index('reg'))

Plot consists of circles with annotation.  Annotation and circle number both reflect cluster number.  From my knoweldge of Houston, this breakdown seems pretty reasonable!

In [281]:
map_clusters = folium.Map(location=[hlat, hlon], zoom_start=11)
import matplotlib.cm as cm
import matplotlib.colors as colors
x = np.arange(n_clusters)
ys = [i + x + (i*x)**2 for i in range(n_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
for lat, lon, poi, cluster in zip(merged['lat'], merged['lon'], merged.index.tolist(), merged['cluster_label']):
    out = 'Region: '+str(poi)+', Cluster '+str(cluster)
    folium.map.Marker([lat, lon], icon=DivIcon(icon_size=(50,50),icon_anchor=(25,25),html='<div style="font-size: 8pt">%s</div>' % out)).add_to(map_clusters)
    folium.Circle([lat, lon], 500, color=rainbow[cluster-1], fill=True, fill_color=rainbow[cluster-1], fill_opacity=0.7).add_to(map_clusters)
map_clusters

Folium plots do not appear in github, so please click [here](https://colab.research.google.com/drive/15bY7XFCpoIYk-i8eOJrR_AsS8t8Wn5qB?usp=sharing) to see the notebook/plot.

## Compare representation of different venues among peers
* Find the average number of each type of venue among cluster group
* Find the difference of each region relative to it's cluster peers

In [270]:
v_rep = onehot.groupby('reg').sum().astype(float)
cols = v_rep.columns
v_rep = regs_venues_sorted[['reg','cluster_label']].join(merged[['lat', 'lon']].join(v_rep))
v_rep_mean = v_rep.groupby('cluster_label').mean().iloc[:,3:]
v_rep = v_rep.merge(v_rep_mean, on='cluster_label', suffixes=['', '_avg']).set_index('reg')
for col in cols:
  v_rep[col] = v_rep[col+'_avg']-v_rep[col]#-100.*(v_rep[col] - v_rep[col+'_avg'])/v_rep[col+'_avg']
v_rep = v_rep.drop([x+'_avg' for x in cols], axis=1)

## Pick a few venue types and explore / plot

Looks like Mexican Restaurants are under-represented in one region...

In [273]:
v_rep.iloc[:,3:].describe().loc['max'].sort_values(ascending=False).head(10)

Indian Restaurant       3.400000
Zoo Exhibit             3.000000
Clothing Store          2.428571
Mexican Restaurant      2.333333
Pizza Place             2.307692
Hotel                   2.285714
Fast Food Restaurant    2.250000
Food Service            2.250000
Discount Store          2.200000
Bar                     2.111111
Name: max, dtype: float64

Region 4 needs more Mexican Restaurants!

In [290]:
v_rep[['lat', 'lon', 'Mexican Restaurant']].sort_values('Mexican Restaurant', ascending=False).head(10)

Unnamed: 0_level_0,lat,lon,Mexican Restaurant
reg,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,29.828409,-95.439639,2.333333
56,29.759201,-95.468754,2.1
52,29.759201,-95.410525,2.1
61,29.74536,-95.395968,2.0
22,29.800726,-95.410525,2.0
95,29.703835,-95.454197,1.538462
93,29.703835,-95.425082,1.5
92,29.703835,-95.410525,1.5
83,29.717677,-95.425082,1.5
0,29.828409,-95.38141,1.333333


Again, plots will not show in github so you can click [here](https://colab.research.google.com/drive/15bY7XFCpoIYk-i8eOJrR_AsS8t8Wn5qB?usp=sharing).  This plot shows the delta between available and cluster_mean Mexican Restaurants (region 4 is on the top row in the middle).

In [284]:
poi = 'Mexican Restaurant'
map_clusters = folium.Map(location=[hlat, hlon], zoom_start=11)
x = np.arange(n_clusters)
ys = [i + x + (i*x)**2 for i in range(n_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
for lat, lon, cluster, val in zip(v_rep['lat'], v_rep['lon'], v_rep['cluster_label'], v_rep[poi]):
    try: out = str(round(val,2))
    except: out = '--'
    folium.map.Marker([lat, lon], icon=DivIcon(icon_size=(150,36),icon_anchor=(20,15),html='<div style="font-size: 18pt">%s</div>' % out)).add_to(map_clusters)
    folium.Circle([lat, lon], 500, color=rainbow[cluster-1], fill=True, fill_color=rainbow[cluster-1], fill_opacity=0.7).add_to(map_clusters)
map_clusters