# Capstone Project - The Battle of Neighborhoods (Week 2)

## Goals:

1. Introduction where you discuss the business problem and who would be interested in this project.
2. Data where you describe the data that will be used to solve the problem and the source of the data.
3. Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
4. Results section where you discuss the results.
5. Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
6. Conclusion section where you conclude the report.

---

## Intro

I'd like to open a restaurant in Texas, specifically a Vietnamese one.  My candidate cities are Dallas, Austin, and Houston.  I'd like to open in an area of town that already has a considerable Asian/Vietnamese restaurant presence.  Walkability and the likability of nearby businesses is important.  

First, I'll narrow down which city to build in.  Then, select a more specific neighborhood within that city.  This is important for anyone who has the flexibility to open in any nearby city and may eventually want to franchise.

---

## Data

#### Examples

- FS API Search: https://developer.foursquare.com/docs/api-reference/venues/search/
- FS API Likes: https://developer.foursquare.com/docs/api-reference/venues/likes/
- Cities DB: https://simplemaps.com/data/us-cities

#### Proposed Steps

- Collect location data for the main cities I selected from Google/Wiki.
- List neighborhood candidates for querying (pulled from external data).
- Search for Asian/Viet restaurants in each city using Foursquare's API
- Visualize on map
- Identify candidate neighborhood/areas where Asian/Viet restaurants are frequent
- Narrow down restaurants per city and re-visualize on map
- Add category data from Foursquare's API
- Add likes data from Foursquare's API
- Group by neighborhoods and run a k-means clutering alg. to score the neighborhoods
- Select where to build the restaurant

---

## Methodology

In [1]:
import pandas as pd
import requests
import bs4

### Gather coordinates of candidate cities

Use a CSV found onlne of cities with coordinates and extract the coordinates for our 3 cities

In [2]:
filename = 'uscities.csv'
csv_df = pd.read_csv(filename)

Cities = ['Dallas', 'Houston', 'Austin']
State = 'TX'

df = csv_df.loc[(csv_df['city'].isin(Cities)) & (csv_df['state_id'] == State)]
cities_df = df[['city','state_id','lat', 'lng']]
cities_df = cities_df.rename(columns={'city':'City','state_id':'State','lat':'Latitude', 'lng':'Longitude'})
cities_df

Unnamed: 0,City,State,Latitude,Longitude
4,Dallas,TX,32.7936,-96.7662
6,Houston,TX,29.7863,-95.3889
31,Austin,TX,30.3004,-97.7522


### Put the cities on a map of Texas

In [3]:
import folium

# Texas coordinates
tx_coord = [31.5132, -96.3832]

# create map of Toronto using latitude and longitude values
map_tx = folium.Map(location=tx_coord, zoom_start=8)

# add markers to map
for lat, lng, city in zip(cities_df['Latitude'], cities_df['Longitude'], cities_df['City']):
    label = '{}'.format(city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tx)  
    
map_tx

### Configuration for using Foursquare API

In [4]:
# Foursquare API

CLIENT_ID = 'ZBY35LYFENJHYRKLT041HRBCRRUEIT1DGBSJJ1ULBK3EEVJA' # your Foursquare ID
CLIENT_SECRET = 'ARVZ5K1FHCUCSDBQ2M1XNCSNS2GKLSSPKCX5N32Y54FVP314' # your Foursquare Secret
VERSION = '20201112'

### Gather city coordinates from dataframe

In [5]:
# Find coordinates of each city

dfw_coord = cities_df.loc[cities_df['City'] == 'Dallas'][['Latitude', 'Longitude']]
dfw_lat, dfw_lng = dfw_coord.Latitude.values[0], dfw_coord.Longitude.values[0]

htx_coord = cities_df.loc[cities_df['City'] == 'Houston'][['Latitude', 'Longitude']]
htx_lat, htx_lng = htx_coord.Latitude.values[0], htx_coord.Longitude.values[0]

atx_coord = cities_df.loc[cities_df['City'] == 'Austin'][['Latitude', 'Longitude']]
atx_lat, atx_lng = atx_coord.Latitude.values[0], atx_coord.Longitude.values[0]

print(f'Dallas coordinates are: {dfw_lat, dfw_lng}')
print(f'Houston coordinates are: {htx_lat, htx_lng}')
print(f'Austin coordinates are: {atx_lat, atx_lng}')

Dallas coordinates are: (32.7936, -96.7662)
Houston coordinates are: (29.7863, -95.3889)
Austin coordinates are: (30.3004, -97.7522)


### Create Foursquare ./search API request

In [6]:
base_url = f'https://api.foursquare.com/v2/venues/search?&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}'

LIMIT = 100
Cities = ['Dallas', 'Houston', 'Austin']
State = 'TX'
query_term = "Vietnamese"

dfw_url = f'{base_url}&ll={dfw_lat},{dfw_lng}&limit={LIMIT}&near={Cities[0]}, {State}&query={query_term}'
htx_url = f'{base_url}&ll={htx_lat},{htx_lng}&limit={LIMIT}&near={Cities[1]}, {State}&query={query_term}'
atx_url = f'{base_url}&ll={atx_lat},{atx_lng}&limit={LIMIT}&near={Cities[2]}, {State}&query={query_term}'

print(f'Example URL: {dfw_url}')

Example URL: https://api.foursquare.com/v2/venues/search?&client_id=ZBY35LYFENJHYRKLT041HRBCRRUEIT1DGBSJJ1ULBK3EEVJA&client_secret=ARVZ5K1FHCUCSDBQ2M1XNCSNS2GKLSSPKCX5N32Y54FVP314&v=20201112&ll=32.7936,-96.7662&limit=100&near=Dallas, TX&query=Vietnamese


### Gather responses

In [7]:
dfw_response = requests.get(dfw_url).json()
htx_response = requests.get(htx_url).json()
atx_response = requests.get(atx_url).json()

responses = [dfw_response, htx_response, atx_response]

### Move from JSON to DataFrame

In [9]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [23]:
# tranform JSON file into a pandas dataframe
from pandas import json_normalize

# Load JSON response into DF and clean
def clean_json_to_df(response):

    # JSON -> DF
    temp = json_normalize(response['response']['venues'])
    # Assign column names
    filtered_cols = ['name', 'location.lat', 'location.lng', 'categories', 'id']
    temp = temp.loc[:, filtered_cols]
    # Clean category columns
    temp['categories'] = temp.apply(get_category_type, axis=1)
    return(temp.rename(columns={'name':'Name','location.lat':'Latitude','location.lng':'Longitude', 'categories':'Categories'}))



### Combine disperate city dataframes into one

In [24]:
from pandas import concat

venue_dfs = []

for response in responses:
    venue_dfs.append(clean_json_to_df(response))

venue_df = concat(venue_dfs)

In [25]:
venue_df.head()

Unnamed: 0,Name,Latitude,Longitude,Categories,id
0,Malai Thai Vietnamese Kitchen & Bar,32.808763,-96.796722,Thai Restaurant,4d41e27533268cfa58df5201
1,Ngon Vietnamese Kitchen,32.813849,-96.770282,Vietnamese Restaurant,5f1cb638b5d99d73561aa7ab
2,Mai's Vietnamese Restaurant,32.803683,-96.77368,Vietnamese Restaurant,4a79ddf7f964a520d7e71fe3
3,Vietnamese Baptist Church,32.91408,-96.639169,Church,4c94132172dd224b50d09691
4,Dalat Vietnamese Restaurant,32.803661,-96.774386,Vietnamese Restaurant,4f32441e19836c91c7c6a350


In [26]:
print('{} venues were returned by Foursquare total.'.format(venue_df.shape[0]))

108 venues were returned by Foursquare total.


### Add empty column placeholder for 'Likes' count

In [27]:
venue_df['Likes'] = ""
venue_df.head()

Unnamed: 0,Name,Latitude,Longitude,Categories,id,Likes
0,Malai Thai Vietnamese Kitchen & Bar,32.808763,-96.796722,Thai Restaurant,4d41e27533268cfa58df5201,
1,Ngon Vietnamese Kitchen,32.813849,-96.770282,Vietnamese Restaurant,5f1cb638b5d99d73561aa7ab,
2,Mai's Vietnamese Restaurant,32.803683,-96.77368,Vietnamese Restaurant,4a79ddf7f964a520d7e71fe3,
3,Vietnamese Baptist Church,32.91408,-96.639169,Church,4c94132172dd224b50d09691,
4,Dalat Vietnamese Restaurant,32.803661,-96.774386,Vietnamese Restaurant,4f32441e19836c91c7c6a350,


### Collect likes for venues and add back into dataframe

In [28]:
for index, row in venue_df.iterrows():
    id = row['id']
    url = f'https://api.foursquare.com/v2/venues/{id}/likes?&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}'
    results = requests.get(url).json()
    venue_df.loc[index, 'Likes'] = results['response']['likes']['count']

venue_df.head()

Unnamed: 0,Name,Latitude,Longitude,Categories,id,Likes
0,Malai Thai Vietnamese Kitchen & Bar,32.808763,-96.796722,Thai Restaurant,4d41e27533268cfa58df5201,382
1,Ngon Vietnamese Kitchen,32.813849,-96.770282,Vietnamese Restaurant,5f1cb638b5d99d73561aa7ab,0
2,Mai's Vietnamese Restaurant,32.803683,-96.77368,Vietnamese Restaurant,4a79ddf7f964a520d7e71fe3,14
3,Vietnamese Baptist Church,32.91408,-96.639169,Church,4c94132172dd224b50d09691,0
4,Dalat Vietnamese Restaurant,32.803661,-96.774386,Vietnamese Restaurant,4f32441e19836c91c7c6a350,0


In [29]:
venue_df.dtypes

Name           object
Latitude      float64
Longitude     float64
Categories     object
id             object
Likes          object
dtype: object

In [30]:
venue_df.head()

Unnamed: 0,Name,Latitude,Longitude,Categories,id,Likes
0,Malai Thai Vietnamese Kitchen & Bar,32.808763,-96.796722,Thai Restaurant,4d41e27533268cfa58df5201,382
1,Ngon Vietnamese Kitchen,32.813849,-96.770282,Vietnamese Restaurant,5f1cb638b5d99d73561aa7ab,0
2,Mai's Vietnamese Restaurant,32.803683,-96.77368,Vietnamese Restaurant,4a79ddf7f964a520d7e71fe3,14
3,Vietnamese Baptist Church,32.91408,-96.639169,Church,4c94132172dd224b50d09691,0
4,Dalat Vietnamese Restaurant,32.803661,-96.774386,Vietnamese Restaurant,4f32441e19836c91c7c6a350,0


### Drop unnecessary columns for clustering

In [31]:
# Remove non numerical values
# Alternatively: One-Hot encode them

venue_clustering = venue_df.drop(['Name', 'Categories', 'id'], 1)
venue_clustering

Unnamed: 0,Latitude,Longitude,Likes
0,32.808763,-96.796722,382
1,32.813849,-96.770282,0
2,32.803683,-96.773680,14
3,32.914080,-96.639169,0
4,32.803661,-96.774386,0
...,...,...,...
11,30.410900,-97.676118,6
12,30.434334,-97.615370,1
13,30.361853,-97.715468,38
14,30.378036,-97.687205,171


### Normalize the values in clustering DF

In [32]:
import pandas as pd
from sklearn import preprocessing

x = venue_clustering.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

### Implement k-means clustering

Features are likes and location (proximity)

In [33]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 10

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=1).fit(df)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([4, 2, 2, 2, 2, 2, 2, 2, 8, 2, 2, 2, 8, 2, 9, 2, 8, 2, 2, 8, 2, 2,
       8, 8, 8, 2, 8, 8, 4, 8, 2, 9, 8, 2, 8, 2, 8, 8, 9, 8, 2, 8, 3, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 3, 1, 1, 6, 1, 1, 1, 1, 1, 1, 6, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0],
      dtype=int32)

### Add Cluster Label back into DF

In [34]:
# venue_df.drop('Cluster Label', 1, inplace=True)
venue_df.insert(0, 'Cluster Label', kmeans.labels_)
venue_df.head()

Unnamed: 0,Cluster Label,Name,Latitude,Longitude,Categories,id,Likes
0,4,Malai Thai Vietnamese Kitchen & Bar,32.808763,-96.796722,Thai Restaurant,4d41e27533268cfa58df5201,382
1,2,Ngon Vietnamese Kitchen,32.813849,-96.770282,Vietnamese Restaurant,5f1cb638b5d99d73561aa7ab,0
2,2,Mai's Vietnamese Restaurant,32.803683,-96.77368,Vietnamese Restaurant,4a79ddf7f964a520d7e71fe3,14
3,2,Vietnamese Baptist Church,32.91408,-96.639169,Church,4c94132172dd224b50d09691,0
4,2,Dalat Vietnamese Restaurant,32.803661,-96.774386,Vietnamese Restaurant,4f32441e19836c91c7c6a350,0


### Function to add markers and focus on given coordinate

In [37]:
# Map venues Function

# Matplotlib and associated plotting modules
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium

def plot_markers(coordinates, df):
    # Create map of Houston using latitude and longitude values
    map_tx = folium.Map(location=coordinates, zoom_start=10)

    # Set color scheme for the clusters
    x = np.arange(kclusters)
    ys = [i + x + (i*x)**2 for i in range(kclusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # Add markers to map
    for lat, lng, name, likes, cluster in zip(venue_df['Latitude'], venue_df['Longitude'], venue_df['Name'], venue_df['Likes'], venue_df['Cluster Label']):
        label = 'Cluster {} - {}: {} likes'.format(cluster,name, likes)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7,
            parse_html=False).add_to(map_tx)  
    return map_tx

### Display each city's maps with clustering labels

In [44]:
# Dallas map
dfw_coord = [dfw_lat, dfw_lng]
dfw_map = plot_markers(dfw_coord, venue_df)
dfw_map

In [39]:
# Houston map
htx_coord = [htx_lat, htx_lng]
htx_map = plot_markers(htx_coord, venue_df)
htx_map

In [40]:
# Austin map
atx_coord = [atx_lat, atx_lng]
atx_map = plot_markers(atx_coord, venue_df)
atx_map

### Brief overview of cluster likes

In [42]:
for k in range(kclusters):
    print(f"Cluster {k} averages {round(venue_df.loc[venue_df['Cluster Label'] == k, 'Likes'].mean(),2)} likes")

Cluster 0 averages 14.5 likes
Cluster 1 averages 7.33 likes
Cluster 2 averages 10.38 likes
Cluster 3 averages 344.0 likes
Cluster 4 averages 344.0 likes
Cluster 5 averages 382.0 likes
Cluster 6 averages 150.67 likes
Cluster 7 averages 171.0 likes
Cluster 8 averages 6.75 likes
Cluster 9 averages 150.67 likes


---

## Results

Compiling the data across all 3 cities and implementing a k-means model based on location and score, we are able to identify clusters of venues.  Reminder, our venues are pre-filtered to show query results for "Vietnamese" based on the Foursquare API request's response.  

We targeted 10 clusters and results shows tier 1 clusters (3, 4, 5), tier 2 (6, 7, 9), and tier 3 (0, 1, 2, 8).  We want to be in proximity to well liked venues, so lets examine what the tier 1 and tier 2 clusters look like.

| cluster | city    | likes  |  tier |   |
|---------|---------|--------|-------|---|
| 0       | Houston | 14.5   |  3    |   |
| 1       | Austin  | 7.33   |  3    |   |
| 2       | Dallas  | 10.38  |  3    |   |
| 3       | Houston | 344.0  |  1    |   |
| 4       | Dallas  | 344.0  |  1    |   |
| 5       | Austin  | 382.0  |  1    |   |
| 6       | Houston | 150.67 |  2    |   |
| 7       | Austin  | 171.0  |  2    |   |
| 8       | Dallas  | 6.75   |  3    |   |
| 9       | Dallas  | 150.67 |  2    |   |
|         |         |        |       |   |

- Cluster 3 in Houston is close together and in a central location.
- Cluster 4 in Dallas is very spread out.
- Cluster 5 in Austin is only a single restaurant.

- Cluster 6 in Houston is spread out.
- Cluster 7 in Austin is only a single restaurant.
- Cluster 9 in Dallas is pretty close together.

**Cluster 3 in Houston** appears to be our best choice.


---

## Discussion

It's noticeable that using 'Likes' as a primary score isn't sufficient because many venues have scores <2 while the maximums in our case near 400.  Normalizing these values renders the low like counts indistinguishable.  Using proximity across 3 cities also should probably be broken up.  Proximity may have been over weighted as a feature since there are natural large differences across the cities.  Our initial query with Foursquare's /search endpoint may have turned up more results if we used a more general term such as "Asian".  More data/venues would help our model, especially if we split it into 3 different cities and then recompiled the cluster results together.

---

## Conclusion

We have shown that Foursquares data across 3 major Texas cities can help cluster and identify areas of each city to build a restaurant in.  More venue data per city would be preferable if we were to run this experiment again.  More features defined by the client would also help differentiate clusters.

For our case, we have decided to build in the downtown area of Houston, TX.