# Capstone Project - The Battle of Neighborhoods (Week 2)

## Goals:

1. Introduction where you discuss the business problem and who would be interested in this project.
2. Data where you describe the data that will be used to solve the problem and the source of the data.
3. Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
4. Results section where you discuss the results.
5. Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
6. Conclusion section where you conclude the report.

---

## Intro

I'd like to open a restaurant in Texas, specifically a Vietnamese one.  My candidate cities are Dallas, Austin, and Houston.  I'd like to open in an area of town that already has a considerable Asian/Vietnamese restaurant presence.  Walkability and the likability of nearby businesses is important.  

First, I'll narrow down which city to build in.  Then, select a more specific neighborhood within that city.  This is important for anyone who has the flexibility to open in any nearby city and may eventually want to franchise.

---

## Data

#### Examples

- FS API Search: https://developer.foursquare.com/docs/api-reference/venues/search/
- FS API Likes: https://developer.foursquare.com/docs/api-reference/venues/likes/
- Cities DB: https://simplemaps.com/data/us-cities

#### Proposed Steps

- Collect location data for the main cities I selected from Google/Wiki.
- List neighborhood candidates for querying (pulled from external data).
- Search for Asian/Viet restaurants in each city using Foursquare's API
- Visualize on map
- Identify candidate neighborhood/areas where Asian/Viet restaurants are frequent
- Narrow down restaurants per city and re-visualize on map
- Add category data from Foursquare's API
- Add likes data from Foursquare's API
- Group by neighborhoods and run a k-means clutering alg. to score the neighborhoods
- Select where to build the restaurant

---

## Methodology

In [92]:
import pandas as pd
import requests
import bs4

### Gather coordinates of candidate cities

Use a CSV found onlne of cities with coordinates and extract the coordinates for our 3 cities

In [135]:
filename = 'uscities.csv'
csv_df = pd.read_csv(filename)

Cities = ['Dallas', 'Houston', 'Austin']
State = 'TX'

df = csv_df.loc[(csv_df['city'].isin(Cities)) & (csv_df['state_id'] == State)]
cities_df = df[['city','state_id','lat', 'lng']]
cities_df = cities_df.rename(columns={'city':'City','state_id':'State','lat':'Latitude', 'lng':'Longitude'})
cities_df

Unnamed: 0,City,State,Latitude,Longitude
4,Dallas,TX,32.7936,-96.7662
6,Houston,TX,29.7863,-95.3889
31,Austin,TX,30.3004,-97.7522


### Put the cities on a map of Texas

In [136]:
import folium

# Texas coordinates
tx_coord = [31.5132, -96.3832]

# create map of Toronto using latitude and longitude values
map_tx = folium.Map(location=tx_coord, zoom_start=8)

# add markers to map
for lat, lng, city in zip(cities_df['Latitude'], cities_df['Longitude'], cities_df['City']):
    label = '{}'.format(city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tx)  
    
map_tx

### Configuration for using Foursquare API

In [137]:
# Foursquare API

CLIENT_ID = 'ZBY35LYFENJHYRKLT041HRBCRRUEIT1DGBSJJ1ULBK3EEVJA' # your Foursquare ID
CLIENT_SECRET = 'ARVZ5K1FHCUCSDBQ2M1XNCSNS2GKLSSPKCX5N32Y54FVP314' # your Foursquare Secret
VERSION = '20201112'

### Gather city coordinates from dataframe

In [138]:
# Find coordinates of each city

dfw_coord = cities_df.loc[cities_df['City'] == 'Dallas'][['Latitude', 'Longitude']]
dfw_lat, dfw_lng = dfw_coord.Latitude.values[0], dfw_coord.Longitude.values[0]

htx_coord = cities_df.loc[cities_df['City'] == 'Houston'][['Latitude', 'Longitude']]
htx_lat, htx_lng = htx_coord.Latitude.values[0], htx_coord.Longitude.values[0]

atx_coord = cities_df.loc[cities_df['City'] == 'Austin'][['Latitude', 'Longitude']]
atx_lat, atx_lng = atx_coord.Latitude.values[0], atx_coord.Longitude.values[0]

print(f'Dallas coordinates are: {dfw_lat, dfw_lng}')
print(f'Houston coordinates are: {htx_lat, htx_lng}')
print(f'Austin coordinates are: {atx_lat, atx_lng}')

Dallas coordinates are: (32.7936, -96.7662)
Houston coordinates are: (29.7863, -95.3889)
Austin coordinates are: (30.3004, -97.7522)


### Create Foursquare ./search API request

In [139]:
base_url = f'https://api.foursquare.com/v2/venues/search?&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}'

LIMIT = 100
Cities = ['Dallas', 'Houston', 'Austin']
State = 'TX'
query_term = "Vietnamese"

dfw_url = f'{base_url}&ll={dfw_lat},{dfw_lng}&limit={LIMIT}&near={Cities[0]}, {State}&query={query_term}'
htx_url = f'{base_url}&ll={htx_lat},{htx_lng}&limit={LIMIT}&near={Cities[1]}, {State}&query={query_term}'
atx_url = f'{base_url}&ll={atx_lat},{atx_lng}&limit={LIMIT}&near={Cities[2]}, {State}&query={query_term}'

print(f'Example URL: {dfw_url}')

Example URL: https://api.foursquare.com/v2/venues/search?&client_id=ZBY35LYFENJHYRKLT041HRBCRRUEIT1DGBSJJ1ULBK3EEVJA&client_secret=ARVZ5K1FHCUCSDBQ2M1XNCSNS2GKLSSPKCX5N32Y54FVP314&v=20201112&ll=32.7936,-96.7662&limit=100&near=Dallas, TX&query=Vietnamese


### Gather responses

In [163]:
dfw_response = requests.get(dfw_url).json()
htx_response = requests.get(htx_url).json()
atx_response = requests.get(atx_url).json()

responses = [dfw_response, htx_response, atx_response]

In [164]:
responses

[{'meta': {'code': 429,
   'errorType': 'quota_exceeded',
   'errorDetail': 'Quota exceeded',
   'requestId': '5fb454754769b8507116a231'},
  'response': {}},
 {'meta': {'code': 429,
   'errorType': 'quota_exceeded',
   'errorDetail': 'Quota exceeded',
   'requestId': '5fb45475a5d23e3a411a69d4'},
  'response': {}},
 {'meta': {'code': 429,
   'errorType': 'quota_exceeded',
   'errorDetail': 'Quota exceeded',
   'requestId': '5fb45475556f3329953f4c1f'},
  'response': {}}]

### Move from JSON to DataFrame

In [165]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [166]:
# tranform JSON file into a pandas dataframe
from pandas import json_normalize

# Load JSON response into DF and clean
def clean_json_to_df(response):

    # JSON -> DF
    temp = json_normalize(response['response']['venues'])
    # Assign column names
    filtered_cols = ['name', 'location.lat', 'location.lng', 'categories', 'id']
    temp = temp.loc[:, filtered_cols]
    # Clean category columns
    temp['categories'] = temp.apply(get_category_type, axis=1)
    return(temp.rename(columns={'name':'Name','location.lat':'Latitude','location.lng':'Longitude', 'categories':'Categories'}))



### Combine disperate city dataframes into one

In [144]:
from pandas import concat

venue_dfs = []

for response in responses:
    venue_dfs.append(clean_json_to_df(response))

venue_df = concat(venue_dfs)

KeyError: 'venues'

In [145]:
venue_df.head()

Unnamed: 0,Name,Latitude,Longitude,Categories,id,Likes
0,Malai Thai Vietnamese Kitchen & Bar,32.808763,-96.796722,Thai Restaurant,4d41e27533268cfa58df5201,
1,Ngon Vietnamese Kitchen,32.813849,-96.770282,Vietnamese Restaurant,5f1cb638b5d99d73561aa7ab,
2,Mai's Vietnamese Restaurant,32.803683,-96.77368,Vietnamese Restaurant,4a79ddf7f964a520d7e71fe3,
3,Vietnamese Baptist Church,32.91408,-96.639169,Church,4c94132172dd224b50d09691,
4,DaLat Late Night Vietnamese Comfort Food,32.812642,-96.784147,Vietnamese Restaurant,4f3b00b6e4b01e33457f2893,


In [130]:
print('{} venues were returned by Foursquare total.'.format(venue_df.shape[0]))

108 venues were returned by Foursquare total.


### Add empty column placeholder for 'Likes' count

In [132]:
venue_df['Likes'] = ""
venue_df.head()

Unnamed: 0,Name,Latitude,Longitude,Categories,id,Likes
0,Malai Thai Vietnamese Kitchen & Bar,32.808763,-96.796722,Thai Restaurant,4d41e27533268cfa58df5201,
1,Ngon Vietnamese Kitchen,32.813849,-96.770282,Vietnamese Restaurant,5f1cb638b5d99d73561aa7ab,
2,Mai's Vietnamese Restaurant,32.803683,-96.77368,Vietnamese Restaurant,4a79ddf7f964a520d7e71fe3,
3,Vietnamese Baptist Church,32.91408,-96.639169,Church,4c94132172dd224b50d09691,
4,DaLat Late Night Vietnamese Comfort Food,32.812642,-96.784147,Vietnamese Restaurant,4f3b00b6e4b01e33457f2893,


### Collect likes for venues and add back into dataframe

In [116]:
for index, row in venue_df.iterrows():
    id = row['id']
    url = f'https://api.foursquare.com/v2/venues/{id}/likes?&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}'
    results = requests.get(url).json()
    venue_df.loc[index, 'Likes'] = results['response']['likes']['count']

venue_df.head()

KeyError: 'likes'

In [13]:
venue_df.dtypes

Name           object
Latitude      float64
Longitude     float64
Categories     object
id             object
Likes          object
dtype: object

In [147]:
venue_df.head()

Unnamed: 0,Name,Latitude,Longitude,Categories,id,Likes
0,Malai Thai Vietnamese Kitchen & Bar,32.808763,-96.796722,Thai Restaurant,4d41e27533268cfa58df5201,
1,Ngon Vietnamese Kitchen,32.813849,-96.770282,Vietnamese Restaurant,5f1cb638b5d99d73561aa7ab,
2,Mai's Vietnamese Restaurant,32.803683,-96.77368,Vietnamese Restaurant,4a79ddf7f964a520d7e71fe3,
3,Vietnamese Baptist Church,32.91408,-96.639169,Church,4c94132172dd224b50d09691,
4,DaLat Late Night Vietnamese Comfort Food,32.812642,-96.784147,Vietnamese Restaurant,4f3b00b6e4b01e33457f2893,


### Drop unnecessary columns for clustering

In [146]:
# Remove non numerical values
# Alternatively: One-Hot encode them

venue_clustering = venue_df.drop(['Name', 'Categories', 'id'], 1)
venue_clustering

Unnamed: 0,Latitude,Longitude,Likes
0,32.808763,-96.796722,
1,32.813849,-96.770282,
2,32.803683,-96.773680,
3,32.914080,-96.639169,
4,32.812642,-96.784147,
...,...,...,...
11,30.389847,-97.658682,
12,30.361853,-97.715468,
13,30.434334,-97.615370,
14,30.378036,-97.687205,


### Normalize the values in clustering DF

In [16]:
import pandas as pd
from sklearn import preprocessing

x = venue_clustering.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

### Implement k-means clustering

In [39]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 10

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=1).fit(df)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 0, 0, 3,
       2, 0, 2, 4, 2, 4, 1, 0, 2, 2, 1, 2, 1, 1, 3, 4, 2, 4, 1, 2, 0, 1,
       2, 1, 4, 2, 4, 4], dtype=int32)

### Add Cluster Label back into DF

In [159]:
# venue_df.drop('Cluster Label', 1, inplace=True)
venue_df.insert(0, 'Cluster Label', kmeans.labels_)
venue_df.head()

ValueError: Length of values does not match length of index

### Function to add markers and focus on given coordinate

In [None]:
# Map venues Function

# Matplotlib and associated plotting modules
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium

def plot_markers(coordinates, df):
    # Create map of Houston using latitude and longitude values
    map_tx = folium.Map(location=htx_coord, zoom_start=10)

    # Set color scheme for the clusters
    x = np.arange(kclusters)
    ys = [i + x + (i*x)**2 for i in range(kclusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # Add markers to map
    for lat, lng, name, likes, cluster in zip(venue_df['Latitude'], venue_df['Longitude'], venue_df['Name'], venue_df['Likes'], venue_df['Cluster Label']):
        label = 'Cluster {} - {}: {} likes'.format(cluster,name, likes)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7,
            parse_html=False).add_to(map_tx)  
    return map_tx

### Display each city's maps with clustering labels

In [149]:
# Dallas map
dfw_coord = [dfw_lat, dfw_lng]
dfw_map = plot_markers(dfw_coord, venue_df)
dfw_map

NameError: name 'plot_markers' is not defined

In [150]:
# Houston map
htx_coord = [htx_lat, htx_lng]
htx_map = plot_markers(htx_coord, venue_df)
htx_map

NameError: name 'plot_markers' is not defined

In [151]:
# Austin map
atx_coord = [atx_lat, atx_lng]
atx_map = plot_markers(atx_coord, venue_df)
atx_map

NameError: name 'plot_markers' is not defined

### Brief overview of cluster likes

In [154]:
for k in range(kclusters):
    print(f"Cluster {k} averages {round(htx_venues_df.loc[htx_venues_df['Cluster Label'] == k, 'Likes'].mean(),2)} likes")

Cluster 0 averages 11.83 likes
Cluster 1 averages 2.36 likes
Cluster 2 averages 11.33 likes
Cluster 3 averages 241.5 likes
Cluster 4 averages 4.0 likes


---

## Goals:

3. Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
4. Results section where you discuss the results.
5. Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
6. Conclusion section where you conclude the report.

## Results

Results show that when compiling the data across all 3 cities and implementing a k-means model based on location and score, assuming venues are already vetted for appropriate category based on selection using Foursquare API with query 'Vietnamese', we can group with ~10 clusters and get meaningful results.

---

## Discussion

It's noticeable that using 'Likes' as a primary score isn't sufficient because many venues have the same score of zero.

---

## Conclusion

It appears it would be best to open a restaurant at 