# Capstone Project - The Battle of the Neighborhoods (Week 2)
## Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction <a name="introduction"></a>
### Opening a Basque Pintxos bar

The goal of this project is to find the best spot for our **Pintxos bar/restaurant**. We need to search for similar cuisine restaurants in an affordable neighborhood. It's a family business looking forward to expand its reach to the United States, particularly on the **Miami** area. The goal is to find a cheap place to rent, and focus on getting local **quality ingredients**. The main ingredient for the menu will be **fish**, so we need to be near other restaurants alike.

## Data <a name="data"></a>
### Find restaurants alike

It is very important to have a relationship with nearby restaurants with the same goals. We need to find a neighborhood where the main ingredient is **fish**. This could help on getting in touch with local suppliers.

We will have to use multiple **Foursquare** API endpoints:

* Venues **search**: query all Miami neighborhoods restaurants
* Venues **explore**: find a location where the recommended restaurants main dish is fish
* Venues **categories**: find resturants of similar cuisine
* Venues **similar**: find a reference restaurant and the find similar restaurants
* Venues **details**: look for restaurants with high likes ranking

First we need to cluster all **Miami** neighborhoods by similar cuisine and find if there is a pattern. If there is a pattern, explore all recommended venues near a cluster centroid. If there is not a pattern, query all the categories and find a similar one, then select the neighborhood with the maximum frequency for the selected category.

We can find similar restaurants in other neighborhoods and sort them by like ranking, searching through its details.

All this data could help us decide a neighborhood with restaurants with similar cuisine and high ratings.

In [None]:
!pip install html5lib
!pip install folium
!pip install selenium
!pip install tabulate

In [None]:
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import folium
import io
from PIL import Image
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline
from scipy import stats

### Obtain all Miami neighborhoods from Wikipedia

In [None]:
url = "https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Miami"
soup = BeautifulSoup(requests.get(url).text)
table = soup.find("table")
neighborhoods = []
for row in table.find_all("tr")[1:]:
    columns = row.find_all("td")
    if columns[0].a is None:
        continue
    geolocation = columns[5].find_all("span", {"class": "geo"})
    if geolocation:
        latitude = geolocation[0].text.split(';')[0].strip()
        longitude = geolocation[0].text.split(';')[1].strip()
    else:
        latitude = ""
        longitude = ""
    neighborhood = {
        "Neighborhood": columns[0].a.text,
        "Latitude": latitude,
        "Longitude": longitude
    }
    neighborhoods.append(neighborhood)

df = pd.DataFrame(neighborhoods)
df

### Data cleaning

* Fill empty data
* Use the correct data types


In [None]:
# Taken from Google Maps
# https://www.google.com/maps/place/Florida+Health+-+District+Center/@25.7870852,-80.2177615,15z......
idx = df[df["Neighborhood"] == "Health District"].index
df.loc[idx, "Latitude"] = "25.787"
df.loc[idx, "Longitude"] = "-80.217"
df.loc[idx]

In [None]:
df.dtypes

In [None]:
df = df.astype({"Neighborhood": "string", "Latitude": "float64", "Longitude": "float64"})
df.dtypes

In [None]:
df

### Show neighborhoods on OpenStreeMaps with Folium

In [None]:
address = 'Miami, Florida'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Miami are {}, {}.'.format(latitude, longitude))

In [None]:
miami = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(miami)  
    
miami

### Foursquare API

In [None]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
ACCESS_TOKEN = '' # your FourSquare Access Token
VERSION = '20210501' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
SEARCH_LIMIT = 50
RADIUS = 500

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

venues_list = []
    
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    url = 'https://api.foursquare.com/v2/venues/search?categoryId=4d4b7105d754a06374d81259&client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&radius={}&limit={}'.format(
        CLIENT_ID, CLIENT_SECRET, lat, lng, ACCESS_TOKEN, VERSION, RADIUS, SEARCH_LIMIT)
    results = requests.get(url).json()
    # assign relevant part of JSON to venues
    venues = results['response']['venues']
    # return only relevant information for each nearby venue
    venues_list.append([(
        label, lat, lng,
        v['name'],
        v['location']['lat'], v['location']['lng'],
        0 < len(v['categories']) and v['categories'][0]['name'] or np.NaN, 
        v['id']) for v in venues])
    
venues_df = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
venues_df.columns = [
    'Neighborhood',
    'Neighborhood Latitude', 'Neighborhood Longitude',
    'Venue',
    'Venue Latitude', 'Venue Longitude',
    'Venue Category',
    'Venue ID'
]
venues_df.head()

In [None]:
venues_df[venues_df.isna().any(axis=1)]

In [None]:
venues_df.shape

In [None]:
neighborhoods = venues_df['Neighborhood'].unique()
neighborhoods

In [None]:
len(neighborhoods)

### Show food venues on OpenStreeMaps with Folium

In [None]:
miami_venues = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the neighborhoods
n_neighborhoods = df['Neighborhood'].count()
x = np.arange(n_neighborhoods)
ys = [i + x + (i*x)**2 for i in range(n_neighborhoods)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to map
for neighborhood, lat, lng, venue, category in zip(venues_df['Neighborhood'], venues_df['Venue Latitude'], venues_df['Venue Longitude'], venues_df['Venue'], venues_df['Venue Category']):
    label = folium.Popup("%s: %s (%s)" % (neighborhood, venue, category), parse_html=True)
    neighborhood_idx = df[df['Neighborhood'] == neighborhood].index[0]
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[neighborhood_idx],
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(miami_venues)
    
miami_venues

### Exploratory Data Analysis

After reviewing neighborhoods with just a few food venues or with very different cuisine, we want to drop the following:

* Coconut Grove
* Coral Way
* The Roads
* Grapeland Heights
* Allapattah
* Liberty City


In [None]:
drop_neighborhoods = ["Coconut Grove", "Coral Way", "The Roads", "Grapeland Heights", "Allapattah", "Liberty City"]
venues_df = venues_df[~venues_df['Neighborhood'].isin(drop_neighborhoods)]
venues_df.shape

### Show the five most frequent venues per neighborhood

In [None]:
num_top_venues = 5

onehot = pd.get_dummies(venues_df[['Venue Category']], prefix="", prefix_sep="")
onehot['Neighborhood'] = venues_df['Neighborhood'] 
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

onehot.head()

In [None]:
onehot.shape

In [None]:
grouped = onehot.groupby('Neighborhood').mean().reset_index()
grouped

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

columns = ['Neighborhood'] + [str(ind + 1) for ind in np.arange(num_top_venues)]
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = grouped['Neighborhood']

for ind in np.arange(grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

### Cluster neighborhoods

In [None]:
grouped_clustering = grouped.drop('Neighborhood', axis=1)

n_clusters = 5
kmeans = KMeans(init="k-means++", n_clusters=n_clusters, n_init=10, random_state=0).fit(grouped_clustering)
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

merged = df
merged = merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

merged

In [None]:
merged.dropna(inplace=True)
merged.reset_index(drop=True)
merged

In [None]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=13)

# set color scheme for the clusters
x = np.arange(n_clusters)
ys = [i + x + (i*x)**2 for i in range(n_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(merged['Latitude'], merged['Longitude'], merged['Neighborhood'], merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)],
        fill=True,
        fill_color=rainbow[int(cluster)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine Clusters

In [None]:
merged.loc[merged['Cluster Labels'] == 0, merged.columns[[0] + list(range(4, merged.shape[1]))]]

In [None]:
merged.loc[merged['Cluster Labels'] == 1, merged.columns[[0] + list(range(4, merged.shape[1]))]]

In [None]:
merged.loc[merged['Cluster Labels'] == 2, merged.columns[[0] + list(range(4, merged.shape[1]))]]

In [None]:
merged.loc[merged['Cluster Labels'] == 3, merged.columns[[0] + list(range(4, merged.shape[1]))]]

In [None]:
merged.loc[merged['Cluster Labels'] == 4, merged.columns[[0] + list(range(4, merged.shape[1]))]]

### Exploratory Data Analysis

As we can see in the data, the neighborhoods with similar cuisine are on the clusters with label **0** and **3**.
The 4th cluster includes more neighborhoods that are similar.
The 3rd cluster is far from crowded venues.

Regarding our needs, we are looking for crowded neighborhoods, the 2nd cluster neighborhoods seem like a good place to start.

Selected neighborhoods:

In [None]:
selected_neighborhoods = list(merged[merged["Cluster Labels"] == 3]['Neighborhood'])
selected_neighborhoods

### Get venues ratings for each neighborhood

In [None]:
selected_venues_df = venues_df[venues_df['Neighborhood'].isin(selected_neighborhoods)].reset_index(drop=True)
selected_venues_df.head()

In [None]:
selected_venues_df.shape

### Discard non restaurant venues

* Bakery
* Cupcake Shop
* Café
* Ice Cream Shop
* Bagel Shop
* Smoothie Shop
* Coffee Shop
* Hotel
* Pie Shop
* Gift Shop
* Record Shop
* Cafeteria
* Event Space

In [None]:
selected_venues_df['Venue Category'].unique()

In [None]:
banned_categories = ["Bakery", "Cupcake Shop", "Café", "Ice Cream Shop", "Bagel Shop", "Smoothie Shop", "Coffee Shop", "Hotel", "Pie Shop", "Gift Shop", "Record Shop", "Cafeteria", "Event Space"]
selected_venues_df = selected_venues_df[~selected_venues_df['Venue Category'].isin(banned_categories)].reset_index(drop=True)
selected_venues_df.head()

In [None]:
selected_venues_df.shape

### Obtain Venue Rating and Likes from Foursquare API (Premium endpoint "/venues/X")

In [None]:
ratings = []
likes = []
    
for venue_id, venue, neighborhood in zip(selected_venues_df['Venue ID'], selected_venues_df['Venue'], selected_venues_df['Neighborhood']):
    url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&oauth_token={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, ACCESS_TOKEN, VERSION)
    results = requests.get(url).json()
    try:
        venue_rating = results['response']['venue']['rating']
    except Exception:
        venue_rating = 0.0
    try:
        venue_likes = results['response']['venue']['likes']['count']
    except Exception:
        venue_likes = 0
    ratings.append(venue_rating)
    likes.append(venue_likes)

selected_venues_df['Venue Rating'] = ratings
selected_venues_df['Venue Likes'] = likes
selected_venues_df.to_csv("venues_ratings.csv")
selected_venues_df.head()

### Transform Venue Ratings to one decimal point float values and normalize

In [None]:
selected_venues_df['Venue Rating'] = MinMaxScaler().fit_transform(selected_venues_df[['Venue Rating']].transform(lambda x: round(x, 2)))
selected_venues_df['Venue Likes'] = MinMaxScaler().fit_transform(selected_venues_df[['Venue Likes']])

In [None]:
csv_venues_df = pd.read_csv("venues_ratings.csv")

In [None]:
csv_venues_df[["Neighborhood", "Venue"]].describe(include=['object'])

In [None]:
csv_venues_df[["Venue Rating", "Venue Likes"]].describe()

In [None]:
csv_venues_df[['Neighborhood', 'Venue Rating', 'Venue Likes']].groupby('Neighborhood').agg(['mean', 'count']).sort_values([('Venue Rating', 'mean')], ascending=False).to_markdown()

In [None]:
selected_venues_df[['Neighborhood', 'Venue Rating', 'Venue Likes']].groupby('Neighborhood').boxplot(fontsize=12, figsize=(15, 15))
plt.show()

### Correlation between Likes and Rating

In [None]:
selected_venues_df[['Venue Rating', 'Venue Likes']].corr()['Venue Rating'].sort_values()

In [None]:
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="Venue Rating", y="Venue Likes", data=selected_venues_df)
plt.ylim(0,)

In [None]:
pearson_coef, p_value = stats.pearsonr(selected_venues_df['Venue Likes'], selected_venues_df['Venue Rating'])
print("The Pearson Correlation Coefficient is", pearson_coef, "with a P-value of P =", p_value ) 

### Correlation between neighborhood and Rating

In [None]:
sns.set(rc={'figure.figsize':(18, 10)})
sns.boxplot(x="Neighborhood", y="Venue Rating", data=selected_venues_df)
plt.show()

In [None]:
neighborhood_rating_group = selected_venues_df[['Neighborhood', 'Venue Rating']].groupby(['Neighborhood'])

rating_groups = []
for neighborhood in selected_venues_df['Neighborhood'].unique():
    rating_groups.append(neighborhood_rating_group.get_group(neighborhood)['Venue Rating'])

    # ANOVA
f_val, p_val = stats.f_oneway(*rating_groups)
 
print("ANOVA results: F-score =", f_val, "P-score =", p_val)

### Data Binning - create Venue Rating categorical column

In [None]:
selected_venues_df['Venue Rating'].describe()

In [None]:
bins = [-1, .0, .4, .6, 1.]
labels = ['Low', 'Low-Mid', 'Mid-High', 'High']
selected_venues_df['Venue Rating Categorical'] = pd.cut(selected_venues_df['Venue Rating'], bins=bins, labels=labels)
selected_venues_df.head()

In [None]:
neighborhood_rating_contingency = pd.crosstab(selected_venues_df['Neighborhood'], selected_venues_df['Venue Rating Categorical'], normalize='index')
plt.figure(figsize=(20,8))
sns.heatmap(neighborhood_rating_contingency, annot=True, cmap="YlGnBu")
plt.show()

In [None]:
# Chi-square test of independence.
chi2, p, dof, expected = stats.chi2_contingency(neighborhood_rating_contingency)
print("Chi-square =", chi2, " P-value =", p)

### Display Venue Rating with Folium

* The red markers have the highest ratings
* The white markers have the lowest ratings
* Fill color corresponds to the neighborhood

In [None]:
miami_venues = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the neighborhoods
cmap = plt.get_cmap('OrRd')

n_neighborhoods = selected_venues_df['Neighborhood'].count()
x = np.arange(n_neighborhoods)
ys = [i + x + (i*x)**2 for i in range(n_neighborhoods)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to map
for neighborhood, lat, lng, venue, category, rating in zip(selected_venues_df['Neighborhood'], selected_venues_df['Venue Latitude'], selected_venues_df['Venue Longitude'], selected_venues_df['Venue'], selected_venues_df['Venue Category'], selected_venues_df['Venue Rating']):
    label = folium.Popup("(%s) %s > %s: %.2f" % (category, neighborhood, venue, rating), parse_html=True)
    neighborhood_idx = selected_venues_df[selected_venues_df['Neighborhood'] == neighborhood].index[0]
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=colors.rgb2hex(cmap(rating)),
        fill=True,
        fill_color=rainbow[neighborhood_idx],
        fill_opacity=0.7,
        parse_html=False).add_to(miami_venues)
    
miami_venues