# Coursera Applied Data Science Capstone Project - Week 2 - Restaurants in Zurich

Let's start by importing our necessary libraries.

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import scipy.stats as stats

Using BeautifulSoup, we will parse the <a href="https://en.wikipedia.org/wiki/List_of_communities_and_neighborhoods_of_San_Diego">Wikipedia link</a> into html text and extract the information from the <b>Table</b> attribute.

In [None]:
#Extract list of neighborhoods from Wikipedia page
wiki_link = 'https://en.wikipedia.org/wiki/Subdivisions_of_Zürich'
raw_wiki = requests.get(wiki_link).text

#Import and parse the text from Wikipedia using BeautifulSoup
soup = BeautifulSoup(raw_wiki,'lxml')

In [None]:
#Data of the neighborhoods is inside an html <table> object.
soup_table = soup.table
#Decompose the <div> object inside the table to clean our data
soup_table.div.decompose()
soup_table

In [None]:
#Loop through the soup object to extract names of neighborhoods
neighborhoods = []

for name in soup_table.find_all('a'):
    neighborhoods.append(str(name.string))
    
print('There are',len(neighborhoods),'neighborhoods in Zürich')

In [None]:
#import Nominatim to get location data on the neighborhoods
from geopy.geocoders import Nominatim

In [None]:
#from geopy.exc import GeocoderTimedOut #Use this if Geocode 

def locationfinder(neighborhoods):
    geolocator = Nominatim(user_agent='kkha@lab-data.com')
    locations = []
    for nbhd in neighborhoods:
        address = nbhd + ', Canton Zürich'
        location = geolocator.geocode(address)
        if location is None:
            pass
        else:
            x = [nbhd, location.latitude, location.longitude]
            locations.append(x)
    return(locations)

In [None]:
import pickle

#check if locations pickle created, will load in data if we've already created it before
try:
    with open('locations.pkl', 'rb') as f:
        locations = pickle.load(f)
    print('Data loaded.')
except:
    locations = locationfinder(neighborhoods)
    
#with open('locations.pkl', 'wb') as f:
#    pickle.dump(locations, f)

locations[0:5]

Let's create the Dataframe so we can begin to analyze the data.

Through trial and error, we found that not all neighborhoods corresponded to GPS locations in geocode. This was due to overlapping neighborhoods and smaller sectioning of neighborhoods. It is okay to remove out these neighborhoods because our search radius will be large enough to cover missing areas.

In [None]:
df_loc = pd.DataFrame(data=locations)
df_loc.columns = ['Neighborhood', 'Latitude', 'Longitude']

#we see we have NaN values, let's drop these.
df_loc.dropna(axis=0, inplace=True)

Let's begin to map out the neighborhoods just to visualize where we will be working with.

In [None]:
!conda install -c conda-forge folium=0.5.0 --yes

import folium

In [None]:
geolocator = Nominatim(user_agent='kevinle.kha@gmail.com')
sd_loc = geolocator.geocode('Zurich')
sd_lat = sd_loc.latitude
sd_lng = sd_loc.longitude

map_SD = folium.Map(location=(sd_lat,sd_lng), zoom_start=10)

for nbhd, lat, lng in zip(df_loc['Neighborhood'], df_loc['Latitude'], df_loc['Longitude']):
    label = '{}, ZH'.format(nbhd)
    label = folium.Popup(label)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=False).add_to(map_SD)
    
map_SD

You may be wondering why there are no marked neighborhoods between the larger group and smaller group near the border of Mexico. That area is the city of Chula Vista, which does not fall in the jurisdiction of San Diego, therefore is not included in this data.

Awesome! Now that we have the neighborhoods mapped out, let's begin to use our knowledge of Foursquare API to explore restaurant venues from within the neighborhoods. This will help us to fill in data about the types of restaurants and their density within each neighborhood.

In [None]:
#Establish Foursquare credentials and version
CLIENT_ID = 'E1NJSUC205TKLJRJ0LLIGOGXRTPHU5G332HBJA00QLSIXIYP' # your Foursquare ID
CLIENT_SECRET = 'GF4TL3GCXPIBPUVOWYLG534XI3LYV3OKHXV1ZEW3YZXSAYVD' # your Foursquare Secret
VERSION = '20190321' # Foursquare API version

In [None]:
def getNearbyFoods(names, latitudes, longitudes, radius=4000, LIMIT=100):
    
    food_cat = '4d4b7105d754a06374d81259'
    rest_list=[]
    #seen = set()
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT,
            food_cat)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
         
        # return only relevant and unique information for each nearby venue
        for v in results:
            #if v['venue']['name'] not in seen:
                #seen.add(v['venue']['name'])
                rest_list.append([(
                    name, 
                    lat, 
                    lng, 
                    v['venue']['id'],
                    v['venue']['name'], 
                    v['venue']['location']['lat'], 
                    v['venue']['location']['lng'],  
                    v['venue']['categories'][0]['name'])])

    nearby_foods = pd.DataFrame([item for rest_list in rest_list for item in rest_list])
    nearby_foods.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue ID',
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_foods)

In [None]:
#check if restaurant data pickle created, will load in data if we've already created it before
try:
    with open('sd_rests.pkl', 'rb') as f:
        sd_rests = pickle.load(f)
    print('Data loaded.')
except:
    sd_rests = getNearbyFoods(names=df_loc['Neighborhood'],
                            latitudes = df_loc['Latitude'],
                            longitudes = df_loc['Longitude'])

In [None]:
sd_rests.to_pickle('sd_rests.pkl')

In [None]:
print('There are {} unique restaurants and {} unique restaurant categories'.format(
    len(sd_rests['Venue'].unique()),
    len(sd_rests['Venue Category'].unique())))

In [None]:
sd_rests.columns = ['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 'Venue ID', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']

In [None]:
print(sd_rests.shape)
sd_rests.head()

We will now begin to manipulate the data by using getdummies and get a onehot table so it will be easier for our machine learning algorithms to classify our data.

In [None]:
#one hot encoding
sd_onehot = pd.get_dummies(sd_rests[['Venue Category']], prefix="", prefix_sep="")

#add neighborhoods column back to dataframe
sd_onehot['Neighborhood'] = sd_rests['Neighborhood']

#move neighborhood column to first column
fixed_columns = [sd_onehot.columns[-1]] + list(sd_onehot.columns[:-1])
sd_onehot = sd_onehot[fixed_columns]

#just to make sure we moved the columns correctly
sd_onehot.head()

In [None]:
sd_grouped = sd_onehot.groupby('Neighborhood').mean().sort_values('Neighborhood').reset_index()
sd_grouped

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

In [None]:
Sum_of_squared_distances = []
sd_grouped_clustering = sd_grouped.drop('Neighborhood', 1)

K = range(1,15)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(sd_grouped_clustering)
    Sum_of_squared_distances.append(km.inertia_)

In [None]:
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

In our Elbow method, we see that K will continuously decrease so we will  set cluster limit to 4. That is when the changes sum of squared distances becomes negligible as we increase K.

In [None]:
#set number of clusters
kclusters = 10

#run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sd_grouped_clustering)

kmeans.labels_[0:10]

In [None]:
#Apply the cluster labels to our original location dataframe
sd_clustered = df_loc.sort_values('Neighborhood').reset_index(drop=True)
sd_clustered['Cluster Labels'] = kmeans.labels_

sd_clustered.head()

We are now going to map out the neighborhoods and color them for easier visualization of the grouping.

In [None]:
#import plot colors
import matplotlib.cm as cm
import matplotlib.colors as colors

In [None]:
# create map
map_clusters = folium.Map(location=[sd_lat, sd_lng], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.jet(np.linspace(0, 1, len(ys)))
jet = [colors.to_hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sd_clustered['Latitude'], sd_clustered['Longitude'], sd_clustered['Neighborhood'], sd_clustered['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=jet[cluster-1],
        fill=True,
        fill_color=jet[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
#Find the count of neighborhoods within each cluster to see how our model worked
sd_clustered[['Neighborhood','Cluster Labels']].groupby('Cluster Labels').count()

Now that we have our neighborhoods clustered, let's dive into each cluster and do some statistical testing to find out the similarity of the clusters

In [None]:
#Create a new dataframe of only Neighborhood, Venue data, and Cluster Labels
sd_rests_cat = pd.DataFrame(data=sd_rests[['Neighborhood','Venue ID','Venue','Venue Latitude', 'Venue Longitude', 'Venue Category']])

#Create a dictionary that maps each neighborhood to its assigned cluster from K-Means
cluster_nbhd = sd_clustered[['Neighborhood', 'Cluster Labels']]
nbhd_cluster_dict = dict(zip(cluster_nbhd['Neighborhood'], cluster_nbhd['Cluster Labels']))
nbhd_cluster_dict

#Map the dictionary to our new dataframe
sd_rests_cat['Cluster Labels'] = sd_rests_cat['Neighborhood'].map(nbhd_cluster_dict)
sd_rests_cat.head()

In [None]:
#list to span numbered clusters
c_list = list(np.arange(0,kclusters,1))

#dictionary of dataframes to separate out restaurants by cluster
d = {c: pd.DataFrame() for c in c_list}

# d[#] where # means the numbered cluster to make our data more easily understood
for c in c_list:
    d[c] = sd_rests_cat.loc[sd_rests_cat['Cluster Labels'] == c].drop(['Cluster Labels'],1).reset_index(drop=True)

What wasn't readily understood at the beginning was that our explore call was collecting duplicate restaurants when the search radius overlapped through neighborhoods. This will clean the data to only include unique restaurants and have each restaurant appear only once per cluster.

In [None]:
def get_unique_rests(data):

    unique_list = []
    seen = set()

    for i in range(0,data.shape[0]):
        if data['Venue ID'][i] not in seen:
            seen.add(data['Venue ID'][i])
            unique_list.append(data.iloc[i])
        
        df_unique = pd.DataFrame(data=unique_list).reset_index(drop=True)
    return(df_unique)

In [None]:
c = {c: pd.DataFrame() for c in c_list}
for x in c_list:
    c[x] = get_unique_rests(d[x]) 

In [None]:
c[0].head()

In [None]:
cf = {x: pd.DataFrame() for x in c_list}

for x in c_list:
    cf[x] = c[x][['Venue','Venue Category']].groupby('Venue Category').count()

In [None]:
from functools import reduce

cfct = reduce(lambda x,y: pd.merge(x,y, on='Venue Category', how='outer'), [cf[x] for x in c_list])

In [None]:
string = 'Cluster '
clusters = list(map(str, range(0, kclusters)))
cluster_list = [string + c for c in clusters]

In [None]:
cfct.columns = cluster_list
cfct.fillna(value=0, inplace=True)
cfct.head()

Using the Student's T-test, we found the p-values between each of the clusters to find the similarity bewteen the clusters. T-testing requires one to understand the null hypothesis that two sample means are the same. If the p-value is closer to 1, we can accept the null hypothesis which means the mean of the clusters are similar. Alternately, if the p-value is closer to 0, we can reject the null hypothesis, which means that the mean of the two clusters are NOT similar (they are different).


In [None]:
p = []
pl = {}

for i in c_list:
    for j in c_list:
        p.append(stats.ttest_ind(cf[i],cf[j]).pvalue[0])
    pl[i] = p
    p=[]
    
pl_df = pd.DataFrame(pl)
pl_df

Comparing a cluster to itself will be 1 (since the means are the same if you compare the same set), so we will drop those values by making them NULL or NaN.

In [None]:
for i in c_list:
    pl_df[i][i] = np.NaN
    
pl_df

In [None]:
# idxmax() will find the index of the maximum of that column/row
max = pl_df.idxmax()
# idxmin() will find the index of the minimum of that column/row
min = pl_df.idxmin()

In [None]:
for i in c_list:
    print('For Cluster',i,'the most similar Cluster is',max[i])

In [None]:
for i in c_list:
    print('For Cluster',i,'the most dissimilar Cluster is',min[i])

## Methodlogy

In this project, we will begin by extracting a list of all the neighborhoods within San Diego. Wikipedia has a list we can use, so we will parse out the raw HTML text of the website for our benefit using BeautifulSoup. Once the neighborhoods are extracted from the text, we will use Geocoder and the names of the neighborhood to find their latitude and longitude. 

Once we have the locations of each neighborhood, we will begin to work with our Foursquare API. 