# Neighborhood for Best Breweries in Minneapolis - St. Paul

## Introduction

#### The goal here is to suggest the best neighborhood for best breweries / brewing companies in the Twin Cities area. Craft beer fans new to the area would be interested.

## Methodology

##### Get Foursquare data for breweries in the area. Cluster them into neighborhoods. Since not all brewries have rating information, train a machine learning model (support vector regression is found to perform well) to predict the missing ratings. Then average the ratings for each clusters to suggest the neighborhood with best brewing.

## Data Description

#### The data are obtained from Foursquare - it contains location (lat. and lng.) rating counts, and likes. However, the rating data exists for only some venues (about half). The available data will be used to cluster the venues into neighborhoods and train models for rating predictions, using which ratings will be predicted for the venues that are missing them. Finally, the neighborhood with the best set of breweries will be suggested.

### 1. Get brewery data from Foursquare

In [23]:
CLIENT_ID = 'M0AMS10V3MXAFMKQGK3JY3TE0SDODCNBRLBVDAUQCJPTVS5X' # your Foursquare ID
CLIENT_SECRET = 'C1PYOJOMKZEDOJN143T3IVFRH5QNMYAK5QYMWBK2BDQPYD5G' # your Foursquare Secret
VERSION = '20180604'
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

radius = 10000 # in meters 
LIMIT = 1000  # Maximum number of results to obtain 
msp_lat = np.average([44.977,44.9537]) 
msp_lng = np.average([-93.2650,-93.0900]) 

search_query = 'brewing' 
print('Querying for',search_query) 
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, msp_lat, msp_lng, VERSION, search_query, radius, LIMIT) 
results = requests.get(url).json() 
mspvenues = results['response']['venues'] 
df_msp1 = json_normalize(mspvenues)

search_query = 'brewery'
print('Querying for',search_query) 
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, msp_lat, msp_lng, VERSION, search_query, radius, LIMIT) 
results = requests.get(url).json() 
mspvenues = results['response']['venues'] 
df_msp2 = json_normalize(mspvenues)
 
# tranform venues into a dataframe 

df_msp = df_msp1.append(df_msp2,ignore_index=True) 
df_msp.drop_duplicates(subset=['name'], keep='first',inplace=True) 
print(df_msp.shape) 
df_msp = df_msp[~df_msp['name'].str.contains("Coffee")] 
df_msp = df_msp[~df_msp['name'].str.contains("Airport")] 
print(df_msp.shape) 



Your credentails:
CLIENT_ID: M0AMS10V3MXAFMKQGK3JY3TE0SDODCNBRLBVDAUQCJPTVS5X
CLIENT_SECRET:C1PYOJOMKZEDOJN143T3IVFRH5QNMYAK5QYMWBK2BDQPYD5G
Querying for brewing
Querying for brewery
(93, 25)
(91, 25)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort,


### 2. Cluster the breweries using K-means

In [67]:
# set number of clusters 
kclusters = 9 # 4 6  
df_cluster = df_msp[['location.lat','location.lng']] 
 
# run k-means clustering 
kmeans = KMeans(n_clusters=kclusters, n_init=50, random_state=None).fit(df_cluster) 
df_cluster.insert(0, 'Cluster Label', kmeans.labels_) 
 
# create map 
map_clusters = folium.Map(location=[msp_lat, msp_lng], zoom_start=11) 
 
# set color scheme for the clusters 
x = np.arange(kclusters) 
ys = [i + x + (i*x)**2 for i in range(kclusters)] 
colors_array = cm.rainbow(np.linspace(0, 1, len(ys))) 
rainbow = [colors.rgb2hex(i) for i in colors_array] 
 
# add markers to the map 
markers_colors = [] 
for lat, lon, name, cluster, rating in zip(df_msp['location.lat'], df_msp['location.lng'], df_msp['name'], df_cluster['Cluster Label'], df_rating['rating']): 
    label = folium.Popup(str(name) + ' Cluster' + str(cluster) + ' Rating' + str(rating), parse_html=True) 
    folium.CircleMarker( 
         [lat, lon], 
         radius=5, 
         popup=label, 
         color=rainbow[cluster-1], 
         fill=True, 
         fill_color=rainbow[cluster-1], 
         fill_opacity=1).add_to(map_clusters) 
         
map_clusters 

#### This clustering is a fairly good results as many of the clusters actually correspond quite well to known/well-defined neighborhoods in the Twin-Cities area, such as Northeast Minneapolis, Central Minneapolis, Calhoun Isles, Powderhorn, University, West Seventh, Central St. Paul, and Roseville

### 3. Get rating and likes data from Foursquare

In [None]:
df_rating = pd.DataFrame(columns=['rating'],index=range(len(df_msp)))
df_sig = pd.DataFrame(columns=['sig'],index=range(len(df_msp)))
df_likes = pd.DataFrame(columns=['likes'],index=range(len(df_msp)))

In [48]:
CLIENT_ID = "14N4BFKHW2A03JRRS0DH12RGJBRK1CEAASOGVARJMDZI4JPT" # O2
CLIENT_SECRET = "TGLOLHAU34YOC4DSKW0CNWPCQYJEIQAIZKBD0ZGTEPLEO33Y" # O2

CLIENT_ID = 'M0AMS10V3MXAFMKQGK3JY3TE0SDODCNBRLBVDAUQCJPTVS5X' # gmail
CLIENT_SECRET = 'C1PYOJOMKZEDOJN143T3IVFRH5QNMYAK5QYMWBK2BDQPYD5G' # gmail

CLIENT_ID = '1DVM5IRVK51YANXBUJPV3LJJDBWOOFGEQAL1NG0OPW2TOVPE' # hotmail
CLIENT_SECRET = 'U4KJOFTGZ1M5HVMFBSTXKPSRML2M5V4VW2FMUV00Q4U4YKNM' # hotmail

CLIENT_ID = '1DVM5IRVK51YANXBUJPV3LJJDBWOOFGEQAL1NG0OPW2TOVPE' # ucla
CLIENT_SECRET = 'U4KJOFTGZ1M5HVMFBSTXKPSRML2M5V4VW2FMUV00Q4U4YKNM' # ucla

istart = 0
nvenues = len(df_rating)
nvenues = 11
for ii, venue_id, lat, lon, name in zip(range(istart,nvenues), df_msp.iloc[range(istart,nvenues)]['id'], df_msp.iloc[range(istart,nvenues)]['location.lat'], df_msp.iloc[range(istart,nvenues)]['location.lng'],df_msp.iloc[range(istart,nvenues)]['name']):
    url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)

    result = requests.get(url).json()
    df_likes['likes'][ii] = result['response']['venue']['likes']['count']
    try:
        print(name,'-',result['response']['venue']['rating'])
        df_rating['rating'][ii] = result['response']['venue']['rating']
        df_sig['sig'][ii] = result['response']['venue']['ratingSignals']        
    except:
        print(name, '- This venue has not been rated yet.')
        df_rating['rating'][ii] = np.nan
        df_sig['sig'][ii] = 0

BlackStack Brewing - 7.9
Urban Growler Brewing Company - 8.6
Surly Brewing Company - 9.5
Burning Brothers Brewing - 7.3
Lake Monster Brewing - 8.5
Day Block Brewing Company - 8.2
Dual Citizen Brewing Company - 6.8
Indeed Brewing Company - 9.0
Insight Brewing - 8.1
St. Paul Brewing Cooperative - This venue has not been rated yet.
Tin Whiskers Brewing Co. - 8.3


### 4. Train a rating model and predict ratings for the venues that don't have them

In [147]:
# add likes and rating_signal columns to df_msp
df_msp['likes'] = df_likes
df_msp['sig'] = df_sig

# do one hot encoding on Cluster label and add to df_msp
df_onehot = pd.get_dummies(df_cluster['Cluster Label'], prefix="", prefix_sep="") 
df_onehot.sort_index 
df_msp[list(str(i) for i in range(kclusters))] = df_onehot

# Test-Training split using the availability of rating
filter = np.zeros(len(df_rating)) 
for i in range(len(df_rating)): 
    filter[i] = np.isnan(df_rating['rating'].iloc[i]) 
    
X = df_msp[list(str(i) for i in range(kclusters))+['likes','sig']] 
from sklearn import preprocessing 
poly = preprocessing.PolynomialFeatures(degree=2) 
#X = poly.fit_transform(X) 
#X = df_msp[['likes','location.lat','location.lng']] 
y = df_rating['rating'] 
X_proc = preprocessing.MinMaxScaler().fit(X).transform(X) 
#X_proc = np.array(X) 
X_pred = X_proc[df_msp[filter==1].index] 
X_train = X_proc[df_msp[filter!=1].index] 
y_train = y[df_msp[filter!=1].index] 
 
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet 
from sklearn.kernel_ridge import KernelRidge 
from sklearn.svm import SVR 
 
#model = LinearRegression()  
#model = ElasticNet(alpha = 0.001) 
#model = KernelRidge(kernel='sigmoid',gamma='auto', coef0=0) 
model = SVR(kernel='rbf',gamma='scale', C=1000000, coef0=-1, shrinking=False) 
model.fit(X_train, y_train) 
print('R2 score : ',model.score(X_train, y_train)) 
try: 
    print ('Coefficients: ', model.coef_) # a   
except: 
    print ('Dual Coefficients: ', model.dual_coef_) # a   
print ('Intercept: ',model.intercept_) # b   
 
# error analysis                                                                                           
from sklearn.metrics import r2_score 
y_hat = model.predict(X_train) # predicted data 
print("Rating Predictions : \n", y_hat) 
print("Error : ", np.round(list(y_hat - y_train),3)) 
print("Mean absolute error: %.2f" % np.mean(np.absolute(y_hat - y_train))) 
print("Residual sum of squares (MSE): %.2f" % np.mean((y_hat - y_train)** 2)) 

y_pred = model.predict(X_pred) # predicted data 
y_pred[y_pred>10] = 10 # in case the predicted rating exceeds 10, just assign it 10 
y_pred[y_pred<0] = 0 # in case the predicted rating is negative, just assign it 0 
print('\n ########### \n') 
print("New Rating Predictions : \n", y_pred) 

R2 score :  0.9372513259707703
Dual Coefficients:  [[ 2.82500226e+01  3.64834649e+01  7.21442399e+05 -6.17425790e+04
   2.56148104e+04 -1.00000000e+06  4.97767348e+02  4.40154633e+04
   1.00000000e+06  5.37191461e+05  2.24318030e+05 -1.00000000e+06
   1.67453606e+04 -8.85672104e+05  2.71069331e+05 -1.47049070e+01
   7.44126456e+05 -4.33105757e+01 -3.29603936e+04 -1.69985086e+05
  -1.00000000e+06  2.24315264e+03 -1.95957761e+05 -8.21730834e+00
  -1.84538127e+04  6.54902052e+05  3.40254115e+05 -4.67335348e+01
   7.92746782e+01  7.82696734e+00  1.00000000e+06 -7.76980854e+00
   2.40374633e+04  9.07897397e+05 -1.00000000e+06 -9.97007876e+04
  -4.11174636e+05 -6.38739197e+05]]
Intercept:  [23.34104807]
Rating Predictions : 
 [7.92123221 8.5277076  9.40858424 7.19514184 8.61192101 8.07910687
 7.09649387 8.88292612 8.14640788 8.18752345 8.19047997 8.29316778
 6.25261476 7.24112836 8.46903731 8.39728407 8.68221594 8.49377378
 9.20310807 8.5398564  8.79556341 9.23022449 6.88924964 8.56157769
 8

##### With R2 = 0.93, Support Vector Regression was the best model (in terms of R2). The predicted ratings more or less make reasonable sense.

### 5. Recommend the best neighborhood for breweries

In [155]:
df_ratingall = pd.DataFrame(columns=['rating'],index=range(len(df_msp)))
df_ratingall['rating'][filter!=1] = df_rating['rating'][filter!=1]
df_ratingall['rating'][filter==1] = y_pred

avgrating = np.zeros(kclusters) 
for clus in range(kclusters): 
    avgrating[clus] = df_ratingall[df_cluster['Cluster Label']==clus]['rating'].mean() 
    print('The avg. rating for cluster '+str(clus)+' is :', avgrating[clus])

best_rating = np.round(np.max(avgrating),2) 
best_clus = np.where(avgrating==np.max(avgrating))[0][0] 
print('The neighborhood with highest average ratings is Cluster '+str(best_clus)+' with '+str(best_rating)) 
 
# create map 
map_clusters = folium.Map(location=[msp_lat, msp_lng], zoom_start=11) 
 
# set color scheme for the clusters 
x = np.arange(kclusters) 
ys = [i + x + (i*x)**2 for i in range(kclusters)] 
colors_array = cm.rainbow(np.linspace(0, 1, len(ys))) 
rainbow = [colors.rgb2hex(i) for i in colors_array] 
 
# add markers to the map 
markers_colors = [] 
for lat, lon, name, cluster, rating in zip(df_msp[filter!=1]['location.lat'], df_msp[filter!=1]['location.lng'], df_msp[filter!=1]['name'], df_cluster[filter!=1]['Cluster Label'], df_ratingall[filter!=1]
['rating']): 
    label = folium.Popup(str(name) + ' Cluster' + str(cluster) + '\n Rating' + str(rating), parse_html=True) 
    if cluster == best_clus : 
        rad = 10 
    else: 
        rad = 5 
    folium.RegularPolygonMarker( 
        [lat, lon], 
        radius=rad, 
        popup=label, 
        color=rainbow[cluster-1], 
        fill=False, 
        fill_color=rainbow[cluster-1], 
        fill_opacity=0).add_to(map_clusters) 
     
for lat, lon, name, cluster, rating in zip(df_msp[filter==1]['location.lat'], df_msp[filter==1]['location.lng'], df_msp[filter==1]['name'], df_cluster[filter==1]['Cluster Label'], df_ratingall[filter==1]
['rating']): 
    label = folium.Popup(str(name) + ' Cluster' + str(cluster) + '\n Predicted Rating' + str(rating), parse_html=True) 
    if cluster == best_clus : 
        rad = 10 
    else: 
        rad = 5 
    folium.CircleMarker( 
        [lat, lon], 
        radius=rad, 
        popup=label, 
        color=rainbow[cluster-1], 
        fill=True, 
        fill_color=rainbow[cluster-1], 
        fill_opacity=1).add_to(map_clusters) 
        
map_clusters                                                                                                                                                                                            

The avg. rating for cluster 0 is : 7.955446433023519
The avg. rating for cluster 1 is : 7.747354516765963
The avg. rating for cluster 2 is : 7.365647438856208
The avg. rating for cluster 3 is : 8.12892824259271
The avg. rating for cluster 4 is : 8.656903910639818
The avg. rating for cluster 5 is : 7.905953150046296
The avg. rating for cluster 6 is : 7.401296490110462
The avg. rating for cluster 7 is : 7.95105735738477
The avg. rating for cluster 8 is : 7.48939044723375
The neighborhood with highest average ratings is Cluster 4 with 8.66


### Conclusion

#### The study predicts <span style="color:red">Cluster 4 (Nakomis - Highland Park) </span> to be the neighborhood with the best breweries. However, it is likely not a plausible conclusion. There are two major problems with this whole study. First, there are too few features used for training the model due to the general lack of data provded in Foursquare. The cluster labels constitute most of the feature space (9) with the other two being the number of likes and number of ratings. Unsurprisingly, the newly predicted ratings for venues within a given cluster are very much alike. Secondly, in four of the nine clusters including Cluster 4, only one or two venues had ratings, meaning that the prediction for the rest of the venues in those clusters highly questionable. Excluding those four clusters, the neighborhood with the best breweries would be <span style="color:lightgreen">Cluster 5 (University) </span> . Overall, more data availability is really necessary.