## Data-Driven Insights into Restaurant Markets Using Hybrid Clustering and Regression on Yelp Data

This project presents a hybrid machine learning system that
integrates both social computing and business analytics to provide
data-driven insights for restaurant planning. By analyzing
both restaurant characteristics and social signals, we demonstrate how social
computing data can enhance business analytics.

Primary questions of exploration:
- Can we identify natural groupings of restaurant markets based on their characteristics?
- Can we predict how well a restaurant may perform in a particular area based on it's features (category, price range, sentiment,
- and cluster type)?
- What cities exhibit similar restaurant market patterns?

### Exploratory Data Analysis and Data Preparation

In [46]:
import json, pandas as pd, numpy as np
import gc
import warnings
warnings.filterwarnings("ignore")
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# import os
# os.system('pip install --upgrade ipywidgets')
from tqdm import tqdm
from sklearn.cluster import KMeans

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pickle
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_absolute_error
import time

In [47]:
def load_restaurants(path='data/Yelp JSON/yelp_dataset/yelp_academic_dataset_business.json'):
    restaurants = []
    with open(path) as json_file:
        for line in tqdm(json_file, desc='Loading restaurants', colour='green'):
            b = json.loads(line)
            if b['categories'] and 'Restaurants' in b['categories'] and b['is_open'] == 1:
                restaurants.append({
                    'business_id': b['business_id'],
                    'name': b['name'],
                    'city': b['city'],
                    'state': b['state'],
                    'postal_code': b['postal_code'],
                    'latitude': b['latitude'],
                    'longitude': b['longitude'],
                    'stars': b['stars'],
                    'review_count': b['review_count'],
                    'categories': b['categories'],
                    'attributes': b['attributes'],
                    'hours': b['hours']
                })
    return pd.DataFrame(restaurants)

restaurants_df = load_restaurants()
print(f"Loaded {len(restaurants_df)} restaurants")
restaurants_df.head()

Loading restaurants: 150346it [00:00, 237632.87it/s]

Loaded 34987 restaurants





Unnamed: 0,business_id,name,city,state,postal_code,latitude,longitude,stars,review_count,categories,attributes,hours
0,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,"Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'RestaurantsDelivery': 'False', 'OutdoorSeati...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
1,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,Ashland City,TN,37015,36.269593,-87.058943,2.0,6,"Burgers, Fast Food, Sandwiches, Food, Ice Crea...","{'BusinessParking': 'None', 'BusinessAcceptsCr...","{'Monday': '0:0-0:0', 'Tuesday': '6:0-22:0', '..."
2,bBDDEgkFA1Otx9Lfe7BZUQ,Sonic Drive-In,Nashville,TN,37207,36.208102,-86.76817,1.5,10,"Ice Cream & Frozen Yogurt, Fast Food, Burgers,...","{'RestaurantsAttire': ''casual'', 'Restaurants...","{'Monday': '0:0-0:0', 'Tuesday': '6:0-21:0', '..."
3,eEOYSgkmpB90uNA7lDOMRA,Vietnamese Food Truck,Tampa Bay,FL,33602,27.955269,-82.45632,4.0,10,"Vietnamese, Food, Restaurants, Food Trucks","{'Alcohol': ''none'', 'OutdoorSeating': 'None'...","{'Monday': '11:0-14:0', 'Tuesday': '11:0-14:0'..."
4,il_Ro8jwPlHresjw9EGmBg,Denny's,Indianapolis,IN,46227,39.637133,-86.127217,2.5,28,"American (Traditional), Restaurants, Diners, B...","{'RestaurantsReservations': 'False', 'Restaura...","{'Monday': '6:0-22:0', 'Tuesday': '6:0-22:0', ..."


In [48]:
# restaurants_df['price_range'] = restaurants_df['attributes'].apply(
#     lambda x: x.get('RestaurantsPriceRange2') if isinstance(x, dict) else None
# ).astype('float').fillna(2).astype(int)
#
# attr_cols = ['RestaurantsTakeOut', 'RestaurantsDelivery', 'OutdoorSeating',
#              'RestaurantsReservations', 'HasTV', 'Alcohol', 'WiFi', 'GoodForKids']
#
# for col in attr_cols:
#     restaurants_df[col] = restaurants_df['attributes'].apply(
#         lambda x: x.get(col) if isinstance(x, dict) else None
#     )
#     # convert the string values to boolean
#     restaurants_df[col] = restaurants_df[col].map({'True': True, 'False': False, 'None': False, None: False}).fillna(False)

# safer methods for lambda transformation
def extract_price(attrs):
    if not attrs or not isinstance(attrs, dict):
        return None
    val = attrs.get('RestaurantsPriceRange2')
    if val in (None, 'None', ''):
        return None
    try:
        return int(val)
    except (ValueError, TypeError):
        return None

def parse_bool(attrs, key, default=False):
    if not attrs or not isinstance(attrs, dict):
        return default
    val = attrs.get(key)
    if val in (True, 'True', "u'True'", "'True'", 'Yes'):
        return True
    if val in (False, 'False', "u'False'", "'False'", 'No', 'None', None):
        return False
    return default

# apply them cleanly
restaurants_df['price_range'] = restaurants_df['attributes'].apply(extract_price).fillna(2).astype(int)

bool_attrs = [
    'RestaurantsTakeOut', 'RestaurantsDelivery', 'OutdoorSeating',
    'RestaurantsReservations', 'HasTV', 'WiFi', 'GoodForKids'
]

for attr in bool_attrs:
    restaurants_df[attr] = restaurants_df['attributes'].apply(lambda x: parse_bool(x, attr))

In [49]:
def load_checkin_counts(checkin_path='data/Yelp JSON/yelp_dataset/yelp_academic_dataset_checkin.json'):
    counts = {}

    with open(checkin_path, encoding='utf-8') as f:
        for line in tqdm(f, total=131930, desc="Processing check-ins"):
            obj = json.loads(line)
            bid = obj['business_id']
            # count the number of checkins
            count = len(obj['date'].split(',')) if obj['date'].strip() else 0
            counts[bid] = count

    return pd.DataFrame(list(counts.items()), columns=['business_id', 'checkin_count'])

checkin_df = load_checkin_counts()

# previous fix: drop any existing checkin_count column before merging
if 'checkin_count' in restaurants_df.columns:
    restaurants_df = restaurants_df.drop(columns=['checkin_count'])
if 'checkin_count_x' in restaurants_df.columns:
    restaurants_df = restaurants_df.drop(columns=['checkin_count_x'])
if 'checkin_count_y' in restaurants_df.columns:
    restaurants_df = restaurants_df.drop(columns=['checkin_count_y'])

restaurants_df = restaurants_df.merge(checkin_df, on='business_id', how='left')

# handle missing with 0 and make integer
restaurants_df['checkin_count'] = restaurants_df['checkin_count'].fillna(0).astype(int)

print(f"{restaurants_df['checkin_count'].sum():,} total check-ins across all restaurants")
print(f"Restaurants with over 100 check-ins: {(restaurants_df['checkin_count'] >= 100).sum():,}")
restaurants_df[['name', 'city', 'checkin_count']].sort_values('checkin_count', ascending=False).head(10)

Processing check-ins: 100%|██████████| 131930/131930 [00:00<00:00, 172401.45it/s]


6,757,093 total check-ins across all restaurants
Restaurants with over 100 check-ins: 12,797


Unnamed: 0,name,city,checkin_count
7882,Café Du Monde,New Orleans,40109
7267,Royal House,New Orleans,28927
26163,Oceana Grill,New Orleans,21542
33302,Reading Terminal Market,Philadelphia,18615
26439,Acme Oyster House,New Orleans,15205
12193,The Jug Handle Inn,Cinnaminson,13244
34209,Ruby Slipper - New Orleans,New Orleans,10209
12281,Pat O'Brien’s,New Orleans,10135
9850,Gumbo Shop,New Orleans,9574
29074,The Original Pierre Maspero's,New Orleans,9464


In [50]:
cuisine_keywords = [
    'Mexican', 'Italian', 'Chinese', 'Japanese', 'Indian', 'Thai', 'French',
    'American', 'Mediterranean', 'Vietnamese', 'Korean', 'Greek', 'Cajun',
    'Spanish', 'Middle Eastern', 'Caribbean', 'German', 'Irish', 'Pizza'
]

def get_main_cuisine(categories):
    if not categories:
        return 'Other'
    cats = [c.strip() for c in categories.split(',')]
    for cuisine in cuisine_keywords:
        if cuisine in cats:
            return cuisine
    return 'Other'

restaurants_df['cuisine'] = restaurants_df['categories'].apply(get_main_cuisine)

In [51]:
def kmeans_market_density(df, n_clusters=1200, new_col='market_cluster_density'):
    print(f"creating {n_clusters} micro markets using KMeans on location")

    coords = df[['longitude', 'latitude']].values
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    df['market_cluster'] = kmeans.fit_predict(coords)

    cluster_sizes = df['market_cluster'].value_counts()
    df[new_col] = df['market_cluster'].map(cluster_sizes)

    df[new_col] = df[new_col] - 1

    print(f"Most crowded micro market has {df[new_col].max() + 1} restaurants")
    return df

restaurants_df = kmeans_market_density(restaurants_df, new_col='cluster_density')

creating 1200 micro markets using KMeans on location
Most crowded micro market has 356 restaurants


Previous considerations included groupings in 2km ranges. With KMeans, we are able to discover 1200 natural micro markets
across North America. The ```cluster_density``` will be a strong feature in later analysis.


In [52]:
# success_score feature, score = percentile rank within each city
restaurants_df['log_reviews'] = np.log1p(restaurants_df['review_count'])

restaurants_df['success_score'] = restaurants_df.groupby('city')['log_reviews']\
    .transform(lambda x: x.rank(pct=True) * 100)

In [53]:
fig1 = px.scatter_mapbox(
    restaurants_df.sample(min(8000, len(restaurants_df))),
    lat="latitude", lon="longitude",
    color="stars", size="review_count",
    hover_name="name", hover_data=["city", "cuisine"],
    title="Yelp Restaurants: Size = Popularity, Color = Rating",
    zoom=3, height=600
)
fig1.update_layout(mapbox_style="carto-positron")
fig1.show()

fig2 = px.bar(restaurants_df['city'].value_counts().head(15),
              title="Top 15 cities by restaurant count",
              labels={'value': 'Number of Restaurants', 'index': 'City'},
              color=restaurants_df['city'].value_counts().head(15).values,
              color_continuous_scale="Blues")
fig2.update_xaxes(tickangle=45)
fig2.show()

fig3 = px.histogram(restaurants_df, x='success_score', nbins=50,
                    title="Success score distribution (by percentile within city)",
                    color_discrete_sequence=['#636EFA'])
fig3.update_layout(bargap=0.1)
fig3.show()

underserved = restaurants_df[restaurants_df['success_score'] < 30]
print(f"Underserved locations (bottom 30% in their city): {len(underserved):,}")

fig4 = px.scatter(restaurants_df.sample(10000),
                  x='cluster_density', y='success_score',
                  color='price_range', size='review_count',
                  hover_data=['name', 'city', 'cuisine'],
                  title="Market Hotspot Density vs Success Score",
                  labels={'cluster_density': 'Competitors in Natural Micro-Market',
                          'success_score': 'Success Score (percentile)'},
                  color_continuous_scale="Portland")
fig4.show()

Underserved locations (bottom 30% in their city): 10,127


In [54]:
# save to central csv
restaurants_df.to_csv('yelp_restaurants_final.csv', index=False)

In [55]:
# preprocessing pipeline
feature_cols = ['latitude', 'longitude', 'price_range', 'checkin_count',
                'cluster_density', 'cuisine', 'city']

X = restaurants_df[feature_cols]
y = restaurants_df['success_score']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), ['latitude', 'longitude', 'price_range',
                                'checkin_count', 'cluster_density']),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False),
     ['cuisine', 'city'])
], remainder='drop')

X_final = preprocessor.fit_transform(X)
print(f"preprocessed matrix shape: {X_final.shape}")
pickle.dump(preprocessor, open('../preprocessor.pkl', 'wb'))
print("preprocessor.pkl saved!")

preprocessed matrix shape: (34987, 869)
preprocessor.pkl saved!


In [56]:
kmeans_rest = KMeans(n_clusters=12, random_state=42, n_init=10)
restaurants_df['restaurant_cluster'] = kmeans_rest.fit_predict(X_final)
pickle.dump(kmeans_rest, open('../kmeans_restaurants.pkl', 'wb'))

# city level clustering
city_agg = restaurants_df.groupby('city').agg({
    'stars': 'mean',
    'price_range': 'mean',
    'cluster_density': 'mean',
    'checkin_count': 'mean',
    'success_score': 'mean',
    'review_count': 'count'
}).reset_index()

city_numeric_correct_order = pd.DataFrame({
    'latitude': 0,                          # dummy
    'longitude': 0,                         # dummy
    'price_range': city_agg['price_range'],
    'checkin_count': city_agg['checkin_count'],
    'cluster_density': city_agg['cluster_density']
})

scaler = preprocessor.named_transformers_['num']
city_scaled = scaler.transform(city_numeric_correct_order)

kmeans_city = KMeans(n_clusters=6, random_state=42, n_init=10)
city_agg['city_cluster'] = kmeans_city.fit_predict(city_scaled)

restaurants_df = restaurants_df.merge(city_agg[['city', 'city_cluster']], on='city', how='left')
pickle.dump(kmeans_city, open('../kmeans_cities.pkl', 'wb'))

fig = px.bar(
    x=range(12),
    y=restaurants_df['restaurant_cluster'].value_counts().sort_index(),
    title="Restaurant Cluster Distributions",
    labels={'x': 'Cluster ID', 'y': 'Number of Restaurants'},
    color=range(12),
    color_continuous_scale="Viridis"
)
fig.update_layout(showlegend=False, height=500)
fig.show()

# print(city_agg[['city', 'city_cluster']].sort_values('city_cluster').head(20))

In [57]:
print("discovered city market types:\n")
print(city_agg.groupby('city_cluster').agg({
    'city': 'count',
    'review_count': 'mean',
    'cluster_density': 'mean',
    'success_score': 'mean'
}).round(2))

print("\nTop cities in each cluster:")
for i in range(6):
    top = city_agg[city_agg['city_cluster'] == i][['city', 'review_count']].sort_values('review_count', ascending=False).head(5)
    print(f"\nCluster {i} (n={len(city_agg[city_agg['city_cluster']==i])} cities):")
    print(top.to_string(index=False))

discovered city market types:

              city  review_count  cluster_density  success_score
city_cluster                                                    
0              131          2.20            24.02          90.30
1               92        159.95            51.43          63.17
2              232         13.15            23.53          83.58
3              372         32.24            23.29          55.71
4                6        820.00           178.00          83.34
5               13          1.46            18.00          89.74

Top cities in each cluster:

Cluster 0 (n=131 cities):
           city  review_count
      Hazelwood            25
       Lawrence            15
Sun City Center            14
        Cahokia            12
          Salem             9

Cluster 1 (n=92 cities):
        city  review_count
       Tampa          1964
Indianapolis          1904
   Nashville          1681
      Tucson          1639
 Saint Louis           957

Cluster 2 (n=232 cities)

The city level KMeans clustering reveals that there are 6 distinct restaurant market types.

**Cluster 0:** represents smaller towns with lower density and more modest success scores

**Clusters 1-5:** major metropolitan areas

This shows that location type is a strong feature to predict restaurant success.

In [58]:
from sklearn.tree import DecisionTreeRegressor

X_train, X_test, y_train, y_test = train_test_split(
    X_final, y, test_size=0.2, random_state=42
)

results = []

print("Decision Tree")
start = time.time()

dt = DecisionTreeRegressor(max_depth=18, random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
results.append({
    'Model': 'Decision Tree',
    'R2': round(r2_score(y_test, dt_pred), 4),
    'Mean absolute error': round(mean_absolute_error(y_test, dt_pred), 2),
    'Time in seconds': round(time.time() - start, 1)
})

print("Gradient Boosting")
start = time.time()
gb = GradientBoostingRegressor(n_estimators=300, max_depth=6, learning_rate=0.1, random_state=42)
gb.fit(X_train, y_train)
gb_pred = gb.predict(X_test)
results.append({
    'Model': 'Gradient Boosting',
    'R2': round(r2_score(y_test, gb_pred), 4),
    'Mean absolute error': round(mean_absolute_error(y_test, gb_pred), 2),
    'Time in seconds': round(time.time() - start, 1)
})

print("Random Forest")
start = time.time()
rf = RandomForestRegressor(n_estimators=400, max_depth=22, n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
results.append({
    'Model': 'Random Forest (Deployed)',
    'R2': round(r2_score(y_test, rf_pred), 4),
    'Mean absolute error': round(mean_absolute_error(y_test, rf_pred), 2),
    'Time in seconds': round(time.time() - start, 1)
})

comparison = pd.DataFrame(results).sort_values('R2', ascending=False)
print("*" * 20 + " FINAL MODEL COMPARISON " + "*" * 20)
print(comparison.to_string(index=False))

# save final model
pickle.dump(rf, open('../final_model.pkl', 'wb'))
pickle.dump(preprocessor, open('../preprocessor.pkl', 'wb'))
print("\nfinal_model.pkl + preprocessor.pkl saved")

Decision Tree
Gradient Boosting
Random Forest
******************** FINAL MODEL COMPARISON ********************
                   Model     R2  Mean absolute error  Time in seconds
       Gradient Boosting 0.6669                12.49            167.1
Random Forest (Deployed) 0.6445                12.74             31.7
           Decision Tree 0.5251                14.72              0.6

final_model.pkl + preprocessor.pkl saved


In [60]:
# !jupyter nbconvert --to html "yelp data analysis.ipynb" \
#     --TemplateExporter.exclude_input=True \
#     --output "../yelp_data_analysis_report.html"

This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    