# Yelp Dataset



The **Yelp dataset** is a comprehensive collection of business, review, and user data sourced from Yelp’s platform. Originally created for the Yelp Dataset Challenge, this dataset includes businesses from **eight metropolitan areas across the United States and Canada**, offering valuable insights into consumer interactions and business attributes.

**Dataset Components:**

1. **business.json**– Contains business details, including location, attributes, and categories.
2. **review.json**– Stores full review text along with the corresponding user ID and business ID.
3. **user.json**– Includes user metadata, such as friend connections and activity metrics.
4. **checkin.json**– Tracks aggregated check-ins for businesses.
5. **tips.json**– Features concise user-submitted tips, typically shorter than full reviews.
6. **photos.json**- Stores user-uploaded photos and their respective categories.

**Dataset Overview:**

*   **5.2 million reviews** across **174,000 businesses**
*   **1.3 million users** contributing **1.1 million tips**
*   **200,000 images** categorized by users
*   **11 metropolitan areas** covered
*   **1.2 million business attributes**, including parking, ambience, availability, and operating hours

## Setting up Environment and Dataset API



In [2]:
!pip install -q scikit-surprise

In [3]:
!pip install numpy==1.23.5 scikit-surprise --force-reinstall

Collecting numpy==1.23.5
  Using cached numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Collecting scikit-surprise
  Using cached scikit_surprise-1.1.4-cp311-cp311-linux_x86_64.whl
Collecting joblib>=1.2.0 (from scikit-surprise)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting scipy>=1.6.0 (from scikit-surprise)
  Using cached scipy-1.15.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Using cached numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Using cached scipy-1.15.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (37.6 MB)
Installing collected packages: numpy, joblib, scipy, scikit-surprise
  Attempting uninstall: numpy
    Found existing installation: numpy 1.23.5
    Uninstalling numpy-1.23.5:
      Successfully uninstalled numpy-1.23.5
  Attempting uninstall: joblib
    Found existing ins

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from tqdm import tqdm
from IPython.display import clear_output
import json

from surprise import Reader, Dataset, SVD, SVDpp, NMF, accuracy, Prediction
from surprise.model_selection import train_test_split, cross_validate, KFold

my_seed = 1234

In [2]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"rakelsksigurardttir","key":"874aca97c8919ad1651c3b28fec26b99"}'}

In [3]:
!pip install kaggle
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json



In [4]:
!kaggle datasets download -d yelp-dataset/yelp-dataset
!unzip yelp-dataset.zip

Dataset URL: https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset
License(s): other
Archive:  yelp-dataset.zip
  inflating: Dataset_User_Agreement.pdf  
  inflating: yelp_academic_dataset_business.json  
  inflating: yelp_academic_dataset_checkin.json  
  inflating: yelp_academic_dataset_review.json  
  inflating: yelp_academic_dataset_tip.json  
  inflating: yelp_academic_dataset_user.json  


## Read Sampled Data


In [2]:
# read pickle datasets
review_df = pd.read_pickle("reviews_sampled.pkl")
user_df = pd.read_pickle("user_sampled.pkl")
business_df = pd.read_pickle("business_sampled.pkl")

In [5]:
review_df.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15
9,8JFGBuHMoiNDyfcxuWNtrA,smOvOajNG0lS4Pq7d8g4JQ,RZtGWDLCAtuipwaZ-UfjmQ,4,0,0,0,Good food--loved the gnocchi with marinara\nth...,2009-10-14 19:57:14
10,UBp0zWyH60Hmw6Fsasei7w,4Uh27DgGzsp6PqrH913giQ,otQS34_MymijPTdNBoBdCw,4,0,2,0,The bun makes the Sonoran Dog. It's like a snu...,2011-10-27 17:12:05


In [6]:
user_df.head()

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
1,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,2009-01-25 04:35:42,43091,13066,27281,"2009,2010,2011,2012,2013,2014,2015,2016,2017,2...","ueRPE0CX75ePGMqOFVj6IQ, 52oH4DrRvzzl8wh5UXyU0A...",3138,...,264,184,157,251,1847,7054,3131,3131,1521,1946
2,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2008-07-25 10:41:00,2086,1010,1003,20092010201120122013,"LuO3Bn4f3rlhyHIaNfTlnA, j9B4XdHUhDfTKVecyWQgyA...",52,...,13,10,17,3,66,96,119,119,35,18
4,hA5lMy-EnncsH4JoR-hFGQ,Karen,79,2007-01-05 19:40:59,29,15,7,,"PBK4q9KEEBHhFvSXCUirIw, 3FWPpM7KU1gXeOM_ZbYMbA...",1,...,1,0,0,0,1,1,0,0,0,0
5,q_QQ5kBBwlCcbL1s4NVK3g,Jane,1221,2005-03-14 20:26:35,14953,9940,11211,200620072008200920102011201220132014,"xBDpTUbai0DXrvxCe3X16Q, 7GPNBO496aecrjJfW6UWtg...",1357,...,163,191,361,147,1212,5696,2543,2543,815,323
6,cxuxXkcihfCbqt5Byrup8Q,Rob,12,2009-02-24 03:09:06,6,1,0,,"HDAQ74AEznP-YsMk1B14CA, 6A6-aIX7fg_zRy9MiE6YyQ...",1,...,0,0,0,0,0,1,0,0,0,0


In [7]:
business_df.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
5,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,615 S Main St,Ashland City,TN,37015,36.269593,-87.058943,2.0,6,1,"{'BusinessParking': 'None', 'BusinessAcceptsCr...","Burgers, Fast Food, Sandwiches, Food, Ice Crea...","{'Monday': '0:0-0:0', 'Tuesday': '6:0-22:0', '..."
9,bBDDEgkFA1Otx9Lfe7BZUQ,Sonic Drive-In,2312 Dickerson Pike,Nashville,TN,37207,36.208102,-86.76817,1.5,10,1,"{'RestaurantsAttire': ''casual'', 'Restaurants...","Ice Cream & Frozen Yogurt, Fast Food, Burgers,...","{'Monday': '0:0-0:0', 'Tuesday': '6:0-21:0', '..."
11,eEOYSgkmpB90uNA7lDOMRA,Vietnamese Food Truck,,Tampa Bay,FL,33602,27.955269,-82.45632,4.0,10,1,"{'Alcohol': ''none'', 'OutdoorSeating': 'None'...","Vietnamese, Food, Restaurants, Food Trucks","{'Monday': '11:0-14:0', 'Tuesday': '11:0-14:0'..."
12,il_Ro8jwPlHresjw9EGmBg,Denny's,8901 US 31 S,Indianapolis,IN,46227,39.637133,-86.127217,2.5,28,1,"{'RestaurantsReservations': 'False', 'Restaura...","American (Traditional), Restaurants, Diners, B...","{'Monday': '6:0-22:0', 'Tuesday': '6:0-22:0', ..."


# Context-based Recommender

## Build CF recommender

### On all the reviews

In [4]:
from surprise import SVD, Dataset, Reader

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(review_df[['user_id', 'business_id', 'stars']], reader)

In [5]:
trainset, testset = train_test_split(data, test_size=.20, random_state=my_seed)

algo = SVD(n_factors=20, reg_all=0.1, random_state=1234)
algo.fit(trainset);

predictions = algo.test(testset)
predictions_df = pd.DataFrame(predictions)

In [6]:
accuracy.rmse(predictions);
accuracy.mae(predictions);

RMSE: 1.0779
MAE:  0.8474


In [211]:
predictions_df.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,nlQGyti20c07rpTSYYoNZQ,qDheewhaZNDWLfpsrrrRiw,4.0,3.947565,{'was_impossible': False}
1,nraF3Nvr2VfJ-3EtQE1mnA,B3CY2NVEFySzbyhW9YuxQg,5.0,3.442246,{'was_impossible': False}
2,eWz41dtIs-H2JeV3JYQFsQ,0r37WEgkjlpioz0Ftdp6Zg,1.0,2.793227,{'was_impossible': False}
3,cMEtAiW60I5wE_vLfTxoJQ,iSRTaT9WngzB8JJ2YKJUig,4.0,3.794896,{'was_impossible': False}
4,VnmVTEsmQ2aBeY4XuLyvig,9xdXS7jtWjCVzL4_oPGv9A,2.0,4.389405,{'was_impossible': False}


#### Training in each context value

In [7]:
review_df['date'] = pd.to_datetime(review_df['date'])

review_df['hour'] = review_df['date'].dt.hour

print(review_df[['date', 'hour']].head())

                  date  hour
0  2018-07-07 22:09:11    22
1  2014-02-05 20:30:30    20
3  2017-01-14 20:54:15    20
9  2009-10-14 19:57:14    19
10 2011-10-27 17:12:05    17


In [8]:
predictions_byuser_df = predictions_df.groupby('uid')[['iid', 'r_ui', 'est']].agg(lambda x: list(x))
predictions_byuser_df.head()

Unnamed: 0_level_0,iid,r_ui,est
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
---2PmXbF47D870stH1jqA,"[hKameFsaXh9g8WQbv593UA, ReVpjIDupK_VMPn7ZxPvO...","[5.0, 5.0, 5.0]","[4.6174665533133625, 4.893266889507578, 5.0]"
---UgP94gokyCDuB5zUssA,"[GBTPC53ZrG1ZBY3DT8Mbcw, mV1UTSvEm-mhaPGFiIGhhQ]","[4.0, 1.0]","[4.15245925478248, 3.8473591765838524]"
--3WaS23LcIXtxyFULJHTA,"[Ih6_y2nnbg2Jw9Qdc876GA, GwNCQAGXJxM_YVnaG61-i...","[4.0, 4.0, 4.0]","[4.060146449893285, 3.6992970371569633, 3.7377..."
--4AjktZiHowEIBCMd4CZA,"[Yz0fJyBkUF8VZBvwFswkRQ, hCMqbFJyLczPk_qU3Ym2C...","[5.0, 5.0, 4.0]","[4.7338309081131715, 3.98401622274251, 3.73340..."
--Al1VYjHegnOfTVotCHFw,[d3dw3YihB3WuU1Sb_i74-w],[4.0],[2.943172919923764]


#### Evaluation

In [9]:
import sklearn.metrics as metrics
def ndcg_multiple_users(relevant_items_all_users, predictions_ranking_all_users, k=5):
    ndcg_list = []
    # Loop all users
    for i in range(len(relevant_items_all_users)):
      # Skip users with only 1 item in the test data, as no NDCG can be computed
      if len(relevant_items_all_users[i]) > 1:
        ndcg = metrics.ndcg_score([relevant_items_all_users[i]],
                                  [predictions_ranking_all_users[i]], k=k)
        ndcg_list.append(ndcg)

    return np.mean(ndcg_list)

In [10]:
ndcg_multiple_users(predictions_byuser_df['r_ui'].tolist(),
                    predictions_byuser_df['est'].tolist())

0.9488985805955051

In [11]:
ratings_dawn_df = review_df[(review_df['hour'] >= 2) & (review_df['hour'] <= 8)]
ratings_morning_df = review_df[(review_df['hour'] >= 9) & (review_df['hour'] <= 13)]
ratings_afternoon_df = review_df[(review_df['hour'] >= 14) & (review_df['hour'] <= 21)]
ratings_night_df = review_df[(review_df['hour'] >= 22) | (review_df['hour'] <= 1)]

context_df_dict = {
    'dawn': ratings_dawn_df,
    'morning': ratings_morning_df,
    'afternoon': ratings_afternoon_df,
    'night': ratings_night_df
}

In [12]:
%%time
for context, ratings_subset_df in context_df_dict.items():
  data = Dataset.load_from_df(ratings_subset_df[['user_id', 'business_id', 'stars']], reader)
  trainset, testset = train_test_split(data, test_size=.20, random_state=1234)
  algo = SVD(n_factors=20, reg_all=0.1, random_state=1234)
  algo.fit(trainset);
  predictions = algo.test(testset)
  predictions_df = pd.DataFrame(predictions)
  predictions_byuser_df = predictions_df.groupby('uid')[['iid', 'r_ui', 'est']].agg(lambda x: list(x))
  ndcg = ndcg_multiple_users(predictions_byuser_df['r_ui'].tolist(),
                             predictions_byuser_df['est'].tolist())
  print(f'{context}: {ndcg}')

dawn: 0.9602012773051585
morning: 0.9619685350401687
afternoon: 0.9576593310582215
night: 0.9607852007596888
CPU times: user 2min 3s, sys: 751 ms, total: 2min 3s
Wall time: 2min 14s


When we evaluate our model during different times of day—dawn, morning, afternoon, and night—we’re consistently seeing NDGCs around 0.96, which is slightly higher than the 0.95 we got when not considering time-of-day. This suggests that taking into account when a review was written (or when a recommendation is made) can fine-tune the recommendations, making them even more aligned with what users actually find relevant. It’s a small bump, but in recommendation systems even a tiny improvement can make a difference in user satisfaction.

## Contextual pre-filtering

Our objective for this is to recommend top N restaurants to a given user, based on their past behavior, preferences, and current context.

In [13]:
# see one sample of hours of a business
business_df['hours'].iloc[1]

{'Monday': '0:0-0:0',
 'Tuesday': '6:0-22:0',
 'Wednesday': '6:0-22:0',
 'Thursday': '6:0-22:0',
 'Friday': '9:0-0:0',
 'Saturday': '9:0-22:0',
 'Sunday': '8:0-22:0'}

Since in the preprocessing of the data we already filtered for only restaurants that are open, now we will filter for restaurants that are open at our target time.

In [3]:
from datetime import datetime, time

def pad_time(t_str):
    parts = t_str.split(":")
    if len(parts) != 2:
        return t_str  # return as-is if unexpected format
    hour = parts[0].zfill(2)
    minute = parts[1].zfill(2)
    return f"{hour}:{minute}"

def is_open_at(hours_str, check_time):

    try:
        start_str, end_str = hours_str.split('-')
        # pad the time strings to ensure proper formatting ('7:0' -> '07:00')
        start_str = pad_time(start_str)
        end_str = pad_time(end_str)
        start = datetime.strptime(start_str, "%H:%M").time()
        end = datetime.strptime(end_str, "%H:%M").time()
        # handle overnight hours
        if end < start:
            return check_time >= start or check_time <= end
        return start <= check_time <= end
    except Exception:
        return False

def check_business_open(business_hours, dt):

    # convert datetime to weekday name
    day = dt.strftime("%A")
    if not isinstance(business_hours, dict):
        return False
    if day not in business_hours:
        return False
    return is_open_at(business_hours[day], dt.time())

In [4]:
target_datetime = datetime.strptime("2025-03-29 15:30", "%Y-%m-%d %H:%M")

# filter to keep only restaurants that have defined hours and are open at target_datetime
filtered_businesses = business_df[
    business_df['hours'].apply(lambda x: check_business_open(x, target_datetime))
]

print(f"Number of restaurants open at {target_datetime}: {len(filtered_businesses)}")

Number of restaurants open at 2025-03-29 15:30:00: 24701


Next we will filter out restaurants that have minimum 3.0 stars

In [5]:
min_stars = 3.0
quality_filtered = filtered_businesses[
    (filtered_businesses['stars'] >= min_stars)
]
print(f"Restaurants open at {target_datetime} and with minimum 3.0 stars: {len(quality_filtered)}")

Restaurants open at 2025-03-29 15:30:00 and with minimum 3.0 stars: 19830


Now we will make sure they are all Restaurants

In [6]:
preferred_category = "Restaurant"
category_filtered = quality_filtered[
    quality_filtered['categories'].apply(lambda cats: preferred_category in cats if cats else False)
]
print(f"Restaurants open at {target_datetime} within 20 km and with minimum 3.0 stars: {len(category_filtered)}")

Restaurants open at 2025-03-29 15:30:00 within 20 km and with minimum 3.0 stars: 19830


### CF with contextual pre-filtering

In [7]:
# Filter reviews to only include restaurants meeting the context criteria.
contextual_reviews = review_df[review_df['business_id'].isin(category_filtered['business_id'])]

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(contextual_reviews[['user_id', 'business_id', 'stars']], reader)

trainset, testset = train_test_split(data, test_size=0.20, random_state=my_seed)

algo = SVD(n_factors=20, reg_all=0.1, random_state=1234)
algo.fit(trainset)

predictions = algo.test(testset)
accuracy.rmse(predictions)
accuracy.mae(predictions)

RMSE: 1.0674
MAE:  0.8358


0.8357519575001126

When we train the model only using the contextually pre-filtered data we get a better RMSE and a better MAE than we did before.

In [20]:
predictions_df = pd.DataFrame(predictions)
predictions_df.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,h8L_oYRfB_fhpwPM9-pgpQ,dERS4qwl9v-Gwft-Ug8APA,3.0,4.294388,{'was_impossible': False}
1,Vix_wF49zbkxvl4_T-YlfA,Uw46n__imJ52D7Zh1vJVrQ,3.0,3.937611,{'was_impossible': False}
2,TttBEhT52jOzuCKobF3wUw,WTac09yyrmBDc5uvExV4yw,1.0,3.961175,{'was_impossible': False}
3,QTbKGM26pPl-jGrATlgzow,Eqfks4GEn5dsI4ZGiPrCVQ,5.0,4.408993,{'was_impossible': False}
4,mlHIwD4IGzM8B_RIGrYtXA,XJ1KN2kQGBIo8h1nWvZNzw,3.0,4.000283,{'was_impossible': False}


In [21]:
predictions_byuser_df = predictions_df.groupby('uid')[['iid', 'r_ui', 'est']].agg(lambda x: list(x))
predictions_byuser_df.head()

Unnamed: 0_level_0,iid,r_ui,est
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
---2PmXbF47D870stH1jqA,"[1An4DxtMmvvSe0HX4viRCA, MxRZHZoDVVnN7EvMAHf1E...","[5.0, 5.0, 5.0]","[4.480272824281376, 4.958602210677502, 4.84801..."
--3WaS23LcIXtxyFULJHTA,[OCzo8T-76iJ_QVB2UX5SEQ],[4.0],[3.9915976659799393]
--4AjktZiHowEIBCMd4CZA,"[tYn8hGpZiRgJ8cP2FcI_YQ, Yz0fJyBkUF8VZBvwFswkR...","[4.0, 5.0, 4.0]","[4.066329557838807, 4.730752866979406, 3.89247..."
--8r3pNaZiG1fN8LCHuL_g,[xp7IRO4FDLcHkAO59Qqehg],[4.0],[3.5516237382978924]
--B4MfqBxNuXX8ujyh8VXg,"[ul03kWvQ22EadE7eq2hO_g, r1n7prN3Q2XQ9oPYstyr_...","[5.0, 5.0, 5.0]","[4.096139142692183, 4.021225105332979, 3.90381..."


#### Evaluation

In [22]:
ndcg_multiple_users(predictions_byuser_df['r_ui'].tolist(),
                    predictions_byuser_df['est'].tolist())

0.9582416421341795

When we use only the pre-filtered reviews, our NDGC jumps to about 0.9582, compared to around 0.9489 when we include all reviews. This improvement shows that narrowing down the dataset to the most context-relevant interactions pays off. The model can focus on the ratings that truly matter, which leads to a better ranking of recommendations. It’s like clearing out all the extra noise so that the signal stands out, making our recommendations more aligned with what users actually care about.

#### Get top 5 recommendations

In [8]:
from scipy.sparse import csr_matrix

user_ids = contextual_reviews['user_id'].unique()
business_ids = category_filtered['business_id'].unique()

user_id_to_index = {uid: idx for idx, uid in enumerate(user_ids)}
business_id_to_index_cf = {bid: idx for idx, bid in enumerate(business_ids)}

rows = contextual_reviews['user_id'].apply(lambda uid: user_id_to_index[uid]).tolist()
cols = contextual_reviews['business_id'].apply(lambda bid: business_id_to_index_cf[bid]).tolist()
ratings = contextual_reviews['stars'].tolist()

num_users = len(user_ids)
num_businesses = len(business_ids)
user_item_matrix = csr_matrix((ratings, (rows, cols)), shape=(num_users, num_businesses))
print("User-Item matrix shape:", user_item_matrix.shape)


User-Item matrix shape: (99797, 19830)


In [15]:
train_item_ids = {trainset.to_raw_iid(i) for i in trainset.all_items()}
candidate_business_ids = np.array(list(set(category_filtered['business_id'].unique()).intersection(train_item_ids)))
print("Number of candidate business IDs:", len(candidate_business_ids))

# Build mapping from business_id to index.
business_id_to_index_cf = {bid: idx for idx, bid in enumerate(candidate_business_ids)}


Number of candidate business IDs: 19777


In [21]:
def get_svd_recommendations(user_id, algo, category_filtered, filtered_reviews, top_n=5):

    candidate_ids = category_filtered['business_id'].unique()

    user_rated = filtered_reviews[filtered_reviews['user_id'] == user_id]['business_id'].unique()

    predictions = []
    for item_id in candidate_ids:
        if item_id not in user_rated:
            pred = algo.predict(user_id, item_id)
            predictions.append((item_id, pred.est))

    predictions.sort(key=lambda x: x[1], reverse=True)
    return predictions[:top_n]

sample_user = contextual_reviews['user_id'].iloc[0]
recommendations = get_svd_recommendations(sample_user, algo, category_filtered, contextual_reviews, top_n=5)
restaurant_names = category_filtered[category_filtered['business_id'].isin([x[0] for x in recommendations])]['name'].tolist()

print(f"Top SVD recommendations for user {sample_user}:")
for idx, name in enumerate(restaurant_names, start=1):
    print(f"{idx}. {name}")

Top SVD recommendations for user 4Uh27DgGzsp6PqrH913giQ:
1. Sierra Gold Seafood
2. Peachtree Grill
3. Haegeles Bakery
4. Mio’s Grill & Cafe
5. Pure Kitchen Organic Vegan


In [22]:
# lets inspect the restaurants we received as recommendations
business_df[business_df['name'].isin(restaurant_names)]


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
15447,TqezXFPh-f2cqIgkMvzHmQ,Sierra Gold Seafood,"1335 Greg St, Ste 105",Sparks,NV,89431,39.52081,-119.760232,5.0,95,1,"{'RestaurantsPriceRange2': '2', 'BusinessAccep...","Grocery, Specialty Food, Restaurants, Food, Se...","{'Tuesday': '10:0-18:0', 'Wednesday': '10:0-18..."
17894,meah5bJYnP24TH6b-8r_Eg,Peachtree Grill,329 Peachtree St,Nashville,TN,37210,36.120241,-86.749043,5.0,71,1,"{'RestaurantsTableService': 'False', 'Business...","American (New), Mediterranean, Middle Eastern,...","{'Monday': '0:0-0:0', 'Wednesday': '11:0-20:0'..."
64165,TiYGVQxNQ-wkr63o0JwaEA,Haegeles Bakery,4164 Barnett St,Philadelphia,PA,19135,40.026332,-75.057225,5.0,45,1,"{'BusinessParking': '{'garage': False, 'street...","Restaurants, Food, Bakeries","{'Tuesday': '7:0-17:0', 'Wednesday': '7:0-17:0..."
99968,YnGlopjmCYM6Pw07qt9bfw,Mio’s Grill & Cafe,119 2nd St N,St. Petersburg,FL,33701,27.772437,-82.635294,5.0,114,1,"{'BYOB': 'False', 'RestaurantsReservations': '...","Beer Bar, Vegetarian, Greek, Bars, Mediterrane...","{'Monday': '0:0-0:0', 'Tuesday': '11:0-21:0', ..."
149077,OVMQ5w9Qw96OfZ0e5nZtVA,Pure Kitchen Organic Vegan,3214 W Kennedy Blvd,Tampa,FL,33609,27.944519,-82.496623,5.0,130,1,"{'RestaurantsReservations': 'False', 'Business...","Vegetarian, Juice Bars & Smoothies, Organic St...","{'Monday': '0:0-0:0', 'Tuesday': '10:0-18:0', ..."


We can see we got 5 restaurants that all have perfect ratings and open for the target time we set earlier. However, they are all in a wildly different place!

### Contextual post-filtering

Let's filter out restaurants that are not open for at least 60 more minutes from our target datetime. Let's also compute the distance from our target location, and then order the recommendations by distance, quality, and our CF score. This should make our 5 top recommendations different and better because now it will order based on that our user wants something close and something that will be open long enough for them to dine.

In [23]:
def haversine_distance(lat1, lon1, lat2, lon2):

    R = 6371  # earth radius in kilometers
    d_lat = np.radians(lat2 - lat1)
    d_lon = np.radians(lon2 - lon1)
    a = np.sin(d_lat / 2)**2 + np.cos(np.radians(lat1)) * np.cos(np.radians(lat2)) * np.sin(d_lon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    return R * c

# our target location will be Philadelphia
target_lat, target_lon = 39.95243, -75.16021

def is_within_radius(row, target_lat, target_lon, radius_km=20):

    dist = haversine_distance(target_lat, target_lon, row['latitude'], row['longitude'])
    return dist <= radius_km

filtered_by_location = filtered_businesses[
    filtered_businesses.apply(lambda row: is_within_radius(row, target_lat, target_lon), axis=1)
]


In [28]:
from datetime import datetime, timedelta
def is_open_for_at_least(business_hours, dt, duration_minutes=60):
    day = dt.strftime("%A")
    if not isinstance(business_hours, dict):
        return False
    if day not in business_hours:
        return False
    try:
        start_str, end_str = business_hours[day].split('-')
        start_str = pad_time(start_str)
        end_str = pad_time(end_str)
        start_time = datetime.strptime(start_str, "%H:%M").time()
        end_time = datetime.strptime(end_str, "%H:%M").time()
    except Exception:
        return False

    if start_time <= dt.time() <= end_time:
        target_dt = dt + timedelta(minutes=duration_minutes)
        if target_dt.time() <= end_time:
            return True
    return False

In [25]:
def contextual_post_filtering_cf(recommendations, business_df, target_dt, target_lat, target_lon, duration_minutes=60):
    filtered = []
    for business_id, cf_score in recommendations:
        info = business_df[business_df['business_id'] == business_id]
        if info.empty:
            continue
        info = info.iloc[0]
        if not is_open_for_at_least(info['hours'], target_dt, duration_minutes):
            continue
        distance = haversine_distance(target_lat, target_lon, info['latitude'], info['longitude'])
        quality = info['stars']
        filtered.append((business_id, cf_score, distance, quality, info['name']))
    # Sort by: distance (asc), then stars (desc), then CF score (desc)
    return sorted(filtered, key=lambda x: (x[2], -x[3], -x[1]))

#### Get top 5 recommendations

In [32]:
recommendations = get_svd_recommendations(sample_user, algo, category_filtered, contextual_reviews, top_n=100)
filtered_recs = contextual_post_filtering_cf(recommendations, category_filtered, target_datetime, target_lat, target_lon)

print("Contextually post-filtered CF recommendations:")
# show only top 5
top_n = 5
for rec in filtered_recs[:top_n]:
    print(f"Business ID: {rec[0]}, Name: {rec[4]}, CF Score: {rec[1]:.2f}, Distance: {rec[2]:.2f} km, Stars: {rec[3]}")

Contextually post-filtered CF recommendations:
Business ID: RVLF2RaStLkJiQCqBHknDw, Name: Mom Mom's Kitchen and Polish Food Cart, CF Score: 4.60, Distance: 1.12 km, Stars: 5.0
Business ID: KTgZXj6xh8aN_tLfI-YZ1Q, Name: Bar Poulet, CF Score: 4.63, Distance: 1.19 km, Stars: 4.5
Business ID: TE2IEDNV0RcI6s1wTOP4fg, Name: Tortilleria San Roman, CF Score: 4.72, Distance: 1.65 km, Stars: 5.0
Business ID: gvD09Ev1aOmphtlq07zYEA, Name: El Rancho Viejo, CF Score: 4.62, Distance: 1.92 km, Stars: 5.0
Business ID: U7HYUH8SqZO6OQMNKCr5kQ, Name: Porco's Porchetteria, CF Score: 4.61, Distance: 2.24 km, Stars: 4.5


Now we got different top 5 than what we got before we applied the post-filtering. We can see they make much more sense now because you don't want a recommendation for a restaurant that is in another state than you are! Now we will focus on restaurants that are close to the user (target location).

## Build CB recommender

### Similarity with categorical column

In [33]:
categorical_cols = ['categories']
items_cat_df = category_filtered[categorical_cols].copy()
items_cat_df = items_cat_df.apply(lambda x: x.str.lower())
items_cat_df = items_cat_df.apply(lambda x: x.str.replace(' ', ''))
items_cat_df = items_cat_df.apply(lambda x: x.str.replace(',', ' '))
items_cat_df['soup'] = items_cat_df[categorical_cols].agg(' '.join, axis=1)
pd.set_option('display.max_colwidth', 200)
items_cat_df.head()

Unnamed: 0,categories,soup
3,restaurants food bubbletea coffee&tea bakeries,restaurants food bubbletea coffee&tea bakeries
15,sushibars restaurants japanese,sushibars restaurants japanese
19,korean restaurants,korean restaurants
22,steakhouses asianfusion restaurants,steakhouses asianfusion restaurants
29,pizza chickenwings sandwiches restaurants,pizza chickenwings sandwiches restaurants


In [34]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
count_matrix = vectorizer.fit_transform(items_cat_df['soup'])
print("Count matrix shape:", count_matrix.shape)

count_matrix_df = pd.DataFrame(count_matrix.toarray(), columns=vectorizer.get_feature_names_out())
count_matrix_df.index = category_filtered.index
count_matrix_df.head()

Count matrix shape: (19830, 646)


Unnamed: 0,acaibowls,accessories,accountants,activelife,adult,adulteducation,adultentertainment,advertising,afghan,african,...,winebars,wineries,winetastingclasses,winetastingroom,winetours,women,wraps,yelpevents,yoga,yourselffood
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
22,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
from sklearn.neighbors import NearestNeighbors

K = 10
nn = NearestNeighbors(n_neighbors=K+1, metric='cosine', algorithm='brute')
nn.fit(count_matrix)

sample_idx = 0  # first restaurant in the count_matrix
distances, indices = nn.kneighbors(count_matrix[sample_idx])

print(f"Top {K} similar restaurants for restaurant at index {sample_idx}:")
for i in range(1, len(indices[0])):  # skip index 0 (the same restaurant)
    similar_idx = indices[0][i]
    sim_distance = distances[0][i]
    restaurant_name = category_filtered.iloc[similar_idx]['name']
    print(f"{i}. {restaurant_name} (Cosine distance: {sim_distance:.4f})")

Top 10 similar restaurants for restaurant at index 0:
1. The Foundry Bakery (Cosine distance: 0.0742)
2. Zhong Gang Bakery (Cosine distance: 0.0871)
3. WIT Cafe (Cosine distance: 0.1340)
4. Amelia's (Cosine distance: 0.1667)
5. Signature India (Cosine distance: 0.1667)
6. Starbucks (Cosine distance: 0.1667)
7. Be Well Bakery & Cafe (Cosine distance: 0.1667)
8. Starbucks (Cosine distance: 0.1667)
9. Krispy Kreme (Cosine distance: 0.1667)
10. Greek From Greece (Cosine distance: 0.1667)


### Content-based Feature Extraction

In [38]:
import ast
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import hstack, csr_matrix

In [39]:
def clean_categories(cat):

    if isinstance(cat, list):
        return " ".join(cat)
    elif isinstance(cat, str):
        return " ".join(cat.split(", "))
    return ""

category_filtered = category_filtered.copy()

category_filtered['clean_categories'] = category_filtered['categories'].fillna("").apply(clean_categories)

vectorizer = CountVectorizer(binary=True, min_df=1)
category_features = vectorizer.fit_transform(category_filtered['clean_categories'])
print("Category feature matrix shape:", category_features.shape)


Category feature matrix shape: (19830, 692)


In [40]:
def parse_attributes(attr):

    try:
        if isinstance(attr, dict):
            return attr
        return ast.literal_eval(attr) if isinstance(attr, str) else {}
    except Exception:
        return {}

# parse attributes into a new column.
category_filtered['parsed_attributes'] = category_filtered['attributes'].fillna("{}").apply(parse_attributes)

attributes_df = pd.json_normalize(category_filtered['parsed_attributes'])
attributes_df.head()

Unnamed: 0,RestaurantsDelivery,OutdoorSeating,BusinessAcceptsCreditCards,BusinessParking,BikeParking,RestaurantsPriceRange2,RestaurantsTakeOut,ByAppointmentOnly,WiFi,Alcohol,...,DriveThru,BusinessAcceptsBitcoin,Corkage,BYOBCorkage,RestaurantsCounterService,Open24Hours,AgesAllowed,AcceptsInsurance,DietaryRestrictions,HairSpecializesIn
0,False,False,False,"{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}",True,1,True,False,u'free',u'none',...,,,,,,,,,,
1,True,True,True,"{u'valet': False, u'garage': None, u'street': True, u'lot': False, u'validated': None}",,2,True,,'free','full_bar',...,,,,,,,,,,
2,,,True,"{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}",True,1,True,,u'no',u'none',...,,,,,,,,,,
3,True,,,,,2,True,,,,...,,,,,,,,,,
4,,,,,,1,,,,,...,,,,,,,,,,


In [41]:
def convert_to_binary(val):

    if val in [True, 'True']:
        return 1
    elif val in [False, 'False']:
        return 0
    try:
        return float(val)
    except:
        return 0

binary_attrs = attributes_df.map(convert_to_binary)
binary_attrs = binary_attrs.fillna(0)

# convert the attributes to a sparse matrix.
binary_attrs_sparse = csr_matrix(binary_attrs.values)
print("Attributes feature matrix shape:", binary_attrs_sparse.shape)

Attributes feature matrix shape: (19830, 39)


In [42]:
# stack the category and attribute features horizontally.
business_features = hstack([category_features, binary_attrs_sparse])
print("Combined feature matrix shape:", business_features.shape)

Combined feature matrix shape: (19830, 731)


So what we just did is we took each restaurant’s list of categories and converted it into a single string with words separated by spaces. This makes it easier for our vectorizer to process. We then used CountVectorizer to convert these cleaned strings into a binary (multi-hot) vector for each restaurant. This means every restaurant is now represented by a vector indicating which categories it belongs to. We ended up with a sparse matrix where each row corresponds to a restaurant and each column represents a category. For our matrix we have 3220 restaurants and 479 unique category features.

### Building User profiles

In [43]:
import numpy as np

business_id_to_index = dict(zip(category_filtered['business_id'], range(business_features.shape[0])))

user_reviews = review_df.groupby('user_id')['business_id'].apply(list)

user_profiles = {}

for user, biz_ids in user_reviews.items():
    indices = [business_id_to_index[biz_id] for biz_id in biz_ids if biz_id in business_id_to_index]
    if indices:
        profile_vector = business_features[indices].mean(axis=0)
        user_profiles[user] = np.asarray(profile_vector).flatten()

print("Number of user profiles created:", len(user_profiles))

Number of user profiles created: 99797


#### Get top 5 recommendations

In [71]:
from sklearn.metrics.pairwise import cosine_similarity
def content_recommend(user_id, user_profiles, business_features, business_ids, top_n=10):

    if user_id not in user_profiles:
        return []

    user_vector = user_profiles[user_id].reshape(1, -1)

    sim_scores = cosine_similarity(user_vector, business_features)
    sim_scores = sim_scores.flatten()

    top_indices = np.argsort(sim_scores)[::-1][:top_n]

    recommendations = [business_ids[i] for i in top_indices]
    return recommendations

business_ids = category_filtered['business_id'].tolist()

In [63]:
sample_user = list(user_profiles.keys())[0]
recommended_restaurants = content_recommend(sample_user, user_profiles, business_features, business_ids, top_n=5)

restaurant_names = category_filtered[category_filtered['business_id'].isin(recommended_restaurants)]['name'].tolist()

print(f"Top Content-based recommendations for user {sample_user}:")
for idx, name in enumerate(restaurant_names, start=1):
    print(f"{idx}. {name}")

Top Content-based recommendations for user ---2PmXbF47D870stH1jqA:
1. Iron Hill Brewery & Restaurant
2. Houlihan's
3. Not Your Average Joe's
4. Grindstone Charley's
5. Lucky Fins Seafood Grill


### Contextual post-filtering

In [58]:
def contextual_post_filtering_cb(recommendations, business_df, target_dt, target_lat, target_lon, duration_minutes=60):
    filtered = []
    for business_id, cb_score in recommendations:
        info = business_df[business_df['business_id'] == business_id]
        if info.empty:
            continue
        info = info.iloc[0]
        if not is_open_for_at_least(info['hours'], target_dt, duration_minutes):
            continue
        distance = haversine_distance(target_lat, target_lon, info['latitude'], info['longitude'])
        quality = info['stars']
        filtered.append((business_id, cb_score, distance, quality, info['name']))
    # Sort by: distance (asc), then stars (desc), then CB score (desc)
    return sorted(filtered, key=lambda x: (x[2], -x[3], -x[1]))

In [88]:
def content_recommend(user_id, user_profiles, business_features, business_ids, top_n=10):
    if user_id not in user_profiles:
        return []
    user_vector = user_profiles[user_id].reshape(1, -1)
    sim_scores = cosine_similarity(user_vector, business_features).flatten()
    top_indices = np.argsort(sim_scores)[::-1][:top_n]
    recommendations = [(business_ids[i], sim_scores[i]) for i in top_indices]
    return recommendations

#### Get top 5 recommendations

In [61]:
recommended_restaurants = content_recommend(sample_user, user_profiles, business_features, business_ids, top_n=50)
filtered_recs = contextual_post_filtering_cb(recommendations, category_filtered, target_datetime, target_lat, target_lon)

print("Contextually post-filtered CB recommendations:")
# show only top 5
top_n = 5
for rec in filtered_recs[:top_n]:
    print(f"Business ID: {rec[0]}, Name: {rec[4]}, CF Score: {rec[1]:.2f}, Distance: {rec[2]:.2f} km, Stars: {rec[3]}")

Contextually post-filtered CB recommendations:
Business ID: RVLF2RaStLkJiQCqBHknDw, Name: Mom Mom's Kitchen and Polish Food Cart, CF Score: 4.60, Distance: 1.12 km, Stars: 5.0
Business ID: KTgZXj6xh8aN_tLfI-YZ1Q, Name: Bar Poulet, CF Score: 4.63, Distance: 1.19 km, Stars: 4.5
Business ID: TE2IEDNV0RcI6s1wTOP4fg, Name: Tortilleria San Roman, CF Score: 4.72, Distance: 1.65 km, Stars: 5.0
Business ID: gvD09Ev1aOmphtlq07zYEA, Name: El Rancho Viejo, CF Score: 4.62, Distance: 1.92 km, Stars: 5.0
Business ID: U7HYUH8SqZO6OQMNKCr5kQ, Name: Porco's Porchetteria, CF Score: 4.61, Distance: 2.24 km, Stars: 4.5


We see that when we do the post-filtering we receive the same recommendations as we did with the CF and post-filtering

## Building hybrid recommender

### Basic weighted hybrid recommender

In [73]:
def get_all_cb_scores(user_id, user_profiles, business_features, business_ids):
    if user_id not in user_profiles:
        return {}
    user_vector = user_profiles[user_id].reshape(1, -1)
    sim_scores = cosine_similarity(user_vector, business_features).flatten()
    return {business_ids[i]: sim_scores[i] for i in range(len(business_ids))}

def get_all_cf_scores(user_id, algo, candidate_business_ids, trainset):
    try:
        inner_uid = trainset.to_inner_uid(user_id)
        rated_businesses = {trainset.to_raw_iid(iid) for (iid, _) in trainset.ur[inner_uid]}
    except ValueError:
        rated_businesses = set()

    cf_scores = {}
    for business_id in candidate_business_ids:
        if business_id not in rated_businesses:
            pred = algo.predict(user_id, business_id)
            cf_scores[business_id] = pred.est
    return cf_scores

business_id_to_name = dict(zip(category_filtered['business_id'], category_filtered['name']))

def get_hybrid_recommendations(user_id, algo, user_profiles, business_features, business_ids, trainset, alpha=0.5, top_n=5, business_id_to_name=None):
    if user_id not in user_profiles:
        return []

    cb_scores = get_all_cb_scores(user_id, user_profiles, business_features, business_ids)
    cf_scores = get_all_cf_scores(user_id, algo, business_ids, trainset)

    hybrid_scores = {}
    for b in business_ids:
        score_cb = cb_scores.get(b, 0)
        score_cf = cf_scores.get(b, 0)
        hybrid_scores[b] = alpha * score_cf + (1 - alpha) * score_cb

    sorted_hybrid = sorted(hybrid_scores.items(), key=lambda x: x[1], reverse=True)[:top_n]
    if business_id_to_name is not None:
        sorted_hybrid = [(bid, business_id_to_name.get(bid, "Unknown"), score) for bid, score in sorted_hybrid]
    return sorted_hybrid

In [74]:
candidate_business_ids = category_filtered['business_id'].unique()

sample_user = list(user_profiles.keys())[0]

recommendations_hybrid = get_hybrid_recommendations(
    sample_user,
    algo,
    user_profiles,
    business_features,
    candidate_business_ids,
    trainset,
    alpha=0.5,
    top_n=5,
    business_id_to_name=business_id_to_name
)

print("Top hybrid recommendations for user", sample_user, ":")
for idx, (bid, name, score) in enumerate(recommendations_hybrid, start=1):
    print(f"{idx}. {name} (Business ID: {bid}, Hybrid Score: {score:.2f})")

Top hybrid recommendations for user ---2PmXbF47D870stH1jqA :
1. Oakleys Bistro (Business ID: idf-eiurCrbsLRcH7c9zmw, Hybrid Score: 2.95)
2. Pho T&N (Business ID: 9Elafazm4QcFRL5VCspPrw, Hybrid Score: 2.94)
3. Root & Bone Indianapolis (Business ID: PjjShgqNvX5Marv-RJ1oNg, Hybrid Score: 2.94)
4. Persis Biryani Indian Grill (Business ID: SwWfW3vBn5QkDE7T3urGAg, Hybrid Score: 2.94)
5. Five Points Pizza - West Nashville (Business ID: LxKrVb6A9vyGdx2gywpUXg, Hybrid Score: 2.93)


### Contextual post-filtering

In [75]:
def contextual_post_filtering(recommendations, business_df, target_datetime, target_lat, target_lon):
    filtered = []
    for business_id, name, hybrid_score in recommendations:
        business_info = business_df[business_df['business_id'] == business_id]
        if business_info.empty:
            continue
        business_info = business_info.iloc[0]
        if not is_open_for_at_least(business_info['hours'], target_datetime, duration_minutes=60):
            continue
        dist = haversine_distance(target_lat, target_lon, business_info['latitude'], business_info['longitude'])
        quality = business_info['stars']
        filtered.append((business_id, name, hybrid_score, dist, quality))
    # Sort by distance (asc), then by stars (desc), then hybrid score (desc)
    sorted_filtered = sorted(filtered, key=lambda x: (x[3], -x[4], -x[2]))
    return sorted_filtered

In [77]:
recommendations_hybrid = get_hybrid_recommendations(
    sample_user,
    algo,
    user_profiles,
    business_features,
    candidate_business_ids,
    trainset,
    alpha=0.5,
    top_n=100,
    business_id_to_name=business_id_to_name
)

#### Get top 5 recommendations

In [79]:
target_datetime = datetime.strptime("2025-03-29 15:30", "%Y-%m-%d %H:%M")
target_lat, target_lon = 39.95243, -75.16021

final_recommendations = contextual_post_filtering(
    recommendations_hybrid,
    category_filtered,
    target_datetime,
    target_lat,
    target_lon
)

# print only top 5
final_recommendations = final_recommendations[:5]
print("Final recommendations after contextual post-filtering:")
for rec in final_recommendations:
    print(f"Business ID: {rec[0]} | Name: {rec[1]} | Hybrid Score: {rec[2]:.2f} | Distance: {rec[3]:.2f} km | Stars: {rec[4]}")

Final recommendations after contextual post-filtering:
Business ID: 0OsR9lO16jxa0xWUY57s9g | Name: Koto Sushi | Hybrid Score: 2.91 | Distance: 0.73 km | Stars: 4.5
Business ID: V4Dr3ragKHKeUab96miyMA | Name: Mole Poblano Restaurant | Hybrid Score: 2.92 | Distance: 1.92 km | Stars: 4.5
Business ID: TKPAyOWcexkpVHPCdYTNmQ | Name: Spuntino Wood Fired Pizza | Hybrid Score: 2.91 | Distance: 1.96 km | Stars: 4.5
Business ID: hxPnlWZmirx7neooZykmtg | Name: Sutton's | Hybrid Score: 2.91 | Distance: 3.09 km | Stars: 5.0
Business ID: utk0M6RZXbtTYTbip40Bdw | Name: Meskerem Ethiopian Restaurant | Hybrid Score: 2.91 | Distance: 4.38 km | Stars: 4.5


In [84]:
# lets inspect the restaurants we got in our final recommendations in a clean format
business_df[business_df['business_id'].isin([x[0] for x in final_recommendations])]

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
42584,0OsR9lO16jxa0xWUY57s9g,Koto Sushi,719 Sansom St,Philadelphia,PA,19106,39.948705,-75.15316,4.5,280,1,"{'GoodForKids': 'True', 'RestaurantsDelivery': 'True', 'BikeParking': 'True', 'Caters': 'True', 'RestaurantsAttire': 'u'casual'', 'RestaurantsTakeOut': 'True', 'GoodForMeal': '{'dessert': None, 'l...","Japanese, Restaurants, Salad, Sushi Bars","{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0', 'Wednesday': '11:0-21:0', 'Thursday': '11:0-21:0', 'Friday': '11:0-21:30', 'Saturday': '15:30-21:30', 'Sunday': '15:30-21:0'}"
42829,V4Dr3ragKHKeUab96miyMA,Mole Poblano Restaurant,1144 S 9th St,Philadelphia,PA,19147,39.93522,-75.158815,4.5,149,1,"{'BikeParking': 'True', 'HasTV': 'True', 'RestaurantsAttire': 'u'casual'', 'Alcohol': 'u'none'', 'RestaurantsPriceRange2': '2', 'BusinessAcceptsCreditCards': 'True', 'WiFi': 'u'no'', 'NoiseLevel':...","Mexican, Restaurants","{'Monday': '10:0-22:0', 'Tuesday': '10:0-22:0', 'Wednesday': '10:0-22:0', 'Thursday': '10:0-22:0', 'Friday': '10:0-22:0', 'Saturday': '9:0-22:0', 'Sunday': '9:0-22:0'}"
64279,hxPnlWZmirx7neooZykmtg,Sutton's,1706 N 5th St,Philadelphia,PA,19122,39.977159,-75.143673,5.0,74,1,"{'RestaurantsDelivery': 'False', 'BikeParking': 'True', 'WheelchairAccessible': 'True', 'BusinessAcceptsCreditCards': 'True', 'Ambience': '{'touristy': False, 'hipster': None, 'romantic': True, 'd...","American (Traditional), Cocktail Bars, Restaurants, Nightlife, Bars","{'Wednesday': '16:0-22:0', 'Thursday': '16:0-22:0', 'Friday': '14:0-23:0', 'Saturday': '13:0-23:0', 'Sunday': '12:15-13:0'}"
106445,utk0M6RZXbtTYTbip40Bdw,Meskerem Ethiopian Restaurant,225 S 45th St,Philadelphia,PA,19104,39.954276,-75.211595,4.5,105,1,"{'RestaurantsAttire': 'u'casual'', 'Caters': 'True', 'HasTV': 'True', 'RestaurantsTableService': 'True', 'WiFi': 'u'no'', 'DogsAllowed': 'False', 'HappyHour': 'False', 'Alcohol': 'u'none'', 'BikeP...","Ethiopian, Restaurants","{'Monday': '12:0-22:0', 'Tuesday': '12:0-22:0', 'Wednesday': '12:0-22:0', 'Thursday': '12:0-22:0', 'Friday': '12:0-22:0', 'Saturday': '12:0-22:0', 'Sunday': '12:0-22:0'}"
150085,TKPAyOWcexkpVHPCdYTNmQ,Spuntino Wood Fired Pizza,701 N 2nd St,Philadelphia,PA,19123,39.962006,-75.14095,4.5,209,1,"{'RestaurantsGoodForGroups': 'True', 'RestaurantsPriceRange2': '2', 'BikeParking': 'True', 'RestaurantsTableService': 'True', 'NoiseLevel': 'u'average'', 'Ambience': '{'touristy': False, 'hipster'...","Pizza, Restaurants","{'Tuesday': '12:0-21:0', 'Wednesday': '12:0-21:0', 'Thursday': '12:0-21:0', 'Friday': '12:0-21:0', 'Saturday': '12:0-21:0'}"


Let's imagine we wanted a specific type of restaurant. We can change the ordering of the recommendations to make sure that we are complying with what our user wants based on the context of our data. For this let's imagine our user wants to try local cuisine and is looking specifically for restaurants that are American!

In [94]:
def contextual_post_filtering(recommendations, business_df, target_datetime, target_lat, target_lon):
    filtered = []
    for business_id, name, hybrid_score in recommendations:
        business_info = business_df[business_df['business_id'] == business_id]
        if business_info.empty:
            continue
        business_info = business_info.iloc[0]

        # Filter out restaurants that do not have "american" in their categories.
        if 'american' not in str(business_info['categories']).lower():
            continue

        if not is_open_for_at_least(business_info['hours'], target_datetime, duration_minutes=60):
            continue

        dist = haversine_distance(target_lat, target_lon, business_info['latitude'], business_info['longitude'])
        quality = business_info['stars']
        filtered.append((business_id, name, hybrid_score, dist, quality))

    sorted_filtered = sorted(filtered, key=lambda x: (x[3], -x[4], -x[2]))
    return sorted_filtered

In [95]:
recommendations_hybrid = get_hybrid_recommendations(
    sample_user,
    algo,
    user_profiles,
    business_features,
    candidate_business_ids,
    trainset,
    alpha=0.5,
    top_n=100,
    business_id_to_name=business_id_to_name
)

In [96]:
target_datetime = datetime.strptime("2025-03-29 15:30", "%Y-%m-%d %H:%M")
target_lat, target_lon = 39.95243, -75.16021

final_recommendations = contextual_post_filtering(
    recommendations_hybrid,
    category_filtered,
    target_datetime,
    target_lat,
    target_lon
)

# print only top 5
final_recommendations = final_recommendations[:5]
print("Final recommendations after contextual post-filtering:")
for rec in final_recommendations:
    print(f"Business ID: {rec[0]} | Name: {rec[1]} | Hybrid Score: {rec[2]:.2f} | Distance: {rec[3]:.2f} km | Stars: {rec[4]}")

Final recommendations after contextual post-filtering:
Business ID: hxPnlWZmirx7neooZykmtg | Name: Sutton's | Hybrid Score: 2.91 | Distance: 3.09 km | Stars: 5.0
Business ID: 2EVCpJTmmjFeYzxjQnnaeg | Name: Maritsas Cuisine | Hybrid Score: 2.91 | Distance: 13.63 km | Stars: 4.5
Business ID: 8x5RiYBEuThT3KIqiBniUw | Name: Zinc Cafe | Hybrid Score: 2.91 | Distance: 31.35 km | Stars: 4.0
Business ID: n_zDBVMcBFU1bCOERIQ6HQ | Name: Ristorante Denicola | Hybrid Score: 2.91 | Distance: 31.80 km | Stars: 4.0
Business ID: F32Z7yfgklHbiumiI-EDZg | Name: Pats Select Pizza & Grill | Hybrid Score: 2.91 | Distance: 31.89 km | Stars: 4.5


In [97]:
# inspect
business_df[business_df['business_id'].isin([x[0] for x in final_recommendations])]

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
64279,hxPnlWZmirx7neooZykmtg,Sutton's,1706 N 5th St,Philadelphia,PA,19122,39.977159,-75.143673,5.0,74,1,"{'RestaurantsDelivery': 'False', 'BikeParking': 'True', 'WheelchairAccessible': 'True', 'BusinessAcceptsCreditCards': 'True', 'Ambience': '{'touristy': False, 'hipster': None, 'romantic': True, 'd...","American (Traditional), Cocktail Bars, Restaurants, Nightlife, Bars","{'Wednesday': '16:0-22:0', 'Thursday': '16:0-22:0', 'Friday': '14:0-23:0', 'Saturday': '13:0-23:0', 'Sunday': '12:15-13:0'}"
89329,8x5RiYBEuThT3KIqiBniUw,Zinc Cafe,679 Stokes Rd,Medford,NJ,8055,39.868752,-74.809164,4.0,198,1,"{'OutdoorSeating': 'True', 'RestaurantsTakeOut': 'True', 'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '2', 'Alcohol': 'u'none'', 'RestaurantsReservations': 'True', 'GoodForKids'...","American (New), Restaurants","{'Tuesday': '11:30-19:45', 'Wednesday': '11:30-19:45', 'Thursday': '11:30-19:45', 'Friday': '11:30-19:45', 'Saturday': '11:30-19:45', 'Sunday': '8:0-13:45'}"
105601,F32Z7yfgklHbiumiI-EDZg,Pats Select Pizza & Grill,855 Easton Rd,Warrington,PA,18976,40.238645,-75.135768,4.5,126,1,"{'BusinessAcceptsCreditCards': 'True', 'RestaurantsDelivery': 'True', 'Ambience': '{'touristy': False, 'hipster': False, 'romantic': False, 'divey': False, 'intimate': False, 'trendy': False, 'ups...","American (Traditional), Restaurants, Italian, Nightlife, Cocktail Bars, Pizza, Soup, Bars","{'Monday': '10:0-22:0', 'Tuesday': '10:0-22:0', 'Wednesday': '10:0-22:0', 'Thursday': '10:0-22:0', 'Friday': '10:0-23:0', 'Saturday': '10:0-23:0', 'Sunday': '10:0-22:0'}"
107969,n_zDBVMcBFU1bCOERIQ6HQ,Ristorante Denicola,"130 Almshouse Rd, Ste 405",Richboro,PA,18954,40.216415,-75.016586,4.0,50,1,"{'NoiseLevel': 'u'average'', 'Alcohol': 'u'none'', 'RestaurantsTakeOut': 'True', 'Caters': 'True', 'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '2', 'RestaurantsAttire': 'u'casu...","Italian, American (New), Restaurants","{'Tuesday': '11:0-21:30', 'Wednesday': '11:0-21:30', 'Thursday': '11:0-21:30', 'Friday': '11:0-22:30', 'Saturday': '11:0-22:30', 'Sunday': '11:0-21:30'}"
132623,2EVCpJTmmjFeYzxjQnnaeg,Maritsas Cuisine,106 E Main St,Maple Shade,NJ,8052,39.955311,-75.000301,4.5,54,1,"{'WiFi': ''no'', 'RestaurantsPriceRange2': '2', 'NoiseLevel': 'u'average'', 'Caters': 'True', 'RestaurantsGoodForGroups': 'True', 'GoodForKids': 'True', 'BusinessAcceptsCreditCards': 'True', 'Rest...","American (New), Restaurants","{'Monday': '7:0-14:0', 'Tuesday': '7:0-20:0', 'Wednesday': '7:0-20:0', 'Thursday': '7:0-20:0', 'Friday': '7:0-20:0', 'Saturday': '7:0-20:0', 'Sunday': '7:0-14:0'}"


Now we can see that our recommender only shows American restaurants! exactly what we wanted!