<a href="https://colab.research.google.com/github/michp15/Big_Data/blob/main/recommendation_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendation System

## 1. Project Setup & Environment

In [None]:
# Install & configure dependencies
!pip install pyspark
!pip install graphframes



In [None]:
# spark functionality
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, count, lit, year, month, split, explode, size, regexp_replace, rand
from pyspark.sql.functions import when, least, greatest, min, max
from pyspark.ml.recommendation import ALS # collaborative filtering
from pyspark.ml.feature import StringIndexer # incoding strings
from pyspark.ml.evaluation import RegressionEvaluator # for evaluation
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit # hyperparameter tuning

# for visualizations
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# initialize SparkSession with GraphFrames support
spark = SparkSession.builder \
    .appName("YelpEDA_GraphAnalysis") \
    .config("spark.jars.packages", "graphframes:graphframes:0.8.2-spark3.1-s_2.12") \
    .getOrCreate()

# verify the session
spark

:: loading settings :: url = jar:file:/usr/local/lib/python3.11/dist-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-28acae5a-c0ad-48b0-be8d-e5dff6d0be7c;1.0
	confs: [default]
	found graphframes#graphframes;0.8.2-spark3.1-s_2.12 in spark-packages
	found org.slf4j#slf4j-api;1.7.16 in central
downloading https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.1-s_2.12/graphframes-0.8.2-spark3.1-s_2.12.jar ...
	[SUCCESSFUL ] graphframes#graphframes;0.8.2-spark3.1-s_2.12!graphframes.jar (68ms)
downloading https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.16/slf4j-api-1.7.16.jar ...
	[SUCCESSFUL ] org.slf4j#slf4j-api;1.7.16!slf4j-api.jar (42ms)
:: resolution report :: resolve 1391ms :: artifacts dl 116ms
	:: modules in use:
	graphframes#graphframes;0.8.2-spark3.1-s_2.12 from spark-packages in [default]
	org.slf4j#slf4j-api;1.7.16 from central in [default]
	------------------------

## 2. Data Ingestion

In [None]:
# loading only columns we need
reviews_df = (spark.read.json("/kaggle/input/yelp-dataset/yelp_academic_dataset_review.json")
                 .select("user_id", "business_id", "stars", "date"))

reviews_df.show()

                                                                                                    

+--------------------+--------------------+-----+-------------------+
|             user_id|         business_id|stars|               date|
+--------------------+--------------------+-----+-------------------+
|mh_-eMZ6K5RLWhZyI...|XQfwVwDr-v0ZS3_Cb...|  3.0|2018-07-07 22:09:11|
|OyoGAe7OKpv6SyGZT...|7ATYjTIgM3jUlt4UM...|  5.0|2012-01-03 15:28:18|
|8g_iMtfSiwikVnbP2...|YjUWPpI6HXG530lwP...|  3.0|2014-02-05 20:30:30|
|_7bHUi9Uuf5__HHc_...|kxX2SOes4o-D3ZQBk...|  5.0|2015-01-04 00:01:03|
|bcjbaE6dDog4jkNY9...|e4Vwtrqf-wpJfwesg...|  4.0|2017-01-14 20:54:15|
|eUta8W_HdHMXPzLBB...|04UD14gamNjLY0IDY...|  1.0|2015-09-23 23:10:31|
|r3zeYsv1XFBRA4dJp...|gmjsEdUsKpj9Xxu6p...|  5.0|2015-01-03 23:21:18|
|yfFzsLmaWF2d4Sr0U...|LHSTtnW3YHCeUkRDG...|  5.0|2015-08-07 02:29:16|
|wSTuiTk-sKNdcFypr...|B5XSoSG3SfvQGtKEG...|  3.0|2016-03-30 22:46:33|
|59MxRhNVhU9MYndMk...|gebiRewfieSdtt17P...|  3.0|2016-07-25 07:31:06|
|1WHRWwQmZOZDAhp2Q...|uMvVYRgGNXf5boolA...|  5.0|2015-06-21 14:48:06|
|ZbqSHbgCjzVAqaa7N..

In [None]:
print("The number of rows: ", reviews_df.count())



The number of rows:  6990280


                                                                                                    

In [None]:
# we need business df to filter out closed businesses
business_df = (spark.read.json("/kaggle/input/yelp-dataset/yelp_academic_dataset_business.json")
                 .select("business_id", "name", "categories", "is_open"))

business_df.show()

                                                                                                    

+--------------------+--------------------+--------------------+-------+
|         business_id|                name|          categories|is_open|
+--------------------+--------------------+--------------------+-------+
|Pns2l4eNsfO8kk83d...|Abby Rappoport, L...|Doctors, Traditio...|      0|
|mpf3x-BjTdTEA3yCZ...|       The UPS Store|Shipping Centers,...|      1|
|tUFrWirKiKi_TAnsV...|              Target|Department Stores...|      0|
|MTSW4McQd7CbVtyjq...|  St Honore Pastries|Restaurants, Food...|      1|
|mWMc6_wTdE0EUBKIG...|Perkiomen Valley ...|Brewpubs, Breweri...|      1|
|CF33F8-E6oudUQ46H...|      Sonic Drive-In|Burgers, Fast Foo...|      1|
|n_0UpQx1hsNbnPUSl...|     Famous Footwear|Sporting Goods, F...|      1|
|qkRM_2X51Yqxk3btl...|      Temple Beth-El|Synagogues, Relig...|      1|
|k0hlBqXX-Bt0vf1op...|Tsevi's Pub And G...|Pubs, Restaurants...|      0|
|bBDDEgkFA1Otx9Lfe...|      Sonic Drive-In|Ice Cream & Froze...|      1|
|UJsufbvfyfONHeWdv...|           Marshalls|Departme

## 3. Preprocessing

In [None]:
# filter closed businesses

# keep businesses that are currently open
open_biz = business_df.filter(col("is_open") == 1).select("business_id")

# drop reviews that point to closed places by joining with `open_biz`
reviews_df = reviews_df.join(open_biz, "business_id")

print("The number of businesses after filtering: ", reviews_df.count())



The number of businesses after filtering:  5791234


                                                                                                    

Decrease from 6990280 -> 5791234

In [None]:
# filter very sparse users and businesses to reduce matrix size

min_user_ratings = 10 # at least 10 ratings from the user
min_biz_ratings = 20 # at least 20 ratings for the business

active_users = (reviews_df.groupBy("user_id").count()
                         .filter(col("count") >= min_user_ratings))

popular_biz  = (reviews_df.groupBy("business_id").count()
                         .filter(col("count") >= min_biz_ratings))

reviews_df = (reviews_df.join(active_users, "user_id")
                  .join(popular_biz, "business_id"))

print("The number of reviews after removing non-actives: ", reviews_df.count())

[Stage 23:>                                                                             (0 + 4) / 4]

The number of businesses after removing non-actives:  2219427


                                                                                                    

Now we have **2219427** reviews

In [None]:
# map string IDs to integer indices (required by ALS with collaborative filtering)

user_indexer = StringIndexer(inputCol="user_id", outputCol="uid").fit(reviews_df) # save indexed column as uid
user_indexed = user_indexer.transform(reviews_df)

biz_indexer = StringIndexer(inputCol="business_id", outputCol="bid").fit(reviews_df) # save indexed column as bid
data = biz_indexer.transform(user_indexed)

data.show()

25/05/13 13:34:56 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (7 + 1) / 8]
[Stage 77:>                                                                             (0 + 1) / 1]

+--------------------+--------------------+-----+-------------------+-----+-----+-------+-------+
|         business_id|             user_id|stars|               date|count|count|    uid|    bid|
+--------------------+--------------------+-----+-------------------+-----+-----+-------+-------+
|---kPU91CF4Lq2-Wl...|6SoUQtbIltsun0IIG...|  4.0|2021-11-28 16:40:02|   20|   24|30418.0|45749.0|
|---kPU91CF4Lq2-Wl...|qrCkKrEwQ-q9m1iWS...|  5.0|2020-03-18 01:34:18|   59|   24| 6617.0|45749.0|
|---kPU91CF4Lq2-Wl...|i48cHEyRBl5g9_npY...|  4.0|2020-06-04 12:27:25|  172|   24| 1427.0|45749.0|
|---kPU91CF4Lq2-Wl...|Q-ia5eY9smWBTwYOZ...|  5.0|2020-10-02 23:01:22|   15|   24|65789.0|45749.0|
|---kPU91CF4Lq2-Wl...|V8oYXtc0hMuYzG5Hf...|  3.0|2021-03-06 01:39:34|   12|   24|59056.0|45749.0|
|---kPU91CF4Lq2-Wl...|TIx1jZXl57mY-JnS3...|  5.0|2021-10-19 01:16:48|   98|   24| 2926.0|45749.0|
|--9osgUCSDUWUkoTL...|NXnWmsyvBx8hjmCTF...|  5.0|2019-01-06 01:26:39|  174|   30| 1062.0|40783.0|
|--9osgUCSDUWUkoTL..

                                                                                                    

## 3. Splitting Data

In [None]:
train, test = data.randomSplit([0.8, 0.2], seed=42)

### 4. Collaborative Filtering with ALS

### 4.1 Training Model

In [None]:
als_default = ALS(
    rank=10, # num of latent factors (default=10)
    regParam=0.1, # regularization parameter to avoid overfitting (default=0.1)
    maxIter=10, # num of ALS iterations (default=10)
    userCol="uid", # user id
    itemCol="bid", # business id
    ratingCol="stars", # based on the stars
    implicitPrefs=False, # treat stars as explicit feedback, not implicit
    coldStartStrategy="drop"  # avoids NaNs in predictions
)

# fit the model to the data to learn it
model_default = als_default.fit(train)

# evaluate the model
preds_def = model_default.transform(test)
evaluator = RegressionEvaluator(
    metricName="rmse", # root-mean-square error metric
    labelCol="stars", # true rating column
    predictionCol="prediction" # predicted rating column
)
rmse_def = evaluator.evaluate(preds_def)
print(f"Default ALS RMSE = {rmse_def:.4f}")

25/05/13 02:49:23 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (7 + 1) / 8]
25/05/13 02:49:26 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 02:49:34 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 02:49:36 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 02:49:42 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 02:49:44 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 02:49:46 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 02:49:48 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 02:49:50 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 02:49:52 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 +

Default ALS RMSE = 1.2767


                                                                                                    

### 4.2 Hyperparameter Tuning

In [None]:
from time import time

param_grid = [
    (rank, reg, iters)
    for rank in [20, 50, 100]
    for reg  in [0.05, 0.1, 0.5]
    for iters in [5, 10]
]

best_rmse = float("inf")
best_model = None

train_sub, val_sub = train.randomSplit([0.9, 0.1], seed=42)

for rank, reg, iters in param_grid:
    print(f"\nTraining rank={rank}, regParam={reg}, maxIter={iters} ...")
    start = time()


    als = ALS(
        userCol="uid", itemCol="bid", ratingCol="stars",
        rank=rank, regParam=reg, maxIter=iters,
        nonnegative=True, implicitPrefs=False,
        coldStartStrategy="drop"
    )

    model = als.fit(train_sub)
    rmse  = evaluator.evaluate(model.transform(val_sub))
    duration = time() - start

    print(f"Finished in {duration:.1f}s, RMSE={rmse:.4f}")

    if rmse < best_rmse:
        best_rmse, best_model = rmse, model

print(f"\n Best RMSE={best_rmse:.4f}")




Training rank=20, regParam=0.05, maxIter=5 ...


25/05/13 03:00:10 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (0 + 4) / 40]
25/05/13 03:00:18 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB (0 + 0) / 40]
25/05/13 03:00:26 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB (0 + 0) / 40]
25/05/13 03:00:33 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB (0 + 0) / 40]
25/05/13 03:00:42 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB (0 + 0) / 40]
25/05/13 03:00:49 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB (0 + 0) / 40]
25/05/13 03:00:58 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB (0 + 0) / 40]
25/05/13 03:01:05 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB (0 + 0) / 40]
25/05/13 03:01:14 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB (0 + 0) / 40]
25/05/13 03:01:20 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB (0 + 0) / 40]

Finished in 411.7s, RMSE=1.2537

Training rank=20, regParam=0.05, maxIter=10 ...


25/05/13 03:08:19 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (7 + 1) / 8]
25/05/13 03:08:21 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:08:28 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (3 + 2) / 5]
25/05/13 03:08:30 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (2 + 3) / 5]
25/05/13 03:08:32 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:08:33 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:08:35 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:08:36 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 + 1) / 10]
25/05/13 03:08:39 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 + 1) / 10]
25/05/13 03:08:47 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 +

Finished in 298.8s, RMSE=1.2478

Training rank=20, regParam=0.1, maxIter=5 ...


25/05/13 03:13:22 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (6 + 2) / 8]
25/05/13 03:13:24 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:13:31 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:13:33 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (3 + 2) / 5]
25/05/13 03:13:35 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:13:36 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (2 + 3) / 5]
25/05/13 03:13:38 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:13:39 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (6 + 4) / 10]
25/05/13 03:13:41 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 03:13:48 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 +

Finished in 259.0s, RMSE=1.2055

Training rank=20, regParam=0.1, maxIter=10 ...


25/05/13 03:17:39 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (7 + 1) / 8]
25/05/13 03:17:41 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:17:48 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:17:49 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (2 + 3) / 5]
25/05/13 03:17:51 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:17:52 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:17:54 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:17:55 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 03:17:58 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 + 1) / 10]
25/05/13 03:18:04 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 +

Finished in 290.4s, RMSE=1.2007

Training rank=20, regParam=0.5, maxIter=5 ...


25/05/13 03:22:27 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (7 + 1) / 8]
25/05/13 03:22:30 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:22:37 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (3 + 2) / 5]
25/05/13 03:22:38 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:22:40 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:22:41 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:22:43 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:22:44 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 03:22:47 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 03:22:50 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 +

Finished in 232.5s, RMSE=1.2154

Training rank=20, regParam=0.5, maxIter=10 ...


25/05/13 03:26:35 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (5 + 3) / 8]
25/05/13 03:26:37 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:26:44 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (3 + 2) / 5]
25/05/13 03:26:45 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:26:47 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:26:48 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:26:50 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:26:52 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 03:26:55 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 03:26:58 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 +

Finished in 276.9s, RMSE=1.2086

Training rank=50, regParam=0.05, maxIter=5 ...


25/05/13 03:30:58 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (7 + 1) / 8]
25/05/13 03:31:01 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:31:08 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:31:09 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (3 + 2) / 5]
25/05/13 03:31:11 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:31:12 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (2 + 3) / 5]
25/05/13 03:31:14 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:31:16 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 03:31:21 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 + 1) / 10]
25/05/13 03:31:35 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 +

Finished in 293.5s, RMSE=1.2148

Training rank=50, regParam=0.05, maxIter=10 ...


25/05/13 03:36:11 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (7 + 1) / 8]
25/05/13 03:36:14 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:36:22 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:36:23 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:36:25 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:36:26 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:36:28 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:36:30 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 03:36:35 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 + 1) / 10]
25/05/13 03:36:50 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 +

Finished in 395.5s, RMSE=1.1961

Training rank=50, regParam=0.1, maxIter=5 ...


25/05/13 03:42:29 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (7 + 1) / 8]
25/05/13 03:42:31 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:42:38 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:42:39 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:42:41 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:42:42 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:42:44 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:42:46 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (7 + 3) / 10]
25/05/13 03:42:50 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 03:43:00 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 +

Finished in 277.3s, RMSE=1.1835

Training rank=50, regParam=0.1, maxIter=10 ...


25/05/13 03:47:27 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (7 + 1) / 8]
25/05/13 03:47:29 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:47:36 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (3 + 2) / 5]
25/05/13 03:47:37 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:47:39 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:47:40 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:47:42 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:47:44 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 03:47:48 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 03:47:59 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 +

Finished in 369.4s, RMSE=1.1750

Training rank=50, regParam=0.5, maxIter=5 ...


25/05/13 03:53:15 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (6 + 2) / 8]
25/05/13 03:53:17 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:53:24 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (3 + 2) / 5]
25/05/13 03:53:26 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:53:28 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:53:29 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:53:31 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:53:33 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (7 + 3) / 10]
25/05/13 03:53:37 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 + 1) / 10]
25/05/13 03:53:42 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 +

Finished in 253.7s, RMSE=1.2167

Training rank=50, regParam=0.5, maxIter=10 ...


25/05/13 03:57:45 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (6 + 2) / 8]
25/05/13 03:57:47 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:57:55 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (1 + 4) / 5]
25/05/13 03:57:56 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:57:58 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:57:59 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 03:58:01 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 03:58:03 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 03:58:06 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 03:58:12 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 +

Finished in 315.3s, RMSE=1.2086

Training rank=100, regParam=0.05, maxIter=5 ...


25/05/13 04:02:46 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (6 + 2) / 8]
25/05/13 04:02:48 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 04:02:56 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (2 + 3) / 5]
25/05/13 04:02:57 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 04:02:59 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 04:03:00 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 04:03:01 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 04:03:04 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 04:03:49 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 + 1) / 10]
25/05/13 04:07:57 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 +

Finished in 1548.8s, RMSE=1.1978

Training rank=100, regParam=0.05, maxIter=10 ...


25/05/13 04:28:50 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (7 + 1) / 8]
25/05/13 04:28:52 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 04:29:00 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 04:29:01 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 04:29:03 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 04:29:04 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 04:29:06 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 04:29:08 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 04:29:54 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 04:33:48 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 +

Finished in 2619.7s, RMSE=1.1781

Training rank=100, regParam=0.1, maxIter=5 ...


25/05/13 05:12:16 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (6 + 2) / 8]
25/05/13 05:12:18 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 05:12:26 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (3 + 2) / 5]
25/05/13 05:12:27 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 05:12:29 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 05:12:30 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (3 + 2) / 5]
25/05/13 05:12:32 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 05:12:34 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB=(10 + 0) / 10]
25/05/13 05:13:15 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 + 1) / 10]
25/05/13 05:15:51 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 +

Finished in 1182.0s, RMSE=1.1743

Training rank=100, regParam=0.1, maxIter=10 ...


25/05/13 05:32:09 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (6 + 2) / 8]
25/05/13 05:32:11 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 05:32:18 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (3 + 2) / 5]
25/05/13 05:32:19 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 05:32:21 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 05:32:22 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 05:32:24 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 05:32:26 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 05:33:06 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 + 1) / 10]
25/05/13 05:35:39 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 +

Finished in 2072.6s, RMSE=1.1656

Training rank=100, regParam=0.5, maxIter=5 ...


25/05/13 06:06:30 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (7 + 1) / 8]
25/05/13 06:06:32 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 06:06:39 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (2 + 3) / 5]
25/05/13 06:06:41 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (2 + 3) / 5]
25/05/13 06:06:43 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 06:06:44 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 06:06:46 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 06:06:48 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 06:07:38 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 + 1) / 10]
25/05/13 06:08:26 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 +

Finished in 673.8s, RMSE=1.2173

Training rank=100, regParam=0.5, maxIter=10 ...


25/05/13 06:17:49 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (7 + 1) / 8]
25/05/13 06:17:54 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 06:18:04 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (1 + 4) / 5]
25/05/13 06:18:05 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 06:18:07 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 06:18:08 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 06:18:10 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 06:18:12 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 06:18:49 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 06:19:39 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 +

Finished in 1114.9s, RMSE=1.2087

 Best RMSE=1.1656


                                                                                                    

In [None]:
# extract the best model and its params
print(">>> Best rank:", best_model._java_obj.parent().getRank())
print(">>> Best regParam:", best_model._java_obj.parent().getRegParam())
print(">>> Best maxIter:", best_model._java_obj.parent().getMaxIter())

>>> Best rank: 100
>>> Best regParam: 0.1
>>> Best maxIter: 10


### 4.3 Model Evaluation

In [None]:
als_best = ALS(
    rank=100, # num of latent factors (default=10)
    regParam=0.1, # regularization parameter to avoid overfitting (default=0.1)
    maxIter=10, # num of ALS iterations (default=10)
    userCol="uid", # user id
    itemCol="bid", # business id
    ratingCol="stars", # based on the stars
    implicitPrefs=False, # treat stars as explicit feedback, not implicit
    coldStartStrategy="drop"  # avoids NaNs in predictions
)

# fit the model to the data to learn it
best_model = als_best.fit(train)

25/05/13 13:40:06 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (6 + 2) / 8]
25/05/13 13:40:09 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 13:40:16 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB===(5 + 0) / 5]
25/05/13 13:40:19 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 13:40:24 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 13:40:26 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (4 + 1) / 5]
25/05/13 13:40:29 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB               
25/05/13 13:40:31 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (8 + 2) / 10]
25/05/13 13:41:09 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 + 1) / 10]
25/05/13 13:42:02 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (9 +

In [None]:
evaluator = RegressionEvaluator(
    metricName="rmse", # root-mean-square error metric
    labelCol="stars", # true rating column
    predictionCol="prediction" # predicted rating column
)

In [None]:
# evaluate the best model on the held-out test set
preds_test = best_model.transform(test)
test_rmse = evaluator.evaluate(preds_test)
print(f"Test RMSE of best model = {test_rmse:.4f}")

25/05/13 13:57:18 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB
25/05/13 13:57:19 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB  (0 + 0) / 10]
25/05/13 13:58:51 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (7 + 1) / 8]
25/05/13 13:58:59 WARN DAGScheduler: Broadcasting large task binary with size 7.6 MiB   (4 + 1) / 5]
25/05/13 13:59:01 WARN DAGScheduler: Broadcasting large task binary with size 7.6 MiB   (1 + 3) / 4]

Test RMSE of best model = 1.1902


                                                                                                    

### 4.4 Save Model

In [None]:
# save the best model to disk
best_model.write().overwrite().save("als_best_model")
print("Best ALS model saved to 'als_best_model'")

25/05/13 14:00:00 WARN DAGScheduler: Broadcasting large task binary with size 7.7 MiB
25/05/13 14:00:07 WARN DAGScheduler: Broadcasting large task binary with size 7.7 MiB               

Best ALS model saved to 'als_best_model'


                                                                                                    

In [None]:
!zip -r model.zip als_best_model

  adding: als_best_model/ (stored 0%)
  adding: als_best_model/metadata/ (stored 0%)
  adding: als_best_model/metadata/part-00000 (deflated 38%)
  adding: als_best_model/metadata/.part-00000.crc (stored 0%)
  adding: als_best_model/metadata/_SUCCESS (stored 0%)
  adding: als_best_model/metadata/._SUCCESS.crc (stored 0%)
  adding: als_best_model/itemFactors/ (stored 0%)
  adding: als_best_model/itemFactors/.part-00002-0a650b85-cab2-491d-abd0-c74cfb35adf2-c000.snappy.parquet.crc (stored 0%)
  adding: als_best_model/itemFactors/.part-00005-0a650b85-cab2-491d-abd0-c74cfb35adf2-c000.snappy.parquet.crc (stored 0%)
  adding: als_best_model/itemFactors/.part-00001-0a650b85-cab2-491d-abd0-c74cfb35adf2-c000.snappy.parquet.crc (stored 0%)
  adding: als_best_model/itemFactors/.part-00006-0a650b85-cab2-491d-abd0-c74cfb35adf2-c000.snappy.parquet.crc (stored 0%)
  adding: als_best_model/itemFactors/part-00001-0a650b85-cab2-491d-abd0-c74cfb35adf2-c000.snappy.parquet (deflated 8%)
  adding: als_best_mo

### 4.5 Model Usage

#### 4.5.1 Business Recommendation for users

In [None]:
data.show()

25/05/13 14:06:07 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (6 + 2) / 8]
[Stage 356:>                                                                            (0 + 1) / 1]

+--------------------+--------------------+-----+-------------------+-----+-----+-------+-------+
|         business_id|             user_id|stars|               date|count|count|    uid|    bid|
+--------------------+--------------------+-----+-------------------+-----+-----+-------+-------+
|---kPU91CF4Lq2-Wl...|6SoUQtbIltsun0IIG...|  4.0|2021-11-28 16:40:02|   20|   24|30418.0|45749.0|
|---kPU91CF4Lq2-Wl...|qrCkKrEwQ-q9m1iWS...|  5.0|2020-03-18 01:34:18|   59|   24| 6617.0|45749.0|
|---kPU91CF4Lq2-Wl...|i48cHEyRBl5g9_npY...|  4.0|2020-06-04 12:27:25|  172|   24| 1427.0|45749.0|
|---kPU91CF4Lq2-Wl...|Q-ia5eY9smWBTwYOZ...|  5.0|2020-10-02 23:01:22|   15|   24|65789.0|45749.0|
|---kPU91CF4Lq2-Wl...|V8oYXtc0hMuYzG5Hf...|  3.0|2021-03-06 01:39:34|   12|   24|59056.0|45749.0|
|---kPU91CF4Lq2-Wl...|TIx1jZXl57mY-JnS3...|  5.0|2021-10-19 01:16:48|   98|   24| 2926.0|45749.0|
|--9osgUCSDUWUkoTL...|NXnWmsyvBx8hjmCTF...|  5.0|2019-01-06 01:26:39|  174|   30| 1062.0|40783.0|
|--9osgUCSDUWUkoTL..

                                                                                                    

In [None]:
# demonstrate recommendations for a random user

# pick a random original user_id and its uid
random_user = data.select("user_id","uid") \
             .distinct() \
             .orderBy(rand()) \
             .limit(1) \
             .collect()[0]

random_user_id, random_uid = random_user["user_id"], random_user["uid"]
print(f"Random user: user_id={random_user_id}, uid={random_uid}")

25/05/13 14:20:01 WARN DAGScheduler: Broadcasting large task binary with size 5.0 MiB   (4 + 2) / 6]
25/05/13 14:20:02 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
25/05/13 14:20:02 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
25/05/13 14:20:03 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
25/05/13 14:20:03 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
25/05/13 14:20:05 WARN DAGScheduler: Broadcasting large task binary with size 4.9 MiB   (4 + 1) / 5]


Random user: user_id=RIXIoCTafjr1auUJkV3fLg, uid=6344.0


                                                                                                    

In [None]:
# join to have the business info to display
reviews_with_meta = data.join(business_df, on="business_id", how="left")

# show top 20 historical ratings
print("Top 20 ratings:")
hist = (reviews_with_meta.filter(col("user_id")==random_user_id)
               .orderBy(col("stars").desc())
               .select("name","categories",col("stars").alias("rating"))
               .limit(20))
hist.show()

Top 20 ratings:


                                                                                                    

+--------------------+--------------------+------+
|                name|          categories|rating|
+--------------------+--------------------+------+
|      El Charro Cafe|Tapas/Small Plate...|   5.0|
|Century Park Plac...|Venues & Event Sp...|   4.0|
|          The Parish|Gastropubs, Cajun...|   4.0|
|    Crave Coffee Bar|Food, Coffee & Te...|   4.0|
|        La Encantada|Shopping Centers,...|   4.0|
|        Frost Gelato|Desserts, Coffee ...|   4.0|
|Ghini's French Caffe|Restaurants, Food...|   4.0|
|         Time Market|Restaurants, Food...|   4.0|
|Enterprise Rent-A...|Hotels & Travel, ...|   4.0|
|         Smashburger|Sandwiches, Burge...|   4.0|
|Plunketts Office ...|Arts & Crafts, Sh...|   4.0|
|  Empire Pizza & Pub|Food, Pizza, Nigh...|   4.0|
|              Macy's|Department Stores...|   4.0|
|        World Market|Shopping, Home De...|   4.0|
|     AJ's Fine Foods|Delis, Bakeries, ...|   4.0|
|        Pottery Barn|Home & Garden, Fu...|   4.0|
|  Whole Foods Market|Health Ma

A random user is chosen, and their top 20 highly-rated businesses are displayed. This helps understand the user's preferences — in this case, mostly restaurants and local services rated 4.0 or 5.0.

In [None]:
# generate & show top 10 recommendations
single = spark.createDataFrame([(random_uid,)], ["uid"])

# recommend using best model
recs = best_model.recommendForUserSubset(single, 10)
recs_flat = recs.select(explode("recommendations").alias("rec")) \
                .select(col("rec.bid").alias("bid"),
                        col("rec.rating").alias("pred_rating"))

# bring back original biz IDs & metadata
biz_meta = reviews_with_meta.select("bid", "business_id","name","categories").distinct()
top10 = recs_flat.join(biz_meta, on="bid") \
                 .select("name","categories","pred_rating")

print("Top 10 recommendations:")
top10.show()

Top 10 recommendations:


25/05/13 14:35:15 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB
25/05/13 14:35:18 WARN DAGScheduler: Broadcasting large task binary with size 7.6 MiB (0 + 0) / 40]]
25/05/13 14:35:28 WARN DAGScheduler: Broadcasting large task binary with size 7.6 MiB (0 + 0) / 40] 
25/05/13 14:36:54 WARN DAGScheduler: Broadcasting large task binary with size 7.7 MiB   (5 + 1) / 6]
25/05/13 14:36:56 WARN DAGScheduler: Broadcasting large task binary with size 7.7 MiB               

+--------------------+--------------------+-----------+
|                name|          categories|pred_rating|
+--------------------+--------------------+-----------+
|        Thai Day Spa|Beauty & Spas, Sk...|  4.6124053|
|     Sanctity Tattoo|Beauty & Spas, Ta...|  4.5933967|
|Story's Lock and Key|Automotive, Local...|  4.6610947|
| Karabu Pet Grooming|Pets, Pet Service...|  4.6053042|
|Native Seeds/ SEARCH|Shopping, Nurseri...|   4.577125|
|All Souls Procession|Arts & Entertainm...|  4.6964893|
|      Istari Studios|Tattoo, Beauty & ...|   4.599941|
|  La Mariposa Resort|Venues & Event Sp...|   4.755892|
|Firestone Complet...|Auto Parts & Supp...|   4.579238|
|   Distinctive Steel|Decks & Railing, ...|   4.569187|
+--------------------+--------------------+-----------+



                                                                                                    

Although the user's top historical ratings were mostly for restaurants, the recommendations include categories like spas and tattoos. This is because the ALS model is based on collaborative filtering. It suggests items liked by similar users, not necessarily from the same category. If users with similar preferences also rated these non-restaurant businesses highly, the model assumes the target user might like them too. This behavior reflects user-based patterns rather than content similarity.

#### 4.5.2 Users Recommendation for businesses

We randomly select a business from the dataset and retrieve the top 20 user ratings it has received.

In [None]:
# pick a random original business_id and its bid
random_business = (
    data.select("business_id", "bid")
        .distinct()
        .orderBy(rand())
        .limit(1)
        .collect()[0]
)

random_business_id, random_bid = random_business["business_id"], random_business["bid"]
print(f"Random business: business_id={random_business_id}, bid={random_bid}")

25/05/13 14:43:22 WARN DAGScheduler: Broadcasting large task binary with size 2.6 MiB   (4 + 2) / 6]

Random business: business_id=aHmRr6FCTxlJ7eD1bC4RUQ, bid=34796.0


                                                                                                    

In [None]:
print("Top 20 ratings by users for this business:")
hist_biz = (
    data
      .filter(col("business_id") == random_business_id)
      .orderBy(col("stars").desc())
      .select("user_id", col("stars").alias("rating"))
      .limit(20)
)
hist_biz.show()

Top 20 ratings by users for this business:


[Stage 705:>                                                                            (0 + 4) / 5]

+--------------------+------+
|             user_id|rating|
+--------------------+------+
|4W49THS3wwMaWCcqq...|   5.0|
|-n4YtwxACJo8HFe4F...|   5.0|
|D2IUOetOVfjAkmohD...|   5.0|
|9TJMq58VJjvr0mDjw...|   5.0|
|4Pk295jW5RiHkfG9M...|   5.0|
|V9IcxLPr-2ipzfshm...|   5.0|
|8MkZ6bpdP7x8Vlm_u...|   5.0|
|5GqguHZj4OXbsNomb...|   5.0|
|GZgMcF-eRFWdoNjbJ...|   5.0|
|Q3Ht1BJCC7z3jvD9J...|   5.0|
|wMy-6JNoA0_AveVsq...|   5.0|
|zdmgX3KCWjoG8dYR2...|   1.0|
|iiJirn6ACeI9VvZBk...|   1.0|
+--------------------+------+



                                                                                                    

We display the available ratings given by users for the selected business. Most users gave it a perfect score of 5.0, indicating strong positive feedback. Although we requested the top 20 ratings, fewer users have rated this business.

Now using the ALS model, we want to identify users who are most likely to rate this business highly, even if they haven't interacted with it before.

In [None]:
# generate & show top 10 user recommendations for this business

single_biz_df = spark.createDataFrame([(random_bid,)], ["bid"])
user_recs = best_model.recommendForItemSubset(single_biz_df, 10)

# flatten out the struct and get uid + predicted rating
recs_flat = (
    user_recs
      .select(explode("recommendations").alias("rec"))
      .select(
         col("rec.uid").alias("uid"),
         col("rec.rating").alias("predicted_rating")
      )
)

# recover original user_id from uid
user_meta = data.select("uid", "user_id").distinct()
top_users = (
    recs_flat
      .join(user_meta, on="uid", how="left")
      .select("user_id", "predicted_rating")
      .orderBy(col("predicted_rating").desc())
)

print("Top 10 users likely to rate this business highly:")
top_users.show()

Top 10 users likely to rate this business highly:


25/05/13 15:29:01 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB
25/05/13 15:29:04 WARN DAGScheduler: Broadcasting large task binary with size 7.6 MiB   (2 + 2) / 4]
25/05/13 15:29:15 WARN DAGScheduler: Broadcasting large task binary with size 7.6 MiB (0 + 0) / 40]]
25/05/13 15:30:42 WARN DAGScheduler: Broadcasting large task binary with size 5.0 MiB   (5 + 1) / 6]
25/05/13 15:30:46 WARN DAGScheduler: Broadcasting large task binary with size 4.9 MiB   (4 + 1) / 5]
25/05/13 15:30:48 WARN DAGScheduler: Broadcasting large task binary with size 7.6 MiB   (2 + 2) / 4]


+--------------------+----------------+
|             user_id|predicted_rating|
+--------------------+----------------+
|BbDE6GvqFgE_lJNg_...|       5.9048853|
|oF-ddAcFcYBpe1iJd...|       5.7885976|
|cF-bdU7IaLkgrk5ls...|       5.5619354|
|LqqI8VNFd30jzzUhy...|       5.5513825|
|rZWjUHMHHizDn1Hmv...|        5.539182|
|uyv47azu0RDhJLyWU...|       5.5383754|
|6yZKAnIU1v_JGhZT1...|       5.5318933|
|onS4xr-XkYCAT_V9q...|       5.5316005|
|4m6xbOBUwNIvIJyRk...|        5.504732|
|ne-yv1C8O_ppzaE62...|       5.4860926|
+--------------------+----------------+



                                                                                                    

ALS generates raw preference scores as dot-products of user and item latent vectors. Even if training labels are 1-5 stars, ALS solves an unconstrained least-squares problem, so the dot product of the user and item vectors can exceed the original range. In other words, those “predicted_rating” values aren't clipped to [1,5], they're just raw preference scores.

So we want to rescale them to see sensible values.

In [None]:
# identify users who already rated that business
rated_uids = (
    data
      .filter(col("bid") == random_bid)
      .select("uid")
      .distinct()
)

# all candidate users = every uid minus those who have rated
all_uids = data.select("uid").distinct()
candidate_uids = all_uids.join(rated_uids, on="uid", how="left_anti")

# build (uid, bid) pairs for candidates only
biz_df = spark.createDataFrame([(random_bid,)], ["bid"])
cross_df = candidate_uids.crossJoin(biz_df)

# score every candidate user
all_scores = best_model.transform(cross_df) \
    .select("uid", col("prediction").alias("raw_score"))

In [None]:
# compute the global raw min and max
stats = all_scores.agg(
    min("raw_score").alias("min_pred"),
    max("raw_score").alias("max_pred")
).first()

min_pred, max_pred = stats["min_pred"], stats["max_pred"]

# apply min-max scaling so that min_pred → 1.0 and max_pred → 5.0
preds_rescaled = all_scores.withColumn(
    "prediction_rescaled",
    # (prediction - min)/(max - min) scales to [0,1], then *4+1 → [1,5]
    ((col("raw_score") - lit(min_pred)) / (lit(max_pred) - lit(min_pred)) * 4.0) + 1.0
)

# clip just in case of numerical drift
preds_rescaled = preds_rescaled.withColumn(
    "prediction_rescaled",
    least(greatest(col("prediction_rescaled"), lit(1.0)), lit(5.0))
).orderBy(col("prediction_rescaled").desc())

# attach original user_id to the scored results
user_meta = data.select("uid", "user_id").distinct()

preds_rescaled_named = preds_rescaled.join(user_meta, on="uid", how="left") \
    .select("user_id", "prediction_rescaled") \
    .orderBy(col("prediction_rescaled").desc())

# Show top 10 recommended new users by rescaled predicted rating
preds_rescaled_named.show(10, truncate=False)

25/05/13 15:33:18 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB
25/05/13 15:33:18 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB
25/05/13 15:34:47 WARN DAGScheduler: Broadcasting large task binary with size 4.9 MiB   (5 + 1) / 6]
25/05/13 15:34:48 WARN DAGScheduler: Broadcasting large task binary with size 7.5 MiB   (0 + 4) / 5]
25/05/13 15:34:50 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
25/05/13 15:34:50 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
25/05/13 15:34:50 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
25/05/13 15:34:50 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
25/05/13 15:34:52 WARN DAGScheduler: Broadcasting large task binary with size 4.9 MiB   (0 + 4) / 5]
25/05/13 15:34:56 WARN DAGScheduler: Broadcasting large t

+----------------------+-------------------+
|user_id               |prediction_rescaled|
+----------------------+-------------------+
|BbDE6GvqFgE_lJNg_WU4fg|5.0                |
|oF-ddAcFcYBpe1iJd-zp-A|4.933731988565125  |
|cF-bdU7IaLkgrk5lskj5Yw|4.804565438092041  |
|LqqI8VNFd30jzzUhyG3e5A|4.79855256025386   |
|rZWjUHMHHizDn1HmvHUiiQ|4.79159949092435   |
|uyv47azu0RDhJLyWU-lTtg|4.791139449249547  |
|6yZKAnIU1v_JGhZT1W3R6Q|4.787445801000523  |
|onS4xr-XkYCAT_V9qOr24A|4.787278686038913  |
|4m6xbOBUwNIvIJyRkHUo2w|4.7719676947758    |
|ne-yv1C8O_ppzaE62kfJmQ|4.76134570477695   |
+----------------------+-------------------+
only showing top 10 rows



                                                                                                    

After rescaling, we present the top 10 users who are most likely to give this business a high rating. These users are ideal targets for personalized marketing or outreach campaigns.