# **Content-Based Product Recommender System with PySpark & OpenAI**

### Task 1 - Set up the project

Installing the needed modules.

In [2]:
!pip install openai==1.16.2 python-dotenv pyspark



Imporint the modules

In [3]:
from dotenv import load_dotenv
import os
from openai import OpenAI
import pandas as pd
import numpy as np

from pyspark.sql import SparkSession
from pyspark.sql.functions import concat_ws
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, FloatType

from pyspark.ml.feature import VectorAssembler, PCA
from pyspark.ml.clustering import KMeans
import findspark
import plotly.express as px


Setup the OpenAI API

In [4]:
# Load environment variables
load_dotenv(dotenv_path='apikey.env.txt')
# Initialize OpenAI client
APIKEY=os.getenv('apikey')

#Create an Instance of the OpenAI client
client = OpenAI(api_key=APIKEY)
client

<openai.OpenAI at 0x22b3c3e41d0>

Create a Spark session

In [5]:
#Create a Spark session
spark = SparkSession.builder.appName("ProductRecommenderSystem").getOrCreate()
spark

In [6]:
from pyspark.sql import SparkSession
import os

print("Creating a new Spark session with 12GB of driver memory...")

try:
    spark = SparkSession.builder \
        .appName("RecommenderSystemWithMemory") \
        .master("local[*]") \
        .config("spark.driver.host", "127.0.0.1") \
        .config("spark.driver.memory", "12g") \
        .getOrCreate()

    print("\n✅ Spark session created successfully!")
    print(f"✅ Spark Version: {spark.version}")

except Exception as e:
    print("\n❌ Test failed. Error details:", e)

Creating a new Spark session with 12GB of driver memory...

✅ Spark session created successfully!
✅ Spark Version: 3.5.1


Loading the dataset

In [7]:
file_path = 'products_dataset.csv'
# Load the dataset
df = spark.read.csv(file_path, header=True, inferSchema=True, samplingRatio=1)
df.show(5)

+----------+--------------------+--------------------+
|product_id|               title|         description|
+----------+--------------------+--------------------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|
|        P1|Turmode 30 ft. RP...|If you need more ...|
|        P2|Large Tapestry Bo...|Polyester cover r...|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|
|        P4|Men's Crazy Horse...|This 9 in. black ...|
+----------+--------------------+--------------------+
only showing top 5 rows



List of 8 products recently viewed by the user.

In [8]:
recently_viewed_products = [
    'P316',
    'P333',
    'P1115',
    'P1691',
    'P1082',
    'P397',
    'P1441',
    'P1054',
]

### Task 2 - Prepare the dataset

Combine `title` and `description` Columns

In [9]:
df = df.withColumn('combined_text', concat_ws("", df.title, df.description))
df.show(5)

+----------+--------------------+--------------------+--------------------+
|product_id|               title|         description|       combined_text|
+----------+--------------------+--------------------+--------------------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|Men's 3X Large Ca...|
|        P1|Turmode 30 ft. RP...|If you need more ...|Turmode 30 ft. RP...|
|        P2|Large Tapestry Bo...|Polyester cover r...|Large Tapestry Bo...|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|16-Gauge-Sinks Ve...|
|        P4|Men's Crazy Horse...|This 9 in. black ...|Men's Crazy Horse...|
+----------+--------------------+--------------------+--------------------+
only showing top 5 rows



get the combined_text column and convert it into a list

In [10]:
list_combined_text = df.select('combined_text').rdd.flatMap(lambda x: x).collect()
print (list_combined_text[:3])

["Men's 3X Large Carbon Heather Cotton/Polyester Rain Defender Paxton Heavyweight Hooded Zip-Front SweatshirtThis heavyweight, water-repellent hooded sweatshirt has a zip front for fast layering. ORIGINAL FIT. 13 oz., 75% cotton/25% polyester blend with Rain Defender durable water repellent. Attached, jersey-lined three-piece hood with drawcord closure. Antique-finish brass front zipper. Two front hand-warmer pockets have a hidden security pocket inside. Stretchable, spandex-reinforced rib-knit cuffs and waistband. Locker loop facilitates hanging.", "Turmode 30 ft. RP TNC Female to RP TNC Male Adapter CableIf you need more length between your existing wireless device and Hi-Gain Antenna, this is the product for you. It's compatible with most Wi-Fi Antennas, so it is easy for you to extend your wireless network. Just replace your existing cable that runs between your wireless device and Antenna and you're ready to use your network with extended range.", 'Large Tapestry Bolster BedPolyes

Use OpenAI text embedding model to create the vector embeddings.

In [11]:
#Use OpenAI's embeddings API to generate embeddings for the combined text
response = client.embeddings.create(input=list_combined_text, model="text-embedding-3-small", dimensions=512)

embedding_vectors = [d.embedding for d in response.data]
embedding_vectors[:3]

[[0.037482116371393204,
  0.030425189062952995,
  -0.013667216524481773,
  -0.0015252791345119476,
  0.007516968995332718,
  -0.03558835759758949,
  0.021474501118063927,
  0.0818961039185524,
  0.0491662472486496,
  -0.05770602449774742,
  0.04201998934149742,
  0.04255595803260803,
  -0.0586707666516304,
  0.031104082241654396,
  0.022832291200757027,
  0.06453070044517517,
  0.10433535277843475,
  -0.037482116371393204,
  -0.08775603026151657,
  0.06442350149154663,
  -0.05638396739959717,
  0.05842065066099167,
  -0.026691269129514694,
  -0.07710810750722885,
  0.026262493804097176,
  0.03805381804704666,
  -0.048094309866428375,
  0.032551199197769165,
  0.055633608251810074,
  -0.015891488641500473,
  0.003608859609812498,
  -0.023582646623253822,
  0.018508804962038994,
  -0.02608383819460869,
  0.044163867831230164,
  -0.07074794173240662,
  -0.021563829854130745,
  0.10948065668344498,
  -0.03380179405212402,
  0.02860289253294468,
  0.03126487508416176,
  -0.05749163776636124

Let't put the embedding vectors into our original dataframe

Convert embedding vectors list into a Pyspark DataFrame

In [12]:
features_column_names = [f"embedding_{i}" for i in range(len(embedding_vectors[0]))]
embeddings_df = spark.createDataFrame(embedding_vectors, schema=features_column_names)
embeddings_df.show(5)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------

Add unique `row_id` to each row in the pysaprk dataframe

In [13]:
embeddings_df = embeddings_df.repartition(1).withColumn("id", F.monotonically_increasing_id()) 
embeddings_df.show(5)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------

Add unique `row_id` to each row in our main pyspark dataframe `df`

In [14]:
df = df.repartition(1).withColumn("id", F.monotonically_increasing_id())
df.show(5)

+----------+--------------------+--------------------+--------------------+---+
|product_id|               title|         description|       combined_text| id|
+----------+--------------------+--------------------+--------------------+---+
|        P0|Men's 3X Large Ca...|This heavyweight,...|Men's 3X Large Ca...|  0|
|        P1|Turmode 30 ft. RP...|If you need more ...|Turmode 30 ft. RP...|  1|
|        P2|Large Tapestry Bo...|Polyester cover r...|Large Tapestry Bo...|  2|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|16-Gauge-Sinks Ve...|  3|
|        P4|Men's Crazy Horse...|This 9 in. black ...|Men's Crazy Horse...|  4|
+----------+--------------------+--------------------+--------------------+---+
only showing top 5 rows



Let's join the two dataframes

In [15]:
df=df.join(embeddings_df, on="id", how="inner").drop("id")
df.show(5)

+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--

### Task 3 - Cluster products using K-means

Assemble the 512 Embedding Columns into a Single 'features' Column

In [16]:
# Create a VectorAssembler to combine the embedding columns into a single vector column
vector_assembler = VectorAssembler(inputCols=features_column_names, outputCol="features")
df_vector = vector_assembler.transform(df)
df_vector = df_vector.select("product_id", "title", "description", "features")
df_vector.show(5)

+----------+--------------------+--------------------+--------------------+
|product_id|               title|         description|            features|
+----------+--------------------+--------------------+--------------------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|[0.03748211637139...|
|        P1|Turmode 30 ft. RP...|If you need more ...|[0.03538732230663...|
|        P2|Large Tapestry Bo...|Polyester cover r...|[0.03585880994796...|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|[-0.0583366602659...|
|        P4|Men's Crazy Horse...|This 9 in. black ...|[0.02001740410923...|
+----------+--------------------+--------------------+--------------------+
only showing top 5 rows



Apply K-Means Clustering with 5 Clusters on the `features` Column

In [17]:
#Apply KMeans clustering to the vectorized data
kmeans = KMeans(k=5, featuresCol="features", predictionCol="cluster")
kmeans_model = kmeans.fit(df_vector)
# Make predictions
clustered_data = kmeans_model.transform(df_vector)
clustered_data.show(5)

+----------+--------------------+--------------------+--------------------+-------+
|product_id|               title|         description|            features|cluster|
+----------+--------------------+--------------------+--------------------+-------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|[0.03748211637139...|      4|
|        P1|Turmode 30 ft. RP...|If you need more ...|[0.03538732230663...|      4|
|        P2|Large Tapestry Bo...|Polyester cover r...|[0.03585880994796...|      3|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|[-0.0583366602659...|      1|
|        P4|Men's Crazy Horse...|This 9 in. black ...|[0.02001740410923...|      4|
+----------+--------------------+--------------------+--------------------+-------+
only showing top 5 rows



### Task 4 - Visualize the clusters

Let's reduce the dimensionality of our features for visualization purpose

`512 dimensions => 2 dimensions`

In [18]:
#Reduce the dimensionality of the data using PCA
pca = PCA(k=2, inputCol="features", outputCol="pca_features")
pca_model = pca.fit(clustered_data)
# Transform the data using PCA
pca_predictions = pca_model.transform(clustered_data)
pca_predictions.show(5)

+----------+--------------------+--------------------+--------------------+-------+--------------------+
|product_id|               title|         description|            features|cluster|        pca_features|
+----------+--------------------+--------------------+--------------------+-------+--------------------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|[0.03748211637139...|      4|[0.17182252527565...|
|        P1|Turmode 30 ft. RP...|If you need more ...|[0.03538732230663...|      4|[-0.1713116207975...|
|        P2|Large Tapestry Bo...|Polyester cover r...|[0.03585880994796...|      3|[-0.0156937687209...|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|[-0.0583366602659...|      1|[0.00186577371176...|
|        P4|Men's Crazy Horse...|This 9 in. black ...|[0.02001740410923...|      4|[-0.0289821929329...|
+----------+--------------------+--------------------+--------------------+-------+--------------------+
only showing top 5 rows



In [19]:
pca_df = pca_predictions.select("product_id", "cluster", "pca_features").toPandas()
pca_df.head(5)

Unnamed: 0,product_id,cluster,pca_features
0,P0,4,"[0.17182252527565395, -0.03081721970236098]"
1,P1,4,"[-0.17131162079758153, 0.13221598409066537]"
2,P2,3,"[-0.015693768720934424, -0.31012649245082896]"
3,P3,1,"[0.0018657737117699826, -0.05688727143212571]"
4,P4,4,"[-0.028982192932957198, 0.03979818215450777]"


Let's plot the Clusters

In [20]:
pca_df['x'] = pca_df['pca_features'].apply(lambda x: x[0])
pca_df['y'] = pca_df['pca_features'].apply(lambda x: x[1])
pca_df.head(5)

Unnamed: 0,product_id,cluster,pca_features,x,y
0,P0,4,"[0.17182252527565395, -0.03081721970236098]",0.171823,-0.030817
1,P1,4,"[-0.17131162079758153, 0.13221598409066537]",-0.171312,0.132216
2,P2,3,"[-0.015693768720934424, -0.31012649245082896]",-0.015694,-0.310126
3,P3,1,"[0.0018657737117699826, -0.05688727143212571]",0.001866,-0.056887
4,P4,4,"[-0.028982192932957198, 0.03979818215450777]",-0.028982,0.039798


In [21]:
def plot_clusters(pca_df, num_clusters=5):
    """
    Plots a 2D visualization of clusters using Plotly Express.

    Parameters:
    - pca_df (DataFrame): A Pandas DataFrame containing columns 'x', 'y', and 'cluster'.
      'x' and 'y' are the 2D PCA components, and 'cluster' indicates the cluster label.
    - num_clusters (int): The number of unique clusters to display.
    - recently_viewed_df (DataFrame, optional): DataFrame with 'x' and 'y' coordinates for recently viewed products.

    This function creates an interactive scatter plot where each point is colored according to its cluster.
    Recently viewed products are marked as black crosses if provided.

    Returns:
    - fig (Figure): The Plotly figure object for the plot.
    """

    # Create the base cluster plot
    fig = px.scatter(
        pca_df,
        x='x',
        y='y',
        opacity=0.6,
        size_max=4,
        color= pca_df.cluster.astype(str),
        title='2D Visualization of Clusters with Recently Viewed Products',
        labels={'x': 'PCA Component 1', 'y': 'PCA Component 2'},
        category_orders={'cluster': list(range(num_clusters))},
        # show the product id in the tooltip
        hover_data={'product_id': True}

    )

    # Update layout to add legend title and adjust plot settings
    fig.update_layout(legend_title_text='Clusters', legend=dict(x=1, y=1), width=600, height=500)

    return fig

fig = plot_clusters(pca_df)
fig.show()

### Task 5 - Highlight recently viewed products

In [22]:
print("The user has recently viewed the following products: ", recently_viewed_products)

The user has recently viewed the following products:  ['P316', 'P333', 'P1115', 'P1691', 'P1082', 'P397', 'P1441', 'P1054']


Let's have a look at the records in our `clustered_data` dataframe related to the recently viewed products.

In [23]:
filtered_data = clustered_data.filter(clustered_data.product_id.isin(recently_viewed_products))
filtered_data.show()

+----------+--------------------+--------------------+--------------------+-------+
|product_id|               title|         description|            features|cluster|
+----------+--------------------+--------------------+--------------------+-------+
|      P316|Mystic Fitz Roy B...|With its distress...|[-0.0203542355448...|      3|
|      P333|Florida Shag Beig...|Lavish natural mo...|[-0.0176313053816...|      3|
|      P397|1 gal. #M250-3 Ap...|BEHR ULTRA SCUFF ...|[-0.0042055491358...|      0|
|     P1054|1 gal. #HDPG60 Mi...|The improved PPG ...|[-0.0088454978540...|      0|
|     P1082|1 qt. #S220-7 Mol...|BEHR ULTRA SCUFF ...|[-0.0200847387313...|      0|
|     P1115|Modern Gray/Multi...|This Modern Gray/...|[-0.0280903037637...|      3|
|     P1441|1 qt. #PPU6-06 Ho...|BEHR PREMIUM PLUS...|[-0.0061709149740...|      0|
|     P1691|Genet Rust/Red-Br...|Add a refreshing ...|[-0.0352701097726...|      3|
+----------+--------------------+--------------------+--------------------+-

In [24]:
unique_clusters = filtered_data.select("cluster").distinct().rdd.flatMap(lambda x:x).collect()
unique_clusters

[3, 0]

### Task 6 - Recommend products based on recently viewed products

Let's have a look at the recently viewed products titles

Let's see the distinct clusters of the recenetly viewed products.

In [25]:
filtered_data.select("title").rdd.flatMap(lambda x: x).collect()

["Mystic Fitz Roy Beige 9' 0 x 12' 0 Area Rug",
 'Florida Shag Beige/Multi 3 ft. x 5 ft. Floral Area Rug',
 '1 gal. #M250-3 Apple Turnover Extra Durable Flat Interior Paint & Primer',
 '1 gal. #HDPG60 Misty Emerald Lake Flat Interior Paint and Primer',
 '1 qt. #S220-7 Molasses Extra Durable Flat Interior Paint & Primer',
 'Modern Gray/Multi 9 ft. x 12 ft. Vibrant Abstract Polyester Area Rug',
 '1 qt. #PPU6-06 Honey Locust Eggshell Enamel Low Odor Interior Paint & Primer',
 'Genet Rust/Red-Brown 8 ft. x 11 ft. Abstract Wool Area Rug']

Let's find the possible products for the recommendation.

In [26]:
possible_recommendations = clustered_data.filter(clustered_data.cluster.isin(unique_clusters)).filter(~clustered_data.product_id.isin(recently_viewed_products))
possible_recommendations.show()

+----------+--------------------+--------------------+--------------------+-------+
|product_id|               title|         description|            features|cluster|
+----------+--------------------+--------------------+--------------------+-------+
|        P2|Large Tapestry Bo...|Polyester cover r...|[0.03585880994796...|      3|
|        P6|5 gal. #650C-2 Po...|BEHR PRO i300 Sem...|[0.00168563472107...|      0|
|       P11|1 qt. #350F-7 Wil...|BEHR PREMIUM PLUS...|[0.00113753252662...|      0|
|       P16|5 gal. #BL-W10 Ma...|BEHR PREMIUM PLUS...|[-0.0087030595168...|      0|
|       P18|1 qt. #M400-5 Bab...|BEHR PREMIUM PLUS...|[0.00725652417168...|      0|
|       P21|Whimsicle Blue Mu...|In true bohemian ...|[0.03497285023331...|      3|
|       P24|5-gal. #HDGO64U C...|The Glidden 5-gal...|[-0.0456367768347...|      0|
|       P26|5 gal. #W-B-320 W...|BEHR ULTRA SCUFF ...|[0.00333938119001...|      0|
|       P30|1 qt. Bermuda San...|This Glidden Exte...|[-0.0576894320547...| 

Let's perform a groupby and generate a list of product IDs that can be recommended for each of the clusters.

In [None]:
# write a python function to display the recommendations
def display_recommendation(row):
    # find the title of the product in df
    product_ids = row['random_recommendations']
    cluster = row.cluster

    titles = df.filter(df["product_id"].isin(product_ids)).select("title").collect()

    print("\n")
    print("Recommendations for Cluster:", cluster)
    for title in titles:
        print(title[0])

recommendation_df.apply(display_recommendation, axis=1)
# End of the notebook



Recommendations for Cluster: 4
Lyndhurst Multi/Green 10 ft. x 14 ft. Border Area Rug
Milana Home Ocean Blue 2 ft. 6 in. x 4 ft. Indoor/Outdoor Rug
Blossom Blue/Multi 8 ft. x 8 ft. Square Solid Floral Distressed Area Rug
Cora Fairway Oaks Blue Transitional Vintage 5 ft. x 7 ft. Area Rug
Vision Gray 2 ft. x 10 ft. Solid Runner Rug


Recommendations for Cluster: 1
8 oz. #S220-5 Nutshell One-Coat Hide Semi-Gloss Enamel Stain-Blocking Interior/Exterior Paint and Primer Sample
5 gal. #790F-5 Amazon Stone Semi-Gloss Enamel Exterior Paint & Primer
1 gal. #260F-6 Smokey Topaz Semi-Gloss Enamel Interior Paint & Primer
5 gal. #PPU24-07 Barnwood Gray Extra Durable Flat Interior Paint & Primer
5 gal. #390D-7 Marsh Grass Eggshell Interior Paint


0    None
1    None
dtype: object