<a href="https://colab.research.google.com/github/piyushgarg878/Recommender_system/blob/main/Recommender_System_with_openAPI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Welcome to the Notebook**

### Task 1 - Set up the project

Installing the needed modules.

In [1]:
!pip install openai==1.16.2 python-dotenv pyspark

Collecting openai==1.16.2
  Downloading openai-1.16.2-py3-none-any.whl.metadata (21 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading openai-1.16.2-py3-none-any.whl (267 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m267.1/267.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv, openai
  Attempting uninstall: openai
    Found existing installation: openai 1.59.9
    Uninstalling openai-1.59.9:
      Successfully uninstalled openai-1.59.9
Successfully installed openai-1.16.2 python-dotenv-1.0.1


In [7]:
!pip install --upgrade openai

Collecting openai
  Downloading openai-1.61.0-py3-none-any.whl.metadata (27 kB)
Downloading openai-1.61.0-py3-none-any.whl (460 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m460.6/460.6 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.16.2
    Uninstalling openai-1.16.2:
      Successfully uninstalled openai-1.16.2
Successfully installed openai-1.61.0


Imporint the modules

In [2]:
from dotenv import load_dotenv
import os
from openai import OpenAI
import pandas as pd
import numpy as np

from pyspark.sql import SparkSession
from pyspark.sql.functions import concat_ws
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, FloatType

from pyspark.ml.feature import VectorAssembler, PCA
from pyspark.ml.clustering import KMeans
import plotly.express as px

Setup the OpenAI API

In [8]:
load_dotenv(dotenv_path='apikey.env.txt')

APIKEY=os.getenv('APIKEY')
print(APIKEY)

client=OpenAI(
    api_key=APIKEY
)
client

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


TypeError: Client.__init__() got an unexpected keyword argument 'proxies'

Create a Spark session

In [None]:
spark=SparkSession.builder.appName("ProductRecommenderSystem").getOrCreate()
spark

Loading the dataset

In [None]:
file_path='product_dataset.csv'
df=spark.read.csv(file_path,header=True,inferSchema=True,samplingRatio=1)
df.show()

List of 8 products recently viewed by the user.

In [None]:
recently_viewed_products = [
    'P316',
    'P333',
    'P1115',
    'P1691',
    'P1082',
    'P397',
    'P1441',
    'P1054',
]

### Task 2 - Prepare the dataset

Combine `title` and `description` Columns

In [None]:
df.withColumn("combined_text",concat_ws=(" ",df.title,df.description))
df.show()

get the combined_text column and convert it into a list

In [None]:
list_combined_text=df.select('combined_text').rdd.flatMap(lambda x: x).collect()
print(list_combined_text)

Use OpenAI text embedding model to create the vector embeddings.

In [None]:
response=client.embeddings.create(
    input=list_combined_text,
    model="text-embedding-3-small",
    dimensions=512
)
embedding_vectors=[d.embedding for d in response.data]
print(response)

Let't put the embedding vectors into our original dataframe

Convert embedding vectors list into a Pyspark DataFrame

In [None]:
features_column_names = [f"embedding_{i}" for i in range(len(embedding_vectors[0]))]
embeddings_df=spark.createDataFrame(embedding_vectors,features_column_names)
embeddings_df.show()

Add unique `row_id` to each row in the pysaprk dataframe

In [None]:
embedings_df.repartition(1).withColumn("row_id",F.monotonically_increasing_id())
embeddings_df.show()

Add unique `row_id` to each row in our main pyspark dataframe `df`

In [None]:
df=df.repartition(1).withColumn("row_id",F.monotonically_increasing_id())
df.show()

Let's join the two dataframes

In [None]:
df=df.join(embeddings_df,on="row_id",how="inner").drop("row_id")
df.show()

### Task 3 - Cluster products using K-means

Assemble the 512 Embedding Columns into a Single 'features' Column

In [None]:
assembler = VectorAssembler(inputCols=features_column_names, outputCol="features")
data = assembler.transform(df)
data=data.select("product_id","title","description","features")
data.show()

Apply K-Means Clustering with 5 Clusters on the `features` Column

In [None]:
kmeans=KMeans(k=5,featuresCol='features',predictionCol="cluster")
model=kmeans.fit(data)
clustered_data=model.transform(data)
clustered_data.show()

### Task 4 - Visualize the clusters

Let's reduce the dimensionality of our features for visualization purpose

`512 dimensions => 2 dimensions`

In [None]:
pca=PCA(k=2,inputCol='features',outputCol='pcaFeatures')
pca_model=pca.fit(clustered_data)
pca_results=pca_model.transform(clustered_data)
pca_results.show()

In [None]:
pca_df=pca_results.select("product_id","cluster","pcaFeatures").toPandas()
pca_df['X']=pca_df.pcaFeatures.apply(lambda x: x[0])
pca_df['Y']=pca_df.pcaFeatures.apply(lambda x: x[1])
pca_df.head()

Let's plot the Clusters

In [None]:
def plot_clusters(pca_df, num_clusters=5):
    """
    Plots a 2D visualization of clusters using Plotly Express.

    Parameters:
    - pca_df (DataFrame): A Pandas DataFrame containing columns 'x', 'y', and 'cluster'.
      'x' and 'y' are the 2D PCA components, and 'cluster' indicates the cluster label.
    - num_clusters (int): The number of unique clusters to display.
    - recently_viewed_df (DataFrame, optional): DataFrame with 'x' and 'y' coordinates for recently viewed products.

    This function creates an interactive scatter plot where each point is colored according to its cluster.
    Recently viewed products are marked as black crosses if provided.

    Returns:
    - fig (Figure): The Plotly figure object for the plot.
    """

    # Create the base cluster plot
    fig = px.scatter(
        pca_df,
        x='x',
        y='y',
        opacity=0.6,
        size_max=4,
        color= pca_df.cluster.astype(str),
        title='2D Visualization of Clusters with Recently Viewed Products',
        labels={'x': 'PCA Component 1', 'y': 'PCA Component 2'},
        category_orders={'cluster': list(range(num_clusters))},
        # show the product id in the tooltip
        hover_data={'product_id': True}

    )

    # Update layout to add legend title and adjust plot settings
    fig.update_layout(legend_title_text='Clusters', legend=dict(x=1, y=1), width=600, height=500)

    return fig

fig = plot_clusters(pca_df)
fig.show()

### Task 5 - Highlight recently viewed products

In [None]:
print("The user has recently viewed the following products: ", recently_viewed_products)

Let's have a look at the records in our `clustered_data` dataframe related to the recently viewed products.

In [None]:
filtered_data=clustered_data.where(F.col("product_id").isin(recently_viewed_products))
unique_clusters=filtered_data.select("cluster").distinct().collect()
unique_clusters

### Task 6 - Recommend products based on recently viewed products

Let's have a look at the recently viewed products titles

In [None]:
filtered_data.select("title").rdd.flatMap(lambda x:x).collect()

Let's see the distinct clusters of the recenetly viewed products.

In [None]:
print(unique_clusters)

Let's find the possible products for the recommendation.

In [None]:
possible_recommendations=clustered_data.filter(clustered_data['cluster'].isin(unique_clusters)).filter(~clustered_data['product_id'].isin(recently_viewed_products))

Let's perform a groupby and generate a list of product IDs that can be recommended for each of the clusters.

In [None]:
recommendations=possible_recommendations.groupby("cluster").agg(F.collect_list("product_id").alias("recommendations"))
recommendations_df=recommendations.toPandas()
recommendations_df['random_recommendations']=recommendations_df.recommendatins.apply(lambda x: np.random.choice(x,5,replace=False).tolist())
recommendations_df.head()

In [None]:
# write a python function to display the recommendations
def display_recommendations(row):
  # find the title of the product in df
  product_ids = row['random_recommendations']
  cluster = row.cluster

  titles = data. \
          filter(data["product_id"]. \
          isin(product_ids)).select("title").collect()

  print("\n")
  print("Recommendations for Cluster:", cluster)
  for title in titles:
    print(title[0])

recommendations_df.apply(display_recommendations, axis=1)