[![Open in Layer](https://development.layer.co/assets/badge.svg)](https://app.layer.ai/layer/Recommendation_System_and_Product_Categorisation_Project/) [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/layerai/examples/blob/main/recommendation-system/Recommendation_System_and_Product_Categorisation.ipynb) [![Layer Examples Github](https://badgen.net/badge/icon/github?icon=github&label)](https://github.com/layerai/examples/tree/main/recommendation-system)


In this e-commerce example walkthrough, we will develop and build a Recommendation System on  Layer. We will use a public clickstream dataset for this example project. For more information about the dataset, you could check out its Kaggle page here: https://www.kaggle.com/datasets/tunguz/clickstream-data-for-online-shopping

# Install Requirements

In [None]:
!pip install layer --upgrade

In [None]:
!rm -rf examples/recommendation_engine_and_product_categorisation
!git clone -b recommendation_engine_and_product_categorisation https://github.com/layerai/examples/

Cloning into 'examples'...
remote: Enumerating objects: 886, done.[K
remote: Counting objects: 100% (434/434), done.[K
remote: Compressing objects: 100% (266/266), done.[K
remote: Total 886 (delta 224), reused 307 (delta 158), pack-reused 452[K
Receiving objects: 100% (886/886), 24.37 MiB | 22.56 MiB/s, done.
Resolving deltas: 100% (395/395), done.


# Getting Started with Layer

Layer is an MLOps platform which advances ML pipelines with remote computation and tracking.

In [None]:
from layer.decorators import dataset, model,resources
from layer.client import Dataset, Model
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import layer

## Login to Layer

---



Let's login to Layer first.

In [None]:
layer.login()

Please open the following link in your web browser. Once logged in, copy the code and paste it here.
https://app.layer.ai/oauth/authorize?response_type=code&code_challenge=YxO_KQablXtFP-GUxNfWPicb_9s0s17RfkucDJXEG68&code_challenge_method=S256&client_id=0STDdcnpK48P8A429EAAn93WNuLmViLR&redirect_uri=https://app.layer.ai/oauth/code&scope=offline_access&audience=https://app.layer.ai
Code: __KkLKr4KUukoId_DfBTCnXh_sd2sr-exrEXB_r3jAKvP
Successfully logged into https://app.layer.ai


## Initialize Layer Project
Now we are ready to init our project. Layer Project is basically an ML Repo hosted on Layer where you can store your datasets, models, metrics

In [None]:
layer.init("Recommendation_System_and_Product_Categorisation_Project", pip_packages=["gensim","sklearn","pandas","numpy"], fabric="f-medium")

Project(name='Recommendation_System_and_Product_Categorisation_Project', raw_datasets=[], derived_datasets=[], models=[], path=PosixPath('.'), project_files_hash='', readme='', account=Account(id=UUID('add1b570-c8e7-4187-b747-1d01104893a9'), name='layer'), _id=UUID('3335ab76-e638-4bb5-92c3-caabd647e298'), functions=[])

Your project is ready. Find your project here:

https://app.layer.ai

# **Data Transformation**

We will create total of 4 Layer datasets in this project. Here is the list of those datasets and their little descriptions.

*  **raw_session_based_clickstream_data:** This is basically identical to the source csv file which is a public Kaggle dataset: https://www.kaggle.com/datasets/tunguz/clickstream-data-for-online-shopping

    It is just Layer Dataset definition of the same clickstream raw data.

* **sequential_products:** This is a Layer dataset derived from the previous dataset which consists of sequences of products viewed in order per session. 


* **product_ids_and_vectors:** This is a Layer dataset which stores product vectors (embeddings) returned from Word2Vec algorithm.


* **final_product_clusters:** This is a Layer dataset which stores assigned cluster numbers per product and other members of those clusters.

Once you are done with building datasets, you can see them all here:

https://app.layer.ai/layer/Recommendation_System_and_Product_Categorisation_Project/datasets/

## **Model**

We will be training a K-Means model from sklearn. We will fit our clustering model using product vectors that we have created previously. You can find all the model experiments and logged data here:

Once you are done with training the model, you can see it here:

https://app.layer.ai/layer/Recommendation_System_and_Product_Categorisation_Project/models/


In [None]:
@resources("examples/recommendation_system_and_product_categorisation/e-shop clothing 2008.csv")
@dataset("raw_session_based_clickstream_data")
def raw_session_based_clickstream_data():
  data = pd.read_csv("examples/recommendation_system_and_product_categorisation/e-shop clothing 2008.csv",delimiter = ';')

  layer.log({"Dataset Description": "Raw clickstream data of an e-commerce clothing company"})

  return data

In [None]:
# remove consecutive duplicates from list
def remove_consec_duplicates(raw_lst):
  previous_value = None
  new_lst = []

  for elem in raw_lst:
    if elem != previous_value:
        new_lst.append(elem)
        previous_value = elem
        
  return new_lst


@dataset("sequential_products",dependencies=[Dataset('raw_session_based_clickstream_data')])
def generate_sequential_products():
  # Load dataset
  raw_clickstream = layer.get_dataset("raw_session_based_clickstream_data").to_pandas()

  # Rename columns
  data = raw_clickstream.rename(columns={"session ID": "session_id", "page 2 (clothing model)": "product_id"})
  # Remove sessions where only a single product is viewed
  data = data.groupby("session_id").filter(lambda x: len(x) > 1)
  # Group product view sequences in order by session id
  data = data.sort_values("order").groupby("session_id")["product_id"].apply(list)
  # Remove consecutive duplicate product views from the sequences genereated in the previous step
  data = data.apply(remove_consec_duplicates)

  #Convert series to data frame since you could only return dataframe in a Layer dataset decorator
  data = data.to_frame().reset_index()

  layer.log({"Dataset Description": "Session-based sequences of products in view order"})

  return data

In [None]:
@dataset("product_ids_and_vectors", dependencies=[Dataset("sequential_products")])
def create_product_embeddings():
  import gensim
  from gensim.models import Word2Vec
  # Fetch dataset from Layer
  df = layer.get_dataset("sequential_products").to_pandas()
  
  # Create Gensim CBOW model
  session_product_sequences = df['product_id'].apply(list)
  word2vec_model = gensim.models.Word2Vec(session_product_sequences, min_count = 1, size = 10, window = 5)
  
  # numpy.ndarrays of product vectors
  product_vectors = word2vec_model.wv.vectors

  productID_list = word2vec_model.wv.vocab.keys()
  vector_list = word2vec_model.wv.vectors.tolist()
  data_tuples = list(zip(productID_list,vector_list))
  product_ids_and_vectors = pd.DataFrame(data_tuples, columns=['Product_ID','Vectors'])

  layer.log({"Dataset Description": "Product Vectors (Embeddings) generated by word2vec algorithm."})

  return product_ids_and_vectors

def plot_cluster_distribution(kmeans_model):
  import matplotlib.pyplot as plt

  plt.hist(kmeans_model.labels_, rwidth=0.7)
  plt.ylabel("Number of Products")
  plt.xlabel("Cluster No")

  # Layer logs the plot (figure)
  fig = plt.gcf()
  layer.log({"Product Distribution over Clusters": fig})

  # clear all plots and figures from memory
  plt.figure().clear()
  plt.close()
  plt.cla()
  plt.clf()


def plot_cluster_scatter(product_vectors):
  import matplotlib.pyplot as plt
  from sklearn.decomposition import PCA
  from sklearn.cluster import KMeans

  pca = PCA(n_components=2)
  two_dimensions_vectors = pca.fit_transform(product_vectors)
  
  kmeans_model = KMeans(n_clusters=10, random_state=0).fit(two_dimensions_vectors)
  label = kmeans_model.fit_predict(two_dimensions_vectors)

  #Getting the Centroids
  centroids = kmeans_model.cluster_centers_
  u_labels = np.unique(kmeans_model.labels_)
  
  #plotting the results:
  for i in u_labels:
      plt.scatter(two_dimensions_vectors[label == i , 0] , two_dimensions_vectors[label == i , 1] , label = i)
  plt.scatter(centroids[:,0] , centroids[:,1] , s = 80, color = 'k')
  plt.legend(bbox_to_anchor =(1, 1))
  

  # Layer logs the plot (figure)
  fig = plt.gcf()
  layer.log({"Clusters Formation Scatter Plot": fig})

  # clear all plots and figures from memory
  plt.figure().clear()
  plt.close()
  plt.cla()
  plt.clf()

@model("clustering_model",dependencies=[Dataset("product_ids_and_vectors")])
def fit_kmeans():
  import gensim
  from gensim.models import Word2Vec
  from sklearn.cluster import KMeans
  import matplotlib.pyplot as plt
  
  # Get product vectors from Word2Vec
  df = layer.get_dataset("product_ids_and_vectors").to_pandas()
  array_product_vectors = np.array(df["Vectors"].values.tolist())

  # Fit K-Means algorithm on those embeddings
  kmeans_model = KMeans(n_clusters=10, random_state=0).fit(array_product_vectors)

  # Cluster Distribution Plot
  plot_cluster_distribution(kmeans_model)

  # Cluster Scatter Plot
  plot_cluster_scatter(array_product_vectors)
  
  return kmeans_model

@dataset("final_product_clusters", dependencies=[Model("clustering_model"), Dataset("product_ids_and_vectors")])
def final_results():
  model = layer.get_model("clustering_model").get_train()
  
  df = layer.get_dataset("product_ids_and_vectors").to_pandas()
  array_product_vectors = np.array(df["Vectors"].values.tolist())

  assigned_cluster_no = model.fit_predict(array_product_vectors).tolist()

  df["Cluster_No"] = assigned_cluster_no
  cluster_members_df = df[["Product_ID","Cluster_No"]].groupby("Cluster_No")['Product_ID'].apply(list).to_frame().reset_index().rename(columns={'Product_ID': 'Cluster_Member_List'})

  result_df = df[["Product_ID","Cluster_No"]].merge(cluster_members_df, how="left", on=["Cluster_No"])

  layer.log({"Dataset Description": "Product Clusters and Categorisation"})

  return result_df


# **Run your project on Layer in local mode**

To run your project locally on Layer, just call your functions in the correct order and Layer will log and save all the data you asked for automatically.

If you would like to run it in remote mode, put all you functions into the layer.run(), then you are good to go! It will run these functions in its correct order on Layer Infra remotely with respect to dependencies.

In [None]:
## LAYER LOCAL MODE
raw_session_based_clickstream_data()
generate_sequential_products()
create_product_embeddings()
fit_kmeans()
final_results()


## LAYER REMOTE MODE
# layer.run([raw_session_based_clickstream_data,
#            generate_sequential_products,
#            fit_kmeans],debug=True)

Output()

Output()

Output()

Output()



Output()

Unnamed: 0,Product_ID,Cluster_No,Cluster_Member_List
0,A13,4,"[A13, B8, A10, C17, P77, C15, A11, P60, P56, C..."
1,A16,6,"[A16, B4, C56, C57, P67, P82, B31, B21, B24, B..."
2,B4,6,"[A16, B4, C56, C57, P67, P82, B31, B21, B24, B..."
3,B17,3,"[B17, A37, C53, C34, C42, P16, A2, A5, A7, A9,..."
4,B8,4,"[A13, B8, A10, C17, P77, C15, A11, P60, P56, C..."
...,...,...,...
212,P45,1,"[C59, B10, P2, P15, B14, C39, P6, P25, B22, C5..."
213,P54,5,"[P34, A34, C50, B1, B3, B30, B20, B25, B33, B3..."
214,P28,5,"[P34, A34, C50, B1, B3, B30, B20, B25, B33, B3..."
215,P22,5,"[P34, A34, C50, B1, B3, B30, B20, B25, B33, B3..."


<Figure size 432x288 with 0 Axes>

# **Let's fetch the final result dataset and fetch 5 randomly-selected recommendations for a given product!**

In [None]:
def get_5_recommendations(product_name):
  # import random 
  from random import sample
  df = layer.get_dataset("layer/Recommendation_System_and_Product_Categorisation_Project/datasets/final_product_clusters").to_pandas()

  # Randomly select sample with 5 product ids
  five_recommendations = sample(df[df["Product_ID"]==product_name]['Cluster_Member_List'].iloc[0].tolist(), 5)
  # Exclude the given product 
  while product_name in five_recommendations:
    five_recommendations = sample(df[df["Product_ID"]==product_name]['Cluster_Member_List'].iloc[0].tolist(), 5)

  return five_recommendations  

  
get_5_recommendations("A13")

['B19', 'A10', 'C51', 'A6', 'C44']