# Product ranking :

The givens case study is to rank the product recommendations based on the relevance. The relevance is labeled into four categories Exact, Substitute, Complement and Irrelevant. The dataset is multilingual and has text in English, Spanish and Japanese.

## Approach

Given the time frame, I am considering only English queries in the dataset to build the model. I am implementing Listwise Ranking model using tensorflow's tensorflow recommenders and tensorflow ranking libraries. (https://www.tensorflow.org/recommenders/examples/listwise_ranking)

In this approach, a two tower recommender model is built to predict relevance scores and model's ranking of a list as a whole is optimized. Here, I am using the query and product title as features into query tower and product tower. Also, a fixed number of n products are passed for ranking. The model is trained to minimize listwise ListMLE loss. The data is split into 80-20 train-test ratio and trained for 30 epochs.

Normalized Discounted Cumulative Gain(NDCG) is used to evaluate the model. NDCG measures a predicted ranking by taking a weighted sum of the actual rating of each candidate. The ratings of movies that are ranked lower by the model would be discounted more. As a result, a good model that ranks highly-rated movies on top would have a high NDCG result








In [1]:
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

Mounted at /content/drive/


In [2]:
%cd drive/MyDrive/Colab_Notebooks

/content/drive/MyDrive/Colab_Notebooks


In [3]:
!pip install tensorflow-recommenders
!pip install tensorflow-ranking

Collecting tensorflow-recommenders
  Downloading tensorflow_recommenders-0.7.3-py3-none-any.whl (96 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/96.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m92.2/96.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.2/96.2 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tensorflow-recommenders
Successfully installed tensorflow-recommenders-0.7.3
Collecting tensorflow-ranking
  Downloading tensorflow_ranking-0.5.3-py2.py3-none-any.whl (151 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.2/151.2 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting tensorflow-serving-api<3.0.0,>=2.0.0 (from tensorflow-ranking)
  Downloading tensorflow_serving_api-2.13.0-py2.py3-none-any.whl (26 kB)
Collecting tensorflow<3,>=2.13.0 (from tensorflow

In [4]:
# Import necessary libraries
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_ranking as tfr
import tensorflow_recommenders as tfrs
from sklearn.model_selection import train_test_split
import string
import random
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
# Read product catalogue data
prod_cat_path = 'product_catalogue-v0.3.csv'
prod_cat = pd.read_csv(prod_cat_path)
print(prod_cat.shape)
prod_cat.head()

(883868, 7)


Unnamed: 0,product_id,product_title,product_description,product_bullet_point,product_brand,product_color_name,product_locale
0,B0188A3QRM,"Amazon Basics Woodcased #2 Pencils, Unsharpene...",,144 woodcase #2 HB pencils made from high-qual...,Amazon Basics,Yellow,us
1,B075VXJ9VG,"BAZIC Pencil #2 HB Pencils, Latex Free Eraser,...",<p><strong>BACK TO BAZIC</strong></p><p>Our go...,&#11088; UN-SHARPENED #2 PREMIUM PENCILS. Each...,BAZIC Products,12-count,us
2,B07G7F6JZ6,Emraw Pre Sharpened Round Primary Size No 2 Ju...,<p><b>Emraw Pre-Sharpened #2 HB Wood Pencils -...,✓ PACK OF 8 NUMBER 2 PRESHARPENED BEGINNERS PE...,Emraw,Yellow,us
3,B07JZJLHCF,Emraw Pre Sharpened Triangular Primary Size No...,<p><b>Emraw Pre-Sharpened #2 HB Wood Pencils -...,✓ PACK OF 6 NUMBER 2 PRESHARPENED BEGINNERS PE...,Emraw,Yellow,us
4,B07MGKC3DD,"BIC Evolution Cased Pencil, #2 Lead, Gray Barr...",,Premium #2 HB lead pencils with break-resistan...,Design House,Gray,us


Preliminary data analysis is conducted on product catalogue data to check for nas, duplicate product ids and the texts in different languages. The product title column has some null values.

In [5]:
print(prod_cat.product_id.isna().sum())
print(prod_cat.product_title.isna().sum())
print(prod_cat.product_description.isna().sum())
print(prod_cat.product_bullet_point.isna().sum())
print(prod_cat.product_brand.isna().sum())
print(prod_cat.product_color_name.isna().sum())
print(prod_cat.product_locale.isna().sum())

0
149
437158
144672
75941
350208
0


In [6]:
prod_cat.product_locale.value_counts()

us    482198
jp    233852
es    167818
Name: product_locale, dtype: int64

In [8]:
prod_cat[prod_cat.duplicated('product_id')].head()

Unnamed: 0,product_id,product_title,product_description,product_bullet_point,product_brand,product_color_name,product_locale
317275,B00OVQUQHG,"Aunt Jackie's Curl La La, Crema para rizos - 4...",,Recomendado para cabello rizado y cabello afro...,Aunt Jackie's,Básico,es
317530,B00II020CU,Hape Estudio de Dibujo portátil - Juguete Gala...,,CABALLETE PARA NIÑOS CON DOS CARAS: Práctico c...,Hape,,es
317719,B07KXQ12CY,zdyCGTime Balun HD Cat5 RJ45 a BNC Video Balun...,"Tamaño: 16,3 cm y 17 cm.<br>Color:Negro.<br>Ca...","Tamaño: (16,3 cm) y 17 cm. Color: negro. Canti...",zdyCGTime,negro,es
317765,B015DU4VSI,VINCIGANT Candelabros de Crystal Portavelas Bo...,,★ DISEÑO ELEGANTG: este espectacular candelabr...,VINCIGANT,Plata,es
317886,B00ZY65OSI,"Bioderma, Gel y jabón - 1L / 33.80 fl.oz.",,Gel de baño Bioderma\nProducto de alta calidad...,Bioderma,Único,es


In [9]:
prod_cat[prod_cat.product_locale == 'es'].head()

Unnamed: 0,product_id,product_title,product_description,product_bullet_point,product_brand,product_color_name,product_locale
317269,B0038TVH3Y,"Shea Moisture Coco y Hibiscus Curl Smoothie, 1...",,Mascarilla para el cabello rizado\nOfrece prot...,SHEA MOISTURE,,es
317270,B00449W12S,Cantu 856017000126 acondicionador,,Dirigido a las mujeres\nEfecto acondicionador ...,CANTU,BLANCO,es
317271,B004JMXLLK,Revlon Professional ProYou Care Activador de R...,,Aporta definición\nProtege de la humedad y el ...,Revlon Professional ProYou Care,Purple,es
317272,B008D5I61Y,"Cantu, Mascarilla de pelo (manteca de karité, ...",,Define rizos sin pesarse el cabello\nHidrata y...,CANTU,BLANCO,es
317273,B009AZ3WH4,Cantu Crema Capilar para Cabello Rizado - 355 ...,,Marca - cantu\nTipo de producto - Crema capila...,CANTU,Ivory,es


In [10]:
prod_cat[prod_cat.product_locale == 'jp'].head()

Unnamed: 0,product_id,product_title,product_description,product_bullet_point,product_brand,product_color_name,product_locale
426936,B00FW60P84,ゼロの焦点,,,,,jp
426937,B00MFBKQKG,第3話,,,,,jp
426938,B075QW7SBN,第08話「総理暗殺未遂事件」10年前の真実,,,,,jp
426939,B07H7WY84W,第67話　さらば！ 我が師よ我が友よ,,,,,jp
426940,B07HR6BC88,第06話,,,,,jp


Similarly, preliminary data analysis is conducted on train data. The train data has no nulls. Using the chart feature on colab, the distributions of English data is checked.

In [6]:
data_path = 'train-v0.3.csv'
data = pd.read_csv(data_path)
print(data.shape)
data.head()

(781744, 5)


Unnamed: 0,query_id,query,query_locale,product_id,esci_label
0,0,# 2 pencils not sharpened,us,B0000AQO0O,exact
1,0,# 2 pencils not sharpened,us,B0002LCZV4,exact
2,0,# 2 pencils not sharpened,us,B00125Q75Y,exact
3,0,# 2 pencils not sharpened,us,B001AZ1D3C,exact
4,0,# 2 pencils not sharpened,us,B001B097KC,exact


In [22]:
# Check nulls
print(data.query_id.isna().sum())
print(data['query'].isna().sum())
print(data.query_locale.isna().sum())
print(data.product_id.isna().sum())
print(data.esci_label.isna().sum())

0
0
0
0
0


In [6]:
# Check the count of esci labels per class
data.esci_label.value_counts()

exact         341170
substitute    267963
irrelevant    132057
complement     40554
Name: esci_label, dtype: int64

In [18]:
# Check the count of rows per locale
data.query_locale.value_counts()

us    419730
jp    209094
es    152920
Name: query_locale, dtype: int64

In [6]:
# Check data for Spanish locale
data[data.query_locale == 'es'].head()

Unnamed: 0,query_id,query,query_locale,product_id,esci_label
419730,18848,. leave in pelo rizado,es,B000TFADXA,substitute
419731,18848,. leave in pelo rizado,es,B0038TVH3Y,substitute
419732,18848,. leave in pelo rizado,es,B00449W12S,exact
419733,18848,. leave in pelo rizado,es,B004JMXLLK,substitute
419734,18848,. leave in pelo rizado,es,B008D5I61Y,substitute


In [7]:
## English data
data[data.query_locale == 'us']

Unnamed: 0,query_id,query,query_locale,product_id,esci_label
0,0,# 2 pencils not sharpened,us,B0000AQO0O,exact
1,0,# 2 pencils not sharpened,us,B0002LCZV4,exact
2,0,# 2 pencils not sharpened,us,B00125Q75Y,exact
3,0,# 2 pencils not sharpened,us,B001AZ1D3C,exact
4,0,# 2 pencils not sharpened,us,B001B097KC,exact
...,...,...,...,...,...
419725,18848,zephyr polishing kit,us,B081SYK6R2,irrelevant
419726,18848,zephyr polishing kit,us,B087HZQY4V,complement
419727,18848,zephyr polishing kit,us,B08H4ZJ6Q1,substitute
419728,18848,zephyr polishing kit,us,B08LSN8MT8,substitute


In [47]:
#Japanese data
data[data.query_locale == 'jp'].head()

Unnamed: 0,query_id,query,query_locale,product_id,esci_label,score
572650,26004,0係,jp,B00FNJ1CTQ,substitute,0.1
572651,26004,0係,jp,B00FW60P84,irrelevant,0.0
572652,26004,0係,jp,B00GAYCNDM,substitute,0.1
572653,26004,0係,jp,B00MFBKQKG,substitute,0.1
572654,26004,0係,jp,B01G1YXU28,irrelevant,0.0


In [20]:
#Check number of unique queries
data.query_id.nunique()

33804

In [10]:
# Check number of recommendations to be ranked per query
data.groupby('query_id')['esci_label'].count().sort_values( ascending=False)

query_id
2712     188
20316    136
10069    109
3316      98
33159     96
        ... 
27712      8
3991       8
28727      8
18720      8
23604      8
Name: esci_label, Length: 33804, dtype: int64

In [14]:
# Examine query id 0 and corresponding product titles and esci labels
data[data['query_id'] == 0].merge(prod_cat,on='product_id')[['query_id','query','product_id','product_title','esci_label']]

Unnamed: 0,query_id,query,product_id,product_title,esci_label
0,0,# 2 pencils not sharpened,B0000AQO0O,"Ticonderoga Beginner Pencils, Wood-Cased #2 HB...",exact
1,0,# 2 pencils not sharpened,B0002LCZV4,"TICONDEROGA Tri-Write Triangular Pencils, Stan...",exact
2,0,# 2 pencils not sharpened,B00125Q75Y,"TICONDEROGA Pencils, Wood-Cased, Unsharpened, ...",exact
3,0,# 2 pencils not sharpened,B001AZ1D3C,"Ticonderoga Pencils, Wood-Cased Graphite #2 HB...",exact
4,0,# 2 pencils not sharpened,B001B097KC,"Ticonderoga Laddie Pencils, Wood-Cased #2 HB S...",exact
5,0,# 2 pencils not sharpened,B003JFL1WY,"iScholar Gross Pack Pencils, #2, Yellow, Box o...",exact
6,0,# 2 pencils not sharpened,B004X4KRPM,"Ticonderoga Pencils, Wood-Cased, Graphite #2 H...",substitute
7,0,# 2 pencils not sharpened,B004X4KRW0,"Dixon No. 2 Yellow Pencils, Wood-Cased, Black ...",exact
8,0,# 2 pencils not sharpened,B00DZB6SIE,Maped Black'Peps Triangular Graphite #2 Pencil...,irrelevant
9,0,# 2 pencils not sharpened,B00OFNI9VK,"Ticonderoga Wood-Cased Pencils, #2 HB Soft, Pr...",exact


The labels need to converted to scores to be used in the model. Using the relevance score given on task page as below:
'exact':1,'substitute':0.1, 'complement':0.01,'irrelevant':0

In [7]:
score_dict = {'exact':1,'substitute':0.1, 'complement':0.01,'irrelevant':0}
data['score'] = data.esci_label.apply(lambda x: score_dict[x])
data.head()

Unnamed: 0,query_id,query,query_locale,product_id,esci_label,score
0,0,# 2 pencils not sharpened,us,B0000AQO0O,exact,1.0
1,0,# 2 pencils not sharpened,us,B0002LCZV4,exact,1.0
2,0,# 2 pencils not sharpened,us,B00125Q75Y,exact,1.0
3,0,# 2 pencils not sharpened,us,B001AZ1D3C,exact,1.0
4,0,# 2 pencils not sharpened,us,B001B097KC,exact,1.0


Filtering data on locale = 'us' to get only English queries and products. Merging the training with product data to create the dataset with required columns

In [8]:
en_data = data[data.query_locale == 'us']
en_data.shape

(419730, 6)

In [9]:
prod_en = prod_cat[prod_cat['product_locale'] == 'us']
prod_en.shape

(482198, 7)

In [10]:
en_data = en_data.merge(prod_en, on='product_id',how='inner')
en_data.head()

Unnamed: 0,query_id,query,query_locale,product_id,esci_label,score,product_title,product_description,product_bullet_point,product_brand,product_color_name,product_locale
0,0,# 2 pencils not sharpened,us,B0000AQO0O,exact,1.0,"Ticonderoga Beginner Pencils, Wood-Cased #2 HB...",,Round wood pencil with latex-free eraser\nFini...,Ticonderoga,Yellow,us
1,8799,pencils for kindergarteners,us,B0000AQO0O,exact,1.0,"Ticonderoga Beginner Pencils, Wood-Cased #2 HB...",,Round wood pencil with latex-free eraser\nFini...,Ticonderoga,Yellow,us
2,14844,#2 dixon oriole pencils not sharpened,us,B0000AQO0O,substitute,0.1,"Ticonderoga Beginner Pencils, Wood-Cased #2 HB...",,Round wood pencil with latex-free eraser\nFini...,Ticonderoga,Yellow,us
3,15768,#2 pencils with erasers sharpened not soft,us,B0000AQO0O,substitute,0.1,"Ticonderoga Beginner Pencils, Wood-Cased #2 HB...",,Round wood pencil with latex-free eraser\nFini...,Ticonderoga,Yellow,us
4,16972,classroom friendly supplies pencil sharpener,us,B0000AQO0O,irrelevant,0.0,"Ticonderoga Beginner Pencils, Wood-Cased #2 HB...",,Round wood pencil with latex-free eraser\nFini...,Ticonderoga,Yellow,us


The product title has some nulls. We cannot use nulls in the model. But before deleting the rows, from analysis title is containing text from product brand and pproduct_color_name columns. So where ever available, combining these two columns and using as product title when null. Removing null title rows from data after imputation.

In [15]:
en_data[en_data.product_title.isna()].head()

Unnamed: 0,query_id,query,query_locale,product_id,esci_label,score,product_title,product_description,product_bullet_point,product_brand,product_color_name,product_locale
17402,330,11 assembly replacement screen and speaker wit...,us,B015MZJUXK,complement,0.01,,<b>Model:Only fit for iPhone 11 Pro screen rep...,1. Compatible with iPhone 11 Pro Screen Replac...,Oli & Ode,,us
19105,367,12 x 18 flag,us,B089FQ55NK,exact,1.0,,,,,Red,us
39300,861,air mattress,us,B07Q4M3HVH,exact,1.0,,,,,,us
40620,892,airpods pro,us,B089K2RHYB,exact,1.0,,,,Hegen,,us
68462,1719,black disposable towels,us,B07G4F7YQ1,exact,1.0,,,,,Black,us


In [11]:
en_data['product_meta'] = en_data['product_brand'].fillna('')+' ' +en_data['product_color_name'].fillna('')
en_data['product_title'] = en_data['product_title'].combine_first(en_data['product_meta'])
en_data = en_data[en_data['product_title'] != ' ']
en_data.shape

(419676, 13)

In [72]:
# Even with locale us texts from other languages are present

In [12]:
# Selecting the required columns
data_sub = en_data[['query_id','query','product_id','product_title','score']]
data_sub.head()

Unnamed: 0,query_id,query,product_id,product_title,score
0,0,# 2 pencils not sharpened,B0000AQO0O,"Ticonderoga Beginner Pencils, Wood-Cased #2 HB...",1.0
1,8799,pencils for kindergarteners,B0000AQO0O,"Ticonderoga Beginner Pencils, Wood-Cased #2 HB...",1.0
2,14844,#2 dixon oriole pencils not sharpened,B0000AQO0O,"Ticonderoga Beginner Pencils, Wood-Cased #2 HB...",0.1
3,15768,#2 pencils with erasers sharpened not soft,B0000AQO0O,"Ticonderoga Beginner Pencils, Wood-Cased #2 HB...",0.1
4,16972,classroom friendly supplies pencil sharpener,B0000AQO0O,"Ticonderoga Beginner Pencils, Wood-Cased #2 HB...",0.0


As the features are text, some common text pre-processing steps are needed before using in the model.


*   Text is converted to lower case
*   In product title, column some numbers are present indicating code like (133308), (14578) etc. These are removed
*   . is used as decimal point indicating prices or other units. Except . all other punctuations are replaced with space

In [13]:
#Text to lower case. Remove numbers in () from product title
data_sub['query'] = data_sub['query'].str.lower()
data_sub['product_title'] = data_sub['product_title'].str.lower().str.replace('\s*\([0-9]+\)\s*','')
data_sub[['query','product_title']].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_sub['query'] = data_sub['query'].str.lower()
  data_sub['product_title'] = data_sub['product_title'].str.lower().str.replace('\s*\([0-9]+\)\s*','')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_sub['product_title'] = data_sub['product_title'].str.lower().str.replace('\s*\([0-9]+\)\s*','')


Unnamed: 0,query,product_title
0,# 2 pencils not sharpened,"ticonderoga beginner pencils, wood-cased #2 hb..."
1,pencils for kindergarteners,"ticonderoga beginner pencils, wood-cased #2 hb..."
2,#2 dixon oriole pencils not sharpened,"ticonderoga beginner pencils, wood-cased #2 hb..."
3,#2 pencils with erasers sharpened not soft,"ticonderoga beginner pencils, wood-cased #2 hb..."
4,classroom friendly supplies pencil sharpener,"ticonderoga beginner pencils, wood-cased #2 hb..."


In [14]:
# Create custom punctuation - remove ., add ”’
custom_punctuation = string.punctuation
custom_punctuation = custom_punctuation.replace('.','')
custom_punctuation = custom_punctuation + '”’'
custom_punctuation

'!"#$%&\'()*+,-/:;<=>?@[\\]^_`{|}~”’'

In [15]:
translator = str.maketrans(custom_punctuation, ' '*len(custom_punctuation))

In [16]:
# Remove punctuation
data_sub['query'] = data_sub['query'].apply(lambda x: x.translate(translator))
data_sub['product_title'] = data_sub['product_title'].apply(lambda x: x.translate(translator))
data_sub[['query','product_title']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_sub['query'] = data_sub['query'].apply(lambda x: x.translate(translator))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_sub['product_title'] = data_sub['product_title'].apply(lambda x: x.translate(translator))


Unnamed: 0,query,product_title
0,2 pencils not sharpened,ticonderoga beginner pencils wood cased 2 hb...
1,pencils for kindergarteners,ticonderoga beginner pencils wood cased 2 hb...
2,2 dixon oriole pencils not sharpened,ticonderoga beginner pencils wood cased 2 hb...
3,2 pencils with erasers sharpened not soft,ticonderoga beginner pencils wood cased 2 hb...
4,classroom friendly supplies pencil sharpener,ticonderoga beginner pencils wood cased 2 hb...
...,...,...
419725,zephyr polishing kit,15 buff rake for cleaning compound from buffi...
419726,zephyr polishing kit,buffing wheel rake remove residual compounds m...
419727,zephyr polishing kit,zephyr orange ruffy heavy cut clear dip airway...
419728,zephyr polishing kit,kshineni car foam drill 3 inch buffing pad 11 ...


We need unique words to be used as vocabulary to construct text embedding layers. Getting unque words in queries and products using TfidfVectorizer

In [17]:
query_words = data_sub['query'].tolist()
tfidf_query = TfidfVectorizer()
tfidf_query.fit_transform(query_words)
unique_query_words = tfidf_query.get_feature_names_out()
#query_idf_weights = tfidf_query.idf_
print(unique_query_words[:5])

['00' '000' '007' '00m' '01']


In [18]:
product_words = data_sub['product_title'].tolist()
tfidf_product = TfidfVectorizer()
tfidf_product.fit_transform(product_words)
unique_product_words = tfidf_product.get_feature_names_out()
#product_idf_weights = tfidf_product.idf_
print(unique_product_words[:5])

['00' '000' '0000' '00000' '000000']


In [19]:
# Check number of unique words
len(unique_query_words), len(unique_product_words)

(15488, 177417)

To prepare the dataset for tensorflow model, for each query, random n items from recommendations are collected into a list and these lists are optimized for ranking. Per query 50 lists are created.

In [20]:
#Group data by query and collect prouct title and score
group_data = data_sub.groupby('query').agg({'product_title':list,'score':list})
group_data.head()

Unnamed: 0_level_0,product_title,score
query,Unnamed: 1_level_1,Unnamed: 2_level_1
i m not myself these days josh kilmer purcell,[feelin good tees people should seriously sarc...,"[0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, ..."
m renee allen do not disturb,[sennheiser rs120 ii on ear wireless rf headph...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.0, 1.0, ..."
10 self seal envelopes without window,[ 10 security tinted self seal envelopes no ...,"[1.0, 0.1, 1.0, 0.0, 0.0, 0.1, 1.0, 1.0, 1.0, ..."
2 pencils not sharpened,[ticonderoga beginner pencils wood cased 2 h...,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.1, 1.0, 0.0, ..."
34 bed swing without stand 34,[full motion tv wall mount bracket dual articu...,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ..."


In [31]:
# To create tf dataset
tensor_slices = {"query": [], "product": [], "score": []}

In [22]:
#collect product title and score into dictionaries with query as key
pdict = group_data['product_title'].to_dict()
sdict = group_data['score'].to_dict()
i = 0
for k,v in pdict.items():
  print(k, v)
  i += 1
  if i == 5:
    break

      i m not myself these days josh kilmer purcell ['feelin good tees people should seriously sarcastic funny t shirt xl black', 'feelin good tees my opinion offended you adult humor t shirt xl black', 'i m fine graphic novelty sarcastic funny t shirt l ash', 'currently unsupervised novelty graphic sarcastic funny t shirt xl black', 'i got your back graphic novelty sarcastic funny t shirt xl black1', 'apple cider vinegar gummy vitamins by goli nutrition   immunity   detox    1 pack  60 count  with the mother  gluten free  vegan  vitamin b9  b12  beetroot  pomegranate ', 'a word boring people use graphic novelty sarcastic funny t shirt xl black', 'just pretend i m not here graphic novelty sarcastic funny t shirt xl black', 'you know the little thing cool graphic sarcastic sarcasm novelty funny t shirt xl black', 'when this virus is over graphic novelty sarcastic funny t shirt l black', 'disagree graphic novelty sarcastic funny t shirt xl black', 'never forget graphic novelty sarcastic 

In [32]:
# Prepare dataset
lists_per_query = 50
list_size = 10  #5
for k, v in pdict.items():
  for _ in range(lists_per_query):
    if len(v) >= list_size:
      tensor_slices['query'].append(k)
      indices = random.sample(range(0, len(v)),list_size)
      tensor_slices["product"].append([v[i] for i in range(len(v)) if i in indices ])
      tensor_slices["score"].append([sdict[k][i] for i in range(len(v)) if i in indices ])

In [None]:
# Convert to tf dataset
ds = tf.data.Dataset.from_tensor_slices(tensor_slices)

In [33]:
for ex in ds.take(5):
  print(ex)

{'query': <tf.Tensor: shape=(), dtype=string, numpy=b'      i m not myself these days josh kilmer purcell'>, 'product': <tf.Tensor: shape=(5,), dtype=string, numpy=
array([b'feelin good tees people should seriously sarcastic funny t shirt xl black',
       b'currently unsupervised novelty graphic sarcastic funny t shirt xl black',
       b'i got your back graphic novelty sarcastic funny t shirt xl black1',
       b'you know the little thing cool graphic sarcastic sarcasm novelty funny t shirt xl black',
       b'when this virus is over graphic novelty sarcastic funny t shirt l black'],
      dtype=object)>, 'score': <tf.Tensor: shape=(5,), dtype=float32, numpy=array([0., 1., 0., 1., 0.], dtype=float32)>}
{'query': <tf.Tensor: shape=(), dtype=string, numpy=b'      i m not myself these days josh kilmer purcell'>, 'product': <tf.Tensor: shape=(5,), dtype=string, numpy=
array([b'i m fine graphic novelty sarcastic funny t shirt l ash',
       b'apple cider vinegar gummy vitamins by goli nut

Split the data into 80-20 train-test. Shuffle the data for training.


In [None]:
n = len(tensor_slices['query'])

In [None]:
train_size = round(n*0.8)
test_size = n - train_size
print(train_size, test_size)

In [None]:
tf.random.set_seed(42)

# Split between train and tests sets, as before.
shuffled = ds.shuffle(n, seed=42)

train = shuffled.take(train_size)
test = shuffled.skip(train_size).take(test_size)

#### Model
Create Query Model to generate query embeddings from query text. Dimension is 64.
Create Product Model to generate product embeddings from product title text. Dimension is 64
Ranking model of type tfrs model is created to predict relevance scores while minimizing ListMLE loss.

In [None]:
class QueryModel(tf.keras.Model):
  def __init__(self):
    super().__init__()
    embedding_dimension = 64
    max_tokens = 200000

    self.query_vectorizer = tf.keras.layers.TextVectorization(standardize=None, split='whitespace', max_tokens=max_tokens, ngrams=3,
                                                              output_mode='int', vocabulary=unique_query_words) #,idf_weights=query_idf_weights)
    self.query_embedding = tf.keras.Sequential([self.query_vectorizer,
                                                tf.keras.layers.Embedding(max_tokens,embedding_dimension,mask_zero=True),
                                                tf.keras.layers.GlobalAveragePooling1D(),])

  def call(self, inputs):
    return self.query_embedding(inputs)


In [None]:
class ProductModel(tf.keras.Model):
  def __init__(self):
    super().__init__()
    embedding_dimension = 64
    max_tokens = 200000

    self.product_vectorizer = tf.keras.layers.TextVectorization(standardize=None, split='whitespace', max_tokens=max_tokens, ngrams=3,
                                                              output_mode='int', vocabulary=unique_product_words, output_sequence_length=20)   #,idf_weights=product_idf_weights)
    self.product_embedding = tf.keras.Sequential([self.product_vectorizer,
                                                tf.keras.layers.Embedding(max_tokens,embedding_dimension,mask_zero=True), ])
                                                # tf.keras.layers.GlobalAveragePooling1D(),])

  def call(self, inputs, pool_size):
    avg_layer = tf.keras.layers.AveragePooling2D(pool_size=pool_size,strides=1,padding='valid',)
    len_inputs=tf.shape(inputs)[0]
    return avg_layer(self.product_embedding(tf.reshape(inputs,[len_inputs,5,1])))


In [None]:
class RankingModel(tfrs.Model):

  def __init__(self, loss):
    super().__init__()
    embedding_dimension = 64
    max_tokens = 200000

    # Compute embeddings for queries.

    self.query_embeddings = QueryModel()
    # Compute embeddings for products.

    self.product_embeddings = ProductModel()

    # Compute predictions.
    self.score_model = tf.keras.Sequential([
      # Learn multiple dense layers.
      tf.keras.layers.Dense(256, activation="relu"),
      tf.keras.layers.Dense(128, activation="relu"),
      # Make rating predictions in the final layer.
      tf.keras.layers.Dense(1)
    ])

    self.task = tfrs.tasks.Ranking(
      loss=loss,
      metrics=[
        tfr.keras.metrics.NDCGMetric(name="ndcg_metric"),
        tf.keras.metrics.RootMeanSquaredError()
      ]
    )

  def call(self, features):
    # We first convert the query features into embeddings.
    query_embeddings = self.query_embeddings(features["query"])

    # product features into embeddings
    product_embeddings = self.product_embeddings(features["product"], pool_size=(1,20))

    # We want to concatenate query embeddings with product emebeddings to pass
    # them into the ranking model. To do so, we need to reshape the query
    # embeddings to match the shape of product embeddings.
    list_length = features["product"].shape[1]
    query_embedding_repeated = tf.repeat(
        tf.expand_dims(tf.expand_dims(query_embeddings, 1),1), [list_length], axis=1)

    # Once reshaped, we concatenate and pass into the dense layers to generate
    # predictions.
    concatenated_embeddings = tf.concat(
        [query_embedding_repeated, product_embeddings], 3)

    return self.score_model(concatenated_embeddings)

  def compute_loss(self, features, training=False):
    labels = features.pop("score")

    scores = self(features)

    return self.task(
        labels=labels,
        predictions=tf.squeeze(tf.squeeze(scores, axis=-1),axis=-1),
    )

Batch size is 2048 for training. Trained for 30 epochs. Tested on validation set with batch size 4096

In [1]:
epochs = 30

cached_train = shuffled.shuffle(train_size).batch(2048).cache()
cached_test = test.batch(4096).cache()

NameError: ignored

In [43]:
listwise_model = RankingModel(tfr.keras.losses.ListMLELoss())
listwise_model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))

In [44]:
listwise_model.fit(cached_train, epochs=epochs)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.src.callbacks.History at 0x7f81ac2f2980>

Model Evaluated on NDCG and RMSE.
The metrics are as below:
{'ndcg_metric': 0.9731355905532837,
 'root_mean_squared_error': 5.092366695404053,
 'loss': 3.4197535514831543,
 'regularization_loss': 0,
 'total_loss': 3.4197535514831543}

In [45]:
listwise_model.evaluate(cached_test, return_dict=True)



{'ndcg_metric': 0.9731355905532837,
 'root_mean_squared_error': 5.092366695404053,
 'loss': 3.4197535514831543,
 'regularization_loss': 0,
 'total_loss': 3.4197535514831543}