## Before you begin
- Make sure that you connect to a GPU accelerated instance (if available)
  - Can work with CPU instance, but fine-tuning can take significantly longer
  - Colab has GPU-instances, but typically have a time limit for use
- Make sure dependencies are versioned correctly
  - Can have incompatible libraries otherwise
  - May need to pip install specific versions of PyTorch (torch==1.13.0), Pandas (pandas==1.5.0), NumPy (numpy==1.23.0)

In [None]:
# Install dependencies (watch out for versions)
!pip install datasets==2.8.0
!pip install transformers==4.26.0
!pip install huggingface-hub==0.13.0
!pip install rouge_score==0.1.2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets==2.8.0
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 KB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.13.1-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m38.0 MB/s[0m eta [36m0:00:00[

In [None]:
# Download T5-large library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-large")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading pytorch_model.bin:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
# If using Colab, run this to select CSVs to upload 
from google.colab import files
uploaded = files.upload()

Saving T5P5_training_data_Arabic.csv to T5P5_training_data_Arabic.csv
Saving T5P5_training_data_English.csv to T5P5_training_data_English.csv


In [None]:
# Read CSVs with training data (will use only 1 or the other, depending on type of recommendation system)
# Each row represents a list of sports equipment one has purchased (input), and the next item to recommend (output)
# Change file path as needed
import io
import pandas as pd
df_english_training = pd.read_csv(io.BytesIO(uploaded['T5P5_training_data_English.csv']))
df_arabic_training = pd.read_csv(io.BytesIO(uploaded['T5P5_training_data_Arabic.csv']))
# Dataset is now stored in a Pandas Dataframe

In [None]:
df_english_training

Unnamed: 0,Input,Output
0,"Soccer Jersey, Soccer Ball, Soccer Cleats, Goa...",Soccer Shin Guards
1,"Basketball Jersey, Basketball, Basketball Shoe...",Basketball Shorts
2,"Football Jersey, Football, Football Cleats, Fo...",Football Helmet
3,"Baseball Jersey, Baseball, Baseball Cleats, Ba...",Baseball Glove
4,"Tennis Shirt, Tennis Ball, Tennis Shoes",Tennis Racket
...,...,...
116,"Soccer Goal Post, Soccer Ball, Soccer Cleats, ...",Soccer Jersey
117,"Basketball Jersey, Basketball, Basketball Shoes",Basketball Arm Sleeve
118,"Basketball Jersey, Basketball, Basketball Arm ...",Basketball Shoes
119,"Basketball Jersey, Basketball Arm Sleeve, Bask...",Basketball


In [None]:
df_arabic_training

Unnamed: 0,Input,Output
0,قميص كرة القدم ، كرة كرة القدم ، مرابط كرة الق...,حراس كرة القدم
1,قميص كرة السلة وكرة السلة وأحذية كرة السلة وأك...,شورت كرة السلة
2,قميص كرة القدم ، كرة القدم ، مرابط كرة القدم ،...,خوذة كرة القدم
3,قميص البيسبول ، البيسبول ، مرابط البيسبول ، قب...,قفاز البيسبول
4,قميص التنس ، كرة تنس ، أحذية تنس,مضرب التنس
...,...,...
116,هدف كرة القدم ، كرة كرة القدم ، مرابط كرة القد...,قميص لكرة القدم
117,قميص كرة السلة ، كرة السلة ، أحذية كرة السلة,دقة ذراع كرة السلة
118,قميص كرة السلة وكرة السلة وأكمام ذراع كرة السلة,أحذية كرة السلة
119,قميص كرة السلة ، وأكمام ذراع كرة السلة ، وأحذي...,كرة سلة


In [None]:
# Full lists of inventory (sports equipment) available - in English and in Arabic

items_english_list = ["Soccer Jersey",	"Basketball Jersey",	"Football Jersey",	"Baseball Jersey",	"Tennis Shirt",	"Hockey Jersey",
"Soccer Ball",	"Basketball",	"Football",	"Baseball",	"Tennis Ball",	"Hocket Puck",
"Soccer Cleats",	"Basketball Shoes",	"Football Cleats",	"Baseball Cleats",	"Tennis Shoes",	"Hockey Helmet",
"Goalie Gloves",	"Basketball Arm Sleeve",	"Football Shoulder Pads",	"Baseball Cap",	"Tennis Racket",	"Hockey Skates",
"Soccer Goal Post",	"Basketball Hoop",	"Football Helmet",	"Baseball Bat",		"Hockey Stick",
"Soccer Cones",	"Basketball Shorts",		"Baseball Glove",		"Hockey Pads",
"Soccer Shin Guards",					
"Soccer Shorts"]

items_arabic_list = ["قميص كرة القدم",	"قميص كرة السلة",	"قميص كرة القدم الأمريكية",	"قميص بيسبول",	"قميص التنس",	"قميص الهوكي",
"كرة كرة القدم",	"كرة سلة",	"كرة القدم الأمريكية",	"البيسبول",	"كرة التنس",	"قرص الهوكي",
"مرابط كرة القدم",	"أحذية كرة السلة",	"المرابط كرة القدم الأمريكية",	"مرابط البيسبول",	"أحذية تنس",	"خوذة الهوكي",
"قفازات حارس المرمى",	"الأكمام ذراع كرة السلة",	"وسادات الكتف لكرة القدم الأمريكية",	"قبعة البيسبول",	"مضرب التنس",	"الزلاجات الهوكي",
"مرمى كرة القدم",	"كرة السلة هوب",	"خوذة كرة القدم الأمريكية",	"مضرب البيسبول",		"عصا الهوكي",
"مخاريط كرة القدم",	"شورت كرة السلة",		"قفاز البيسبول",		"وسادات الهوكي",
"حراس كرة القدم",					
"شورت كرة القدم"]

print(len(items_arabic_list))
print(items_arabic_list)

print(len(items_english_list))
print(items_english_list)

35
['قميص كرة القدم', 'قميص كرة السلة', 'قميص كرة القدم الأمريكية', 'قميص بيسبول', 'قميص التنس', 'قميص الهوكي', 'كرة كرة القدم', 'كرة سلة', 'كرة القدم الأمريكية', 'البيسبول', 'كرة التنس', 'قرص الهوكي', 'مرابط كرة القدم', 'أحذية كرة السلة', 'المرابط كرة القدم الأمريكية', 'مرابط البيسبول', 'أحذية تنس', 'خوذة الهوكي', 'قفازات حارس المرمى', 'الأكمام ذراع كرة السلة', 'وسادات الكتف لكرة القدم الأمريكية', 'قبعة البيسبول', 'مضرب التنس', 'الزلاجات الهوكي', 'مرمى كرة القدم', 'كرة السلة هوب', 'خوذة كرة القدم الأمريكية', 'مضرب البيسبول', 'عصا الهوكي', 'مخاريط كرة القدم', 'شورت كرة السلة', 'قفاز البيسبول', 'وسادات الهوكي', 'حراس كرة القدم', 'شورت كرة القدم']
35
['Soccer Jersey', 'Basketball Jersey', 'Football Jersey', 'Baseball Jersey', 'Tennis Shirt', 'Hockey Jersey', 'Soccer Ball', 'Basketball', 'Football', 'Baseball', 'Tennis Ball', 'Hocket Puck', 'Soccer Cleats', 'Basketball Shoes', 'Football Cleats', 'Baseball Cleats', 'Tennis Shoes', 'Hockey Helmet', 'Goalie Gloves', 'Basketball Arm Sleeve', 

In [None]:
df_english_training['Input'].iloc[0]

'Soccer Jersey, Soccer Ball, Soccer Cleats, Goalie Gloves, Soccer Goal Post, Soccer Cones'

In [None]:
# Functions to select the list of products (sports equipment) not yet purchased for the given customer...
# Will use these as candidates for next recommended item

def get_items_not_purchased_yet_english(purchase_history):
  items_not_purchased_yet = []
  
  for item in items_english_list:
    if(item not in purchase_history): #If the item (from the list) is NOT in the purchase_history string
      items_not_purchased_yet.append(item)
  return items_not_purchased_yet

def get_items_not_purchased_yet_arabic(purchase_history):
  items_not_purchased_yet = []
  
  for item in items_arabic_list:
    if(item not in purchase_history): #If the item (from the list) is NOT in the purchase_history string
      items_not_purchased_yet.append(item)
  return items_not_purchased_yet

In [None]:
# Functions that will build strings for list of candidates for recommendation and list of items purchased

def modify_unpurchased_items_list_english(Unpurchased_items_list):
  concatenated_list = ', '.join(Unpurchased_items_list) #Join into a string with commas in between
  concatenated_list = concatenated_list.rstrip(',') #Drop last comma
  concatenated_list = "CANDIDATES FOR RECOMMENDATION: {" + concatenated_list + "}"
  return concatenated_list


def modify_purchased_items_english(purchase_history):
  purchase_history = "ITEMS PURCHASED: {" + purchase_history + "}"
  return purchase_history

def modify_unpurchased_items_list_arabic(Unpurchased_items_list):
  concatenated_list = ', '.join(Unpurchased_items_list) #Join into a string with commas in between
  concatenated_list = concatenated_list.rstrip(',') #Drop last comma
  concatenated_list = "المرشحين للتوصية: {" + concatenated_list + "}"
  return concatenated_list


def modify_purchased_items_arabic(purchase_history):
  purchase_history = "عناصر تم شراؤها: {" + purchase_history + "}"
  return purchase_history

In [None]:
# Add columns with strings of items not yet purchased by the customer

df_english_training['Unpurchased_items'] = df_english_training['Input'].apply(get_items_not_purchased_yet_english)

df_arabic_training['Unpurchased_items'] = df_arabic_training['Input'].apply(get_items_not_purchased_yet_arabic)

In [None]:
len(df_arabic_training['Unpurchased_items'].iloc[0])

29

In [None]:
len(df_english_training['Unpurchased_items'].iloc[0])

29

In [None]:
print(df_arabic_training['Unpurchased_items'].iloc[0])

['قميص كرة السلة', 'قميص كرة القدم الأمريكية', 'قميص بيسبول', 'قميص التنس', 'قميص الهوكي', 'كرة سلة', 'كرة القدم الأمريكية', 'البيسبول', 'كرة التنس', 'قرص الهوكي', 'أحذية كرة السلة', 'المرابط كرة القدم الأمريكية', 'مرابط البيسبول', 'أحذية تنس', 'خوذة الهوكي', 'الأكمام ذراع كرة السلة', 'وسادات الكتف لكرة القدم الأمريكية', 'قبعة البيسبول', 'مضرب التنس', 'الزلاجات الهوكي', 'كرة السلة هوب', 'خوذة كرة القدم الأمريكية', 'مضرب البيسبول', 'عصا الهوكي', 'شورت كرة السلة', 'قفاز البيسبول', 'وسادات الهوكي', 'حراس كرة القدم', 'شورت كرة القدم']


In [None]:
print(df_english_training['Unpurchased_items'].iloc[0])

['Basketball Jersey', 'Football Jersey', 'Baseball Jersey', 'Tennis Shirt', 'Hockey Jersey', 'Basketball', 'Football', 'Baseball', 'Tennis Ball', 'Hocket Puck', 'Basketball Shoes', 'Football Cleats', 'Baseball Cleats', 'Tennis Shoes', 'Hockey Helmet', 'Basketball Arm Sleeve', 'Football Shoulder Pads', 'Baseball Cap', 'Tennis Racket', 'Hockey Skates', 'Basketball Hoop', 'Football Helmet', 'Baseball Bat', 'Hockey Stick', 'Basketball Shorts', 'Baseball Glove', 'Hockey Pads', 'Soccer Shin Guards', 'Soccer Shorts']


In [None]:
print(df_arabic_training['Input'].iloc[0])

قميص كرة القدم ، كرة كرة القدم ، مرابط كرة القدم ، قفازات حارس المرمى ، مرمى كرة القدم ، مخاريط كرة القدم


In [None]:
print(df_english_training['Input'].iloc[0])

Soccer Jersey, Soccer Ball, Soccer Cleats, Goalie Gloves, Soccer Goal Post, Soccer Cones


In [None]:
df_english_training['Input'].apply(modify_purchased_items_english).iloc[0]

'ITEMS PURCHASED: {Soccer Jersey, Soccer Ball, Soccer Cleats, Goalie Gloves, Soccer Goal Post, Soccer Cones}'

In [None]:
df_arabic_training['Input'].apply(modify_purchased_items_arabic).iloc[0]

'عناصر تم شراؤها: {قميص كرة القدم ، كرة كرة القدم ، مرابط كرة القدم ، قفازات حارس المرمى ، مرمى كرة القدم ، مخاريط كرة القدم}'

In [None]:
df_english_training['Unpurchased_items'].apply(modify_unpurchased_items_list_english).iloc[0]

'CANDIDATES FOR RECOMMENDATION: {Basketball Jersey, Football Jersey, Baseball Jersey, Tennis Shirt, Hockey Jersey, Basketball, Football, Baseball, Tennis Ball, Hocket Puck, Basketball Shoes, Football Cleats, Baseball Cleats, Tennis Shoes, Hockey Helmet, Basketball Arm Sleeve, Football Shoulder Pads, Baseball Cap, Tennis Racket, Hockey Skates, Basketball Hoop, Football Helmet, Baseball Bat, Hockey Stick, Basketball Shorts, Baseball Glove, Hockey Pads, Soccer Shin Guards, Soccer Shorts}'

In [None]:
df_arabic_training['Unpurchased_items'].apply(modify_unpurchased_items_list_arabic).iloc[0]

'المرشحين للتوصية: {قميص كرة السلة, قميص كرة القدم الأمريكية, قميص بيسبول, قميص التنس, قميص الهوكي, كرة سلة, كرة القدم الأمريكية, البيسبول, كرة التنس, قرص الهوكي, أحذية كرة السلة, المرابط كرة القدم الأمريكية, مرابط البيسبول, أحذية تنس, خوذة الهوكي, الأكمام ذراع كرة السلة, وسادات الكتف لكرة القدم الأمريكية, قبعة البيسبول, مضرب التنس, الزلاجات الهوكي, كرة السلة هوب, خوذة كرة القدم الأمريكية, مضرب البيسبول, عصا الهوكي, شورت كرة السلة, قفاز البيسبول, وسادات الهوكي, حراس كرة القدم, شورت كرة القدم}'

In [None]:
df_arabic_training

Unnamed: 0,Input,Output,Unpurchased_items
0,قميص كرة القدم ، كرة كرة القدم ، مرابط كرة الق...,حراس كرة القدم,"[قميص كرة السلة, قميص كرة القدم الأمريكية, قمي..."
1,قميص كرة السلة وكرة السلة وأحذية كرة السلة وأك...,شورت كرة السلة,"[قميص كرة القدم, قميص كرة القدم الأمريكية, قمي..."
2,قميص كرة القدم ، كرة القدم ، مرابط كرة القدم ،...,خوذة كرة القدم,"[قميص كرة السلة, قميص كرة القدم الأمريكية, قمي..."
3,قميص البيسبول ، البيسبول ، مرابط البيسبول ، قب...,قفاز البيسبول,"[قميص كرة القدم, قميص كرة السلة, قميص كرة القد..."
4,قميص التنس ، كرة تنس ، أحذية تنس,مضرب التنس,"[قميص كرة القدم, قميص كرة السلة, قميص كرة القد..."
...,...,...,...
116,هدف كرة القدم ، كرة كرة القدم ، مرابط كرة القد...,قميص لكرة القدم,"[قميص كرة القدم, قميص كرة السلة, قميص كرة القد..."
117,قميص كرة السلة ، كرة السلة ، أحذية كرة السلة,دقة ذراع كرة السلة,"[قميص كرة القدم, قميص كرة القدم الأمريكية, قمي..."
118,قميص كرة السلة وكرة السلة وأكمام ذراع كرة السلة,أحذية كرة السلة,"[قميص كرة القدم, قميص كرة القدم الأمريكية, قمي..."
119,قميص كرة السلة ، وأكمام ذراع كرة السلة ، وأحذي...,كرة سلة,"[قميص كرة القدم, قميص كرة القدم الأمريكية, قمي..."


In [None]:
# Combine purchased items and unpurchased items strings to build prompt
df_english_training['Prompt'] = df_english_training['Input'].apply(modify_purchased_items_english) + \
                               " - " + df_english_training['Unpurchased_items'].apply(modify_unpurchased_items_list_english) + \
                               " - RECOMMENDATION: "

In [None]:
# Combine purchased items and unpurchased items strings to build prompt
df_arabic_training['Prompt'] = df_arabic_training['Input'].apply(modify_purchased_items_arabic) + \
                               " - " + df_arabic_training['Unpurchased_items'].apply(modify_unpurchased_items_list_arabic) + \
                               " - توصية: "

In [None]:
df_english_training

Unnamed: 0,Input,Output,Unpurchased_items,Prompt
0,"Soccer Jersey, Soccer Ball, Soccer Cleats, Goa...",Soccer Shin Guards,"[Basketball Jersey, Football Jersey, Baseball ...","ITEMS PURCHASED: {Soccer Jersey, Soccer Ball, ..."
1,"Basketball Jersey, Basketball, Basketball Shoe...",Basketball Shorts,"[Soccer Jersey, Football Jersey, Baseball Jers...","ITEMS PURCHASED: {Basketball Jersey, Basketbal..."
2,"Football Jersey, Football, Football Cleats, Fo...",Football Helmet,"[Soccer Jersey, Basketball Jersey, Baseball Je...","ITEMS PURCHASED: {Football Jersey, Football, F..."
3,"Baseball Jersey, Baseball, Baseball Cleats, Ba...",Baseball Glove,"[Soccer Jersey, Basketball Jersey, Football Je...","ITEMS PURCHASED: {Baseball Jersey, Baseball, B..."
4,"Tennis Shirt, Tennis Ball, Tennis Shoes",Tennis Racket,"[Soccer Jersey, Basketball Jersey, Football Je...","ITEMS PURCHASED: {Tennis Shirt, Tennis Ball, T..."
...,...,...,...,...
116,"Soccer Goal Post, Soccer Ball, Soccer Cleats, ...",Soccer Jersey,"[Soccer Jersey, Basketball Jersey, Football Je...","ITEMS PURCHASED: {Soccer Goal Post, Soccer Bal..."
117,"Basketball Jersey, Basketball, Basketball Shoes",Basketball Arm Sleeve,"[Soccer Jersey, Football Jersey, Baseball Jers...","ITEMS PURCHASED: {Basketball Jersey, Basketbal..."
118,"Basketball Jersey, Basketball, Basketball Arm ...",Basketball Shoes,"[Soccer Jersey, Football Jersey, Baseball Jers...","ITEMS PURCHASED: {Basketball Jersey, Basketbal..."
119,"Basketball Jersey, Basketball Arm Sleeve, Bask...",Basketball,"[Soccer Jersey, Football Jersey, Baseball Jers...","ITEMS PURCHASED: {Basketball Jersey, Basketbal..."


In [None]:
df_arabic_training

Unnamed: 0,Input,Output,Unpurchased_items,Prompt
0,قميص كرة القدم ، كرة كرة القدم ، مرابط كرة الق...,حراس كرة القدم,"[قميص كرة السلة, قميص كرة القدم الأمريكية, قمي...",عناصر تم شراؤها: {قميص كرة القدم ، كرة كرة الق...
1,قميص كرة السلة وكرة السلة وأحذية كرة السلة وأك...,شورت كرة السلة,"[قميص كرة القدم, قميص كرة القدم الأمريكية, قمي...",عناصر تم شراؤها: {قميص كرة السلة وكرة السلة وأ...
2,قميص كرة القدم ، كرة القدم ، مرابط كرة القدم ،...,خوذة كرة القدم,"[قميص كرة السلة, قميص كرة القدم الأمريكية, قمي...",عناصر تم شراؤها: {قميص كرة القدم ، كرة القدم ،...
3,قميص البيسبول ، البيسبول ، مرابط البيسبول ، قب...,قفاز البيسبول,"[قميص كرة القدم, قميص كرة السلة, قميص كرة القد...",عناصر تم شراؤها: {قميص البيسبول ، البيسبول ، م...
4,قميص التنس ، كرة تنس ، أحذية تنس,مضرب التنس,"[قميص كرة القدم, قميص كرة السلة, قميص كرة القد...",عناصر تم شراؤها: {قميص التنس ، كرة تنس ، أحذية...
...,...,...,...,...
116,هدف كرة القدم ، كرة كرة القدم ، مرابط كرة القد...,قميص لكرة القدم,"[قميص كرة القدم, قميص كرة السلة, قميص كرة القد...",عناصر تم شراؤها: {هدف كرة القدم ، كرة كرة القد...
117,قميص كرة السلة ، كرة السلة ، أحذية كرة السلة,دقة ذراع كرة السلة,"[قميص كرة القدم, قميص كرة القدم الأمريكية, قمي...",عناصر تم شراؤها: {قميص كرة السلة ، كرة السلة ،...
118,قميص كرة السلة وكرة السلة وأكمام ذراع كرة السلة,أحذية كرة السلة,"[قميص كرة القدم, قميص كرة القدم الأمريكية, قمي...",عناصر تم شراؤها: {قميص كرة السلة وكرة السلة وأ...
119,قميص كرة السلة ، وأكمام ذراع كرة السلة ، وأحذي...,كرة سلة,"[قميص كرة القدم, قميص كرة القدم الأمريكية, قمي...",عناصر تم شراؤها: {قميص كرة السلة ، وأكمام ذراع...


In [None]:
df_english_training['Prompt'].iloc[0]

'ITEMS PURCHASED: {Soccer Jersey, Soccer Ball, Soccer Cleats, Goalie Gloves, Soccer Goal Post, Soccer Cones} - CANDIDATES FOR RECOMMENDATION: {Basketball Jersey, Football Jersey, Baseball Jersey, Tennis Shirt, Hockey Jersey, Basketball, Football, Baseball, Tennis Ball, Hocket Puck, Basketball Shoes, Football Cleats, Baseball Cleats, Tennis Shoes, Hockey Helmet, Basketball Arm Sleeve, Football Shoulder Pads, Baseball Cap, Tennis Racket, Hockey Skates, Basketball Hoop, Football Helmet, Baseball Bat, Hockey Stick, Basketball Shorts, Baseball Glove, Hockey Pads, Soccer Shin Guards, Soccer Shorts} - RECOMMENDATION: '

In [None]:
df_arabic_training['Prompt'].iloc[0]

'عناصر تم شراؤها: {قميص كرة القدم ، كرة كرة القدم ، مرابط كرة القدم ، قفازات حارس المرمى ، مرمى كرة القدم ، مخاريط كرة القدم} - المرشحين للتوصية: {قميص كرة السلة, قميص كرة القدم الأمريكية, قميص بيسبول, قميص التنس, قميص الهوكي, كرة سلة, كرة القدم الأمريكية, البيسبول, كرة التنس, قرص الهوكي, أحذية كرة السلة, المرابط كرة القدم الأمريكية, مرابط البيسبول, أحذية تنس, خوذة الهوكي, الأكمام ذراع كرة السلة, وسادات الكتف لكرة القدم الأمريكية, قبعة البيسبول, مضرب التنس, الزلاجات الهوكي, كرة السلة هوب, خوذة كرة القدم الأمريكية, مضرب البيسبول, عصا الهوكي, شورت كرة السلة, قفاز البيسبول, وسادات الهوكي, حراس كرة القدم, شورت كرة القدم} - توصية: '

In [None]:
# Reformat training and eval dataframes
train_english_reformatted = df_english_training[['Prompt', 'Output']][0:100].reset_index().rename(columns={"Prompt":"source", "Output": "target", "index": "id"})
train_english_reformatted = train_english_reformatted.dropna()
train_english_reformatted

eval_english_reformatted = df_english_training[['Prompt', 'Output']][100:].reset_index().rename(columns={"Prompt":"source", "Output": "target", "index": "id"})
eval_english_reformatted = eval_english_reformatted.dropna()
eval_english_reformatted

Unnamed: 0,id,source,target
0,100,"ITEMS PURCHASED: {Tennis Ball, Basketball, Foo...",Soccer Ball
1,101,"ITEMS PURCHASED: {Soccer Cleats, Basketball Sh...",Tennis Shoes
2,102,"ITEMS PURCHASED: {Soccer Cleats, Basketball Sh...",Baseball Cleats
3,103,"ITEMS PURCHASED: {Soccer Cleats, Basketball Sh...",Football Cleats
4,104,"ITEMS PURCHASED: {Soccer Cleats, Tennis Shoes,...",Basketball Shoes
5,105,"ITEMS PURCHASED: {Tennis Shoes, Basketball Sho...",Soccer Cleats
6,106,ITEMS PURCHASED: {Football Helmet} - CANDIDATE...,Baseball Cap
7,107,ITEMS PURCHASED: {Baseball Cap} - CANDIDATES F...,Football Helmet
8,108,ITEMS PURCHASED: {Tennis Racket} - CANDIDATES ...,Hockey Stick
9,109,ITEMS PURCHASED: {Hockey Stick} - CANDIDATES F...,Tennis Racket


In [None]:
# Convert dataframes to Dataset objects (for use in Hugging Face model)
from datasets import Dataset

english_dataset_train = Dataset.from_pandas(train_english_reformatted)
english_dataset_eval = Dataset.from_pandas(eval_english_reformatted)

In [None]:
# Build DatasetDict from Dataset objects
import datasets

english_data_dict_dataset = datasets.DatasetDict({"train": english_dataset_train, "eval": english_dataset_eval})
english_data_dict_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'source', 'target', '__index_level_0__'],
        num_rows: 100
    })
    eval: Dataset({
        features: ['id', 'source', 'target', '__index_level_0__'],
        num_rows: 21
    })
})

In [None]:
# Preprocess function to tokenize input text (for fine-tuning)

max_input_length = tokenizer.model_max_length
max_target_length = 20 # Adjust as needed, should be relatively short as we expect 1 product to be recommended

def preprocess_function(examples):
  inputs = [doc for doc in examples["source"]]
  model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding=True)

  #Setup the tokenizer for targets
  with tokenizer.as_target_tokenizer():
    labels = tokenizer(examples["target"], max_length=max_target_length, truncation=True, padding=True)

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

In [None]:
# Test preprocessing on first 2 rows
preprocess_function(english_data_dict_dataset["train"][:2])



{'input_ids': [[2344, 20804, 276, 5905, 20891, 134, 2326, 10, 3, 2, 134, 13377, 49, 5092, 6, 23375, 4155, 6, 23375, 4779, 1544, 7, 6, 17916, 23, 15, 9840, 162, 7, 6, 23375, 17916, 1844, 6, 23375, 1193, 15, 7, 2, 3, 18, 205, 9853, 26483, 21254, 5652, 4083, 6657, 329, 14920, 8015, 10, 3, 2, 14885, 8044, 3184, 5092, 6, 10929, 5092, 6, 22398, 5092, 6, 18539, 3, 16671, 6, 23127, 5092, 6, 21249, 6, 10929, 6, 22398, 6, 18539, 4155, 6, 1546, 8849, 17, 276, 4636, 6, 21249, 23548, 6, 10929, 4779, 1544, 7, 6, 22398, 4779, 1544, 7, 6, 18539, 23548, 6, 23127, 22887, 15, 17, 6, 21249, 5412, 31909, 6, 10929, 5066, 49, 10683, 7, 6, 22398, 4000, 6, 18539, 24688, 15, 17, 6, 23127, 6458, 6203, 6, 21249, 454, 6631, 6, 10929, 22887, 15, 17, 6, 22398, 8897, 6, 23127, 12422, 6, 21249, 7110, 7, 6, 22398, 9840, 162, 6, 23127, 10683, 7, 6, 23375, 14215, 12899, 7, 6, 23375, 7110, 7, 2, 3, 18, 4083, 6657, 329, 14920, 8015, 10, 1], [2344, 20804, 276, 5905, 20891, 134, 2326, 10, 3, 2, 14885, 8044, 3184, 5092, 6, 21

In [None]:
print(len(preprocess_function(english_data_dict_dataset["train"][:2])['input_ids'][0] ))
print(len(preprocess_function(english_data_dict_dataset["train"][:2])['attention_mask'][0] ))
print(len(preprocess_function(english_data_dict_dataset["train"][:2])['labels'][0] ))

175
175
5


In [None]:
# Tokenize train and eval datasets
tokenized_datasets = english_data_dict_dataset.map(preprocess_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'source', 'target', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 100
    })
    eval: Dataset({
        features: ['id', 'source', 'target', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 21
    })
})

In [None]:
# Instantiate Data Collator object
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [None]:
# Instantiate Data Loader for train and eval sets
# Adjust batch sizes as necessary

from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)

eval_dataloader = DataLoader(
    tokenized_datasets["eval"], batch_size=8, collate_fn=data_collator
)

In [None]:
len(train_dataloader)

13

## Fine-tune T5 model

In [None]:
### Select Optimizer (for regularization)

from transformers import AdamW, get_scheduler

learning_rate = 1e-4
optimizer = AdamW(model.parameters(), lr=learning_rate)

num_epochs = 5
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

65




In [None]:
# Run if you want to push to Hugging Face Hub (need account and API token)
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid.
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
# Instantiate training arguments object
batch_size = 4
args = Seq2SeqTrainingArguments(
    "./t5_recommendation_sports_equipment_english",
    push_to_hub=True, # Comment out if you don't want to push to Hugging Face Hub
    evaluation_strategy = "epoch",
    learning_rate = 1e-4,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    weight_decay = 0.01,
    save_total_limit = 3,
    num_train_epochs = 10, # Try 5-10 epochs; results may vary
    predict_with_generate = True,
    gradient_accumulation_steps = 4,
    eval_accumulation_steps = 4,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
# Instantiate ROUGE metric object

from datasets import load_dataset, load_metric

metric = load_metric("rouge")
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_aggregator: Return aggregates if this is set to True
Retu

In [None]:
# Functions for further preprocessing and metrics computation
import numpy as np

def postprocess_text(preds, labels):
  preds = [pred.strip() for pred in preds]
  labels = [[label.strip()] for label in labels]

  return preds, labels

def compute_metrics(eval_preds):
  preds, labels = eval_preds
  if isinstance(preds, tuple):
    preds = preds[0]
  decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

  # Replace -100 in the labes as we can't decode them.
  labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

  # Some simple post processing
  decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

  result = metric.compute(predictions = decoded_preds, references = decoded_labels)
  result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

  prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
  result["gen_len"] = np.mean(prediction_lens)

  return result

In [None]:
# Instantiate Trainer object (for fine-tuning)
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset = tokenized_datasets["train"],
    eval_dataset = tokenized_datasets["eval"],
    data_collator = data_collator,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics,
    optimizers = (optimizer, lr_scheduler)
)

Cloning https://huggingface.co/mohammadhia/t5_recommendation_sports_equipment_english into local empty directory.


In [None]:
# Train time should take a few minutes or less if on GPU
# Can take up to several hours if on CPU
trainer.train()

The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: __index_level_0__, target, source, id. If __index_level_0__, target, source, id are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 100
  Num Epochs = 10
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 60
  Number of trainable parameters = 737668096
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
0,No log,6.737507,8.706611,0.952381,8.759812,8.601081,19.0
1,No log,2.808905,23.809524,9.52381,23.333333,23.333333,3.142857
2,No log,0.939374,9.52381,4.761905,9.52381,9.52381,3.190476
3,No log,0.667893,33.333333,14.285714,32.857143,32.539683,3.571429
4,No log,0.673615,26.507937,9.52381,25.079365,25.079365,4.238095
5,No log,0.665844,38.730159,23.809524,37.301587,37.460317,4.047619
6,No log,0.646018,46.349206,33.333333,45.634921,45.238095,3.857143
7,No log,0.559592,52.380952,42.857143,50.793651,50.793651,4.0
8,No log,0.5082,57.142857,47.619048,55.555556,55.555556,3.952381
9,No log,0.455381,57.142857,47.619048,55.555556,55.555556,3.904762


The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: __index_level_0__, target, source, id. If __index_level_0__, target, source, id are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 21
  Batch size = 4
Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}

Generate config GenerationConf

TrainOutput(global_step=60, training_loss=2.6742706298828125, metrics={'train_runtime': 202.0142, 'train_samples_per_second': 4.95, 'train_steps_per_second': 0.297, 'total_flos': 753893148672000.0, 'train_loss': 2.6742706298828125, 'epoch': 9.96})

## Collect evaluation data predictions

In [None]:
%%time
# Try predictions on validation set for confirmation
predictions = trainer.predict(tokenized_datasets["eval"])

The following columns in the test set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: __index_level_0__, target, source, id. If __index_level_0__, target, source, id are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 21
  Batch size = 4
Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}



Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}



CPU times: user 2.89 s, sys: 383 ms, total: 3.27 s
Wall time: 3.42 s


In [None]:
predictions

In [None]:
# Convert tokens from data to text
def translate(tokens):
  my_list = tokenizer.convert_ids_to_tokens(tokens)
  new_list = [token for token in my_list if ('<' not in token)] # token != '<pad>' and token != '<s>'
  new_string = ''.join(new_list)
  new_string = new_string.replace("▁", " ")
  # new_string = new_string.replace("_", " ")
  new_string = new_string.strip()
  return new_string

In [None]:
# Print sample predicted output
index = 16
print(tokenized_datasets["eval"]["source"][index])
print("Target product: ", tokenized_datasets["eval"]["target"][index])
print("Recommended product: ", [translate(predictions.predictions[index])])

ITEMS PURCHASED: {Soccer Goal Post, Soccer Ball, Soccer Cleats, Goalie Gloves} - CANDIDATES FOR RECOMMENDATION: {Soccer Jersey, Basketball Jersey, Football Jersey, Baseball Jersey, Tennis Shirt, Hockey Jersey, Basketball, Football, Baseball, Tennis Ball, Hocket Puck, Basketball Shoes, Football Cleats, Baseball Cleats, Tennis Shoes, Hockey Helmet, Basketball Arm Sleeve, Football Shoulder Pads, Baseball Cap, Tennis Racket, Hockey Skates, Basketball Hoop, Football Helmet, Baseball Bat, Hockey Stick, Soccer Cones, Basketball Shorts, Baseball Glove, Hockey Pads, Soccer Shin Guards, Soccer Shorts} - RECOMMENDATION: 
Target product:  Soccer Jersey
Recommended product:  ['Soccer Cones']


In [None]:
# Collect generated outputs and join with prompts and targets
model_generated = []
prompt_list = []
target_list = []

for i in range(len(predictions.predictions)):
  model_generated.append(translate(predictions.predictions[i]))

  prompt_list.append(english_dataset_eval['source'][i])
  target_list.append(english_dataset_eval['target'][i])

In [None]:
df_target_and_generated = pd.DataFrame()

df_target_and_generated['input'] = prompt_list
df_target_and_generated['target'] = target_list
df_target_and_generated['model_generated'] = model_generated

In [None]:
df_target_and_generated

Unnamed: 0,input,target,model_generated
0,"ITEMS PURCHASED: {Tennis Ball, Basketball, Foo...",Soccer Ball,Soccer Ball
1,"ITEMS PURCHASED: {Soccer Cleats, Basketball Sh...",Tennis Shoes,Tennis Shoes
2,"ITEMS PURCHASED: {Soccer Cleats, Basketball Sh...",Baseball Cleats,Baseball Cleats
3,"ITEMS PURCHASED: {Soccer Cleats, Basketball Sh...",Football Cleats,Football Cleats
4,"ITEMS PURCHASED: {Soccer Cleats, Tennis Shoes,...",Basketball Shoes,Basketball Shoes
5,"ITEMS PURCHASED: {Tennis Shoes, Basketball Sho...",Soccer Cleats,Tennis Shoes
6,ITEMS PURCHASED: {Football Helmet} - CANDIDATE...,Baseball Cap,Hockey Helmet
7,ITEMS PURCHASED: {Baseball Cap} - CANDIDATES F...,Football Helmet,Basketball Arm Sleeve
8,ITEMS PURCHASED: {Tennis Racket} - CANDIDATES ...,Hockey Stick,Tennis Ball
9,ITEMS PURCHASED: {Hockey Stick} - CANDIDATES F...,Tennis Racket,Hockey Puck


## Push fine-tuned model to Hugging Face Hub (ptional)

In [None]:
trainer.push_to_hub("t5_recommendation_sports_equipment")

Saving model checkpoint to ./t5_recommendation_sports_equipment_english
Configuration saved in ./t5_recommendation_sports_equipment_english/config.json
Configuration saved in ./t5_recommendation_sports_equipment_english/generation_config.json
Model weights saved in ./t5_recommendation_sports_equipment_english/pytorch_model.bin
tokenizer config file saved in ./t5_recommendation_sports_equipment_english/tokenizer_config.json
Special tokens file saved in ./t5_recommendation_sports_equipment_english/special_tokens_map.json


Upload file pytorch_model.bin:   0%|          | 32.0k/2.75G [00:00<?, ?B/s]

Upload file runs/Mar13_02-46-03_df067a67bbfa/1678675577.1665006/events.out.tfevents.1678675577.df067a67bbfa.14…

Upload file training_args.bin: 100%|##########| 3.56k/3.56k [00:00<?, ?B/s]

Upload file runs/Mar13_02-46-03_df067a67bbfa/events.out.tfevents.1678675577.df067a67bbfa.146.0: 100%|#########…

remote: Scanning LFS files of refs/heads/main for validity...        
remote: LFS file scan complete.        
To https://huggingface.co/mohammadhia/t5_recommendation_sports_equipment_english
   c296479..3f30dbb  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/mohammadhia/t5_recommendation_sports_equipment_english
   c296479..3f30dbb  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Sequence-to-sequence Language Modeling', 'type': 'text2text-generation'}, 'metrics': [{'name': 'Rouge1', 'type': 'rouge', 'value': 57.14285714285714}]}
To https://huggingface.co/mohammadhia/t5_recommendation_sports_equipment_english
   3f30dbb..99484c5  main -> main

   3f30dbb..99484c5  main -> main



'https://huggingface.co/mohammadhia/t5_recommendation_sports_equipment_english/commit/3f30dbb7382798eb6dc349d3b34449923e3331d0'

## Test the Model on various examples!

In [None]:
# Load fine-tuned model from Hugging Face Hub
from transformers import pipeline

t5_recommender = pipeline(model="mohammadhia/t5_recommendation_sports_equipment_english")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [None]:
# Prepare sample customers (items purchased so far)
example_customer_1 = "ITEMS PURCHASED: {Soccer Shin Guards} - CANDIDATES FOR RECOMMENDATION: {Soccer Jersey, Basketball Jersey, Football Jersey, Baseball Jersey, Tennis Shirt, Hockey Jersey, Soccer Ball, Basketball, Football, Baseball, Tennis Ball, Hocket Puck, Soccer Cleats, Basketball Shoes, Football Cleats, Baseball Cleats, Tennis Shoes, Hockey Helmet, Goalie Gloves, Basketball Arm Sleeve, Football Shoulder Pads, Baseball Cap, Tennis Racket, Hockey Skates, Soccer Goal Post, Basketball Hoop, Football Helmet, Baseball Bat, Hockey Stick, Soccer Cones, Basketball Shorts, Baseball Glove, Hockey Pads, Soccer Shorts} - RECOMMENDATION: "
example_customer_2 = "ITEMS PURCHASED: {Soccer Jersey, Soccer Goal Post, Soccer Cleats, Goalie Gloves} - CANDIDATES FOR RECOMMENDATION: {Basketball Jersey, Football Jersey, Baseball Jersey, Tennis Shirt, Hockey Jersey, Soccer Ball, Basketball, Football, Baseball, Tennis Ball, Hocket Puck, Basketball Shoes, Football Cleats, Baseball Cleats, Tennis Shoes, Hockey Helmet, Basketball Arm Sleeve, Football Shoulder Pads, Baseball Cap, Tennis Racket, Hockey Skates, Basketball Hoop, Football Helmet, Baseball Bat, Hockey Stick, Soccer Cones, Basketball Shorts, Baseball Glove, Hockey Pads, Soccer Shin Guards, Soccer Shorts} - RECOMMENDATION: "
example_customer_3 = "ITEMS PURCHASED: {Basketball Jersey, Basketball, Basketball Arm Sleeve} - CANDIDATES FOR RECOMMENDATION: {Soccer Jersey, Football Jersey, Baseball Jersey, Tennis Shirt, Hockey Jersey, Soccer Ball, Football, Baseball, Tennis Ball, Hocket Puck, Soccer Cleats, Basketball Shoes, Football Cleats, Baseball Cleats, Tennis Shoes, Hockey Helmet, Goalie Gloves, Football Shoulder Pads, Baseball Cap, Tennis Racket, Hockey Skates, Soccer Goal Post, Basketball Hoop, Football Helmet, Baseball Bat, Hockey Stick, Soccer Cones, Basketball Shorts, Baseball Glove, Hockey Pads, Soccer Shin Guards, Soccer Shorts} - RECOMMENDATION: "

In [None]:
# Generate model recommendations for products for each customer
model_output_1 = t5_recommender.predict(example_customer_1)
model_recommendation_1 = model_output_1[0]['generated_text']

model_output_2 = t5_recommender.predict(example_customer_2)
model_recommendation_2 = model_output_2[0]['generated_text']

model_output_3 = t5_recommender.predict(example_customer_3)
model_recommendation_3 = model_output_3[0]['generated_text']

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}



In [None]:
print(example_customer_1)
print("RECOMMENDATION: ", model_recommendation_1)

ITEMS PURCHASED: {Soccer Shin Guards} - CANDIDATES FOR RECOMMENDATION: {Soccer Jersey, Basketball Jersey, Football Jersey, Baseball Jersey, Tennis Shirt, Hockey Jersey, Soccer Ball, Basketball, Football, Baseball, Tennis Ball, Hocket Puck, Soccer Cleats, Basketball Shoes, Football Cleats, Baseball Cleats, Tennis Shoes, Hockey Helmet, Goalie Gloves, Basketball Arm Sleeve, Football Shoulder Pads, Baseball Cap, Tennis Racket, Hockey Skates, Soccer Goal Post, Basketball Hoop, Football Helmet, Baseball Bat, Hockey Stick, Soccer Cones, Basketball Shorts, Baseball Glove, Hockey Pads, Soccer Shorts} - RECOMMENDATION: 
RECOMMENDATION:  Basketball Arm Sleeve


In [None]:
print(example_customer_2)
print("RECOMMENDATION: ", model_recommendation_2)

ITEMS PURCHASED: {Soccer Jersey, Soccer Goal Post, Soccer Cleats, Goalie Gloves} - CANDIDATES FOR RECOMMENDATION: {Basketball Jersey, Football Jersey, Baseball Jersey, Tennis Shirt, Hockey Jersey, Soccer Ball, Basketball, Football, Baseball, Tennis Ball, Hocket Puck, Basketball Shoes, Football Cleats, Baseball Cleats, Tennis Shoes, Hockey Helmet, Basketball Arm Sleeve, Football Shoulder Pads, Baseball Cap, Tennis Racket, Hockey Skates, Basketball Hoop, Football Helmet, Baseball Bat, Hockey Stick, Soccer Cones, Basketball Shorts, Baseball Glove, Hockey Pads, Soccer Shin Guards, Soccer Shorts} - RECOMMENDATION: 
RECOMMENDATION:  Soccer Ball


In [None]:
print(example_customer_3)
print("RECOMMENDATION: ", model_recommendation_3)

ITEMS PURCHASED: {Basketball Jersey, Basketball, Basketball Arm Sleeve} - CANDIDATES FOR RECOMMENDATION: {Soccer Jersey, Football Jersey, Baseball Jersey, Tennis Shirt, Hockey Jersey, Soccer Ball, Football, Baseball, Tennis Ball, Hocket Puck, Soccer Cleats, Basketball Shoes, Football Cleats, Baseball Cleats, Tennis Shoes, Hockey Helmet, Goalie Gloves, Football Shoulder Pads, Baseball Cap, Tennis Racket, Hockey Skates, Soccer Goal Post, Basketball Hoop, Football Helmet, Baseball Bat, Hockey Stick, Soccer Cones, Basketball Shorts, Baseball Glove, Hockey Pads, Soccer Shin Guards, Soccer Shorts} - RECOMMENDATION: 
RECOMMENDATION:  Basketball Shoes


## Test the Model on various examples (in Arabic)!
- Unfortunately, the MT5 models were not powerful enough to achieve the same fine-tuning results above, in Arabic - for the same dataset, but in Arabic
 - Even extra large models suffered from the same, irrelevant output across the board

- An alternative for now is to use the Google Translate API on the input, get the output, and translate back 

In [None]:
!pip install googletrans==3.1.0a0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting googletrans==3.1.0a0
  Downloading googletrans-3.1.0a0.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3
  Downloading httpx-0.13.3-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.1/55.1 KB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sniffio
  Downloading sniffio-1.3.0-py3-none-any.whl (10 kB)
Collecting chardet==3.*
  Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.4/133.4 KB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
Collecting rfc3986<2,>=1.3
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl (31 kB)
Collecting httpcore==0.9.*
  Downloading httpcore-0.9.1-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 KB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?2

In [None]:
# Import Google Translate
from googletrans import Translator
translator = Translator()

In [None]:
# Convert Arabic input (customer purchases) into English, get recommendations, and translate back to Arabic

def recommendation_arabic(arabic_input):
  translation = translator.translate(arabic_input, dest='en') # Translate input to english first
  translation = translation.text

  # Replace substrings in translation (for prompt purposes)
  translation = translation.replace("Items Purchased", "ITEMS PURCHASED")
  translation = translation.replace("Recommended Candidates", "CANDIDATES FOR RECOMMENDATION")
  translation = translation.replace("Candidates for Recommendation", "CANDIDATES FOR RECOMMENDATION")
  translation = translation.replace("Recommendation", "RECOMMENDATION")

  # print(translation)

  model_output = t5_recommender.predict(translation) # Generate model recommendations for products for each customer
  model_recommendation = model_output[0]['generated_text'] # Extract generated text

  model_recommendation_string = "RECOMMENDATION: " + model_recommendation # combine to build string

  model_recommendation_arabic = translator.translate(model_recommendation_string, dest='ar') # Translate recommendation to Arabic
  return model_recommendation_arabic.text

In [None]:
# Sample customer purchases (in Arabic)
example_customer_1_arabic = "عناصر تم شراؤها: {حراس كرة القدم} - المرشحين للتوصية: {قميص كرة القدم, قميص كرة السلة, قميص كرة القدم الأمريكية, قميص بيسبول, قميص التنس, قميص الهوكي, كرة كرة القدم, كرة سلة, كرة القدم الأمريكية, البيسبول, كرة التنس, قرص الهوكي, مرابط كرة القدم, أحذية كرة السلة, المرابط كرة القدم الأمريكية, مرابط البيسبول, أحذية تنس, خوذة الهوكي, قفازات حارس المرمى, الأكمام ذراع كرة السلة, وسادات الكتف لكرة القدم الأمريكية, قبعة البيسبول, مضرب التنس, الزلاجات الهوكي, مرمى كرة القدم, كرة السلة هوب, خوذة كرة القدم الأمريكية, مضرب البيسبول, عصا الهوكي, مخاريط كرة القدم, شورت كرة السلة, قفاز البيسبول, وسادات الهوكي, شورت كرة القدم} - توصية: "
example_customer_2_arabic = "عناصر تم شراؤها: {قميص لكرة القدم ، هدف كرة القدم ، مرابط كرة القدم ، قفازات حارس المرمى} - المرشحين للتوصية: {قميص كرة القدم, قميص كرة السلة, قميص كرة القدم الأمريكية, قميص بيسبول, قميص التنس, قميص الهوكي, كرة كرة القدم, كرة سلة, كرة القدم الأمريكية, البيسبول, كرة التنس, قرص الهوكي, أحذية كرة السلة, المرابط كرة القدم الأمريكية, مرابط البيسبول, أحذية تنس, خوذة الهوكي, الأكمام ذراع كرة السلة, وسادات الكتف لكرة القدم الأمريكية, قبعة البيسبول, مضرب التنس, الزلاجات الهوكي, مرمى كرة القدم, كرة السلة هوب, خوذة كرة القدم الأمريكية, مضرب البيسبول, عصا الهوكي, مخاريط كرة القدم, شورت كرة السلة, قفاز البيسبول, وسادات الهوكي, حراس كرة القدم, شورت كرة القدم} - توصية: "
example_customer_3_arabic = "عناصر تم شراؤها: {قميص كرة السلة وكرة السلة وأكمام ذراع كرة السلة} - المرشحين للتوصية: {قميص كرة القدم, قميص كرة القدم الأمريكية, قميص بيسبول, قميص التنس, قميص الهوكي, كرة كرة القدم, كرة سلة, كرة القدم الأمريكية, البيسبول, كرة التنس, قرص الهوكي, مرابط كرة القدم, أحذية كرة السلة, المرابط كرة القدم الأمريكية, مرابط البيسبول, أحذية تنس, خوذة الهوكي, قفازات حارس المرمى, الأكمام ذراع كرة السلة, وسادات الكتف لكرة القدم الأمريكية, قبعة البيسبول, مضرب التنس, الزلاجات الهوكي, مرمى كرة القدم, كرة السلة هوب, خوذة كرة القدم الأمريكية, مضرب البيسبول, عصا الهوكي, مخاريط كرة القدم, شورت كرة السلة, قفاز البيسبول, وسادات الهوكي, حراس كرة القدم, شورت كرة القدم} - توصية: "

In [None]:
example_customer_1_arabic

'عناصر تم شراؤها: {حراس كرة القدم} - المرشحين للتوصية: {قميص كرة القدم, قميص كرة السلة, قميص كرة القدم الأمريكية, قميص بيسبول, قميص التنس, قميص الهوكي, كرة كرة القدم, كرة سلة, كرة القدم الأمريكية, البيسبول, كرة التنس, قرص الهوكي, مرابط كرة القدم, أحذية كرة السلة, المرابط كرة القدم الأمريكية, مرابط البيسبول, أحذية تنس, خوذة الهوكي, قفازات حارس المرمى, الأكمام ذراع كرة السلة, وسادات الكتف لكرة القدم الأمريكية, قبعة البيسبول, مضرب التنس, الزلاجات الهوكي, مرمى كرة القدم, كرة السلة هوب, خوذة كرة القدم الأمريكية, مضرب البيسبول, عصا الهوكي, مخاريط كرة القدم, شورت كرة السلة, قفاز البيسبول, وسادات الهوكي, شورت كرة القدم} - توصية: '

In [None]:
recommendation_arabic(example_customer_1_arabic) # Get recommended product

'* توصية: كرة القدم'

In [None]:
example_customer_2_arabic

'عناصر تم شراؤها: {قميص لكرة القدم ، هدف كرة القدم ، مرابط كرة القدم ، قفازات حارس المرمى} - المرشحين للتوصية: {قميص كرة القدم, قميص كرة السلة, قميص كرة القدم الأمريكية, قميص بيسبول, قميص التنس, قميص الهوكي, كرة كرة القدم, كرة سلة, كرة القدم الأمريكية, البيسبول, كرة التنس, قرص الهوكي, أحذية كرة السلة, المرابط كرة القدم الأمريكية, مرابط البيسبول, أحذية تنس, خوذة الهوكي, الأكمام ذراع كرة السلة, وسادات الكتف لكرة القدم الأمريكية, قبعة البيسبول, مضرب التنس, الزلاجات الهوكي, مرمى كرة القدم, كرة السلة هوب, خوذة كرة القدم الأمريكية, مضرب البيسبول, عصا الهوكي, مخاريط كرة القدم, شورت كرة السلة, قفاز البيسبول, وسادات الهوكي, حراس كرة القدم, شورت كرة القدم} - توصية: '

In [None]:
recommendation_arabic(example_customer_2_arabic) # Get recommended product

'توصية: كرة القدم كرة القدم'

In [None]:
example_customer_3_arabic

'عناصر تم شراؤها: {قميص كرة السلة وكرة السلة وأكمام ذراع كرة السلة} - المرشحين للتوصية: {قميص كرة القدم, قميص كرة القدم الأمريكية, قميص بيسبول, قميص التنس, قميص الهوكي, كرة كرة القدم, كرة سلة, كرة القدم الأمريكية, البيسبول, كرة التنس, قرص الهوكي, مرابط كرة القدم, أحذية كرة السلة, المرابط كرة القدم الأمريكية, مرابط البيسبول, أحذية تنس, خوذة الهوكي, قفازات حارس المرمى, الأكمام ذراع كرة السلة, وسادات الكتف لكرة القدم الأمريكية, قبعة البيسبول, مضرب التنس, الزلاجات الهوكي, مرمى كرة القدم, كرة السلة هوب, خوذة كرة القدم الأمريكية, مضرب البيسبول, عصا الهوكي, مخاريط كرة القدم, شورت كرة السلة, قفاز البيسبول, وسادات الهوكي, حراس كرة القدم, شورت كرة القدم} - توصية: '

In [None]:
recommendation_arabic(example_customer_3_arabic) # Get recommended product

'* توصية: كرة القدم'