# Recommendation and Search
   - using  two years worth [real online retail transaction data](https://archive.ics.uci.edu/ml/datasets/Online+Retail+II)
   - Key techniques applied:
       - word/sentence embeddings (vectors)
           - `man + king - woman = queen`
       - inverted index
       - cosine similarity
       
## Flow of Demo:

### 1. Recommender System:

   - Area for improvement: Recommend product to customer while browsing the app
   - Benefit: Increase in sales and customer retention
   - Approach: Using historical transaction data to train Word2Vec and LSTM models to give relevant/related product recommendations in app real time

### 2. Search Engine:

   - Area for improvement: Provide specific product options that user can buy instead of only showing the Merchants's name
   - Benefit: Customer retention and engagement
   - Approach: Train/Apply Doc2Vec and sentence transformer models to product and merchant's content data (name, description, reviews) to be used for calculating product relevance to the search query
                
### 3. Aspirations on things that can be done moving forward

In [2]:
datapath_='../data/ecommerce/'

from helpers.helper_funcs import (import_data,
                                  create_customer_sessions,
                                  make_prod_index,
                                  read_corpus
                                 )
import pandas as pd
from random import choices
import gensim.models
from random import randint
from collections import defaultdict
from sentence_transformers import SentenceTransformer, util
import torch
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = import_data()
data.tail(3)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
813655,573167,23264,set of 3 wooden sleigh decorations,36,2011-10-28 09:29:00,1.25,18287,United Kingdom
813656,573167,21824,painted metal star with holly bells,48,2011-10-28 09:29:00,0.39,18287,United Kingdom
813657,573167,21014,swiss chalet tree decoration,24,2011-10-28 09:29:00,0.29,18287,United Kingdom


In [3]:
print("\n\n===========================================================")
print("Number of unique transactions: ", data.InvoiceNo.nunique())
print("Number of unique customers: ", data.CustomerID.nunique())
print("Number of unique products: ", data.StockCode.nunique())
print("===========================================================")



Number of unique transactions:  22190
Number of unique customers:  4372
Number of unique products:  3684


### Using transaction sequences to generate related product recommendations
   Method: Word Embeddings - numeric representations of words (product/StockCode in this case)
   - Word2Vec (Continuous Bag of Words Model)
       - based from context (product bought before and after)
       - works with no labeled data
   - Long Short Term Memory (LSTM) 
       - based on sequence of products, remember important segments (short product items sequence), forgets those aren't
       - works when you have labeled data such as browsing behavior of user that results with a purchase or not.
       - technically can work without labeled data (seq2seq model)

### Recommender process:

 1. (offline) create sequence of products bought by user (items bought by user in a browsing session, session = 1hr) this can also be sequences like browsing behavior
 2. (offline) train embedding models (w2v/lstm)
 3. (online) use product "word" vectors learned from (2) to calculate cosine similarity

#### Step 1:

In [4]:
customer_sessions = create_customer_sessions()
customer_sessions.head(3)

Unnamed: 0,CustomerID,StockCode
0,12347,"[85116, 22375, 71477, 22492, 22771, 22772, 227..."
1,12347,"[84625A, 84625C, 85116, 20719, 22375, 22376, 2..."
2,12347,"[22376, 22374, 22371, 22375, 20665, 23076, 217..."


#### Look up dictionary of StockCode and corresponding product description

In [5]:
stock_descrip = make_prod_index()
{k:v for i, (k, v) in enumerate(stock_descrip.items()) if i < 3}

{'23166': 'medium ceramic top storage jar',
 '85116': 'black candelabra t-light holder',
 '22375': 'airline bag vintage jet set brown'}

#### Step 2:

#### Word2Vec Approach (CBOW)

In [6]:
seq_fname='stock_sequences.txt'
w2v_model = gensim.models.Word2Vec(corpus_file=datapath_+seq_fname, sg=1, min_count=1, vector_size=40)

p, d = w2v_model.wv.vectors.shape
print("\n{d}-D vectors for {p} products\n".format(p=p, d=d))

for index, word in enumerate(w2v_model.wv.index_to_key):
    if index == 2:
        break
    print("Index: {i}, StockCode: {s}, Descrip: {d}\n Vector: [{v} ... x_40] ".format(i=index,
                                                                          s=word, 
                                                                          d=stock_descrip[word],
                                                                          v=', '.join([str(i) for i in w2v_model.wv.vectors[index][:5]])
                                                                         ))


40-D vectors for 3677 products

Index: 0, StockCode: 85123A, Descrip: cream hanging heart t-light holder
 Vector: [-0.23690282, 0.67627496, 0.26765117, 0.39828238, 0.07714165 ... x_40] 
Index: 1, StockCode: 22423, Descrip: regency cakestand 3 tier
 Vector: [-0.11227249, 0.306507, 0.019814892, 0.4933716, -0.0047198436 ... x_40] 


### Step 3 (Online): 

### Recommend related items when an item is clicked or added to cart

In [7]:
# stockitems=choices(w2v_model.wv.index_to_key, k=1)
stockitems=['21088']
print("Find similar items to: StockCode: {s}, Descrip: {d}\n".format(s=stockitems[0], d=stock_descrip[stockitems[0]]))
print("==============================================================================")
for item in w2v_model.wv.most_similar(positive=stockitems, topn=10):
    print("\t StockCode: {s}, Descrip: {d}".format(s=item[0], d=stock_descrip[item[0]]))
print("==============================================================================")

Find similar items to: StockCode: 21088, Descrip: set/6 fruit salad paper cups

	 StockCode: 21096, Descrip: set/6 fruit salad  paper plates
	 StockCode: 21090, Descrip: set/6 collage paper plates
	 StockCode: 21087, Descrip: set/6 posies paper cups
	 StockCode: 21089, Descrip: set/6 green spring paper cups
	 StockCode: 20823, Descrip: gold wine goblet
	 StockCode: 21084, Descrip: set/6 collage paper cups
	 StockCode: 21634, Descrip: assorted mini madras notebook
	 StockCode: 47518F, Descrip: icon placemat pop art elvis
	 StockCode: 21394, Descrip: red polkadot beaker 
	 StockCode: 21402, Descrip: red  egg  spoon


### Adaptive recommendation when multiple similar items are clicked or added to the cart

In [8]:
# sim=0
# while sim < 0.7:
#     stockitems=choices(w2v_model.wv.index_to_key, k=2)
#     sim=w2v_model.wv.similarity(*stockitems)
stockitems=['90022', '90030C']
sim=w2v_model.wv.similarity(*stockitems)
print("\n\n==============================================================================")
print("""
Find similar items to:\n 
    StockCode: {s}, Descrip: {d}
    StockCode: {s1}, Descrip: {d1}
""".format(s=stockitems[0], d=stock_descrip[stockitems[0]],
           s1=stockitems[1], d1=stock_descrip[stockitems[1]]
          ))
print("==============================================================================")
for item in w2v_model.wv.most_similar(positive=stockitems, topn=10):
    print("\t StockCode: {s}, Descrip: {d}".format(s=item[0], d=stock_descrip[item[0]]))
print("==============================================================================")




Find similar items to:
 
    StockCode: 90022, Descrip: edwardian drop earrings jet black
    StockCode: 90030C, Descrip: brown kukui coconut seed necklace

	 StockCode: 90030A, Descrip: spotted white natural seed necklace
	 StockCode: 90064A, Descrip: white vintage crystal earrings
	 StockCode: 90059F, Descrip: diamante hair grip pack/2 lt rose
	 StockCode: 90011B, Descrip: black crystal drop earrings
	 StockCode: 90018C, Descrip: silver black orbit drop earrings
	 StockCode: 90001D, Descrip: antique olive green flower earrings
	 StockCode: 90170, Descrip: daisy hair band
	 StockCode: 90081C, Descrip: lily brooch olive colour
	 StockCode: 90059E, Descrip: diamante hair grip pack/2 ruby
	 StockCode: 90031, Descrip: bili nut and wood necklace


### LSTM (Use other notebook)

### Search capability from product descriptions

Methods:
   - inverted index
       - list the unique "words" from all documents and create an index where: 
       
           `word_1 -> (docid1, docid2, .. docid3)`
       - [big data infrastructure for inverted index](https://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/ditp/ditp_ch4.pdf)
   - document embedding (similar concept to word vectors but applied on product description)
       - doc2vec (PV-DBOW)
       - sentence transformers

## Search engine indexing and document vectorisation (Offline):
1. Create inverted index
2. Using product decriptions, train document/sentence embedding model (doc2vec/sentence transformers)
3. Each product item will have a corresponding vector learned from (2), this will be used when calculating relevance at query time

In [9]:
descrips = list(stock_descrip.values())
descrips[:3]

['medium ceramic top storage jar',
 'black candelabra t-light holder',
 'airline bag vintage jet set brown']

### Step 1:

In [10]:
corpus = descrips
inverted_index = defaultdict(set)

for docid, c in enumerate(corpus):
    for word in c.split():
        inverted_index[word].add(docid)

### Step 2:

#### Train Doc2Vec model

In [11]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=40)

train_corpus = read_corpus(corpus)

model.build_vocab(train_corpus)

print("Number of Document vectors: {l}, Dimension: 40".format(l=model.corpus_count))

Number of Document vectors: 3684, Dimension: 40


#### Infer document embedding using pre-trained sentence transformers

In [12]:
embedder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
descrip_embeddings = embedder.encode(descrips, convert_to_tensor=True)
docs, dim = descrip_embeddings.shape
print("Number of Document vectors: {l}, Dimension: {d}".format(l=docs, d=dim))

Number of Document vectors: 3684, Dimension: 384


### Step 3

## Search retrieval process (Online):
   1. split query in to words or ngrams:
       - "christmas wall decoration" -> `["christmas", "wall", "decoration"]
   2. Using inverted index, find the product candidates
       - candidates: product item that contains at least 1 word in the user's query
   3. Generate document relevance scores (cosine similarity of document vectors) for the particular query
       - relevance scores can also be generated using a ranking function (ML-learned)

In [13]:
def process_and_search(query):
    matched_documents = set()
    for word in query.split():
        matches = inverted_index.get(word)
        if matches:
            matched_documents |= matches
    return matched_documents

In [25]:
query="avocado chair"
# query="christmas wall decoration"

In [21]:
match_indx = list(process_and_search(query))

In [22]:
top_k = 10

### Doc2Vec

In [23]:
query_vector = model.infer_vector(query.split())

print("\n\n===========================================================")
print("Query:", query)
print("\nMost similar products:\n")

matched_docs = [(i,s) for i,s in zip(match_indx, model.dv.cosine_similarities(query_vector, model.dv.vectors[match_indx]))]
matched_docs = sorted(matched_docs, key=lambda tup: tup[1], reverse=True)
for item, sim_score in matched_docs[:top_k]:
    print("\t {d} (Score: {s:.4f})".format(d=corpus[item], s=sim_score))
print("===========================================================")



Query: avocado chair

Most similar products:

	 danish rose folding chair (Score: 0.0222)
	 school desk and chair  (Score: -0.0740)
	 flag of st george chair (Score: -0.0866)
	 blue painted kashmiri chair (Score: -0.1394)
	 pink painted kashmiri chair (Score: -0.1416)


### Pre-trained sentence transformer model

In [26]:
query_embedding = embedder.encode(query, convert_to_tensor=True)

search_embeddings = descrip_embeddings[match_indx]
cos_scores = util.pytorch_cos_sim(query_embedding, search_embeddings)[0]
idx_scores = [(i,s) for i,s in zip(match_indx, cos_scores)]
idx_scores = sorted(idx_scores, key=lambda tup: tup[1], reverse=True)[:top_k]

print("\n\n======================================================================\n\n")
print("Query:", query)
print("\nTop {n} most similar products:\n".format(n=top_k))

for idx, score in idx_scores:
    print('\t', corpus[idx], "(Score: {:.4f})".format(score))
print("======================================================================")





Query: avocado chair

Top 10 most similar products:

	 school desk and chair  (Score: 0.4560)
	 blue painted kashmiri chair (Score: 0.3810)
	 pink painted kashmiri chair (Score: 0.3591)
	 danish rose folding chair (Score: 0.3487)
	 flag of st george chair (Score: 0.3128)


In [27]:
cos_scores = util.pytorch_cos_sim(query_embedding, descrip_embeddings)[0]
top_results = torch.topk(cos_scores, k=top_k)

print("\n\n======================================================================\n\n")
print("Query:", query)
print("\nTop {n} most similar products:\n".format(n=top_k))

for idx, score in zip(top_results.indices.tolist(), top_results.values.tolist()):
    print('\t', corpus[idx], "(Score: {:.4f})".format(score))
print("======================================================================")





Query: avocado chair

Top 10 most similar products:

	 school desk and chair  (Score: 0.4560)
	 retrospot padded seat cushion (Score: 0.4412)
	 blue painted kashmiri chair (Score: 0.3810)
	 fuschia retro bar stool (Score: 0.3759)
	 tv dinner tray air hostess  (Score: 0.3697)
	 pink painted kashmiri chair (Score: 0.3591)
	 skull design tv dinner tray (Score: 0.3494)
	 danish rose folding chair (Score: 0.3487)
	 orange tv tray table  (Score: 0.3474)
	 spaceboy tv dinner tray (Score: 0.3413)


### Search pipeline
   1. Collect and store data which includes clickstream, transactions, search, text from product title, decription, review, etc.. image
   2. Build inverted index from products and merchants pages
   3. Use inverted index to create a database of word - product/merchant features 
       - store data to HDFS/Clickhouse/MySQL which are scalable for big data and can be easily accessible for Data Analysts and ML Engineers
   4. Using collected data, train models for product retrieval and recommendation using text/image/behavior data and other analytics and ML applications
         - database for `|ngram(word) | product id | product features |`
         - machine learning based product retrieval/recommendation
             - train a model that gives relevance score each words to corresponding products/merchants
             - on query time, aggregate word-document scores to produce page's query relevance score
         - product ranking by relevance (Learning to Rank) model using product titles, description, reviews, sales as features and clicks/ratings as labels
             - Gradient Boosted Trees (Catboost/XGboost/LightGBM) optimised using YetiRank/PairLogit/ and evaluated using Normatilised Cummulative Discounted Gain (NDCG) 
         - understand customer query better by incorporating semantic relevance inferred from using transfer learning and fine-tuning [pre-trained DNN models](https://www.sbert.net/docs/pretrained_models.html) like (BERT, GPT-3)
         
         <img src="../images/index.png" alt="index" width="500"/> <img src="../images/ranking_model.png" alt="ranking" width="500"/> 
   5. Continuously update the database and index of product, customer and merchant data and transactions
   6. Use customer feedback from clickstream and ratings to evaluate and improve models

<img src="../images/query_process.png" alt="query process" width="500" lenght="100"/>