# BERT Content Based Recommendation Engine

![](tha.jpg)
<center>(Don't mind me, the editor is just having fun)</center>

BERT is Google's open source NLP technique, it stands for Bidirectional Encoder Representations from Transformers. What this means is that it becvomes very easy to create vector representations of words and sentences. A vector representation is a way of saying we can have a numeric representation of a word or sentence which can be fed to a machine learning pipeline and do "stuff" to them. (Stuff is a adequate word, look it up). 

One of the most important steps in a Data Science / ML pipeline is feature engineering, BERT is a great starting point.

Why is BERT different to other better known word embedding techniques?

>*"Why does this matter? Pre-trained representations can either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary. For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the ... account” — starting from the very bottom of a deep neural network, making it deeply bidirectional." [Link](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)

Ok lets say we have BERT, then many interesting projects come to mind.
- Text classification 
- Text summarization
- Sentiment analysis
- **Recommendation engine** you knew I was going to choose this one, you are smart, I like you.

So that based on some text we can recommend other similar texts. So lets say you read Herman Melville's Moby Dick, you might be interested in some aquatic adventures! Perhaps recommending Jules Verne's Twenty Thousand Leagues Under the Sea, or The old man and the sea by Ernest Hemingway or...

![](aquaman.jpg)

# Setting everything up

Pretty much I followed [this tutorial](https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/) to set up BERT's client and server in my computer. So you know you can go there and follow it or stay here, I have better images though. It is important for this to be fast that you have a cool graphics card, I have a GTX 1080 that I use exclusively for Deep Learning and never for games... never.

You will also need to download the pretrained models, I donwloaded a couple a one uncased and the other cased here are the links:
- [uncased_L-12_H-768_A-12](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)
- [cased_L-24_H-1024_A-16](https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip)

You might run into trouble if you want to use your gpu to train models if you haven't before, here is an excelent tutorial on how to set up most stuff for running ML on GPU on Windows 10 [Link](https://harangdev.github.io/tips/1/), create a virtual environment and stuff everything in there.

Ok so you have everything running and ready, you will see the following on your CLI.

> *bert-serving-start -model_dir uncased_L-12_H-768_A-12 -gpu_memory_fraction 0.75 -cors 1 -num_worker 1 -device_map 0 -max_seq_len 40*
> *bert-serving-start -model_dir cased_L-24_H-1024_A-16 -gpu_memory_fraction 0.75 -cors 1 -num_worker 1 -device_map 0 -max_seq_len 40*

![](server_start.PNG)

A quick explanation on what the parameters mean:
- **-model_dir** : the path to where you downloaded the pretrained models
- **-gpu_memory_fraction** : the proportion of memory from your gpu that you will allocate to this process
- **-cors** : Number of concurrent connections (so how many clients will be connected at one time to the server)
- **-device_map** : the id from the gpus you will use, I have one and it is the 0
- **-max_seq_len** : In order to create the vector from a sentence or sequence of tokens, there is some transformations I haven't read in detail, but basically this parameter identifies how many tokens or words will be used to create the vector that will identify the sentence.

Following are the rest of the parameters:

In [None]:
"""
           ckpt_name = bert_model.ckpt
         config_name = bert_config.json
                cors = 1
                 cpu = False
          device_map = [0]
       do_lower_case = True
  fixed_embed_length = False
                fp16 = False
 gpu_memory_fraction = 0.9
       graph_tmp_dir = None
    http_max_connect = 10
           http_port = None
        mask_cls_sep = False
      max_batch_size = 256
         max_seq_len = 25
           model_dir = cased_L-24_H-1024_A-16
    no_special_token = False
          num_worker = 1
       pooling_layer = [-2]
    pooling_strategy = REDUCE_MEAN
                port = 5555
            port_out = 5556
       prefetch_size = 10
 priority_batch_size = 16
show_tokens_to_client = False
     tuned_model_dir = None
             verbose = False
                 xla = False
"""

# Find data, use BERT server, create vectors

So first step is to read the data that we will use for this recommendation engine, it should be some text or group of texts that will need to undergo (probably) some cleaning before sending it to encoding.

On the tutorial I found they had a link to a Twitter sentiment competition which I used, but I want to see how this works with other text, and I found this. Hope this works since I am doing one day before I am presenting it.

I will be using this data [Link](https://github.com/groveco/content-engine/blob/master/sample-data.csv) which I found while browsing this [Link](http://blog.untrod.com/2016/06/simple-similar-products-recommendation-engine-in-python.html). They use Tf-Idf to create the feature vectors, but that is so 2016...

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.decomposition import PCA
from bert_serving.client import BertClient
from vsm import *

In [2]:
text = pd.read_csv('sample-data.csv', encoding='iso-8859-1')

In [3]:
text.shape

(500, 2)

In [4]:
text.head()

Unnamed: 0,id,description
0,1,Active classic boxers - There's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...
2,3,Active sport briefs - These superbreathable no...
3,4,"Alpine guide pants - Skin in, climb ice, switc..."
4,5,"Alpine wind jkt - On high ridges, steep ice an..."


In [5]:
text.drop('id', axis=1, inplace=True)

Here we define some functions to clean the text before sending it to encoding, basically what it does is to leaves only alphabetic characters, removes unicode characters if present, removes extra spaces and changes all caps if present to Title case. I left the words cases and I will be using the cased pretrained models, I could change everthing to lowercase and use the other one, part of the thrill of reasearch that I am not doing just yet.

In [6]:
def proper_case(word: str) -> str:
    if len(word) == 0:
        return word
    elif word[0].isupper():
        return word.capitalize()
    else:
        return word.lower()

def clean_text(text):
    text = re.sub(r'[^a-zA-Z\']', ' ', text)
    
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    
    text = text.strip().split(' ')
    text = ' '.join([proper_case(word) for word in text if len(word) > 0])
       
    return text

text['clean_text'] = text.description.apply(clean_text)

In [7]:
text.head()

Unnamed: 0,description,clean_text
0,Active classic boxers - There's a reason why o...,Active classic boxers There's a reason why our...
1,Active sport boxer briefs - Skinning up Glory ...,Active sport boxer briefs Skinning up Glory re...
2,Active sport briefs - These superbreathable no...,Active sport briefs These superbreathable no f...
3,"Alpine guide pants - Skin in, climb ice, switc...",Alpine guide pants Skin in climb ice switch to...
4,"Alpine wind jkt - On high ridges, steep ice an...",Alpine wind jkt On high ridges steep ice and a...


Now, lets encode our text:

In [9]:
bc = BertClient()
clean_encoded_text = bc.encode(text.clean_text.tolist())
clean_encoded_text.shape

(500, 1024)

We have now out feature vectors, numerical representations of the text we submitted. One issue we see from looking at the shape of the dataset is that we have many more columns than we have rows. Since we don't have more data (always our best ally) we might resort to some dimensionionality reduction tecniques.

Lets try PCA here:

> Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables. [wiki](https://en.wikipedia.org/wiki/Principal_component_analysis)

In [10]:
n = 128
pca = PCA(n_components=n, whiten=False, random_state=0)
pca_text = pca.fit_transform(clean_encoded_text)

We can check how much of the total variance is being explained by the transformation. Since we reducing the dimensionality, we should expect some loss.

In [11]:
print('Loss from 1024 to {} = {:.2f}%'.format(n, 100*(1-np.sum(pca.explained_variance_ratio_))))

Loss from 1024 to 128 = 12.06%


So lets say that is good enough (again this should be tested and researched). To create the recommendation engine, we want to find a way to measure the distance between 2 vectors, and there are plenty of ways to do so.

There is a library I am developing with tools for Natural Language Understanding so I will used those, the code is on the same folder on vsm.py

Once we select our metric, it is our task to find the vectors that are closest to the one we selected, that can be very slow when we have many vectors because we need to compare each vector against each other in the dataset.

So depending on the size if we wanted to have the recommendation for each vector, we would end up with a triangular matrix. Time complexity say for using euclidean distance would be O(m*n^2) where n is the number of vectors and m the size of them.

Anyway...

In [26]:
class recommendation(object):
    def __init__(self, text: pd.DataFrame, col: str):
        self.text = text
        self.col = col
        self.clean_text = None
        self.encoded_text = None
        self.encoded_pca = None
        self.bc = BertClient()
    
    
    def __clean_text(self):
        self.clean_text = self.text[self.col].apply(clean_text).tolist()
    
    
    def encode(self, n_components=1, random_state=0, use_pca=False):
        self.use_pca = use_pca
        self.n_components = n_components
        self.random_state = random_state
        
        if self.clean_text is None:
            self.__clean_text()
        
        if self.encoded_text is None:
            self.encoded_text = self.bc.encode(self.clean_text)
            
        if self.use_pca:
            self.pca = PCA(n_components=self.n_components, random_state=self.random_state)
            self.pca.fit(self.encoded_text)
            self.encoded_pca =  self.pca.transform(self.encoded_text)
        
    
    def recommend(self, text: str, distance=cosine_distance, n_recommend=10, ascending=True):
        enc_text = self.bc.encode([clean_text(text)])
        
        if self.use_pca:
            enc_pca = self.pca.transform(enc_text)
            self.text[distance.__name__] = np.apply_along_axis(distance, 1, self.encoded_pca, enc_pca)
            
        else:
            self.text[distance.__name__] = np.apply_along_axis(distance, 1, self.encoded_text, enc_text)
        
        return self.text.sort_values(by=distance.__name__, ascending=ascending).head(n_recommend)

In [27]:
rec = recommendation(text, 'description')

In [28]:
rec.encode()

here is what you can do:
- or, start a new server with a larger "max_seq_len"
  '- or, start a new server with a larger "max_seq_len"' % self.length_limit)


In [29]:
rec.recommend(text['description'].loc[0])

Unnamed: 0,description,clean_text,cosine_distance
0,Active classic boxers - There's a reason why o...,Active classic boxers There's a reason why our...,5.960464e-08
28,Continental shorts - Wrinkle-resistant travel ...,Continental shorts Wrinkle resistant travel sh...,0.03168947
439,Cap 3 bottoms - The unwavering foundation for ...,Cap bottoms The unwavering foundation for any ...,0.03196377
399,Retro grade shorts - As advantageous as a numb...,Retro grade shorts As advantageous as a number...,0.03285807
493,Active boxer briefs - A no-fuss travel compani...,Active boxer briefs A no fuss travel companion...,0.03371626
462,Custodian pants - short - The graveyard shift ...,Custodian pants short The graveyard shift has ...,0.03381079
480,Duck pants - reg - Essential wear for splittin...,Duck pants reg Essential wear for splitting lo...,0.03393221
11,"Baggies shorts - Even Baggies, our most popula...",Baggies shorts Even Baggies our most popular s...,0.03402412
131,"Stretch polo - Core to the nomadic lifestyle, ...",Stretch polo Core to the nomadic lifestyle our...,0.03427577
61,El cap jkt - Resistant to hard play but irresi...,El cap jkt Resistant to hard play but irresist...,0.03459531


In [33]:
rec2 = recommendation(text, 'description')

In [34]:
rec.encode(use_pca=True, n_components=128)

In [35]:
rec.recommend(text['description'].loc[0])

here is what you can do:
- or, start a new server with a larger "max_seq_len"
  '- or, start a new server with a larger "max_seq_len"' % self.length_limit)


Unnamed: 0,description,clean_text,cosine_distance
0,Active classic boxers - There's a reason why o...,Active classic boxers There's a reason why our...,-1.192093e-07
54,Hip chest pack - Ready to go vest free? This i...,Hip chest pack Ready to go vest free This is t...,0.6087426
11,"Baggies shorts - Even Baggies, our most popula...",Baggies shorts Even Baggies our most popular s...,0.6328166
452,Compound cargo shorts - With cargo pockets pul...,Compound cargo shorts With cargo pockets pulle...,0.6739159
85,M10 pants - Volatile climates don't rule out b...,M pants Volatile climates don't rule out big a...,0.683771
229,Synch vest - With the possible exception of du...,Synch vest With the possible exception of duct...,0.6877334
26,Compound cargo pants - long - The ultimate do-...,Compound cargo pants long The ultimate do ever...,0.7161425
398,Marlwalker pants - Veterans of the tropics kno...,Marlwalker pants Veterans of the tropics know ...,0.7177995
451,Compound cargo pants - short - The ultimate do...,Compound cargo pants short The ultimate do eve...,0.7187109
439,Cap 3 bottoms - The unwavering foundation for ...,Cap bottoms The unwavering foundation for any ...,0.7195204


And yeah, that is basically it. This is the first run, plenty of parameters to tune, models to train, and metrics to take care of to get better results. 

Thank you

Your friendly neighbor.

---