# Implementation of a search engine based on sBERT

In this notebook there is a basic implementation of sBERT for searching a database of sentences with queries.

The goal is to increase the amount of labeled data that we have in order to later fine tune a model to be used for sentence classification. First of all we have to find a pool of queries that represent the six labels of the six policy instruments. With these queries we can pull a set of sentences that can be automaticaly labeled with the same label of the query. In this way we can increase the diversity of labeled sentences in each label category. This approach will be complemented with a manual curation step to produce a high quality training data set.

The policy instruments that we want to find and that correspond to the different labels are:
* Direct payment (PES)
* Tax deduction
* Credit/guarantee
* Technical assistance
* Supplies
* Fines

This notebook is intended for the following purposes:
* Try different query strategies to find the optimal retrieval of sentences in each policy instrument category
* Try different transformers
* Be the starting point for further enhancements

## Import modules

This notebook is self contained, it does not depend on any other class of the sBERT folder.

You just have to create an environment where you install the external dependencies. Usually the dependencies that you have to install are:

**For the basic sentence similarity calculation**
*  pandas
*  boto3
*  sentence_transformers

**If you want to use ngrams to generate queries**
*  nltk
*  plotly
*  wordcloud

**If you want to do evaluation and ploting with pyplot**
*  matplotlib

In [None]:
# If your environment is called nlp then you execute this cell otherwise you change the name of the environment
!conda activate nlp

In [1]:
# General purpose libraries
import numpy as np
import pandas as pd
import boto3
import json

# Model libraries
from sentence_transformers import SentenceTransformer
from scipy.spatial import distance

# Libraries for model evaluation
# import matplotlib.pyplot as plt
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import accuracy_score
# from sklearn.metrics import confusion_matrix

# Libraries to be used in the process of definig queries
import nltk # imports the natural language toolkit
import plotly
from wordcloud import WordCloud
from collections import Counter
from nltk.util import ngrams
import re

## Defining Queries

In the following lines, we use the excel file with the selected phrases of each country, process them and get N-grams to define basic queries for the SBERT model.

In [None]:
data = pd.read_excel(r'WRI_Policy_Tags (1).xlsx', sheet_name = None)
df = None

if isinstance(data, dict):
    for key, value in data.items():
        if not isinstance(df,pd.DataFrame):
            df = value
        else:
            df = df.append(value)
else:
    df = data
df.head()

In [None]:
sentences = df["relevant sentences"].apply(lambda x: x.split(";") if isinstance(x,str) else x)
sentence = []

for elem in sentences:
    if isinstance(elem,float) or len(elem) == 0:
        continue
    elif isinstance(elem,list):
        for i in elem:
            if len(i.strip()) == 0:
                continue
            else:
                sentence.append(i.strip())
    else:
        if len(elem.strip()) == 0:
            continue
        else:
            sentence.append(elem.strip())

sentence
words_per_sentence = [len(x.split(" ")) for x in sentence]
plt.hist(words_per_sentence, bins = 50)
plt.title("Histogram of number of words per sentence")

In [None]:
def top_k_ngrams(word_tokens,n,k):
    
    ## Getting them as n-grams
    n_gram_list = list(ngrams(word_tokens, n))

    ### Getting each n-gram as a separate string
    n_gram_strings = [' '.join(each) for each in n_gram_list]
    
    n_gram_counter = Counter(n_gram_strings)
    most_common_k = n_gram_counter.most_common(k)
    print(most_common_k)

noise_words = []
stopwords_corpus = nltk.corpus.stopwords
sp_stop_words = stopwords_corpus.words('spanish')
noise_words.extend(sp_stop_words)
print(len(noise_words))

if "no" in noise_words:
    noise_words.remove("no")

tokenized_words = nltk.word_tokenize(''.join(sentence))
word_freq = Counter(tokenized_words)
# word_freq.most_common(20)
# list(ngrams(tokenized_words, 3))

word_tokens_clean = [re.findall(r"[a-zA-Z]+",each) for each in tokenized_words if each.lower() not in noise_words and len(each.lower()) > 1]
word_tokens_clean = [each[0].lower() for each in word_tokens_clean if len(each)>0]

### n-grams size
We define the size of the n-gram that we want to find. The larger it is, the less frequent it will be, unless we substantially increase the number of phrases.

In [None]:
n_grams = 2

top_k_ngrams(word_tokens_clean, n_grams, 20)

## Accesing documents in S3

All documents from El Salvador have been preprocessed and their contents saved in a JSON file. In the JSON file there are the sentences of interest.

Use the json file with the key and password to access the S3 bucket if necessary. 
If not, skip this section and use files in a local folder. 

In [3]:
# If you want to keep the credentials in a local folder out of GitHub, you can change the path to adapt it to your needs.
# Please, comment out other users lines and set your own
path = "C:/Users/jordi/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/credentials/" # Jordi's local path in desktop
# path = "C:/Users/user/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/credentials/" # Jordi's local path in laptop
# path = ""
#If you put the credentials file in the same "notebooks" folder then you can use the following path
# path = ""
filename = "Omdena_key_S3.json"
file = path + filename
with open(file, 'r') as dict:
    key_dict = json.load(dict)

In [4]:
for key in key_dict:
    KEY = key
    SECRET = key_dict[key]

In [5]:
s3 = boto3.resource(
    service_name = 's3',
    region_name = 'us-east-2',
    aws_access_key_id = KEY,
    aws_secret_access_key = SECRET
)

### Loading the sentence database from El Salvador

In [6]:
filename = 'JSON/ElSalvador.json'

obj = s3.Object('wri-latin-talent',filename)
serializedObject = obj.get()['Body'].read()
policy_list = json.loads(serializedObject)

### Building a list of potentially relevant sentences

In [7]:
is_not_incentive = {"CONSIDERANDO:" : 0,
                    "POR TANTO" : 0,
                    "DISPOSICIONES GENERALES" : 0,
                    "OBJETO" : 0,
                    "COMPETENCIA, PROCEDIMIENTOS Y RECURSOS." : 0}
sentences = {}
for key, value in policy_list.items():
    for item in value:
        if item in is_not_incentive:
            continue
        else:
            for sentence in policy_list[key][item]['sentences']:
                sentences[sentence] = policy_list[key][item]['sentences'][sentence]

In [8]:
len(sentences)
# for sentence in sentences:
#     print(sentences[sentence]['text'])

40124

## Initializing the model

First, we import the sBERT model. Several transformers are available and documentation is here: https://github.com/UKPLab/sentence-transformers <br>

Then we build a simple function that takes four inputs:
1. The model as we have set it in the previous line of code
2. A dictionary that contains the sentences {"\<sentence_ID\>" : {"text" : "The actual sentence", labels : []}
3. A query in the form of a string
4. A similarity treshold. It is a float that we can use to limit the results list to the most relevant.

The output of the function is a list with three columns with the following content:
1. Column 1 contains the id of the sentence
2. Column 2 contains the similarity score
3. Column 3 contains the text of the sentence that has been compared with the query

In [11]:
# transformer_name='xlm-r-100langs-bert-base-nli-mean-tokens'
transformer_name = "distiluse-base-multilingual-cased"
model = SentenceTransformer(transformer_name)

def highlight(model, sentences_dict, query, similarity_treshold):
    query_embedding = model.encode(query)
    highlights = []
    for sentence in sentences_dict:
        sentence_embedding = model.encode(sentence)
        score = 1 - distance.cosine(sentence_embedding, query_embedding)
        if score > similarity_treshold:
            highlights.append([sentence, score, document[sentence]['text']])
    highlights = sorted(highlights, key = lambda x : x[1], reverse = True)
    return highlights


## Running the search

Ti = time.perf_counter()

highlighter_query = "Todos aquellos agricultores que fueron beneficiados con el otorgamiento de hasta dos créditos"
highlighter_precision = 0.2

label_1 = highlight(model, sentences, highlighter_query, highlighter_precision)

Tf = time.perf_counter()

print(f"similarity search for El Salvador sentences done in {Tf - Ti:0.4f} seconds")

#### Inspecting the results

In [None]:
print(len(label_1))
label_1[0:10]

##### Further filtering of the results by using the similarity score

In [None]:
similarity_treshold = 0.5
filtered = [row for row in label_1 if row[1] > similarity_treshold]
filtered