# Assisted labeling

In this notebook there is a basic implementation of sBERT for searching a database of sentences with queries.

The goal is to increase the amount of labeled data that we have in order to later fine tune a model to be used for sentence classification. First of all we have to find a pool of queries that represent the six labels of the six policy instruments. With these queries we can pull a set of sentences that can be automaticaly labeled with the same label of the query. In this way we can increase the diversity of labeled sentences in each label category. This approach will be complemented with a manual curation step to produce a high quality training data set.

The policy instruments that we want to find and that correspond to the different labels are:
* Direct payment (PES)
* Tax deduction
* Credit/guarantee
* Technical assistance
* Supplies
* Fines

This notebook is intended for the following purposes:
* Loading of a database of sentences
* Create sentence embeddings for further processing
* Define the queries corresponding to the labels
* Compute the cosine similarity score between the embeddings of the sentences and those of the queries

## Import modules

This notebook is self contained, it does not depend on any other class of the sBERT folder.

You just have to create an environment where you install the external dependencies. Usually the dependencies that you have to install are:

**For the basic sentence similarity calculation**
*  pandas
*  boto3
*  pytorch
*  sentence_transformers

**If you want to do evaluation and ploting with pyplot**
*  matplotlib

In [None]:
# If your environment is called nlp then you execute this cell otherwise you change the name of the environment
!conda activate NLP

In [1]:
# General purpose libraries
import boto3
import copy
import csv
import datetime
import json
import numpy as np
import pandas as pd
import time
# from pathlib import Path
import re

In [2]:
# Model libraries
from sentence_transformers import SentenceTransformer
import sentencepiece
from scipy.spatial import distance

from json import JSONEncoder

class NumpyArrayEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.ndarray):
            return obj.tolist()
        return JSONEncoder.default(self, obj)

## Accesing documents in S3

All documents from El Salvador have been preprocessed and their contents saved in a JSON file. In the JSON file there are the sentences of interest.

Use the json file with the key and password to access the S3 bucket if necessary. 
If not, skip this section and use files in a local folder. 

In [3]:
# If you want to keep the credentials in a local folder out of GitHub, you can change the path to adapt it to your needs.
# Please, comment out other users lines and set your own
path = "C:/Users/jordi/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/credentials/" # Jordi's local path in desktop
# path = "C:/Users/user/Google Drive/Els_meus_documents/projectes/CompetitiveIntelligence/WRI/Notebooks/credentials/" # Jordi's local path in laptop
# path = ""
#If you put the credentials file in the same "notebooks" folder then you can use the following path
# path = ""
filename = "AWS_S3_keys_Omdena.json"
file = path + filename
with open(file, 'r') as dict:
    key_dict = json.load(dict)

In [4]:
for key in key_dict:
    KEY = key
    SECRET = key_dict[key]
region = 'us-east-2'

In [5]:
s3 = boto3.resource(
    service_name = 's3',
    region_name = region,
    aws_access_key_id = KEY,
    aws_secret_access_key = SECRET
)

### Loading the sentence database

Here we merge all documents databases for each country in a single dictionary.

In [6]:
filename = ['Chile.json','ElSalvador.json']
filter_language = "_ES"
filter_folder = "JSON/" #TODO: move everything to a "Sentences" folder
filter_prefix = filter_folder + filter_language # TODO: change naming convention for files adding _ES or _En according to language. 

policy_dict = {}
for file in filename:
    filter_prefix  = filter_folder + file
    for obj in s3.Bucket('wri-latin-talent').objects.all().filter(Prefix=filter_prefix):
    #     obj = s3.Object('wri-latin-talent',filename)
        serializedObject = obj.get()['Body'].read()
        policy_dict = {**policy_dict, **json.loads(serializedObject)}

In [None]:
i = 0
for item in policy_dict:
    if i == 0:
        print(policy_dict[item])
    i += 1

### Building a list of potentially relevant sentences

The main purpose of the following code is to apply a series of filters to the sentence database to reduce the final number of sentences used:
* The most important is to keep the sentences that are in relevant parts of the documents, and lieave aside the ones which are in parts of the documents that will not contain incentives by nature.

* There is a second filter by sentence length

* There is a for-testing-only filter which arbitrarily selects a sample of sentences. The reason being that running the sentence embedding function takes time. The variable "slim_by" is the reduction factor. If it is set to 1, there will be no reduction and we will be working with the full dataset. It it is set to two, we will take one every two sentences and so one.

The output of the function is a dictionary of this form:

{"\<sentence id\>" : "\<text of the sentence\>"}.

<span style="color:red"><strong>REMEMBER</strong></span> that you have to re-run the function "get_sentences_dict" with the "slim_by" variable set to 1 when you want to go for the final shoot.

In [7]:
# This is to shrink the sentences dict by a user set factor. It will pick only one sentence every "slim_factor"
def slim_dict(counter, slim_factor):
    if counter % slim_factor == 0:
        return True
    else:
        return False
    
# This is to trim sentences that are either to short or too large to be meaningful.
# This function is based on number of characters, but it can easily be adated to trim by word number.
def sentence_length_filter(sentence_text, minLength, maxLength):
    if len(sentence_text) > minLength:#len(sentence_text) < maxLength and
        return True
    else:
        return False
    
def get_sentences_dict(docs_dict, is_not_incentive_dict, slim_factor, minLength, maxLength):
    count = 0
    result = {}
    for key, value in docs_dict.items():
        for item in value: 
            if item in is_not_incentive_dict:
                continue
            else:
                for sentence in docs_dict[key][item]['sentences']:
                    if sentence_length_filter(docs_dict[key][item]['sentences'][sentence]["text"], minLength, maxLength):
                        count += 1
                        if slim_dict(count, slim_by):
                            result[sentence] = docs_dict[key][item]['sentences'][sentence]
                        else:
                            continue
                    else:
                        continue
    return result

In [8]:
is_not_incentive = {}
# is_not_incentive = {"CONSIDERANDO:" : 0,
#                     "POR TANTO" : 0,
#                     "DISPOSICIONES GENERALES" : 0,
#                     "OBJETO" : 0,
#                     "COMPETENCIA, PROCEDIMIENTOS Y RECURSOS." : 0}
# is_not_incentive = {"CONSIDERANDO:" : 0,
#                     "POR TANTO" : 0,
#                     "DISPOSICIONES GENERALES" : 0,
#                     "OBJETO" : 0,
#                     "COMPETENCIA, PROCEDIMIENTOS Y RECURSOS." : 0,
#                    "VISTO" : 0,
#                    "HEADING" : 0}

slim_by = 10000 # REMEMBER to set this variable to the desired value.
min_length = 50 # Just to avoid short sentences which might be fragments or headings without a lot of value
max_length = 250 # Just to avoid long sentences which might be artifacts or long legal jargon separated by semicolons

sentences = get_sentences_dict(policy_dict, is_not_incentive, slim_by, min_length, max_length)


In [9]:
# Just to check if the results look ok
print("In this data set there are {} policies and {} sentences".format(len(policy_dict),len(sentences)))
# for sentence in sentences:
#     print(sentences[sentence]['text'])


In this data set there are 5191 policies and 14 sentences


In [None]:
sentences["70be962_99"]

## Computing the embeddings

First, we import the sBERT model. Several transformers are available and documentation is here: https://github.com/UKPLab/sentence-transformers <br>

Then we build a simple function that takes four inputs:
1. The model as we have set it in the previous line of code
2. A dictionary that contains the sentences {"\<sentence_ID\>" : {"text" : "The actual sentence", labels : []}
3. A query in the form of a string
4. A similarity treshold. It is a float that we can use to limit the results list to the most relevant.

The output of the function is a list with three columns with the following content:
1. Column 1 contains the id of the sentence
2. Column 2 contains the similarity score
3. Column 3 contains the text of the sentence that has been compared with the query

### Modeling functions

There are currently two multi language models available for sentence similarity

* xlm-r-bert-base-nli-stsb-mean-tokens: Produces similar embeddings as the bert-base-nli-stsb-mean-token model. Trained on parallel data for 50+ languages.
<span style="color:red"><strong>Attention!</strong></span> Model "xlm-r-100langs-bert-base-nli-mean-tokens" which was the name used in the original Omdena-challenge script has changed to this "xlm-r-bert-base-nli-stsb-mean-tokens"

* distiluse-base-multilingual-cased-v2: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. While the original mUSE model only supports 16 languages, this multilingual knowledge distilled version supports 50+ languages

In [25]:
# This function is to create the embeddings for each transformer the embeddings in a json with the following structure:
# INPUT PARAMETERS
# transformers: a list with transformer names
# sentences_dict: a dictionary with the sentences of the database with the form {"<sentence id>" : "<sentence text>"}}
# file: the filepath and filename of the output json
# OUTPUT
# the embeddings of the sentences in a json with the following structure:
# {"<transformer name>" : {"<sentence id>" : <sentence embedding>}}

def create_sentence_embeddings(transformer, sentences_dict, file):
    embeddings = {}
    model = SentenceTransformer(transformer)
    embeddings = {}
    for sentence in sentences_dict:
        embeddings[sentence] = [model.encode(sentences_dict[sentence]['text'].lower(), show_progress_bar=False)]
    with open(file, 'w') as fp:
        json.dump(embeddings, fp, cls = NumpyArrayEncoder)
    file_name = "pre_embeddings/" + file.split("/")[3]
    s3.Object('wri-latin-talent', file_name).put(Body=open(file, 'rb'))
     
   
def highlight(transformer_name, model, sentence_emb, sentences_dict, query, similarity_treshold):
    query_embedding = model.encode(query.lower())
    highlights = []
    for sentence in sentences_dict:
        sentence_embedding = np.asarray(sentence_emb[sentence])[0]#[transformer_name][sentence])[0]
        score = 1 - distance.cosine(sentence_embedding, query_embedding)
        if score > similarity_treshold:
            highlights.append([sentence, score, sentences_dict[sentence]['text']])
    highlights = sorted(highlights, key = lambda x : x[1], reverse = True)
    return highlights


### Create embeddings for sentences in the database with the pre-trained model

This piece of code it's to be executed only once every time the database is changed or when we want to get the embeddings of a new database. For example, we are going to use it once for El Salvador policies and we don't need to use it again until we add new policies to this database.

Instead, whenever we want to run experiments on this database, we will load the json files with the embeddings which are in the "input" folder.

So, the next cell will be kept commented for safety reasons. Un comment it and execute it whenvere you need it.

In [26]:
Ti = time.perf_counter()

# We will use only one transformer to compute embeddings
transformer_name = 'xlm-r-bert-base-nli-stsb-mean-tokens'

path = "../../input/"
today = datetime.date.today()
today = today.strftime('%Y-%m-%d')
filename = "Embeddings_" + today + "_ES.json"
file = path + filename

create_sentence_embeddings(transformer_name, sentences, file)

Tf = time.perf_counter()

print(f"The building of a sentence embedding database for El Salvador in the two current models has taken {Tf - Ti:0.4f} seconds")

The building of a sentence embedding database for El Salvador in the two current models has taken 6.2777 seconds


### Loading the pre-trained model embeddings for database sentences

In [34]:
filter_language = "_ES"
filter_prefix = "pre_embeddings/" #TODO: move everything to a "Sentences" folder

sentence_embeddings = {}
for obj in s3.Bucket('wri-latin-talent').objects.all().filter(Prefix=filter_prefix):
    print("\n*** in 1", obj.key)
    if filter_language in obj.key:
          print("\n*** in 2", obj.key)
          serializedObject = obj.get()['Body'].read()
          sentence_embeddings = {**sentence_embeddings, **json.loads(serializedObject)}
    #     obj = s3.Object('wri-latin-talent',filename)


*** in 1 pre_embeddings/Embeddings_2021-03-04_ES.json

*** in 2 pre_embeddings/Embeddings_2021-03-04_ES.json


In [35]:
sentence_embeddings

{'1934416_15': [[-0.07183882594108582,
   -0.15796872973442078,
   1.2472082376480103,
   0.7328324913978577,
   0.21512727439403534,
   -0.008788103237748146,
   -0.77666175365448,
   0.43852102756500244,
   0.0241704061627388,
   -0.28062373399734497,
   0.6563766598701477,
   0.15529128909111023,
   0.3175541162490845,
   1.0527704954147339,
   -0.04091237112879753,
   0.03930460661649704,
   0.38285893201828003,
   0.6351944208145142,
   -0.6672925353050232,
   -0.6026720404624939,
   0.10112451016902924,
   0.3130311667919159,
   -0.2368570864200592,
   -0.2738688588142395,
   -0.13243883848190308,
   0.12706869840621948,
   0.1350613385438919,
   -0.33273398876190186,
   0.9802122116088867,
   -0.07314999401569366,
   0.21889004111289978,
   -0.4243481755256653,
   -0.3706037700176239,
   0.5283688902854919,
   -1.425404667854309,
   -0.3159946799278259,
   0.2442912459373474,
   0.8409515619277954,
   -0.19795754551887512,
   0.5083175301551819,
   -0.2524135112762451,
   -0.263

In [None]:
print(len(sentence_embeddings))
# for key in sentence_embeddings:
#     print(key)
#     print(len(sentence_embeddings[key]))

## Assisted labeling by query search

In [None]:
# The function below is to use a set of queries to search a database for similar sentences with different transformers.
# The input parameters are:

# Transformer_names: A list with the names of the transformers to be used. For multilingual similarity search we have two transformers
# Queries: a list of the queries as strings, that we want to use for searching the database

# Similarity_limit: The results are in the form of a similarity coefficient where 1 is a perfect match between the query embedding
# and the sentence in the database (the two vectors overlap). If the similarity coefficient is 0 the two vectors are orthogonal,
# they do not share anything in common. Thus, in order to restribt the number of results that are kept from the experiment we can
# it by setting a similarity threshold.When we have a huge database a good treshold would be 0.3 to 0.5 or even higher.

# Results_limit: instead of or complementary to Similarity_limit, we can limit our list of search results by the first sentences
# in the similarity ranking. We can set the limit to high numbers in an exploration phase and then reduce this number in a 
# "production" phase

# Filename: The results will be exported to the "output/" folder in json formate, we need to give it a name witout extension.

def sentence_similarity_search(transformer, queries, similarity_limit, results_limit, filename):
    results = {}
    for query in queries:
        Ti = time.perf_counter()
        similarities = highlight(transformer, model, sentence_embeddings, sentences, query, similarity_limit)
        results[query] = similarities[0:results_limit]#results[transformer][query] = similarities[0:results_limit]
        Tf = time.perf_counter()
        print(f"similarity search for model {transformer} and query {query} it's been done in {Tf - Ti:0.4f} seconds")

    path = "../output/"
    filename = filename + ".json"
    file = path + filename
    with open(file, 'w') as fp:
        json.dump(results, fp, indent=4)
    return results

# This function helps debugging misspelling in the values of the dictionary
def check_dictionary_values(dictionary):
    check_country = {}
    check_incentive = {}
    for key, value in dictionary.items():
        incentive, country = value.split("-")
        check_incentive[incentive] = 0
        check_country[country] = 0
    print(check_incentive)
    print(check_country)

### Query building

The code to compute sentence similarity will take two imputs:

* The queries that will by input as a list of strings. 
* The embeddings of the sentences in the database. 

At this point all we need to run the experiment is ready but the list of queries. One can write the list manually, or one can make it from other data flows. The next cells are ment to do this.

Here we use the databse of tagged sentences to define queries. The database is structured by countries. From a list of model documents the sentences were separated and tagged with a policy instrument label. The labels that were used are:

* Credit
* Direct payment
* Fine
* Guarantee
* Supplies
* Tax deduction
* Technical assistance

Not all countries have tagged sentences for each category so we ended up with 26 queries

The difference between this experiment and experiment 2 is that here we have reformulated the query sentences by extracting the core incentive meaning from the original sentences, eliminating all the vocabulary not strictly speaking about incentives.

In [None]:
queries_dict_exp3 = {
    "Otorgamiento de estímulos crediticios por parte de el estado" : "Credit-México",
"Estos créditos podrían beneficiar a sistemas productivos asociados a la pequeña y mediana producción" : "Credit-Perú",
"Se asocia con créditos de enlace del Banco del Estado" : "Credit-Chile", 
"Acceso al programa de garantía crediticia para la actividad económica" : "Credit-Guatemala",
"El banco establecerá líneas de crédito para que el sistema financiero apoye la pequeña, mediana y microempresa" : "Credit-El Salvador",
"Dentro de los incentivos económicos se podrá crear un bono para retribuir a los propietarios por los bienes y servicios generados." : "Direct_payment-México",
"Acceso a los fondos forestales para el pago de actividad" : "Direct_payment-Perú",
"Se bonificará el 90% de los costos de repoblación para las primeras 15 hectáreas y de un 75% respecto las restantes" : "Direct_payment-Chile",
"El estado dará un incentivo que se pagará una sola vez a los propietarios forestales" : "Direct_payment-Guatemala",
"Incentivos en dinero para cubrir los costos directos e indirectos del establecimiento y manejo de areas de producción" : "Direct_payment-El Salvador",
"Toda persona física o moral que cause daños estará obligada a repararlo o compensarlo" : "Fine-México",
"Disminuir los riesgos para el inversionista implementando mecanismos de aseguramiento" : "Guarantee-México",
"Podrá garantizarse el cumplimiento de la actividad mediante fianza otorgada a favor del estado por cualquiera de las afianzadoras legalmente autorizadas." : "Guarantee-Guatemala",
"El sujeto de derecho podrá recibir insumos para la instalación y operación de infraestructuras para la actividad económica." : "Supplies-México",
"Se facilitará el soporte técnico a  través de la utilización de guías, manuales, protocolos, paquetes tecnológicos, procedimientos, entre otros." : "Supplies-Perú",
"Se concederán incentivos en especie para fomentar la actividad en forma de insumos" : "Supplies-El Salvador",
"Se otorgarán incentivos fiscales para la actividad primaria y también la actividad de transformación" : "Tax_deduction-México",
"De acuerdo con los lineamientos aprobados se concederá un 25% de descuento en el pago del derecho de aprovechamiento" : "Tax_deduction-Perú",
"Las bonificaciones percibidas o devengadas se considerarán como ingresos diferidos en el pasivo circulante y no se incluirán para el cálculo de la tasa adicional ni constituirán renta para ningún efecto legal hasta el momento en que se efectúe la explotación o venta" : "Tax_deduction-Chile",
"Los contratistas que suscriban contratos de exploración y/o explotación, quedan exentos de cualquier impuesto sobre los dividendos, participaciones y utilidades" : "Tax_deduction-Guatemala",
"Exención de los derechos e impuestos, incluyendo el Impuesto a la Transferencia de Bienes Muebles y a la Prestación de Servicios, en la importación de sus bienes, equipos y accesorios, maquinaria, vehículos, aeronaves o embarcaciones" : "Tax_deduction-El Salvador",
"Se facilitará formación Permanente Además del acompañamiento técnico, los sujetos de derecho participarán en un proceso permanente de formación a lo largo de todo el año, que les permita enriquecer sus habilidades y capacidades " : "Technical_assistance-México",
"Contribuir en la promoción para la gestión, a través de la capacitación, asesoramiento, asistencia técnica y educación de los usuarios" : "Technical_assistance-Perú",
"Asesoría prestada al usuario por un operador acreditado, conducente a elaborar, acompañar y apoyar la adecuada ejecución técnica en terreno de aquellas prácticas comprometidas en el Plan de Manejo" : "Technical_assistance-Chile",
"Para la ejecución de programas de capacitación, adiestramiento y otorgamiento de becas para la preparación de personal , así como para el desarrollo de tecnología en actividades directamente relacionadas con las operaciones objeto del contrato" : "Technical_assistance-Guatemala",
"Apoyo técnico y en formulación de proyectos y conexión con mercados" : "Technical_assistance-El Salvador"}

queries = []
for query in queries_dict_exp3:
    queries.append(query)
        
# print(queries)

The next cell is just to check the presence of misspelling in the values of the queries dictionary

In [None]:
check_dictionary_values(queries_dict_exp3)

### Similarity search

In [None]:
transformer ='xlm-r-bert-base-nli-stsb-mean-tokens'
similarity_threshold = 0.2
search_results_limit = 1000
today = datetime.date.today()
today = today.strftime('%Y-%m-%d')
name = "Pre_tagged_" + today + "_" + filter_language

results_dict = sentence_similarity_search(transformer, queries, similarity_threshold, search_results_limit, name)

## Results analysis

This is a temporary section to explore how to analyze the results. It is organized with the same structure as the section <strong>Defining queries</strong> as we are exploring the best search strategies based on different types of queries.

### N-grams approach

### Parts-of-speach approach

### Keyword approach

In [None]:
# Loading the results

# path = "../output/"
# filename = "Experiment_201215_jordi_1500.json"
# file = path + filename

# with open(file, "r") as f:
#     experiment_results = json.load(f)

#### Experiment 1

First we load the results and refactor data structures to better process them.

In [None]:
experiment_results = results

# Building a final dictionari of the results with a extra layer with sentence IDs as keys of the last layer
experiment_results_full_dict = {}
for model in experiment_results:
    experiment_results_full_dict[model] = {}
    i = 0
    for keyword in experiment_results[model]:
        if i % len(experiment_results[model]) == 0:
            key = keyword + "_K"
            experiment_results_full_dict[model][key] = {}
            for result in experiment_results[model][keyword]:
                experiment_results_full_dict[model][key][result[0]] = result[1:len(result)]
        else:
            key = key[0:-2] + "_S"
            experiment_results_full_dict[model][key] = {}
            for result in experiment_results[model][keyword]:
                experiment_results_full_dict[model][key][result[0]] = result[1:len(result)]
        i += 1
            
# Building a dictionary with all sentences found by exact keyword matching. The dictionary is of the form:
# {"<incetive>" : ["<sentence_id_1>", ... "<sentence_id_n>"]}

keyword_hits = {}
for item in keyword_in_sentences:
    if item[0] in keyword_hits:
        keyword_hits[item[0]][item[1]] = []
    else:
        keyword_hits[item[0]] = {}
        keyword_hits[item[0]][item[1]] = []     

In [None]:
transformer_names =['xlm-r-bert-base-nli-stsb-mean-tokens', 'distiluse-base-multilingual-cased-v2']


for incentive, sentence_list in keyword_hits.items():
#     print("\t", incentive.center(25)
    for sentence_id in sentence_list:
        for model_name in transformer_names:
            for key in experiment_results_full_dict[model_name]:
                if incentive in key:
                    if sentence_id in experiment_results_full_dict[model_name][key]:
                        keyword_hits[incentive][sentence_id].append(experiment_results_full_dict[model_name][key][sentence_id][2])
                        keyword_hits[incentive][sentence_id].append(round(experiment_results_full_dict[model_name][key][sentence_id][0], 2))
                    else:
                        keyword_hits[incentive][sentence_id].append(15000)
                        keyword_hits[incentive][sentence_id].append(0.0)
        i += 1
#         for keyword in 
#             print(experiment_results_full_dict[model_name].keys())

In [None]:
keyword_hits

In [None]:
results_csv = []
for key, value in keyword_hits.items():
    for sentence, res in value.items():
        results_csv.append([key, sentence, res[0], res[1], res[2], res[3], res[4], res[5], res[6], res[7]])

In [None]:
column_names = ["keyword", "sentence_ID", "xlm_K-rank", "xlm_K-sim", "xlm_S-rank", "xlm_S-sim", "dist_K-rank", "dist_K-sim", "dist_S-rank", "dist_S-sim"]
df= pd.DataFrame(results_csv, columns = column_names)
        
    

In [None]:
path = "../output/"
filename = "Experiment_201217_jordi_1500.csv"
file = path + filename

df.to_csv(file)
df.head()

### Tagged sentence approach

Below, we define the functions that are going to be used in the post-processing and in the analysis of the experiments.

In [None]:
# To show the contents of the results dict, particularly, the length of the first element and its contents
def show_results(results_dictionary):
    i = 0
    for key1 in results_dictionary:
        for key2 in results_dictionary[key1]:
            if i == 0:
                print(len(results_dictionary[key1][key2]))
                print(results_dictionary[key1][key2])
            i += 1

# Adding the rank to each result
def add_rank(results_dictionary):
#     for model in results_dictionary:
    for keyword in results_dictionary:#[model]:
        i = 1
        for result in results_dictionary[keyword]:#[model][keyword]:
            result.insert(1, i)
            i += 1
    return results_dictionary

# For experiments 2 and 3 this function is to save results in separate csv files
def save_results_as_separate_csv(results_dictionary, queries_dictionary, experiment_number, date):
    name = "Exp" + experiment_number
    path = "../output/" + name + "/" + date + "/"
#     for model, value in results_dictionary.items():
    name1 = name + "_" + "Mxlm_"
    for exp_title, result in results_dictionary.items():#value.items():
        filename = name1 + queries_dictionary[exp_title]
        file = path + filename + ".tsv"
        with open(file, 'w', newline='', encoding='utf-8') as f:
            write = csv.writer(f, delimiter='\t')
            write.writerows(result)
#             print(filename)
    

The results from the analysis are saved as a json file. To further process the information we can upload the file contents into a dictionary.

After loading the results, a rank value is added to the results from the highest similarity score to the lower one.

In [None]:
# Load the json where there are the results that you want to analyze. CHANGE the file name accordingly.
path = "../output/"
filename = "Exp4_tagged_210105.json"
file = path + filename
with open(file, "r") as f:
    results_ = json.load(f)

In [None]:
len(results_)

In [None]:
# Adding the rank in the results dictionary
results = copy.deepcopy(add_rank(results_dict))
# show_results(results_E2)

Now, to simplify the analysis process and to make it available for a broader spectrum of analysts, the results are split into small "tsv" documents that can be easily imported in spreadsheets.

The new files will contain only the results of a single query, this is it will contain all the 100 (or whatever number has been retrieved) sentences from the database which have the highest similarity score with the query. There will be the following columns:

* Sentence ID
* Rank of the sentence in the similarity results
* Similarity score
* Text of the sentence

In [None]:
# Save the results as separete csv files
queries_dict = queries_dict_exp4 # CHANGE the queries dict accordingly!
Experiment_number = "4" # CHANGE the experiment number accordingly!
Date = "210105" # CHANGE the date accordingly!
save_results_as_separate_csv(results, queries_dict, Experiment_number, Date)

In the next cell there is the code to retrieve the results that were saved in the previous cell for further analysis

In [None]:
subfolder = "Exp4/210105/" # CHANGE the subfolder name accordingly!
# subfolder = "Exp3/201231/" # CHANGE the subfolder name accordingly!
paths = Path("../output/" + subfolder).glob('**/*.tsv')
transformers = ["Mxlm", "Mdistiluse"]
policy_instruments = ["Credit", "Direct_payment", "Fine", "Guarantee", "Supplies", "Tax_deduction", "Technical_assistance"]
countries = ["Chile", "El Salvador", "Guatemala", "México", "Perú"]
transformer = "Mxlm"
policy_instrument = "Technical_assistance"

In [None]:
sentences_dict = {}
unique_ids = {}

for path in paths:
    # because path is object not string
    path_in_str = str(path)
    if transformer in path_in_str:
        if policy_instrument in path_in_str:
            for country in countries:
                if country in path_in_str:
                    sentences_dict[country] = {}
                    print(path_in_str)
                    with open(path_in_str, "r", encoding = "utf-8") as f:
                        file = csv.reader(f, delimiter='\t')
                        for row in file:
                            sentences_dict[country][row[0]] = row
                            unique_ids[row[0]] = row

In [None]:
name = "Unique_sentence_IDs_" + policy_instrument + ".tsv"
path = "../output/Exp4/210105/"
file = path + name
with open(file, 'w', newline = '', encoding = 'utf-8') as f:
    write = csv.writer(f, delimiter='\t')
    for key, value in unique_ids.items():
        write.writerow(value)

In [None]:
print(len(sentences_dict))
print(len(sentences_dict[country]))
print(len(unique_ids))

In [None]:
sentences = {}
counts = 0
i = 0
for country in countries:
    i += 1
    j = 0
    for ref_country in countries:
        j += 1
#         if j > i:
        print(ref_country, "---", country)
        for sentence in sentences_dict[country]:
            if sentence in sentences_dict[ref_country]:
                if sentence in sentences:
                    sentences[sentence] = sentences[sentence] + 1
                else:
                    sentences[sentence] = 1
#                     print("hit")

In [None]:
print(counts)
print(len(sentences))
sentences

In [None]:
policy_instrument = "Direct_payment"
path = Path("../output/")
subfolder = Path("Exp3/201228/" )# CHANGE the subfolder name accordingly!
filename = "Unique_Ids_Tagged_" + policy_instrument + ".xlsx"
file = path / subfolder / filename
df = pd.read_excel(file)
tagged = df.values.tolist()
tagged_dict = {}
for item in tagged:
    tagged_dict[item[0]] = [item[4], item[5], item[6]]

for country in countries:
    updated_file = []
    filename = "Exp3_Mxlm_" + policy_instrument + "-" + country + ".tsv"
    file = path / subfolder / filename
    with open(file, "r", encoding = "utf-8") as f:
        file = csv.reader(f, delimiter='\t')
        for row in file:
            updated_file.append([row[0], row[1], row[2], row[3], tagged_dict[row[0]][0], tagged_dict[row[0]][1], tagged_dict[row[0]][2]])
    filename = "Exp3_Mxlm_" + policy_instrument + "-" + country + "_tagged.tsv"
    file = path / subfolder / filename
    with open(file, 'w', newline = '', encoding = 'utf-8') as f:
        write = csv.writer(f, delimiter='\t')
        write.writerows(updated_file)

In [None]:
updated_file

#### Exp2

In [None]:
path = "../output/"
filename = "Exp2_tagged_201228.json"
file = path + filename
with open(file, "r") as f:
    results_Exp2 = json.load(f)

In [None]:
for key1 in results_Exp2:
    print(key1)
    for key2 in results_Exp2[key1]:
        print(queries_dict_exp2[key2])

#### Exp3

In [None]:
path = "../output/"
filename = "Exp3_tagged_201228.json"
file = path + filename
with open(file, "r") as f:
    results_Exp3 = json.load(f)

### Retrieving the documents of selected sentences