# Project Context

I chose to use transformers, through project BERT, in order to take the context of the embeddings into account.
The embeddings of the contexts (paragraphs) will be generated once and stored in a file. This same file will be used as an input in the algorithm and read to compare the contexts with the embedding of the question.

NB : I will not be using the whole training set (in order not to go beyond the capacities of my computer while generating the embeddings). Also, I did not train the BERT model on the train data (it would require too much time and the access to GPU ressources).

## Importing the libraries

In [1]:
from zipfile import ZipFile 
import json
import pandas as pd
import numpy as np
from annoy import AnnoyIndex
from sentence_transformers import SentenceTransformer
import os
from functools import reduce

In [2]:
%load_ext autotime

## Reading the train data

In [3]:
def json_to_dataframe(input_file_path, record_path = ['data','paragraphs','qas','answers'],
                           verbose = 1):
    """
    input_file_path: path to the squad json file.
    record_path: path to deepest level in json file default value is
    ['data','paragraphs','qas','answers']
    verbose: 0 to suppress it default is 1
    """
    if verbose:
        print("Reading the json file")    
    with ZipFile(input_file_path, "r") as z:
        for filename in z.namelist():  
            print(filename)  
            with z.open(filename) as f:  
                data = f.read()  
                file = json.loads(data.decode("utf-8"))
    if verbose:
        print("processing...")
    # parsing different level's in the json file
    js = pd.io.json.json_normalize(file , record_path )
    m = pd.io.json.json_normalize(file, record_path[:-1] )
    r = pd.io.json.json_normalize(file,record_path[:-2])

    #combining it into single dataframe
    idx = np.repeat(r['context'].values, r.qas.str.len())
    #     ndx  = np.repeat(m['id'].values,m['answers'].str.len())
    m['context'] = idx
    #     js['q_idx'] = ndx
    main = m[['id','question','context','answers']].set_index('id').reset_index()
    main['c_id'] = main['context'].factorize()[0]
    if verbose:
        print("shape of the dataframe is {}".format(main.shape))
        print("Done")
    return main

time: 2 ms


In [4]:
input_file_path = 'train.json.zip'
record_path = ['data','paragraphs','qas','answers']
verbose = 0
train_data = json_to_dataframe(input_file_path, record_path)

Reading the json file
train.json
processing...
shape of the dataframe is (20731, 5)
Done
time: 943 ms


In [5]:
len(train_data["context"])

20731

time: 6 ms


In [8]:
len(train_data["context"].unique())

4920

time: 29 ms


In [9]:
bis = train_data.drop_duplicates(subset=["context"])
len(bis)

4920

time: 37.8 ms


## Defining a function to store the embeddings in order to boost the process 

In [6]:
def store_data(data, name_output = "encoding.ann"):

    items = data
    
    model = SentenceTransformer('distiluse-base-multilingual-cased')

    f = 512
    t = AnnoyIndex(f, 'angular')  # Length of item vector that will be indexed
    
    items = items.set_index("c_id")
    
    for i, row in items.iterrows():
        v = model.encode(row["context"])
        t.add_item(i, v)

    t.build(10) # 10 trees
    t.save(name_output)

    print("Done")

time: 1 ms


In [10]:
store_data(bis[:100])

Done
time: 9.86 s


This operation is only done once in a while. When new contexts are added, the file must be updated.

## Loading the embeddings and encoding the query to compare it to them

The metric used to compute the similarity between the query and each context is the cosine similarity (the parameter "angular" was used as a metric in annoy).

In [11]:
def load_corpus(file_name = 'encoding.ann'):
    f = 512
    u = AnnoyIndex(f, 'angular')
    u.load(file_name)  # super fast, will just mmap the file
    return u

time: 963 µs


In [12]:
def import_enc(file_name = "encoding.ann"):
    model = SentenceTransformer('distiluse-base-multilingual-cased')
    embeddings = load_corpus(file_name)
    return(model, embeddings)

time: 1.01 ms


In [13]:
def produce_prediction(query_text, model, embeddings, top_n = 3):
    query = model.encode(query_text)
    nearest = embeddings.get_nns_by_vector(query, top_n, include_distances = True)
    return nearest

Error in callback <function LineWatcher.stop at 0x0000017B8B5A31F8> (for post_run_cell):


AssertionError: 

In [14]:
model, embeddings = import_enc()

time: 1.9 s


In [17]:
query_text = "Pour quelle raison on ne put plus observer Cérès ?"

time: 1.04 ms


In [18]:
nearest = produce_prediction(query_text, model, embeddings)
nearest

([17, 2, 18], [1.2273740768432617, 1.2467377185821533, 1.2482205629348755])

time: 37.7 ms


In [19]:
print(*reduce(pd.DataFrame.append, map(lambda i: bis[bis.c_id == i], nearest[0]))["context"].tolist(), sep = '\n\n')

Lorsque Cérès est en opposition à proximité de son périhélie, il peut atteindre une magnitude apparente de +6,7. On considère généralement que cette valeur est trop faible pour que l'objet soit visible à l'œil nu, mais il est néanmoins possible pour une personne dotée d'une excellente vue et dans des conditions d'observation exceptionnelles de percevoir la planète naine. Les seuls astéroïdes pouvant atteindre une telle magnitude sont Vesta et, pendant de rares oppositions à leur périhélie, Pallas et Iris. Au maximum de sa luminosité, Cérès n'est pas l'astéroïde le plus brillant ; Vesta peut atteindre la magnitude +5,4, la dernière fois en mai et juin 2007. Aux conjonctions, Cérès atteint la magnitude de +9,3, ce qui correspond aux objets les moins lumineux qui puissent être visibles à l'aide de jumelles 10×50. La planète naine peut donc être vue aux jumelles dès qu'elle est au-dessus de l'horizon par une nuit noire. Pallas et Iris sont invisibles aux jumelles par de petites élongations

I can see that the contexts are "close" to the query (in terms of main subject). The problem here is that, just by comparing the closest context to the real context linked to that query, I can see that I did not get the right context. The similiarities between the right one (1.2467377185821533) and the one predicted (1.2273740768432617) are quite close.
I can also underline that the predicted context contains the word "observation", "invisibles", "percevoir", thus leading the algorithm to pick him as the closest context (the query contains the word "observer").

## Evaluation

To evaluate the model, I will use the F1 score. The score would surely be greater with a training of the BERT model on our data.
For each context, I am computing the terms required for the calculation of the F1.

In [20]:
input_file_path_valid = 'valid.json.zip'
record_path_valid = ['data','paragraphs','qas','answers']
verbose = 0
valid_data = json_to_dataframe(input_file_path_valid, record_path_valid)

Reading the json file
valid.json
processing...
shape of the dataframe is (3188, 5)
Done
time: 141 ms


In [21]:
valid_data.head()

Unnamed: 0,id,question,context,answers,c_id
0,a524504f-0816-4f58-9f2d-27f82a85c73d,Que concerne principalement les documents ?,Les deux tableaux sont certes décrits par des ...,"[{'answer_start': 161, 'text': 'La Vierge aux ...",0
1,8a72ad1c-b2fe-4fe6-9f87-35fcb713cf38,Par quoi sont décrit les deux tableaux ?,Les deux tableaux sont certes décrits par des ...,"[{'answer_start': 46, 'text': 'documents conte...",0
2,b2db7f77-f6a7-4c3d-9274-9807f9764e97,Quels types d'objets sont les deux tableaux au...,Les deux tableaux sont certes décrits par des ...,"[{'answer_start': 204, 'text': 'objets de spéc...",0
3,31ec5079-7d62-43b9-8a5c-49f8b7ad6f5a,Sur quelle jambe les personnages se tiennent-t...,Les deux panneaux présentent de nombreuses sim...,"[{'answer_start': 242, 'text': 'droite'}]",1
4,008a2ca6-a589-4751-ba59-f7f762874785,Quel pied avancent les personnages ?,Les deux panneaux présentent de nombreuses sim...,"[{'answer_start': 271, 'text': 'gauche'}]",1


time: 85.3 ms


In [22]:
bis_valid = valid_data.drop_duplicates(subset=["context"])
print(len(valid_data), len(bis_valid))

3188 768
time: 11 ms


In [23]:
store_data(bis_valid, name_output = "encoding_valid.ann")

Done
time: 1min 4s


In [24]:
model, embeddings = import_enc("encoding_valid.ann")

time: 1.94 s


In [26]:
query_text = "Sur quelle jambe les personnages se tiennent-t-ils ?"

time: 963 µs


In [27]:
pred_val = produce_prediction(query_text, model, embeddings)

time: 32.4 ms


In [28]:
print(*reduce(pd.DataFrame.append, map(lambda i: bis_valid[bis_valid.c_id == i], pred_val[0]))["context"].tolist(), sep = '\n\n')

Les deux panneaux présentent de nombreuses similitudes : chacun comporte un seul personnage exposé en pied, qui se tient dans une niche en trompe-l'œil proposant les mêmes dégradés de gris. Les personnages se tiennent en appui sur leur jambe droite et avancent leur pied gauche vers le spectateur. Des ailes s'ouvrent légèrement dans leur dos, qui indiquent leur nature d'ange. Les cheveux longs et bouclés, ils sont vêtus d'une longue robe de couleur dont le col est rond pour l'un et carré pour l'autre. Chacun tient un instrument de musique dont il semble jouer. Leurs différences tiennent dans l'instrument duquel ils jouent, leur posture pour le faire ainsi que l'aspect et la position de leur tête. Ainsi, l'ange en vert joue de la lira da braccio ; il semble frotter l'archet sur les cordes. Il incline la tête vers son instrument selon un port identique à celui de la Vierge dans La Vierge aux rochers de Léonard de Vinci. L'ange en rouge joue du luth ; sa main pince les cordes de l'instrume

### This prediction is accurate. Let's see what if the proportion of good prediction over all the questions of the validation set.

In [29]:
def evaluate_recall(data):
    
    real_vs_pred = {"actual_context_id":[], "predicted_context_id":[]}

    for i, row in data.iterrows():
        pred = produce_prediction(row["question"], model, embeddings, top_n = 1)[0][0]
        real_vs_pred["actual_context_id"].append(row["c_id"])
        real_vs_pred["predicted_context_id"].append(pred)
    predictions = pd.DataFrame(real_vs_pred, columns = real_vs_pred.keys())
    print("Done")
    
    li_id = predictions["actual_context_id"].unique().tolist()

    TP = []
    FN = []
    FP = []

    for id_context in li_id:

        pred_per_id = predictions[predictions["actual_context_id"] == id_context]
        TP.append(len([i for i in pred_per_id["predicted_context_id"].tolist() if i == id_context]))
        FN.append(len([i for i in pred_per_id["predicted_context_id"].tolist() if i != id_context]))
        
        pred_per_id = predictions[predictions["predicted_context_id"] == id_context]
        FP.append(len([i for i in pred_per_id["actual_context_id"].tolist() if i != id_context]))
        
    return sum(TP) / (sum(TP) + 1/2 * (sum(FP) + sum(FN)))
    

time: 2 ms


In [185]:
evaluate_model(valid_data)

Done


0.34598494353826853

time: 1min 17s


I got quite a low F1 score. This can be explained by the fact that the model was not trained on the training set (it would take too much take). An other explanation would be the fact that we could have the real context as the second or third prediction for a given question.
If I chose to evaluate the model on 3 predictions per question(for example), I might have got a higher F1 score.