# LLM Embedding

This notebook shows the embedding process of our full corpus that will be used later to retrieve documents relevant to given queries.

In [1]:
import warnings
import json
import torch

warnings.filterwarnings('ignore')
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import normalize_embeddings, dot_score, semantic_search

## Data Loading

In [2]:
file_path = '../clean_data/clean_corpus_fr.json'

with open(file_path, 'r') as f:
        # Parse each line as an independent JSON object
        data = [json.loads(line) for line in f]
        
print(len(data), 'documents loaded.')

10676 documents loaded.


## Embedding

Start to vectorize the text by embedding them with the DistillBERT encoder.

In [3]:
# Create a list containing all text from the corpus
corpus = [document['text'] for document in data[:2000]]
docids = [document['docid'] for document in data[:2000]]

# Select the LLM used to perform the embeddings
# Basic model (english only): sentence-transformers/all-MiniLM-L6-v2
# Multilingual model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
# GOAT Multilingual model: BAAI/bge-large-zh-v1.5
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') 

# Encode all the text
corpus_embeddings = model.encode(
    corpus, 
    convert_to_tensor=True, 
    show_progress_bar=True
)

corpus_embeddings = normalize_embeddings(corpus_embeddings)

# Display some information to the user
print("Information about the corpus embedding:\n"
      f"Number of document embedded: {len(corpus_embeddings)}\n"
      f"Vector size of each embedding: {corpus_embeddings[0].shape}")

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Information about the corpus embedding:
Number of document embedded: 2000
Vector size of each embedding: torch.Size([384])


In [4]:
query = "Quelles sont les informations révélées par la Société norvégienne de radiodiffusion (NRK) concernant les adresses et les numéros des employés du complexe de l'abri antiatomique norvégien ?"  # wanted document: doc-fr-4878
top_k=10

query_embedding = model.encode(query)

# Using build-in method
search_hits = semantic_search(query_embedding, corpus_embeddings, score_function=dot_score)
scores, indices = ([hit['score'] for hit in search_hits[0]][:top_k], 
                   [hit['corpus_id'] for hit in search_hits[0]][:top_k])

print(f"Query: {query}\n\n"
      f"{'#'*50}\n"
      f"Top {top_k} most similar sentences in corpus:\n"
      f"{'#'*50}\n")

for score, idx in zip(scores, indices):
        print(f"Docid: {docids[idx]}\n"
              f"Text:{corpus[idx][:500]}\n"
              f"(Score: {score:.4f})\n")

Query: Quelles sont les informations révélées par la Société norvégienne de radiodiffusion (NRK) concernant les adresses et les numéros des employés du complexe de l'abri antiatomique norvégien ?

##################################################
Top 10 most similar sentences in corpus:
##################################################

Docid: doc-fr-2248
Text:étude effets biologiques rayonnements notamment rayonnements ionisants êtres vivants sensibilité espèces radiosensibilité individuelle pathologique syndrome gorlin groupes vulnérables aussi étude moyens préserver effets délétères certains rayonnements étude traitements suivre cas contamination irradiation enfin branche biologie médicale emploie techniques radiologiques permettant investigation corps humain plus largement agit ensemble techniques analyses liquides biologiques utilisent radio isot
(Score: 0.5787)

Docid: doc-fr-1438
Text:guillaume radio anciennement guillaume radio 2 0 jusqu 2017 émission radio divertissement axé