# Encodage de Texte par les LLM

L'objectif de ce notebook est montrer des exemples de d'encodage de textes et de calcul de similarité. La partie tokenizer rappelle quelques spécifictés des toknizers, i.e. le type de textes qu'ils encodent

In [2]:
from transformers import AutoModel, AutoTokenizer
#from sklearn.metrics.pairwise import cosine_similarity


  from .autonotebook import tqdm as notebook_tqdm


!huggingface-cli login

## 1- Tokenizer : Exemples de tokenizers
Le tokenizer des modèles LLMs sont spéciaux. Ils ne traitent pas les mots comme on l'a vu dans les chapitres relatifs à la représentation des textes, mais des sous-mots (token).
Les cellules ci-dessous présentent des exemples de tokens.

In [5]:
def show_tokens(text,tokenizer_name):
    tokenizer=AutoTokenizer.from_pretrained(tokenizer_name)
    tokens = tokenizer(text)
    token_ids = tokens.input_ids  
    print("tokens_decoded:", " ; ".join([tokenizer.decode([t]) for t in token_ids]))
    print(token_ids )
    #print("\n") 
def encode_decode(text, tokenizer_name):
    tokenizer=AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(text).input_ids   
    print(tokenizer.decode(token_ids))

In [6]:
text = """
English and CAPITALIZATION
):  ♫  􀀀
show _tokens False None elif == >= else :
Four spaces : "     " Two tabs : "  "
12 . 0 * 50 = 600
"""

In [7]:
show_tokens(text,"bert-base-uncased")

tokens_decoded: [CLS] ; english ; and ; capital ; ##ization ; ) ; : ; [UNK] ; show ; _ ; token ; ##s ; false ; none ; eli ; ##f ; = ; = ; > ; = ; else ; : ; four ; spaces ; : ; " ; " ; two ; tab ; ##s ; : ; " ; " ; 12 ; . ; 0 ; * ; 50 ; = ; 600 ; [SEP]
[101, 2394, 1998, 3007, 3989, 1007, 1024, 100, 2265, 1035, 19204, 2015, 6270, 3904, 12005, 2546, 1027, 1027, 1028, 1027, 2842, 1024, 2176, 7258, 1024, 1000, 1000, 2048, 21628, 2015, 1024, 1000, 1000, 2260, 1012, 1014, 1008, 2753, 1027, 5174, 102]


  from pandas.core import (


In [None]:
encode_decode(text,"bert-base-uncased")

In [None]:
show_tokens(text,"gpt2")
encode_decode(text,"gpt2")

## 2- Encodage de texte avec des transformers

In [None]:
from transformers import BertTokenizer, AutoModel, AutoTokenizer
#from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

In [5]:
# Choisir une option lecture du Texte et la procédure de construction des tockens

# Sample documents
def get_documents():
    documents = [
    "Text retrieval is the process of finding documents that are relevant to a user's query.",    
    "Gensim is a popular library for topic modeling and text retrieval in Python.",
    "Cosine similarity is a metric used to measure how similar two vectors are.",
    "Vectorization is the process of converting text data into numerical vectors.",
    "Python is a versatile programming language used for various applications.",
    "BM25 model is one of the main IR model,"
    ]  
    return documents

## Plusieurs fichiers dans un répértoire
def readfiles_from_dir(dir_path='./data'):
    for file_name in os.listdir(dir_path):
        if ".txt" in file_name:
            texts = [simple_preprocess(remove_stopwords(sentence))
                  for sentence in open(os.path.join(dir_path, file_name), encoding='utf-8')]
    return texts

# Function to read texts from a file
def read_texts_from_file(file_path,n):
    # lire les n lignes d'un csv
    texts=pd.read_csv(file_path, encoding='utf-8',sep = '\t', header=None, nrows=n)
    #return uniquement la colonne 1 coortant le texte à traiter 
    return texts[1].tolist()


In [8]:
# lecture du fichier 
#file_path = './data/msmarco/collection.tsv'
#documents = read_texts_from_file(file_path,100)
#documents=documents
#.astype(str)
documents=get_documents()

In [None]:
documents[1]

In [6]:
def encode_text(texts, tokenizer,model, max_length=512):
    # Tokenize and encode the batch of texts
    tokens = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
    # Build the embeddings
    outputs = model(**tokens)
    # Get the last hidden state 
    embeddings = outputs.last_hidden_state
    return get_cls(embeddings)

# Get the CLS embeddings
def get_cls(embeddings):
    return embeddings[:, 0, :]
    

# Compute the mean of the Embeddings
def get_mean(embeddings):
    return(embeddings.mean(dim=0))

# Extract the CLS token output (first token in the sequence)
#cls_output = last_hidden_state[:, 0, :]  # Shape: (batch_size, hidden_size)

# Display the CLS output
#print("CLS Output Shape:", cls_output.shape)  # Example: torch.Size([1, 768])
#print("CLS Output Vector:", cls_output)


In [8]:
# Load pre-trained BERT model and tokenizer
from transformers import AutoModel
model_name = "bert-base-uncased"
#model_name ="distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
    
# Encode the batch of texts
doc_embeddings = encode_text(documents,tokenizer,model)

# Display the encoded representations
print("Encoded Batch Shape:", doc_embeddings.shape)
print("Encoded Batch Representations:")
#print(doc_embeddings[0])


Encoded Batch Shape: torch.Size([6, 768])
Encoded Batch Representations:


In [None]:
doc_embeddings.shape

## 3- Calcul de la similarité quer-text représentées sous forme d'embeddings

In [9]:
import torch
from torch.nn.functional import cosine_similarity

# Encode query
k=6
query = "bm25 model."
query_embedding = encode_text(query, tokenizer, model)

# Compute cosine similarity between query and documents
similarities = cosine_similarity(query_embedding,doc_embeddings, dim=1)
#reshape(-1, 1)

#If dim=1, cosine similarity is computed for each row of the tensors.
#If dim=0, cosine similarity is computed for each column of the tensors.
    
# Rank the documents based on similarity scores
ranked_documents = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)


# get les indices
top_k_indices = [index for index, _ in ranked_documents[:k]]

# Display the ranked list of documents
print("Ranked List of Documents:")
for idx in top_k_indices:
    print(f"Document {idx + 1}: {documents[idx]}", "Similarity:", similarities[idx])
    
    

Ranked List of Documents:
Document 6: BM25 model is one of the main IR model, Similarity: tensor(0.8291, grad_fn=<SelectBackward0>)
Document 2: Gensim is a popular library for topic modeling and text retrieval in Python. Similarity: tensor(0.7916, grad_fn=<SelectBackward0>)
Document 5: Python is a versatile programming language used for various applications. Similarity: tensor(0.7643, grad_fn=<SelectBackward0>)
Document 4: Vectorization is the process of converting text data into numerical vectors. Similarity: tensor(0.6679, grad_fn=<SelectBackward0>)
Document 1: Text retrieval is the process of finding documents that are relevant to a user's query. Similarity: tensor(0.6379, grad_fn=<SelectBackward0>)
Document 3: Cosine similarity is a metric used to measure how similar two vectors are. Similarity: tensor(0.4853, grad_fn=<SelectBackward0>)


In [None]:
query_embedding.shape
doc_embeddings.shape

## 4- Encodage d'un texte avec Sentence BERT model
#### Modèle prêt à l'emploi, regardez sa care d'identité sur Jugging Face

In [3]:
from sentence_transformers import SentenceTransformer,util
model_sentence = SentenceTransformer('all-MiniLM-L6-v2')

In [9]:
doc_embeddings = model_sentence.encode(documents)

In [10]:
# Encode query
k=5
query = "This is the query text."
query_embedding = model_sentence.encode(query)

# Compute cosine similarity between query and documents
similarities = util.cos_sim(query_embedding,doc_embeddings).reshape(-1, 1)

#similarities = util.cos_sim(embeddings, embeddings)

# Rank the documents based on similarity scores
ranked_documents = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)

# get les indices
top_k_indices = [index for index, _ in ranked_documents[:k]]

# Display the ranked list of documents
print("Ranked List of Documents:")
for idx in top_k_indices:
    print(f"Document {idx + 1}: {documents[idx]}", "Similarity:", similarities[idx])
    
    

Ranked List of Documents:
Document 1: Text retrieval is the process of finding documents that are relevant to a user's query. Similarity: tensor([0.5131])
Document 4: Vectorization is the process of converting text data into numerical vectors. Similarity: tensor([0.2193])
Document 3: Cosine similarity is a metric used to measure how similar two vectors are. Similarity: tensor([0.1479])
Document 6: BM25 model is one of the main IR model, Similarity: tensor([0.1276])
Document 2: Gensim is a popular library for topic modeling and text retrieval in Python. Similarity: tensor([0.1183])
