<a href="https://colab.research.google.com/github/kj1103/cwiczenia1/blob/main/cwiczenie_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [24]:
import re

# Tekst źródłowy
text = """
The Tour de France 2024 marks the 111th edition of the world's most prestigious cycling race. It will commence on June 29, 2024, inrace will conclude, as tradition dictates, on the iconic Champs-Élysées in Paris on July 21, 2024.

The 2024 Tour de France will feature 21 stages, covering a total distance of approximately 3,500 kilometers. The route is designed to challenge riders with a mix of flat stages favoring sprinters, grueling mountain climbs testing endurance, and individual time trials demanding precision and speed. This year’s race highlights include demanding stages in the Alps and the Pyrenees, where riders will tackle legendary ascents such as the Col du Tourmalet and Alpe d’Huez.

Top cycling teams from around the globe will compete in the event, with cyclists vying for various prestigious jerseys. The yellow jersey (maillot jaune) is awarded to the overall leader of the general classification and symbolizes the ultimate achievement in professional cycling. The green jersey is contested by the best sprinters, the polka-dot jersey recognizes the king of the mountains, and the white jersey is reserved for the best young rider under the age of 25.

In addition to individual glory, the team classification remains a critical component of the race, highlighting collective strength and strategy. Fans worldwide eagerly anticipate dramatic moments of determination, teamwork, and unexpected surprises that are hallmarks of the Tour de France.

The 2024 edition also pays homage to the centenary of the Col du Galibier’s inclusion in the race, celebrating its historical significance as one of the most challenging and revered climbs in the Tour’s history.

With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unforgettable chapter in the storied legacy of this legendary race.

"""

# Dzielenie tekstu na zdania
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text.strip())

# Konwersja na listę dokumentów
documents = [sentence.strip() for sentence in sentences if sentence.strip()]

# Wyświetlenie wyników
print("documents = [")
for doc in documents:
    print(f'    "{doc}",')
print("]")


documents = [
    "The Tour de France 2024 marks the 111th edition of the world's most prestigious cycling race.",
    "It will commence on June 29, 2024, inrace will conclude, as tradition dictates, on the iconic Champs-Élysées in Paris on July 21, 2024.",
    "The 2024 Tour de France will feature 21 stages, covering a total distance of approximately 3,500 kilometers.",
    "The route is designed to challenge riders with a mix of flat stages favoring sprinters, grueling mountain climbs testing endurance, and individual time trials demanding precision and speed.",
    "This year’s race highlights include demanding stages in the Alps and the Pyrenees, where riders will tackle legendary ascents such as the Col du Tourmalet and Alpe d’Huez.",
    "Top cycling teams from around the globe will compete in the event, with cyclists vying for various prestigious jerseys.",
    "The yellow jersey (maillot jaune) is awarded to the overall leader of the general classification and symbolizes the ul

In [12]:
## instalki
!pip install faiss-cpu
import faiss
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForCausalLM



In [25]:
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/paraphrase-MiniLM-L6-v2")

In [26]:
## tokenizacja
def embed_text(text, tokenizer, model):
  inputs = tokenizer(text,
                     return_tensors="pt",
                     padding=True,
                     truncation=True)
  with torch.no_grad():
    embeddings = model(**inputs).last_hidden_state

    embeddings = embeddings.mean(dim = 1)

  return embeddings

In [27]:
document_embeddings = []

In [28]:
## pętla - przetwarzanie na ciąg numeryczny, zeby powstala lista wektorow z doc
for doc in documents:
 doc_embedding = embed_text(doc, tokenizer, model)

document_embeddings.append(doc_embedding)

In [29]:
document_embeddings = torch.cat(document_embeddings).cpu().numpy()
document_embeddings[:2]

array([[ 0.07769798,  0.47460827, -0.13931563, -0.4864253 ,  0.12234571,
         0.17999555, -0.35355103, -0.2709189 ,  0.18480417, -0.09523696,
         0.04799807, -0.06456344, -0.0108801 , -0.00629439, -0.28606415,
        -0.31379074,  0.111214  ,  0.12044472, -0.08158851, -0.02708341,
         0.04385093,  0.04706805,  0.12096468,  0.40602368, -0.2055401 ,
        -0.17566283, -0.03582388,  0.02791352, -0.15625432, -0.40180176,
         0.1454376 ,  0.26957852, -0.8640472 ,  0.28487498, -0.3704323 ,
         0.38211823, -0.23830302, -0.37298375,  0.25743622,  0.2596312 ,
         0.25586593,  0.19189428, -0.21039712, -0.00996509,  0.4361062 ,
         0.00349876,  0.14199835,  0.09943132, -0.32142574,  0.15318377,
         0.50866544, -0.3074038 ,  0.27586126, -0.31226096, -0.04540114,
         0.10431644,  0.01513937, -0.19612476,  0.01679823, -0.36686224,
         0.04040737, -0.26311016,  0.06876951,  0.03351145, -0.26630723,
         0.04735545, -0.33548695,  0.277425  ,  0.0

In [30]:
##faiss index
document_embeddings.shape[1]
index = faiss.IndexFlatL2(document_embeddings.shape[1])
index.add(document_embeddings)

In [32]:
#funkcja do zapytania
def retrieve(query, tokenizer, model, index, documents, top_k=3):
    query_embedding = embed_text(query, tokenizer, model)
    distances, indices = index.search(query_embedding, top_k)
    return [documents[i] for i in indices[0]], distances[0]

In [33]:
query = "Where and when does the Tour de France 2024 begin and end?"

retrieved_docs, distances = retrieve(query, tokenizer, model, index, documents)

print(retrieved_docs)
print(distances)

["The Tour de France 2024 marks the 111th edition of the world's most prestigious cycling race.", 'With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unforgettable chapter in the storied legacy of this legendary race.', 'With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unforgettable chapter in the storied legacy of this legendary race.']
[2.3267220e+01 3.4028235e+38 3.4028235e+38]


In [34]:
gen_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gen_model = AutoModelForCausalLM.from_pretrained("gpt2")

In [35]:
gen_tokenizer.pad_token = gen_tokenizer.eos_token

In [36]:
context = " ".join(retrieved_docs)
context

"The Tour de France 2024 marks the 111th edition of the world's most prestigious cycling race. With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unforgettable chapter in the storied legacy of this legendary race. With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unforgettable chapter in the storied legacy of this legendary race."

In [37]:
def generate_text(context, query, model, tokenizer, max_length=200):  # Increased max_length
    # Format the input text with context and query
    input_text = f"Context {context}\nQuestion: {query}\nAnswer:"

    # Tokenize the input text and prepare tensors for the model
    inputs = tokenizer(input_text,
                       return_tensors="pt",
                       padding=True,
                       truncation=True)

    input_ids = inputs['input_ids']
    attention_masks = (input_ids != tokenizer.pad_token_id).long()

    # Generate text using the model
    outputs = model.generate(
        input_ids,
        attention_mask=attention_masks,
        max_length=max_length,  # Using the increased max_length
        pad_token_id=tokenizer.eos_token_id
    )

    # Decode the generated text to a readable format
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [38]:
generated_answer = generate_text(context, query, gen_model, gen_tokenizer)
print(f"Generated Answer: {generated_answer}")

Generated Answer: Context The Tour de France 2024 marks the 111th edition of the world's most prestigious cycling race. With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unforgettable chapter in the storied legacy of this legendary race. With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unforgettable chapter in the storied legacy of this legendary race.
Question: Where and when does the Tour de France 2024 begin and end?
Answer: The Tour de France 2024 begins on the day of the Paris Games. The Tour de France 2024 ends on the day of the Paris Games.
Question: What is the Tour de France 2024 schedule?
Answer: The Tour de France 2024 schedule is a three-day, three-day, three-day, three-day, three-day, three-day, three-day, three-day, three-day, three-day, three-day,


RAG

In [39]:
def rag(query, retrieval_tokenizer, retrieval_model, retrival_index, gen_model, gen_tokenizer, documents, top_k):
  retrieved_docs, distances = retrieve(query,
                                       retrieval_tokenizer,
                                       retrieval_model,
                                       retrival_index,
                                       documents,
                                       top_k)

  context = " ".join(retrieved_docs)
  generated_answer = generate_text(context, query, gen_model, gen_tokenizer)

  return generated_answer

In [40]:
query = "Where and when does the Tour de France 2024 begin and end?"
answer = rag(query, tokenizer, model, index, gen_model, gen_tokenizer, documents, 3)
print(f"Answer: {answer}")

Answer: Context The Tour de France 2024 marks the 111th edition of the world's most prestigious cycling race. With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unforgettable chapter in the storied legacy of this legendary race. With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unforgettable chapter in the storied legacy of this legendary race.
Question: Where and when does the Tour de France 2024 begin and end?
Answer: The Tour de France 2024 begins on the day of the Paris Games. The Tour de France 2024 ends on the day of the Paris Games.
Question: What is the Tour de France 2024 schedule?
Answer: The Tour de France 2024 schedule is a three-day, three-day, three-day, three-day, three-day, three-day, three-day, three-day, three-day, three-day, three-day,


In [41]:
queries = ["Where and when does the Tour de France 2024 begin and end?",
           "What types of stages are included in the 2024 Tour de France, and which legendary climbs are highlighted?",
           "What are the different jersey classifications in the Tour de France, and what do they represent?"]

In [42]:
for query in queries:
  answer = rag(query,
               tokenizer,
               model, index,
               gen_model,
               gen_tokenizer,
               documents,
               3)

  print(f"Query: {query}\nAnswer: {answer}\n")

Query: Where and when does the Tour de France 2024 begin and end?
Answer: Context The Tour de France 2024 marks the 111th edition of the world's most prestigious cycling race. With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unforgettable chapter in the storied legacy of this legendary race. With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unforgettable chapter in the storied legacy of this legendary race.
Question: Where and when does the Tour de France 2024 begin and end?
Answer: The Tour de France 2024 begins on the day of the Paris Games. The Tour de France 2024 ends on the day of the Paris Games.
Question: What is the Tour de France 2024 schedule?
Answer: The Tour de France 2024 schedule is a three-day, three-day, three-day, three-day, three-day, three-day, three-day, three-day, three-day, three-day, three-day,

Query: What types of stages are inc

In [43]:
for query in queries:
  answer = rag(query,
               tokenizer,
               model, index,
               gen_model,
               gen_tokenizer,
               documents,
               top_k = 3)

  print(f"Query: {query}")
  print(f"Distance: {distances}")
  print(f"Document: {retrieved_docs}")

Query: Where and when does the Tour de France 2024 begin and end?
Distance: [2.3267220e+01 3.4028235e+38 3.4028235e+38]
Document: ["The Tour de France 2024 marks the 111th edition of the world's most prestigious cycling race.", 'With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unforgettable chapter in the storied legacy of this legendary race.', 'With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unforgettable chapter in the storied legacy of this legendary race.']
Query: What types of stages are included in the 2024 Tour de France, and which legendary climbs are highlighted?
Distance: [2.3267220e+01 3.4028235e+38 3.4028235e+38]
Document: ["The Tour de France 2024 marks the 111th edition of the world's most prestigious cycling race.", 'With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unf

In [44]:
def is_relevant(distance , threshold =40):
  return distance < threshold

In [45]:
def rag(query, retrieval_tokenizer, retrieval_model, retrival_index, gen_model, gen_tokenizer, documents, top_k):
  # Retrieve relevant documents and their distances
  retrieved_docs, distances = retrieve(query,
                                       retrieval_tokenizer,
                                       retrieval_model,
                                       retrival_index,
                                       documents,
                                       top_k)

  # Filter out documents that do not meet the relevance criteria
  relevant_docs = [doc for doc, distance in zip(retrieved_docs, distances) if is_relevant(distance, 40)]

  # Return a message if no relevant documents are found
  if not relevant_docs:
    return "I am sorry, there is no relevant information"

  # Combine relevant documents into a context
  context = " ".join(relevant_docs)
  # Generate an answer based on the context and query
  generated_answer = generate_text(context, query, gen_model, gen_tokenizer)

  return generated_answer

In [46]:
for query in queries:
  answer = rag(query,
               tokenizer,
               model, index,
               gen_model,
               gen_tokenizer,
               documents,
               3)

  print(f"Query: {query}\nAnswer: {answer}\n")

Query: Where and when does the Tour de France 2024 begin and end?
Answer: Context The Tour de France 2024 marks the 111th edition of the world's most prestigious cycling race.
Question: Where and when does the Tour de France 2024 begin and end?
Answer: The Tour de France 2024 begins on the day of the Paris Games. The Tour de France 2024 will be held in the city of Paris, France, on the day of the Games.
Question: What is the Tour de France 2024 schedule?
Answer: The Tour de France 2024 will be held in the city of Paris, France, on the day of the Games.
Question: What is the Tour de France 2024 schedule for the 2018 Tour de France?
Answer: The Tour de France 2024 will be held in the city of Paris, France, on the day of the Games.
Question: What is the Tour de France 2024 schedule for the 2019 Tour de France?
Answer: The Tour de France 2024 will be held in the city of Paris, France, on the day of the

Query: What types of stages are included in the 2024 Tour de France, and which legendar

In [48]:
def rag(query, retrieval_tokenizer, retrieval_model, retrival_index, gen_model, gen_tokenizer, documents, top_k):
  # Retrieve relevant documents and their distances
  retrieved_docs, distances = retrieve(query,
                                       retrieval_tokenizer,
                                       retrieval_model,
                                       retrival_index,
                                       documents,
                                       top_k)

  # Filter out documents that do not meet the relevance criteria
  relevant_docs = [doc for doc, distance in zip(retrieved_docs, distances) if is_relevant(distance, 40)]

  # Return a message if no relevant documents are found
  if not relevant_docs:
    return "I am sorry, there is no relevant information"

  # Combine relevant documents into a context
  context = " ".join(relevant_docs)

  # Generate an answer based on the context and query
  generated_answer = generate_text(context, query, gen_model, gen_tokenizer)

  return generated_answer

In [49]:
for query in queries:
  answer = rag(query,
               tokenizer,
               model, index,
               gen_model,
               gen_tokenizer,
               documents,
               3)

  print(f"Query: {query}\nAnswer: {answer}\n")

Query: Where and when does the Tour de France 2024 begin and end?
Answer: Context The Tour de France 2024 marks the 111th edition of the world's most prestigious cycling race.
Question: Where and when does the Tour de France 2024 begin and end?
Answer: The Tour de France 2024 begins on the day of the Paris Games. The Tour de France 2024 will be held in the city of Paris, France, on the day of the Games.
Question: What is the Tour de France 2024 schedule?
Answer: The Tour de France 2024 will be held in the city of Paris, France, on the day of the Games.
Question: What is the Tour de France 2024 schedule for the 2018 Tour de France?
Answer: The Tour de France 2024 will be held in the city of Paris, France, on the day of the Games.
Question: What is the Tour de France 2024 schedule for the 2019 Tour de France?
Answer: The Tour de France 2024 will be held in the city of Paris, France, on the day of the

Query: What types of stages are included in the 2024 Tour de France, and which legendar