<a href="https://colab.research.google.com/github/kj1103/cwiczenia/blob/main/RAG_with_GPT2_TDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [41]:
## instalki
!pip install faiss-cpu
import faiss
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForCausalLM



In [52]:
import re

# Tekst źródłowy
text = """
The Tour de France 2024 marks the 111th edition of the world's most prestigious cycling race. It will commence on June 29, 2024, inrace will conclude, as tradition dictates, on the iconic Champs-Élysées in Paris on July 21, 2024.

The 2024 Tour de France will feature 21 stages, covering a total distance of approximately 3,500 kilometers. The route is designed to challenge riders with a mix of flat stages favoring sprinters, grueling mountain climbs testing endurance, and individual time trials demanding precision and speed. This year’s race highlights include demanding stages in the Alps and the Pyrenees, where riders will tackle legendary ascents such as the Col du Tourmalet and Alpe d’Huez.

Top cycling teams from around the globe will compete in the event, with cyclists vying for various prestigious jerseys. The yellow jersey (maillot jaune) is awarded to the overall leader of the general classification and symbolizes the ultimate achievement in professional cycling. The green jersey is contested by the best sprinters, the polka-dot jersey recognizes the king of the mountains, and the white jersey is reserved for the best young rider under the age of 25.

In addition to individual glory, the team classification remains a critical component of the race, highlighting collective strength and strategy. Fans worldwide eagerly anticipate dramatic moments of determination, teamwork, and unexpected surprises that are hallmarks of the Tour de France.

The 2024 edition also pays homage to the centenary of the Col du Galibier’s inclusion in the race, celebrating its historical significance as one of the most challenging and revered climbs in the Tour’s history.

With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unforgettable chapter in the storied legacy of this legendary race.

"""

# Dzielenie tekstu na zdania
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text.strip())

# Konwersja na listę dokumentów
documents = [sentence.strip() for sentence in sentences if sentence.strip()]

# Wyświetlenie wyników
print("documents = [")
for doc in documents:
    print(f'    "{doc}",')
print("]")

documents = [
    "The Tour de France 2024 marks the 111th edition of the world's most prestigious cycling race.",
    "It will commence on June 29, 2024, inrace will conclude, as tradition dictates, on the iconic Champs-Élysées in Paris on July 21, 2024.",
    "The 2024 Tour de France will feature 21 stages, covering a total distance of approximately 3,500 kilometers.",
    "The route is designed to challenge riders with a mix of flat stages favoring sprinters, grueling mountain climbs testing endurance, and individual time trials demanding precision and speed.",
    "This year’s race highlights include demanding stages in the Alps and the Pyrenees, where riders will tackle legendary ascents such as the Col du Tourmalet and Alpe d’Huez.",
    "Top cycling teams from around the globe will compete in the event, with cyclists vying for various prestigious jerseys.",
    "The yellow jersey (maillot jaune) is awarded to the overall leader of the general classification and symbolizes the ul

In [53]:
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/paraphrase-MiniLM-L6-v2")
gen_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gen_model = AutoModelForCausalLM.from_pretrained("gpt2")
gen_tokenizer.pad_token = gen_tokenizer.eos_token


In [54]:
## trzeba określić index

# funkcja przekształcająca tekst na osadzenia
def embed_text(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        embeddings = model(**inputs).last_hidden_state.mean(dim=1)  # Uśrednienie wektorów
    return embeddings

# przechowywanie osadzeń dokumentów
document_embeddings = []
for doc in documents:
    doc_embedding = embed_text(doc, tokenizer, model)
    document_embeddings.append(doc_embedding)

# lączenie osadzeń w jedną macierz
document_embeddings = torch.cat(document_embeddings).cpu().numpy()

# tworzenie indeksu FAISS
index = faiss.IndexFlatL2(document_embeddings.shape[1])  # Tworzymy indeks oparty na odległości L2
index.add(document_embeddings)  # Dodajemy osadzenia do indeksu

In [55]:
queries = ["Where and when does the Tour de France 2024 begin and end?",
           "What types of stages are included in the 2024 Tour de France, and which legendary climbs are highlighted?",
           "What are the different jersey classifications in the Tour de France, and what do they represent?"]

In [65]:
## czy dana odległość jest istotna na podstawie porównania z wartością progową
## Wartość progowa, która określa, powyżej której odległość uznaje się za nieistotną (jeśli nie poda, przyjmuje 40)
def is_relevant(distance , threshold =40):
  return distance < threshold

In [76]:
def retrieve(query, tokenizer, model, index, documents, top_k=3):
    query_embedding = embed_text(query, tokenizer, model)
    distances, indices = index.search(query_embedding, top_k)
    return [documents[i] for i in indices[0]], distances[0]

In [77]:
#Na początku wyszukuje najbardziej trafne dokumenty związane z zapytaniem,  filtruje te dokumenty, wybierając tylko te, które są wystarczająco trafne.
#Tworzy z nich kontekst i na końcu generuje odpowiedź na podstawie tego kontekstu

def rag(query, retrieval_tokenizer, retrieval_model, retrival_index, gen_model, gen_tokenizer, documents, top_k):
  # wyszukiwanie dokumentów i distance
  retrieved_docs, distances = retrieve(query,
                                       retrieval_tokenizer,
                                       retrieval_model,
                                       retrival_index,
                                       documents,
                                       top_k)

  # stosujemy filtr, który wybiera tylko te dokumenty, których odległość (lub podobieństwo) jest mniejsza niż 40, co oznacza, że są one wystarczająco trafne w kontekście zapytania.
  relevant_docs = [doc for doc, distance in zip(retrieved_docs, distances) if is_relevant(distance, 40)]

  # jeśli żaden dokument nie spełnia kryteriów trafności (nie ma żadnego dokumentu w relevant_docs), funkcja zwraca komunikat, że brak jest istotnych informacji
  if not relevant_docs:
    return "brak istotnych informacji"

  # kontekst
  context = " ".join(relevant_docs)

  # generuje odpowiedź na podstawie kontekstu i pytania
  generated_answer = generate_text(context, query, gen_model, gen_tokenizer)

  return generated_answer

In [78]:
##Kod przetwarza każde zapytanie z listy queries, wykonując dla niego wyszukiwanie w zbiorze dokumentów i generuje odpowiedź na podstawie dokumentów.
## tokenizer: dzieli tekst na tokeny, czyli mniejsze jednostki, takie jak słowa czy części słów
## top_k=3 - określa ile dokumentów będzie branych, narzucono 3
for query in queries:
  answer = rag(query,
               tokenizer,
               model, index,
               gen_model,
               gen_tokenizer,
               documents,
               3)

  print(f"Query: {query}\nAnswer: {answer}\n")

Query: Where and when does the Tour de France 2024 begin and end?
Answer: Context With its mix of tradition, innovation, and global participation, the Tour de France 2024 promises to deliver another unforgettable chapter in the storied legacy of this legendary race. It will commence on June 29, 2024, inrace will conclude, as tradition dictates, on the iconic Champs-Élysées in Paris on July 21, 2024. The 2024 Tour de France will feature 21 stages, covering a total distance of approximately 3,500 kilometers.
Question: Where and when does the Tour de France 2024 begin and end?
Answer: The Tour de France 2024 will begin on June 29, 2024, in race will conclude, as tradition dictates, on the iconic Champs-Élysées in Paris on July 21, 2024. The 2024 Tour de France will feature 21 stages, covering a total distance of approximately 3,500 kilometers.
Question: What is the Tour de France 2024 schedule?
Answer: The Tour de France 2024 schedule will be announced on June 29, 2024

Query: What types 