# **LOAD DATA**

**First, we need to load the dataset. I chose a small dataset from Wikipedia - specifically the "rahular/simple-wikipedia" dataset from HuggingFace Hub.**

In [1]:
from datasets import load_dataset

dataset = load_dataset("rahular/simple-wikipedia")

### **Key Information About Our Dataset**

In [2]:
print(f"Dataset type: {type(dataset)}")
print(f"Available splits: {list(dataset.keys())}")

Dataset type: <class 'datasets.dataset_dict.DatasetDict'>
Available splits: ['train']


In [3]:
print(f"Dataset length: {len(dataset["train"])}")

Dataset length: 769764


In [4]:
print(f"Dataset features: {dataset['train'].features}")

Dataset features: {'text': Value('string')}


In [5]:
print(f"Sample record (index 558): {dataset['train'][558]}")

Sample record (index 558): {'text': 'Plants are also multicellular eukaryotic organisms, but live by using light, water and basic elements to make their tissues.'}


In [6]:
empty = [i for i in range(len(dataset["train"])) if not dataset["train"][i]["text"].strip()]
print(f"Empty records: {len(empty)}")

Empty records: 0


In [7]:
import pandas as pd
df = pd.DataFrame(dataset["train"][:5])
print("Sample data as DataFrame:")
df

Sample data as DataFrame:


Unnamed: 0,text
0,April
1,"April is the fourth month of the year, and com..."
2,April always begins on the same day of week as...
3,April's flowers are the Sweet Pea and Daisy. I...
4,"April comes between March and May, making it t..."


# **Document Creation**

In [8]:
from langchain.schema import Document as LangChainDocument
    

langchain_documents = []
for i, record in enumerate(dataset["train"]):
    text_content = record["text"]
        
    if not text_content.strip():
        continue
        
    metadata = {
        "source": "simple_wikipedia",
        "original_index": i,
        "text_length": len(text_content),
    }
        
    doc = LangChainDocument(
        page_content=text_content,
        metadata=metadata
    )
        
    langchain_documents.append(doc)
    
print(f"✅ Created {len(langchain_documents)} LangChain documents")

✅ Created 769764 LangChain documents


# **Text Splitting**

In [9]:
# Get all text lengths from the dataset
text_lengths = []
for i, record in enumerate(dataset["train"]):
    text = record["text"]
    if text.strip():  # Skip empty records
        text_lengths.append(len(text))

print(f"Total non-empty documents: {len(text_lengths)}")

# =====================================================================
# LENGTH STATISTICS
# =====================================================================

import numpy as np

print("\nDOCUMENT LENGTH STATISTICS (in characters):")
print("-" * 50)
print(f"Minimum length: {min(text_lengths):,} characters")
print(f"Maximum length: {max(text_lengths):,} characters")
print(f"Average length: {np.mean(text_lengths):,.0f} characters")
print(f"Median length: {np.median(text_lengths):,.0f} characters")
print(f"Standard deviation: {np.std(text_lengths):,.0f} characters")

# Percentiles
percentiles = [10, 25, 50, 75, 90, 95, 99]
print(f"\nPercentiles:")
for p in percentiles:
    value = np.percentile(text_lengths, p)
    print(f"  {p}th percentile: {value:,.0f} characters")

Total non-empty documents: 769764

DOCUMENT LENGTH STATISTICS (in characters):
--------------------------------------------------
Minimum length: 1 characters
Maximum length: 10,570 characters
Average length: 183 characters
Median length: 127 characters
Standard deviation: 198 characters

Percentiles:
  10th percentile: 12 characters
  25th percentile: 24 characters
  50th percentile: 127 characters
  75th percentile: 271 characters
  90th percentile: 432 characters
  95th percentile: 558 characters
  99th percentile: 872 characters


### Why Text Splitting is Not Necessary for Our Dataset

After analyzing our Simple Wikipedia dataset, we found that text splitting (chunking) is not necessary for this particular dataset. Our dataset contains 769,764 non-empty documents with an average length of only 183 characters (46 tokens) and a median of 127 characters (32 tokens). Even the 95th percentile is just 558 characters (140 tokens), which is well below the typical embedding model limits of 512-8192 tokens.


Since most embedding models can easily handle documents of this size, splitting our already short Wikipedia articles would actually be counterproductive. Text splitting is typically needed when documents are very long (>1000 characters) or exceed model token limits, but our dataset doesn't meet these criteria. Keeping the documents intact preserves the complete context of each Wikipedia article, results in better embeddings, and simplifies our RAG pipeline. We can proceed directly to embedding generation without any chunking steps.

# **Embedding Generation**

In [10]:
# from sentence_transformers import SentenceTransformer

# embedding_model = SentenceTransformer("all-MiniLM-L6-v2",device="cuda")

In [11]:
# print(f"Model device: {embedding_model.device}")
# print(f"Model is on GPU: {next(embedding_model.parameters()).is_cuda}")

In [12]:
# texts = [doc.page_content for doc in langchain_documents]
# embeddings = embedding_model.encode(texts)
# print(f"Embeddings shape: {embeddings.shape}")

In [13]:
from langchain_huggingface import HuggingFaceEmbeddings

embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2",model_kwargs={'device': 'cuda'})

In [14]:
print(f"Model kwargs: {embedding_function.model_kwargs}")
print(f"Model name: {embedding_function.model_name}")

Model kwargs: {'device': 'cuda'}
Model name: all-MiniLM-L6-v2


In [15]:
texts = [doc.page_content for doc in langchain_documents]
embeddings = embedding_function.embed_documents(texts)

In [16]:
print(f"Number of embeddings: {len(embeddings)}")
print(f"Embedding dimension: {len(embeddings[0])}")
print(f"Shape equivalent: ({len(embeddings)}, {len(embeddings[0])})")

Number of embeddings: 769764
Embedding dimension: 384
Shape equivalent: (769764, 384)


# PORÓB JAKIES POROWNAIA PODOBNYCH ZDAŃ

# **Vector Database Setup**

In [17]:
# # Najpierw musisz stworzyć wrapper dla SentenceTransformer
# from langchain_huggingface import HuggingFaceEmbeddings

# # Stwórz embedding function kompatybilną z LangChain
# embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2",model_kwargs={'device': 'cuda'})

In [18]:
from langchain_chroma import Chroma
import shutil
import os

persist_dir = "./chroma_wikipedia_db"

# Usuń starą bazę, jeśli istnieje
if os.path.exists(persist_dir):
    shutil.rmtree(persist_dir)

vector_store = Chroma(
    collection_name="simple_wikipedia_collection",
    embedding_function=embedding_function,
    persist_directory=persist_dir, 
)

In [19]:
BATCH_SIZE = 5000  # Zostaw margines poniżej limitu 5461

# Podziel dokumenty na partie i dodaj je
for i in range(0, len(langchain_documents), BATCH_SIZE):
    batch = langchain_documents[i:i + BATCH_SIZE]
    vector_store.add_documents(batch)
    print(f"Dodano batch {i//BATCH_SIZE + 1}: dokumenty {i} - {min(i + BATCH_SIZE, len(langchain_documents))}")
 

Dodano batch 1: dokumenty 0 - 5000
Dodano batch 2: dokumenty 5000 - 10000
Dodano batch 3: dokumenty 10000 - 15000
Dodano batch 4: dokumenty 15000 - 20000
Dodano batch 5: dokumenty 20000 - 25000
Dodano batch 6: dokumenty 25000 - 30000
Dodano batch 7: dokumenty 30000 - 35000
Dodano batch 8: dokumenty 35000 - 40000
Dodano batch 9: dokumenty 40000 - 45000
Dodano batch 10: dokumenty 45000 - 50000
Dodano batch 11: dokumenty 50000 - 55000
Dodano batch 12: dokumenty 55000 - 60000
Dodano batch 13: dokumenty 60000 - 65000
Dodano batch 14: dokumenty 65000 - 70000
Dodano batch 15: dokumenty 70000 - 75000
Dodano batch 16: dokumenty 75000 - 80000
Dodano batch 17: dokumenty 80000 - 85000
Dodano batch 18: dokumenty 85000 - 90000
Dodano batch 19: dokumenty 90000 - 95000
Dodano batch 20: dokumenty 95000 - 100000
Dodano batch 21: dokumenty 100000 - 105000
Dodano batch 22: dokumenty 105000 - 110000
Dodano batch 23: dokumenty 110000 - 115000
Dodano batch 24: dokumenty 115000 - 120000
Dodano batch 25: dokum

In [20]:
print(vector_store._collection.count())  # Chroma

769764


# **Retrieval System**

In [21]:
query = "Between each month comes April?"

In [22]:
results = vector_store.similarity_search_with_score(query, k=5)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content}")

* [SIM=0.570056] April comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year.
* [SIM=0.588876] April is the fourth month of the year, and comes between March and May. It is one of four months to have 30 days.
* [SIM=0.675394] April begins on the same day of the week as July every year and on the same day of the week as January in leap years. April ends on the same day of the week as December every year, as each other's last days are exactly 35 weeks (245 days) apart.
* [SIM=0.679126] April is a spring month in the Northern Hemisphere and an autumn/fall month in the Southern Hemisphere. In each hemisphere, it is the seasonal equivalent of October in the other.
* [SIM=0.698874] April always begins on the same day of week as July, and additionally, January in leap years. April always ends on the same day of the week as December.


# **Generation (LLM)**

In [5]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain_huggingface.llms import HuggingFacePipeline

model_id = "google/flan-t5-large"

# 1. Ładujemy tokenizer i model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# 2. Tworzymy pipeline z odpowiednimi parametrami
pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    device=0,  # GPU
    do_sample=True,
    max_new_tokens=200,
    temperature=0.7,
    repetition_penalty=1.0,
)

# 3. Przekazujemy pipeline do LangChain
llm = HuggingFacePipeline(pipeline=pipe)


Device set to use cuda:0


In [6]:
# Jeśli używasz notebooka lub Colaba
import torch
torch.cuda.empty_cache()


In [7]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

retriever = vector_store.as_retriever(search_kwargs={"k": 5})

my_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
Use the following context to answer the question below as clearly and concisely as possible.

{context}

Question: {question}
Answer:"""
)

rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type_kwargs={"prompt": my_prompt},
    return_source_documents=True,
    verbose=True
)

NameError: name 'vector_store' is not defined

In [114]:
# Przykładowe pytanie
query = "What can u say about climate change?"
result = rag_chain(query)

# Wynik
print("------------------------------------------------------------------------------------------------------------------------")
print("🔍 Odpowiedź:\n", result["result"])
print("------------------------------------------------------------------------------------------------------------------------")
print("\n📄 Źródła:")
print("------------------------------------------------------------------------------------------------------------------------")
for doc in result["source_documents"]:
    print("-", doc.page_content, "...")




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
------------------------------------------------------------------------------------------------------------------------
🔍 Odpowiedź:
 climate change is now a big problem
------------------------------------------------------------------------------------------------------------------------

📄 Źródła:
------------------------------------------------------------------------------------------------------------------------
- Climate change ...
- Climate change means the climate of Earth changing. Climate change is now a big problem. Climate change this century and last century is sometimes called global warming, because the surface of the Earth is getting hotter, because of humans. But thousands and millions of years ago sometimes it was very cold, like ice ages and snowball Earth. ...
- People in government and the Intergovernmental Panel on Climate Change (IPCC) are talking about global warming. But governments, co

In [16]:
import torch
print(torch.version.hip)  # sprawdza, czy ROCm jest dostępny
print(torch.cuda.is_available())  # może też zadziałać z ROCm


6.4.43483-a187df25c
True


In [17]:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
    print(f"Current device: {torch.cuda.current_device()}")

CUDA available: True
Device count: 1
Current device: 0


In [18]:
# Sprawdź szczegóły GPU
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device name: {torch.cuda.get_device_name(0)}")
print(f"PyTorch version: {torch.__version__}")

CUDA available: True
Device name: AMD Radeon RX 6750 XT
PyTorch version: 2.9.0.dev20250715+rocm6.4
