# **LOAD DATA**

**First, we need to load the dataset. I chose a small dataset from Wikipedia - specifically the "rahular/simple-wikipedia" dataset from HuggingFace Hub.**

In [3]:
from datasets import load_dataset

dataset = load_dataset("rahular/simple-wikipedia")

### **Key Information About Our Dataset**

In [4]:
print(f"Dataset type: {type(dataset)}")
print(f"Available splits: {list(dataset.keys())}")

Dataset type: <class 'datasets.dataset_dict.DatasetDict'>
Available splits: ['train']


In [5]:
print(f"Dataset length: {len(dataset["train"])}")

Dataset length: 769764


In [6]:
print(f"Dataset features: {dataset['train'].features}")

Dataset features: {'text': Value('string')}


In [7]:
print(f"Sample record (index 558): {dataset['train'][558]}")

Sample record (index 558): {'text': 'Plants are also multicellular eukaryotic organisms, but live by using light, water and basic elements to make their tissues.'}


In [8]:
empty = [i for i in range(len(dataset["train"])) if not dataset["train"][i]["text"].strip()]
print(f"Empty records: {len(empty)}")

Empty records: 0


In [9]:
import pandas as pd
df = pd.DataFrame(dataset["train"][:5])
print("Sample data as DataFrame:")
df

Sample data as DataFrame:


Unnamed: 0,text
0,April
1,"April is the fourth month of the year, and com..."
2,April always begins on the same day of week as...
3,April's flowers are the Sweet Pea and Daisy. I...
4,"April comes between March and May, making it t..."


# **Document Creation**

In [10]:
from langchain.schema import Document as LangChainDocument
    

langchain_documents = []
for i, record in enumerate(dataset["train"]):
    text_content = record["text"]
        
    if not text_content.strip():
        continue
        
    metadata = {
        "source": "simple_wikipedia",
        "original_index": i,
        "text_length": len(text_content),
    }
        
    doc = LangChainDocument(
        page_content=text_content,
        metadata=metadata
    )
        
    langchain_documents.append(doc)
    
print(f"✅ Created {len(langchain_documents)} LangChain documents")

✅ Created 769764 LangChain documents


# **Text Splitting**

In [11]:
# Get all text lengths from the dataset
text_lengths = []
for i, record in enumerate(dataset["train"]):
    text = record["text"]
    if text.strip():  # Skip empty records
        text_lengths.append(len(text))

print(f"Total non-empty documents: {len(text_lengths)}")

# =====================================================================
# LENGTH STATISTICS
# =====================================================================

import numpy as np

print("\nDOCUMENT LENGTH STATISTICS (in characters):")
print("-" * 50)
print(f"Minimum length: {min(text_lengths):,} characters")
print(f"Maximum length: {max(text_lengths):,} characters")
print(f"Average length: {np.mean(text_lengths):,.0f} characters")
print(f"Median length: {np.median(text_lengths):,.0f} characters")
print(f"Standard deviation: {np.std(text_lengths):,.0f} characters")

# Percentiles
percentiles = [10, 25, 50, 75, 90, 95, 99]
print(f"\nPercentiles:")
for p in percentiles:
    value = np.percentile(text_lengths, p)
    print(f"  {p}th percentile: {value:,.0f} characters")

Total non-empty documents: 769764

DOCUMENT LENGTH STATISTICS (in characters):
--------------------------------------------------
Minimum length: 1 characters
Maximum length: 10,570 characters
Average length: 183 characters
Median length: 127 characters
Standard deviation: 198 characters

Percentiles:
  10th percentile: 12 characters
  25th percentile: 24 characters
  50th percentile: 127 characters
  75th percentile: 271 characters
  90th percentile: 432 characters
  95th percentile: 558 characters
  99th percentile: 872 characters


### Why Text Splitting is Not Necessary for Our Dataset

After analyzing our Simple Wikipedia dataset, we found that text splitting (chunking) is not necessary for this particular dataset. Our dataset contains 769,764 non-empty documents with an average length of only 183 characters (46 tokens) and a median of 127 characters (32 tokens). Even the 95th percentile is just 558 characters (140 tokens), which is well below the typical embedding model limits of 512-8192 tokens.


Since most embedding models can easily handle documents of this size, splitting our already short Wikipedia articles would actually be counterproductive. Text splitting is typically needed when documents are very long (>1000 characters) or exceed model token limits, but our dataset doesn't meet these criteria. Keeping the documents intact preserves the complete context of each Wikipedia article, results in better embeddings, and simplifies our RAG pipeline. We can proceed directly to embedding generation without any chunking steps.

# **Embedding Generation**

In [12]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("all-MiniLM-L6-v2",device="cpu")

In [13]:
texts = [doc.page_content for doc in langchain_documents]
embeddings = embedding_model.encode(texts)
print(f"Embeddings shape: {embeddings.shape}")

Embeddings shape: (769764, 384)


In [14]:
#similarities = embedding_model.similarity(embeddings, embeddings)
#print(similarities)

# PORÓB JAKIES POROWNAIA PODOBNYCH ZDAŃ

# **Vector Database Setup**

In [15]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="simple_wikipedia_collection",
    embedding_function=embedding_model,
    persist_directory="./chroma_wikipedia_db", 
)

# **Retrieval System**

# **Generation (LLM)**

In [16]:
import torch
print(torch.version.hip)  # sprawdza, czy ROCm jest dostępny
print(torch.cuda.is_available())  # może też zadziałać z ROCm


6.4.43483-a187df25c
True


In [17]:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
    print(f"Current device: {torch.cuda.current_device()}")

CUDA available: True
Device count: 1
Current device: 0


In [18]:
# Sprawdź szczegóły GPU
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device name: {torch.cuda.get_device_name(0)}")
print(f"PyTorch version: {torch.__version__}")

CUDA available: True
Device name: AMD Radeon RX 6750 XT
PyTorch version: 2.9.0.dev20250715+rocm6.4
