# **LOAD DATA**

**First, we need to load the dataset. I chose a small dataset from Wikipedia - specifically the "rahular/simple-wikipedia" dataset from HuggingFace Hub.**

In [4]:
from datasets import load_dataset

dataset = load_dataset("rahular/simple-wikipedia")

**Key Information About Our Dataset**

In [5]:
print(f"Dataset type: {type(dataset)}")
print(f"Available splits: {list(dataset.keys())}")

Dataset type: <class 'datasets.dataset_dict.DatasetDict'>
Available splits: ['train']


In [6]:
print(f"Dataset length: {len(dataset["train"])}")

Dataset length: 769764


In [7]:
print(f"Dataset features: {dataset['train'].features}")

Dataset features: {'text': Value('string')}


In [8]:
print(f"Sample record (index 558): {dataset['train'][558]}")

Sample record (index 558): {'text': 'Plants are also multicellular eukaryotic organisms, but live by using light, water and basic elements to make their tissues.'}


In [9]:
empty = [i for i in range(len(dataset["train"])) if not dataset["train"][i]["text"].strip()]
print(f"Empty records: {len(empty)}")

Empty records: 0


In [10]:
import pandas as pd
df = pd.DataFrame(dataset["train"][:5])
print("Sample data as DataFrame:")
df

Sample data as DataFrame:


Unnamed: 0,text
0,April
1,"April is the fourth month of the year, and com..."
2,April always begins on the same day of week as...
3,April's flowers are the Sweet Pea and Daisy. I...
4,"April comes between March and May, making it t..."


# **Document Creation**

In [11]:
from langchain.schema import Document as LangChainDocument
    

langchain_documents = []
for i, record in enumerate(dataset["train"]):
    text_content = record["text"]
        
    if not text_content.strip():
        continue
        
    metadata = {
        "source": "simple_wikipedia",
        "original_index": i,
        "text_length": len(text_content),
    }
        
    doc = LangChainDocument(
        page_content=text_content,
        metadata=metadata
    )
        
    langchain_documents.append(doc)
    
print(f"✅ Created {len(langchain_documents)} LangChain documents")

✅ Created 769764 LangChain documents


In [14]:
# Get all text lengths from the dataset
text_lengths = []
for i, record in enumerate(dataset["train"]):
    text = record["text"]
    if text.strip():  # Skip empty records
        text_lengths.append(len(text))

print(f"Total non-empty documents: {len(text_lengths)}")

# =====================================================================
# LENGTH STATISTICS
# =====================================================================

import numpy as np

print("\nDOCUMENT LENGTH STATISTICS (in characters):")
print("-" * 50)
print(f"Minimum length: {min(text_lengths):,} characters")
print(f"Maximum length: {max(text_lengths):,} characters")
print(f"Average length: {np.mean(text_lengths):,.0f} characters")
print(f"Median length: {np.median(text_lengths):,.0f} characters")
print(f"Standard deviation: {np.std(text_lengths):,.0f} characters")

# Percentiles
percentiles = [10, 25, 50, 75, 90, 95, 99]
print(f"\nPercentiles:")
for p in percentiles:
    value = np.percentile(text_lengths, p)
    print(f"  {p}th percentile: {value:,.0f} characters")

Total non-empty documents: 769764

DOCUMENT LENGTH STATISTICS (in characters):
--------------------------------------------------
Minimum length: 1 characters
Maximum length: 10,570 characters
Average length: 183 characters
Median length: 127 characters
Standard deviation: 198 characters

Percentiles:
  10th percentile: 12 characters
  25th percentile: 24 characters
  50th percentile: 127 characters
  75th percentile: 271 characters
  90th percentile: 432 characters
  95th percentile: 558 characters
  99th percentile: 872 characters
