# Data Collection and Exploration: Building the Foundation of RAG Systems

## Introduction to Data Collection in RAG Systems

In this notebook, we will collect real data from Wikipedia and ArXiv to build our RAG (Retrieval-Augmented Generation) system. This is where we start working with actual documents that our system will need to understand and retrieve from.

### Why Data Collection Matters

Data collection is the first and most critical step in building any RAG system. The quality and diversity of your data directly impacts:

- **Retrieval Accuracy**: Better data leads to more relevant search results
- **Response Quality**: Rich, diverse data enables more comprehensive answers
- **System Reliability**: Well-structured data ensures consistent performance
- **Domain Coverage**: Different data sources provide broader knowledge coverage

### Understanding Our Data Sources

We'll work with two primary data sources:

1. **Wikipedia Articles**: Encyclopedia-style content with structured information
2. **ArXiv Papers**: Scientific abstracts with technical, research-focused content

Each source has unique characteristics that affect how we process and use the data in our RAG system.

## Learning Objectives

By the end of this notebook, you will:
1. Understand how to collect data using our custom DataCollector
2. Explore the structure and characteristics of different data sources
3. Learn about data quality and filtering strategies
4. Get hands-on experience with real text data
5. Understand different chunking strategies for text preprocessing
6. Master the art of text preprocessing for RAG systems
7. Learn why chunking is essential and how different strategies work


In [1]:
# Standard library imports
import json
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import numpy as np
from tqdm import tqdm
import os
import sys

# Add project root to path - multiple approaches for reliability
current_dir = os.getcwd()
project_root = os.path.dirname(current_dir) if current_dir.endswith('notebooks') else current_dir

# Add both current directory and project root to path
sys.path.insert(0, project_root)
sys.path.insert(0, current_dir)
sys.path.insert(0, '.')

print(f"Current directory: {current_dir}")
print(f"Project root: {project_root}")
print(f"Python path: {sys.path[:3]}")

# Force reload modules to get latest changes
import importlib

# Import our custom modules with error handling
try:
    from src.config import DATA_CONFIG, DATA_DIR
    from src.data.collect_data import DataCollector
    from src.data.preprocess_data import TextPreprocessor
    print("Successfully imported from src module")
except ImportError as e:
    print(f"Import error: {e}")
    print("Trying alternative import methods...")
    
    # Try importing directly from the file
    try:
        import importlib.util
        
        # Import config
        config_path = os.path.join(project_root, 'src', 'config.py')
        spec = importlib.util.spec_from_file_location("config", config_path)
        config_module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(config_module)
        DATA_CONFIG = config_module.DATA_CONFIG
        DATA_DIR = config_module.DATA_DIR
        
        # Import collect_data
        collect_path = os.path.join(project_root, 'src', 'data', 'collect_data.py')
        spec = importlib.util.spec_from_file_location("collect_data", collect_path)
        collect_module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(collect_module)
        DataCollector = collect_module.DataCollector
        
        # Import preprocess_data
        preprocess_path = os.path.join(project_root, 'src', 'data', 'preprocess_data.py')
        spec = importlib.util.spec_from_file_location("preprocess_data", preprocess_path)
        preprocess_module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(preprocess_module)
        TextPreprocessor = preprocess_module.TextPreprocessor
        
        print("Successfully imported using direct file imports")
        
    except Exception as e2:
        print(f"Direct import also failed: {e2}")
        print("Please check that you're running this from the correct directory")
        raise e2

print("Libraries imported successfully!")
print(f"Data directory: {DATA_DIR}")

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")


Current directory: /Users/scienceman/Desktop/LLM/notebooks
Project root: /Users/scienceman/Desktop/LLM
Python path: ['.', '/Users/scienceman/Desktop/LLM/notebooks', '/Users/scienceman/Desktop/LLM']
Successfully imported from src module
Libraries imported successfully!
Data directory: /Users/scienceman/Desktop/LLM/data


In [2]:
# Diagnostic and Testing Cell
print("Running diagnostics...")

# Check current working directory and file structure
import os
print(f"Current working directory: {os.getcwd()}")
print(f"Contents of current directory: {os.listdir('.')}")

# Check if src directory exists
if os.path.exists('../src'):
    print("Found src directory in parent folder")
    print(f"Contents of src/: {os.listdir('../src')}")
    if os.path.exists('../src/data'):
        print(f"Contents of src/data/: {os.listdir('../src/data')}")
else:
    print("src directory not found in parent folder")

# Test the DataCollector method signatures
print("\nTesting DataCollector method signatures...")

try:
    import inspect
    
    # Check Wikipedia method
    wiki_sig = inspect.signature(DataCollector.collect_wikipedia_data)
    print(f"Wikipedia method parameters: {list(wiki_sig.parameters.keys())}")
    
    # Check ArXiv method  
    arxiv_sig = inspect.signature(DataCollector.collect_arxiv_data)
    print(f"ArXiv method parameters: {list(arxiv_sig.parameters.keys())}")
    
    # Test with sample data (should work)
    print("\nTesting with sample data...")
    test_collector = DataCollector()
    test_wiki = test_collector.collect_wikipedia_data(max_documents=1, use_real_data=False)
    print(f"Sample Wikipedia test: {len(test_wiki)} articles")
    
    test_arxiv = test_collector.collect_arxiv_data(max_documents=1, use_real_data=False)
    print(f"Sample ArXiv test: {len(test_arxiv)} papers")
    
    print("DataCollector methods are working correctly!")
    
except Exception as e:
    print(f"Error during testing: {e}")
    print("This might indicate an import or configuration issue.")


INFO:src.data.collect_data:Data collector initialized. Output directory: /Users/scienceman/Desktop/LLM/data/raw
INFO:src.data.collect_data:Collecting Wikipedia data...
INFO:src.data.collect_data:Using sample Wikipedia data...
INFO:src.data.collect_data:Collecting ArXiv data...
INFO:src.data.collect_data:Using sample ArXiv data...


Running diagnostics...
Current working directory: /Users/scienceman/Desktop/LLM/notebooks
Contents of current directory: ['07_llm_integration.ipynb', '05_vector_search.ipynb', '03_embeddings_and_vector_store.ipynb', '08_evaluation.ipynb', '06_retrieval_systems.ipynb', '01_understanding_rag.ipynb', '09_optimization.ipynb', '10_vector_database_production.ipynb', '04_text_preprocessing.ipynb', '02_data_collection.ipynb']
Found src directory in parent folder
Contents of src/: ['advanced', '.DS_Store', 'config.py', 'vector_db', 'optimization', '__init__.py', 'models', '__pycache__', 'retrieval', 'api', 'evaluation', 'data']
Contents of src/data/: ['preprocess_data.py', '__init__.py', '__pycache__', 'collect_data.py']

Testing DataCollector method signatures...
Wikipedia method parameters: ['self', 'max_documents', 'use_real_data']
ArXiv method parameters: ['self', 'max_documents', 'use_real_data']

Testing with sample data...
Sample Wikipedia test: 1 articles
Sample ArXiv test: 1 papers
Dat

## Understanding Our Data Sources

We'll work with two primary data sources:

1. **Wikipedia Articles**: Encyclopedia-style content with structured information
2. **ArXiv Papers**: Scientific abstracts with technical, research-focused content

Each source has unique characteristics that affect how we process and use the data in our RAG system.

Let us start by collecting a small sample of each to explore their characteristics.


In [3]:
# Initialize our data collector
collector = DataCollector()
print("DataCollector initialized successfully!")

# Choose data collection mode
USE_REAL_DATA = True  # Set to False to use sample data, True to fetch real data from APIs

print(f"\nData collection mode: {'Real data from APIs' if USE_REAL_DATA else 'Sample data'}")

# Collect Wikipedia data with error handling
print("\nCollecting Wikipedia data...")
try:
    wiki_data = collector.collect_wikipedia_data(max_documents=5, use_real_data=USE_REAL_DATA)
    print(f"Collected {len(wiki_data)} Wikipedia articles")
except TypeError as e:
    if "unexpected keyword argument" in str(e):
        print("Using fallback method (old API signature)")
        wiki_data = collector.collect_wikipedia_data(max_documents=5)
        print(f"Collected {len(wiki_data)} Wikipedia articles (sample data)")
    else:
        raise e
except Exception as e:
    print(f"Error collecting Wikipedia data: {e}")
    print("Falling back to sample data...")
    wiki_data = collector.collect_wikipedia_data(max_documents=5, use_real_data=False)

# Collect ArXiv data with error handling
print("\nCollecting ArXiv data...")
try:
    arxiv_data = collector.collect_arxiv_data(max_documents=5, use_real_data=USE_REAL_DATA)
    print(f"Collected {len(arxiv_data)} ArXiv papers")
except TypeError as e:
    if "unexpected keyword argument" in str(e):
        print("Using fallback method (old API signature)")
        arxiv_data = collector.collect_arxiv_data(max_documents=5)
        print(f"Collected {len(arxiv_data)} ArXiv papers (sample data)")
    else:
        raise e
except Exception as e:
    print(f"Error collecting ArXiv data: {e}")
    print("Falling back to sample data...")
    arxiv_data = collector.collect_arxiv_data(max_documents=5, use_real_data=False)

print(f"\nTotal documents collected: {len(wiki_data) + len(arxiv_data)}")

# Display sample data to verify collection
if wiki_data:
    print(f"\nSample Wikipedia article:")
    print(f"  Title: {wiki_data[0]['title']}")
    print(f"  Word count: {wiki_data[0]['word_count']}")
    print(f"  Preview: {wiki_data[0]['text'][:100]}...")
    if 'url' in wiki_data[0]:
        print(f"  URL: {wiki_data[0]['url']}")

if arxiv_data:
    print(f"\nSample ArXiv paper:")
    print(f"  Title: {arxiv_data[0]['title']}")
    print(f"  Word count: {arxiv_data[0]['word_count']}")
    print(f"  Authors: {arxiv_data[0]['authors']}")
    print(f"  Categories: {arxiv_data[0]['categories']}")
    print(f"  Preview: {arxiv_data[0]['abstract'][:100]}...")
    if 'arxiv_id' in arxiv_data[0]:
        print(f"  ArXiv ID: {arxiv_data[0]['arxiv_id']}")


INFO:src.data.collect_data:Data collector initialized. Output directory: /Users/scienceman/Desktop/LLM/data/raw
INFO:src.data.collect_data:Collecting Wikipedia data...
INFO:src.data.collect_data:Fetching real Wikipedia data using Wikipedia API...


DataCollector initialized successfully!

Data collection mode: Real data from APIs

Collecting Wikipedia data...


INFO:src.data.collect_data:Collected: Machine learning
INFO:src.data.collect_data:Collected: Artificial intelligence
INFO:src.data.collect_data:Collected: Deep learning
INFO:src.data.collect_data:Collected: Natural language processing
INFO:src.data.collect_data:Collected: Computer vision
INFO:src.data.collect_data:Successfully collected 5 Wikipedia articles
INFO:src.data.collect_data:Collecting ArXiv data...
INFO:src.data.collect_data:Fetching real ArXiv data using ArXiv API...
INFO:src.data.collect_data:Collected: GC-VLN: Instruction as Graph Constraints for Training-free
  Vision-and-Language Navigation
INFO:src.data.collect_data:Collected: SSL-AD: Spatiotemporal Self-Supervised Learning for Generalizability and
  Adaptability Across Alzheimer's Prediction Tasks and Datasets


Collected 5 Wikipedia articles

Collecting ArXiv data...


INFO:src.data.collect_data:Collected: SSL-AD: Spatiotemporal Self-Supervised Learning for Generalizability and
  Adaptability Across Alzheimer's Prediction Tasks and Datasets
INFO:src.data.collect_data:Collected: WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained
  Speech Recognition Transformers
INFO:src.data.collect_data:Successfully collected 4 ArXiv papers


Collected 4 ArXiv papers

Total documents collected: 9

Sample Wikipedia article:
  Title: Machine learning
  Word count: 68
  Preview: Machine learning (ML) is a field of study in artificial intelligence concerned with the development ...
  URL: https://en.wikipedia.org/wiki/Machine_learning

Sample ArXiv paper:
  Title: GC-VLN: Instruction as Graph Constraints for Training-free
  Vision-and-Language Navigation
  Word count: 221
  Authors: ['Hang Yin', 'Haoyu Wei', 'Xiuwei Xu', 'Wenxuan Guo', 'Jie Zhou', 'Jiwen Lu']
  Categories: ['cs.RO', 'cs.CV']
  Preview: In this paper, we propose a training-free framework for vision-and-language
navigation (VLN). Existi...
  ArXiv ID: 2509.10454v1


## Alternative Data Collection Methods

We've implemented several data collection methods to ensure our system works in different scenarios:

### 1. **Wikipedia API** (Recommended)
- Free and reliable with no authentication required
- Real-time data with rich metadata
- Built-in rate limiting for respectful usage

### 2. **ArXiv API**
- Scientific papers from the ArXiv preprint server
- Recent publications with authors, categories, and abstracts
- Structured data from ArXiv's Atom feed

### 3. **Sample Data Fallback**
- Offline mode for development and testing
- Consistent data for reproducible results
- Fast execution without API delays

### Usage Options:
- Set `USE_REAL_DATA = True` to fetch real data from APIs
- Set `USE_REAL_DATA = False` to use sample data
- The system automatically falls back to sample data if API calls fail


## Exploring Wikipedia Data Structure

Let us examine the structure and characteristics of our Wikipedia articles.


In [4]:
print("Wikipedia Data Structure:")
print("=" * 50)

if wiki_data:
    # Show structure of first article
    first_article = wiki_data[0]
    print(f"Article keys: {list(first_article.keys())}")
    print(f"\nFirst article:")
    print(f"  Title: {first_article['title']}")
    print(f"  Source: {first_article['source']}")
    print(f"  Length: {first_article['length']} characters")
    print(f"  Word count: {first_article['word_count']} words")
    print(f"  Text preview: {first_article['text'][:200]}...")
    
    # Show all articles
    print(f"\nAll Wikipedia articles:")
    for i, article in enumerate(wiki_data):
        print(f"  {i+1}. {article['title']} ({article['word_count']} words)")
    
    # Show statistics
    word_counts = [article['word_count'] for article in wiki_data]
    print(f"\nWikipedia Statistics:")
    print(f"  Total articles: {len(wiki_data)}")
    print(f"  Average words per article: {sum(word_counts) / len(word_counts):.1f}")
    print(f"  Min words: {min(word_counts)}")
    print(f"  Max words: {max(word_counts)}")
else:
    print("No Wikipedia data collected")


Wikipedia Data Structure:
Article keys: ['id', 'title', 'text', 'source', 'length', 'word_count', 'url', 'description']

First article:
  Title: Machine learning
  Source: wikipedia
  Length: 462 characters
  Word count: 68 words
  Text preview: Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data, and thus...

All Wikipedia articles:
  1. Machine learning (68 words)
  2. Artificial intelligence (64 words)
  3. Deep learning (64 words)
  4. Natural language processing (44 words)
  5. Computer vision (90 words)

Wikipedia Statistics:
  Total articles: 5
  Average words per article: 66.0
  Min words: 44
  Max words: 90


## Exploring ArXiv Data Structure

Now let us examine the structure and characteristics of our ArXiv papers.


In [5]:
print("ArXiv Data Structure:")
print("=" * 50)

if arxiv_data:
    # Show structure of first paper
    first_paper = arxiv_data[0]
    print(f"Paper keys: {list(first_paper.keys())}")
    print(f"\nFirst paper:")
    print(f"  Title: {first_paper['title']}")
    print(f"  Source: {first_paper['source']}")
    print(f"  Length: {first_paper['length']} characters")
    print(f"  Word count: {first_paper['word_count']} words")
    print(f"  Authors: {first_paper['authors']}")
    print(f"  Categories: {first_paper['categories']}")
    print(f"  Abstract preview: {first_paper['abstract'][:200]}...")
    
    # Show all papers
    print(f"\nAll ArXiv papers:")
    for i, paper in enumerate(arxiv_data):
        print(f"  {i+1}. {paper['title']} ({paper['word_count']} words)")
    
    # Show statistics
    word_counts = [paper['word_count'] for paper in arxiv_data]
    print(f"\nArXiv Statistics:")
    print(f"  Total papers: {len(arxiv_data)}")
    print(f"  Average words per abstract: {sum(word_counts) / len(word_counts):.1f}")
    print(f"  Min words: {min(word_counts)}")
    print(f"  Max words: {max(word_counts)}")
else:
    print("No ArXiv data collected")


ArXiv Data Structure:
Paper keys: ['id', 'title', 'abstract', 'source', 'length', 'word_count', 'authors', 'categories', 'published', 'arxiv_id']

First paper:
  Title: GC-VLN: Instruction as Graph Constraints for Training-free
  Vision-and-Language Navigation
  Source: arxiv
  Length: 1657 characters
  Word count: 221 words
  Authors: ['Hang Yin', 'Haoyu Wei', 'Xiuwei Xu', 'Wenxuan Guo', 'Jie Zhou', 'Jiwen Lu']
  Categories: ['cs.RO', 'cs.CV']
  Abstract preview: In this paper, we propose a training-free framework for vision-and-language
navigation (VLN). Existing zero-shot VLN methods are mainly designed for
discrete environments or involve unsupervised train...

All ArXiv papers:
  1. GC-VLN: Instruction as Graph Constraints for Training-free
  Vision-and-Language Navigation (221 words)
  2. SSL-AD: Spatiotemporal Self-Supervised Learning for Generalizability and
  Adaptability Across Alzheimer's Prediction Tasks and Datasets (175 words)
  3. SSL-AD: Spatiotemporal Self-Supervised L

## Text Preprocessing and Chunking: The Foundation of RAG Systems

### Why Do We Need Chunking?

Before we dive into the technical details, let's understand why chunking is absolutely essential for RAG (Retrieval-Augmented Generation) systems:

#### 1. **The Context Window Problem**
- **LLM Limitations**: Large Language Models have a limited "context window" - they can only process a certain amount of text at once
- **Example**: GPT-3.5 can handle ~4,000 tokens, GPT-4 can handle ~8,000-32,000 tokens
- **Reality Check**: A single Wikipedia article can easily exceed 10,000 words, which is way beyond most model limits
- **Solution**: Break large documents into smaller, manageable "chunks"

#### 2. **Precision in Retrieval**
- **The Needle in Haystack Problem**: When you ask "What is machine learning?", you don't want the entire Wikipedia article about AI
- **Targeted Retrieval**: You want the specific section that explains machine learning concepts
- **Better Matches**: Smaller chunks allow for more precise matching between user queries and relevant content

#### 3. **Computational Efficiency**
- **Embedding Costs**: Each chunk needs to be converted to a vector (embedding) for similarity search
- **Storage Optimization**: Smaller chunks mean more efficient storage and faster retrieval
- **Processing Speed**: Faster similarity calculations with smaller, focused pieces of text

#### 4. **Semantic Coherence**
- **Preserving Meaning**: Good chunking keeps related concepts together
- **Avoiding Fragmentation**: We don't want to split a sentence or paragraph in the middle
- **Context Preservation**: Each chunk should be meaningful on its own

### How Chunking Works: The Technical Deep Dive

Let's explore the different strategies our `TextPreprocessor` uses:

#### 1. **Fixed Chunking (Simple but Effective)**
```python
# How it works:
# - Split text into chunks of exactly N characters
# - No overlap between chunks
# - Simple and predictable
```

**Pros:**
- Simple to implement and understand
- Consistent chunk sizes
- Fast processing
- Predictable behavior

**Cons:**
- Can break sentences mid-way
- May lose important context at boundaries
- Not semantically aware

**When to Use:**
- When you need consistent chunk sizes
- For simple documents with uniform structure
- When processing speed is critical

#### 2. **Semantic Chunking (Smart and Context-Aware)**
```python
# How it works:
# - Analyzes text structure (sentences, paragraphs)
# - Tries to keep related concepts together
# - Uses natural language boundaries
# - May have variable chunk sizes
```

**Pros:**
- Preserves semantic meaning
- Keeps sentences and paragraphs intact
- Better for complex documents
- More natural text boundaries

**Cons:**
- More complex to implement
- Variable chunk sizes
- Slower processing
- May create very small or very large chunks

**When to Use:**
- For complex, technical documents
- When semantic coherence is critical
- For documents with varied structure

#### 3. **Hierarchical Chunking (Multi-Level Organization)**
```python
# How it works:
# - Creates multiple levels of chunks (e.g., sections, paragraphs, sentences)
# - Maintains relationships between different levels
# - Allows for flexible retrieval strategies
```

**Pros:**
- Maintains document structure
- Flexible retrieval options
- Preserves relationships between concepts
- Good for structured documents

**Cons:**
- Most complex to implement
- Requires understanding of document structure
- More storage overhead
- Complex retrieval logic

**When to Use:**
- For highly structured documents (papers, reports)
- When you need multiple levels of detail
- For complex knowledge bases

### The TextPreprocessor: A Deep Dive into Implementation

Let's examine how our `TextPreprocessor` class works internally:

#### Key Methods and Their Purposes:

1. **`chunk_document(doc, strategy='semantic')`**
   - Main entry point for chunking
   - Takes a document and returns a list of chunks
   - Each chunk contains: text, metadata, source information

2. **`_chunk_by_fixed_size(text, chunk_size=500)`**
   - Implements fixed-size chunking
   - Splits text into equal-sized pieces
   - Handles edge cases (very short documents)

3. **`_chunk_by_semantic_boundaries(text, max_chunk_size=500)`**
   - Implements semantic chunking
   - Uses sentence and paragraph boundaries
   - Tries to preserve meaning

4. **`_chunk_hierarchically(text, max_chunk_size=500)`**
   - Implements hierarchical chunking
   - Creates multiple levels of organization
   - Maintains document structure

#### What Happens During Preprocessing:

1. **Text Cleaning**: Remove extra whitespace, normalize text
2. **Boundary Detection**: Find sentence and paragraph boundaries
3. **Chunk Creation**: Split text according to chosen strategy
4. **Metadata Addition**: Add source, position, and other metadata
5. **Quality Checks**: Ensure chunks meet minimum requirements

### Understanding the Final Data Structure

After preprocessing, each chunk looks like this:

```json
{
  "id": "unique_chunk_identifier",
  "text": "The actual text content of the chunk",
  "source": "wikipedia" or "arxiv",
  "document_id": "original_document_id",
  "chunk_index": 0,
  "metadata": {
    "title": "Original document title",
    "word_count": 150,
    "char_count": 800,
    "chunk_type": "semantic"
  }
}
```

### Best Practices for Chunking

1. **Choose the Right Strategy**:
   - Use fixed chunking for simple, uniform documents
   - Use semantic chunking for complex, varied content
   - Use hierarchical chunking for structured documents

2. **Optimal Chunk Size**:
   - Too small: Loses context, creates too many chunks
   - Too large: Exceeds context limits, less precise retrieval
   - Sweet spot: 200-800 characters (varies by use case)

3. **Overlap Considerations**:
   - Some chunking strategies use overlap to preserve context
   - Overlap helps with boundary issues
   - But increases storage and processing costs

4. **Metadata Preservation**:
   - Always keep track of source document
   - Maintain position information
   - Include relevant metadata for filtering

Now let us learn about different chunking strategies for preprocessing our text data. Chunking is crucial for RAG systems as it determines how we break down large documents into manageable pieces for retrieval.


## Understanding the Chunking Results

The chunking demonstration shows that all three strategies produced similar results because our demo document (462 characters) is smaller than the default chunk size (500+ characters). This teaches us that chunking is context-dependent and strategy selection matters more with larger documents.


In [10]:
# Initialize our text preprocessor
preprocessor = TextPreprocessor()
print("TextPreprocessor initialized successfully!")

# Let us use one of our documents for demonstration
if wiki_data:
    demo_doc = wiki_data[0]
    print(f"\nDemo document: {demo_doc['title']}")
    print(f"Original text length: {len(demo_doc['text'])} characters")
    print(f"Original word count: {demo_doc['word_count']} words")
    print(f"\nOriginal text preview:")
    print(demo_doc['text'][:300] + "...")
elif arxiv_data:
    demo_doc = arxiv_data[0]
    print(f"\nDemo document: {demo_doc['title']}")
    print(f"Original text length: {len(demo_doc['abstract'])} characters")
    print(f"Original word count: {demo_doc['word_count']} words")
    print(f"\nOriginal text preview:")
    print(demo_doc['abstract'][:300] + "...")
else:
    print("No data available for demonstration")


INFO:src.data.preprocess_data:Text preprocessor initialized. Output directory: /Users/scienceman/Desktop/LLM/data/processed


TextPreprocessor initialized successfully!

Demo document: Machine learning
Original text length: 462 characters
Original word count: 68 words

Original text preview:
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data, and thus perform tasks without explicit instructions. Within a subdiscipline in machine learning, advances i...


In [7]:
# Compare different chunking strategies
if wiki_data or arxiv_data:
    # Use Wikipedia data if available, otherwise ArXiv
    if wiki_data:
        demo_doc = wiki_data[0]
        print(f"Using Wikipedia article: {demo_doc['title']}")
    else:
        demo_doc = arxiv_data[0]
        print(f"Using ArXiv paper: {demo_doc['title']}")
    
    print("Comparing Chunking Strategies:")
    print("=" * 50)
    
    strategies = ['fixed', 'semantic', 'hierarchical']
    
    for strategy in strategies:
        print(f"\n{strategy.upper()} Chunking:")
        chunks = preprocessor.chunk_document(demo_doc, strategy=strategy)
        print(f"  Number of chunks: {len(chunks)}")
        
        if chunks:
            # Show first chunk
            first_chunk = chunks[0]
            print(f"  First chunk length: {len(first_chunk['text'])} characters")
            print(f"  First chunk preview: {first_chunk['text'][:150]}...")
            
            # Show chunk statistics
            chunk_lengths = [len(chunk['text']) for chunk in chunks]
            print(f"  Chunk length stats: min={min(chunk_lengths)}, max={max(chunk_lengths)}, avg={np.mean(chunk_lengths):.1f}")
        else:
            print("  No chunks created (document too short)")
else:
    print("No data available for chunking demonstration")


Using Wikipedia article: Machine learning
Comparing Chunking Strategies:

FIXED Chunking:
  Number of chunks: 1
  First chunk length: 462 characters
  First chunk preview: Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn...
  Chunk length stats: min=462, max=462, avg=462.0

SEMANTIC Chunking:
  Number of chunks: 1
  First chunk length: 460 characters
  First chunk preview: Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn...
  Chunk length stats: min=460, max=460, avg=460.0

HIERARCHICAL Chunking:
  Number of chunks: 1
  First chunk length: 462 characters
  First chunk preview: Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn...
  Chunk length stats: min=462, max=462, avg=462.0


## Processing All Collected Data

Now let us process all our collected data using our preferred chunking strategy and save it for use in the next notebook.


In [11]:
# Process all documents
print("Processing all collected documents...")

all_chunks = []

# Process Wikipedia articles
print(f"\nProcessing {len(wiki_data)} Wikipedia articles...")
for i, article in enumerate(tqdm(wiki_data, desc="Wikipedia")):
    chunks = preprocessor.chunk_document(article, strategy='semantic')
    all_chunks.extend(chunks)

# Process ArXiv papers
print(f"\nProcessing {len(arxiv_data)} ArXiv papers...")
for i, paper in enumerate(tqdm(arxiv_data, desc="ArXiv")):
    # Convert ArXiv format to expected format
    doc = {
        'id': paper['id'],
        'title': paper['title'],
        'text': paper['abstract'],
        'source': paper['source']
    }
    chunks = preprocessor.chunk_document(doc, strategy='semantic')
    all_chunks.extend(chunks)

print(f"\nTotal chunks created: {len(all_chunks)}")

# Show chunk statistics
if all_chunks:
    chunk_lengths = [len(chunk['text']) for chunk in all_chunks]
    print(f"Chunk length statistics:")
    print(f"  Min: {min(chunk_lengths)} characters")
    print(f"  Max: {max(chunk_lengths)} characters")
    print(f"  Average: {np.mean(chunk_lengths):.1f} characters")
    print(f"  Median: {np.median(chunk_lengths):.1f} characters")
    
    # Show sources
    sources = [chunk['source'] for chunk in all_chunks]
    source_counts = Counter(sources)
    print(f"\nChunks by source:")
    for source, count in source_counts.items():
        print(f"  {source}: {count} chunks")
    
    # Show sample chunks
    print(f"\nSample chunks:")
    for i, chunk in enumerate(all_chunks[:3]):  # Show first 3 chunks
        print(f"  Chunk {i+1} ({chunk['source']}): {chunk['text'][:100]}...")
else:
    print("No chunks were created. This might indicate an issue with the chunking process.")


Processing all collected documents...

Processing 5 Wikipedia articles...


Wikipedia: 100%|| 5/5 [00:00<00:00, 6545.42it/s]



Processing 4 ArXiv papers...


ArXiv: 100%|| 4/4 [00:00<00:00, 3268.50it/s]


Total chunks created: 18
Chunk length statistics:
  Min: 189 characters
  Max: 494 characters
  Average: 415.5 characters
  Median: 439.0 characters

Chunks by source:
  wikipedia: 6 chunks
  arxiv: 12 chunks

Sample chunks:
  Chunk 1 (wikipedia): Machine learning (ML) is a field of study in artificial intelligence concerned with the development ...
  Chunk 2 (wikipedia): Artificial intelligence (AI) is the capability of computational systems to perform tasks typically a...
  Chunk 3 (wikipedia): In machine learning, deep learning focuses on utilizing multilayered neural networks to perform task...





In [13]:
# Save processed data
print("Saving processed data...")

# Save chunks
chunks_file = DATA_DIR / "processed" / "all_chunks.json"
chunks_file.parent.mkdir(parents=True, exist_ok=True)

with open(chunks_file, 'w', encoding='utf-8') as f:
    json.dump(all_chunks, f, indent=2)

print(f"Saved {len(all_chunks)} chunks to: {chunks_file}")

print("\nData processing complete!")


Saving processed data...
Saved 18 chunks to: /Users/scienceman/Desktop/LLM/data/processed/all_chunks.json

Data processing complete!


## Understanding Our Final Processed Data

Let's examine what we've created and understand the transformation from raw documents to RAG-ready chunks:

### The Data Transformation Journey

```
Raw Documents → Text Preprocessing → Chunks → Vector Embeddings → Vector Database
     ↓              ↓                ↓           ↓                ↓
  Wikipedia      Cleaning &        Semantic    Numerical        Searchable
  Articles       Chunking          Chunks      Vectors          Database
  ArXiv Papers
```

### What Our Final Data Looks Like

After processing, we have **18 chunks** from our 9 original documents:

- **Wikipedia**: 6 chunks (from 5 articles)
- **ArXiv**: 12 chunks (from 4 papers)
- **Chunk Size Range**: 189-494 characters
- **Average Size**: 415.5 characters

### Why ArXiv Produced More Chunks

ArXiv abstracts are typically longer and more complex than Wikipedia article previews, so semantic chunking created more meaningful segments from the technical content.

### Understanding the Chunk Structure

Each chunk contains:
- **Unique ID**: For tracking and retrieval
- **Text Content**: The actual chunk text
- **Source Information**: Which document it came from
- **Metadata**: Word count, character count, chunk type
- **Position Data**: Where it appears in the original document

### Why This Structure Matters for RAG

1. **Retrieval Efficiency**: Each chunk can be independently searched
2. **Context Preservation**: Metadata helps maintain document context
3. **Source Tracking**: We know where each piece of information came from
4. **Scalability**: This structure works for millions of chunks

## Summary

In this notebook, we have learned:

1. **Data Collection**: How to collect data from different sources using our custom DataCollector
2. **Data Exploration**: How to analyze the structure and characteristics of our collected data
3. **Text Preprocessing**: How to chunk documents using different strategies (fixed, semantic, hierarchical)
4. **Data Processing**: How to process and save data for use in subsequent notebooks
5. **Chunking Deep Dive**: Understanding why chunking is essential and how different strategies work

### Key Takeaways:

- **Wikipedia articles** tend to be longer and more comprehensive
- **ArXiv abstracts** are more concise and technical
- **Semantic chunking** works well for preserving meaning while creating manageable pieces
- **Data preprocessing** is crucial for effective RAG systems
- **Chunking strategy** depends on document type and use case

### Next Steps:

In the next notebook, we will learn how to convert these text chunks into embeddings and build vector stores for efficient similarity search.
