# Data Collection and Exploration

In this notebook, we will collect real data from Wikipedia and ArXiv to build our RAG system. This is where we start working with actual documents that our system will need to understand and retrieve from.

## Learning Objectives
By the end of this notebook, you will:
1. Understand how to collect data using our custom DataCollector
2. Explore the structure and characteristics of different data sources
3. Learn about data quality and filtering strategies
4. Get hands-on experience with real text data
5. Understand different chunking strategies for text preprocessing


## Setup and Imports

First, let us import the libraries we will need and set up our environment.
 

In [1]:
# Standard library imports
import json
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import numpy as np
from tqdm import tqdm
import os
import sys

# Add project root to path - multiple approaches for reliability
current_dir = os.getcwd()
project_root = os.path.dirname(current_dir) if current_dir.endswith('notebooks') else current_dir

# Add both current directory and project root to path
sys.path.insert(0, project_root)
sys.path.insert(0, current_dir)
sys.path.insert(0, '.')

print(f"Current directory: {current_dir}")
print(f"Project root: {project_root}")
print(f"Python path: {sys.path[:3]}")

# Force reload modules to get latest changes
import importlib

# Import our custom modules with error handling
try:
    from src.config import DATA_CONFIG, DATA_DIR
    from src.data.collect_data import DataCollector
    from src.data.preprocess_data import TextPreprocessor
    print("Successfully imported from src module")
except ImportError as e:
    print(f"Import error: {e}")
    print("Trying alternative import methods...")
    
    # Try importing directly from the file
    try:
        import importlib.util
        
        # Import config
        config_path = os.path.join(project_root, 'src', 'config.py')
        spec = importlib.util.spec_from_file_location("config", config_path)
        config_module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(config_module)
        DATA_CONFIG = config_module.DATA_CONFIG
        DATA_DIR = config_module.DATA_DIR
        
        # Import collect_data
        collect_path = os.path.join(project_root, 'src', 'data', 'collect_data.py')
        spec = importlib.util.spec_from_file_location("collect_data", collect_path)
        collect_module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(collect_module)
        DataCollector = collect_module.DataCollector
        
        # Import preprocess_data
        preprocess_path = os.path.join(project_root, 'src', 'data', 'preprocess_data.py')
        spec = importlib.util.spec_from_file_location("preprocess_data", preprocess_path)
        preprocess_module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(preprocess_module)
        TextPreprocessor = preprocess_module.TextPreprocessor
        
        print("Successfully imported using direct file imports")
        
    except Exception as e2:
        print(f"Direct import also failed: {e2}")
        print("Please check that you're running this from the correct directory")
        raise e2

print("Libraries imported successfully!")
print(f"Data directory: {DATA_DIR}")

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")


Current directory: /Users/scienceman/Desktop/LLM/notebooks
Project root: /Users/scienceman/Desktop/LLM
Python path: ['.', '/Users/scienceman/Desktop/LLM/notebooks', '/Users/scienceman/Desktop/LLM']
Successfully imported from src module
Libraries imported successfully!
Data directory: /Users/scienceman/Desktop/LLM/data


In [2]:
# Diagnostic and Testing Cell
print("Running diagnostics...")

# Check current working directory and file structure
import os
print(f"Current working directory: {os.getcwd()}")
print(f"Contents of current directory: {os.listdir('.')}")

# Check if src directory exists
if os.path.exists('../src'):
    print("Found src directory in parent folder")
    print(f"Contents of src/: {os.listdir('../src')}")
    if os.path.exists('../src/data'):
        print(f"Contents of src/data/: {os.listdir('../src/data')}")
else:
    print("src directory not found in parent folder")

# Test the DataCollector method signatures
print("\nTesting DataCollector method signatures...")

try:
    import inspect
    
    # Check Wikipedia method
    wiki_sig = inspect.signature(DataCollector.collect_wikipedia_data)
    print(f"Wikipedia method parameters: {list(wiki_sig.parameters.keys())}")
    
    # Check ArXiv method  
    arxiv_sig = inspect.signature(DataCollector.collect_arxiv_data)
    print(f"ArXiv method parameters: {list(arxiv_sig.parameters.keys())}")
    
    # Test with sample data (should work)
    print("\nTesting with sample data...")
    test_collector = DataCollector()
    test_wiki = test_collector.collect_wikipedia_data(max_documents=1, use_real_data=False)
    print(f"Sample Wikipedia test: {len(test_wiki)} articles")
    
    test_arxiv = test_collector.collect_arxiv_data(max_documents=1, use_real_data=False)
    print(f"Sample ArXiv test: {len(test_arxiv)} papers")
    
    print("DataCollector methods are working correctly!")
    
except Exception as e:
    print(f"Error during testing: {e}")
    print("This might indicate an import or configuration issue.")


INFO:src.data.collect_data:Data collector initialized. Output directory: /Users/scienceman/Desktop/LLM/data/raw
INFO:src.data.collect_data:Collecting Wikipedia data...
INFO:src.data.collect_data:Using sample Wikipedia data...
INFO:src.data.collect_data:Collecting ArXiv data...
INFO:src.data.collect_data:Using sample ArXiv data...


Running diagnostics...
Current working directory: /Users/scienceman/Desktop/LLM/notebooks
Contents of current directory: ['03_embeddings_and_vector_store.ipynb', '01_understanding_rag.ipynb', '02_data_collection.ipynb', 'test.ipynb']
Found src directory in parent folder
Contents of src/: ['config.py', '__init__.py', 'models', '__pycache__', 'retrieval', 'evaluation', 'data']
Contents of src/data/: ['preprocess_data.py', '__init__.py', '__pycache__', 'collect_data.py']

Testing DataCollector method signatures...
Wikipedia method parameters: ['self', 'max_documents', 'use_real_data']
ArXiv method parameters: ['self', 'max_documents', 'use_real_data']

Testing with sample data...
Sample Wikipedia test: 1 articles
Sample ArXiv test: 1 papers
DataCollector methods are working correctly!


## Understanding Our Data Sources

Before we start collecting data, let us understand what we are working with:

- **Wikipedia**: Encyclopedia articles with structured content
- **ArXiv**: Scientific paper abstracts with technical content

Let us start by collecting a small sample of each to explore their characteristics.


In [3]:
# Initialize our data collector
collector = DataCollector()
print("DataCollector initialized successfully!")

# Choose data collection mode
USE_REAL_DATA = True  # Set to False to use sample data, True to fetch real data from APIs

print(f"\nData collection mode: {'Real data from APIs' if USE_REAL_DATA else 'Sample data'}")

# Collect Wikipedia data with error handling
print("\nCollecting Wikipedia data...")
try:
    wiki_data = collector.collect_wikipedia_data(max_documents=5, use_real_data=USE_REAL_DATA)
    print(f"Collected {len(wiki_data)} Wikipedia articles")
except TypeError as e:
    if "unexpected keyword argument" in str(e):
        print("Using fallback method (old API signature)")
        wiki_data = collector.collect_wikipedia_data(max_documents=5)
        print(f"Collected {len(wiki_data)} Wikipedia articles (sample data)")
    else:
        raise e
except Exception as e:
    print(f"Error collecting Wikipedia data: {e}")
    print("Falling back to sample data...")
    wiki_data = collector.collect_wikipedia_data(max_documents=5, use_real_data=False)

# Collect ArXiv data with error handling
print("\nCollecting ArXiv data...")
try:
    arxiv_data = collector.collect_arxiv_data(max_documents=5, use_real_data=USE_REAL_DATA)
    print(f"Collected {len(arxiv_data)} ArXiv papers")
except TypeError as e:
    if "unexpected keyword argument" in str(e):
        print("Using fallback method (old API signature)")
        arxiv_data = collector.collect_arxiv_data(max_documents=5)
        print(f"Collected {len(arxiv_data)} ArXiv papers (sample data)")
    else:
        raise e
except Exception as e:
    print(f"Error collecting ArXiv data: {e}")
    print("Falling back to sample data...")
    arxiv_data = collector.collect_arxiv_data(max_documents=5, use_real_data=False)

print(f"\nTotal documents collected: {len(wiki_data) + len(arxiv_data)}")

# Display sample data to verify collection
if wiki_data:
    print(f"\nSample Wikipedia article:")
    print(f"  Title: {wiki_data[0]['title']}")
    print(f"  Word count: {wiki_data[0]['word_count']}")
    print(f"  Preview: {wiki_data[0]['text'][:100]}...")
    if 'url' in wiki_data[0]:
        print(f"  URL: {wiki_data[0]['url']}")

if arxiv_data:
    print(f"\nSample ArXiv paper:")
    print(f"  Title: {arxiv_data[0]['title']}")
    print(f"  Word count: {arxiv_data[0]['word_count']}")
    print(f"  Authors: {arxiv_data[0]['authors']}")
    print(f"  Categories: {arxiv_data[0]['categories']}")
    print(f"  Preview: {arxiv_data[0]['abstract'][:100]}...")
    if 'arxiv_id' in arxiv_data[0]:
        print(f"  ArXiv ID: {arxiv_data[0]['arxiv_id']}")


INFO:src.data.collect_data:Data collector initialized. Output directory: /Users/scienceman/Desktop/LLM/data/raw
INFO:src.data.collect_data:Collecting Wikipedia data...
INFO:src.data.collect_data:Fetching real Wikipedia data using Wikipedia API...


DataCollector initialized successfully!

Data collection mode: Real data from APIs

Collecting Wikipedia data...


INFO:src.data.collect_data:Collected: Machine learning
INFO:src.data.collect_data:Collected: Artificial intelligence
INFO:src.data.collect_data:Collected: Deep learning
INFO:src.data.collect_data:Collected: Natural language processing
INFO:src.data.collect_data:Collected: Computer vision
INFO:src.data.collect_data:Successfully collected 5 Wikipedia articles
INFO:src.data.collect_data:Collecting ArXiv data...
INFO:src.data.collect_data:Fetching real ArXiv data using ArXiv API...


Collected 5 Wikipedia articles

Collecting ArXiv data...


INFO:src.data.collect_data:Collected: FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning
  Dataset and Comprehensive Benchmark
INFO:src.data.collect_data:Collected: ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable
  Orthogonal Butterfly Transforms
INFO:src.data.collect_data:Collected: ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable
  Orthogonal Butterfly Transforms
INFO:src.data.collect_data:Collected: SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
INFO:src.data.collect_data:Successfully collected 4 ArXiv papers


Collected 4 ArXiv papers

Total documents collected: 9

Sample Wikipedia article:
  Title: Machine learning
  Word count: 68
  Preview: Machine learning (ML) is a field of study in artificial intelligence concerned with the development ...
  URL: https://en.wikipedia.org/wiki/Machine_learning

Sample ArXiv paper:
  Title: FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning
  Dataset and Comprehensive Benchmark
  Word count: 203
  Authors: ['Rongyao Fang', 'Aldrich Yu', 'Chengqi Duan', 'Linjiang Huang', 'Shuai Bai', 'Yuxuan Cai', 'Kun Wang', 'Si Liu', 'Xihui Liu', 'Hongsheng Li']
  Categories: ['cs.CV', 'cs.CL']
  Preview: The advancement of open-source text-to-image (T2I) models has been hindered
by the absence of large-...
  ArXiv ID: 2509.09680v1


## Alternative Data Collection Methods

Since HuggingFace datasets no longer support `trust_remote_code`, we've implemented several alternative methods:

### 1. **Wikipedia API** (Recommended)
- **Free and reliable**: No authentication required
- **Real-time data**: Always up-to-date articles
- **Rich metadata**: Includes URLs, descriptions, and full content
- **Rate limiting**: Built-in delays to be respectful to Wikipedia

### 2. **ArXiv API**
- **Scientific papers**: Real research papers from ArXiv
- **Recent publications**: Can fetch latest papers
- **Rich metadata**: Authors, categories, publication dates
- **XML format**: Parses ArXiv's Atom feed

### 3. **Sample Data Fallback**
- **Offline mode**: Works without internet connection
- **Consistent data**: Same data for reproducible results
- **Fast execution**: No API delays

### Usage Options:
- Set `USE_REAL_DATA = True` to fetch real data from APIs
- Set `USE_REAL_DATA = False` to use sample data
- The system automatically falls back to sample data if API calls fail


## Exploring Wikipedia Data Structure

Let us examine the structure and characteristics of our Wikipedia articles.


In [4]:
print("Wikipedia Data Structure:")
print("=" * 50)

if wiki_data:
    # Show structure of first article
    first_article = wiki_data[0]
    print(f"Article keys: {list(first_article.keys())}")
    print(f"\nFirst article:")
    print(f"  Title: {first_article['title']}")
    print(f"  Source: {first_article['source']}")
    print(f"  Length: {first_article['length']} characters")
    print(f"  Word count: {first_article['word_count']} words")
    print(f"  Text preview: {first_article['text'][:200]}...")
    
    # Show all articles
    print(f"\nAll Wikipedia articles:")
    for i, article in enumerate(wiki_data):
        print(f"  {i+1}. {article['title']} ({article['word_count']} words)")
    
    # Show statistics
    word_counts = [article['word_count'] for article in wiki_data]
    print(f"\nWikipedia Statistics:")
    print(f"  Total articles: {len(wiki_data)}")
    print(f"  Average words per article: {sum(word_counts) / len(word_counts):.1f}")
    print(f"  Min words: {min(word_counts)}")
    print(f"  Max words: {max(word_counts)}")
else:
    print("No Wikipedia data collected")


Wikipedia Data Structure:
Article keys: ['id', 'title', 'text', 'source', 'length', 'word_count', 'url', 'description']

First article:
  Title: Machine learning
  Source: wikipedia
  Length: 462 characters
  Word count: 68 words
  Text preview: Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data, and thus...

All Wikipedia articles:
  1. Machine learning (68 words)
  2. Artificial intelligence (64 words)
  3. Deep learning (64 words)
  4. Natural language processing (44 words)
  5. Computer vision (90 words)

Wikipedia Statistics:
  Total articles: 5
  Average words per article: 66.0
  Min words: 44
  Max words: 90


## Exploring ArXiv Data Structure

Now let us examine the structure and characteristics of our ArXiv papers.


In [5]:
print("ArXiv Data Structure:")
print("=" * 50)

if arxiv_data:
    # Show structure of first paper
    first_paper = arxiv_data[0]
    print(f"Paper keys: {list(first_paper.keys())}")
    print(f"\nFirst paper:")
    print(f"  Title: {first_paper['title']}")
    print(f"  Source: {first_paper['source']}")
    print(f"  Length: {first_paper['length']} characters")
    print(f"  Word count: {first_paper['word_count']} words")
    print(f"  Authors: {first_paper['authors']}")
    print(f"  Categories: {first_paper['categories']}")
    print(f"  Abstract preview: {first_paper['abstract'][:200]}...")
    
    # Show all papers
    print(f"\nAll ArXiv papers:")
    for i, paper in enumerate(arxiv_data):
        print(f"  {i+1}. {paper['title']} ({paper['word_count']} words)")
    
    # Show statistics
    word_counts = [paper['word_count'] for paper in arxiv_data]
    print(f"\nArXiv Statistics:")
    print(f"  Total papers: {len(arxiv_data)}")
    print(f"  Average words per abstract: {sum(word_counts) / len(word_counts):.1f}")
    print(f"  Min words: {min(word_counts)}")
    print(f"  Max words: {max(word_counts)}")
else:
    print("No ArXiv data collected")


ArXiv Data Structure:
Paper keys: ['id', 'title', 'abstract', 'source', 'length', 'word_count', 'authors', 'categories', 'published', 'arxiv_id']

First paper:
  Title: FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning
  Dataset and Comprehensive Benchmark
  Source: arxiv
  Length: 1595 characters
  Word count: 203 words
  Authors: ['Rongyao Fang', 'Aldrich Yu', 'Chengqi Duan', 'Linjiang Huang', 'Shuai Bai', 'Yuxuan Cai', 'Kun Wang', 'Si Liu', 'Xihui Liu', 'Hongsheng Li']
  Categories: ['cs.CV', 'cs.CL']
  Abstract preview: The advancement of open-source text-to-image (T2I) models has been hindered
by the absence of large-scale, reasoning-focused datasets and comprehensive
evaluation benchmarks, resulting in a performanc...

All ArXiv papers:
  1. FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning
  Dataset and Comprehensive Benchmark (203 words)
  2. ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable
  Orthogonal Butterfly Transform

## Text Preprocessing and Chunking

Now let us learn about different chunking strategies for preprocessing our text data. Chunking is crucial for RAG systems as it determines how we break down large documents into manageable pieces for retrieval.


In [6]:
# Initialize our text preprocessor
preprocessor = TextPreprocessor()
print("TextPreprocessor initialized successfully!")

# Let us use one of our documents for demonstration
if wiki_data:
    demo_doc = wiki_data[0]
    print(f"\nDemo document: {demo_doc['title']}")
    print(f"Original text length: {len(demo_doc['text'])} characters")
    print(f"Original word count: {demo_doc['word_count']} words")
    print(f"\nOriginal text preview:")
    print(demo_doc['text'][:300] + "...")
elif arxiv_data:
    demo_doc = arxiv_data[0]
    print(f"\nDemo document: {demo_doc['title']}")
    print(f"Original text length: {len(demo_doc['abstract'])} characters")
    print(f"Original word count: {demo_doc['word_count']} words")
    print(f"\nOriginal text preview:")
    print(demo_doc['abstract'][:300] + "...")
else:
    print("No data available for demonstration")


INFO:src.data.preprocess_data:Text preprocessor initialized. Output directory: /Users/scienceman/Desktop/LLM/data/processed


TextPreprocessor initialized successfully!

Demo document: Machine learning
Original text length: 462 characters
Original word count: 68 words

Original text preview:
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data, and thus perform tasks without explicit instructions. Within a subdiscipline in machine learning, advances i...


In [7]:
# Compare different chunking strategies
if wiki_data or arxiv_data:
    # Use Wikipedia data if available, otherwise ArXiv
    if wiki_data:
        demo_doc = wiki_data[0]
        print(f"Using Wikipedia article: {demo_doc['title']}")
    else:
        demo_doc = arxiv_data[0]
        print(f"Using ArXiv paper: {demo_doc['title']}")
    
    print("Comparing Chunking Strategies:")
    print("=" * 50)
    
    strategies = ['fixed', 'semantic', 'hierarchical']
    
    for strategy in strategies:
        print(f"\n{strategy.upper()} Chunking:")
        chunks = preprocessor.chunk_document(demo_doc, strategy=strategy)
        print(f"  Number of chunks: {len(chunks)}")
        
        if chunks:
            # Show first chunk
            first_chunk = chunks[0]
            print(f"  First chunk length: {len(first_chunk['text'])} characters")
            print(f"  First chunk preview: {first_chunk['text'][:150]}...")
            
            # Show chunk statistics
            chunk_lengths = [len(chunk['text']) for chunk in chunks]
            print(f"  Chunk length stats: min={min(chunk_lengths)}, max={max(chunk_lengths)}, avg={np.mean(chunk_lengths):.1f}")
        else:
            print("  No chunks created (document too short)")
else:
    print("No data available for chunking demonstration")


Using Wikipedia article: Machine learning
Comparing Chunking Strategies:

FIXED Chunking:
  Number of chunks: 1
  First chunk length: 462 characters
  First chunk preview: Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn...
  Chunk length stats: min=462, max=462, avg=462.0

SEMANTIC Chunking:
  Number of chunks: 1
  First chunk length: 460 characters
  First chunk preview: Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn...
  Chunk length stats: min=460, max=460, avg=460.0

HIERARCHICAL Chunking:
  Number of chunks: 1
  First chunk length: 462 characters
  First chunk preview: Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn...
  Chunk length stats: min=462, max=462, avg=462.0


## Processing All Collected Data

Now let us process all our collected data using our preferred chunking strategy and save it for use in the next notebook.


In [8]:
# Process all documents
print("Processing all collected documents...")

all_chunks = []

# Process Wikipedia articles
print(f"\nProcessing {len(wiki_data)} Wikipedia articles...")
for i, article in enumerate(tqdm(wiki_data, desc="Wikipedia")):
    chunks = preprocessor.chunk_document(article, strategy='semantic')
    all_chunks.extend(chunks)

# Process ArXiv papers
print(f"\nProcessing {len(arxiv_data)} ArXiv papers...")
for i, paper in enumerate(tqdm(arxiv_data, desc="ArXiv")):
    # Convert ArXiv format to expected format
    doc = {
        'id': paper['id'],
        'title': paper['title'],
        'text': paper['abstract'],
        'source': paper['source']
    }
    chunks = preprocessor.chunk_document(doc, strategy='semantic')
    all_chunks.extend(chunks)

print(f"\nTotal chunks created: {len(all_chunks)}")

# Show chunk statistics
if all_chunks:
    chunk_lengths = [len(chunk['text']) for chunk in all_chunks]
    print(f"Chunk length statistics:")
    print(f"  Min: {min(chunk_lengths)} characters")
    print(f"  Max: {max(chunk_lengths)} characters")
    print(f"  Average: {np.mean(chunk_lengths):.1f} characters")
    print(f"  Median: {np.median(chunk_lengths):.1f} characters")
    
    # Show sources
    sources = [chunk['source'] for chunk in all_chunks]
    source_counts = Counter(sources)
    print(f"\nChunks by source:")
    for source, count in source_counts.items():
        print(f"  {source}: {count} chunks")
    
    # Show sample chunks
    print(f"\nSample chunks:")
    for i, chunk in enumerate(all_chunks[:3]):  # Show first 3 chunks
        print(f"  Chunk {i+1} ({chunk['source']}): {chunk['text'][:100]}...")
else:
    print("No chunks were created. This might indicate an issue with the chunking process.")


Processing all collected documents...

Processing 5 Wikipedia articles...


Wikipedia: 100%|██████████| 5/5 [00:00<00:00, 5814.12it/s]



Processing 4 ArXiv papers...


ArXiv: 100%|██████████| 4/4 [00:00<00:00, 336.16it/s]


Total chunks created: 22
Chunk length statistics:
  Min: 44 characters
  Max: 510 characters
  Average: 400.7 characters
  Median: 435.0 characters

Chunks by source:
  wikipedia: 6 chunks
  arxiv: 16 chunks

Sample chunks:
  Chunk 1 (wikipedia): Machine learning (ML) is a field of study in artificial intelligence concerned with the development ...
  Chunk 2 (wikipedia): Artificial intelligence (AI) is the capability of computational systems to perform tasks typically a...
  Chunk 3 (wikipedia): In machine learning, deep learning focuses on utilizing multilayered neural networks to perform task...





In [9]:
# Save processed data
print("Saving processed data...")

# Save chunks
chunks_file = DATA_DIR / "processed" / "all_chunks.json"
chunks_file.parent.mkdir(parents=True, exist_ok=True)

with open(chunks_file, 'w', encoding='utf-8') as f:
    json.dump(all_chunks, f, indent=2)

print(f"Saved {len(all_chunks)} chunks to: {chunks_file}")

print("\nData processing complete! Ready for the next notebook.")


Saving processed data...
Saved 22 chunks to: /Users/scienceman/Desktop/LLM/data/processed/all_chunks.json

Data processing complete! Ready for the next notebook.


## Summary

In this notebook, we have learned:

1. **Data Collection**: How to collect data from different sources using our custom DataCollector
2. **Data Exploration**: How to analyze the structure and characteristics of our collected data
3. **Text Preprocessing**: How to chunk documents using different strategies (fixed, semantic, hierarchical)
4. **Data Processing**: How to process and save data for use in subsequent notebooks

### Key Takeaways:

- **Wikipedia articles** tend to be longer and more comprehensive
- **ArXiv abstracts** are more concise and technical
- **Semantic chunking** works well for preserving meaning while creating manageable pieces
- **Data preprocessing** is crucial for effective RAG systems

### Next Steps:

In the next notebook, we will learn how to convert these text chunks into embeddings and build vector stores for efficient similarity search.
