# Data Collection and Exploration

In this notebook, we'll collect real data from Wikipedia and ArXiv to build our RAG system. This is where we start working with actual documents that our system will need to understand and retrieve from.

## Learning Objectives
By the end of this notebook, you will:
1. Understand how to collect data from HuggingFace datasets
2. Explore the structure and characteristics of different data sources
3. Learn about data quality and filtering strategies
4. Get hands-on experience with real text data


## Setup and Imports

First, let's import the libraries we'll need and set up our environment.


In [None]:
# Standard library imports
import json
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import numpy as np
from tqdm import tqdm

# Add project root to path
import sys
sys.path.append('.')

# Import our custom modules
from src.config import DATA_CONFIG, DATA_DIR
from src.data.collect_data import DataCollector
from src.data.preprocess_data import TextPreprocessor

print("Libraries imported successfully!")
print(f"Data directory: {DATA_DIR}")

# Use our DataCollector instead of direct load_dataset
collector = DataCollector()
wiki_sample_data = collector.collect_wikipedia_data(max_documents=5)
print(f'Collected {len(wiki_sample_data)} Wikipedia articles')

# Convert to the format expected by the rest of the notebook
wiki_sample = []
for article in wiki_sample_data:
    wiki_sample.append({
        'title': article['title'],
        'text': article['text']
    })

print(f'Wikipedia sample structure: {len(wiki_sample)} articles')
if wiki_sample:
    print(f'First article title: {wiki_sample[0]["title"]}')
    print(f'First article length: {len(wiki_sample[0]["text"])} characters')

## Understanding Our Data Sources

Before we start collecting data, let's understand what we're working with:

### Wikipedia
- **What it is**: General knowledge articles on a wide range of topics
- **Why it's good for RAG**: Diverse topics, well-structured text, factual content
- **Challenges**: Variable length, some articles are very long or very short

### ArXiv
- **What it is**: Academic paper abstracts from various scientific fields
- **Why it's good for RAG**: Technical content, structured abstracts, domain-specific knowledge
- **Challenges**: Technical jargon, variable quality, specialized vocabulary

Let's start by exploring these datasets without downloading the full thing.


In [61]:
# Use our DataCollector instead of direct load_dataset
collector = DataCollector()
wiki_sample_data = collector.collect_wikipedia_data(max_documents=5)
print(f'Collected {len(wiki_sample_data)} Wikipedia articles')

# Convert to the format expected by the rest of the notebook
wiki_sample = []
for article in wiki_sample_data:
    wiki_sample.append({
        'title': article['title'],
        'text': article['text']
    })

print(f'Wikipedia sample structure: {len(wiki_sample)} articles')
if wiki_sample:
    print(f'First article title: {wiki_sample[0]["title"]}')
    print(f'First article length: {len(wiki_sample[0]["text"])} characters')

In [50]:
# Use our DataCollector instead of direct load_dataset
collector = DataCollector()
arxiv_sample_data = collector.collect_arxiv_data(max_documents=5)
print(f'Collected {len(arxiv_sample_data)} ArXiv papers')

# Convert to the format expected by the rest of the notebook
arxiv_sample = []
for paper in arxiv_sample_data:
    arxiv_sample.append({
        'title': paper['title'],
        'abstract': paper['abstract']
    })

print(f'ArXiv sample structure: {len(arxiv_sample)} papers')
if arxiv_sample:
    print(f'First paper title: {arxiv_sample[0]["title"]}')
    print(f'First paper abstract length: {len(arxiv_sample[0]["abstract"])} characters')

Exploring ArXiv dataset structure...


RuntimeError: Dataset scripts are no longer supported, but found scientific_papers.py

## Collecting Real Data

Now let's use our data collection module to get some real data. We'll start with a small sample to test our approach, then you can decide if you want to collect more.


In [51]:
# Let's start with a small collection to test our approach
print("Collecting a small sample of Wikipedia data...")
wiki_data = collector.collect_wikipedia_data(max_documents=50)

print(f"\nCollected {len(wiki_data)} Wikipedia articles")
if wiki_data:
    print(f"\nSample article titles:")
    for i, article in enumerate(wiki_data[:5]):
        print(f"  {i+1}. {article['title']} ({article['word_count']} words)")
    
    # Show some statistics
    word_counts = [article['word_count'] for article in wiki_data]
    print(f"\nWikipedia Statistics:")
    print(f"  Average words: {np.mean(word_counts):.1f}")
    print(f"  Word range: {min(word_counts)} - {max(word_counts)}")
    print(f"  Total words: {sum(word_counts):,}")


In [52]:
# Now let's collect some ArXiv data
print("Collecting a small sample of ArXiv data...")
arxiv_data = collector.collect_arxiv_data(max_documents=25)

print(f"\nCollected {len(arxiv_data)} ArXiv abstracts")
if arxiv_data:
    print(f"\nSample paper titles:")
    for i, paper in enumerate(arxiv_data[:5]):
        print(f"  {i+1}. {paper['title']} ({paper['word_count']} words)")
        if 'categories' in paper and paper['categories']:
            print(f"     Categories: {paper['categories'][:3]}")  # Show first 3 categories
    
    # Show some statistics
    word_counts = [paper['word_count'] for paper in arxiv_data]
    print(f"\nArXiv Statistics:")
    print(f"  Average words: {np.mean(word_counts):.1f}")
    print(f"  Word range: {min(word_counts)} - {max(word_counts)}")
    print(f"  Total words: {sum(word_counts):,}")
    
    # Analyze categories
    all_categories = []
    for paper in arxiv_data:
        if 'categories' in paper and paper['categories']:
            all_categories.extend(paper['categories'])
    
    if all_categories:
        category_counts = Counter(all_categories)
        print(f"\nTop categories:")
        for cat, count in category_counts.most_common(5):
            print(f"  {cat}: {count} papers")


## Data Visualization and Analysis

Let's create some visualizations to understand our data better.


In [53]:
# Create visualizations of our collected data
if wiki_data and arxiv_data:
    # Combine data for comparison
    comparison_data = []
    
    for article in wiki_data:
        comparison_data.append({
            'source': 'Wikipedia',
            'word_count': article['word_count'],
            'title': article['title']
        })
    
    for paper in arxiv_data:
        comparison_data.append({
            'source': 'ArXiv',
            'word_count': paper['word_count'],
            'title': paper['title']
        })
    
    comparison_df = pd.DataFrame(comparison_data)
    
    # Create comparison plots
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Word count distribution
    axes[0, 0].hist(comparison_df[comparison_df['source'] == 'Wikipedia']['word_count'], 
                   bins=20, alpha=0.7, label='Wikipedia', color='blue')
    axes[0, 0].hist(comparison_df[comparison_df['source'] == 'ArXiv']['word_count'], 
                   bins=20, alpha=0.7, label='ArXiv', color='orange')
    axes[0, 0].set_title('Word Count Distribution')
    axes[0, 0].set_xlabel('Word Count')
    axes[0, 0].set_ylabel('Number of Documents')
    axes[0, 0].legend()
    
    # Box plot comparison
    comparison_df.boxplot(column='word_count', by='source', ax=axes[0, 1])
    axes[0, 1].set_title('Word Count by Source')
    axes[0, 1].set_xlabel('Source')
    axes[0, 1].set_ylabel('Word Count')
    
    # Scatter plot
    wiki_counts = comparison_df[comparison_df['source'] == 'Wikipedia']['word_count']
    arxiv_counts = comparison_df[comparison_df['source'] == 'ArXiv']['word_count']
    
    axes[1, 0].scatter(range(len(wiki_counts)), wiki_counts, alpha=0.6, label='Wikipedia', color='blue')
    axes[1, 0].scatter(range(len(arxiv_counts)), arxiv_counts, alpha=0.6, label='ArXiv', color='orange')
    axes[1, 0].set_title('Individual Document Word Counts')
    axes[1, 0].set_xlabel('Document Index')
    axes[1, 0].set_ylabel('Word Count')
    axes[1, 0].legend()
    
    # Summary statistics
    summary_stats = comparison_df.groupby('source')['word_count'].agg(['count', 'mean', 'std', 'min', 'max'])
    axes[1, 1].axis('off')
    
    # Create a text table
    table_text = "Summary Statistics:\n\n"
    for source in summary_stats.index:
        stats = summary_stats.loc[source]
        table_text += f"{source}:\n"
        table_text += f"  Count: {stats['count']}\n"
        table_text += f"  Mean: {stats['mean']:.1f}\n"
        table_text += f"  Std: {stats['std']:.1f}\n"
        table_text += f"  Range: {stats['min']}-{stats['max']}\n\n"
    
    axes[1, 1].text(0.1, 0.9, table_text, transform=axes[1, 1].transAxes, 
                   fontsize=10, verticalalignment='top', fontfamily='monospace')
    
    plt.tight_layout()
    plt.show()
    
    print("Key Insights:")
    wiki_mean = comparison_df[comparison_df['source'] == 'Wikipedia']['word_count'].mean()
    arxiv_mean = comparison_df[comparison_df['source'] == 'ArXiv']['word_count'].mean()
    
    print(f"• Wikipedia articles are {wiki_mean/arxiv_mean:.1f}x longer on average")
    print(f"• Wikipedia: {len(wiki_data)} articles, average {wiki_mean:.0f} words")
    print(f"• ArXiv: {len(arxiv_data)} abstracts, average {arxiv_mean:.0f} words")
    print(f"• Wikipedia provides broader, general knowledge")
    print(f"• ArXiv provides focused, technical content")
    print(f"• Together, they offer both breadth and depth for RAG systems")
else:
    print("No data collected yet. Run the previous cells to collect data first.")


No data collected yet. Run the previous cells to collect data first.


## Text Preprocessing and Chunking

Now let's use our preprocessing module to clean and chunk the data we collected. This is a crucial step because raw text needs to be prepared before we can create embeddings.


In [54]:
# Let's test different chunking strategies on a sample document
if wiki_data:
    # Pick a longer article to demonstrate chunking
    sample_article = max(wiki_data, key=lambda x: x['word_count'])
    
    print(f"Testing chunking on: '{sample_article['title']}'")
    print(f"Original length: {sample_article['word_count']} words")
    print(f"Original text (first 300 chars): {sample_article['text'][:300]}...")
    
    # Test different chunking strategies
    strategies = ['fixed', 'semantic', 'hierarchical']
    
    for strategy in strategies:
        print(f"\n--- {strategy.upper()} CHUNKING ---")
        chunks = preprocessor.chunk_document(sample_article, strategy)
        
        print(f"Number of chunks: {len(chunks)}")
        if chunks:
            print(f"Average words per chunk: {np.mean([c['word_count'] for c in chunks]):.1f}")
            print(f"Chunk sizes: {[c['word_count'] for c in chunks]}")
            
            # Show first chunk
            print(f"First chunk ({chunks[0]['word_count']} words):")
            print(f"  {chunks[0]['text'][:150]}...")
            
            if len(chunks) > 1:
                print(f"Second chunk ({chunks[1]['word_count']} words):")
                print(f"  {chunks[1]['text'][:150]}...")
        
        print("-" * 50)
else:
    print("No Wikipedia data available. Run the data collection cells first.")


No Wikipedia data available. Run the data collection cells first.


## Processing All Collected Data

Now let's process all our collected data using the best chunking strategy. We'll use semantic chunking as it tends to work well for most use cases.


In [55]:
# Process all our collected data
if wiki_data and arxiv_data:
    print("Processing all collected data...")
    
    # Process Wikipedia data
    print("\nProcessing Wikipedia data with semantic chunking...")
    wiki_chunks = preprocessor.process_documents(wiki_data, strategy="semantic")
    
    # Process ArXiv data
    print("\nProcessing ArXiv data with semantic chunking...")
    arxiv_chunks = preprocessor.process_documents(arxiv_data, strategy="semantic")
    
    # Combine all chunks
    all_chunks = wiki_chunks + arxiv_chunks
    
    print(f"\nProcessing Summary:")
    print(f"  Wikipedia: {len(wiki_data)} documents -> {len(wiki_chunks)} chunks")
    print(f"  ArXiv: {len(arxiv_data)} documents -> {len(arxiv_chunks)} chunks")
    print(f"  Total: {len(all_chunks)} chunks")
    
    if all_chunks:
        # Analyze chunk characteristics
        word_counts = [chunk['word_count'] for chunk in all_chunks]
        char_counts = [chunk['char_count'] for chunk in all_chunks]
        
        print(f"\nChunk Statistics:")
        print(f"  Average words per chunk: {np.mean(word_counts):.1f}")
        print(f"  Average chars per chunk: {np.mean(char_counts):.1f}")
        print(f"  Word count range: {min(word_counts)} - {max(word_counts)}")
        print(f"  Total words in chunks: {sum(word_counts):,}")
        
        # Show sample chunks
        print(f"\nSample chunks:")
        for i, chunk in enumerate(all_chunks[:3]):
            print(f"\nChunk {i+1} (Source: {chunk['source']}, {chunk['word_count']} words):")
            print(f"  Title: {chunk['source_title']}")
            print(f"  Text: {chunk['text'][:200]}...")
    
    # Save the processed data
    print(f"\nSaving processed data...")
    preprocessor.save_processed_data(wiki_chunks, "wikipedia_chunks.json")
    preprocessor.save_processed_data(arxiv_chunks, "arxiv_chunks.json")
    preprocessor.save_processed_data(all_chunks, "all_chunks.json")
    
    print(f"Processed data saved to: {preprocessor.output_dir}")
    
else:
    print("No data collected yet. Run the data collection cells first.")


No data collected yet. Run the data collection cells first.


## Interactive Data Exploration

Let's create some interactive tools to explore our data and understand what we've collected.


In [56]:
# Interactive exploration functions
def explore_documents(documents, doc_type="document"):
    """
    Interactive function to explore individual documents.
    """
    if not documents:
        print("No documents available")
        return
    
    print(f"\n=== Exploring {doc_type}s ===")
    print(f"Total {doc_type}s: {len(documents)}")
    
    while True:
        try:
            choice = input(f"\nEnter {doc_type} number (0-{len(documents)-1}) or 'q' to quit: ")
            
            if choice.lower() == 'q':
                break
            
            idx = int(choice)
            if 0 <= idx < len(documents):
                doc = documents[idx]
                print(f"\n--- {doc_type.upper()} {idx} ---")
                print(f"Title: {doc['title']}")
                print(f"Word count: {doc['word_count']}")
                print(f"Character count: {doc['length']}")
                
                # Show the text content
                text_key = 'text' if 'text' in doc else 'abstract'
                text = doc[text_key]
                
                print(f"\n{text_key.upper()} (first 500 characters):")
                print("-" * 50)
                print(text[:500])
                if len(text) > 500:
                    print("...")
                print("-" * 50)
                
                # Show additional info if available
                if 'categories' in doc and doc['categories']:
                    print(f"Categories: {doc['categories']}")
                if 'authors' in doc and doc['authors']:
                    print(f"Authors: {doc['authors'][:3]}")
            else:
                print(f"Please enter a number between 0 and {len(documents)-1}")
        except ValueError:
            print("Please enter a valid number or 'q'")
        except KeyboardInterrupt:
            print("\nExiting exploration...")
            break

def explore_chunks(chunks):
    """
    Interactive function to explore individual chunks.
    """
    if not chunks:
        print("No chunks available")
        return
    
    print(f"\n=== Exploring Chunks ===")
    print(f"Total chunks: {len(chunks)}")
    
    while True:
        try:
            choice = input(f"\nEnter chunk number (0-{len(chunks)-1}) or 'q' to quit: ")
            
            if choice.lower() == 'q':
                break
            
            idx = int(choice)
            if 0 <= idx < len(chunks):
                chunk = chunks[idx]
                print(f"\n--- CHUNK {idx} ---")
                print(f"Source: {chunk['source']}")
                print(f"Title: {chunk['source_title']}")
                print(f"Word count: {chunk['word_count']}")
                print(f"Chunk type: {chunk['type']}")
                
                print(f"\nCHUNK TEXT:")
                print("-" * 50)
                print(chunk['text'])
                print("-" * 50)
                
                # Show metadata
                if 'metadata' in chunk:
                    print(f"Metadata: {chunk['metadata']}")
            else:
                print(f"Please enter a number between 0 and {len(chunks)-1}")
        except ValueError:
            print("Please enter a valid number or 'q'")
        except KeyboardInterrupt:
            print("\nExiting exploration...")
            break

# Let's explore Wikipedia articles
if wiki_data:
    print("Wikipedia data exploration available!")
    print("Run: explore_documents(wiki_data, 'Wikipedia article')")
    
# Let's explore ArXiv abstracts
if arxiv_data:
    print("ArXiv data exploration available!")
    print("Run: explore_documents(arxiv_data, 'ArXiv abstract')")
    
# Let's explore chunks if we have them
if 'all_chunks' in locals():
    print("Chunk data exploration available!")
    print("Run: explore_chunks(all_chunks)")

print("\nTry these exploration functions:")
print("1. explore_documents(wiki_data, 'Wikipedia article')")
print("2. explore_documents(arxiv_data, 'ArXiv abstract')")
if 'all_chunks' in locals():
    print("3. explore_chunks(all_chunks)")



Try these exploration functions:
1. explore_documents(wiki_data, 'Wikipedia article')
2. explore_documents(arxiv_data, 'ArXiv abstract')


## Summary and Next Steps

Great! You've successfully collected and processed real data for your RAG system. Here's what we've accomplished:

### What We've Done
1. **Explored data sources** - Understood the structure of Wikipedia and ArXiv datasets
2. **Collected sample data** - Downloaded and processed real documents
3. **Analyzed data characteristics** - Understood word counts, distributions, and quality
4. **Compared data sources** - Saw how Wikipedia and ArXiv complement each other
5. **Processed and chunked data** - Prepared text for embedding generation
6. **Created interactive tools** - Built functions to explore your data

### Key Insights
- **Wikipedia** provides broad, general knowledge with longer articles
- **ArXiv** provides focused, technical content with shorter abstracts
- Both sources have good coverage for building a comprehensive RAG system
- **Semantic chunking** works well for preserving meaning while creating manageable pieces
- Data quality is generally good, with some variability in length

### Next Steps
Now that we have our processed data, the next steps in building our RAG system are:
1. **Generate embeddings** - Convert text chunks to vector representations
2. **Build vector store** - Create a searchable database of embeddings
3. **Implement retrieval** - Find relevant chunks for queries
4. **Connect LLM** - Generate answers based on retrieved context

### Files Created
- Raw data: `data/raw/wikipedia_sample.json`, `data/raw/arxiv_sample.json`
- Processed chunks: `data/processed/wikipedia_chunks.json`, `data/processed/arxiv_chunks.json`, `data/processed/all_chunks.json`

The data you've collected and processed will be the foundation for all these next steps. Each chunk will be converted into embeddings that can be searched and retrieved.

**Ready to move on to the next notebook?** The next step is embedding generation, where we'll learn how to convert our text chunks into mathematical representations for similarity search.


In [57]:
# Final summary and verification
print("=== Data Collection and Processing Summary ===")
print(f"Wikipedia articles collected: {len(wiki_data) if 'wiki_data' in locals() and wiki_data else 0}")
print(f"ArXiv abstracts collected: {len(arxiv_data) if 'arxiv_data' in locals() and arxiv_data else 0}")
print(f"Total documents: {(len(wiki_data) if 'wiki_data' in locals() and wiki_data else 0) + (len(arxiv_data) if 'arxiv_data' in locals() and arxiv_data else 0)}")

if 'all_chunks' in locals() and all_chunks:
    print(f"Total chunks created: {len(all_chunks)}")
    word_counts = [chunk['word_count'] for chunk in all_chunks]
    print(f"Average words per chunk: {np.mean(word_counts):.1f}")
    print(f"Total words in all chunks: {sum(word_counts):,}")

print(f"\nData saved to:")
print(f"  Raw data: {collector.output_dir}")
print(f"  Processed data: {preprocessor.output_dir}")

print(f"\nFiles created:")
for file in collector.output_dir.glob("*.json"):
    print(f"  - {file.name}")
for file in preprocessor.output_dir.glob("*.json"):
    print(f"  - {file.name}")

print("\nData collection and processing completed successfully!")
print("\nNext: Open '03_embeddings_and_vector_store.ipynb' to learn about converting text to embeddings.")


=== Data Collection and Processing Summary ===
Wikipedia articles collected: 0
ArXiv abstracts collected: 0
Total documents: 0

Data saved to:
  Raw data: /Users/scienceman/Desktop/LLM/data/raw
  Processed data: /Users/scienceman/Desktop/LLM/data/processed

Files created:

Data collection and processing completed successfully!

Next: Open '03_embeddings_and_vector_store.ipynb' to learn about converting text to embeddings.
