# 📚 Data Loading Example - LLM Survey Generator

This notebook demonstrates how to load research papers from various sources for survey generation.

## Supported Data Sources
1. **Parquet Files** - Pre-indexed paper datasets
2. **ArXiv API** - Direct paper retrieval by ID or search
3. **Semantic Scholar** - Academic paper database
4. **JSON/CSV Files** - Custom paper collections

---

## 📋 Setup and Imports

In [None]:
# Standard imports
import sys
import os
import json
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime
from IPython.display import display, Markdown
import warnings
warnings.filterwarnings('ignore')

# Add project root to path
project_root = Path('.').absolute().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print(f'✅ Project root: {project_root}')
print(f'✅ Python version: {sys.version.split()[0]}')

# Import project modules
from src.data.data_loader import DataLoader
from src.data.search_engine import SearchEngine

## 1️⃣ Loading from Parquet Files

Parquet files provide the fastest way to load large paper datasets.

In [None]:
# Initialize data loader
loader = DataLoader()

# Check if data path is configured
data_path = os.environ.get('SCIMCP_DATA_PATH')
if data_path:
    print(f'📂 Data path: {data_path}')
    
    # Load papers from parquet
    print('⏳ Loading papers from parquet file...')
    papers_df = loader.load_from_parquet(data_path)
    
    print(f'✅ Loaded {len(papers_df):,} papers')
    print(f'\n📊 Dataset Overview:')
    print(f'  • Date range: {papers_df["updated"].min()} to {papers_df["updated"].max()}')
    print(f'  • Columns: {list(papers_df.columns)}')
    
    # Display sample
    print('\n📄 Sample papers:')
    display(papers_df.head(3)[['title', 'authors', 'updated']])
else:
    print('⚠️ SCIMCP_DATA_PATH not set. Using sample data.')
    
    # Create sample data
    papers_df = pd.DataFrame([
        {
            'title': 'Attention Is All You Need',
            'abstract': 'The dominant sequence transduction models...',
            'authors': ['Vaswani et al.'],
            'updated': '2017-06-12'
        },
        {
            'title': 'BERT: Pre-training of Deep Bidirectional Transformers',
            'abstract': 'We introduce BERT, a new language representation model...',
            'authors': ['Devlin et al.'],
            'updated': '2018-10-11'
        }
    ])
    print(f'✅ Created {len(papers_df)} sample papers')

## 2️⃣ Topic-Based Filtering

Filter papers by topic using keyword search.

In [None]:
# Filter papers by topic
topic = "Large Language Models"

print(f'🔍 Filtering papers on topic: "{topic}"')

# Method 1: Simple keyword filtering
keywords = ['language model', 'transformer', 'gpt', 'bert', 'llm']
pattern = '|'.join(keywords)

filtered_df = papers_df[
    papers_df['title'].str.lower().str.contains(pattern, na=False) |
    papers_df['abstract'].str.lower().str.contains(pattern, na=False)
]

print(f'✅ Found {len(filtered_df):,} papers matching keywords')

# Display distribution by year
if 'updated' in filtered_df.columns and len(filtered_df) > 0:
    filtered_df['year'] = pd.to_datetime(filtered_df['updated']).dt.year
    year_counts = filtered_df['year'].value_counts().sort_index()
    
    print('\n📈 Papers by year:')
    for year, count in year_counts.tail(5).items():
        bar = '█' * int(count / year_counts.max() * 20)
        print(f'  {year}: {bar} {count}')

## 3️⃣ Using BM25 Search Engine

BM25 provides more sophisticated relevance-based search.

In [None]:
# Initialize search engine
print('🔧 Initializing BM25 search engine...')
search_engine = SearchEngine()

# Build index from papers
if len(papers_df) > 0:
    papers_list = papers_df.to_dict('records')
    search_engine.build_index(papers_list)
    print(f'✅ Indexed {len(papers_list):,} papers')
    
    # Perform searches
    queries = [
        "transformer attention mechanism",
        "few-shot learning prompting",
        "reinforcement learning from human feedback"
    ]
    
    print('\n🔍 Search Results:')
    for query in queries:
        results = search_engine.search(query, top_k=3)
        print(f'\nQuery: "{query}"')
        for i, paper in enumerate(results, 1):
            print(f'  {i}. {paper.get("title", "Untitled")[:60]}...')
else:
    print('⚠️ No papers to index')

## 4️⃣ Loading from ArXiv API

Fetch papers directly from ArXiv using their API.

In [None]:
# Demo: Loading from ArXiv (requires arxiv package)
try:
    import arxiv
    
    print('🌐 Fetching papers from ArXiv...')
    
    # Search for recent LLM papers
    search = arxiv.Search(
        query="ti:language model OR abs:transformer",
        max_results=5,
        sort_by=arxiv.SortCriterion.SubmittedDate
    )
    
    arxiv_papers = []
    for result in search.results():
        arxiv_papers.append({
            'title': result.title,
            'abstract': result.summary,
            'authors': [author.name for author in result.authors],
            'updated': result.updated.strftime('%Y-%m-%d'),
            'arxiv_id': result.entry_id.split('/')[-1]
        })
    
    print(f'✅ Fetched {len(arxiv_papers)} papers from ArXiv')
    
    # Display results
    for paper in arxiv_papers:
        print(f'\n📄 {paper["title"]}')
        print(f'   Authors: {", ".join(paper["authors"][:3])}...')
        print(f'   ArXiv ID: {paper["arxiv_id"]}')
        
except ImportError:
    print('⚠️ arxiv package not installed.')
    print('   Install with: pip install arxiv')

## 5️⃣ Loading from JSON/CSV Files

Load custom paper collections from structured files.

In [None]:
# Create sample JSON file
sample_papers = [
    {
        "title": "Chain-of-Thought Prompting Elicits Reasoning",
        "abstract": "We explore how generating a chain of thought...",
        "authors": ["Wei et al."],
        "year": 2022,
        "venue": "NeurIPS"
    },
    {
        "title": "Constitutional AI: Harmlessness from AI Feedback",
        "abstract": "We present Constitutional AI, a method for training...",
        "authors": ["Bai et al."],
        "year": 2022,
        "venue": "arXiv"
    }
]

# Save to JSON
json_path = Path('../data/sample_papers.json')
json_path.parent.mkdir(exist_ok=True)

with open(json_path, 'w') as f:
    json.dump(sample_papers, f, indent=2)

print(f'💾 Saved sample papers to {json_path}')

# Load from JSON
with open(json_path) as f:
    loaded_papers = json.load(f)

print(f'\n✅ Loaded {len(loaded_papers)} papers from JSON')

# Convert to DataFrame
custom_df = pd.DataFrame(loaded_papers)
display(custom_df)

# Also demonstrate CSV loading
csv_path = Path('../data/sample_papers.csv')
custom_df.to_csv(csv_path, index=False)
print(f'\n💾 Saved to CSV: {csv_path}')

# Load from CSV
csv_df = pd.read_csv(csv_path)
print(f'✅ Loaded {len(csv_df)} papers from CSV')

## 6️⃣ Combining Multiple Sources

Merge papers from different sources for comprehensive coverage.

In [None]:
# Combine papers from multiple sources
all_papers = []

# Add papers from parquet
if 'papers_df' in locals() and len(papers_df) > 0:
    all_papers.extend(papers_df.head(10).to_dict('records'))
    print(f'✅ Added {min(10, len(papers_df))} papers from parquet')

# Add papers from JSON
if 'loaded_papers' in locals():
    all_papers.extend(loaded_papers)
    print(f'✅ Added {len(loaded_papers)} papers from JSON')

# Add papers from ArXiv
if 'arxiv_papers' in locals():
    all_papers.extend(arxiv_papers)
    print(f'✅ Added {len(arxiv_papers)} papers from ArXiv')

print(f'\n📊 Total papers collected: {len(all_papers)}')

# Remove duplicates based on title
unique_papers = []
seen_titles = set()

for paper in all_papers:
    title = paper.get('title', '').lower()
    if title and title not in seen_titles:
        unique_papers.append(paper)
        seen_titles.add(title)

print(f'✅ Unique papers after deduplication: {len(unique_papers)}')

## 7️⃣ Data Preprocessing for Survey Generation

Prepare papers for input to the survey generation system.

In [None]:
def preprocess_papers(papers, max_papers=50):
    """Preprocess papers for survey generation."""
    
    # Ensure required fields
    processed = []
    for paper in papers[:max_papers]:
        # Clean and standardize fields
        processed_paper = {
            'title': paper.get('title', 'Untitled'),
            'abstract': paper.get('abstract', paper.get('summary', '')),
            'authors': paper.get('authors', []),
            'year': None,
            'venue': paper.get('venue', '')
        }
        
        # Extract year
        if 'year' in paper:
            processed_paper['year'] = paper['year']
        elif 'updated' in paper:
            try:
                date = pd.to_datetime(paper['updated'])
                processed_paper['year'] = date.year
            except:
                pass
        
        # Ensure authors is a list
        if isinstance(processed_paper['authors'], str):
            processed_paper['authors'] = [processed_paper['authors']]
        
        # Truncate abstract if too long
        if len(processed_paper['abstract']) > 1000:
            processed_paper['abstract'] = processed_paper['abstract'][:997] + '...'
        
        processed.append(processed_paper)
    
    return processed

# Preprocess papers
if unique_papers:
    survey_papers = preprocess_papers(unique_papers, max_papers=20)
    
    print(f'✅ Preprocessed {len(survey_papers)} papers for survey generation')
    print('\n📄 Sample preprocessed paper:')
    
    sample = survey_papers[0]
    print(f"  Title: {sample['title']}")
    print(f"  Authors: {', '.join(sample['authors'][:3]) if sample['authors'] else 'Unknown'}")
    print(f"  Year: {sample['year'] or 'Unknown'}")
    print(f"  Abstract: {sample['abstract'][:100]}..." if sample['abstract'] else "  Abstract: None")
    
    # Save for survey generation
    output_path = Path('../data/preprocessed_papers.json')
    with open(output_path, 'w') as f:
        json.dump(survey_papers, f, indent=2)
    
    print(f'\n💾 Saved preprocessed papers to {output_path}')
else:
    print('⚠️ No papers to preprocess')

## 8️⃣ Data Quality Checks

Validate data quality before survey generation.

In [None]:
def check_data_quality(papers):
    """Check data quality and completeness."""
    
    print('🔍 Data Quality Report')
    print('=' * 50)
    
    # Overall stats
    print(f'Total papers: {len(papers)}')
    
    # Check field completeness
    fields = ['title', 'abstract', 'authors', 'year']
    completeness = {}
    
    for field in fields:
        count = sum(1 for p in papers if p.get(field))
        completeness[field] = count / len(papers) * 100
    
    print('\n📊 Field Completeness:')
    for field, pct in completeness.items():
        bar = '█' * int(pct / 5)
        print(f'  {field:10s}: {bar:20s} {pct:.1f}%')
    
    # Check abstract lengths
    abstract_lengths = [len(p.get('abstract', '')) for p in papers]
    avg_length = np.mean(abstract_lengths) if abstract_lengths else 0
    
    print(f'\n📝 Abstract Statistics:')
    print(f'  Average length: {avg_length:.0f} characters')
    print(f'  Min length: {min(abstract_lengths) if abstract_lengths else 0}')
    print(f'  Max length: {max(abstract_lengths) if abstract_lengths else 0}')
    
    # Year distribution
    years = [p.get('year') for p in papers if p.get('year')]
    if years:
        print(f'\n📅 Year Distribution:')
        print(f'  Earliest: {min(years)}')
        print(f'  Latest: {max(years)}')
        print(f'  Median: {np.median(years):.0f}')
    
    # Quality score
    quality_score = np.mean(list(completeness.values()))
    
    print(f'\n⭐ Overall Quality Score: {quality_score:.1f}/100')
    
    if quality_score < 70:
        print('⚠️ Warning: Data quality is below recommended threshold (70%)')
        print('   Consider adding more complete paper metadata.')
    else:
        print('✅ Data quality is sufficient for survey generation')
    
    return quality_score

# Run quality check
if 'survey_papers' in locals() and survey_papers:
    quality_score = check_data_quality(survey_papers)
else:
    print('⚠️ No papers available for quality check')

## 🎯 Summary & Next Steps

### What We Covered
1. ✅ Loading papers from parquet files
2. ✅ Topic-based filtering
3. ✅ BM25 search engine usage
4. ✅ ArXiv API integration
5. ✅ JSON/CSV file handling
6. ✅ Combining multiple sources
7. ✅ Data preprocessing
8. ✅ Quality validation

### Key Takeaways
- **Flexibility**: Load papers from various sources
- **Scalability**: Handle datasets with 100,000+ papers
- **Quality**: Validate data before survey generation
- **Search**: Use BM25 for efficient relevance-based retrieval

### Next Steps
1. **Generate Survey**: Use loaded papers with survey generation systems
2. **Explore Trends**: Analyze temporal patterns in your dataset
3. **Custom Sources**: Add your own paper sources
4. **API Integration**: Use the FastAPI endpoints for programmatic access

### 📚 Related Notebooks
- **[02_survey_generation_comparison.ipynb](02_survey_generation_comparison.ipynb)** - Compare generation methods
- **[03_results_visualization.ipynb](03_results_visualization.ipynb)** - Visualize survey quality
- **[04_api_integration_example.ipynb](04_api_integration_example.ipynb)** - Use the REST API
- **[05_quick_start_tutorial.ipynb](05_quick_start_tutorial.ipynb)** - Quick start guide

### 💡 Tips
- For large datasets, use parquet files for best performance
- Combine multiple sources for comprehensive coverage
- Always validate data quality before generation
- Use BM25 search for topic-specific paper selection

Happy researching! 🚀