# Semantic Scholar Paper Collection Pipeline

This notebook downloads and filters academic papers from the [Semantic Scholar Academic Graph](https://www.semanticscholar.org/product/api) dataset. Papers are filtered by field of study and stored in a local DuckDB database for downstream analysis.

**Target Fields:** Physics, Computer Science, Psychology


In [5]:
import pandas as pd
from typing import List, Dict, Optional
import requests
import duckdb
import time
from tqdm.notebook import tqdm
import os

## 1. Database Setup

Initialize DuckDB connection and create tables for storing papers and tracking file processing progress.

In [None]:
DB_DIR = 'db'                       # Directory to store the database(s) files
if not os.path.exists(DB_DIR):
    os.makedirs(DB_DIR)
DB_FILENAME = 'ss_aws.db' # semantic scholar from (aws)
DB_PATH = os.path.join(DB_DIR, DB_FILENAME)
conn = duckdb.connect(DB_PATH)
print(f"Connected to database at {DB_PATH}")

Connected to database at db/ss_aws.db


In [7]:
# Create table for filtered papers
conn.execute("""
    CREATE TABLE IF NOT EXISTS papers (
        corpusid VARCHAR PRIMARY KEY,
        title VARCHAR,
        publication_date VARCHAR,
        citation_count INTEGER,
        influential_citation_count INTEGER,
        field_of_study VARCHAR,
    )
""")

# Create metadata table to track file processing status
conn.execute("""
    CREATE TABLE IF NOT EXISTS file_metadata (
        file_index INTEGER PRIMARY KEY,
        file_url VARCHAR,
        status VARCHAR,  -- 'pending', 'processing', 'success', 'failed'
        records_processed INTEGER DEFAULT 0,
        records_inserted INTEGER DEFAULT 0,
        chunks_processed INTEGER DEFAULT 0,
        size_mb FLOAT DEFAULT 0,
        error_message VARCHAR,
        started_at TIMESTAMP,
        completed_at TIMESTAMP,
        last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    )
""")

print("Tables 'papers' and 'file_metadata' created/verified")

# Show table schemas
print("\nPapers table schema:")
display(conn.execute("DESCRIBE papers").df())
print("\nFile metadata table schema:")
display(conn.execute("DESCRIBE file_metadata").df())

Tables 'papers' and 'file_metadata' created/verified

Papers table schema:


Unnamed: 0,column_name,column_type,null,key,default,extra
0,corpusid,VARCHAR,NO,PRI,,
1,title,VARCHAR,YES,,,
2,publication_date,VARCHAR,YES,,,
3,citation_count,INTEGER,YES,,,
4,influential_citation_count,INTEGER,YES,,,
5,field_of_study,VARCHAR,YES,,,



File metadata table schema:


Unnamed: 0,column_name,column_type,null,key,default,extra
0,file_index,INTEGER,NO,PRI,,
1,file_url,VARCHAR,YES,,,
2,status,VARCHAR,YES,,,
3,records_processed,INTEGER,YES,,0,
4,records_inserted,INTEGER,YES,,0,
5,chunks_processed,INTEGER,YES,,0,
6,size_mb,FLOAT,YES,,0,
7,error_message,VARCHAR,YES,,,
8,started_at,TIMESTAMP,YES,,,
9,completed_at,TIMESTAMP,YES,,,


## 2. Semantic Scholar API Configuration

Connect to the Semantic Scholar Datasets API to retrieve the papers dataset manifest.

In [None]:
API_KEY = "" #Semantic Scholar API Key
API_HEADERS = {"x-api-key": API_KEY}
S2_API_URL = "https://api.semanticscholar.org/datasets/v1/release/latest" # S2 = Semantic Scholar

The [dataset release endpoint](https://api.semanticscholar.org/api-docs/datasets) provides us a list of datasets that is available for download

In [None]:
release_info = requests.get(S2_API_URL, headers=API_HEADERS).json()
release_info

{'release_id': '2025-12-02',
 'README': 'Semantic Scholar Academic Graph Datasets\n\nThese datasets provide a variety of information about research papers taken from a snapshot in time of the Semantic Scholar corpus.\n\nThis site is provided by The Allen Institute for Artificial Intelligence (“AI2”) as a service to the\nresearch community. The site is covered by AI2 Terms of Use and Privacy Policy. AI2 does not claim\nownership of any materials on this site unless specifically identified. AI2 does not exercise editorial\ncontrol over the contents of this site. AI2 respects the intellectual property rights of others. If\nyou believe your copyright or trademark is being infringed by something on this site, please follow\nthe "DMCA Notice" process set out in the Terms of Use (https://allenai.org/terms).\n\nSAMPLE DATA ACCESS\nSample data files can be downloaded with the following UNIX command:\n\nfor f in $(curl https://s3-us-west-2.amazonaws.com/ai2-s2ag/samples/MANIFEST.txt)\n  do curl 

Here, we download the `papers` dataset which includes core attributes of a paper (title, authors, date, citation, field of study, etc). Note that this dataset does NOT include the paper's abstract

In [9]:
papers_dataset = requests.get('https://api.semanticscholar.org/datasets/v1/release/latest/dataset/papers', headers=API_HEADERS).json()
papers_dataset


{'name': 'papers',
 'description': 'The core attributes of a paper (title, authors, date, etc.).\n200M records in 30 1.5GB files.',
 'README': 'Semantic Scholar Academic Graph Datasets\n\nThe "papers" dataset provides core metadata about papers.\n\nSCHEMA\nSee https://api.semanticscholar.org/api-docs/graph#tag/Paper-Data\n\nThis dataset does not contain information about a paper\'s references or citations.\nInstead, join with citingPaperId/citedPaperId from the "citations" dataset.\n\nLICENSE\nThis collection is licensed under ODC-BY. (https://opendatacommons.org/licenses/by/1.0/)\n\nBy downloading this data you acknowledge that you have read and agreed to all the terms in this license.\n\nATTRIBUTION\nWhen using this data in a product or service, or including data in a redistribution, please cite the following paper:\n\nBibTex format:\n@misc{https://doi.org/10.48550/arxiv.2301.10140,\n  title = {The Semantic Scholar Open Data Platform},\n  author = {Kinney, Rodney and Anastasiades, Ch

As seen from the output of above url call, the dataset is split into multiple files in AWS bucket. 
However, these files are not structured by field of study, so we will need to filter them ourselves.
Instead of downloading all the files and then filtering, we implement processing each file in chunks and filter on the fly.
This saves bandwidth and storage.

In [None]:
# Initialize file metadata for all files in the dataset
global papers_dataset
def initialize_file_metadata():
    """Initialize metadata entries for all files if not already present"""
    for idx, file_url in enumerate(papers_dataset['files']):
        conn.execute("""
            INSERT OR IGNORE INTO file_metadata (file_index, file_url, status)
            VALUES (?, ?, 'pending')
        """, [idx, file_url])
    
    total_files = len(papers_dataset['files'])
    print(f"Initialized metadata for {total_files} files")
    
    # Show status summary
    status_summary = conn.execute("""
        SELECT status, COUNT(*) as count 
        FROM file_metadata 
        GROUP BY status 
        ORDER BY count DESC
    """).df()
    print("\nCurrent status summary:")
    display(status_summary)

def update_expired_urls():
    """Update URLs in metadata table with fresh URLs from the API"""
    print("Updating URLs with fresh tokens...")
    global papers_dataset
    # Get fresh dataset with new URLs
    fresh_dataset = requests.get(
        'https://api.semanticscholar.org/datasets/v1/release/latest/dataset/papers', 
        headers=API_HEADERS
    ).json()
    
    if len(fresh_dataset['files']) != len(papers_dataset['files']):
        print(f" WARNING: File count mismatch!")
        print(f" Old: {len(papers_dataset['files'])}, New: {len(fresh_dataset['files'])}")
        print(f" URLs may not align correctly!")
    
    # Update URLs in the database
    updated_count = 0
    for idx, new_url in enumerate(fresh_dataset['files']):
        result = conn.execute("""
            UPDATE file_metadata 
            SET file_url = ?,
                last_updated = CURRENT_TIMESTAMP
            WHERE file_index = ?
        """, [new_url, idx])
        updated_count += 1
    
    print(f"Updated {updated_count} URLs in metadata table")
    
    # Update the global variable as well
    
    papers_dataset = fresh_dataset
    
    return fresh_dataset
    
initialize_file_metadata()

Initialized metadata for 60 files

Current status summary:


Unnamed: 0,status,count
0,success,36
1,failed,21
2,processing,3


## 3. Streaming Download & Processing

Download compressed JSONL files, _decompress on-the-fly_, filter by field of study, and insert matching records into DuckDB. Processing is batched and resumable (failed or pending files can be retried without reprocessing successful ones).

In [None]:
import asyncio
import aiohttp
from pathlib import Path
import time as time_module
import gzip
import json
from io import BytesIO
from collections import deque

# Configuration
BATCH_SIZE = 3  # Number of parallel file downloads
DELAY_BETWEEN_BATCHES = 2  # Seconds to wait between batches
DELAY_BETWEEN_FILES = 0.5  # Seconds to wait between individual files
CHUNK_SIZE = 256 * 1024  # 256KB chunks for streaming
RECORDS_PER_BATCH = 5000  # Process records in batches
TARGET_FIELDS = {"Physics", "Computer Science", "Psychology"}  # Fields of study to keep

# To write directly to DB
processing_queue = asyncio.Queue()

print(f"Total files to download: {len(papers_dataset['files'])}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Chunk size: {CHUNK_SIZE/1024:.0f} KB")
print(f"Records processed per batch: {RECORDS_PER_BATCH}")
print(f"Target fields of study: {TARGET_FIELDS}")

Total files to download: 60
Batch size: 3
Chunk size: 256 KB
Records processed per batch: 5000
Target fields of study: {'Physics', 'Computer Science', 'Psychology'}


In [None]:
async def process_records_batch(records, file_index):
    """Process a batch of records and write filtered ones to DuckDB"""
    records_to_insert = []
    
    for record in records:
        # Get fields of study
        fields_of_study = record.get('s2fieldsofstudy', [])
        
        if not fields_of_study:
            continue
        title = record.get('title', '')
        if not title:
            continue
        
        all_fields = set(field.get('category', '') for field in fields_of_study if isinstance(field, dict))
        # remove empty strings
        all_fields.discard('')
        
        # Check if any field matches our target fields
        if not all_fields.intersection(TARGET_FIELDS):
            continue
        all_fields_str = ', '.join(sorted(all_fields))
        paper_record = {
            'corpusid': record.get('corpusid'),
            'title': title,
            'publication_date': record.get('publicationdate'),
            'citation_count': record.get('citationcount'),
            'influential_citation_count': record.get('influentialcitationcount'),
            'field_of_study': all_fields_str
        }
        records_to_insert.append(paper_record)
            
    # Write to DuckDB
    if records_to_insert:
        df_batch = pd.DataFrame(records_to_insert)
        conn.execute("""
            INSERT OR IGNORE INTO papers 
            SELECT * FROM df_batch
        """)
    
    return len(records_to_insert)

In [None]:
async def stream_and_process_file(session, url, index, semaphore, pbar_position):
    """Stream file, decompress on-the-fly, and process JSONL records as they arrive"""
    async with semaphore:
        pbar = None
        
        # Update status to 'processing' and set started_at
        conn.execute("""
            UPDATE file_metadata 
            SET status = 'processing', 
                started_at = CURRENT_TIMESTAMP,
                last_updated = CURRENT_TIMESTAMP
            WHERE file_index = ?
        """, [index])
        
        try:
            timeout = aiohttp.ClientTimeout(total=30000, connect=10, sock_read=60)
            async with session.get(url, timeout=timeout, headers = API_HEADERS) as response:
                if response.status != 200:
                    # Update metadata with failure
                    conn.execute("""
                        UPDATE file_metadata 
                        SET status = 'failed',
                            error_message = ?,
                            completed_at = CURRENT_TIMESTAMP,
                            last_updated = CURRENT_TIMESTAMP
                        WHERE file_index = ?
                    """, [f"HTTP {response.status}", index])
                    
                    return {
                        'index': index,
                        'status': 'failed',
                        'error': f"HTTP {response.status}",
                        'records_processed': 0,
                        'records_inserted': 0
                    }
                
                # Create progress bar for this file
                pbar = tqdm(
                    total=0,  # We don't know total size initially
                    desc=f"File {index}",
                    position=pbar_position,
                    leave=True,
                    unit='rec',
                    unit_scale=True,
                    ncols=100,   
                    mininterval=1.0  # Update at most once per second
                )
                
                # Stream and decompress in chunks
                decompressor = gzip.GzipFile(fileobj=BytesIO())
                buffer = b''
                records_processed = 0
                records_inserted = 0
                chunk_num = 0
                total_bytes = 0
                records_batch = []
                
                async for compressed_chunk in response.content.iter_chunked(CHUNK_SIZE):
                    chunk_num += 1
                    total_bytes += len(compressed_chunk)
                    
                    # Decompress chunk
                    try:
                        # Use zlib for streaming decompression
                        import zlib
                        if chunk_num == 1:
                            decompressor = zlib.decompressobj(16 + zlib.MAX_WBITS)
                        
                        decompressed = decompressor.decompress(compressed_chunk)
                        buffer += decompressed
                        
                        # Process complete lines from buffer
                        while b'\n' in buffer:
                            line, buffer = buffer.split(b'\n', 1)
                            if line.strip():
                                try:
                                    record = json.loads(line.decode('utf-8'))
                                    records_batch.append(record)
                                    records_processed += 1
                                    
                                    # Process in batches
                                    if len(records_batch) >= RECORDS_PER_BATCH:
                                        inserted = await process_records_batch(records_batch, index)
                                        records_inserted += inserted
                                        
                                        # Update metadata periodically
                                        conn.execute("""
                                            UPDATE file_metadata 
                                            SET records_processed = ?,
                                                records_inserted = ?,
                                                chunks_processed = ?,
                                                size_mb = ?,
                                                last_updated = CURRENT_TIMESTAMP
                                            WHERE file_index = ?
                                        """, [records_processed, records_inserted, chunk_num, 
                                              total_bytes/1024/1024, index])
                                        
                                        # Update progress bar
                                        pbar.total = records_processed
                                        pbar.n = records_processed
                                        pbar.set_postfix({
                                            'inserted': f'{records_inserted:,}',
                                            'chunks': chunk_num,
                                            'MB': f'{total_bytes/1024/1024:.1f}'
                                        })
                                        pbar.refresh()
                                        
                                        records_batch = []
                                        
                                except json.JSONDecodeError:
                                    continue
                    except Exception as e:
                        # Continue with next chunk if decompression fails
                        continue
                
                # Process remaining records in buffer
                if buffer.strip():
                    try:
                        record = json.loads(buffer.decode('utf-8'))
                        records_batch.append(record)
                        records_processed += 1
                    except:
                        pass
                
                if records_batch:
                    inserted = await process_records_batch(records_batch, index)
                    records_inserted += inserted
                
                # Final metadata update with success status
                conn.execute("""
                    UPDATE file_metadata 
                    SET status = 'success',
                        records_processed = ?,
                        records_inserted = ?,
                        chunks_processed = ?,
                        size_mb = ?,
                        completed_at = CURRENT_TIMESTAMP,
                        last_updated = CURRENT_TIMESTAMP,
                        error_message = NULL
                    WHERE file_index = ?
                """, [records_processed, records_inserted, chunk_num, 
                      total_bytes/1024/1024, index])
                
                # Final update
                pbar.total = records_processed
                pbar.n = records_processed
                pbar.set_postfix({
                    'inserted': f'{records_inserted:,}',
                    'chunks': chunk_num,
                    'MB': f'{total_bytes/1024/1024:.1f}',
                    'status': '✓'
                })
                pbar.refresh()
                pbar.close()
                
                return {
                    'index': index,
                    'status': 'success',
                    'records_processed': records_processed,
                    'records_inserted': records_inserted,
                    'chunks': chunk_num,
                    'size_mb': total_bytes / 1024 / 1024
                }
                
        except asyncio.TimeoutError:
            error_msg = 'Timeout'
            conn.execute("""
                UPDATE file_metadata 
                SET status = 'failed',
                    error_message = ?,
                    completed_at = CURRENT_TIMESTAMP,
                    last_updated = CURRENT_TIMESTAMP
                WHERE file_index = ?
            """, [error_msg, index])
            
            if pbar:
                pbar.set_postfix({'status': 'Timeout'})
                pbar.close()
            return {
                'index': index,
                'status': 'failed',
                'error': 'Timeout',
                'records_processed': 0,
                'records_inserted': 0
            }
        except Exception as e:
            error_msg = str(e)[:200]
            conn.execute("""
                UPDATE file_metadata 
                SET status = 'failed',
                    error_message = ?,
                    completed_at = CURRENT_TIMESTAMP,
                    last_updated = CURRENT_TIMESTAMP
                WHERE file_index = ?
            """, [error_msg, index])
            
            if pbar:
                pbar.set_postfix({'status': f'✗ {str(e)[:20]}'})
                pbar.close()
            return {
                'index': index,
                'status': 'failed',
                'error': str(e)[:200],
                'records_processed': 0,
                'records_inserted': 0
            }
        finally:
            await asyncio.sleep(DELAY_BETWEEN_FILES)

In [None]:
async def download_all_files(urls, file_indices):
    """Download and process all files in batches with delays"""
    semaphore = asyncio.Semaphore(BATCH_SIZE)
    results = []
    
    # Create session with connector settings
    connector = aiohttp.TCPConnector(limit=BATCH_SIZE, limit_per_host=BATCH_SIZE)
    timeout = aiohttp.ClientTimeout(total=300000)
    
    async with aiohttp.ClientSession(connector=connector, timeout=timeout, headers=API_HEADERS) as session:
        total_batches = (len(urls) + BATCH_SIZE - 1) // BATCH_SIZE
        
        for batch_num in range(total_batches):
            start_idx = batch_num * BATCH_SIZE
            end_idx = min((batch_num + 1) * BATCH_SIZE, len(urls))
            batch_urls = urls[start_idx:end_idx]
            batch_indices = file_indices[start_idx:end_idx]
            
            print(f"\n{'='*70}")
            print(f"Batch {batch_num + 1}/{total_batches} (files {batch_indices[0]}-{batch_indices[-1]})")
            print(f"{'='*70}\n")
            
            # Create tasks for this batch with proper pbar positioning
            tasks = [
                stream_and_process_file(session, url, file_idx, semaphore, i)
                for i, (url, file_idx) in enumerate(zip(batch_urls, batch_indices))
            ]
            
            # Download this batch
            batch_results = await asyncio.gather(*tasks, return_exceptions=True)
            
            # Handle any exceptions
            for i, result in enumerate(batch_results):
                if isinstance(result, Exception):
                    batch_results[i] = {
                        'index': batch_indices[i],
                        'status': 'failed',
                        'error': str(result)[:200],
                        'records_processed': 0,
                        'records_inserted': 0
                    }
            
            results.extend(batch_results)
            
            # Count records from this batch
            batch_processed = sum(r.get('records_processed', 0) for r in batch_results if isinstance(r, dict))
            batch_inserted = sum(r.get('records_inserted', 0) for r in batch_results if isinstance(r, dict))
            
            # Get total count from database
            total_in_db = conn.execute("SELECT COUNT(*) FROM papers").fetchone()[0]
            
            # Get overall processing status
            status_counts = conn.execute("""
                SELECT status, COUNT(*) as count 
                FROM file_metadata 
                GROUP BY status
            """).df()
            
            # Show progress for this batch
            successful = sum(1 for r in batch_results if isinstance(r, dict) and r['status'] == 'success')
            failed = len(batch_results) - successful
            
            print(f"\n{'='*70}")
            print(f"Batch {batch_num + 1} Summary:")
            print(f"   Success: {successful}, Failed: {failed}")
            print(f"   Processed: {batch_processed:,}")
            print(f"   Inserted: {batch_inserted:,}")
            print(f"   Total in DB: {total_in_db:,}")
            print(f"\n   Overall Status:")
            for _, row in status_counts.iterrows():
                print(f"   {row['status']:>12}: {row['count']:>5}")
            print(f"{'='*70}")
            
            # Delay before next batch (except after the last batch)
            if batch_num < total_batches - 1:
                print(f"\nWaiting {DELAY_BETWEEN_BATCHES}s before next batch...\n")
                await asyncio.sleep(DELAY_BETWEEN_BATCHES)
    
    return results

### Execute Pipeline

In [None]:
# Start the download and processing
TEST_LIMIT = None  # Set to None to process all files
RESUME_MODE = True  # Set to True to skip already processed files
REFRESH_URLS = True  # Set to True to get fresh URLs before processing

print("Starting streaming download and processing...")
print(f"   Writing directly to DuckDB at: {DB_PATH}")
print(f"   Filtering for fields: {TARGET_FIELDS}")
print(f"   Resume mode: {RESUME_MODE}\n")

# Refresh URLs if needed (to handle expired tokens)
if REFRESH_URLS:
    print("Refreshing URLs with new tokens...")
    update_expired_urls()
    print()

# Get list of files to process based on their status
if RESUME_MODE:
    # Only process files that are pending or failed
    pending_files = conn.execute("""
        SELECT file_index, file_url 
        FROM file_metadata 
        WHERE status IN ('pending', 'failed')
        ORDER BY file_index
    """).df()
    
    if TEST_LIMIT:
        pending_files = pending_files.head(TEST_LIMIT)
    
    file_indices = pending_files['file_index'].tolist()
    urls_to_process = pending_files['file_url'].tolist()
    
    print(f"Resume mode: Found {len(urls_to_process)} files to process")
    print(f"   (Skipping already successful files)\n")
else:
    # Process all files
    if TEST_LIMIT:
        urls_to_process = papers_dataset['files'][:TEST_LIMIT]
        file_indices = list(range(TEST_LIMIT))
    else:
        urls_to_process = papers_dataset['files']
        file_indices = list(range(len(papers_dataset['files'])))
    
    print(f"Processing {len(urls_to_process)} files from scratch\n")

if len(urls_to_process) == 0:
    print("All files already processed!")
else:
    start_time = time_module.time()
    
    results = await download_all_files(urls_to_process, file_indices)
    
    # Summary
    end_time = time_module.time()
    elapsed = end_time - start_time
    successful = sum(1 for r in results if r['status'] == 'success')
    failed = len(results) - successful
    total_size = sum(r.get('size_mb', 0) for r in results if r['status'] == 'success')
    total_processed = sum(r.get('records_processed', 0) for r in results if r['status'] == 'success')
    total_inserted = sum(r.get('records_inserted', 0) for r in results if r['status'] == 'success')
    
    # Get final count from database
    final_count = conn.execute("SELECT COUNT(*) FROM papers").fetchone()[0]
    
    # Get overall status summary
    overall_status = conn.execute("""
        SELECT 
            status,
            COUNT(*) as count,
            SUM(records_processed) as total_processed,
            SUM(records_inserted) as total_inserted,
            SUM(size_mb) as total_mb
        FROM file_metadata
        GROUP BY status
        ORDER BY count DESC
    """).df()
    
    print(f"\n{'='*70}")
    print(f"Final Summary")
    print(f"{'='*70}")
    print(f"Files in this run: {len(results)}")
    print(f"Successful: {successful}")
    print(f"Failed: {failed}")
    print(f"Records processed: {total_processed:,}")
    print(f"Records inserted: {total_inserted:,}")
    print(f"Total papers in DB: {final_count:,}")
    print(f"Data processed: {total_size:.2f} MB")
    print(f"Time elapsed: {elapsed:.2f}s")
    if elapsed > 0 and total_processed > 0:
        print(f"Processing rate: {total_processed/elapsed:.1f} records/sec")
    if total_processed > 0:
        print(f"Filter rate: {total_inserted/total_processed*100:.1f}% matched target fields")
    
    print(f"\n{'='*70}")
    print(f"Overall Status (All Files)")
    print(f"{'='*70}")
    display(overall_status)
    
    # Show failed downloads if any
    if failed > 0:
        print(f"\nFailed downloads in this run:")
        for r in results:
            if r['status'] == 'failed':
                print(f"   - File {r['index']}: {r.get('error', 'Unknown error')}")

Starting streaming download and processing...
   Writing directly to DuckDB at: db/ss_aws.db
   Filtering for fields: {'Physics', 'Computer Science', 'Psychology'}
   Resume mode: True

Refreshing URLs with new tokens...
Updating URLs with fresh tokens...
Updated 60 URLs in metadata table

Resume mode: Found 6 files to process
   (Skipping already successful files)


Batch 1/2 (files 24-26)

Updated 60 URLs in metadata table

Resume mode: Found 6 files to process
   (Skipping already successful files)


Batch 1/2 (files 24-26)



File 26: 0.00rec [00:00, ?rec/s]

File 24: 0.00rec [00:00, ?rec/s]

File 25: 0.00rec [00:00, ?rec/s]


Batch 1 Summary:
   Success: 3, Failed: 0
   Processed: 10,260,081
   Inserted: 1,924,612
   Total in DB: 41,844,141

   Overall Status:
        success:    57
         failed:     3

Waiting 2s before next batch...


Batch 2/2 (files 39-41)


Batch 2/2 (files 39-41)



File 39: 0.00rec [00:00, ?rec/s]

File 40: 0.00rec [00:00, ?rec/s]

File 41: 0.00rec [00:00, ?rec/s]


Batch 2 Summary:
   Success: 3, Failed: 0
   Processed: 8,819,014
   Inserted: 1,653,557
   Total in DB: 43,337,660

   Overall Status:
        success:    60

Final Summary
Files in this run: 6
Successful: 6
Failed: 0
Records processed: 19,079,095
Records inserted: 3,578,169
Total papers in DB: 43,337,660
Data processed: 4004.50 MB
Time elapsed: 2835.61s
Processing rate: 6728.4 records/sec
Filter rate: 18.8% matched target fields

Overall Status (All Files)


Unnamed: 0,status,count,total_processed,total_inserted,total_mb
0,success,60,230861043.0,43337660.0,48463.997223


## 4. Data Validation & Inspection

Verify collected data and review processing status.

In [48]:
# Check the data in the database
print("Sample of inserted records:")
conn.execute("SELECT * FROM papers LIMIT 10").df()

Sample of inserted records:


Unnamed: 0,corpusid,title,publication_date,citation_count,influential_citation_count,field_of_study
0,15035112,Design of Low Power Monolithic DC-DC Buck Conv...,2005-06-16,41,3,"Engineering, Materials Science, Physics"
1,44907541,Mobile grid data access for high resolution we...,2012-09-01,0,0,"Computer Science, Engineering, Environmental S..."
2,250918402,Volume illumination for two-dimensional partic...,,0,0,"Engineering, Physics"
3,270742948,Distributed model predictive control for coope...,2024-06-25,0,0,"Computer Science, Engineering"
4,123623111,The global existence of weak solutions to bipo...,,0,0,"Mathematics, Physics"
5,145532803,Stressful life circumstances: Concepts and mea...,1990-07-01,60,2,"Medicine, Psychology"
6,13369830,ON THE FEASIBILITY OF THE LINEAR SAMPLING METH...,,35,1,"Computer Science, Engineering, Environmental S..."
7,8972695,StreamNF: Performance and Correctness for Stat...,2016-12-05,6,2,"Computer Science, Engineering"
8,61774153,Making Classifications (at) Work,2005-08-01,39,0,"Computer Science, Sociology"
9,151543243,Buddhist Coping as a Predictor of Psychologica...,,0,0,Psychology


In [49]:
# View file processing metadata
print("File processing status:")
conn.execute("""
    SELECT 
        file_index,
        status,
        records_processed,
        records_inserted,
        chunks_processed,
        ROUND(size_mb, 2) as size_mb,
        error_message,
        started_at,
        completed_at
    FROM file_metadata
    ORDER BY file_index
""").df()

File processing status:


Unnamed: 0,file_index,status,records_processed,records_inserted,chunks_processed,size_mb,error_message,started_at,completed_at
0,0,success,3158439,592820,5666,663.080017,,2025-12-09 08:28:27.928266,2025-12-09 08:36:07.748816
1,1,success,3195517,600589,5872,670.919983,,2025-12-09 08:28:27.929750,2025-12-09 08:34:14.811897
2,2,success,4879473,916414,9127,1024.060059,,2025-12-09 08:28:27.931202,2025-12-09 08:37:24.674371
3,3,success,2722340,510873,2901,571.669983,,2025-12-09 08:37:27.197114,2025-12-09 08:43:02.806653
4,4,success,4880159,916859,7466,1024.26001,,2025-12-09 08:37:27.201303,2025-12-09 08:45:47.305216
5,5,success,2970520,557718,3274,623.309998,,2025-12-09 08:37:27.203092,2025-12-09 08:43:17.866997
6,6,success,4876107,915513,4796,1024.060059,,2025-12-09 08:45:49.818770,2025-12-09 08:54:58.841907
7,7,success,4877910,916189,5060,1024.050049,,2025-12-09 08:45:49.823116,2025-12-09 08:55:20.351818
8,8,success,2868777,537999,2790,602.349976,,2025-12-09 08:45:49.825023,2025-12-09 08:52:10.969090
9,9,success,4876815,914391,5761,1024.040039,,2025-12-09 08:55:22.873233,2025-12-09 09:05:44.758571


In [50]:
# View failed files with details
print("Failed files (if any):")
conn.execute("""
    SELECT 
        file_index,
        error_message,
        records_processed,
        chunks_processed,
        started_at,
        completed_at
    FROM file_metadata
    WHERE status = 'failed'
    ORDER BY file_index
""").df()

Failed files (if any):


Unnamed: 0,file_index,error_message,records_processed,chunks_processed,started_at,completed_at


In [51]:
# Check distribution by field of study
print("Distribution by field of study:")
conn.execute("""
    SELECT field_of_study, COUNT(*) as count 
    FROM papers 
    GROUP BY field_of_study 
    ORDER BY count DESC
""").df()

Distribution by field of study:


Unnamed: 0,field_of_study,count
0,"Computer Science, Engineering",4835024
1,Physics,3252090
2,Computer Science,2820754
3,"Medicine, Psychology",2798358
4,"Engineering, Materials Science, Physics",2253801
...,...,...
12425,"Biology, Computer Science, Economics, Educatio...",1
12426,"Agricultural And Food Sciences, Art, Computer ...",1
12427,"Business, Computer Science, Engineering, Geogr...",1
12428,"Computer Science, Education, Environmental Sci...",1


In [None]:
# check some specific paper
conn.execute("""
    SELECT * 
    FROM papers 
    where LOWER(title) like '%attention is all%'
""").df()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,corpusid,title,publication_date,citation_count,influential_citation_count,field_of_study
0,279319234,TransMLA: Multi-Head Latent Attention Is All Y...,,0,0,"Computer Science, Psychology"
1,238419336,Attention is All You Need? Good Embeddings wit...,2021-10-07,3,0,"Computer Science, Engineering"
2,245263755,Re-Attention Is All You Need: Memory-Efficient...,2021-09-27,3,0,Computer Science
3,252438864,Attention is All They Need: Exploring the Medi...,2022-09-22,2,0,Computer Science
4,13756489,Attention is All you Need,2017-06-12,155623,18678,Computer Science
5,268681594,Attention is all you need for boosting graph c...,2024-03-10,0,0,Computer Science
6,275134048,Attention Is All You Need For Mixture-of-Depth...,2024-12-30,4,0,Computer Science
7,274486518,TransferAttn: Transferable-guided Attention Is...,,0,0,Computer Science
8,274238417,Sparse attention is all you need for pre-train...,2024-11-20,4,0,"Business, Computer Science"
9,239012212,"Yes, ""Attention Is All You Need"", for Exemplar...",2021-10-17,35,4,Computer Science


In [53]:
conn.close()