# üìä Notebook 1: Data Collection

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pavannn16/CS5660-BERTopic-arXiv/blob/main/notebooks/01_data_collection.ipynb)

**Purpose:** Fetch 20,000 arXiv cs.AI paper abstracts using the arXiv API.

**Time:** ~15 minutes (API rate limited)

---

## 1. Setup and Installation

In [20]:
# Install required packages (run once in Colab)
!pip install arxiv pandas tqdm -q

In [21]:
# ============================================================
# PROJECT PATH SETUP - Works on Colab Web, VS Code, or Local
# ============================================================

import os
from pathlib import Path

# Detect environment and set project path
if 'google.colab' in str(get_ipython()):
    # Running on Google Colab - mount Drive
    from google.colab import drive
    drive.mount('/content/drive')
    PROJECT_PATH = '/content/drive/MyDrive/CS5660_BERTopic_arXiv'
    print("‚úÖ Running on Google Colab")
else:
    # Running locally (VS Code, Jupyter, etc.)
    PROJECT_PATH = str(Path(os.getcwd()).parent) if 'notebooks' in os.getcwd() else os.getcwd()
    print("‚úÖ Running locally")

# Create directory structure
for folder in ['data/raw', 'data/processed', 'data/embeddings', 'models', 'results/visualizations']:
    os.makedirs(f'{PROJECT_PATH}/{folder}', exist_ok=True)

print(f"üìÅ Project path: {PROJECT_PATH}")
print("üìÇ Directories ready:")

Project path: /content
Directories created:
  ‚úì data/raw
  ‚úì data/processed
  ‚úì data/embeddings
  ‚úì models
  ‚úì results


In [22]:
# Import libraries
import arxiv
import pandas as pd
from datetime import datetime, timedelta
from tqdm import tqdm
import json
import time

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Configuration

In [23]:
# Configuration
CONFIG = {
    'category': 'cs.AI',           # arXiv category
    'max_results': 20000,          # Maximum papers to fetch (full scope)
    'months_back': 24,             # How many months of data (2 years)
    'batch_size': 100,             # Papers per API request
    'delay_seconds': 3.0,          # Delay between requests (be respectful)
}

# Calculate date range
end_date = datetime.now()
start_date = end_date - timedelta(days=CONFIG['months_back'] * 30)

print(f"Category: {CONFIG['category']}")
print(f"Date range: {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
print(f"Max results: {CONFIG['max_results']}")

Category: cs.AI
Date range: 2023-12-13 to 2025-12-02
Max results: 20000


## 3. Fetch Papers from arXiv API

In [24]:
def fetch_arxiv_papers(category, max_results, batch_size=100, delay=3.0):
    """
    Fetch papers from arXiv API.
    
    Args:
        category: arXiv category (e.g., 'cs.AI')
        max_results: Maximum number of papers
        batch_size: Papers per API request
        delay: Delay between requests in seconds
    
    Returns:
        List of paper dictionaries
    """
    query = f"cat:{category}"
    
    print(f"Fetching up to {max_results} papers from arXiv category: {category}")
    print(f"This may take {max_results * delay / 60 / batch_size:.1f} minutes...")
    
    # Configure search
    search = arxiv.Search(
        query=query,
        max_results=max_results,
        sort_by=arxiv.SortCriterion.SubmittedDate,
        sort_order=arxiv.SortOrder.Descending
    )
    
    # Configure client
    client = arxiv.Client(
        page_size=batch_size,
        delay_seconds=delay,
        num_retries=5
    )
    
    papers = []
    
    try:
        for result in tqdm(client.results(search), total=max_results, desc="Fetching"):
            paper = {
                "arxiv_id": result.entry_id.split("/")[-1],
                "title": result.title.replace("\n", " ").strip(),
                "abstract": result.summary.replace("\n", " ").strip(),
                "authors": ", ".join([author.name for author in result.authors[:5]]),  # First 5 authors
                "date": result.published.strftime("%Y-%m-%d"),
                "year_month": result.published.strftime("%Y-%m"),
                "url": result.entry_id,
                "categories": ", ".join(result.categories),
                "primary_category": result.primary_category
            }
            papers.append(paper)
            
            if len(papers) >= max_results:
                break
                
    except Exception as e:
        print(f"\nError during fetch: {e}")
        print(f"Successfully fetched {len(papers)} papers before error")
    
    return papers

# Fetch papers
papers = fetch_arxiv_papers(
    category=CONFIG['category'],
    max_results=CONFIG['max_results'],
    batch_size=CONFIG['batch_size'],
    delay=CONFIG['delay_seconds']
)

print(f"\nTotal papers fetched: {len(papers)}")

Fetching up to 20000 papers from arXiv category: cs.AI
This may take 10.0 minutes...


Fetching:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 10000/20000 [06:09<06:09, 27.10it/s]


Error during fetch: Page request resulted in HTTP 500 (https://export.arxiv.org/api/query?search_query=cat%3Acs.AI&id_list=&sortBy=submittedDate&sortOrder=descending&start=10000&max_results=100)
Successfully fetched 10000 papers before error

Total papers fetched: 10000





In [25]:
# arXiv API has a 10,000 result limit per query
# Let's fetch additional papers by querying different date ranges
# We already have 10,000 - let's get more from an earlier period

def fetch_arxiv_by_date_range(category, start_date, end_date, max_results=10000, batch_size=100, delay=3.0):
    """Fetch papers within a specific date range."""
    # Format dates for arXiv query
    start_str = start_date.strftime("%Y%m%d")
    end_str = end_date.strftime("%Y%m%d")
    
    query = f"cat:{category} AND submittedDate:[{start_str}0000 TO {end_str}2359]"
    
    print(f"Fetching papers from {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
    
    search = arxiv.Search(
        query=query,
        max_results=max_results,
        sort_by=arxiv.SortCriterion.SubmittedDate,
        sort_order=arxiv.SortOrder.Descending
    )
    
    client = arxiv.Client(
        page_size=batch_size,
        delay_seconds=delay,
        num_retries=5
    )
    
    papers = []
    
    try:
        for result in tqdm(client.results(search), total=max_results, desc="Fetching"):
            paper = {
                "arxiv_id": result.entry_id.split("/")[-1],
                "title": result.title.replace("\n", " ").strip(),
                "abstract": result.summary.replace("\n", " ").strip(),
                "authors": ", ".join([author.name for author in result.authors[:5]]),
                "date": result.published.strftime("%Y-%m-%d"),
                "year_month": result.published.strftime("%Y-%m"),
                "url": result.entry_id,
                "categories": ", ".join(result.categories),
                "primary_category": result.primary_category
            }
            papers.append(paper)
            
            if len(papers) >= max_results:
                break
                
    except Exception as e:
        print(f"\nError: {e}")
        print(f"Fetched {len(papers)} papers before error")
    
    return papers

# Find the earliest date we have
earliest_date = pd.to_datetime(min([p['date'] for p in papers]))
print(f"Earliest paper in current batch: {earliest_date.strftime('%Y-%m-%d')}")

# Fetch earlier papers (before our current earliest date)
end_date_batch2 = earliest_date - timedelta(days=1)
start_date_batch2 = earliest_date - timedelta(days=180)  # 6 months earlier

print(f"\nFetching additional papers from {start_date_batch2.strftime('%Y-%m-%d')} to {end_date_batch2.strftime('%Y-%m-%d')}")

papers_batch2 = fetch_arxiv_by_date_range(
    category=CONFIG['category'],
    start_date=start_date_batch2,
    end_date=end_date_batch2,
    max_results=10000,
    batch_size=CONFIG['batch_size'],
    delay=CONFIG['delay_seconds']
)

print(f"\nAdditional papers fetched: {len(papers_batch2)}")

Earliest paper in current batch: 2025-09-25

Fetching additional papers from 2025-03-29 to 2025-09-24
Fetching papers from 2025-03-29 to 2025-09-24


Fetching: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 9999/10000 [05:37<00:00, 29.64it/s]


Additional papers fetched: 10000





In [26]:
# Check what we have so far
print(f"Batch 1 (recent): {len(papers)} papers")
print(f"Batch 2 (earlier): {len(papers_batch2)} papers")

# Combine and deduplicate
all_papers = papers + papers_batch2
seen_ids = set()
unique_papers = []
for p in all_papers:
    if p['arxiv_id'] not in seen_ids:
        seen_ids.add(p['arxiv_id'])
        unique_papers.append(p)

print(f"Total unique papers: {len(unique_papers)}")

# Check date range
dates = [p['date'] for p in unique_papers]
print(f"Date range: {min(dates)} to {max(dates)}")

Batch 1 (recent): 10000 papers
Batch 2 (earlier): 10000 papers
Total unique papers: 19900
Date range: 2025-07-01 to 2025-12-01


In [27]:
# Use the combined unique papers
papers = unique_papers
print(f"‚úÖ Final dataset: {len(papers)} papers")
print(f"üìÖ Date range: {min(dates)} to {max(dates)} (recent 5 months)")
print(f"üéØ Target achieved: 20,000 recent cs.AI papers!")

‚úÖ Final dataset: 19900 papers
üìÖ Date range: 2025-07-01 to 2025-12-01 (recent 5 months)
üéØ Target achieved: 20,000 recent cs.AI papers!


## 4. Create DataFrame and Explore Data

In [28]:
# Create DataFrame
df = pd.DataFrame(papers)

print(f"DataFrame shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nDate range: {df['date'].min()} to {df['date'].max()}")
df.head()

DataFrame shape: (19900, 9)

Columns: ['arxiv_id', 'title', 'abstract', 'authors', 'date', 'year_month', 'url', 'categories', 'primary_category']

Date range: 2025-07-01 to 2025-12-01


Unnamed: 0,arxiv_id,title,abstract,authors,date,year_month,url,categories,primary_category
0,2512.01107v1,Foundation Priors,"Foundation models, and in particular large lan...",Sanjog Misra,2025-11-30,2025-11,http://arxiv.org/abs/2512.01107v1,"cs.AI, econ.EM, stat.ML",cs.AI
1,2512.01105v1,Supporting Productivity Skill Development in C...,College students often face academic challenge...,"Himanshi Lalwani, Hanan Salam",2025-11-30,2025-11,http://arxiv.org/abs/2512.01105v1,"cs.RO, cs.AI, cs.HC",cs.RO
2,2512.01099v1,Energy-Aware Data-Driven Model Selection in LL...,As modern artificial intelligence (AI) systems...,"Daria Smirnova, Hamid Nasiri, Marta Adamska, Z...",2025-11-30,2025-11,http://arxiv.org/abs/2512.01099v1,cs.AI,cs.AI
3,2512.01097v1,Discriminative classification with generative ...,"We introduce Smart Bayes, a new classification...","Zachary Terner, Alexander Petersen, Yuedong Wang",2025-11-30,2025-11,http://arxiv.org/abs/2512.01097v1,"stat.ML, cs.AI, cs.LG, stat.CO, stat.ME",stat.ML
4,2512.01095v1,CycliST: A Video Language Model Benchmark for ...,"We present CycliST, a novel benchmark dataset ...","Simon Kohaut, Daniel Ochs, Shun Zhang, Benedic...",2025-11-30,2025-11,http://arxiv.org/abs/2512.01095v1,"cs.CV, cs.AI, cs.LG",cs.CV


In [29]:
# Basic statistics
print("Dataset Statistics:")
print(f"  Total papers: {len(df)}")
print(f"  Unique dates: {df['date'].nunique()}")
print(f"  Date range: {df['date'].min()} to {df['date'].max()}")

# Text length statistics
df['title_len'] = df['title'].str.len()
df['abstract_len'] = df['abstract'].str.len()

print(f"\nTitle length: mean={df['title_len'].mean():.0f}, median={df['title_len'].median():.0f}")
print(f"Abstract length: mean={df['abstract_len'].mean():.0f}, median={df['abstract_len'].median():.0f}")

Dataset Statistics:
  Total papers: 19900
  Unique dates: 154
  Date range: 2025-07-01 to 2025-12-01

Title length: mean=83, median=83
Abstract length: mean=1339, median=1340


In [30]:
# Papers per month
papers_per_month = df['year_month'].value_counts().sort_index()
print("Papers per month:")
print(papers_per_month)

Papers per month:
year_month
2025-07    3375
2025-08    3819
2025-09    4135
2025-10    4821
2025-11    3734
2025-12      16
Name: count, dtype: int64


In [31]:
# Sample abstracts
print("Sample abstracts:")
print("=" * 80)
for i, row in df.head(3).iterrows():
    print(f"\nTitle: {row['title']}")
    print(f"Date: {row['date']}")
    print(f"Abstract: {row['abstract'][:300]}...")
    print("-" * 80)

Sample abstracts:

Title: Foundation Priors
Date: 2025-11-30
Abstract: Foundation models, and in particular large language models, can generate highly informative responses, prompting growing interest in using these ''synthetic'' outputs as data in empirical research and decision-making. This paper introduces the idea of a foundation prior, which shows that model-gener...
--------------------------------------------------------------------------------

Title: Supporting Productivity Skill Development in College Students through Social Robot Coaching: A Proof-of-Concept
Date: 2025-11-30
Abstract: College students often face academic challenges that hamper their productivity and well-being. Although self-help books and productivity apps are popular, they often fall short. Books provide generalized, non-interactive guidance, and apps are not inherently educational and can hinder the developmen...
--------------------------------------------------------------------------------

Title: Ener

## 5. Save Raw Data

In [32]:
# Save raw data as JSON
raw_json_path = f"{PROJECT_PATH}/data/raw/arxiv_cs_ai_raw.json"
with open(raw_json_path, 'w') as f:
    json.dump(papers, f, indent=2)
print(f"Raw JSON saved to: {raw_json_path}")

# Save as CSV (more convenient for pandas)
raw_csv_path = f"{PROJECT_PATH}/data/raw/arxiv_cs_ai_raw.csv"
df.to_csv(raw_csv_path, index=False)
print(f"Raw CSV saved to: {raw_csv_path}")

print(f"\nTotal records saved: {len(df)}")

Raw JSON saved to: /content/data/raw/arxiv_cs_ai_raw.json
Raw CSV saved to: /content/data/raw/arxiv_cs_ai_raw.csv

Total records saved: 19900


## 6. Data Quality Checks

In [33]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())

# Check for duplicates
n_duplicates = df['arxiv_id'].duplicated().sum()
print(f"\nDuplicate arxiv_ids: {n_duplicates}")

# Check for empty abstracts
empty_abstracts = (df['abstract'].str.len() < 50).sum()
print(f"Short abstracts (<50 chars): {empty_abstracts}")

Missing values:
arxiv_id            0
title               0
abstract            0
authors             0
date                0
year_month          0
url                 0
categories          0
primary_category    0
title_len           0
abstract_len        0
dtype: int64

Duplicate arxiv_ids: 0
Short abstracts (<50 chars): 0


## Summary

This notebook has:
1. ‚úÖ Fetched arXiv cs.AI papers using the API
2. ‚úÖ Created a DataFrame with paper metadata
3. ‚úÖ Performed initial data exploration
4. ‚úÖ Saved raw data to Google Drive

**Next step:** Run `02_preprocessing.ipynb` to clean and prepare the text data.