# arXiv Version (Structured Like Your WOS Pipeline)

### Load arXiv Metadata (Streaming – Don’t Load 1.5GB Fully)

In [6]:
import json
import pandas as pd
from tqdm import tqdm

file_path = "../../../../arxiv-metadata-oai-snapshot.json"



records = []

# Stream safely (important for large file)
with open(file_path, "r") as f:
    for line in tqdm(f):
        paper = json.loads(line)
        
        # Only keep CS + Physics (like taxonomy goal)
        if paper["categories"].startswith(("cs.", "physics.")):
            records.append({
                "topic": paper["title"] + " " + paper["abstract"],
                "categories": paper["categories"]
            })

# Convert to DataFrame
df_arxiv = pd.DataFrame(records)
df_arxiv.head()
df_arxiv

2951540it [00:23, 124315.35it/s]


Unnamed: 0,topic,categories
0,The evolution of the Earth-Moon system based o...,physics.gen-ph
1,Convergence of the discrete dipole approximati...,physics.optics physics.comp-ph
2,Convergence of the discrete dipole approximati...,physics.optics physics.comp-ph
3,The discrete dipole approximation for simulati...,physics.optics physics.comp-ph
4,The discrete dipole approximation: an overview...,physics.optics physics.comp-ph
...,...,...
924080,"Variational methods, multiprecision and nonrel...",physics.atom-ph physics.comp-ph
924081,Effective interaction between helical bio-mole...,physics.bio-ph physics.chem-ph physics.comp-ph...
924082,Atom-optics hologram in the time domain The ...,physics.atom-ph physics.optics
924083,A Second-Order Stochastic Leap-Frog Algorithm ...,physics.comp-ph


# arXiv Dataset Subsampling
The full arXiv metadata snapshot contains over 900,000 physics-related records, which is computationally expensive to process for embedding generation and dimensionality reduction. Generating embeddings for the entire dataset would significantly increase runtime, memory usage, and API costs without providing meaningful additional evaluation benefits for this study.
To ensure computational feasibility while preserving hierarchical diversity, we randomly sampled 30,000 papers from the filtered Physics subset. This sample size is sufficient to:
Maintain a rich hierarchical structure across subject categories
Enable robust clustering evaluation
Provide statistically meaningful benchmarking results
Keep embedding and PHATE computation tractable
The sampling procedure was performed using a fixed random seed to ensure reproducibility.

In [8]:
df_arxiv = df_arxiv.sample(30000, random_state=42).reset_index(drop=True)
df_arxiv

Unnamed: 0,topic,categories
0,Optimized Cloud Resource Allocation Using Gene...,cs.DC cs.AI
1,Control of Rayleigh-like waves in thick plate ...,physics.geo-ph cond-mat.mtrl-sci
2,Deep Text Classification Can be Fooled In th...,cs.CR cs.LG
3,Clip-TTS: Contrastive Text-content and Mel-spe...,cs.SD cs.AI cs.CL cs.HC cs.LG eess.AS
4,Convex Cauchy Schwarz Independent Component An...,cs.IT math.IT
...,...,...
29995,HyperAttention: Long-context Attention in Near...,cs.LG cs.AI
29996,Maximum Likelihood Estimation of Power-law Deg...,cs.SI physics.data-an physics.soc-ph
29997,SCL(FOL) Can Simulate Non-Redundant Superposit...,cs.LO cs.AI cs.SC
29998,Submodlib: A Submodular Optimization Library ...,cs.LG cs.IR


In [9]:
df_arxiv.head

<bound method NDFrame.head of                                                    topic  \
0      Optimized Cloud Resource Allocation Using Gene...   
1      Control of Rayleigh-like waves in thick plate ...   
2      Deep Text Classification Can be Fooled   In th...   
3      Clip-TTS: Contrastive Text-content and Mel-spe...   
4      Convex Cauchy Schwarz Independent Component An...   
...                                                  ...   
29995  HyperAttention: Long-context Attention in Near...   
29996  Maximum Likelihood Estimation of Power-law Deg...   
29997  SCL(FOL) Can Simulate Non-Redundant Superposit...   
29998  Submodlib: A Submodular Optimization Library  ...   
29999  Deep Learning Method to Predict Wound Healing ...   

                                  categories  
0                                cs.DC cs.AI  
1           physics.geo-ph cond-mat.mtrl-sci  
2                                cs.CR cs.LG  
3      cs.SD cs.AI cs.CL cs.HC cs.LG eess.AS  
4             

# Extract Primary Category + Top Domain

In [11]:
def extract_categories(cat_string):
    primary = cat_string.split()[0]  # first category
    top_level = primary.split('.')[0]
    return top_level, primary

df_arxiv[["category 0", "category 1"]] = df_arxiv["categories"].apply(
    lambda x: pd.Series(extract_categories(x))
)

df_arxiv = df_arxiv[["topic", "category 0", "category 1"]]
df_arxiv.head()


Unnamed: 0,topic,category 0,category 1
0,Optimized Cloud Resource Allocation Using Gene...,cs,cs.DC
1,Control of Rayleigh-like waves in thick plate ...,physics,physics.geo-ph
2,Deep Text Classification Can be Fooled In th...,cs,cs.CR
3,Clip-TTS: Contrastive Text-content and Mel-spe...,cs,cs.SD
4,Convex Cauchy Schwarz Independent Component An...,cs,cs.IT


# Generate Embeddings (Same as WOS)
Important: 30k embeddings with text-embedding-3-large will cost money.
If cost is a concern, use "text-embedding-3-small" for arXiv.