# ANN-based Novelty Computation Pipeline

This notebook constructs a temporally consistent **approximate nearest neighbor (ANN)**
pipeline to compute **novelty metrics for scientific papers** based on embedding similarity
to prior work.

It performs the following steps:
1. Downloads paper metadata and embedding files from Hugging Face Hub
2. Merges multiple metadata databases into a unified view
3. Joins metadata with embeddings to create an ANN-ready table
4. Sorts papers by publication date for forward-in-time indexing
5. Builds an incremental FAISS HNSW index over paper embeddings
6. Computes multiple novelty metrics using nearest-neighbor distances from papers indexed so far
7. Stores and uploads the resulting metrics database to Hugging Face Hub

**Input Databases:**
- `S2_papers_cleaned.db` – Cleaned Semantic Scholar metadata
- `S2_papers_cleaned_additional_papers.db` – Additional cleaned papers
- `embeddings/embeddings_*.db` – Sharded embedding databases

**Output:**
- `ann_pipeline.db` – ANN-ready papers with embeddings sorted by publication date
- `metrics.db` – Per-paper novelty metrics computed via ANN search
- Hugging Face Dataset:  
  [`lalit3c/S2_CS_PHY_PYSCH_papers`](https://huggingface.co/datasets/lalit3c/S2_CS_PHY_PYSCH_papers)

FAISS documentation:
- https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexHNSW.html
- https://github.com/facebookresearch/faiss/wiki


In [None]:
import duckdb
import numpy as np
import faiss
from datetime import date
import glob
from huggingface_hub import snapshot_download, upload_file
import os
import pandas as pd

In [2]:
os.makedirs("data", exist_ok=True)
snapshot_download(
    repo_id="lalit3c/S2_CS_PHY_PYSCH_papers",
    repo_type="dataset",
    local_dir="data",
    local_dir_use_symlinks=False,
    allow_patterns=[
        "S2_papers_cleaned.db",
        "S2_papers_cleaned_additional_papers.db",
        "embeddings/**"
    ]
)


For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Fetching 32 files:   0%|          | 0/32 [00:00<?, ?it/s]

'/Users/kanchanpoudel/uni/DataLit/DataLit-Scientific-Novelty-Impact/data'

Merge Metadata db created from semantic scholar with embedding files

In [None]:
# CONFIGS
META_DBS = [
    "./data/S2_papers_cleaned.db",
    "./data/S2_papers_cleaned_additional_papers.db"
]

META_TABLE = "papers_with_abstracts"

EMBED_FILES_PATTERN = "./data/embeddings/embeddings_*.db"
EMBED_TABLE = "embeddings"

OUTPUT_DB = "./data/ann_pipeline.db"
ANN_READY_TABLE = "papers_ann_ready"
ANN_SORTED_TABLE = "papers_ann_sorted"

# Columns to merge from metadata
MERGE_COLUMNS = ["corpusid", "title", "publication_date", "citation_count"]
MERGE_COLS_STR = ", ".join(MERGE_COLUMNS)

# Connect to output DuckDB
con = duckdb.connect(OUTPUT_DB)

# Attach metadata DBs
for idx, db_path in enumerate(META_DBS, start=1):
    con.execute(f"ATTACH DATABASE '{db_path}' AS meta{idx};")

# Create a unified metadata VIEW
con.execute(f"""
CREATE OR REPLACE VIEW meta_all AS
SELECT {MERGE_COLS_STR} FROM meta1.{META_TABLE}
UNION ALL
SELECT {MERGE_COLS_STR} FROM meta2.{META_TABLE};
""")

# Create empty ANN-ready table
con.execute(f"""
CREATE TABLE IF NOT EXISTS {ANN_READY_TABLE} (
    corpus_id BIGINT,
    pub_date DATE,
    embedding FLOAT[]
);
""")

# Process each embeddings file
embedding_files = sorted(glob.glob(EMBED_FILES_PATTERN))

for idx, emb_file in enumerate(embedding_files, 1):
    print(f"Processing embedding file {idx}/{len(embedding_files)}: {emb_file}")

    con.execute(f"ATTACH DATABASE '{emb_file}' AS emb;")

    con.execute(f"""
        INSERT INTO {ANN_READY_TABLE}
        SELECT
            m.corpusid,
            m.publication_date,
            e.embedding
        FROM meta_all m
        JOIN emb.{EMBED_TABLE} e
        USING (corpusid);
    """)

    con.execute("DETACH DATABASE emb;")

print("All embedding files merged.")

# Sort final table
con.execute(f"""
CREATE TABLE {ANN_SORTED_TABLE} AS
SELECT *
FROM {ANN_READY_TABLE}
ORDER BY pub_date, corpus_id;
""")

con.close()

print(f"Final ANN-ready DB created: {OUTPUT_DB} with table '{ANN_SORTED_TABLE}'")


Processing embedding file 1/30: ./data/embeddings/embeddings_1.db
Processing embedding file 2/30: ./data/embeddings/embeddings_10.db
Processing embedding file 3/30: ./data/embeddings/embeddings_11.db
Processing embedding file 4/30: ./data/embeddings/embeddings_12.db
Processing embedding file 5/30: ./data/embeddings/embeddings_13.db
Processing embedding file 6/30: ./data/embeddings/embeddings_14.db
Processing embedding file 7/30: ./data/embeddings/embeddings_15.db
Processing embedding file 8/30: ./data/embeddings/embeddings_16.db
Processing embedding file 9/30: ./data/embeddings/embeddings_17.db
Processing embedding file 10/30: ./data/embeddings/embeddings_18.db
Processing embedding file 11/30: ./data/embeddings/embeddings_19.db
Processing embedding file 12/30: ./data/embeddings/embeddings_2.db
Processing embedding file 13/30: ./data/embeddings/embeddings_20.db
Processing embedding file 14/30: ./data/embeddings/embeddings_21.db
Processing embedding file 15/30: ./data/embeddings/embeddin

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Final ANN-ready DB created: ./data/ann_pipeline.db with table 'papers_ann_sorted'


In [3]:
#Sanity checking

con = duckdb.connect("./data/ann_pipeline.db", read_only=True)

# Load the table into a pandas DataFrame
df = con.execute("SELECT * FROM papers_ann_sorted").fetch_df()

# View the first few rows
print(df.head())

# Checking the shape and column names
print("Shape:", df.shape)
print("Columns:", df.columns)

con.close()


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

   corpus_id   pub_date                                          embedding
0      60282 2000-01-01  [-0.015204819, 0.7857948, -1.3321525, 0.037485...
1     755730 2000-01-01  [-0.32926, 0.587181, -0.51643664, 0.7690042, -...
2     757185 2000-01-01  [0.35842383, -0.14089696, 0.1665931, -0.651995...
3     858325 2000-01-01  [1.0989523, -0.2769418, -0.35953367, 0.3855169...
4    1119702 2000-01-01  [0.56514174, 0.09447052, 0.1302544, 0.22057152...
Shape: (2564613, 3)
Columns: Index(['corpus_id', 'pub_date', 'embedding'], dtype='object')


In [None]:
# CONFIGS

# Path to the DuckDB database containing the paper metadata and embeddings
DUCKDB_PATH = "./data/ann_pipeline.db"
TABLE_NAME = "papers_ann_sorted"

# Dimensionality of the embedding vector
EMBED_DIM = 768

# Number of nearest neighbors to retrieve per query
K = 100

# HNSW parameters:
INDEX_M = 32 # Maximum number of connections per node (higher = better recall, more memory)

EF_CONSTRUCTION = 200 # Controls accuracy vs. indexing time during index build

EF_SEARCH = 50 # Controls recall vs. query-time latency during ANN search

# Beginning publication date considered in experiments
START_DATE = date(2000, 1, 1)

# Last publication date considered in experiments
END_DATE   = date(2026, 1, 1)

# Read-only connection for the source paper data
source_con = duckdb.connect(DUCKDB_PATH, read_only=True)

# Writable connection for storing computed metrics
neighbors_con = duckdb.connect("./data/metrics.db")


In [8]:
tables = source_con.execute("SHOW TABLES").fetchall()
print(tables)

[('meta_all',), ('papers_ann_ready',), ('papers_ann_sorted',)]


In [9]:
# HNSW-based FAISS index with ID mapping for efficient approximate nearest-neighbor search on embeddings
base_index = faiss.IndexHNSWFlat(EMBED_DIM, INDEX_M)
base_index.hnsw.efConstruction = EF_CONSTRUCTION
base_index.hnsw.efSearch = EF_SEARCH
index = faiss.IndexIDMap2(base_index)

In [10]:
# Table to store per-paper novelty metrics computed from nearest-neighbor distances and citation weightings
neighbors_con.execute("""
CREATE TABLE IF NOT EXISTS paper_neighbors_metrics (
    corpus_id BIGINT,
    novelty_mean FLOAT,
    novelty_max FLOAT,
    novelty_min FLOAT,
    novelty_median FLOAT,
    novelty_harmonic FLOAT
)
""")


<_duckdb.DuckDBPyConnection at 0x15b8432f0>

In [11]:
# Utility for iteration, Returns the first day of the next calendar month for a given date

def add_month(d):
    if d.month == 12:
        return date(d.year + 1, 1, 1)
    else:
        return date(d.year, d.month + 1, 1)

**Monthly batch processing strategy**

- Iterates over papers in **monthly batches**, yielding corpus IDs and their corresponding embeddings.
- All papers published within the **same calendar month** are processed together.
- Novelty metrics for a batch are computed **only against papers from earlier months** (past batches), not against papers within the same batch for computational efficiency.
- The approach relies on the assumption that papers published within the **same month of the same year** are unlikely to substantially influence one another.


In [None]:

def monthly_batch_iterator(con, table, start_date, end_date):
    current = start_date
    while current < end_date:
        next_month = add_month(current)
        df = con.execute(f"""
            SELECT corpus_id, embedding
            FROM {table}
            WHERE pub_date >= ?
              AND pub_date < ?
            ORDER BY pub_date
        """, [current, next_month]).fetch_df()
        if not df.empty:
            yield (
                df["corpus_id"].to_numpy(),
                np.vstack(df["embedding"].to_numpy())
            )
        current = next_month


In [13]:
epsilon = 1e-6

for corpus_ids, embeddings in monthly_batch_iterator(
        source_con, TABLE_NAME, START_DATE, END_DATE
    ):

    # Normalize embeddings
    embeddings = embeddings.astype("float32")
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    embeddings = embeddings / np.clip(norms, 1e-10, None)

    if index.ntotal > 0:
        # Search neighbors (IDs now returned because of IndexIDMap2)
        distances, neighbors = index.search(embeddings, K)
        neighbors = neighbors.astype(int)

        # Cosine similarity
        cosine_sim = np.clip(1 - distances / 2, 0.0, 1.0)

        # Novelty metrics
        novelty_mean = 1 - cosine_sim.mean(axis=1)
        novelty_max = 1 - cosine_sim.max(axis=1)
        novelty_min = 1 - cosine_sim.min(axis=1)
        novelty_median = 1 - np.median(cosine_sim, axis=1)
        novelty_harmonic = K / np.sum(1 / np.maximum(1 - cosine_sim, epsilon), axis=1)
        

        # Prepare rows for DuckDB insert
        rows = [
            (
                int(corpus_ids[i]),
                float(novelty_mean[i]),
                float(novelty_max[i]),
                float(novelty_min[i]),
                float(novelty_median[i]),
                float(novelty_harmonic[i])
                
            )
            for i in range(len(corpus_ids))
        ]
        
        neighbors_con.executemany("""
            INSERT INTO paper_neighbors_metrics (
            corpus_id,                    
            novelty_mean,                               
            novelty_max,
            novelty_min,
            novelty_median,
            novelty_harmonic
            )
            VALUES (?, ?, ?, ?, ?, ?)
        """, rows)
        
        # FREE MEMORY
        import gc
        del neighbors,rows
        gc.collect()

    # Add current batch to FAISS index
    index.add_with_ids(
        embeddings,
        corpus_ids.astype("int64")
    )



    print(f"Indexed {index.ntotal:,} papers")


Indexed 1,211 papers
Indexed 2,397 papers
Indexed 3,744 papers
Indexed 4,925 papers
Indexed 6,228 papers
Indexed 7,591 papers
Indexed 8,954 papers
Indexed 10,153 papers
Indexed 11,517 papers
Indexed 13,638 papers
Indexed 15,028 papers
Indexed 16,583 papers
Indexed 17,898 papers
Indexed 19,160 papers
Indexed 20,709 papers
Indexed 22,043 papers
Indexed 23,577 papers
Indexed 25,160 papers
Indexed 26,684 papers
Indexed 28,016 papers
Indexed 30,037 papers
Indexed 31,626 papers
Indexed 33,168 papers
Indexed 34,740 papers
Indexed 36,290 papers
Indexed 37,560 papers
Indexed 39,088 papers
Indexed 40,602 papers
Indexed 42,145 papers
Indexed 43,692 papers
Indexed 45,414 papers
Indexed 47,072 papers
Indexed 49,214 papers
Indexed 50,852 papers
Indexed 52,504 papers
Indexed 54,309 papers
Indexed 55,843 papers
Indexed 57,360 papers
Indexed 59,000 papers
Indexed 60,649 papers
Indexed 62,380 papers
Indexed 64,202 papers
Indexed 66,114 papers
Indexed 67,539 papers
Indexed 69,998 papers
Indexed 71,993 pa

In [16]:
#Check the output
import duckdb

con = duckdb.connect("./data/metrics.db", read_only=True)

con.execute("""
    COPY paper_neighbors_metrics
    TO 'papers.csv'
    (FORMAT CSV, HEADER)
""")
df = pd.read_csv('papers.csv')
df.sort_values('novelty_mean', ascending=False)

Unnamed: 0,corpus_id,novelty_mean,novelty_max,novelty_min,novelty_median,novelty_harmonic
26,6331294,0.212569,0.154055,0.237483,0.217403,0.210313
52,11291405,0.211614,0.134824,0.232631,0.213135,0.210220
143,58975810,0.209327,0.141295,0.230281,0.216912,0.207048
133,46002453,0.206089,0.146562,0.244218,0.206014,0.203224
1387,144667741,0.203484,0.140061,0.223022,0.209135,0.201741
...,...,...,...,...,...,...
2097918,259283111,0.000414,0.000004,0.002197,0.000077,0.000020
2098106,259328480,0.000407,0.000011,0.002016,0.000235,0.000212
2097114,259107395,0.000406,0.000005,0.002167,0.000079,0.000022
2116443,259830204,0.000380,0.000015,0.002526,0.000065,0.000048


In [None]:
#upload ./data/metrics.db to hugginface
upload_file(
    path_or_fileobj="./data/metrics.db",
    path_in_repo="metrics.db",
    repo_id="lalit3c/S2_CS_PHY_PYSCH_papers",
    repo_type="dataset",
    commit_message="Upload computed metrics database"
)


