# Bibliography Shortlisting Notebook

## Purpose
This notebook performs an initial triage of candidate academic references for the DS687 capstone project.
Its role is **narrowing**, not final selection.

The input dataset (`bib_candidates.csv`) contains citation metadata only
(year, title, venue, DOI). Abstracts are not included at this stage.

The output of this notebook is one or more CSV files intended for **manual
review in Excel**, where abstracts are read and final references are selected.

## What this notebook does NOT do
- It does not select final references
- It does not evaluate abstracts
- It does not attempt automated relevance decisions

Final judgment is intentionally human.

## Step 1: Load citation metadata

This step loads the pre-filtered candidate citation list produced earlier
in the workflow. At this point, the dataset includes only basic bibliographic
fields (no abstracts).

In [6]:
import re
import pandas as pd

pd.set_option("display.max_colwidth", 200)
pd.set_option("display.width", 160)

df = pd.read_csv("bib_candidates.csv")

print("Rows:", len(df))
print("Columns:", list(df.columns))
df.head(3)

Rows: 312
Columns: ['year', 'title', 'venue', 'doi', 'key']


Unnamed: 0,year,title,venue,doi,key
0,2003,Collaborative filtering with decoupled models for preferences and ratings,Proceedings of the Twelfth International Conference on Information and Knowledge Management,10.1145/956863.956922,10.1145/956863.956922
1,2007,Case amazon: ratings and reviews as part of recommendations,Proceedings of the 2007 ACM Conference on Recommender Systems,10.1145/1297231.1297255,10.1145/1297231.1297255
2,2007,Dynamics of collaborative document rating systems,Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis,10.1145/1348549.1348555,10.1145/1348549.1348555


## Step 2: Create a searchable text field

Because abstracts are not available in the CSV, narrowing is performed using
title and venue text only. These fields are combined and lower-cased to support
keyword matching.

This is a *mechanical* filtering step, not a semantic judgment.

In [7]:
df["text_blob"] = (
    df["title"].fillna("").astype(str) + " " +
    df["venue"].fillna("").astype(str)
).str.lower()

df[["year", "title", "venue", "doi"]].head(5)

Unnamed: 0,year,title,venue,doi
0,2003,Collaborative filtering with decoupled models for preferences and ratings,Proceedings of the Twelfth International Conference on Information and Knowledge Management,10.1145/956863.956922
1,2007,Case amazon: ratings and reviews as part of recommendations,Proceedings of the 2007 ACM Conference on Recommender Systems,10.1145/1297231.1297255
2,2007,Dynamics of collaborative document rating systems,Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis,10.1145/1348549.1348555
3,2008,Boosting collaborative filtering based on statistical prediction errors,Proceedings of the 2008 ACM Conference on Recommender Systems,10.1145/1454008.1454011
4,2008,Improving top-n recommendation techniques using rating variance,Proceedings of the 2008 ACM Conference on Recommender Systems,10.1145/1454008.1454059


## Step 3: Broad relevance filter

This step applies an intentionally broad keyword filter to remove citations
that are clearly unrelated to the project domain (IMDb, reviews, ratings,
recommendation systems, text analysis).

At this stage, recall is favored over precision. It is acceptable for this
filter to return a large set.

In [8]:
include_terms = [
    r"\bimdb\b", r"\bmovie\b", r"\bmovies\b", r"\bfilm\b", r"\bcinema\b",
    r"\breview\b", r"\breviews\b", r"\brating\b", r"\bratings\b",
    r"\brecommend\b", r"\brecommender\b", r"\bcollaborative filtering\b",
    r"\bsentiment\b", r"\bopinion\b", r"\btext\b", r"\bnlp\b",
    r"\btopic\b", r"\blda\b", r"\bembedding\b", r"\btransformer\b", r"\bbert\b",
    r"\bclassification\b", r"\bprediction\b", r"\bevaluation\b", r"\bbenchmark\b"
]

pattern = re.compile("|".join(include_terms), flags=re.IGNORECASE)

df["include_hit"] = df["text_blob"].str.contains(pattern, regex=True, na=False)

short = df[df["include_hit"]].copy()

print("Shortlist rows after broad filter:", len(short))
short.sort_values(["year"], ascending=False).head(20)[["year", "title", "venue", "doi"]]

Shortlist rows after broad filter: 213


Unnamed: 0,year,title,venue,doi
304,2025,On the Cross-Graph Transferability of Dynamic Link Prediction,Proceedings of the ACM on Web Conference 2025,10.1145/3696410.3714712
305,2025,On the Role of Weight Decay in Collaborative Filtering: A Popularity Perspective,Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2,10.1145/3711896.3737068
307,2025,Popularity‑Bias Vulnerability: Semi‑Supervised Label Inference Attack on Federated Recommender Systems,Proceedings of the Nineteenth ACM Conference on Recommender Systems,10.1145/3705328.3748024
308,2025,Predicting Company ESG Ratings from News Articles Using Multivariate Timeseries Analysis,Companion Proceedings of the ACM on Web Conference 2025,10.1145/3701716.3717509
291,2025,Balancing Accuracy and Novelty with Sub-Item Popularity,Proceedings of the Nineteenth ACM Conference on Recommender Systems,10.1145/3705328.3759311
290,2025,A Multi-Factor Collaborative Prediction for Review-based Recommendation,Proceedings of the Nineteenth ACM Conference on Recommender Systems,10.1145/3705328.3748062
292,2025,Beyond Immediate Click: Engagement-Aware and MoE-Enhanced Transformers for Sequential Movie Recommendation,Proceedings of the Nineteenth ACM Conference on Recommender Systems,10.1145/3705328.3748076
294,2025,D2: Customizing Two-Stage Graph Neural Networks for Early Rumor Detection through Cascade Diffusion Prediction,Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining,10.1145/3701551.3703589
295,2025,Dual Pairwise Pre-training and Prompt-tuning with Aligned Prototypes for Interbank Credit Rating,Proceedings of the ACM on Web Conference 2025,10.1145/3696410.3714530
296,2025,Exploring the Effect of Context-Awareness and Popularity Calibration on Popularity Bias in POI Recommendations,Proceedings of the Nineteenth ACM Conference on Recommender Systems,10.1145/3705328.3748017


In [14]:
short[["year", "title", "venue", "doi"]].to_csv(
    "doi_shortlist_213.csv", index=False
)

## Optional: Automated ranking for reduced shortlist

The following cells apply a simple keyword-weighted scoring approach
to further reduce the candidate set.

This step is optional and exists as an alternative workflow.
Final selection still requires manual abstract review.

In [16]:
weights = {
    "imdb": 6,
    "movie": 4, "movies": 4, "film": 4, "cinema": 3,
    "review": 4, "reviews": 4,
    "rating": 3, "ratings": 3,
    "recommender": 4, "recommend": 3, "collaborative filtering": 4,
    "sentiment": 4, "opinion": 3, "nlp": 3, "text": 2,
    "topic": 2, "lda": 2,
    "embedding": 2, "transformer": 2, "bert": 2,
    "evaluation": 2, "benchmark": 2,
    "prediction": 2, "classification": 2
}

def score_text(text):
    t = str(text).lower()
    s = 0
    for k, w in weights.items():
        if k in t:
            s += w
    return s

short["score"] = short["text_blob"].apply(score_text)

ranked = short.sort_values(["score", "year"], ascending=[False, False]).copy()

ranked.head(25)[["year", "score", "title", "venue", "doi"]]

Unnamed: 0,year,score,title,venue,doi
79,2014,21,"Ratings meet reviews, a combined approach to recommend",Proceedings of the 8th ACM Conference on Recommender Systems,10.1145/2645710.2645728
1,2007,21,Case amazon: ratings and reviews as part of recommendations,Proceedings of the 2007 ACM Conference on Recommender Systems,10.1145/1297231.1297255
85,2015,20,Incorporating Phrase-level Sentiment Analysis on Textual Reviews for Personalized Recommendation,Proceedings of the Eighth ACM International Conference on Web Search and Data Mining,10.1145/2684822.2697033
164,2019,19,"Exploiting Ratings, Reviews and Relationships for Item Recommendations in Topic Based Social Networks",The World Wide Web Conference,10.1145/3308558.3313473
8,2009,19,Context-based splitting of item ratings in collaborative filtering,Proceedings of the Third ACM Conference on Recommender Systems,10.1145/1639714.1639759
49,2013,18,Context-aware review helpfulness rating prediction,Proceedings of the 7th ACM Conference on Recommender Systems,10.1145/2507157.2507183
52,2013,18,Hidden factors and hidden topics: understanding rating dimensions with review text,Proceedings of the 7th ACM Conference on Recommender Systems,10.1145/2507157.2507163
162,2019,17,DAML: Dual Attention Mutual Learning between Ratings and Reviews for Item Recommendation,Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining,10.1145/3292500.3330906
166,2019,17,Leveraging Ratings and Reviews with Gating Mechanism for Recommendation,Proceedings of the 28th ACM International Conference on Information and Knowledge Management,10.1145/3357384.3357919
135,2018,17,Coevolutionary Recommendation Model: Mutual Learning between Ratings and Reviews,Proceedings of the 2018 World Wide Web Conference,10.1145/3178876.3186158


In [17]:
ranked_doi = ranked[ranked["doi"].notna() & (ranked["doi"].astype(str).str.strip() != "")].copy()

print("Ranked rows with DOI:", len(ranked_doi))

top_n = 40  # change to 30 or 50 if you prefer
out = ranked_doi.head(top_n)[["year", "title", "venue", "doi", "score"]].copy()

out_file = "doi_shortlist_top40.csv"
out.to_csv(out_file, index=False)

out_file

Ranked rows with DOI: 212


'doi_shortlist_top40.csv'

## Final note

- This notebook completes the automated portion of bibliography triage.
- Downstream steps (abstract reading, final reference selection, and slide integration) are performed manually outside this notebook.