# bib_enrich_top40.ipynb

## Purpose
This notebook enriches the weighted DOI shortlist (`doi_shortlist_top40.csv`) by adding abstracts from local `.bib` files.

## Inputs
- `doi_shortlist_top40.csv` (produced by `bib_shortlist.ipynb`)
- One or more `.bib` files in the same folder

## Output
- `doi_shortlist_top40_with_abstracts.csv`

## Notes
- Abstracts are pulled from `.bib` entries, not the web
- Matching is done by DOI, normalized to reduce formatting differences

In [1]:
import re
from pathlib import Path
import pandas as pd

pd.set_option("display.max_colwidth", 200)
pd.set_option("display.width", 160)

## Step 1: Load the top 40 DOI shortlist

In [2]:
shortlist_path = Path("doi_shortlist_top40.csv")
df = pd.read_csv(shortlist_path)

print("Rows:", len(df))
print("Columns:", list(df.columns))
df.head(3)

Rows: 40
Columns: ['year', 'title', 'venue', 'doi', 'score']


Unnamed: 0,year,title,venue,doi,score
0,2014,"Ratings meet reviews, a combined approach to recommend",Proceedings of the 8th ACM Conference on Recommender Systems,10.1145/2645710.2645728,21
1,2007,Case amazon: ratings and reviews as part of recommendations,Proceedings of the 2007 ACM Conference on Recommender Systems,10.1145/1297231.1297255,21
2,2015,Incorporating Phrase-level Sentiment Analysis on Textual Reviews for Personalized Recommendation,Proceedings of the Eighth ACM International Conference on Web Search and Data Mining,10.1145/2684822.2697033,20


## Step 2: Locate `.bib` files in the current folder

In [3]:
bib_files = sorted(Path(".").glob("*.bib"))

print("Bib files found:", len(bib_files))
for f in bib_files:
    print(" -", f.name)

Bib files found: 5
 - CIKM.bib
 - KDD.bib
 - RecSys.bib
 - TheWebConf.bib
 - WSDM.bib


## Step 3: Normalize DOIs

DOIs can appear in slightly different formats:
- uppercase vs lowercase
- with a leading `https://doi.org/`
- with extra whitespace

This function standardizes DOIs so matching works reliably, and while this might seem redundant, BibTeX files are not guaranteed to  be clean. If I were to merg in another .bib file later, the formatting might be different. Think of it as a future-proofing step.

In [4]:
def normalize_doi(doi: str) -> str:
    if doi is None:
        return ""
    s = str(doi).strip().lower()
    s = s.replace("https://doi.org/", "").replace("http://doi.org/", "")
    s = s.replace("https://dx.doi.org/", "").replace("http://dx.doi.org/", "")
    return s.strip()

df["doi_norm"] = df["doi"].apply(normalize_doi)
df[["doi", "doi_norm"]].head(5)

Unnamed: 0,doi,doi_norm
0,10.1145/2645710.2645728,10.1145/2645710.2645728
1,10.1145/1297231.1297255,10.1145/1297231.1297255
2,10.1145/2684822.2697033,10.1145/2684822.2697033
3,10.1145/3308558.3313473,10.1145/3308558.3313473
4,10.1145/1639714.1639759,10.1145/1639714.1639759


## Step 4: Parse `.bib` files to build a DOI â†’ abstract lookup

I do not need full BibTeX parsing for this job and only need:
- DOI
- abstract

This parser:
- splits on BibTeX entry boundaries (`@...{`)
- searches each entry for `doi = {...}` and `abstract = {...}`
- stores the abstract in a dictionary keyed by normalized DOI

In [5]:
doi_to_abstract = {}
doi_to_title = {}  # optional, helpful for debugging

# These regex patterns handle common BibTeX formatting.
doi_re = re.compile(r"\bdoi\s*=\s*[{\"\(]\s*(.+?)\s*[}\"\)]\s*,?", re.IGNORECASE | re.DOTALL)
abs_re = re.compile(r"\babstract\s*=\s*[{\"\(]\s*(.+?)\s*[}\"\)]\s*,?", re.IGNORECASE | re.DOTALL)
title_re = re.compile(r"\btitle\s*=\s*[{\"\(]\s*(.+?)\s*[}\"\)]\s*,?", re.IGNORECASE | re.DOTALL)

def clean_bib_value(s: str) -> str:
    # Light cleanup: collapse whitespace and remove common BibTeX line breaks.
    if s is None:
        return ""
    s = re.sub(r"\s+", " ", s).strip()
    return s

entries_scanned = 0
entries_with_doi = 0
entries_with_abstract = 0

for bib_path in bib_files:
    text = bib_path.read_text(encoding="utf-8", errors="ignore")

    # Split on the start of an entry. Keep it simple, this is enough for enrichment.
    chunks = re.split(r"\n@", "\n" + text)

    for chunk in chunks:
        if not chunk.strip():
            continue
        if "doi" not in chunk.lower():
            continue

        entries_scanned += 1

        doi_match = doi_re.search(chunk)
        if not doi_match:
            continue

        doi_raw = clean_bib_value(doi_match.group(1))
        doi_norm = normalize_doi(doi_raw)
        if not doi_norm:
            continue

        entries_with_doi += 1

        abs_match = abs_re.search(chunk)
        if abs_match:
            abstract = clean_bib_value(abs_match.group(1))
            if abstract:
                doi_to_abstract.setdefault(doi_norm, abstract)
                entries_with_abstract += 1

        # Optional title capture for sanity checks
        t_match = title_re.search(chunk)
        if t_match:
            doi_to_title.setdefault(doi_norm, clean_bib_value(t_match.group(1)))

print("Entries scanned (DOI present):", entries_scanned)
print("Entries with DOI extracted:", entries_with_doi)
print("Entries with abstract extracted:", entries_with_abstract)
print("Unique DOIs with abstracts:", len(doi_to_abstract))

Entries scanned (DOI present): 1344
Entries with DOI extracted: 1344
Entries with abstract extracted: 1338
Unique DOIs with abstracts: 1338


## Step 5: Join abstracts onto the top 40 list

Add:
- `abstract` pulled from `.bib` when available
- `abstract_found` to quickly see coverage

In [6]:
df["abstract"] = df["doi_norm"].map(doi_to_abstract).fillna("")
df["abstract_found"] = df["abstract"].astype(str).str.strip().ne("")

coverage = df["abstract_found"].mean() * 100
print(f"Abstract coverage: {coverage:.1f}%")

df[["year", "title", "doi", "abstract_found"]].head(10)

Abstract coverage: 100.0%


Unnamed: 0,year,title,doi,abstract_found
0,2014,"Ratings meet reviews, a combined approach to recommend",10.1145/2645710.2645728,True
1,2007,Case amazon: ratings and reviews as part of recommendations,10.1145/1297231.1297255,True
2,2015,Incorporating Phrase-level Sentiment Analysis on Textual Reviews for Personalized Recommendation,10.1145/2684822.2697033,True
3,2019,"Exploiting Ratings, Reviews and Relationships for Item Recommendations in Topic Based Social Networks",10.1145/3308558.3313473,True
4,2009,Context-based splitting of item ratings in collaborative filtering,10.1145/1639714.1639759,True
5,2013,Context-aware review helpfulness rating prediction,10.1145/2507157.2507183,True
6,2013,Hidden factors and hidden topics: understanding rating dimensions with review text,10.1145/2507157.2507163,True
7,2019,DAML: Dual Attention Mutual Learning between Ratings and Reviews for Item Recommendation,10.1145/3292500.3330906,True
8,2019,Leveraging Ratings and Reviews with Gating Mechanism for Recommendation,10.1145/3357384.3357919,True
9,2018,Coevolutionary Recommendation Model: Mutual Learning between Ratings and Reviews,10.1145/3178876.3186158,True


## Step 6: Inspect missing abstracts

This helps me see which items did not match, so you can decide whether:
- the `.bib` entries do not include abstracts for those items, or
- the DOI formatting differs and needs normalization tweaks

In [7]:
missing = df[~df["abstract_found"]].copy()
print("Missing abstracts:", len(missing))

missing[["year", "title", "doi"]].head(20)

Missing abstracts: 0


Unnamed: 0,year,title,doi


## Step 7: Export enriched shortlist for Excel

This file is what I will review in Excel to select:
- 1 problem statement reference
- 3 key references

The remaining items stay as the literature pool for the full paper.

In [8]:
out_cols = [c for c in ["year", "title", "venue", "doi", "score", "hit_count", "abstract"] if c in df.columns]
out = df[out_cols].copy()

out_path = Path("doi_shortlist_top40_with_abstracts.csv")
out.to_csv(out_path, index=False)

print("Wrote:", out_path.name)
out.head(3)

Wrote: doi_shortlist_top40_with_abstracts.csv


Unnamed: 0,year,title,venue,doi,score,abstract
0,2014,"Ratings meet reviews, a combined approach to recommend",Proceedings of the 8th ACM Conference on Recommender Systems,10.1145/2645710.2645728,21,"Most existing recommender systems focus on modeling the ratings while ignoring the abundant information embedded in the review text. In this paper, we propose a unified model that combines content..."
1,2007,Case amazon: ratings and reviews as part of recommendations,Proceedings of the 2007 ACM Conference on Recommender Systems,10.1145/1297231.1297255,21,"We studied user behavior in a recommender-rich environment, Amazon online store, to see what role the algorithm-based and user-generated recommendations play in finding items of interest. We used ..."
2,2015,Incorporating Phrase-level Sentiment Analysis on Textual Reviews for Personalized Recommendation,Proceedings of the Eighth ACM International Conference on Web Search and Data Mining,10.1145/2684822.2697033,20,Previous research on Recommender Systems (RS
