# Building a Semantic Retrieval System for Book Discovery Using TF-IDF, NMF, and Cosine Similarity

**COSCI210 - Data Mining and Wrangling 1**

**Jose Miguel Bautista, PhD**
  
*Lab Report 2/Final Project*

**<u>MSDS 2026 Learning Team 8, Term 2</u>**

- Francis Erdey M. Capati
   
- Kevin Ansel S. Dy

- Jan Paolo V. Moreno

- Mia Cielo G. Oliveros

## Abstract
This paper presents an end-to-end semantic search pipeline designed to assist bookstore retailers in identifying fiction titles relevant to a customer’s requested themes or topics. Using Penguin Random House’s API as a metadata source, we construct a corpus of fiction titles and apply a multi-stage natural language processing (NLP) workflow. The flap-copy descriptions undergo TF-IDF vectorization, followed by Non-Negative Matrix Factorization (NMF) for topic extraction. A cosine similarity ranking layer enables semantic retrieval, allowing users to submit free-form text queries and receive the most relevant book titles. The resulting system enables accurate, interpretable topic-aligned book recommendations suitable for curation.

## Introduction and Problem Definition

A bookstore customer requests recommendations for titles aligned with specific themes or topics. The objective is to build a semantic retrieval system that returns relevant fiction books from the retailer’s inventory. The challenge is to interpret free-form text queries and identify the most semantically aligned titles using book metadata.

## Data Acquisition from Penguin Random House API

The corpus is constructed from Penguin Random House’s metadata service, accessed via a secure API key.

2.1. Collecting Fiction Category Identifiers

Fiction BISAC categories are retrieved from PRH endpoints and stored locally in a JSON file for reference.

These category IDs form the basis for selecting relevant books to include in the corpus.

2.2. Batch Retrieval of Title Metadata
A custom function accepts:

A list of category IDs

A record count per category

A set of API parameters

The function returns a list of data frames, each containing:

|ISBN|
|--|
|Title|
|Author|
|Flap-copy description|

These fields form the raw text corpus for downstream NLP processing.

In [2]:
import requests
import time
from datetime import datetime, timedelta
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import sparse
import json
import re
from sklearn.decomposition import NMF
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
API_KEY = "fh5hj47dynk4nvx4s9ewufj4"
BASE = "https://api.penguinrandomhouse.com/resources/v2/title/domains/PRH.US/titles/views/istca"
CAT = "https://api.penguinrandomhouse.com/resources/v2/domains/PRH.US/categories/"
TITLE_CAT = "https://api.penguinrandomhouse.com/resources/v2/title/domains/PRH.US/titles/"
session = requests.Session()

In [3]:
BISAC = pd.read_json("bisac_prefixes.json", typ="series").reset_index()
BISAC.columns = ["prefix", "category"]
BISAC

Unnamed: 0,prefix,category
0,ANT,Antiques & Collectibles
1,ARC,Architecture
2,BIB,Bibles
3,BIO,Biography & Autobiography
4,BOD,"Body, Mind & Spirit"
5,BUS,Business & Economics
6,CGN,Comics & Graphic Novels
7,COM,Computers
8,CKB,Cooking
9,CRA,Crafts & Hobbies


In [5]:
with open('fiction_BISAC.json', 'r') as f:
    fiction_bisac_codes = json.load(f)
fiction_bisac_codes = fiction_bisac_codes.get('data')
fiction_bisac_codes
f_cat_map = {
    c["catId"]: c["menuText"]
    # c["catId"]: {
    #     "BISAC": c["catUri"],
    #     "Description": c["menuText"]
        for c in fiction_bisac_codes["categories"]
}
f_cat_map

{3000001525: 'Fiction',
 3000001526: 'Absurdist',
 3000001527: 'Action & Adventure',
 3000001528: 'Adaptations & Pastiche',
 3000001529: 'African American & Black',
 3000001530: 'Christian',
 3000001531: 'Erotica',
 3000001532: 'Historical',
 3000001533: 'Mystery & Detective',
 3000001534: 'Urban & Street Lit',
 3000001535: 'Women',
 3000001536: 'Alternative History',
 3000001537: 'Amish & Mennonite',
 3000001538: 'Animals',
 3000001539: 'Anthologies (multiple authors)',
 3000001540: 'Asian American & Pacific Islander',
 3000001541: 'Biographical & Autofiction',
 3000001542: 'Books, Bookstores & Libraries',
 3000001543: 'Buddhist',
 3000001544: 'Christian',
 3000001545: 'Biblical',
 3000001546: 'Classic & Allegory',
 3000001547: 'Collections & Anthologies',
 3000001548: 'Contemporary',
 3000001549: 'Fantasy',
 3000001550: 'Futuristic',
 3000001551: 'Historical',
 3000001552: 'Romance',
 3000001553: 'Historical',
 3000001555: 'Suspense',
 3000001556: 'Western',
 3000001558: 'City Life',

In [6]:
len(f_cat_map)

369

In [7]:
f_cat_id = {}
for key, value in f_cat_map.items():
    if value not in f_cat_id.values():
        f_cat_id[key] = value

f_cat_id

{3000001525: 'Fiction',
 3000001526: 'Absurdist',
 3000001527: 'Action & Adventure',
 3000001528: 'Adaptations & Pastiche',
 3000001529: 'African American & Black',
 3000001530: 'Christian',
 3000001531: 'Erotica',
 3000001532: 'Historical',
 3000001533: 'Mystery & Detective',
 3000001534: 'Urban & Street Lit',
 3000001535: 'Women',
 3000001536: 'Alternative History',
 3000001537: 'Amish & Mennonite',
 3000001538: 'Animals',
 3000001539: 'Anthologies (multiple authors)',
 3000001540: 'Asian American & Pacific Islander',
 3000001541: 'Biographical & Autofiction',
 3000001542: 'Books, Bookstores & Libraries',
 3000001543: 'Buddhist',
 3000001545: 'Biblical',
 3000001546: 'Classic & Allegory',
 3000001547: 'Collections & Anthologies',
 3000001548: 'Contemporary',
 3000001549: 'Fantasy',
 3000001550: 'Futuristic',
 3000001552: 'Romance',
 3000001555: 'Suspense',
 3000001556: 'Western',
 3000001558: 'City Life',
 3000001559: 'Classics',
 3000001560: 'Coming Of Age',
 3000001561: 'Crime',
 3

In [8]:
f_cat_id[3000001771]

'Romantic Comedy'

In [9]:
def build_corpus(f_cat_id, rows=20):
    API_KEY = "fh5hj47dynk4nvx4s9ewufj4"
    BASE = "https://api.penguinrandomhouse.com/resources/v2/title/domains/PRH.US/titles/views/istca"
    CAT = "https://api.penguinrandomhouse.com/resources/v2/domains/PRH.US/categories/"
    session = requests.Session()

    results = []
    for catId in f_cat_id:

        params = {
            "formatFamily": "Paperback",
            "catId": catId,
            "showFlapCopy": "true",
            "showPublishedBooks": "true",
            "start": 0,
            "rows": rows,
            "api_key": API_KEY,
        }

        try:
            r = session.get(BASE, params=params, timeout=20)
            r.raise_for_status()

        except requests.exceptions.Timeout:
            #print(f"\n[TIMEOUT] Category {catId} timed out. Skipping.")
            continue

        except requests.exceptions.ConnectionError:
            #print(f"\n[CONNECTION ERROR] Failed to connect for category {catId}. Skipping.")
            continue

        except requests.exceptions.HTTPError as http_err:
            #print(f"\n[HTTP ERROR] Category {catId}: {http_err}. Skipping.")
            continue

        except Exception as e:
            #print(f"\n[UNKNOWN ERROR] Category {catId}: {e}. Skipping.")
            continue

        #print("\n---")
        #print("CATID:", catId)
        #print("STATUS:", r.status_code)
        #print("URL:", r.url)
        #print("RAW:", r.text[:200])

        r.raise_for_status()
        r_dict = r.json()
        for i in r_dict['data']:
            i['catId'] = catId
        results.append(r_dict)

    
    # Build the dataframe
    corpus = pd.DataFrame(results[0]['data'])
    #catid_corpus_dict = {}

    for catid in results:
        df = pd.DataFrame(catid['data'])
        try:
            corpus = pd.concat([corpus, df], ignore_index=True)
            #catid_corpus_dict[catid['params']['catId']] = len(catid['data'])
        except NameError:
            corpus = df

    corpus = corpus.drop(columns=['isbnHyphenated', 'workId', 'coverUrl',
        'format', 'subformat', 'binding', 'editionTarget', 'trim', 'edition',
        'onSaleDate', 'exportOnSaleDate', 'price', 'exportPrice',
        'globalDivision', 'publishingDivision', 'imprint', 'publishingStatus',
        'series', 'language', 'seq', 'titleBlock', 'authors'])

    return corpus

In [10]:
corpus = build_corpus(f_cat_id, rows=100)
# for catid in fetch:
#     df = pd.DataFrame(catid['data'])
#     try:
#         corpus = pd.concat([corpus, df], ignore_index=True)
#     except NameError:
#         corpus = df

# corpus = corpus.drop(columns=['isbnHyphenated', 'workId', 'coverUrl',
#        'format', 'subformat', 'binding', 'editionTarget', 'trim', 'edition',
#        'onSaleDate', 'exportOnSaleDate', 'price', 'exportPrice',
#        'globalDivision', 'publishingDivision', 'imprint', 'publishingStatus',
#        'series', 'language', 'seq', 'titleBlock', 'authors'])
# # corpus.set_index('isbn', inplace=True)
corpus

Unnamed: 0,isbn,title,author,description,catId
0,9780140014457,Under the Net,Iris Murdoch,<b>Iris Murdoch's debut&mdash;a comic novel ab...,3000001525
1,9780140014747,The Sandcastle,Iris Murdoch,<b>A sparklingly profound novel about the conf...,3000001525
2,9780140020038,A Severed Head,Iris Murdoch,<b>A novel about the frightfulness and ruthles...,3000001525
3,9780140024760,The Unicorn,Iris Murdoch,<b>A brilliant mythical drama about well-meani...,3000001525
4,9780140030341,The Nice and the Good,Iris Murdoch,From the Booker Prize-winning author of <i>The...,3000001525
...,...,...,...,...,...
11998,9788439740070,El resto del mundo rima / The Rest of The Worl...,Carolina Bello,<b>Con una escritura precisa e im&aacute;genes...,3000001916
11999,9788439742272,Un pianista de provincias / Provincial Pianist,Ramiro Sanchiz,<b>&laquo;Federico Stahl es uno de los grandes...,3000001916
12000,9788439743859,El cielo visible / The Invisible Sky,Diego Recoba,<b>&laquo;&iquest;C&oacute;mo se explica lo la...,3000001916
12001,9788439744566,El monte de las furias/ The Hill of Wrath,Fernanda Trías,"<b>La nueva novela de la autora de Mugre rosa,...",3000001916


In [4]:
corpus = pd.read_json('fiction_corpus.json')
corpus = corpus.drop(columns=['isbnHyphenated', 'workId', 'coverUrl',
        'format', 'subformat', 'binding', 'editionTarget', 'trim', 'edition',
        'onSaleDate', 'exportOnSaleDate', 'price', 'exportPrice',
        'globalDivision', 'publishingDivision', 'imprint', 'publishingStatus',
        'series', 'language', 'seq', 'titleBlock', 'authors'])
corpus

Unnamed: 0,isbn,title,author,description,catId
0,9780140014457,Under the Net,Iris Murdoch,<b>Iris Murdoch's debut&mdash;a comic novel ab...,3000001525
1,9780140014747,The Sandcastle,Iris Murdoch,<b>A sparklingly profound novel about the conf...,3000001525
2,9780140020038,A Severed Head,Iris Murdoch,<b>A novel about the frightfulness and ruthles...,3000001525
3,9780140024760,The Unicorn,Iris Murdoch,<b>A brilliant mythical drama about well-meani...,3000001525
4,9780140030341,The Nice and the Good,Iris Murdoch,From the Booker Prize-winning author of <i>The...,3000001525
...,...,...,...,...,...
11318,9781681377964,Waiting for the Fear,"Oguz Atay, translated from the Turkish by Ralp...","<b>Short stories about people on the margins, ...",3000001915
11319,9781892746931,The Prisoner of Ankara,Suat Dervis,<b>An idealistic young man attempts to find hi...,3000001915
11320,9781953861382,Dawn,Sevgi Soysal,<b>A searing autobiographical novel about a si...,3000001915
11321,9781939810465,A Dream Come True,Juan Carlos Onetti,<b><i>A Dream Come True </i>collects the compl...,3000001916


In [5]:
corpus.loc[:, 'description'] = (
    corpus['description']
        .str.replace(r"<.*?>", "", regex=True)            
        .str.replace(r"&[A-Za-z0-9#]+;", "", regex=True)  
        .str.replace(r"\d+", "", regex=True)              
        .str.strip()
)
corpus = corpus.drop_duplicates(subset=['title'], keep='first')
corpus.reset_index(drop=True, inplace=True)
corpus

Unnamed: 0,isbn,title,author,description,catId
0,9780140014457,Under the Net,Iris Murdoch,Iris Murdoch's debuta comic novel about work a...,3000001525
1,9780140014747,The Sandcastle,Iris Murdoch,A sparklingly profound novel about the conflic...,3000001525
2,9780140020038,A Severed Head,Iris Murdoch,A novel about the frightfulness and ruthlessne...,3000001525
3,9780140024760,The Unicorn,Iris Murdoch,A brilliant mythical drama about well-meaning ...,3000001525
4,9780140030341,The Nice and the Good,Iris Murdoch,From the Booker Prize-winning author of The Se...,3000001525
...,...,...,...,...,...
8459,9781681376769,The Wounded Age and Eastern Tales,"Ferit Edgü, translated from the Turkish by Aro...",One of Turkey's most celebrated writers explor...,3000001915
8460,9781681377964,Waiting for the Fear,"Oguz Atay, translated from the Turkish by Ralp...","Short stories about people on the margins, fro...",3000001915
8461,9781892746931,The Prisoner of Ankara,Suat Dervis,An idealistic young man attempts to find his p...,3000001915
8462,9781953861382,Dawn,Sevgi Soysal,A searing autobiographical novel about a singl...,3000001915


In [6]:
corpus.shape

(8464, 5)

#### Step 1. Establish corpus

In [7]:
# Focus on descriptions only
working_corpus = corpus['description']
working_corpus

0       Iris Murdoch's debuta comic novel about work a...
1       A sparklingly profound novel about the conflic...
2       A novel about the frightfulness and ruthlessne...
3       A brilliant mythical drama about well-meaning ...
4       From the Booker Prize-winning author of The Se...
                              ...                        
8459    One of Turkey's most celebrated writers explor...
8460    Short stories about people on the margins, fro...
8461    An idealistic young man attempts to find his p...
8462    A searing autobiographical novel about a singl...
8463    A Dream Come True collects the complete storie...
Name: description, Length: 8464, dtype: object

## TF-IDF Vectorization of Flap-Copy Descriptions

TF-IDF is applied to the book descriptions to numerically encode text.

Term Frequency (TF): Measures how often a word appears within a document, reflecting document-specific importance.

Inverse Document Frequency (IDF): Penalizes common words across the corpus while boosting rare, topic-specific terms.

The resulting sparse TF-IDF matrix serves as the foundation for topic modeling.

#### Step 2. Perform TF-IDF on corpus

In [8]:
# TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(working_corpus)

# Get the feature names (words in the vocabulary)
feature_names = vectorizer.get_feature_names_out()

# Convert the sparse matrix to a dense array for easier viewing (for small datasets)
dense_matrix = tfidf_matrix.toarray()

# Create a DataFrame for better readability
tf_idf_df = pd.DataFrame(dense_matrix, columns=feature_names)
tf_idf_df

Unnamed: 0,______________________________________________________________________through,aa,aabria,aaden,aalto,aambc,aames,aand,aap,aapi,...,zuni,zuri,zurich,zurichwith,zuzzo,zweig,zwelf,zydeco,zylas,zyme
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8459,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8460,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8461,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Topic Extraction via SVD

In [9]:
def truncated_svd(X):
    """
    Accepts a design matrix and returns q, sigma, p and the normalized sum of squared distance from the origin

    Inputs
    ------
    design matrix

    Returns
    -------
    q, sigma, p, normalized sum of squared distance from the origin
    """
    import numpy as np
    U, s, Vh = np.linalg.svd(X)
    
    # Get q
    q = U

    # Get sigma
    sigma = np.matmul(np.matmul(U.T, X), Vh.T)

    # Get p
    p = Vh.T

    # Get normalized sum of sqaured distance from the origin
    sq_singular_values = s**2
    nssd = sq_singular_values / sq_singular_values.sum()
    return q, sigma, p, nssd

In [None]:
q, sigma, p, nssd = truncated_svd(tf_idf_df)

In [None]:
fig, ax = plt.subplots()
ax.plot(range(1, len(nssd)+1), nssd, 'o-', label='individual')
ax.plot(range(1, len(nssd)+1), nssd.cumsum(), 'o-', label='cumulative')
ax.legend()
ax.set_ylim(0, 1)
ax.set_xlabel('SV')
ax.set_ylabel('variance explained');

## Topic Extraction via Non-Negative Matrix Factorization

Non-Negative Matrix Factorization (NMF) decomposes the TF-IDF matrix into:

A document-topic matrix (each book’s topic weights)

A topic-term matrix (keywords characterizing each topic)

NMF provides:

Dimensionality reduction

Interpretability

Topic clusters reflecting underlying themes in fiction descriptions

Optimal topic count may be evaluated using reconstruction error or coherence metrics.

#### Step 3. Perform NMF

In [69]:
n_components = [5, 10, 15, 20, 30, 40, 50, 100]
errors = []
for n_component in n_components:
    nmf_ = NMF(n_component, max_iter=10000)
    nmf_.fit(tf_idf_df)
    errors.append(nmf_.reconstruction_err_)
plt.plot(n_components, errors, '-o')
plt.xlabel(r'$n_\text{components}$')
plt.ylabel('reconstruction error');

KeyboardInterrupt: 

In [70]:
n_topics = corpus['catId'].nunique()
n_topics

225

In [71]:
# Instantiate the NMF model & specify the number of topics
# Set random_state for reproducibility
nmf_model = NMF(n_components=n_topics, random_state=1).set_output(transform="pandas")

# Fit the NMF model to the TF-IDF matrix
# Note that the 'H' matrix (topic-term distribution) is in nmf_model.components_
# Note that the 'W' matrix (document-topic distribution) can be obtained with model.transform(tfidf_matrix)
nmf_model.fit(tf_idf_df)

# Function to print the top words for each topic (from search results)
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(nmf_model.components_):
        print(f"Topic #{topic_idx}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()



In [73]:
# Print the top words per topic from search results
# print_top_words(nmf_model, feature_names, topic_no) # uncomment when using topic_no 
print_top_words(nmf_model, feature_names, 10)

Topic #0:
una en que las la novela se es los como
Topic #1:
su en el la se sus por para del vida
Topic #2:
penguin classics introductions disciplines translators authoritative scholars notes bookshelf enhanced
Topic #3:
shes hes doesnt knows isnt theres finally shell whos got
Topic #4:
men marines mutant magneto avengers mutants powerful kill west mr
Topic #5:
new york times bestseller review bestselling yorker instant timesbestselling motel
Topic #6:
short collection writers including volume fiction form anthology literature works
Topic #7:
stories collection published language magazines themes selected complete styles range
Topic #8:
christmas holiday season holidays spirit eve dickens festive gift cheer
Topic #9:
woman beautiful loves body louise womans joan skirt finds voice
Topic #10:
montalbano camilleri sicily inspector andrea live transporting finn blasts long
Topic #11:
war civil army ii soldiers soldier union shaara battle military
Topic #12:
julia miss hazel roof april raise

In [72]:
# Get the document-topic distribution
document_topic = nmf_model.transform(tf_idf_df)
document_topic

Unnamed: 0,nmf0,nmf1,nmf2,nmf3,nmf4,nmf5,nmf6,nmf7,nmf8,nmf9,...,nmf215,nmf216,nmf217,nmf218,nmf219,nmf220,nmf221,nmf222,nmf223,nmf224
0,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
1,0.000000,0.000000,0.000000,0.00000,0.000000,0.013443,0.000000,0.000000,0.0,0.027160,...,0.000000,0.000000,0.000000,0.000000,0.000327,0.000000,0.000000,0.000000,0.0,0.000000
2,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.034566,...,0.000000,0.000000,0.000354,0.001851,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
3,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.113835,...,0.000000,0.000000,0.003540,0.000693,0.000000,0.000000,0.001161,0.001499,0.0,0.000000
4,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.003516,0.0,0.004144
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8995,0.000000,0.063856,0.000000,0.02314,0.000000,0.000000,0.000000,0.000224,0.0,0.000000,...,0.000000,0.000000,0.000000,0.001845,0.000000,0.000326,0.000266,0.000000,0.0,0.000000
8996,0.001550,0.021172,0.000000,0.00000,0.000322,0.000000,0.000000,0.001337,0.0,0.011654,...,0.002232,0.000000,0.000000,0.000000,0.000000,0.000000,0.000404,0.000000,0.0,0.000000
8997,0.001201,0.002125,0.000902,0.00000,0.000000,0.003281,0.000061,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
8998,0.000901,0.000000,0.000000,0.00000,0.000000,0.002289,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000


## Semantic Ranking Using Cosine Similarity

Cosine similarity measures the angular distance between the query’s projected topic vector and each book’s topic vector from the NMF model

This produces relevance scores and enables ranking the entire catalog by semantic similarity to the user’s query.

A reusable query function is implemented:

>Input: A natural-language text query

>Processing: TF-IDF transform → NMF projection → cosine similarity scoring

>Output: Top-10 semantically relevant book titles with metadata

This module forms the end-user interface for book discovery.

#### Step 4. Create a query function

In [None]:
# Get top words per topic from search results
n_top_words = 10
topic_dict = {}
for topic_idx, topic in enumerate(nmf_model.components_):
        topic_dict[topic_idx] = [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
topic_dict

{0: ['new',
  'york',
  'times',
  'bestselling',
  'author',
  'bestseller',
  'comes',
  'review',
  'just',
  'instant'],
 1: ['la', 'una', 'que', 'en', 'su', 'el', 'los', 'por', 'del', 'es'],
 2: ['penguin',
  'notes',
  'classics',
  'translators',
  'authoritative',
  'scholars',
  'disciplines',
  'introductions',
  'bookshelf',
  'enhanced'],
 3: ['hes',
  'shes',
  'knows',
  'derek',
  'doesnt',
  'nathan',
  'ashton',
  'sweet',
  'smile',
  'janie'],
 4: ['dimity',
  'lori',
  'aunt',
  'atherton',
  'nancy',
  'ransom',
  'latest',
  'shepherd',
  'viking',
  'watch'],
 5: ['war',
  'world',
  'ii',
  'soldier',
  'civil',
  'young',
  'europe',
  'german',
  'end',
  'england'],
 6: ['paul',
  'glass',
  'post',
  'trilogy',
  'austers',
  'room',
  'detective',
  'washington',
  'locked',
  'novels'],
 7: ['stories',
  'collection',
  'short',
  'tales',
  'volume',
  'writers',
  'language',
  'themes',
  'range',
  'including'],
 8: ['christmas',
  'holiday',
  'season

#### Cosine Similarity Query

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def nmf_query_search(query, tfidf, nmf, W, top_k=20):

    # 1. Transform query using the SAME TF-IDF model
    q_tfidf = tfidf.transform([query])

    # 2. Project query into SAME NMF topic space
    q_vec = nmf.transform(q_tfidf)

    # 3. Cosine similarity
    sims = cosine_similarity(q_vec, W).ravel()

    # 4. Sort
    top_idx = sims.argsort()[::-1][:top_k]

    return top_idx, sims[top_idx]


In [None]:
query = 'Romantic Comedy'

In [None]:
idx, scores =  nmf_query_search(query=query, tfidf=vectorizer, nmf=nmf_model, W=document_topic, top_k=20)
# idx, scores
q_result = corpus.iloc[idx][['isbn', 'title', 'description']]
q_result



Unnamed: 0,isbn,title,description
1647,9780593358955,How to Fake It in Hollywood,A talented Hollywood starlet and a reclusive A...
1873,9780425274798,Misunderstandings,Just when she thought things were going upTwo ...
1496,9781639107094,Here for the Wrong Reasons,"In this swoon-filled lesbian romcom, two datin..."
1955,9780143138563,Passion Project,Passion Project will charm your pants off and ...
419,9780593469330,Your Driver Is Waiting,"A BEST BOOK OF THE YEAR: Autostraddle, Shondal..."
1499,9781998341177,"Diamonds, Deception & Destiny",Readers who love books about rockstar romances...
1848,9780451490803,The Kiss Quotient,From the author of The Bride Test comes a roma...
1450,9780593441541,"Once Smitten, Twice Shy",AN INSTANT USA TODAY BESTSELLER!Star-crossed l...
1652,9780593733035,Alice Rue Evades the Truth,"Vibrant, voice-y, tender, genuinely funny, and..."
1455,9781998854608,Love at First Flight,"Pippa, a neurodiverse air traffic controller w..."


In [None]:
def get_title_categories(isbn):
    params = {
        "api_key": API_KEY,
        "catSetId": "BI",
    }
    cat_url = f"{TITLE_CAT}{isbn}/categories"
    r = session.get(cat_url, params=params, timeout=20)
    print(r)
    print("STATUS:", r.status_code)
    print("URL:", r.url)
    print("RAW:", r.text[:250])

    result = dict(r.json())['data']['categories']
    out = []
    for entry in result:
        out.append(entry['catId'])

    r.raise_for_status()
    return out

In [None]:
isbn = q_result.iloc[0]['isbn']
isbn

9780593358955

In [None]:
get_title_categories(isbn)

<Response [200]>
STATUS: 200
URL: https://api.penguinrandomhouse.com/resources/v2/title/domains/PRH.US/titles/9780593358955/categories?api_key=fh5hj47dynk4nvx4s9ewufj4&catSetId=BI
RAW: {"status":"ok","recordCount":5,"startTimestamp":"2025-11-22T15:37:44Z","endTimestamp":"2025-11-22T15:37:44Z","timeTaken":5,"data":{"categories":[{"catId":3000001525,"description":"Fiction","catSetId":"BI","catUri":"FIC000000","menuText":"Fiction","ha


[3000001525, 3000001715, 3000001724, 3000001728, 3000001771]

In [None]:
def build_catid_confusion_matrix(q_result, target_catid):
    """
    q_result: DataFrame containing ISBNs in column 'isbn'
    target_catid: int, the catId to evaluate

    Returns:
        confusion matrix (DataFrame with rows=Actual, cols=Predicted)
    """

    isbns = q_result["isbn"].tolist()

    rows = []

    for isbn in isbns:
        # Now returns a LIST of catIds
        cat_ids = get_title_categories(isbn)

        # Check if the target category is present
        has_cat = int(target_catid in cat_ids)

        rows.append({"isbn": isbn, "has_cat": has_cat})

    df = pd.DataFrame(rows)

    # All items in q_result are predicted relevant (1)
    df["predicted"] = 1

    # Build confusion matrix
    confusion = pd.crosstab(
        df["has_cat"],
        df["predicted"],
        rownames=["Actual"],
        colnames=["Predicted"]
    )

    return df, confusion

In [None]:
build_catid_confusion_matrix(q_result, 3000001771)

<Response [200]>
STATUS: 200
URL: https://api.penguinrandomhouse.com/resources/v2/title/domains/PRH.US/titles/9780593358955/categories?api_key=fh5hj47dynk4nvx4s9ewufj4&catSetId=BI
RAW: {"status":"ok","recordCount":5,"startTimestamp":"2025-11-22T15:37:44Z","endTimestamp":"2025-11-22T15:37:44Z","timeTaken":5,"data":{"categories":[{"catId":3000001525,"description":"Fiction","catSetId":"BI","catUri":"FIC000000","menuText":"Fiction","ha
<Response [200]>
STATUS: 200
URL: https://api.penguinrandomhouse.com/resources/v2/title/domains/PRH.US/titles/9780425274798/categories?api_key=fh5hj47dynk4nvx4s9ewufj4&catSetId=BI
RAW: {"status":"ok","recordCount":5,"startTimestamp":"2025-11-22T15:37:47Z","endTimestamp":"2025-11-22T15:37:47Z","timeTaken":9,"data":{"categories":[{"catId":3000001525,"description":"Fiction","catSetId":"BI","catUri":"FIC000000","menuText":"Fiction","ha
<Response [200]>
STATUS: 200
URL: https://api.penguinrandomhouse.com/resources/v2/title/domains/PRH.US/titles/9781639107094/cate

(             isbn  has_cat  predicted
 0   9780593358955        1          1
 1   9780425274798        1          1
 2   9781639107094        1          1
 3   9780143138563        1          1
 4   9780593469330        0          1
 5   9781998341177        0          1
 6   9780451490803        0          1
 7   9780593441541        1          1
 8   9780593733035        1          1
 9   9781998854608        1          1
 10  9780143138419        0          1
 11  9798892422352        0          1
 12  9780425275030        0          1
 13  9780451491992        1          1
 14  9780385702621        0          1
 15  9780593199237        1          1
 16  9780593596531        1          1
 17  9780593722961        1          1
 18  9780593200148        1          1
 19  9780425274804        0          1,
 Predicted   1
 Actual       
 0           8
 1          12)