# Attempting to Beat a Zero-Shot Classifier with Classical NLP/ML Techniques

## Overview

This notebook investigates whether a traditional supervised NLP pipeline can outperform a strong **zero-shot baseline** on our real-world book-classification task.

The reference model is the widely used
**`facebook/bart-large-mnli`**, employed in a zero-shot setting to infer book genres from textual description (metadata) - in which it got a 77% accuracy.

Rather than relying on LLMs, we deliberately take steps back into the classical machine learning area to test a simpler hypothesis:

> With sufficient feature engineering and task-specific signal, classical models can compete - and sometimes outperform - zero-shot LLM classifiers - and potentially SOTA LLMs like GPT and DeepSeek.

In [1]:
# Basic Stack
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
# Scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_validate, train_test_split
from sklearn.pipeline import Pipeline, make_pipeline


In [3]:
import sys
import types

import requests

In [4]:
books_cleaned_url = "https://raw.githubusercontent.com/jhlopesalves/classic_workflows/refs/heads/main/LLM/book_recommender/data/books_cleaned.csv"
books = pd.read_csv(books_cleaned_url)
books.head()

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,title_and_subtitle,tagged_description
0,9780002005883,2005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0,Gilead,9780002005883 A NOVEL THAT READERS and critics...
1,9780002261982,2261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0,Spider's Web: A Novel,9780002261982 A new 'Christie for Christmas' -...
2,9780006178736,6178731,Rage of angels,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0,Rage of angels,"9780006178736 A memorable, mesmerizing heroine..."
3,9780006280897,6280897,The Four Loves,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,33684.0,The Four Loves,9780006280897 Lewis' work on the nature of lov...
4,9780006280934,6280935,The Problem of Pain,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=Kk-uV...,"""In The Problem of Pain, C.S. Lewis, one of th...",2002.0,4.09,176.0,37569.0,The Problem of Pain,"9780006280934 ""In The Problem of Pain, C.S. Le..."


## Loading Helper Functions from GitHub

To keep this notebook fully portable, helper functions are loaded directly from a raw GitHub file rather than from a local module.
This choice ensures that the notebook can be executed consistently across different environments:

* locally after cloning the repository,
* directly from GitHub,
* or inside runtimes such as Google Colab.

By fetching the utilities at runtime, we avoid fragile relative imports and eliminate problems about local folder structure.
The helper module is dynamically loaded into memory and registered in `sys.modules`, allowing its functions to be accessed exactly like a standard Python package.


In [5]:
def import_from_url(url: str, module_name: str = "remote_module") -> types.ModuleType:
	"""
	Fetches a raw Python script from a URL and loads it as a module.

	Parameters
	----------
	url : str
	    The raw URL of the Python file (e.g., from raw.githubusercontent.com).
	module_name : str
	    The name to assign to the module in sys.modules.

	Returns
	-------
	types.ModuleType
	    The loaded module object.
	"""
	# Fetch the raw code
	response = requests.get(url)
	response.raise_for_status()  # Ensure the request succeeded
	source_code = response.text

	# Create a new, empty module object
	module = types.ModuleType(module_name)

	# Populate the module by executing the source code in its namespace
	# This keeps functions/variables contained within 'module', not global
	exec(source_code, module.__dict__)

	#  Register it in sys.modules
	# This allows other imported modules to see it if necessary.
	sys.modules[module_name] = module

	return module

In [10]:
# GitHub URL
helper_utils_url = "https://raw.githubusercontent.com/jhlopesalves/classic_workflows/refs/heads/main/LLM/book_recommender/data/utils.py"

# Import it
utils = import_from_url(helper_utils_url, module_name="utils")

# It's now possible to access functions just like a normal package
# df_results = utils.evaluate_candidates_cls(candidates, X, y)
print(f"Module '{utils.__name__}' loaded successfully.")

Module 'utils' loaded successfully.


## Simplifying the Target Categories

The original dataset contains a rich and heterogeneous set of genre labels. While informative, this level of granularity is not statistically practical for supervised modelling with the available data.

To ensure reliable learning and stable evaluation, categories are collapsed into a binary distinction:

**Fiction** vs **Nonfiction**.

This simplification is intentional. The book descriptions themselves remain semantically rich and will later provide sufficient signal for recommendation and similarity tasks with the "Description" feature. However, for classification, the fiction/nonfiction boundary is an extremelly meaningful, well-defined axis that is both realistic and learnable.

In simpler terms:
> semantic richness is preserved in the dataset, while categorical noise is reduced in the labels.


In [11]:
category_mapping = {
	"Fiction": "Fiction",
	"Juvenile Fiction": "Fiction",
	"Biography & Autobiography": "Nonfiction",
	"History": "Nonfiction",
	"Literary Criticism": "Nonfiction",
	"Philosophy": "Nonfiction",
	"Religion": "Nonfiction",
	"Comics & Graphic Novels": "Fiction",
	"Drama": "Fiction",
	"Juvenile Nonfiction": "Nonfiction",
	"Science": "Nonfiction",
	"Poetry": "Fiction",
}

books["simple_categories"] = books["categories"].map(category_mapping)

In [12]:
# Extracting rows that already have labels for training/testing
labelled = books[books["simple_categories"].notna()].copy()
# Extracting rows that need labels
unlabelled = books[books["simple_categories"].isna()].copy()

# Using description for the primary semantic signal
X_labelled = labelled["description"]
y_labelled = labelled["simple_categories"]

X_train, X_test, y_train, y_test = train_test_split(X_labelled, y_labelled, test_size=0.2, random_state=42, stratify=y_labelled)

### Baseline Candidates

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier

base_candidates = {
	"baseline_logreg": Pipeline(
		steps=[
			("tfidf", TfidfVectorizer(stop_words="english", ngram_range=(1, 2), max_features=25000)),
			("model", LogisticRegression(C=100, max_iter=1000, random_state=42)),
		]
	),
	"baseline_svc": Pipeline(
		steps=[
			("tfidf", TfidfVectorizer(stop_words="english", ngram_range=(1, 2), max_features=25000)),
			("model", LinearSVC(C=1, max_iter=1000, random_state=42)),
		]
	),
	"baseline_tree": Pipeline(
		steps=[("tfidf", TfidfVectorizer(stop_words="english", ngram_range=(1, 2), max_features=25000)), ("model", DecisionTreeClassifier(max_depth=5, random_state=42))]
	),
}

# Use the function from utils (loaded from GitHub)
baseline_evaluation = utils.evaluate_candidates_cls(
	candidates=base_candidates,
	X=X_train,
	y=y_train,
	n_splits=5,
	sort_by="test_accuracy",
	n_jobs=-1,
)

display(baseline_evaluation)

for name, model in base_candidates.items():
	print(f"\n{'=' * 60}\n{name}\n{'=' * 60}")
	metrics = plot_classifier_metrics(X=X_train, y=y_train, estimator=model, cv=5, random_state=42)
	print(f"AUC: {metrics['mean_auc']:.3f} ± {metrics['std_auc']:.3f}")
	print(f"AP:  {metrics['mean_ap']:.3f} ± {metrics['std_ap']:.3f}")

Evaluating baseline_logreg...


NameError: name 'cross_validate' is not defined