# **Zero-Shot Text Classification with LLMs**

In this section, we are going to use **Large Language Models (LLMs)** to perform a technique called **text classification** to help group book categories into a much smaller number of meaningful clusters.  
Once we have these cleaner groups, we can add them as an additional filter option in our book recommender system.

Text classification is a branch of Natural Language Processing (NLP) focused on assigning text into predefined discrete groups. 

In [1]:
import torch

# For using mps (like GPU) in MACOS
print(torch.backends.mps.is_available())  # Should be True
print(torch.backends.mps.is_built())       # Should be True

True
True


In [19]:
import pandas as pd
import numpy as np

from tqdm import tqdm
from transformers import pipeline # Use a pipeline as a high-level helper

---

## Reading the dataset

In [3]:
books = pd.read_csv('datasets/books_cleaned.csv')
books.head(5)

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,short_description,title_and_subtitle,tagged_description
0,9780002005883,2005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0,False,Gilead,9780002005883 A NOVEL THAT READERS and critics...
1,9780002261982,2261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0,False,Spider's Web: A Novel,9780002261982 A new 'Christie for Christmas' -...
2,9780006178736,6178731,Rage of angels,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0,False,Rage of angels,"9780006178736 A memorable, mesmerizing heroine..."
3,9780006280897,6280897,The Four Loves,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,33684.0,False,The Four Loves,9780006280897 Lewis' work on the nature of lov...
4,9780006280934,6280935,The Problem of Pain,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=Kk-uV...,"""In The Problem of Pain, C.S. Lewis, one of th...",2002.0,4.09,176.0,37569.0,False,The Problem of Pain,"9780006280934 ""In The Problem of Pain, C.S. Le..."


---

## **Grouping Categories for Better Filtering**

In this section, we aim to create **broader and more useful categories** for our book recommender. We don't want categories that are **too specific** or **too rare** ‚Äî otherwise, filters would not be meaningful during recommendations.


### **Checking Category Distribution**

This helps us **see all available categories** and **how frequently** they occur in the dataset.


In [4]:
books['categories'].value_counts().reset_index()

Unnamed: 0,categories,count
0,Fiction,2111
1,Juvenile Fiction,390
2,Biography & Autobiography,311
3,History,207
4,Literary Criticism,124
...,...,...
474,Conspiracies,1
475,Brothers and sisters,1
476,Rock musicians,1
477,Community life,1


---

### **Filtering for Broader, Popular Categories**

We now filter to keep only those categories that have **more than 50 books**. This ensures that our filters will have **enough data** to be useful:

In [5]:
books['categories'].value_counts().reset_index().query('count > 50')

Unnamed: 0,categories,count
0,Fiction,2111
1,Juvenile Fiction,390
2,Biography & Autobiography,311
3,History,207
4,Literary Criticism,124
5,Philosophy,117
6,Religion,117
7,Comics & Graphic Novels,116
8,Drama,86
9,Juvenile Nonfiction,57


---

### **Exploring Specific Examples**

Let's see examples of books under **'Juvenile Fiction'** ‚Äî one of the broader categories:

In [6]:
print(len(books[books['categories'] == 'Juvenile Fiction']))
books[books['categories'] == 'Juvenile Fiction'].head()

390


Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,short_description,title_and_subtitle,tagged_description
30,9780006646006,000664600X,Ocean Star Express,Mark Haddon;Peter Sutton,Juvenile Fiction,http://books.google.com/books/content?id=I2QZA...,Joe and his parents are enjoying a summer holi...,2002.0,3.5,32.0,1.0,False,Ocean Star Express,9780006646006 Joe and his parents are enjoying...
79,9780020442608,0020442602,The voyage of the Dawn Treader,Clive Staples Lewis,Juvenile Fiction,http://books.google.com/books/content?id=fDD3C...,"The ""Dawn Treader"" is the first ship Narnia ha...",1970.0,4.09,216.0,2869.0,False,The voyage of the Dawn Treader,"9780020442608 The ""Dawn Treader"" is the first ..."
85,9780030547744,0030547741,Where the Red Fern Grows,Wilson Rawls,Juvenile Fiction,http://books.google.com/books/content?id=IHpRw...,A young boy living in the Ozarks achieves his ...,2000.0,4.37,288.0,95.0,False,Where the Red Fern Grows: The Story of Two Dog...,9780030547744 A young boy living in the Ozarks...
86,9780060000141,0060000147,Poppy's Return,Avi,Juvenile Fiction,http://books.google.com/books/content?id=XbcMJ...,"There's trouble at Gray House, the girlhood ho...",2006.0,3.99,256.0,1086.0,False,Poppy's Return,"9780060000141 There's trouble at Gray House, t..."
87,9780060001537,0060001534,Diary of a Spider,Doreen Cronin,Juvenile Fiction,http://books.google.com/books/content?id=UWvZo...,This is the diary ... of a spider. But don't b...,2005.0,4.25,40.0,7903.0,False,Diary of a Spider,9780060001537 This is the diary ... of a spide...


---

Similarly, let's look at another category ‚Äî **'Juvenile Nonfiction'**:


In [7]:
print(len(books[books['categories'] == 'Juvenile Nonfiction']))
books[books['categories'] == 'Juvenile Nonfiction'].head()

57


Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,short_description,title_and_subtitle,tagged_description
107,9780060277406,60277408,The Secret Garden Cookbook,Amy Cotler,Juvenile Nonfiction,http://books.google.com/books/content?id=c7E_H...,Frances Hodgson Burnett's The Secret Garden de...,1999.0,4.28,128.0,142.0,False,The Secret Garden Cookbook: Recipes Inspired b...,9780060277406 Frances Hodgson Burnett's The Se...
108,9780060278427,60278420,Laura's Album,William Anderson,Juvenile Nonfiction,http://books.google.com/books/content?id=_zTkq...,Though best known as the author of the Little ...,1998.0,4.3,80.0,713.0,False,Laura's Album: A Remembrance Scrapbook of Laur...,9780060278427 Though best known as the author ...
228,9780060782139,60782137,Time For Kids: Butterflies!,Editors of TIME For Kids,Juvenile Nonfiction,http://books.google.com/books/content?id=OdZxn...,"Butterflies There are 20,000 different kinds o...",2006.0,4.0,32.0,20.0,False,Time For Kids: Butterflies!,"9780060782139 Butterflies There are 20,000 dif..."
267,9780060882600,60882603,The Annotated Charlotte's Web,E. B. White,Juvenile Nonfiction,http://books.google.com/books/content?id=vaYYH...,"Charlotte's Web, one of America's best-loved c...",2006.0,4.16,320.0,41.0,False,The Annotated Charlotte's Web,"9780060882600 Charlotte's Web, one of America'..."
434,9780064462044,64462048,My Little House Crafts Book,Carolyn Strom Collins,Juvenile Nonfiction,http://books.google.com/books/content?id=lTzrs...,Make the same pioneer crafts that Laura did! I...,1998.0,4.05,64.0,56.0,False,My Little House Crafts Book: 18 Projects from ...,9780064462044 Make the same pioneer crafts tha...


> üß† **Note:**  
Choosing broader categories with a decent number of books helps in **building powerful, balanced filters** when users search or explore recommendations.


---

### üìö **Mapping Detailed Categories into Broader Groups**

In this step, we create a **category mapping** to simplify the large variety of book categories into just **four broader groups**.  
This helps in making our filters cleaner and the recommendations more effective.

The broader groups we are using are:
- **Fiction**
- **Nonfiction**
- **Children's Fiction**
- **Children's Nonfiction**

We map the top 11 most frequent categories into one of these broader categories using the following dictionary:

In [8]:
category_mapping = {
    'Fiction': "Fiction",
    'Juvenile Fiction': "Children's Fiction",
    'Biography & Autobiography': "Nonfiction",
    'History': "Nonfiction",
    'Literary Criticism': "Nonfiction",
    'Philosophy': "Nonfiction",
    'Religion': "Nonfiction",
    'Comics & Graphic Novels': "Fiction",
    'Drama': "Fiction",
    'Juvenile Nonfiction': "Children's Nonfiction",
    'Science': "Nonfiction",
    'Poetry': "Fiction"
}

We will later use this mapping to create a new column in our dataset for **easier filtering and grouping** in our recommender system.

In [9]:
books['simple_categories'] = books['categories'].map(category_mapping)

We now look at how many books have known simple categories.

In [10]:
print('Number of books with known labels: ', len(books[~books['simple_categories'].isnull()]))

Number of books with known labels:  3743


---

## ü§ó **Using Hugging Face Transformers for Zero-Shot Classification**

In this step, we introduce **Hugging Face**, a popular open-source platform that provides thousands of pre-trained models for tasks like text classification, summarization, translation, and much more.  
The **Transformers** library from Hugging Face makes it extremely easy to use state-of-the-art models with just a few lines of code.

For our **zero-shot text classification**, we will use:

In [11]:
pipe = pipeline(
    "zero-shot-classification", 
    model="facebook/bart-large-mnli",
    device="mps"
)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use mps


- `pipeline()` provides a simple API to perform complex tasks.
- We are specifying:
  - `"zero-shot-classification"` as the task.
  - `"facebook/bart-large-mnli"` as the model, a powerful model trained to perform Natural Language Inference (NLI), making it excellent for zero-shot tasks.

---


Now that we have a list of **broader categories** (fiction, nonfiction, children's fiction, children's nonfiction) and a pre-trained **zero-shot classification** model loaded, let's try predicting the category for one of our book descriptions.

1. First, we pick a description from a book that we know belongs to the "Fiction" category:
    - Here, we filter the `books` DataFrame to select only rows where `simple_categories` is `"Fiction"`.
    - We then **reset the index** so that we can easily pick the **first** (`0th`) description.

2. Now, we pass the selected book description into our **zero-shot classification pipeline** (`pipe`) along with the list of target categories:
    - `sequence` is the book description text.
    - `simple_categories` is the list of candidate labels (`["Fiction", "Nonfiction"]`).
    - The model will **analyze** the description and **assign probabilities** for each category.
    - The category with the highest score will be considered the model's prediction!

In [15]:
simple_categories = ['Fiction', 'Nonfiction']

sequence = books[books['simple_categories'] == 'Fiction']['description'].reset_index(drop=True)[0]

pipe(sequence, simple_categories)

{'sequence': 'A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It‚Äôs 1956 in Gilead, Iowa, towards the end of the Reverend Ames‚Äôs life, and he is absorbed in recording his family‚Äôs story, a legacy for the young son he will never see grow up. Haunted by his grandfather‚Äôs presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend‚Äôs lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames‚Äôs joyous, rambling voice that finds beauty, humour and truth in the smallest of life‚Äôs details, Gilead is a song of celebration and acceptance of the best

In [17]:
prediction = pipe(sequence, simple_categories)

max_index = np.argmax(prediction['scores'])
max_label = prediction['labels'][max_index]
max_label

'Fiction'

---

### üîß **Wrapping Zero-Shot Classification into a Function**

Instead of running zero-shot classification manually each time, let's **wrap the logic into a reusable function**.  
This makes it easy to call the model on any book description and get the predicted category.

In [18]:
def generate_predictions(sequence, categories):
    prediction = pipe(sequence, categories)
    max_index = np.argmax(prediction['scores'])
    max_label = prediction['labels'][max_index]
    return max_label

- `sequence`: The text description we want to classify.
- `categories`: List of target categories (e.g., fiction, nonfiction, etc.)
- The function uses the model pipeline (`pipe`) to predict scores for each category.
- It **selects the label** with the **highest probability score** and returns it as the predicted category.


---

### üöÄ **Generating Predictions for Multiple Descriptions**

Now let's **predict categories for multiple book descriptions** and **evaluate the model's performance**.

We will:
- Predict 300 samples each from **Fiction** and **Nonfiction** categories.
- Compare the model's predictions with the actual labels.

In [21]:
actual_categories = []
predicted_categories = []

for i in tqdm(range(0, 300)):
    sequence = books[books['simple_categories'] == 'Fiction']['description'].reset_index(drop=True)[i]
    predicted_categories += [generate_predictions(sequence=sequence, categories=simple_categories)]
    actual_categories += ['Fiction']
    
for i in tqdm(range(0, 300)):
    sequence = books[books['simple_categories'] == 'Nonfiction']['description'].reset_index(drop=True)[i]
    predicted_categories += [generate_predictions(sequence=sequence, categories=simple_categories)]
    actual_categories += ['Nonfiction']
    

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 [02:05<00:00,  2.39it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 [02:41<00:00,  1.86it/s]


---

### üìã Creating a Predictions DataFrame

This table shows the **true category** vs the **predicted category** for each book description.

In [22]:
predictions_df = pd.DataFrame({'actual_categories': actual_categories, 'predicted_categories': predicted_categories})
predictions_df

Unnamed: 0,actual_categories,predicted_categories
0,Fiction,Fiction
1,Fiction,Fiction
2,Fiction,Fiction
3,Fiction,Nonfiction
4,Fiction,Fiction
...,...,...
595,Nonfiction,Nonfiction
596,Nonfiction,Fiction
597,Nonfiction,Nonfiction
598,Nonfiction,Nonfiction


---

### ‚úÖ Evaluating Prediction Accuracy

Now, let's **calculate the accuracy** of our zero-shot classification!

In [25]:
predictions_df['correct_prediction'] = np.where(
    predictions_df['actual_categories'] == predictions_df['predicted_categories'],
    1,
    0
)

print('Prediction accuracy: ', predictions_df['correct_prediction'].sum() / len(predictions_df) * 100, '%')

Prediction accuracy:  77.83333333333333 %


After running zero-shot classification on 600 book descriptions, we achieved a **prediction accuracy of 77.8%**.

‚úÖ **77.8% accuracy** is **very strong**, especially considering that:
  - We used a **zero-shot model** with **no fine-tuning**.
  - The task involved real-world, diverse book descriptions, not simple textbook examples.
  - Even humans might struggle to classify some books correctly based only on short descriptions.

  - üìñ **Zero-shot learning** shines here because it shows that large pre-trained models (like `facebook/bart-large-mnli`) have **learned general semantic understanding** from massive datasets during training ‚Äî even without specific domain adaptation to books!

---

### üõ†Ô∏è Using the Model for Books with Missing Categories

Now that we have validated the classification method:

- We can **apply this model to books that are missing a category** or where the category is too specific or inconsistent.
- For any book **without a simple category**, we can:
  - Feed its description into the zero-shot model.
  - Predict one of our **broader, consistent categories**: 
    - `"Fiction"`, 
    - `"Nonfiction"`,     
  - üìö This will **ensure every book** in our recommender system **has a clean, useful category**, allowing:
  - Better filtering options for users,
  - More personalized and structured recommendations.

In [26]:
isbns = []
predicted_categories = []

missing_categories = books.loc[books['simple_categories'].isnull(), ['isbn13', 'description']].reset_index(drop=True)
# Books that we filtered out using cutpoint of atleast 50 books in the categories
print('Number of books without simple category: ', len(missing_categories))

Number of books without simple category:  1454


1454 books without known simple category while remaining 3743 had simple categories assigned to them.

In [29]:
# Finding predicted categories for all the books with missing categories keeping isbns and predicted_categories in a list
tot_len = len(missing_categories)
for i in tqdm(range(0, tot_len)):
    sequence = missing_categories['description'][i]
    predicted_categories += [generate_predictions(sequence=sequence, categories=simple_categories)]
    isbns += [missing_categories['isbn13'][i]]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1454/1454 [10:13<00:00,  2.37it/s]


Now, we will generate a dataframe `missing_predicted_df` with isbns and predicted_categories and merge it into the original dataframe.

In [30]:
missing_predicted_df = pd.DataFrame({'isbn13': isbns, 'predicted_categories': predicted_categories})
missing_predicted_df.head()

Unnamed: 0,isbn13,predicted_categories
0,9780002261982,Fiction
1,9780006280897,Nonfiction
2,9780006280934,Nonfiction
3,9780006380832,Nonfiction
4,9780006470229,Fiction


In [31]:
books = pd.merge(books, missing_predicted_df, on='isbn13', how='left')
books['simple_categories'] = np.where(
    books['simple_categories'].isnull(),
    books['predicted_categories'],
    books['simple_categories']
)
books = books.drop(columns = ['predicted_categories'])
books.head()

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,short_description,title_and_subtitle,tagged_description,simple_categories
0,9780002005883,2005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0,False,Gilead,9780002005883 A NOVEL THAT READERS and critics...,Fiction
1,9780002261982,2261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0,False,Spider's Web: A Novel,9780002261982 A new 'Christie for Christmas' -...,Fiction
2,9780006178736,6178731,Rage of angels,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,29532.0,False,Rage of angels,"9780006178736 A memorable, mesmerizing heroine...",Fiction
3,9780006280897,6280897,The Four Loves,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,33684.0,False,The Four Loves,9780006280897 Lewis' work on the nature of lov...,Nonfiction
4,9780006280934,6280935,The Problem of Pain,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=Kk-uV...,"""In The Problem of Pain, C.S. Lewis, one of th...",2002.0,4.09,176.0,37569.0,False,The Problem of Pain,"9780006280934 ""In The Problem of Pain, C.S. Le...",Nonfiction


In [32]:
def summary(df):
    """
    Extended describe() function:
    Adds a 'missing_values' row as the third row of the describe() output.
    
    Parameters:
    df (pd.DataFrame): The DataFrame to summarize.
    
    Returns:
    pd.DataFrame: Summary statistics including missing values.
    """
    desc = df.describe(include='all')

    # Create missing values row
    missing = df.isnull().sum()
    missing.name = 'missing'
    
    missing_perc = round(df.isnull().sum() / len(df) * 100)
    missing_perc.name = 'missing %'

    # Insert 'missing_values' as the third row
    desc = pd.concat(
        [desc.iloc[:2], 
         pd.DataFrame([missing], index=['missing']), 
         pd.DataFrame([missing_perc], index=['missing %']), 
         desc.iloc[2:]
        ],
        axis=0
    )

    # Reindex to maintain the order
    return desc

summary(books)

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,short_description,title_and_subtitle,tagged_description,simple_categories
count,5197.0,5197.0,5197,5165,5167,5031,5197,5197.0,5197.0,5197.0,5197.0,5197,5197,5197,5197
unique,,5197.0,4969,3045,479,5031,5154,,,,,1,5056,5197,4
missing,0.0,0.0,0,32,30,166,0,0.0,0.0,0.0,0.0,0,0,0,0
missing %,0.0,0.0,0.0,1.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
top,,2005883.0,The Lord of the Rings,Agatha Christie,Fiction,http://books.google.com/books/content?id=KQZCP...,This is a reproduction of the original artefac...,,,,,False,The Lord of the Rings,9780002005883 A NOVEL THAT READERS and critics...,Fiction
freq,,1.0,9,30,2111,1,6,,,,,5197,6,1,2808
mean,9780667000000.0,,,,,,,1999.804118,3.922246,348.472195,21131.12,,,,
std,595105300.0,,,,,,,9.082979,0.324975,229.891672,144648.0,,,,
min,9780002000000.0,,,,,,,1876.0,0.0,0.0,0.0,,,,
25%,9780313000000.0,,,,,,,1998.0,3.75,213.0,183.0,,,,


Looking at the summary, the `simple_categories` column has 4 unique categories and no missing values as expected.

---

### üóÇÔ∏è Creating and Saving the Final DataFrame

In [33]:
books.to_csv('datasets/books_with_categories.csv', index=False)