### Data Preparation

**Modules**

In [1]:
import numpy as np
import pandas as pd

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

from dotenv import load_dotenv

import warnings
warnings.filterwarnings("ignore")

In [31]:
load_dotenv()

True

**Load the Data**

In [12]:
books = pd.read_csv("books_cleaned.csv")

books.head(2)

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,title_and_subtitle,tagged_description
0,9780002005883,2005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0,Gilead,9780002005883 A NOVEL THAT READERS and critics...
1,9780002261982,2261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0,Spider's Web: A Novel,9780002261982 A new 'Christie for Christmas' -...


### Vectorize the Tagged Description

* The `tagged _description` column is the comcatenation of the `isbn13` and the `description` column
* The `tagged _description` is going to be the source of truth for vector search 

In [13]:
books["tagged_description"].head()

0    9780002005883 A NOVEL THAT READERS and critics...
1    9780002261982 A new 'Christie for Christmas' -...
2    9780006178736 A memorable, mesmerizing heroine...
3    9780006280897 Lewis' work on the nature of lov...
4    9780006280934 "In The Problem of Pain, C.S. Le...
Name: tagged_description, dtype: object

In [24]:
# check the highest number of characters in a description: this the number I'd split the text by
chunk_size = books["tagged_description"].str.len().values[0]
chunk_size

np.int64(1168)

#### Save Tagged Description as Test File

I would use the isbn13 (the number on the left hand side of description in the output above) part of each description to identify book corresponding to the result of description search.

In [14]:
books["tagged_description"].to_csv("tagged_description.txt", sep = "\n", index = False, header = False)

#### Split Description into a Separate Document

This will make embedding and retrieval work uniformly.

In [26]:
raw_documents = TextLoader("tagged_description.txt", encoding="utf-8").load()
text_splitter = CharacterTextSplitter(chunk_size=chunk_size+10, chunk_overlap=0, separator="\n")
documents = text_splitter.split_documents(raw_documents) # list of documents based on each split

Created a chunk of size 1214, which is longer than the specified 1178
Created a chunk of size 1189, which is longer than the specified 1178
Created a chunk of size 1267, which is longer than the specified 1178
Created a chunk of size 2010, which is longer than the specified 1178
Created a chunk of size 1225, which is longer than the specified 1178
Created a chunk of size 1184, which is longer than the specified 1178
Created a chunk of size 1214, which is longer than the specified 1178
Created a chunk of size 1191, which is longer than the specified 1178
Created a chunk of size 1270, which is longer than the specified 1178
Created a chunk of size 1635, which is longer than the specified 1178
Created a chunk of size 1325, which is longer than the specified 1178
Created a chunk of size 1195, which is longer than the specified 1178
Created a chunk of size 2012, which is longer than the specified 1178
Created a chunk of size 1286, which is longer than the specified 1178
Created a chunk of s

In [29]:
# see what each document looks like: the first document
documents[0]

Document(metadata={'source': 'tagged_description.txt'}, page_content='9780002005883 A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gi

#### Embed the Documents

In [32]:
db_books = Chroma.from_documents(documents, embedding=OpenAIEmbeddings())

Retrieve top 5 documents from the vecotr database similar to a query:

In [44]:
query = "A book about parenting"
docs = db_books.similarity_search(query, k = 5)  # top 5 similar 
docs

[Document(id='6e0eca3e-b011-4001-abf4-5899c8965ea0', metadata={'source': 'tagged_description.txt'}, page_content="9781889032207 The bookshelves in your home no doubt contain volumes of books, manuals, seminar notes, magazine articles, and video and audio cassettes purporting to address parenting from a Christian point of view. With rare exception, however, most of today's Christian parenting resources fail to emphasize what is perhaps the most important aspect of true biblical parenting -- how to relate the Bible to the disciplinary process in practical ways. Think about it. With all of your training, do you really know how to use the Bible for doctrine, reproof, correction, and instruction in righteousness with your children? If you do, read no further. If you don't, this little book will augment and strengthen your parenting skills as you learn how to use the scriptures more thoroughly and effectively in your child training. - Back cover."),
 Document(id='ed8fec0e-b1ba-4d25-97d8-8c62

### Retrieval Function

In [43]:
def retrieve_semantic_recommendations(query: str, top_k: int = 5) -> pd.DataFrame:
    retrieved_docs = db_books.similarity_search(query, k = 50)

    books_list = []

    for i in range(0, len(retrieved_docs)):
        # retrieve the isbn13 number from the retrieved documents
        books_list += [int(retrieved_docs[i].page_content.strip('"').split()[0])]

    return books[books["isbn13"].isin(books_list)]

In [45]:
query = "A book about parenting"
retrieve_semantic_recommendations(query)

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,title_and_subtitle,tagged_description
92,9780060004507,0060004509,An Old-Fashioned Thanksgiving,Louisa May Alcott;James Bernardin,Juvenile Fiction,http://books.google.com/books/content?id=SKYr4...,An adaptation of the original story follows th...,2005.0,3.7,32.0,599.0,An Old-Fashioned Thanksgiving,9780060004507 An adaptation of the original st...
106,9780060177249,0060177241,The thief of always,Clive Barker,Fiction,http://books.google.com/books/content?id=jKsrA...,After a mysterious stranger promises to end hi...,1992.0,4.19,225.0,22123.0,The thief of always: a fable,9780060177249 After a mysterious stranger prom...
155,9780060564780,0060564784,The Kindness of Strangers,Katrina Kittle,Fiction,http://books.google.com/books/content?id=nDhTB...,"A young widow raising two boys, Sarah Laden is...",2007.0,4.01,400.0,11432.0,The Kindness of Strangers,"9780060564780 A young widow raising two boys, ..."
289,9780060915186,0060915188,An American Childhood,Annie Dillard,Biography & Autobiography,http://books.google.com/books/content?id=tRihT...,A book that instantly captured the hearts of r...,1988.0,3.91,255.0,7086.0,An American Childhood,9780060915186 A book that instantly captured t...
328,9780060936662,0060936665,Smart Discipline(R),Larry Koenig,Family & Relationships,http://books.google.com/books/content?id=Bpo2r...,"Larry J. Koenig, Ph.D., creator of the hugely ...",2004.0,3.99,208.0,12.0,"Smart Discipline(R): Fast, Lasting Solutions f...","9780060936662 Larry J. Koenig, Ph.D., creator ..."
332,9780060956868,0060956860,Joy in the Morning,Betty Smith,Fiction,http://books.google.com/books/content?id=w2uMG...,The story of a young couple from Brooklyn who ...,2000.0,3.9,296.0,5559.0,Joy in the Morning,9780060956868 The story of a young couple from...
386,9780061129735,0061129739,The Art of Loving,Erich Fromm,Self-Help,http://books.google.com/books/content?id=TRMED...,The fiftieth Anniversary Edition of the ground...,2006.0,4.03,192.0,35605.0,The Art of Loving,9780061129735 The fiftieth Anniversary Edition...
399,9780061144899,0061144894,When the Heart Waits,Sue Monk Kidd,Religion,http://books.google.com/books/content?id=JlP91...,From the Bestselling Author of The Secret Life...,2006.0,4.17,240.0,2141.0,When the Heart Waits: Spiritual Direction for ...,9780061144899 From the Bestselling Author of T...
934,9780152018481,0152018484,How I Became a Pirate,Melinda Long;David Shannon,Juvenile Fiction,http://books.google.com/books/content?id=9Odoo...,Jeremy Jacob joins Braid Beard and his pirate ...,2003.0,4.08,44.0,23328.0,How I Became a Pirate,9780152018481 Jeremy Jacob joins Braid Beard a...
1150,9780240806082,0240806085,Directing the Documentary,Michael Rabiger,Performing Arts,http://books.google.com/books/content?id=uoKli...,Michael Rabiger guides the reader through the ...,2004.0,4.23,648.0,173.0,Directing the Documentary,9780240806082 Michael Rabiger guides the reade...
