### Data Preparation

**Modules**

In [1]:
import numpy as np
import pandas as pd

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

from dotenv import load_dotenv

import warnings
warnings.filterwarnings("ignore")

In [31]:
load_dotenv()

True

**Load the Data**

In [12]:
books = pd.read_csv("books_cleaned.csv")

books.head(2)

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,ratings_count,title_and_subtitle,tagged_description
0,9780002005883,2005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,361.0,Gilead,9780002005883 A NOVEL THAT READERS and critics...
1,9780002261982,2261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,5164.0,Spider's Web: A Novel,9780002261982 A new 'Christie for Christmas' -...


### Vectorize the Tagged Description

* The `tagged _description` column is the comcatenation of the `isbn13` and the `description` column
* The `tagged _description` is going to be the source of truth for vector search 

In [13]:
books["tagged_description"].head()

0    9780002005883 A NOVEL THAT READERS and critics...
1    9780002261982 A new 'Christie for Christmas' -...
2    9780006178736 A memorable, mesmerizing heroine...
3    9780006280897 Lewis' work on the nature of lov...
4    9780006280934 "In The Problem of Pain, C.S. Le...
Name: tagged_description, dtype: object

In [24]:
# check the highest number of characters in a description: this the number I'd split the text by
chunk_size = books["tagged_description"].str.len().values[0]
chunk_size

np.int64(1168)

#### Save Tagged Description as Test File

I would use the isbn13 (the number on the left hand side of description in the output above) part of each description to identify book corresponding to the result of description search.

In [14]:
books["tagged_description"].to_csv("tagged_description.txt", sep = "\n", index = False, header = False)

#### Split Description into a Separate Document

This will make embedding and retrieval work uniformly.

In [26]:
raw_documents = TextLoader("tagged_description.txt", encoding="utf-8").load()
text_splitter = CharacterTextSplitter(chunk_size=chunk_size+10, chunk_overlap=0, separator="\n")
documents = text_splitter.split_documents(raw_documents) # list of documents based on each split

Created a chunk of size 1214, which is longer than the specified 1178
Created a chunk of size 1189, which is longer than the specified 1178
Created a chunk of size 1267, which is longer than the specified 1178
Created a chunk of size 2010, which is longer than the specified 1178
Created a chunk of size 1225, which is longer than the specified 1178
Created a chunk of size 1184, which is longer than the specified 1178
Created a chunk of size 1214, which is longer than the specified 1178
Created a chunk of size 1191, which is longer than the specified 1178
Created a chunk of size 1270, which is longer than the specified 1178
Created a chunk of size 1635, which is longer than the specified 1178
Created a chunk of size 1325, which is longer than the specified 1178
Created a chunk of size 1195, which is longer than the specified 1178
Created a chunk of size 2012, which is longer than the specified 1178
Created a chunk of size 1286, which is longer than the specified 1178
Created a chunk of s

In [29]:
# see what each document looks like: the first document
documents[0]

Document(metadata={'source': 'tagged_description.txt'}, page_content='9780002005883 A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gi

#### Embed the Documents

In [32]:
db_books = Chroma.from_documents(documents, embedding=OpenAIEmbeddings())

Retrieve top 5 documents from the vecotr database similar to a query:

In [None]:
query = "A book to teach children about nature"
docs = db_books.similarity_search(query, k = 5)  # top 5 similar 
docs