<a href="https://colab.research.google.com/github/jasmeet0817/booklm/blob/main/booklm_usage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## INSTALL

In [2]:
!pip install EbookLib
!pip install --upgrade llama-index
!pip install -U sentence-transformers

Collecting EbookLib
  Downloading EbookLib-0.18.tar.gz (115 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/115.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.5/115.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: EbookLib
  Building wheel for EbookLib (setup.py) ... [?25l[?25hdone
  Created wheel for EbookLib: filename=EbookLib-0.18-py3-none-any.whl size=38778 sha256=4e3304e290b35c5350a7d0b140dd97815fd20d97a58377ac19d3e5d4f3a4795f
  Stored in directory: /root/.cache/pip/wheels/0f/38/cc/a3728bb72a315d9d8766fb71d362136372066fc25ad838f8fa
Successfully built EbookLib
Installing collected packages: EbookLib
Successfully installed EbookLib-0.18
Collecting llama-index
  Downloading llama_index-0.10.15-py3-none-any.whl (5.6 kB)
Collecting llama-index-agent-openai<0.2.0,>=0.1.4 (from llama-index)
  Downloadin

In [None]:
OPENAI_KEY = ""

In [3]:
from google.colab import drive
drive.mount('data')
DATA_FOLDER = '/content/data/MyDrive/Colab Notebooks/book-llm/data/'

Mounted at data


## LOGGING

In [None]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.ERROR)

logger = logging.getLogger()
for handler in logger.handlers[:]:
    if isinstance(handler, logging.StreamHandler) and handler.stream == sys.stdout:
        logger.removeHandler(handler)

In [None]:
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

## Read book

In [4]:
import ebooklib

from bs4 import BeautifulSoup
from ebooklib import epub
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Document
from llama_index.core.schema import MetadataMode


def html_to_text(html):
    soup = BeautifulSoup(html, "html.parser")
    return soup.get_text()

def get_book_content(file_path, search_str, chunk_size, chunk_overlap):
    book = epub.read_epub(file_path)
    documents = []
    for item in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
        content = html_to_text(item.get_content().decode('utf-8')).strip()
        if content == '':
            continue
        documents.append(Document(text=content))
        if search_str is not None and search_str in content:
            break

    parser = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    nodes = parser.get_nodes_from_documents(documents)
    return nodes

book_nodes = get_book_content(DATA_FOLDER + 'waybound.epub', 'I had it in my hands! Why did I have to give it back?', 512, 50)



In [None]:
len(book_nodes)

436

In [5]:
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def search_embeddings(embeddings_df, embbedding_column, query_embedding, num_results):
    embeddings_df[embbedding_column + 'similarities'] = embeddings_df[embbedding_column].apply(lambda process_embedding: cosine_similarity(process_embedding, query_embedding))
    return embeddings_df.sort_values(embbedding_column + 'similarities', ascending=False).head(num_results)

In [6]:
def print_node_text(id):
    print(next((node.text for node in book_nodes if node.id_ == id), None))

## SENTECE TRANSFORMR EMBEDDINGS IN A DF

Run one or the other cell to get embeddings

#### Load embeddings from Drive

In [None]:
import pandas as pd

df = pd.read_csv(DATA_FOLDER + 'waybound_model_v3_df.csv')

#### Compute embeddings from model

In [7]:
from sentence_transformers import SentenceTransformer

model_v3 = SentenceTransformer(DATA_FOLDER + 'finetuned_bge_small_v3')
model_v4 = SentenceTransformer(DATA_FOLDER + 'finetuned_bge_small_v4')
model_v7 = SentenceTransformer(DATA_FOLDER + 'finetuned_bge_small_v7')

In [8]:
model = model_v3

In [9]:
from sentence_transformers import SentenceTransformer
import pandas as pd

df = pd.DataFrame([(node.id_, node.text) for node in book_nodes], columns=['id', 'text'])

In [10]:
df['embeddings_v3'] = df.text.apply(lambda x: model.encode(x))

In [18]:
df.to_csv(DATA_FOLDER + 'waybound_model_v3_df.csv', index=False)

### Search

In [22]:
top_results = search_embeddings(df, 'embeddings_v3', model.encode('What did Reigan Shen create out of Subject one\'s core binding?'), 3)

In [23]:
top_results[['text', 'embeddings_v3similarities']].sort_index()

Unnamed: 0,text,embeddings_v3similarities
37,3\n\n\n\n\n\nReigan Shen didn’t do his own Sou...,0.437798
275,"Before the Dragon Icon could form, Lindon reaf...",0.395442
390,"From inside his body, his madra was fading to ...",0.418516


In [24]:
for _, row in top_results.iterrows():
    print('-----------------------------')
    print(row['text'])

-----------------------------
3





Reigan Shen didn’t do his own Soulsmithing. He had people for that.
But the skills of Ozmanthus Arelius, one of the greatest Soulsmiths of all time, still flowed through his mind and spirit. Instincts honed by years of practice, the insight of a genius, and decades if not centuries of weapons-crafting experience now lurked inside Reigan Shen. Now and then, he even felt a shadow of the human’s arrogance bubbling up.
It was the one thing he appreciated about the man.
The core binding of Subject One was too valuable a material for Reigan to trust to others, but it was also unique and irreplaceable, and thus unsuitable for amateurs. His teams of expert Soulsmiths had labored ceaselessly for days while he breathed down their necks, giving them direction filtered through the talents of his greatest enemy.
They finally turned it into the form he wanted, and they had certainly earned their reputations. If they weren’t fine craftsmen, he wouldn’t have retain

In [None]:
prompt = """
You have a user query and several contexts (extracted from a book) below, not all contexts are relevant to the user query but some might be. Contexts are ordered chronologically. Use the context and no prior knowledge to answer the query.

Procedure to answer:

1. Discard irrelevant contexts that do not answer the query
2. If no context is relevant then call the question unanswerable (a query is likely to be unanswerable so do not be afraid to call it so)
3. Use the relevant contexts to formulate your answer. If multiple contexts are relevant, use all of them

Format of the answer:
- First answer the query below
- Follow the answer with just the number of the context that you used to answer

Query: Did all the Eight Man empire survive dreadgod battles?

Context 1:

"""

**Prompt Key**: response_synthesizer:text_qa_template<br>**Text:** <br>

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 


<br><br>

**Prompt Key**: response_synthesizer:refine_template<br>**Text:** <br>

The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
{context_msg}
------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 


<br><br>