# Book Reading Assignment Pipeline 

## Prerequisites

In [1]:
!pip install llama-index --upgrade --force
!pip install langchain --upgrade
!pip install langchain_community
!pip install boto3
!pip install pinecone-client
!pip install python-dotenv
!pip install cohere
!pip install langchain sentence_transformers

Python 3.11.5


## Data Preparation and ingestion pipeline

The first step of the assignment is reading the booking right?! So let's start reading then! We will use Pinecone as vector store. Llamaindex will help us load and index our raw text into Pinecone. We will also use it to access and query the indexed data.

Let's prepare some metadata first. The book pdf was split into separate files, one for reach chapter. We'll use this breakdown to assign metadata to each node inserted as part of our index. We won't cover that step in details but please check out the BookSplitter notebook if you are curious.

In [36]:
from dotenv import load_dotenv
import os 
load_dotenv()

environment = os.environ["PINECONE_ENV"]
index_name = os.environ["PINECONE_INDEX_NAME"]

In [37]:
# Initializing API Keys, credentials, imports and context variables
import logging
import sys
import os

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [38]:
from llama_index import (
    VectorStoreIndex,
    SimpleKeywordTableIndex,
    SimpleDirectoryReader,
    LLMPredictor,
    ServiceContext,
    Document,
    download_loader,
    set_global_service_context,
    get_response_synthesizer,
)
from llama_index.vector_stores import PineconeVectorStore
from llama_index.storage.storage_context import StorageContext
from llama_index.indices.document_summary import DocumentSummaryIndex
from llama_index.llms import Bedrock
from pathlib import Path

In [39]:
import json

#book and chapters metadata are pre-stored into a a JSON Configuration file
with open("../config/book_2.json", "r") as read_content: 
	books_config = json.load(read_content)

file_name = books_config['books'][0]['fileName']
book_title = books_config['books'][0]['title']
chapter_titles = books_config['books'][0]['chapters']

Initialize Pinecone client

In [None]:
from langchain_community.embeddings import BedrockEmbeddings

embeddings = BedrockEmbeddings(credentials_profile_name="genai-demo", region_name="us-east-1", model_id="cohere.embed-english-v3")
llm = Bedrock(model="anthropic.claude-v2:1", profile_name="genai-demo", max_tokens=1024)
set_global_service_context(None)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embeddings)
vector_store = PineconeVectorStore(
        index_name=index_name,
        environment=environment
    )
storage_context = StorageContext.from_defaults(vector_store=vector_store)

Let's start indexing! Notice how llama-index abstracts all the communication with the Vector store so far ...


In [35]:

PDFReader = download_loader("PDFReader")
chapter_docs = []
loader = PDFReader()

#Let's index each chapter
for pos in range(0, len(chapter_titles)):
    documents = loader.load_data(file=Path(f"{file_name}-{pos}.pdf"))
    chapter_title = chapter_titles[pos]['title']
    for doc in documents:
        doc.metadata['book_title'] = book_title
        doc.metadata['chapter_number'] = pos
        doc.metadata['chapter_title'] = chapter_title
    chapter_docs.append(documents)

#Let's generate chapter summaries
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize", 
    use_async=True, 
    service_context=service_context,
    summary_template=summary_template,
)
summary_docs = []
for chapter_pos in range(0, len(chapter_titles)):
    doc_index = VectorStoreIndex.from_documents(
        chapter_docs[chapter_pos],
        storage_context=storage_context,
        service_context=service_context,
    )

    if chapter_pos>0:
        query_summary_engine = doc_index.as_query_engine(response_synthesizer=response_synthesizer,)
        summary = query_summary_engine.query("Could you summarize the given context?")
        print("Summary " + str(chapter_pos) + ": " + summary.response)
        chapter_title = chapter_titles[chapter_pos]['title']
        summary_doc = Document(text=summary.response)
        summary_doc.metadata['book_title'] = book_title
        summary_doc.metadata['chapter_number'] = chapter_pos
        summary_doc.metadata['chapter_title'] = chapter_title
        summary_doc.metadata['is_summary'] = "Y"
        summary_docs.append(summary_doc)

summary_index = VectorStoreIndex.from_documents(
    summary_docs,
    storage_context=storage_context,
    service_context=service_context,
)


Summary 1:  Here is a summary of the key points in the passage:

- The passage is from the book "Ishmael" by Daniel Quinn. It's from chapter 1.

- It discusses how animals in captivity often pace around endlessly and seem to be asking themselves "Why?" They sense something is wrong with their confined way of living. 

- The narrator began wondering the same thing about why life in captivity is boring and unpleasant compared to the interesting and pleasant life in the wild.

- Ishmael, a character, points out to the narrator that humans have made themselves captives of a system that compels them to destroy the world, and have also made the world itself captive. 

- The narrator feels a personal sense of being a captive but can't explain why. Ishmael refers to a time when young people tried to escape captivity but failed because they couldn't find what was confining them.

- Ishmael indicates the world won't survive much longer as humanity's captive. He thinks many humans would release t

Upserted vectors:   0%|          | 0/14 [00:00<?, ?it/s]

## Let's run a few tests
### Simple query

In [None]:
from IPython.display import Markdown, display
from llama_index.prompts import PromptTemplate

text_qa_template_str = (
    "Human: The <passage></passage> xml tags below contain a passage of the book.\n"
    " <passage>{context_str}</passage>\n"
    "Using the provided passage answer the following question:\n"
    "{query_str}\nIf the context isn't helpful, please reply that you don't know.\n\n"
    "Assistant:"
)
text_qa_template = PromptTemplate(text_qa_template_str)

refine_template_str = (
    "Human: The original question is within tghe <original_question></original_question> XML tags below.\nWe have provided an"
    " existing answer within the <existing_answer>{context_str}</existing_answer> XML tags below. \nWe have the opportunity to refine"
    " the existing answer (only if needed) with some more context available within the <passage></passage> XML tags below."
    " <original_question>{query_str}</original_question>\n"
    " <existing_answer>{context_str}</existing_answer>\n"
    " <passage>{context_str}</passage>\n"
    "Rewrite a new answer. Use the existing answer and context as new context to expand the existing answer.\n\n"
    "Assistant:"
)
refine_template = PromptTemplate(refine_template_str)

summary_template_str = (
    "Human: The <passage></passage> xml tags below contain a passage of the book.\n"
    " <passage>{context_str}</passage>\n"
    "Please summarize the passage. Provide as many details as possible.\n"
    "Assistant:"
)
summary_template = PromptTemplate(summary_template_str)

def display_prompt_dict(prompts_dict):
    for k, p in prompts_dict.items():
        text_md = f"**Prompt Key**: {k}<br>" f"**Text:** <br>"
        display(Markdown(text_md))
        print(p.get_template())
        display(Markdown("<br><br>"))


In [43]:
from llama_index.postprocessor.cohere_rerank import CohereRerank

api_key = os.environ["COHERE_API_KEY"]
cohere_rerank = CohereRerank(api_key=api_key, top_n=100)

book_index = VectorStoreIndex.from_vector_store(vector_store=vector_store, service_context=service_context)
query_engine = book_index.as_query_engine(
    similarity_top_k=10,
    text_qa_template=text_qa_template,
    refine_template = refine_template,
    node_postprocessors=[cohere_rerank],
    )

prompts_dict = query_engine.get_prompts()
display_prompt_dict(prompts_dict)

response = query_engine.query("Who are the main characters of the book?")
#response = query_engine.query("What does the book talk about?") #may work better by putting entire book in context
#response = query_engine.query("What is the significance of the jellyfish story?")
print(str(response))
print(response.get_formatted_sources())

**Prompt Key**: response_synthesizer:text_qa_template<br>**Text:** <br>

Human: The <passage></passage> xml tags below contain a passage of the book.
 <passage>{context_str}</passage>
Using the provided passage answer the following question:
{query_str}
If the context isn't helpful, please reply that you don't know.

Assistant:


<br><br>

**Prompt Key**: response_synthesizer:refine_template<br>**Text:** <br>

Human: The original question is within tghe <original_question></original_question> XML tags below.
We have provided an existing answer within the <existing_answer>{context_str}</existing_answer> XML tags below. 
We have the opportunity to refine the existing answer (only if needed) with some more context available within the <passage></passage> XML tags below. <original_question>{query_str}</original_question>
 <existing_answer>{context_str}</existing_answer>
 <passage>{context_str}</passage>
Rewrite a new answer. Use the existing answer and context as new context to expand the existing answer.

Assistant:


<br><br>

 Based on the passages provided, the two main characters of the book "Ishmael" by Daniel Quinn appear to be:

1) Ishmael: He is described as the narrator's "benefactor" and seems to serve as a teacher or guide, imparting wisdom and challenging perspectives. 

2) The narrator: He is the student or pupil of Ishmael. His name is not provided in the passages. He struggles to grasp Ishmael's teachings fully and shows little emotion.

The passages center around dialogues and debates between Ishmael and the unnamed narrator, so they seem to be the central characters of the book.
> Source (Doc id: ef396427-65b2-4077-af54-06cf3ab175c4): “Oh, hu ndreds, I suppose. Growers, harvesters, truckers, cleaners
at t he canning plant, people ...

> Source (Doc id: d5a0e659-eeb7-47e7-8125-7a65eebaa35e): 3
“What happens  t o peop le who  live in the hands of the
gods?”
“What do you mean?”
“I mean , w...

> Source (Doc id: fc9711e6-bffa-4d55-a689-97330a773151): 1
“Oddly enough,” he said, “it was my benefact

### Using metadata to summarize/query a specific chapter

In [44]:
import pinecone

from llama_index.vector_stores.types import (
    MetadataFilters,
    ExactMatchFilter,
)

query_chapter_title = "SIX"
query_book_title = book_title
query_engine_six = book_index.as_query_engine(
    similarity_top_k=20,
    text_qa_template=text_qa_template,
    refine_template = refine_template,
    node_postprocessors=[cohere_rerank],
    filters=MetadataFilters(
        filters=[
            ExactMatchFilter(key="chapter_title", value=query_chapter_title),
            ExactMatchFilter(key="book_title", value=query_book_title),
        ]
    ),
)
response = query_engine_six.query("What does the passage talk about?")
print("Response = "+ repr(response))

Response = Response(response=' Based on the context provided in the passage, it seems to talk about the history and development of human civilization, using an analogy of a person trying unsuccessfully to fly using an aircraft not properly designed for flight.\n\nSpecifically, the passage discusses how early human civilizations were able to thrive and grow for some time, like a person joyriding in a faulty aircraft, but eventually they failed and collapsed because they were not living in accordance with some fundamental "law" that governs sustainability. Just as the laws of aerodynamics govern whether manmade aircraft can achieve flight, the passage argues there seems to be a natural law governing whether human civilizations can be sustainable. \n\nThe passage goes on to criticize modern industrial civilization for the same shortsightedness and ignorance of this sustainability "law", predicting it will eventually collapse just as earlier civilizations did. It uses the analogy of the pe

## Time to do some homework

In [47]:
import pandas as pd

assignment = pd.read_csv('../data/assignment_questions.csv')

role_prompt = "You are a 12th grader doing an English assignment. Based on the provided context, please answer the following question.\n QUESTION:"

def get_query_engine_by_chapter(book_index, query_book_title, query_chapter_title="ALL"):
    similarity_top_k_val=20
    if query_chapter_title == "ALL":
        chapter_filter = ExactMatchFilter(key="is_summary", value="Y")
        similarity_top_k_val = 100
    else:
        chapter_filter = ExactMatchFilter(key="chapter_title", value=query_chapter_title)
        
    query_engine = book_index.as_query_engine(
        similarity_top_k=similarity_top_k_val,
        text_qa_template=text_qa_template,
        refine_template = refine_template,
        node_postprocessors=[cohere_rerank],
        filters=MetadataFilters(
            filters=[
                chapter_filter,
                ExactMatchFilter(key="book_title", value=query_book_title),
            ]
        ),
    )
    return query_engine

book_index = VectorStoreIndex.from_vector_store(vector_store=vector_store, service_context=service_context)

questions_one = assignment[assignment['chapter'] == "ONE"]
query_engine_one = get_query_engine_by_chapter(book_index, book_title, "ONE")
questions_one['answer'] = questions_one['question'].map(lambda q: query_engine_one.query(role_prompt + q).response, na_action='ignore')

questions_two = assignment[assignment['chapter'] == "TWO"]
query_engine_two = get_query_engine_by_chapter(book_index, book_title, "TWO")
questions_two['answer'] = questions_two['question'].map(lambda q: query_engine_two.query(role_prompt + q).response, na_action='ignore')

#General questions
questions_all = assignment[assignment['chapter'] == "ALL"]
query_engine_all = get_query_engine_by_chapter(book_index, book_title)
questions_all['answer'] = questions_all['question'].map(lambda q: query_engine_all.query(role_prompt + q).response, na_action='ignore') #re-initialize query_engine if needed by running beginning of previous section

questions = pd.concat([questions_one, questions_two, questions_all], sort=False)

pd.options.display.max_colwidth = 1000
questions.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  questions_two['answer'] = questions_two['question'].map(lambda q: query_engine_two.query(role_prompt + q).response, na_action='ignore')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  questions_all['answer'] = questions_all['question'].map(lambda q: query_engine_all.query(role_prompt + q).response, na_action='ignore') #re-initialize query_engine if needed by running beginning of previous section


Unnamed: 0,question,chapter,answer
1,"Explain all Ishmael has to say about captivity. In your opinion, is there any truth in what he says?",TWO,"Ishmael has quite a bit to say about captivity in this passage. Here is a summary of his main points:\n\n1. Ishmael's benefactor (Mr. Sokolow) was obsessed with the events in Nazi Germany and wanted to show Ishmael that the entire German nation was held captive under Hitler, including his supporters. Some detested Hitler's actions but still had to play their part in Hitler's story.\n\n2. Ishmael draws a parallel to the narrator's own culture, saying ""the people of your culture are in much the same situation. Like the people of Nazi Germany, they are the captives of a story."" The narrator protests that he knows of no such story captivating his culture.\n\n3. Ishmael argues that there is an ""explaining story"" that people in the narrator's culture know that tells them how things came to be the way they are. This story pacifies them and allows them to accept things like environmental destruction without becoming alarmed. \n\n4. Ishmael says everyone is fed the same story from childhoo..."
2,Find Ishmael’s definitions for the following word: Takers,TWO,"Based on the passage, Ishmael defines ""Takers"" as:\n\n""Using them in this sense, do the words takers and leavers have any heavy connotation for you?""\n""I'm not sure what you mean.""\n""I mean, if I call one group Takers and the other group Leavers, will this sound like I'm setting up one to be good guys and the other to be bad guys?""\n""No. They sound pretty neutral to me.""\n\nSo Ishmael defines ""Takers"" as a neutral term to refer to the people of the narrator's culture, as opposed to ""Leavers"" which refers to the people of all other cultures. He wants to use these terms without attaching positive or negative connotations to them."
3,Find Ishmael’s definitions for the following word: Leavers,TWO,"Based on the passage, Ishmael provides the following definition for ""Leavers"":\n\n""Using them in this sense, do the words takers and leavers have any heavy connotation for you?""\n""I mean, if I call one group Takers and the other group Leavers, will this sound like I’m setting up one to be good guys and the other to be bad guys?""\n""No. They sound pretty neutral to me.""\n\nSo Ishmael defines ""Leavers"" as a neutral term to refer to ""the people of all other cultures"" besides the narrator's own. He sets them up as a broad group, not specifically good or bad, that can be contrasted with the ""Takers""."
4,Find Ishmael’s definitions for the following word: Mother Culture,TWO,"Unfortunately the passage does not contain Ishmael's explicit definition of the term ""Mother Culture"". The term is used several times, but is not clearly defined. Without additional context, I don't have enough information to determine Ishmael's intended definition of this term."
0,What does the book Ishmael talk about?,ALL,"Unfortunately, the passage provided does not contain any actual text from the book ""Ishmael"" that I can use to summarize what the book talks about. The XML tags simply provide some metadata like the book title, chapter numbers, etc. but do not include passage content. Without any excerpt from the book itself, I do not have enough context to reliably state what the book is about. If you are able to provide a sample passage from the book ""Ishmael"", I would be happy to analyze it and answer questions about the book's key themes and topics. But with just the limited metadata here, I don't believe I can confidently answer the question ""What does the book Ishmael talk about?"". Please let me know if you have any passages from the book you can share for analysis."


In [48]:
questions.to_csv('../data/assignment_answers.csv', index=False)