# Query a book 

#### Pinecone Vector Store

Basically I want to ask questions and get the response back from the llm that will use a book for context 

In [1]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
import os

load_dotenv()

True

In [2]:
hugging_face_token = os.getenv("HUGGINGFACEHUB_API_TOKEN")
langchain_token = os.getenv("LANGCHAIN_API_KEY")
pinecone_api_key = os.getenv("PINECONE_API_KEY")
pinecone_env = os.getenv("PINECONE_ENV")

Get the model

In [3]:
from langchain_huggingface import HuggingFaceEndpoint

repo_id = "mistralai/Mistral-7B-Instruct-v0.2"


llm = HuggingFaceEndpoint(repo_id=repo_id,
                          huggingfacehub_api_token=hugging_face_token,
                          temperature=0.1)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\Hori\.cache\huggingface\token
Login successful


Load the data

In [4]:
loader = PyPDFLoader("../../data/field-guide-to-data-science.pdf")
data = loader.load()

Ignoring wrong pointing object 1221 0 (offset 0)
Ignoring wrong pointing object 1309 0 (offset 0)
Ignoring wrong pointing object 1388 0 (offset 0)
Ignoring wrong pointing object 1412 0 (offset 0)
Ignoring wrong pointing object 2082 0 (offset 0)
Ignoring wrong pointing object 2429 0 (offset 0)


In [5]:
# Note: If you're using PyPDFLoader then it will split by page for you already
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your sample document')
print (f'Here is a sample: {data[0].page_content[:150]}')

You have 126 document(s) in your data
There are 0 characters in your sample document
Here is a sample: 


Split the text into smaller chunks

In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(data)

Let's see how many small chunks we have

In [8]:
print (f'Now you have {len(texts)} documents')

Now you have 258 documents


In [9]:
num_total_characters = sum([len(x.page_content) for x in texts])
print(f"Now you have {len(texts)} documents that have an average of {num_total_characters / len(texts):,.0f}  characters (smaller pieces)")

Now you have 258 documents that have an average of 786  characters (smaller pieces)


Create embeddings and store them to Pinecone

In [10]:
from langchain.vectorstores import Pinecone as LangchainPinecone
from langchain.embeddings import HuggingFaceEmbeddings
from pinecone import Pinecone, ServerlessSpec

embeddings = HuggingFaceEmbeddings()

pc = Pinecone(
        api_key=os.environ.get("PINECONE_API_KEY")
    )

index_name="langchain1"

  warn_deprecated(


Create the index in Pinecone

In [11]:
import time

existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]

if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=768,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)

index = pc.Index(index_name)

Upload the embeddings to pinecone

In [12]:
docsearch = LangchainPinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

Ask questions

In [16]:
query = "What are examples of good data science teams?"
docs = docsearch.similarity_search(query=query, k=5)

In [17]:
print(docs)

[Document(page_content='Data Science teams need a broad view of the organization. Leaders \nmust be key advocates who meet with stakeholders to ferret out \nthe hardest challenges, locate the data, connect disparate parts of \nthe business, and gain widespread buy-in.››\n17 The Short Version 17'), Document(page_content='43 Start Here for the Basics 43 Start Here for the BasicsShaping the Culture\nIt is no surprise—building a culture is hard and there is just as \nmuch art to it as there is science. It is about deliberately creating the \nconditions for Data Science to flourish (for both Data Scientists and \nthe average employee). You can then step back to empower collective \nownership of an organic transformation. \nData Scientists are fundamentally curious and imaginative. We have \na saying on our team, “We’re not nosy, we’re Data Scientists.” These \nqualities are fundamental to the success of the project and to gaining \nnew dimensions on challenges and questions. Often Data Scie

Query the documents and get the answer back from the LLM based on those returned docs

In [19]:
from langchain.chains.question_answering import load_qa_chain

chain = load_qa_chain(llm,  chain_type="stuff")

In [32]:
query = "What are The Stages of Data Science Maturity?"
docs = docsearch.similarity_search(query=query, k=5)

In [33]:
chain.run(input_documents=docs, question=query)

' The Stages of Data Science Maturity are Collect, Describe, Discover, Predict, and Advise. Each stage represents an increasing level of maturity and analytic capability. The Collect stage focuses on collecting internal or external datasets. The Describe stage seeks to enhance or refine raw data and leverage basic analytic functions. The Discover stage identifies hidden relationships or patterns. The Predict stage utilizes past observations to make predictions. The Advise stage applies insights gained from data analysis to inform decision making. The proportion of time spent on each stage changes as an organization matures, with less time spent on earlier stages and more time spent on later, more mature stages.'