# In the Notebook "Workflow of PDF Chatbot", we can have different options for the type of chains. This notebook is used to investigate the influence of chain type parameters

In [1]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.document_loaders import DirectoryLoader
import os 
from dotenv import load_dotenv
import pinecone

load_dotenv()
openai_key = os.getenv('OPENAI_KEY')
os.environ['OPENAI_API_KEY'] = openai_key
embeddings = OpenAIEmbeddings()


pinecone_api_key = os.getenv('PINECONE_KEY')
pinecone_env_name = os.getenv('PINECONE_ENV')
pinecone_index_name = os.getenv('PINECONE_INDEX')

pinecone_config = {
    "api_key":pinecone_api_key,
    "env_name":pinecone_env_name,
    "index_name":pinecone_index_name
}

  from tqdm.autonotebook import tqdm


# Load pdfs

In [2]:
def upload_pdf_to_pinecone(pdf_directory, pinecone_config, my_namespace,
                            chunk_size=1000, chunk_overlap=0):
    
    
    my_loader = DirectoryLoader(pdf_directory, glob='**/*.pdf')
    documents = my_loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size, chunk_overlap = chunk_overlap)
    docs = text_splitter.split_documents(documents)
    
    # initialize pinecone
    pinecone.init(
        api_key=pinecone_config['api_key'],  # find at app.pinecone.io
        environment=pinecone_config['env_name']  # next to api key in console
    )
    
    docsearch = Pinecone.from_documents(docs, embeddings, index_name=pinecone_config['index_name'], namespace=my_namespace)
    

In [3]:
# Upload pdfs if you first run this notebook, 
# this operation could take a while if documents are long

# pdf_directory = "../docs"
# my_namespace = 'Unilever-2018-2019'
# upload_pdf_to_pinecone(pdf_directory, pinecone_config, my_namespace, chunk_size=1000, chunk_overlap=200)

# Load Vectorstore

In [4]:
# Load vector store
my_namespace = 'Unilever-2018-2019'
pinecone.init(api_key=pinecone_api_key,environment=pinecone_env_name)
index = pinecone.Index(pinecone_index_name)
vectorstore = Pinecone(index, embeddings.embed_query, "text", namespace=my_namespace)

# Option 1. Stuff Chain

ref: https://python.langchain.com/docs/modules/chains/document/stuff

<font size=4 color=green>  The stuff documents chain ("stuff" as in "to stuff" or "to fill") is the most straightforward of the document chains. It takes a list of documents, inserts them all into a prompt and passes that prompt to an LLM. This chain is well-suited for applications where documents are small and only a few are passed in for most calls.

In [6]:
from langchain.llms import OpenAI
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="stuff")
query = "In what ways could a slowed technological change risk affect our data management enhancement programmes?"
ref_docs = vectorstore.similarity_search(query, k=10)

chain({"input_documents": ref_docs, "question": query}, return_only_outputs=True)

{'output_text': ' A slowed technological change could risk affecting our data management enhancement programmes by making it difficult to manage the business, increasing the cost of recycled plastic or other alternative packaging materials, and making products less affordable or less available for our consumers.\nSOURCES: ../docs/unilever-annual-report-and-accounts-2019.pdf'}

# Option 2. Map_reduce
ref: https://python.langchain.com/docs/modules/chains/document/map_reduce

<font size=4 color=green> The map reduce documents chain first applies an LLM chain to each document individually (the Map step), treating the chain output as a new document. It then passes all the new documents to a separate combine documents chain to get a single output (the Reduce step). It can optionally first compress, or collapse, the mapped documents to make sure that they fit in the combine documents chain (which will often pass them to an LLM). This compression step is performed recursively if necessary.

In [7]:
chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="map_reduce", return_intermediate_steps=True)
query = "In what ways could a slowed technological change risk affect our data management enhancement programmes?"
ref_docs = vectorstore.similarity_search(query, k=10)

chain({"input_documents": ref_docs, "question": query}, return_only_outputs=True)

{'intermediate_steps': [' No relevant text.',
  ' None.',
  ' No relevant text.',
  ' None.',
  ' None.',
  ' Technology continues to change the fabric of life and business. Enhanced AI, robotics and the internet of things (IoT) are reshaping how people live, work and interact with the world – and with brands. Intelligent technologies are optimising manufacturing and agriculture, connecting global businesses like ours inside and out, and changing how people shop.',
  ' None',
  ' None',
  ' None',
  ' No relevant text.'],
 'output_text': " I don't know.\nSOURCES: None."}

# Option 3. Refine
ref: https://python.langchain.com/docs/modules/chains/document/refine

<font size=4 color=green> The refine documents chain constructs a response by looping over the input documents and iteratively updating its answer. For each document, it passes all non-document inputs, the current document, and the latest intermediate answer to an LLM chain to get a new answer.

<font size=4 color=green> Since the Refine chain only passes a single document to the LLM at a time, it is well-suited for tasks that require analyzing more documents than can fit in the model's context. The obvious tradeoff is that this chain will make far more LLM calls than, for example, the Stuff documents chain. There are also certain tasks which are difficult to accomplish iteratively. For example, the Refine chain can perform poorly when documents frequently cross-reference one another or when a task requires detailed information from many documents.

In [8]:
chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="refine", return_intermediate_steps=True)
query = "In what ways could a slowed technological change risk affect our data management enhancement programmes?"
ref_docs = vectorstore.similarity_search(query, k=4)

chain({"input_documents": ref_docs, "question": query}, return_only_outputs=True)

{'intermediate_steps': ['\nA slowed technological change could risk affecting our data management enhancement programmes by reducing the speed at which new products can be developed and deployed. This could lead to a delay in the implementation of new technologies and data management strategies, which could result in a lack of competitive advantage and a decrease in consumer satisfaction. Additionally, a slowed technological change could lead to a decrease in the quality of data collected, as well as a decrease in the accuracy of data analysis. This could lead to a decrease in the effectiveness of data management enhancement programmes.',
  '\n\nA slowed technological change could risk affecting our data management enhancement programmes by reducing the speed at which new products can be developed and deployed. This could lead to a delay in the implementation of new technologies and data management strategies, which could result in a lack of competitive advantage and a decrease in cons

# Option 4. Map_Rerank
ref: https://python.langchain.com/docs/modules/chains/document/map_rerank

<font size=4 color=green> The map re-rank documents chain runs an initial prompt on each document, that not only tries to complete a task but also gives a score for how certain it is in its answer. The highest scoring response is returned.

In [9]:
chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="map_rerank", return_intermediate_steps=True)
query = "In what ways could a slowed technological change risk affect our data management enhancement programmes?"
ref_docs = vectorstore.similarity_search(query, k=4)

chain({"input_documents": ref_docs, "question": query}, return_only_outputs=True)

{'intermediate_steps': [{'answer': ' Slowed technological change could affect our data management enhancement programmes by reducing the speed at which we can develop and deploy new communication technologies, reducing the speed at which we can develop and deploy new products to meet changing consumer trends, and reducing the speed at which we can convert category strategies into projects.',
   'score': '90'},
  {'answer': ' This document does not answer the question.', 'score': '0'},
  {'answer': ' This document does not answer the question.', 'score': '0'},
  {'answer': ' Slowed technological change could limit the effectiveness of data management enhancement programmes, as the programmes may not be able to keep up with the latest technology.',
   'score': '80'}],
 'output_text': ' Slowed technological change could affect our data management enhancement programmes by reducing the speed at which we can develop and deploy new communication technologies, reducing the speed at which we c

<font size =4 color="blue"> 1. "Stuff" puts all referenced documents into prompt and get the response straight forward. (ALL)

<font size =4 color="blue"> 2. "Map-reduce" applies each document individually and then treating the output as a new document, and then passes all the new documents to a separate combine documents chain to get a single output (Compression).
    
<font size =4 color="blue"> 3. "Refine" pass one document at a time, and iteratively updating its answer (Iteration)

<font size =4 color="blue"> 4. "Re-rank" rank the response with the input with one referenced document (Ranking)

<font size =4 color = green > In terms of application, stuff is suitable when we divided the document into small chunks. "Map-reduce" is suitable when the chunk is big as compression is applied. Refine is suitable for high level questions or much precise response, as it iteratively updates it answer with one chunk, but it needs to have multiple calls which is extremly time-consuming! Re-rank ranks the response with the prompt with one chunk, and the response is slightly improved but also takes a lot of time! 
    
<font size =4 color = red > Overall, we can stick on the chain type of stuff, but consider "refine" as an option for users if they need more deep analysis over more documents!