### This script focuses on doing Doc Q&A using doc embeddings, LLMs and even storing of said embeddings in Vector Stores. We explore a solution using GPT from Open AI and a completely open source solution too

****

**This document assumes you have a some understanding about the following:**
* Python and Object Oriented Programming
* NLP concepts such as embeddings
* somewhat know how LLMs work
* know a bit about VectorDBs

**Explicitly used packages:**
* Langchain
* Transformers

**Implicitly used packages:**
* ChromaDB
* openai
* sentence-transformers


Implicit packages: These packages must be installed to allow for the explicitly defined packages to perform some functions.
The explicitly used packages implement the functionality of these implicit packages and simplify it for the end user by abstracting many complicated lines of code into one simple function call (read: Wrapper Classes) but you do not need to understand the inner workings of the implicit packages if you are a beginner



Create an OPEN AI account and set up your Open AI API key


In [None]:
import os
os.environ["OPENAI_API_KEY"] = "enter your OPEN AI API Key here"

Importing OpenAI related classes from Langchain

In [1]:
from langchain.llms import OpenAI


Let us start by creating an object of the OpenAI class and set some parameters. Here, we set up the parameter called temperature. Having a lower temperature value ensures that our LLM output is deterministic in nature and not too random and "creative".

Let us call the instance (object) of the OpenAI class "llm"

In [2]:
llm = OpenAI(temperature=0.9)

Let us define a variable text in which we will store the question we will be asking GPT
The llm object takes a string as input and returns the response from the GPT model as output. We can print the output of the model.

In [3]:
text = "What would be a good company name for a company that makes colorful socks?"
print(llm(text))



Socktastic.


****
Now that we have an answer from the GPT model, let us move onto loading a Doc into a DB and then running a query over it.
The pipeline for this task is as follows:

1. Load doc
2. split lengthy docs 
3. get doc embeddings 
4. store doc embeddings in vector db (chroma db)
5. query over the db to obtain the correct chunk to answer from
6. provide this chunk along with your question to obtain the answer

Some of the above portions of the pipeline are implemented in an abstract fashion by some of the functions that we will see below

Let us load the documents using TextLoader from Langchain
Let us also define a variable to store the documents (ideally a list because we can load up many docs and store them in a list) 

In [4]:
from langchain.document_loaders import TextLoader
data = []

loader.load() returns a list with one element in this case.

Since we already have a defined list, we can append the output of loader.load() into our list variable

In [5]:
from langchain.document_loaders import TextLoader
loader = TextLoader(r'C:\Users\JkReddy\Desktop\Weill Cornell Medicine\Subjects\Capstone\LangChain.txt')
data.append(loader.load()[0]) 

Importing Vector DB - Chroma along with TextSplitter and QA related packages from Langchain. Also, import package for Embeddings from OpenAI

In [6]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.embeddings.openai import OpenAIEmbeddings

We split the lengthy document into smaller chunks. This is done because LLMs have a limit to the number of words they can take in as input. Also, they can retain more information in their "memory", recall and give accurate respones if they have lesser number of words to work with. 

Experiment with document chunk size parameter value below. Chunk overlap parameter dictates how much overlap should exist between the chunks. Having more of an overlap ensures that important information is not lost during the splitting process.

split documents function takes in a list as an input. Each list element must contain a document loaded in by Langchain

In [8]:
text_splitter = CharacterTextSplitter(chunk_size = 1000, chunk_overlap=200)
texts = text_splitter.split_documents(data)

Now that we have loaded up the text, we can now go ahead with creating a database, getting the embedding functions ready for embedding the docs and storing them in the vector db.

In the below cell, we mention a directory in which we want our vector db to reside. This is then followed by the creation of an embeddings object from the OpenAIEmbeddings class from the Langchain package. This can used for creating the doc embeddings

In [9]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'myvectordb'
embeddings = OpenAIEmbeddings()

By default, the model used for embeddings is the text-ada-embeddings-002
This model costs about $0.004 per 1000 tokens so be mindful

The vector db we are using is Chroma DB which is integrated into Langchain

**The Chroma.from_documents function takes in these parameters:**
1. The split up texts
2. embedding instance from Langchain
3. directory in which we want the persistence of our db to be asserted

In [11]:
vectordb = Chroma.from_documents(texts, embeddings, persist_directory = persist_directory)

Using embedded DuckDB with persistence: data will be stored in: myvectordb


Writing Embeddings to a disk using db.persist() and wiping it clean. Reload again to test if it has been stored

In [12]:
vectordb.persist()
vectordb = None

Reload VectorDB

In [13]:
vectordb = Chroma(persist_directory=persist_directory, embedding_function = embeddings)

Using embedded DuckDB with persistence: data will be stored in: myvectordb


Creating QA object with LLM model being passed as a parameter along with temperature parameter for controlling the nature of the LLM output. OpenAI class also takes in model_name as a parameter. Here we are going with davinci-003 so this is a bit pricey. For a cheaper alternative, go with ada or curie models of OpenAI

In [14]:
gpt_qa = RetrievalQA.from_chain_type(llm=OpenAI(temperature = 0.1, model_name = "text-davinci-003"), 
                                 chain_type = "stuff", 
                                 retriever = vectordb.as_retriever())

Our newly created qa object has the function query using which we can run a query over the db. Once the right doc chunk has been retrieved, it is passed to the llm along with the query. This helps us retrieve the correct answer.

In [17]:
query = "What can I eat?"
gpt_qa.run(query)

' You can eat collard greens, broccoli, yogurt, cheese, chia seeds, almonds, and sardines and other high-quality seafood to boost your calcium intake during pregnancy.'

****
Open Source Solution for the same task using llms available on hugging face.
We use sentence-transformers embeddings for embedding the documents

Creating an instance of an Open Source Embedding 

In [18]:
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

In [19]:
from langchain.text_splitter import CharacterTextSplitter

In [20]:
text_splitter = CharacterTextSplitter(chunk_size = 500, chunk_overlap=50)
texts = text_splitter.split_documents(data)

Created a chunk of size 510, which is longer than the specified 500
Created a chunk of size 979, which is longer than the specified 500
Created a chunk of size 518, which is longer than the specified 500


In [21]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import Chroma


persist_directory = 'myvectordb_opensource'
vectordb = Chroma.from_documents(texts, embeddings, persist_directory = persist_directory)

Using embedded DuckDB with persistence: data will be stored in: myvectordb_opensource


In [22]:
vectordb.persist()

Currently, Langchain HF pipeline only supports models in the hub which function as text2text gen or text gen models. So be mindful when picking the model from Hugging Face.

In [23]:
from langchain import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(model_id="declare-lab/flan-gpt4all-xl", 
                                        task="text2text-generation", 
                                        model_kwargs={"temperature":0, "max_length":50, "min_length":10})

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device has 1 GPUs available. Provide device={deviceId} to `from_model_id` to use availableGPUs for execution. deviceId is -1 (default) for CPU and can be a positive integer associated with CUDA device id.


In [24]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm, 
                                 chain_type = "refine", 
                                 retriever = vectordb.as_retriever())

In [25]:
query = "Can I have fruits?"
qa.run(query)



'Yes, you can have fruits. Chia seeds are a good source of calcium, fiber, and omega-3 fatty acids. There are many ways to use chia seeds, including chia pudding, added to energy bites'

Truth be told, GPT by Open AI still seems to be better than most models at being chatty and accurate. However, if you do have the LLama Weights, I implore you to try the Vicuna model as I have heard that it delivers promisng results in this particular use case.
****