### "CEPI-gpt" Proof of Concept

Why do this?

Data is not a business asset unless it is actually used. A ChatGPT-like experience is one way to encourage adoption of data to inform and accelerate work - through a simple and accessible natural language query on your data.

What do we want to evidence?

We want to create a ChatGPT-like experience to query and get answers from a corpus of first-party data. Technically we want to learn how word embeddings, vector databases and natural language queries can all come together to create this application.


In [1]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma # note capital letter
from langchain.text_splitter import CharacterTextSplitter
from langchain import OpenAI 
#from langchain import OpenAI, VectorDBQA # VectorDBQA is deprecated, instead import RetrievalQA. 
# See dev notes.
from langchain.chains import RetrievalQA 
from langchain.document_loaders import DirectoryLoader
import os
import nltk
import magic

# the below is used to manage the API keys in the .env
import openai
from dotenv import load_dotenv
# Load environment variables from inside the .env file
load_dotenv()



True

In [2]:
openai.api_key = os.getenv("OPENAI_API_KEY")

# if not using .env for some reason, use: os.environ["OPENAI_API_KEY"] = "KEY" 


In [3]:
loader = DirectoryLoader('/Users/joellim/Desktop/coda_work_mac_2022/openai-project-3/embeddings1', 'cepi100days.txt')

docs = loader.load()

In [4]:
#print(docs)

In [5]:
char_text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

In [6]:
doc_texts = char_text_splitter.split_documents(docs)


Created a chunk of size 1220, which is longer than the specified 1000
Created a chunk of size 1628, which is longer than the specified 1000
Created a chunk of size 1208, which is longer than the specified 1000
Created a chunk of size 1198, which is longer than the specified 1000
Created a chunk of size 1040, which is longer than the specified 1000
Created a chunk of size 1213, which is longer than the specified 1000
Created a chunk of size 1186, which is longer than the specified 1000
Created a chunk of size 1542, which is longer than the specified 1000
Created a chunk of size 1227, which is longer than the specified 1000
Created a chunk of size 1149, which is longer than the specified 1000
Created a chunk of size 1093, which is longer than the specified 1000
Created a chunk of size 1359, which is longer than the specified 1000
Created a chunk of size 1039, which is longer than the specified 1000
Created a chunk of size 1018, which is longer than the specified 1000
Created a chunk of s

In [7]:
#openAI_embeddings = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"])

openAI_embeddings = OpenAIEmbeddings(openai_api_key=openai.api_key)

In [8]:
#vStore = Chroma.from_documents(doc_texts, OpenAIEmbeddings)

vStore = Chroma.from_documents(doc_texts, openAI_embeddings)

In [9]:

model = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=vStore.as_retriever())

#model = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=vStore)

# when I ran: model = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=vStore) 
# I got the following warning message:  /Users/joellim/miniforge3/envs/tf_m1/lib/python3.8/site-packages/langchain/chains/retrieval_qa/base.py:201: UserWarning: `VectorDBQA` is deprecated - please use `from langchain.chains import RetrievalQA`
# So I researched and found this: https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html 


In [10]:
#query = 'what is the fundamental shift towards preparedness?'
query = 'write a tweet about Key areas of innovation contributing to accelerated development and authorisation of COVID-19 vaccines'
#query = 'Whats the link for "Facts and figures: Ending violence against women"?'
#query = 'Who wrote "Applying lessons from the Ebola vaccine experience for SARS-CoV-2 and other epidemic pathogens"?'
#query = 'summarize the 100-day vaccine creation plan in less than 30 words'
#query = 'What are the barriers to vaccine creation - as a 6 verse poem'
#query = 'What are barriers to 100-day vaccine development?'
#query = 'What is CEPI\'s budget?' 
#query = 'What is the main mission of CEPI and how?'

print(model.run(query))

 A new study identified 37 innovations that enabled the accelerated development and authorisation of #COVID19 vaccines. It found that leveraging pre-existing insights, multiple processes running in parallel, and collaboration between stakeholders were key factors. #VaccinesWork


### To demonstrate the cognition of the model and accuracy of answers, the answers to the following questions were correct and from the following pages.

Q: 'what is the fundamental shift towards preparedness?'
A: Page 5

Q: 'write a tweet about Key areas of innovation contributing to accelerated development and authorisation of COVID-19 vaccines'
A: Page 4

Q: 'Whats the link for "Facts and figures: Ending violence against women"?'
A: Page 62

Q: 'Who wrote "Applying lessons from the Ebola vaccine experience for SARS-CoV-2 and other epidemic pathogens"?'
A: Page 63





------------------------------------------------------------------------------------------------

### Ask model to show where it got its answer from in the corpus:

In [11]:
query_for_source = 'Whats the link for "Facts and figures: Ending violence against women"?'

In [12]:
model_source_doc_return = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=vStore.as_retriever(), return_source_documents=True)


In [13]:
source_doc_return = model_source_doc_return({'query': query_for_source})

source_doc_return['source_documents']

[Document(page_content='58', metadata={'source': '/Users/joellim/Desktop/coda_work_mac_2022/openai-project-3/embeddings1/cepi100days.txt'}),
 Document(page_content='42', metadata={'source': '/Users/joellim/Desktop/coda_work_mac_2022/openai-project-3/embeddings1/cepi100days.txt'}),
 Document(page_content='40', metadata={'source': '/Users/joellim/Desktop/coda_work_mac_2022/openai-project-3/embeddings1/cepi100days.txt'}),
 Document(page_content='2\tAllen, T. et al., 2017. Global hotspots and correlates of emerging zoonotic diseases. Nature Communications, p. 8:1124. 3\tThe concept of a stringent regulatory authority or SRA was developed by the WHO Secretariat and the Global Fund to Fight AIDS, Tuberculosis and Malaria to guide medicine procurement decisions and is now widely recognized by the international regulatory and procurement community. A list of stringent regulatory authorities can be consulted on: https://www.who.int/initiatives/who-listed-authority-reg-authorities/SRAs 4\tWHO, 2

### DEV NOTES

1. I had to install a bunch of new libraries: langchain, nltk (natural language tool kit), etc.

2. VectorDBQA is deprecated. So 'from langchain.chains import RetrievalQA'. 

3. model = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=vStore)

    - When I ran: model = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=vStore), 
    I got the following warning message: /Users/joellim/miniforge3/envs/tf_m1/lib/python3.8/site-packages/langchain/chains/retrieval_qa/base.py:201: UserWarning: `VectorDBQA` is deprecated - please use `from langchain.chains import RetrievalQA`
    - So I researched and found this: https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html 

4. For the 'magic' module ('import magic') I first had to install libmagic with 'brew install libmagic' via terminal (within the tf_m1 environment).

5. Does the model get information from outside the corpus i.e. from ChatGPT's knowledge base? When asked "query = 'Which countries fund CEPI?'" at times it answers "I don't know" and at other times it gives an answer like " CEPI is funded by multiple countries, including the UK, Germany, Norway, Japan, and Canada." The source doc I used does not mention some of the nations - e.g. Canada, Norway, Germany. Baffling. When asked which nations founded CEPI it answered "CEPI was founded by the governments of Germany, Norway, Japan, Canada, France, India, Italy, the Netherlands, the United Kingdom and the Bill & Melinda Gates Foundation." I checked the CEPI website and the correct answer is "CEPI was founded in Davos by the governments of Norway and India, the Bill & Melinda Gates Foundation, Wellcome, and the World Economic Forum." At other times it answered "I don't know". All this to say, I suspect it is possible that the model sometimes might use data outside of the corpus or vector database to answer a query and users should check for veracity if this is ever suspected to be happening. However, the general accuracy of the answers are indeed impressive for this ChatGPT-like experience. 






