#  Data Connections

## Create a Chatbot for any Wikipedia topic

* We will connect to the internet an get the wikipedia data of the topic of our choice
* Split this into chunks (you choose the size) and Encode the chunks into embeddings
* Write these embeddings to a ChromaDB Vector Store
* Get respose of our query from LLM by providing it the wikipedia data as context - Do we need to provide all data as context?
* Use Context Compression to return the relevant portion of the document to the question

In [3]:
from langchain.document_loaders import WikipediaLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
# https://openai.com/blog/introducing-text-and-code-embeddings - OpenAI embeddings beat SOTA embeddings for multiple tasks like Similarity Search
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor 
from dotenv import load_dotenv, find_dotenv

import os
load_dotenv(find_dotenv(), override=True)
api_key = os.getenv("OPENAI_API_KEY")

In [69]:
def wiki_bot(topic,question):
    # PART ONE:
    # LOAD 
    loader = WikipediaLoader(query=topic,load_max_docs=10, doc_content_chars_max = 4000)
    documents = loader.load()
    
    # PART TWO
    # Split the document into chunks (you choose how and what size)
    text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)
    docs = text_splitter.split_documents(documents)
    
    # PART THREE
    # EMBED THE Documents (now in chunks) to a persisted ChromaDB
    embedding_function = OpenAIEmbeddings()
    db = Chroma.from_documents(docs, embedding_function,persist_directory='./OpenAI')
    db.persist()

    # PART FOUR - 
    # Use ChatOpenAI and ContextualCompressionRetriever to return the most
    # relevant part of the documents.
    #Embeddings capture semantic relationships between items
    
    llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
    
    retriever_from_llm = MultiQueryRetriever.from_llm(retriever=db.as_retriever(),llm=llm)
    
    compressor = LLMChainExtractor.from_llm(llm)
    compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, 
                                                           base_retriever=retriever_from_llm)

    compressed_docs = compression_retriever.get_relevant_documents(question)

    return compressed_docs[0].page_content

In [70]:
print(wiki_bot("OpenAI", "What was the contorversy between Sam Altman and the board of OpenAI"))

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Can you provide details about the dispute between Sam Altman and the board of OpenAI?', '2. What were the main points of contention between Sam Altman and the board of OpenAI?', '3. Could you explain the controversy that arose between Sam Altman and the board of OpenAI?']


The board announced that it had made the decision to remove Altman as CEO. The board said that Altman "was not consistently candid in his communications."


### Behind the Scenes

In [45]:
loader = WikipediaLoader(query='OpenAI',load_max_docs=10, doc_content_chars_max = 4000)
documents = loader.load()

print(documents)

[Document(page_content='OpenAI is an American artificial intelligence (AI) research organization consisting of the non-profit OpenAI, Inc. registered in Delaware and its for-profit subsidiary OpenAI Global, LLC. OpenAI researches artificial intelligence with the declared intention of developing "safe and beneficial" artificial general intelligence, which it defines as "highly autonomous systems that outperform humans at most economically valuable work". OpenAI has also developed several large language models, such as ChatGPT and GPT-4, as well as advanced image generation models like DALL-E 3, and in the past published open-source models.The organization was founded in December 2015 by Ilya Sutskever, Greg Brockman, Trevor Blackwell, Vicki Cheung, Andrej Karpathy, Durk Kingma, Jessica Livingston, John Schulman, Pamela Vagata, and Wojciech Zaremba, with Sam Altman and Elon Musk serving as the initial board members. Microsoft provided OpenAI Global LLC with a $1 billion investment in 201

In [46]:
len(documents)

10

In [59]:
print(documents[0].page_content)

OpenAI is an American artificial intelligence (AI) research organization consisting of the non-profit OpenAI, Inc. registered in Delaware and its for-profit subsidiary OpenAI Global, LLC. OpenAI researches artificial intelligence with the declared intention of developing "safe and beneficial" artificial general intelligence, which it defines as "highly autonomous systems that outperform humans at most economically valuable work". OpenAI has also developed several large language models, such as ChatGPT and GPT-4, as well as advanced image generation models like DALL-E 3, and in the past published open-source models.The organization was founded in December 2015 by Ilya Sutskever, Greg Brockman, Trevor Blackwell, Vicki Cheung, Andrej Karpathy, Durk Kingma, Jessica Livingston, John Schulman, Pamela Vagata, and Wojciech Zaremba, with Sam Altman and Elon Musk serving as the initial board members. Microsoft provided OpenAI Global LLC with a $1 billion investment in 2019 and a $10 billion inve

In [58]:
print(documents[0].metadata)

{'title': 'OpenAI', 'summary': 'OpenAI is an American artificial intelligence (AI) research organization consisting of the non-profit OpenAI, Inc. registered in Delaware and its for-profit subsidiary OpenAI Global, LLC. OpenAI researches artificial intelligence with the declared intention of developing "safe and beneficial" artificial general intelligence, which it defines as "highly autonomous systems that outperform humans at most economically valuable work". OpenAI has also developed several large language models, such as ChatGPT and GPT-4, as well as advanced image generation models like DALL-E 3, and in the past published open-source models.The organization was founded in December 2015 by Ilya Sutskever, Greg Brockman, Trevor Blackwell, Vicki Cheung, Andrej Karpathy, Durk Kingma, Jessica Livingston, John Schulman, Pamela Vagata, and Wojciech Zaremba, with Sam Altman and Elon Musk serving as the initial board members. Microsoft provided OpenAI Global LLC with a $1 billion investmen

In [47]:
for i in range(len(documents)):
    print(documents[i].metadata['source'])

https://en.wikipedia.org/wiki/OpenAI
https://en.wikipedia.org/wiki/OpenAI_Codex
https://en.wikipedia.org/wiki/OpenAI_Five
https://en.wikipedia.org/wiki/ChatGPT
https://en.wikipedia.org/wiki/Greg_Brockman
https://en.wikipedia.org/wiki/Mira_Murati
https://en.wikipedia.org/wiki/Sam_Altman
https://en.wikipedia.org/wiki/Ilya_Sutskever
https://en.wikipedia.org/wiki/Artificial_general_intelligence
https://en.wikipedia.org/wiki/Bard_(chatbot)


In [12]:
# Set logging for the queries to understand what's happening behind the scenes. 
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [20]:
from langchain.retrievers.multi_query import MultiQueryRetriever

In [68]:
topic = "OpenAI"
question = 'What was the contorversy between Sam Altman and the board of OpenAI'

embedding_function = OpenAIEmbeddings()
db = Chroma(persist_directory='./OpenAI',embedding_function=embedding_function)

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0) 
retriever_from_llm = MultiQueryRetriever.from_llm(retriever=db.as_retriever(),llm=llm)

retriever_from_llm.get_relevant_documents(query = question)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Can you provide details about the dispute between Sam Altman and the board of OpenAI?', '2. What were the main points of contention between Sam Altman and the board of OpenAI?', '3. Could you explain the controversy that arose between Sam Altman and the board of OpenAI?']


[Document(page_content='==== OpenAI ====\nOpenAI was initially funded by Altman, Greg Brockman, Elon Musk, Jessica Livingston, Peter Thiel, Microsoft, Amazon Web Services, Infosys, and YC Research. When OpenAI launched in 2015, it had raised $1 billion. In March 2019, Sam Altman left Y Combinator to focus full-time on OpenAI as CEO. By the summer of 2019, he had helped raise $1 billion from Microsoft. Altman testified before the United States Senate Judiciary Subcommittee on Privacy, Technology and the Law on 16 May 2023 about issues of AI oversight. On November 17, 2023, OpenAI\'s board announced that it had made the decision to remove Altman as CEO. The board said that Altman "was not consistently candid in his communications."The Verge reported that a day after Altman was removed, the board was in discussion to bring him back. It has also been said that before Altman was removed, he was', metadata={'source': 'https://en.wikipedia.org/wiki/Sam_Altman', 'summary': 'Samuel Harris Altma

In [65]:

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, 
                                                           base_retriever=db.as_retriever(), verbose = False)
compressed_docs = compression_retriever.get_relevant_documents(question)
print(compressed_docs[0].page_content)
print(compressed_docs[0].metadata['summary'])



The board announced that it had made the decision to remove Altman as CEO. The board said that Altman "was not consistently candid in his communications."
Samuel Harris Altman ( AWLT-mən; born April 22, 1985) is an American entrepreneur and investor, who has been the chief executive officer of OpenAI since 2019 (being briefly fired in November 2023). Prior to OpenAI, Altman was president of Y Combinator from 2014 until he was fired by Paul Graham in 2019.


In [25]:
compressed_docs

[Document(page_content='The board announced that it had made the decision to remove Altman as CEO. The board said that Altman "was not consistently candid in his communications."', metadata={'source': 'https://en.wikipedia.org/wiki/Sam_Altman', 'summary': 'Samuel Harris Altman ( AWLT-mən; born April 22, 1985) is an American entrepreneur and investor, who has been the chief executive officer of OpenAI since 2019 (being briefly fired in November 2023). Prior to OpenAI, Altman was president of Y Combinator from 2014 until he was fired by Paul Graham in 2019.', 'title': 'Sam Altman'}),
 Document(page_content='Samuel Harris Altman has been the chief executive officer of OpenAI since 2019 (being briefly fired in November 2023).', metadata={'source': 'https://en.wikipedia.org/wiki/Sam_Altman', 'summary': 'Samuel Harris Altman ( AWLT-mən; born April 22, 1985) is an American entrepreneur and investor, who has been the chief executive officer of OpenAI since 2019 (being briefly fired in November