# Keeping Knowledge Organized

* [1. Utilizing Deep Lake](#deeplake)
    * [1.1. Using Text Loaders and Text Splitters](#loaders_splitters) 
    * [1.2. Exploring DeepLake - adding and retrieving data](#deeplake_explore) 
    * [1.3. Question-Answering Example](#q_a) 
    * [1.4. Using Document Compressors](#compressors) 
* [2. Streamlined Data Ingestion](#ingestion)
* [3. Text Splitters](#splitters)
* [4. Embeddings](#embeddings)
* [5. Customer Support Question Answering Chatbot](#cs)
* [6. Gong.io Open-Source Alternative AI Sales Assistant](#gong_io)
* [7. Creating Picture Books with OpenAI, Replicate, and Deep Lake](#picture_books)
* [8. Additional Resources](#resources)

In [1]:
import os
from keys import OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

<hr>
<a class="anchor" id="deeplake">
    
## 1. Utilizing Deep Lake
    
</a>

In LangChain, a crucial role in structuring documents and fetching relevant data for LLMs belongs to **indexes and retrievers**. An `index` is a data structure that organizes and stores documents to enable efficient searching, while a `retriever` uses the index to find and return relevant documents in response to user queries.

<hr>
<a class="anchor" id="loaders_splitters">
    
### 1.1. Using Text Loaders and Text Splitters
    
</a>

In [2]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

In [3]:
# Sample of text,taken from https://www.theverge.com/2023/3/14/23639313/google-ai-language-model-palm-api-challenge-openai
text = """Google opens up its AI language model PaLM to challenge OpenAI and GPT-3
Google is offering developers access to one of its most advanced AI language models: PaLM.
The search giant is launching an API for PaLM alongside a number of AI enterprise tools
it says will help businesses “generate text, images, code, videos, audio, and more from
simple natural language prompts.”

PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or
Meta’s LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs,
PaLM is a flexible system that can potentially carry out all sorts of text generation and
editing tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for
example, or you could use it for tasks like summarizing text or even writing code.
(It’s similar to features Google also announced today for its Workspace apps like Google
Docs and Gmail.)
"""

# Write text to local file
with open("output/my_file.txt", "w") as file:
    file.write(text)

In [4]:
# Use TextLoader to load text from the local file
loader = TextLoader("output/my_file.txt")
docs_from_file = loader.load()

print(len(docs_from_file))

1


In [5]:
docs_from_file

[Document(page_content='Google opens up its AI language model PaLM to challenge OpenAI and GPT-3\nGoogle is offering developers access to one of its most advanced AI language models: PaLM.\nThe search giant is launching an API for PaLM alongside a number of AI enterprise tools\nit says will help businesses “generate text, images, code, videos, audio, and more from\nsimple natural language prompts.”\n\nPaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or\nMeta’s LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs,\nPaLM is a flexible system that can potentially carry out all sorts of text generation and\nediting tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for\nexample, or you could use it for tasks like summarizing text or even writing code.\n(It’s similar to features Google also announced today for its Workspace apps like Google\nDocs and Gmail.)\n', metadata={'source': 'output/my_file.txt'})]

In [6]:
# Use CharacterTextSplitter to split the docs into texts
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20) # create a text splitter
docs = text_splitter.split_documents(docs_from_file) # split documents into chunks
print(len(docs))

Created a chunk of size 373, which is longer than the specified 200


2


<hr>
<a class="anchor" id="deeplake_explore">
    
### 1.2. Exploring DeepLake - adding and retrieving data
    
</a>

In [7]:
# Specify an embedder model
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

Let's explore Deep Lake and how it can be utilized to retrieve pertinent documents for contextual use. 

**Deep Lake** is a vector store that provides several advantages:

- It’s **multimodal**, which means that it can be used to store items of diverse modalities (texts, images, audio, and video, along with their vector representations).
- It’s **serverless**, which means that we can create and manage cloud datasets without the need to create and manage a database instance. 
- It’s possible to create a *streaming data loader* out of the data loaded into a Deep Lake dataset, which is convenient for fine-tuning machine learning models using common frameworks like PyTorch and TensorFlow.
- Data can be **queried and visualized** from the web.

In [10]:
# Load the Activeloop key 
from keys import ACTIVELOOP_TOKEN
os.environ["ACTIVELOOP_TOKEN"] = ACTIVELOOP_TOKEN

# Import DeepLake
from langchain.vectorstores import DeepLake

In [11]:
# Create DeepLake dataset
my_activeloop_org_id = "iryna"
my_activeloop_dataset_name = "langchain_course_indexers_retrievers"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

Your Deep Lake dataset has been successfully created!




In [12]:
# Add documents to the DeepLake dataset
db.add_documents(docs)

/

Dataset(path='hub://iryna/langchain_course_indexers_retrievers', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  (2, 1536)  float32   None   
    id        text      (2, 1)      str     None   
 metadata     json      (2, 1)      str     None   
   text       text      (2, 1)      str     None   


 

['e7da2916-385d-11ee-9111-12ee7aa5dbdc',
 'e7da2a10-385d-11ee-9111-12ee7aa5dbdc']

In [14]:
# Create retriever from db
retriever = db.as_retriever()

<hr>
<a class="anchor" id="q_a">
    
### 1.3. Question-Answering Example
    
</a>

In [15]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Create a retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(model="text-davinci-003"),
    chain_type="stuff",
    retriever=retriever
)

In [16]:
query = "How Google plans to challenge OpenAI?"
response = qa_chain.run(query)
print(response)

 Google is offering access to its AI language model, PaLM, to developers. It is launching an API for PaLM which will help businesses generate text, images, code, videos, audio, and more from natural language prompts. PaLM is a large language model, similar to the GPT series created by OpenAI, which can be used for tasks like summarizing text or writing code.


What happened under the hood in the question-answering example above is a similarity search. It was conducted using the embeddings to identify matching documents to be used as context for the LLM. Preselecting the most suitable documents based on semantic similarity enables us to provide the model with meaningful knowledge through the prompt while remaining within the allowed context size.

Also, "stuff chain" was used to supply information to the LLM. In this technique, we "stuff" all the information into the LLM's prompt. 

**Note:** Stuffing is effective only with shorter documents because of context length limit that most LLMs have.

<hr>
<a class="anchor" id="compressors">
    
### 1.4. Using Document Compressors
    
</a>

Including unrelated information in the LLM prompt is detrimental, because it can divert the LLM's focus from important details and occupies valuable prompt space.

To address this issue and improve the retrieval process, let's use a wrapper named `ContextualCompressionRetriever` that will wrap the base retriever with an `LLMChainExtractor`. The `LLMChainExtractor` iterates over the initially returned documents and extracts only the content relevant to the query. 

In [17]:
# An example of how to use ContextualCompressionRetriever with LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# Create GPT3 wrapper
llm = OpenAI(model="text-davinci-003", temperature=0)

# Create compressor for the retriever
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

In [22]:
# Retrieve compressed (relevant) documents 
retrieved_docs = compression_retriever.get_relevant_documents("How Google plans to challenge OpenAI?")
print(retrieved_docs[0].page_content)

Google is offering developers access to one of its most advanced AI language models: PaLM. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses “generate text, images, code, videos, audio, and more from simple natural language prompts.”


<hr>
<a class="anchor" id="ingestion">
    
## 2. Streamlined Data Ingestion
    
</a>

The LangChain library offers a variety of helpers designed to facilitate data loading and extraction from diverse sources: 
- Text
- PyPDF 
- Selenium URL Loaders
- Google Drive Sync

Regardless of whether the information originates from a PDF file or website content, these classes streamline the process of handling different data formats.

<hr>
<a class="anchor" id="splitters">
    
## 3. Text Splitters
    
</a>


The length of the contents may vary depending on their source and may exceed the input window size of the model. Splitting the large text into smaller segments allows to use the most relevant chunk as the context instead of expecting the model to comprehend the textual input.

<hr>
<a class="anchor" id="embeddings">
    
## 4. Embeddings
    
</a>


LLMs can transform textual data into embedding space, allowing for versatile representations across languages.  

Embeddings are high-dimensional vectors that capture semantic information. Embeddings also serve to identify relevant information by quantifying the distance between data points (by indicating closer semantic meaning for points being closer together).

The LangChain integration provides necessary functions for both transforming and calculating similarities.

<hr>
<a class="anchor" id="cs">
    
## 5. Customer Support Question Answering Chatbot
    
</a>

Let's demonstrate how to use a website's content as supplementary context for a chatbot to respond to user queries effectively. 

The code implementation below involves:
- employing data loaders, 
- storing the corresponding embeddings in the Deep Lake dataset, 
- and retrieving the most relevant documents corresponding to the user's question.

<hr>
<a class="anchor" id="gong_io">
    
## 6. Gong.io Open-Source Alternative AI Sales Assistant
    
</a>

Let's explore how LangChain, Deep Lake, and GPT-4 can be used to develop a sales assistant able to give advice to salesman, taking into considerations internal guidelines.

<hr>
<a class="anchor" id="picture_books">
    
## 7. Creating Picture Books with OpenAI, Replicate, and Deep Lake
</a>

Having as look at the use case of AI technology in the creative domain of children's picture book creation, using both OpenAI GPT-3.5 LLM for writing the story and Stable Diffusion for generating images for it.

<hr>
<a class="anchor" id="resources">
    
## 8. Additional Resources
</a>

- [Improving Document Retrieval with Contextual Compression](https://blog.langchain.dev/improving-document-retrieval-with-contextual-compression/)