# Keeping Knowledge Organized

* [1. Utilizing Deep Lake](#deeplake)
    * [1.1. Using Text Loaders and Text Splitters](#loaders_splitters) 
    * [1.2. Exploring DeepLake - adding and retrieving data](#deeplake_explore) 
    * [1.3. Question-Answering Example](#q_a) 
    * [1.4. Using Document Compressors](#compressors) 
* [2. Streamlined Data Ingestion](#ingestion)
    * [2.1. TextLoader](#TextLoader)
    * [2.2. PyPDFLoader](#PyPDFLoader) 
    * [2.3. SeleniumURLLoader](#SeleniumURLLoader) 
    * [2.4. GoogleDriveLoader](#GoogleDriveLoader) 
* [3. Text Splitters](#splitters)
* [4. Embeddings](#embeddings)
* [5. Customer Support Question Answering Chatbot](#cs)
* [6. Gong.io Open-Source Alternative AI Sales Assistant](#gong_io)
* [7. Creating Picture Books with OpenAI, Replicate, and Deep Lake](#picture_books)
* [8. Additional Resources](#resources)

In [1]:
import os
from keys import OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

<hr>
<a class="anchor" id="deeplake">
    
## 1. Utilizing Deep Lake
    
</a>

In LangChain, a crucial role in structuring documents and fetching relevant data for LLMs belongs to **indexes and retrievers**. An `index` is a data structure that organizes and stores documents to enable efficient searching, while a `retriever` uses the index to find and return relevant documents in response to user queries.

<hr>
<a class="anchor" id="loaders_splitters">
    
### 1.1. Using Text Loaders and Text Splitters
    
</a>

In [2]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

In [3]:
# Sample of text, taken from https://www.theverge.com/2023/3/14/23639313/google-ai-language-model-palm-api-challenge-openai
text = """Google opens up its AI language model PaLM to challenge OpenAI and GPT-3
Google is offering developers access to one of its most advanced AI language models: PaLM.
The search giant is launching an API for PaLM alongside a number of AI enterprise tools
it says will help businesses “generate text, images, code, videos, audio, and more from
simple natural language prompts.”

PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or
Meta’s LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs,
PaLM is a flexible system that can potentially carry out all sorts of text generation and
editing tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for
example, or you could use it for tasks like summarizing text or even writing code.
(It’s similar to features Google also announced today for its Workspace apps like Google
Docs and Gmail.)
"""

# Write text to local file
with open("output/my_file.txt", "w") as file:
    file.write(text)

In [4]:
# Use TextLoader to load text from the local file
loader = TextLoader("output/my_file.txt")
docs_from_file = loader.load()

print(len(docs_from_file))

1


In [5]:
docs_from_file

[Document(page_content='Google opens up its AI language model PaLM to challenge OpenAI and GPT-3\nGoogle is offering developers access to one of its most advanced AI language models: PaLM.\nThe search giant is launching an API for PaLM alongside a number of AI enterprise tools\nit says will help businesses “generate text, images, code, videos, audio, and more from\nsimple natural language prompts.”\n\nPaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or\nMeta’s LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs,\nPaLM is a flexible system that can potentially carry out all sorts of text generation and\nediting tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for\nexample, or you could use it for tasks like summarizing text or even writing code.\n(It’s similar to features Google also announced today for its Workspace apps like Google\nDocs and Gmail.)\n', metadata={'source': 'output/my_file.txt'})]

In [6]:
# Use CharacterTextSplitter to split the docs into texts
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20) # create a text splitter
docs = text_splitter.split_documents(docs_from_file) # split documents into chunks
print(len(docs))

Created a chunk of size 373, which is longer than the specified 200


2


<hr>
<a class="anchor" id="deeplake_explore">
    
### 1.2. Exploring DeepLake - adding and retrieving data
    
</a>

In [7]:
# Specify an embedder model
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

Let's explore Deep Lake and how it can be utilized to retrieve pertinent documents for contextual use. 

**Deep Lake** is a vector store that provides several advantages:

- It’s **multimodal**, which means that it can be used to store items of diverse modalities (texts, images, audio, and video, along with their vector representations).
- It’s **serverless**, which means that we can create and manage cloud datasets without the need to create and manage a database instance. 
- It’s possible to create a *streaming data loader* out of the data loaded into a Deep Lake dataset, which is convenient for fine-tuning machine learning models using common frameworks like PyTorch and TensorFlow.
- Data can be **queried and visualized** from the web.

In [8]:
# Load the Activeloop key 
from keys import ACTIVELOOP_TOKEN
os.environ["ACTIVELOOP_TOKEN"] = ACTIVELOOP_TOKEN

# Import DeepLake
from langchain.vectorstores import DeepLake

In [9]:
# Create DeepLake dataset
my_activeloop_org_id = "iryna"
my_activeloop_dataset_name = "langchain_course_indexers_retrievers"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

Using embedding function is deprecated and will be removed in the future. Please use embedding instead.


Deep Lake Dataset in hub://iryna/langchain_course_indexers_retrievers already exists, loading from the storage


In [10]:
# Add documents to the DeepLake dataset
db.add_documents(docs)

 

Dataset(path='hub://iryna/langchain_course_indexers_retrievers', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  (4, 1536)  float32   None   
    id        text      (4, 1)      str     None   
 metadata     json      (4, 1)      str     None   
   text       text      (4, 1)      str     None   


['15cdcbe2-3882-11ee-a6a7-12ee7aa5dbdc',
 '15cdccd2-3882-11ee-a6a7-12ee7aa5dbdc']

In [11]:
# Create retriever from db
retriever = db.as_retriever()

<hr>
<a class="anchor" id="q_a">
    
### 1.3. Question-Answering Example
    
</a>

In [12]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Create a retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(model="text-davinci-003"),
    chain_type="stuff",
    retriever=retriever
)

In [13]:
query = "How Google plans to challenge OpenAI?"
response = qa_chain.run(query)
print(response)

 Google plans to challenge OpenAI by offering developers access to their most advanced AI language model, PaLM, and launching an API for PaLM alongside a number of AI enterprise tools. PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or Meta's LLaMA family of models.


What happened under the hood in the question-answering example above is a similarity search. It was conducted using the embeddings to identify matching documents to be used as context for the LLM. Preselecting the most suitable documents based on semantic similarity enables us to provide the model with meaningful knowledge through the prompt while remaining within the allowed context size.

Also, "stuff chain" was used to supply information to the LLM. In this technique, we "stuff" all the information into the LLM's prompt. 

**Note:** Stuffing is effective only with shorter documents because of context length limit that most LLMs have.

<hr>
<a class="anchor" id="compressors">
    
### 1.4. Using Document Compressors
    
</a>

Including unrelated information in the LLM prompt is detrimental, because it can divert the LLM's focus from important details and occupies valuable prompt space.

To address this issue and improve the retrieval process, let's use a wrapper named `ContextualCompressionRetriever` that will wrap the base retriever with an `LLMChainExtractor`. The `LLMChainExtractor` iterates over the initially returned documents and extracts only the content relevant to the query. 

In [14]:
# An example of how to use ContextualCompressionRetriever with LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# Create GPT3 wrapper
llm = OpenAI(model="text-davinci-003", temperature=0)

# Create compressor for the retriever
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

In [15]:
# Retrieve compressed (relevant) documents 
retrieved_docs = compression_retriever.get_relevant_documents("How Google plans to challenge OpenAI?")
print(retrieved_docs[0].page_content)



Google is offering developers access to one of its most advanced AI language models: PaLM. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses “generate text, images, code, videos, audio, and more from simple natural language prompts.”


<hr>
<a class="anchor" id="ingestion">
    
## 2. Streamlined Data Ingestion
    
</a>

The LangChain library offers a variety of helpers designed to facilitate data loading and extraction from diverse sources: 
- TextLoader (handling plain text files);
- PyPDFLoader (dealing with PDF files);
- SeleniumURLLoaders (loading HTML documents from URLs that require JavaScript rendering);
- GoogleDriveLoader (importing data from Google Drive docs or folders).

Regardless of whether the information originates from a PDF file or website content, these classes streamline the process of handling different data formats.

<hr>
<a class="anchor" id="TextLoader">
    
### 2.1. TextLoader
    
</a>

In [16]:
from langchain.document_loaders import TextLoader

file_path = 'data/my_file.txt'
loader = TextLoader(file_path) # optional argument: encoding="ISO-8859-1"
documents = loader.load()

documents

[Document(page_content='Google opens up its AI language model PaLM to challenge OpenAI and GPT-3\nGoogle is offering developers access to one of its most advanced AI language models: PaLM.\nThe search giant is launching an API for PaLM alongside a number of AI enterprise tools\nit says will help businesses “generate text, images, code, videos, audio, and more from\nsimple natural language prompts.”\n\nPaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or\nMeta’s LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs,\nPaLM is a flexible system that can potentially carry out all sorts of text generation and\nediting tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for\nexample, or you could use it for tasks like summarizing text or even writing code.\n(It’s similar to features Google also announced today for its Workspace apps like Google\nDocs and Gmail.)\n', metadata={'source': 'data/my_file.txt'})]

<hr>
<a class="anchor" id="PyPDFLoader">
    
### 2.2. PyPDFLoader
    
</a>

In [17]:
!pip install -q pypdf

In [18]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("data/article.pdf")
pages = loader.load_and_split()

print(pages[0])

page_content='Simulated Annealing in Early Layers Leads to Better Generalization\nAmir M. Sarfi1,2Zahra Karimpour1Muawiz Chaudhary1,2Nasir M. Khalid1,2\nMirco Ravanelli1,2Sudhir Mudur1Eugene Belilovsky1,2\n1Concordia University2Mila – Quebec AI Institute\nAbstract\nRecently, a number of iterative learning methods have\nbeen introduced to improve generalization. These typically\nrely on training for longer periods of time in exchange for\nimproved generalization. LLF (later-layer-forgetting) is a\nstate-of-the-art method in this category. It strengthens learn-\ning in early layers by periodically re-initializing the last\nfew layers of the network. Our principal innovation in this\nwork is to use Simulated annealing in EArly Layers (SEAL)\nof the network in place of re-initialization of later layers.\nEssentially, later layers go through the normal gradient de-\nscent process, while the early layers go through short stints\nof gradient ascent followed by gradient descent. Extensive\nexp

<hr>
<a class="anchor" id="SeleniumURLLoader">
    
### 2.3. SeleniumURLLoader
    
</a>

In [19]:
!pip install -q unstructured selenium

In [20]:
from langchain.document_loaders import SeleniumURLLoader

urls = [
    "https://www.boost.ai/blog/llms-large-language-models",
    "https://www.youtube.com/watch?v=6Zv6A_9urh4&t=112s"
]

loader = SeleniumURLLoader(urls=urls)
data = loader.load() # .load() returns the list of document instances containing 'page_content' and 'metadata'

print(data[0])

page_content="With your permission, we use cookies to personalize content and ads, provide social media features and analyze our traffic. Learn more about our cookies policy.\n\nBy clicking ‘Accept’, you give your consent to the aforementioned and accept that we share this information with third parties. If you do not give us your consent, we will continue to use only essential cookies to enable core functionality of the website.\n\nAccept\n\nDecline\n\nProduct\n\nConversational AI\n\nChat Automation\n\nVoice Call Automation\n\nIntegrations\n\nLarge Language Models\n\nSolutions\n\nCustomer self-service\n\nInternal virtual agent\n\nFinancial Services\n\ninsurance\n\ntelecom\n\nPublic Sector\n\nResources\n\nCase studies\n\nWebinars\n\nguides\n\nblog\n\nAnnouncements\n\nAcademy\n\nPartners\n\nCompany\n\nAbout us\n\nCareers\n\nSuppliers\n\nSecurity\n\nAccessibility\n\nPrivacy Policy\n\nCookies Policy\n\nContact\n\nSecurity\n\nAbout\n\nCareer\n\nSuppliers\n\nProduct\n\nConversational AI\n\n

The `SeleniumURLLoader` class has the following attributes:
- `urls` (List[str]): List of URLs to load from;
- `continue_on_failure` (bool, default=True): If set to True, continues loading other URLs on failure;
- `browser` (str, default="chrome"): Browser selection, either 'Chrome' or 'Firefox';
- `executable_path` (Optional[str], default=None): Browser executable path;
- `headless` (bool, default=True): Browser runs in headless mode if True.

<hr>
<a class="anchor" id="GoogleDriveLoader">
    
### 2.4. GoogleDriveLoader
    
</a>

In [21]:
from langchain.document_loaders import GoogleDriveLoader

By default, the GoogleDriveLoader searches for the "credentials.json" file in "~/.credentials/credentials.json". Use the `credentials_file` keyword argument to modify this path.
The "token.json" file follows the same principle and will be created automatically upon the loader's first use.

Steps to set up the `credentials_file`:

1. Create a new Google Cloud Platform project (or use an existing one) by visiting the Google Cloud Console. Ensure that billing is enabled for your project.
2. Enable the Google Drive API by navigating to its dashboard in the Google Cloud Console and clicking "Enable."
3. Create a service account by going to the Service Accounts page in the Google Cloud Console. Follow the prompts to set up a new service account.
4. Assign necessary roles to the service account, such as "Google Drive API - Drive File Access" and "Google Drive API - Drive Metadata Read/Write Access," depending on your needs.
5. After creating the service account, access the "Actions" menu next to it, select "Manage keys," click "Add Key," and choose "JSON" as the key type. This generates a JSON key file and downloads it to your computer, which serves as your credentials_file.

To retrieve the folder or document ID from the URL:
- Folder: https://drive.google.com/drive/u/0/folders/{folder_id}
- Document: https://docs.google.com/document/d/{document_id}/edit

In [22]:
loader = GoogleDriveLoader(
    folder_id="your_folder_id",
    recursive=False  # Optional: Fetch files from subfolders recursively. Defaults to False.
)

In [None]:
docs = loader.load()

<hr>
<a class="anchor" id="splitters">
    
## 3. Text Splitters
    
</a>


The length of the contents may vary depending on their source and may exceed the input window size of the model. Splitting the large text into smaller segments allows to use the most relevant chunk as the context instead of expecting the model to comprehend the textual input.

<hr>
<a class="anchor" id="embeddings">
    
## 4. Embeddings
    
</a>


LLMs can transform textual data into embedding space, allowing for versatile representations across languages.  

Embeddings are high-dimensional vectors that capture semantic information. Embeddings also serve to identify relevant information by quantifying the distance between data points (by indicating closer semantic meaning for points being closer together).

The LangChain integration provides necessary functions for both transforming and calculating similarities.

<hr>
<a class="anchor" id="cs">
    
## 5. Customer Support Question Answering Chatbot
    
</a>

Let's demonstrate how to use a website's content as supplementary context for a chatbot to respond to user queries effectively. 

The code implementation below involves:
- employing data loaders, 
- storing the corresponding embeddings in the Deep Lake dataset, 
- and retrieving the most relevant documents corresponding to the user's question.

<hr>
<a class="anchor" id="gong_io">
    
## 6. Gong.io Open-Source Alternative AI Sales Assistant
    
</a>

Let's explore how LangChain, Deep Lake, and GPT-4 can be used to develop a sales assistant able to give advice to salesman, taking into considerations internal guidelines.

<hr>
<a class="anchor" id="picture_books">
    
## 7. Creating Picture Books with OpenAI, Replicate, and Deep Lake
</a>

Having as look at the use case of AI technology in the creative domain of children's picture book creation, using both OpenAI GPT-3.5 LLM for writing the story and Stable Diffusion for generating images for it.

<hr>
<a class="anchor" id="resources">
    
## 8. Additional Resources
</a>

- [Improving Document Retrieval with Contextual Compression](https://blog.langchain.dev/improving-document-retrieval-with-contextual-compression/)