# Keeping Knowledge Organized

* [1. Utilizing Deep Lake](#deeplake)
    * [1.1. Using Text Loaders and Text Splitters](#loaders_splitters) 
    * [1.2. Exploring DeepLake - adding and retrieving data](#deeplake_explore) 
    * [1.3. Question-Answering Example](#q_a) 
    * [1.4. Using Document Compressors](#compressors) 
* [2. Streamlined Data Ingestion](#ingestion)
    * [2.1. TextLoader](#TextLoader)
    * [2.2. PyPDFLoader](#PyPDFLoader) 
    * [2.3. SeleniumURLLoader](#SeleniumURLLoader) 
    * [2.4. GoogleDriveLoader](#GoogleDriveLoader) 
* [3. Text Splitters](#splitters)
    * [3.1. Character Text Splitter](#Character_Splitter)
    * [3.2. Recursive Character Text Splitter](#Recursive_Splitter) 
    * [3.3. NLTK Text Splitter](#NLTK_Splitter) 
    * [3.4. Spacy Text Splitter](#Spacy_Splitter) 
    * [3.5. Markdown Text Splitter](#Markdown_Splitter) 
    * [3.6. Token Text Splitter](#Token_Splitter) 
* [4. Embeddings](#embeddings)
    * [4.1. Similarity search and vector embeddings](#sim_search)
    * [4.2. Embedding Models](#emb_models)
    * [4.3. Cohere embeddings](#cohere)
    * [4.4. DeepLake Vector Store Embeddings](#vector_store)
* [5. Customer Support Question Answering Chatbot](#cs)
* [6. Additional Resources (including sales assistant and book generator)](#resources)

In [1]:
import os
from keys import OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

<hr>
<a class="anchor" id="deeplake">
    
## 1. Utilizing Deep Lake
    
</a>

In LangChain, a crucial role in structuring documents and fetching relevant data for LLMs belongs to **indexes and retrievers**. An `index` is a data structure that organizes and stores documents to enable efficient searching, while a `retriever` uses the index to find and return relevant documents in response to user queries.

<hr>
<a class="anchor" id="loaders_splitters">
    
### 1.1. Using Text Loaders and Text Splitters
    
</a>

In [2]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

In [3]:
# Sample of text, taken from https://www.theverge.com/2023/3/14/23639313/google-ai-language-model-palm-api-challenge-openai
text = """Google opens up its AI language model PaLM to challenge OpenAI and GPT-3
Google is offering developers access to one of its most advanced AI language models: PaLM.
The search giant is launching an API for PaLM alongside a number of AI enterprise tools
it says will help businesses “generate text, images, code, videos, audio, and more from
simple natural language prompts.”

PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or
Meta’s LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs,
PaLM is a flexible system that can potentially carry out all sorts of text generation and
editing tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for
example, or you could use it for tasks like summarizing text or even writing code.
(It’s similar to features Google also announced today for its Workspace apps like Google
Docs and Gmail.)
"""

# Write text to local file
with open("output/my_file.txt", "w") as file:
    file.write(text)

In [4]:
# Use TextLoader to load text from the local file
loader = TextLoader("output/my_file.txt")
docs_from_file = loader.load()

print(len(docs_from_file))

1


In [5]:
docs_from_file

[Document(page_content='Google opens up its AI language model PaLM to challenge OpenAI and GPT-3\nGoogle is offering developers access to one of its most advanced AI language models: PaLM.\nThe search giant is launching an API for PaLM alongside a number of AI enterprise tools\nit says will help businesses “generate text, images, code, videos, audio, and more from\nsimple natural language prompts.”\n\nPaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or\nMeta’s LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs,\nPaLM is a flexible system that can potentially carry out all sorts of text generation and\nediting tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for\nexample, or you could use it for tasks like summarizing text or even writing code.\n(It’s similar to features Google also announced today for its Workspace apps like Google\nDocs and Gmail.)\n', metadata={'source': 'output/my_file.txt'})]

In [6]:
# Use CharacterTextSplitter to split the docs into texts
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20) # create a text splitter
docs = text_splitter.split_documents(docs_from_file) # split documents into chunks
print(len(docs))

Created a chunk of size 373, which is longer than the specified 200


2


<hr>
<a class="anchor" id="deeplake_explore">
    
### 1.2. Exploring DeepLake - adding and retrieving data
    
</a>

In [7]:
# Specify an embedder model
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

Let's explore Deep Lake and how it can be utilized to retrieve pertinent documents for contextual use. 

**Deep Lake** is a vector store that provides several advantages:

- It’s **multimodal**, which means that it can be used to store items of diverse modalities (texts, images, audio, and video, along with their vector representations).
- It’s **serverless**, which means that we can create and manage cloud datasets without the need to create and manage a database instance. 
- It’s possible to create a *streaming data loader* out of the data loaded into a Deep Lake dataset, which is convenient for fine-tuning machine learning models using common frameworks like PyTorch and TensorFlow.
- Data can be **queried and visualized** from the web.

In [8]:
# Load the Activeloop key 
from keys import ACTIVELOOP_TOKEN
os.environ["ACTIVELOOP_TOKEN"] = ACTIVELOOP_TOKEN

# Import DeepLake
from langchain.vectorstores import DeepLake

In [9]:
# Create DeepLake dataset
my_activeloop_org_id = "iryna"
my_activeloop_dataset_name = "langchain_course_knowledge_organized"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

Using embedding function is deprecated and will be removed in the future. Please use embedding instead.


Your Deep Lake dataset has been successfully created!


 

In [10]:
# Add documents to the DeepLake dataset
db.add_documents(docs)

|

Dataset(path='hub://iryna/langchain_course_knowledge_organized', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  (2, 1536)  float32   None   
    id        text      (2, 1)      str     None   
 metadata     json      (2, 1)      str     None   
   text       text      (2, 1)      str     None   


 

['d0c2411c-38b9-11ee-a540-12ee7aa5dbdc',
 'd0c24234-38b9-11ee-a540-12ee7aa5dbdc']

In [11]:
# Create retriever from db
retriever = db.as_retriever()

<hr>
<a class="anchor" id="q_a">
    
### 1.3. Question-Answering Example
    
</a>

In [12]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Create a retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(model="text-davinci-003"),
    chain_type="stuff",
    retriever=retriever
)

In [13]:
query = "How Google plans to challenge OpenAI?"
response = qa_chain.run(query)
print(response)

 Google is offering developers access to its PaLM AI language model, which is similar to the GPT series created by OpenAI. It can be used for tasks like summarizing text or writing code, which could potentially challenge OpenAI's models.


What happened under the hood in the question-answering example above is a similarity search. It was conducted using the embeddings to identify matching documents to be used as context for the LLM. Preselecting the most suitable documents based on semantic similarity enables us to provide the model with meaningful knowledge through the prompt while remaining within the allowed context size.

Also, "stuff chain" was used to supply information to the LLM. In this technique, we "stuff" all the information into the LLM's prompt. 

**Note:** Stuffing is effective only with shorter documents because of context length limit that most LLMs have.

<hr>
<a class="anchor" id="compressors">
    
### 1.4. Using Document Compressors
    
</a>

Including unrelated information in the LLM prompt is detrimental, because it can divert the LLM's focus from important details and occupies valuable prompt space.

To address this issue and improve the retrieval process, let's use a wrapper named `ContextualCompressionRetriever` that will wrap the base retriever with an `LLMChainExtractor`. The `LLMChainExtractor` iterates over the initially returned documents and extracts only the content relevant to the query. 

In [14]:
# An example of how to use ContextualCompressionRetriever with LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# Create GPT3 wrapper
llm = OpenAI(model="text-davinci-003", temperature=0)

# Create compressor for the retriever
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

In [15]:
# Retrieve compressed (relevant) documents 
retrieved_docs = compression_retriever.get_relevant_documents("How Google plans to challenge OpenAI?")
print(retrieved_docs[0].page_content)



Google is offering developers access to one of its most advanced AI language models: PaLM. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses “generate text, images, code, videos, audio, and more from simple natural language prompts.”


<hr>
<a class="anchor" id="ingestion">
    
## 2. Streamlined Data Ingestion
    
</a>

The LangChain library offers a variety of helpers designed to facilitate data loading and extraction from diverse sources: 
- TextLoader (handling plain text files);
- PyPDFLoader (dealing with PDF files);
- SeleniumURLLoaders (loading HTML documents from URLs that require JavaScript rendering);
- GoogleDriveLoader (importing data from Google Drive docs or folders).

Regardless of whether the information originates from a PDF file or website content, these classes streamline the process of handling different data formats.

<hr>
<a class="anchor" id="TextLoader">
    
### 2.1. TextLoader
    
</a>

In [16]:
from langchain.document_loaders import TextLoader

file_path = 'data/my_file.txt'
loader = TextLoader(file_path) # optional argument: encoding="ISO-8859-1"
documents = loader.load()

documents

[Document(page_content='Google opens up its AI language model PaLM to challenge OpenAI and GPT-3\nGoogle is offering developers access to one of its most advanced AI language models: PaLM.\nThe search giant is launching an API for PaLM alongside a number of AI enterprise tools\nit says will help businesses “generate text, images, code, videos, audio, and more from\nsimple natural language prompts.”\n\nPaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or\nMeta’s LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs,\nPaLM is a flexible system that can potentially carry out all sorts of text generation and\nediting tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for\nexample, or you could use it for tasks like summarizing text or even writing code.\n(It’s similar to features Google also announced today for its Workspace apps like Google\nDocs and Gmail.)\n', metadata={'source': 'data/my_file.txt'})]

<hr>
<a class="anchor" id="PyPDFLoader">
    
### 2.2. PyPDFLoader
    
</a>

In [17]:
!pip install -q pypdf

In [18]:
from langchain.document_loaders import PyPDFLoader

# PDF file has been taken here:
# https://arxiv.org/pdf/2304.04858.pdf
loader = PyPDFLoader("data/article.pdf")
pages = loader.load_and_split()

print(pages[0])

page_content='Simulated Annealing in Early Layers Leads to Better Generalization\nAmir M. Sarfi1,2Zahra Karimpour1Muawiz Chaudhary1,2Nasir M. Khalid1,2\nMirco Ravanelli1,2Sudhir Mudur1Eugene Belilovsky1,2\n1Concordia University2Mila – Quebec AI Institute\nAbstract\nRecently, a number of iterative learning methods have\nbeen introduced to improve generalization. These typically\nrely on training for longer periods of time in exchange for\nimproved generalization. LLF (later-layer-forgetting) is a\nstate-of-the-art method in this category. It strengthens learn-\ning in early layers by periodically re-initializing the last\nfew layers of the network. Our principal innovation in this\nwork is to use Simulated annealing in EArly Layers (SEAL)\nof the network in place of re-initialization of later layers.\nEssentially, later layers go through the normal gradient de-\nscent process, while the early layers go through short stints\nof gradient ascent followed by gradient descent. Extensive\nexp

<hr>
<a class="anchor" id="SeleniumURLLoader">
    
### 2.3. SeleniumURLLoader
    
</a>

In [19]:
!pip install -q unstructured selenium

In [20]:
from langchain.document_loaders import SeleniumURLLoader

urls = [
    "https://www.boost.ai/blog/llms-large-language-models",
    "https://www.youtube.com/watch?v=6Zv6A_9urh4&t=112s"
]

loader = SeleniumURLLoader(urls=urls)
data = loader.load() # .load() returns the list of document instances containing 'page_content' and 'metadata'

# print(data[0])

The `SeleniumURLLoader` class has the following attributes:
- `urls` (List[str]): List of URLs to load from;
- `continue_on_failure` (bool, default=True): If set to True, continues loading other URLs on failure;
- `browser` (str, default="chrome"): Browser selection, either 'Chrome' or 'Firefox';
- `executable_path` (Optional[str], default=None): Browser executable path;
- `headless` (bool, default=True): Browser runs in headless mode if True.

<hr>
<a class="anchor" id="GoogleDriveLoader">
    
### 2.4. GoogleDriveLoader
    
</a>

In [21]:
from langchain.document_loaders import GoogleDriveLoader

By default, the GoogleDriveLoader searches for the "credentials.json" file in "~/.credentials/credentials.json". Use the `credentials_file` keyword argument to modify this path.
The "token.json" file follows the same principle and will be created automatically upon the loader's first use.

Steps to set up the `credentials_file`:

1. Create a new Google Cloud Platform project (or use an existing one) by visiting the Google Cloud Console. Ensure that billing is enabled for your project.
2. Enable the Google Drive API by navigating to its dashboard in the Google Cloud Console and clicking "Enable."
3. Create a service account by going to the Service Accounts page in the Google Cloud Console. Follow the prompts to set up a new service account.
4. Assign necessary roles to the service account, such as "Google Drive API - Drive File Access" and "Google Drive API - Drive Metadata Read/Write Access," depending on your needs.
5. After creating the service account, access the "Actions" menu next to it, select "Manage keys," click "Add Key," and choose "JSON" as the key type. This generates a JSON key file and downloads it to your computer, which serves as your credentials_file.

To retrieve the folder or document ID from the URL:
- Folder: https://drive.google.com/drive/u/0/folders/{folder_id}
- Document: https://docs.google.com/document/d/{document_id}/edit

In [22]:
loader = GoogleDriveLoader(
    folder_id="the_folder_id",
    recursive=False  # default
)

In [23]:
# docs = loader.load()

<hr>
<a class="anchor" id="splitters">
    
## 3. Text Splitters
    
</a>


The length of the contents may vary depending on their source and may exceed the input window size of the model. Splitting the large text into smaller segments allows to use the most relevant chunk as the context instead of expecting the model to comprehend entire documents. 

Using a Text Splitter can also improve vector store search results: smaller segments might be more likely to match a query. That is why experimenting with different chunk sizes and overlaps can be beneficial.

Two primary dimensions to consider:
- the way the text is split
- the way the chunk size (and overlap size) is measured

No universal approach for chunking text will fit all scenarios - what's effective for one case might not be suitable for another. 

Main steps when using text splitters:

1. Clean up data, get rid of anything unnecessary, like HTML tags.
2. Divide the text into small, semantically meaningful chunks (often sentences).
3. Combine these small chunks into a larger one until a specific size is reached (determined by a particular function).
4. Once the desired size is attained, separate that chunk as an individual piece of text, then start forming a new chunk with some overlap to maintain context between segments.
5. Run some queries to test the chosen chunk size.
5. Repeat steps 2-5 to test a few different chunk sizes
6. Compare results (the best size will depend on what kind of data and model are used). 

<hr>
<a class="anchor" id="Character_Splitter">
    
### 3.1. Character Text Splitter
    
</a>

In [24]:
from langchain.document_loaders import PyPDFLoader

# PDF file has been taken here: 
# https://assets.super.so/0d43acc6-3340-4118-9744-d03a54c41788/files/d3bbdb7c-5122-4066-800c-f44e043ecf43.pdf
loader = PyPDFLoader("data/linux_manual.pdf")
pages = loader.load_and_split()

In [25]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
texts = text_splitter.split_documents(pages)

#print(texts[0])

print (f"You have {len(texts)} documents")

#print ("Preview:", texts[0].page_content)

You have 2 documents


<hr>
<a class="anchor" id="Recursive_Splitter">
    
### 3.2. Recursive Character Text Splitter
    
</a>

The Recursive Splitter allows to split the text into chunks based on a list of characters provided. It attempts to split text using the characters from a list in that particular order until the resulting chunks are small enough. 

The default list of characters is **`["\n\n", "\n", " ", "]`**, meaning the splitter tries to keep paragraphs, sentences, and words together as long as possible, as they are generally the most semantically related pieces of text.

In [26]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("data/article.pdf")
pages = loader.load_and_split()

In [27]:
# Create an instance of the RecursiveCharacterTextSplitter class with the desired parameters,
# the default list of characters to split by is ["\n\n", "\n", " ", ""]
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,
    chunk_overlap=10,
    length_function=len,
)

In [28]:
texts = text_splitter.create_documents([pages[0].page_content])

print(texts[0])
print(texts[1])
print(texts[2])

page_content='Simulated Annealing in Early Layers Leads to' metadata={}
page_content='Leads to Better Generalization' metadata={}
page_content='Amir M. Sarfi1,2Zahra Karimpour1Muawiz' metadata={}


<hr>
<a class="anchor" id="NLTK_Splitter">
    
### 3.3. NLTK Text Splitter
    
</a>

The NLTKTextSplitter in LangChain uses the Natural Language Toolkit (NLTK) library to split text based on tokenizers.

In [29]:
from langchain.text_splitter import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=500)
texts = text_splitter.split_text(pages[0].page_content)


# Print the length of the first chunk and the chunk
print(len(texts[0]))
print(texts[0])

445
Simulated Annealing in Early Layers Leads to Better Generalization
Amir M. Sarfi1,2Zahra Karimpour1Muawiz Chaudhary1,2Nasir M. Khalid1,2
Mirco Ravanelli1,2Sudhir Mudur1Eugene Belilovsky1,2
1Concordia University2Mila – Quebec AI Institute
Abstract
Recently, a number of iterative learning methods have
been introduced to improve generalization.

These typically
rely on training for longer periods of time in exchange for
improved generalization.


<hr>
<a class="anchor" id="Spacy_Splitter">
    
### 3.4. Spacy Text Splitter
    
</a>

SpacyTextSplitter leverages the popular SpaCy library to split texts based on linguistic features. 

In [30]:
#import en_core_web_sm
#en_core_web_sm.load(disable=["tagger", "ner", "lemmatizer"])

In [31]:
from langchain.text_splitter import SpacyTextSplitter

# Instantiate the SpacyTextSplitter with the desired chunk size
text_splitter = SpacyTextSplitter(chunk_size=500, chunk_overlap=20)

# Split the text using SpacyTextSplitter
texts = text_splitter.split_text(pages[0].page_content)

# Print the length of the first chunk and the chunk
print(len(texts[0]))
print(texts[0])

445
Simulated Annealing in Early Layers Leads to Better Generalization
Amir M. Sarfi1,2Zahra Karimpour1Muawiz Chaudhary1,2Nasir M. Khalid1,2
Mirco Ravanelli1,2Sudhir Mudur1Eugene Belilovsky1,2
1Concordia University2Mila – Quebec AI Institute
Abstract
Recently, a number of iterative learning methods have
been introduced to improve generalization.

These typically
rely on training for longer periods of time in exchange for
improved generalization.




<hr>
<a class="anchor" id="Markdown_Splitter">
    
### 3.5. Markdown Text Splitter
    
</a>

The `MarkdownTextSplitter` is used to split text written using Markdown languages like headers, code blocks, or dividers. It is implemented as a simple subclass of `RecursiveCharacterSplitter` with Markdown-specific separators and helps divide text while preserving the structure and meaning provided by Markdown formatting.

In [32]:
markdown_text = """
# 

# Welcome to My Blog!

## Introduction
Hello everyone! My name is **John Doe** and I am a _software developer_. I specialize in Python, Java, and JavaScript.

Here's a list of my favorite programming languages:

1. Python
2. JavaScript
3. Java

You can check out some of my projects on [GitHub](https://github.com).

## About this Blog
In this blog, I will share my journey as a software developer. I'll post tutorials, my thoughts on the latest technology trends, and occasional book reviews.

Here's a small piece of Python code to say hello:

\``` python
def say_hello(name):
    print(f"Hello, {name}!")

say_hello("John")
\```

Stay tuned for more updates!

## Contact Me
Feel free to reach out to me on [Twitter](https://twitter.com) or send me an email at johndoe@email.com.

"""

In [33]:
from langchain.text_splitter import MarkdownTextSplitter

markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])

print(docs[0])
print(docs[1])
print(docs[2])

page_content='# \n\n# Welcome to My Blog!' metadata={}
page_content='## Introduction' metadata={}
page_content='Hello everyone! My name is **John Doe** and I am a _software developer_. I specialize in Python,' metadata={}


**Note**: The default separators that are determined by the Markdown syntax can be customized when MarkdownTextSplitter instance is initialized.

<hr>
<a class="anchor" id="Token_Splitter">
    
### 3.6. Token Text Splitter
    
</a>

TokenTextSplitter respects the token boundaries, ensuring that the chunks do not split tokens in the middle. This type of splitter breaks down raw text strings into smaller pieces by initially converting the text into BPE (Byte Pair Encoding) tokens, and subsequently dividing these tokens into chunks.

In [34]:
from langchain.text_splitter import TokenTextSplitter

# Initialize the TokenTextSplitter with desired chunk size and overlap
text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=10)

# Split into smaller chunks
texts = text_splitter.split_text(pages[0].page_content)
print(texts[0])

Simulated Annealing in Early Layers Leads to Better Generalization
Amir M. Sarfi1,2Zahra Karimpour1Muawiz Chaudhary1,2Nasir M. Khalid1,2
Mirco Ravanelli1,2Sudhir Mudur1Eugene Belilovsky1,2
1Concordia University2Mila – Quebec AI Institute
Abstract
Recently, a number of iterative learning


<hr>
<a class="anchor" id="embeddings">
    
## 4. Embeddings
    
</a>


LLMs can transform textual data into embedding space, allowing for versatile representations across languages.  

Embeddings are high-dimensional vectors that capture semantic information. Embeddings also serve to identify relevant information by quantifying the distance between data points (by indicating closer semantic meaning for points being closer together).

The LangChain integration provides necessary functions for both transforming and calculating similarities.

<hr>
<a class="anchor" id="sim_search">
    
### 4.1. Similarity search and vector embeddings 
    
</a>

GPT-3 is a powerful language model offered by OpenAI. It can be used for various tasks, such as generating embeddings and performing similarity searches.

In [56]:
import openai
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from langchain.embeddings import OpenAIEmbeddings

In [57]:
# Define the documents
documents = [
    "The cat is on the mat.",
    "There is a cat on the mat.",
    "The dog is in the yard.",
    "There is a dog in the yard.",
]

In [58]:
# Initialize the OpenAIEmbeddings instance
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Generate embeddings for the documents
document_embeddings = embeddings.embed_documents(documents)

In [59]:
# Perform a similarity search for a given query
query = "A cat is sitting on a mat."
query_embedding = embeddings.embed_query(query)

# Calculate similarity scores
similarity_scores = cosine_similarity([query_embedding], document_embeddings)[0]

# Find the most similar document
most_similar_index = np.argmax(similarity_scores)
most_similar_document = documents[most_similar_index]

print(f"Most similar document to the query '{query}':")
print(most_similar_document)

Most similar document to the query 'A cat is sitting on a mat.':
The cat is on the mat.


<hr>
<a class="anchor" id="emb_models">
    
### 4.2. Embedding Models
    
</a>

Embedding models are ML models that convert discrete data into continuous vectors. In the context of NLP, these discrete data points can be words, sentences, or even entire documents. The generated vectors (embeddings) are supposed to capture the semantic meaning of the original data.

In [60]:
# !pip install sentence_transformers

In [61]:
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings

In [62]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
hf = HuggingFaceEmbeddings(model_name=model_name, 
                           model_kwargs=model_kwargs)

documents = ["It is really hot in Portugal during August", 
             "Summertime is too hot in Lisbon", 
             "Something else here"]

doc_embeddings = hf.embed_documents(documents)

The obtained embeddings (doc_embeddings) are ready for any downstream tasks: classification, clustering, or similarity analysis. They represent our original documents in a form that machines can understand and process.

In [63]:
print(cosine_similarity([doc_embeddings[0]], [doc_embeddings[1]]))
print(cosine_similarity([doc_embeddings[0]], [doc_embeddings[2]]))
print(cosine_similarity([doc_embeddings[1]], [doc_embeddings[2]]))

[[0.70272776]]
[[0.0454873]]
[[0.04992708]]


<hr>
<a class="anchor" id="cohere">
    
### 4.3. Cohere embeddings
    
</a>

In [2]:
from keys import COHERE_API_KEY ## API key obtained from https://dashboard.cohere.ai/api-keys

# !pip install cohere

In [4]:
import cohere
from langchain.embeddings import CohereEmbeddings

In [5]:
# Initialize the CohereEmbeddings object
cohere = CohereEmbeddings(
    model="embed-multilingual-v2.0",
    cohere_api_key=COHERE_API_KEY
)

# Define a list of texts
texts = [
    "Hello from Cohere!", 
    "مرحبًا من كوهير!", 
    "Hallo von Cohere!",  
    "Bonjour de Cohere!", 
    "¡Hola desde Cohere!", 
    "Olá do Cohere!",  
    "Ciao da Cohere!", 
    "您好，来自 Cohere！", 
    "कोहेरे से नमस्ते!"
]

In [6]:
# Generate embeddings for the texts
document_embeddings = cohere.embed_documents(texts)

# Print the embeddings
for text, embedding in zip(texts, document_embeddings):
    print(f"Text: {text}")
    print(f"Embedding: {embedding[:7]}")  # print first 7 dimensions of each embedding

Text: Hello from Cohere!
Embedding: [0.23449707, 0.50097656, -0.04876709, 0.14001465, -0.1796875, 0.39575195, 0.2836914]
Text: مرحبًا من كوهير!
Embedding: [0.25341797, 0.30004883, 0.01083374, 0.12573242, -0.1821289, 0.39160156, 0.31201172]
Text: Hallo von Cohere!
Embedding: [0.10205078, 0.28320312, -0.0496521, 0.2364502, -0.0715332, 0.34643555, 0.34179688]
Text: Bonjour de Cohere!
Embedding: [0.15161133, 0.28222656, -0.057281494, 0.11743164, -0.044189453, 0.29467773, 0.29125977]
Text: ¡Hola desde Cohere!
Embedding: [0.25146484, 0.43139648, -0.08642578, 0.24682617, -0.117004395, 0.29956055, 0.33691406]
Text: Olá do Cohere!
Embedding: [0.18676758, 0.390625, -0.04550171, 0.14562988, -0.11230469, 0.25732422, 0.34716797]
Text: Ciao da Cohere!
Embedding: [0.11590576, 0.4333496, -0.025772095, 0.14538574, 0.0703125, 0.11468506, 0.31713867]
Text: 您好，来自 Cohere！
Embedding: [0.24645996, 0.3083496, -0.111816406, 0.26586914, -0.05102539, 0.17785645, 0.36328125]
Text: कोहेरे से नमस्ते!
Embedding: [0.

Given a list of multilingual texts, the embed_documents() method in LangChain's CohereEmbeddings class, connected to Cohere’s embedding endpoint and generated unique semantic embeddings for each text.

<hr>
<a class="anchor" id="vector_store">
    
### 4.4. DeepLake Vector Store Embeddings
    
</a>

DeepLake is a Vector Store for creating, storing, and querying vector representations (also known as embeddings) of data.

The pipeline below demonstrates how to leverage the power of the LangChain, OpenAI, and Deep Lake libraries and products to create a conversational AI model capable of retrieving and answering questions based on the content of a given repository.

In [7]:
from keys import OPENAI_API_KEY, ACTIVELOOP_TOKEN

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["ACTIVELOOP_TOKEN"] = ACTIVELOOP_TOKEN

In [8]:
#!pip install deeplake

In [9]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

In [10]:
# Create documents
texts = [
    "Napoleon Bonaparte was born in 15 August 1769",
    "Louis XIV was born in 5 September 1638",
    "Lady Gaga was born in 28 March 1986",
    "Michael Jeffrey Jordan was born in 17 February 1963"
]

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.create_documents(texts)

In [14]:
# Initialize embeddings model
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Create DeepLake dataset
my_activeloop_org_id = "iryna"
my_activeloop_dataset_name = "course_embeddings"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

db = DeepLake(dataset_path=dataset_path, embedding=embeddings)

# Add documents to our Deep Lake dataset
db.add_documents(docs)

Using embedding function is deprecated and will be removed in the future. Please use embedding instead.


Your Deep Lake dataset has been successfully created!


-

Dataset(path='hub://iryna/course_embeddings', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  (4, 1536)  float32   None   
    id        text      (4, 1)      str     None   
 metadata     json      (4, 1)      str     None   
   text       text      (4, 1)      str     None   


 

['9b4f40c6-3a34-11ee-aa34-12ee7aa5dbdc',
 '9b4f4486-3a34-11ee-aa34-12ee7aa5dbdc',
 '9b4f4558-3a34-11ee-aa34-12ee7aa5dbdc',
 '9b4f45d0-3a34-11ee-aa34-12ee7aa5dbdc']

In [15]:
# Create retriever from db
retriever = db.as_retriever() # transforming the DeepLake dataset into a LangChain retriever object

In [16]:
# Create a RetrievalQA chain and run it
model = ChatOpenAI(model='gpt-3.5-turbo')  ## istantiate the LLM wrapper 
qa_chain = RetrievalQA.from_llm(model, retriever=retriever) ## create the question-answering chain
qa_chain.run("When was Michael Jordan born?") ## ask a question to the chain

'Michael Jordan was born on 17 February 1963.'

**Note**: The underlying OpenAI model in this chain is responsible for both the question embedding and the answer generation.

<hr>
<a class="anchor" id="cs">
    
## 5. Customer Support Question Answering Chatbot
    
</a>

When users interact with the chatbot, their queries are matched to the most similar intent, generating the associated response. As LLMs continue to evolve, chatbot development is shifting toward more sophisticated and dynamic solutions capable of handling a broader range of user inquiries with greater precision.

Let's demonstrate how to use a **website's content as supplementary context for a chatbot** to respond to user queries effectively. 

The code implementation below involves:
- employing data loaders to scrape some content from online articles;
- employing data splitters to split content into small chunks;
- computting and storing the corresponding embeddings in the Deep Lake dataset; 
- retrieving the most relevant documents corresponding to the user's question.

**Note**: There is always a risk of generating hallucinations or false information when using LLMs. It might not be acceptable for many customers support use cases, however the chatbot can still be helpful for assisting operators in drafting answers that they can double-check before sending them to the user. 

In [18]:
#!pip install unstructured selenium
#!pip install langchain==0.0.208 deeplake openai tiktoken

In [19]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import CharacterTextSplitter
from langchain import OpenAI
from langchain.document_loaders import SeleniumURLLoader
from langchain import PromptTemplate

In [24]:
# Our chatbot will use information from the following articles
urls = ['https://beebom.com/what-is-nft-explained/',
        'https://beebom.com/how-delete-spotify-account/',
        'https://beebom.com/how-download-gif-twitter/',
        'https://beebom.com/how-delete-spotify-account/',
        'https://beebom.com/how-save-instagram-story-with-music/',
        'https://beebom.com/how-install-pip-windows/',
        'https://beebom.com/how-check-disk-usage-linux/']

In [25]:
# Scrape and split the documents into chunks
loader = SeleniumURLLoader(urls=urls)
docs_not_splitted = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(docs_not_splitted)

In [27]:
# Compute the embeddings using OpenAIEmbeddings 
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Create DeepLake dataset (store embeddings in a DeepLake vector store on the cloud)
my_activeloop_org_id = "iryna"
my_activeloop_dataset_name = "langchain_course_customer_support"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)

Using embedding function is deprecated and will be removed in the future. Please use embedding instead.


Your Deep Lake dataset has been successfully created!


/

Dataset(path='hub://iryna/langchain_course_customer_support', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
 embedding  embedding  (102, 1536)  float32   None   
    id        text      (102, 1)      str     None   
 metadata     json      (102, 1)      str     None   
   text       text      (102, 1)      str     None   


 

['c3726fb6-3a47-11ee-aa34-12ee7aa5dbdc',
 'c37271fa-3a47-11ee-aa34-12ee7aa5dbdc',
 'c372724a-3a47-11ee-aa34-12ee7aa5dbdc',
 'c3727290-3a47-11ee-aa34-12ee7aa5dbdc',
 'c37272cc-3a47-11ee-aa34-12ee7aa5dbdc',
 'c3727308-3a47-11ee-aa34-12ee7aa5dbdc',
 'c3727344-3a47-11ee-aa34-12ee7aa5dbdc',
 'c3727380-3a47-11ee-aa34-12ee7aa5dbdc',
 'c37273b2-3a47-11ee-aa34-12ee7aa5dbdc',
 'c37273ee-3a47-11ee-aa34-12ee7aa5dbdc',
 'c372742a-3a47-11ee-aa34-12ee7aa5dbdc',
 'c3727466-3a47-11ee-aa34-12ee7aa5dbdc',
 'c37274a2-3a47-11ee-aa34-12ee7aa5dbdc',
 'c37274d4-3a47-11ee-aa34-12ee7aa5dbdc',
 'c3727510-3a47-11ee-aa34-12ee7aa5dbdc',
 'c3727556-3a47-11ee-aa34-12ee7aa5dbdc',
 'c3727592-3a47-11ee-aa34-12ee7aa5dbdc',
 'c37275c4-3a47-11ee-aa34-12ee7aa5dbdc',
 'c372760a-3a47-11ee-aa34-12ee7aa5dbdc',
 'c372763c-3a47-11ee-aa34-12ee7aa5dbdc',
 'c3727678-3a47-11ee-aa34-12ee7aa5dbdc',
 'c37276b4-3a47-11ee-aa34-12ee7aa5dbdc',
 'c37276e6-3a47-11ee-aa34-12ee7aa5dbdc',
 'c3727722-3a47-11ee-aa34-12ee7aa5dbdc',
 'c372775e-3a47-

In [30]:
# Retrieve the most relevant documents to a specific query
query = "how to check disk usage in linux?"
docs = db.similarity_search(query)
print(docs[0].page_content)

Home  Tech  How to Check Disk Usage in Linux (4 Methods)

How to Check Disk Usage in Linux (4 Methods)

Beebom Staff

Last Updated: June 19, 2023 5:14 pm

There may be times when you need to download some important files or transfer some photos to your Linux system, but face a problem of insufficient disk space. You head over to your file manager to delete the large files which you no longer require, but you have no clue which of them are occupying most of your disk space. In this article, we will show some easy methods to check disk usage in Linux from both the terminal and the GUI application.

Monitor Disk Usage in Linux (2023)

Table of Contents

Check Disk Space Using the df Command
		
Display Disk Usage in Human Readable FormatDisplay Disk Occupancy of a Particular Type

Check Disk Usage using the du Command
		
Display Disk Usage in Human Readable FormatDisplay Disk Usage for a Particular DirectoryCompare Disk Usage of Two Directories


In [31]:
# Create a prompt for a customer support chatbot
template = """You are an exceptional customer support chatbot that gently answer questions.

You know the following context information.

{chunks_formatted}

Answer to the following question from a customer. Use only information from the previous context information. Do not invent stuff.

Question: {query}

Answer:"""

prompt = PromptTemplate(
    input_variables=["chunks_formatted", "query"],
    template=template,
)

In [32]:
# FULL PIPELINE

query = "How to check disk usage in linux?"  # user question

# Retrieve relevant chunks
docs = db.similarity_search(query)
retrieved_chunks = [doc.page_content for doc in docs]

# Format the prompt
chunks_formatted = "\n\n".join(retrieved_chunks)
prompt_formatted = prompt.format(chunks_formatted=chunks_formatted, query=query)

# Generate answer
llm = OpenAI(model="text-davinci-003", temperature=0)
answer = llm(prompt_formatted)
print(answer)

 You can check disk usage in Linux using the df command or by using a GUI tool such as the GDU Disk Usage Analyzer or the Gnome Disks Tool. The df command is used to check the current disk usage and the available disk space in Linux. The syntax for the df command is: df <options> <file_system>. The options to use with the df command are: a, h, t, and x. To install the GDU Disk Usage Analyzer, use the command: sudo snap install gdu-disk-usage-analyzer. To install the Gnome Disks Tool, use the command: sudo apt-get -y install gnome-disk-utility.


**Note**: it can be challenging to ensure that the model generates answers solely based on the context, as it has a tendency to generate new, potentially false information. Generating false information might have various severity levels depending on the use case.

<hr>
<a class="anchor" id="resources">
    
## 6. Additional Resources
</a>

- [Improving Document Retrieval with Contextual Compression](https://blog.langchain.dev/improving-document-retrieval-with-contextual-compression/)
- [Split by character](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter)
- [Split code](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter)
- [Recursively split by character](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)
- [Chatbot + Knowledge Base](https://learnprompting.org/docs/applied_prompting/build_chatbot_from_kb)
- [Conversation Intelligence: Gong.io Open-Source Alternative AI Sales Assistant](https://www.activeloop.ai/resources/conversation-intelligence-gong-io-open-source-alternative-ai-sales-assistant/)
- [AI Story Generator: OpenAI Function Calling, LangChain, & Stable Diffusion](https://www.activeloop.ai/resources/ai-story-generator-open-ai-function-calling-lang-chain-stable-diffusion/)