In [None]:
%%capture
!pip install langchain openai tiktoken faiss-cpu

In [15]:
import os
import getpass

In [16]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key:")

Enter Your OpenAI API Key:··········


In [None]:
!wget -O "golden_hymns_of_epictetus.txt" https://www.gutenberg.org/cache/epub/871/pg871.txt

--2023-09-29 10:46:35--  https://www.gutenberg.org/cache/epub/871/pg871.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 152365 (149K) [text/plain]
Saving to: ‘golden_hymns_of_epictetus.txt’


2023-09-29 10:46:36 (1021 KB/s) - ‘golden_hymns_of_epictetus.txt’ saved [152365/152365]



In [1]:
filename = "/content/golden_hymns_of_epictetus.txt"

start_saving = False
stop_saving = False
lines_to_save = []

with open(filename, 'r') as file:
    for line in file:
        if "Are these the only works of Providence within us?" in line:
            start_saving = True
        if "*** END OF THE PROJECT GUTENBERG EBOOK THE GOLDEN SAYINGS OF EPICTETUS, WITH THE HYMN OF CLEANTHES ***" in line:
            stop_saving = True
            break
        if start_saving and not stop_saving:
            lines_to_save.append(line)

# Write the stored lines back to the file
with open(filename, 'w') as file:
    for line in lines_to_save:
        file.write(line)

In [2]:
word_count = 0

with open(filename, 'r') as file:
    for line in file:
        words = line.split()
        word_count += len(words)

print(f"The total number of words in the file is: {word_count}")

The total number of words in the file is: 23503


# Retrieval Overview

Retrieval in LangChain refers to the process of fetching and retrieving relevant data or documents from external sources.

It is a crucial step in many language model applications, especially in Retrieval Augmented Generation (RAG) tasks.

Retrieval is useful because it allows you to incorporate external data into your language model, providing additional context and information that may not be present in the model's training data.

By retrieving relevant documents, you can enhance the generation process and improve the quality and relevance of the generated responses.

# Document Loaders

Document loaders in LangChain are used to load data from various sources as Document objects.

A Document is a piece of text with associated metadata.

Document loaders provide a convenient way to fetch data from different sources such as text files, web pages, or even transcripts of videos.

## Text loader

Ths is the simplest loader. It reads in a file as text and places it all into one Document

In [3]:
from langchain.document_loaders import TextLoader
loader = TextLoader("/content/golden_hymns_of_epictetus.txt")
golden_sayings = loader.load()

In [None]:
type(golden_sayings)

list

In [None]:
type(golden_sayings[0])

langchain.schema.document.Document

# There are SO MANY document loaders in LangChain

I won't go every single one in this notebook. But, you can check out [the documentation](https://github.com/langchain-ai/langchain/tree/master/libs/langchain/langchain/document_loaders) to see jusy how many are available to you.

They all follow the same pattern:


1) Import the `DirectoryLoader` class from the `langchain.document_loaders module`.

2) Create an instance of the `DirectoryLoader` class, providing the path to the directory as the argument.

3) Use the `load()` method of the `DirectoryLoader` instance to load all the files in the directory and convert them into LangChain's Document format.

# Document transformers

Document transformers in LangChain are used to manipulate and transform documents to better suit your application's needs.

You need document transformers when you want to perform operations on your documents, such as splitting a long document into smaller chunks that can fit into your model's context window.

# Text splitters

Text splitters in LangChain are used to split long pieces of text into smaller, semantically meaningful chunks.

They are particularly useful when you want to keep related pieces of text together or when you need to process text in smaller segments.

### At a high level, text splitters work as following:

1) Split the text up into small, semantically meaningful chunks (often sentences).

2) Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).

3) Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

### The default recommended text splitter is the `RecursiveCharacterTextSplitter`.

`RecursiveCharacterTextSplitter` is like a smart text splitting tool.

Instead of just blindly splitting text by a single separator, it tries multiple separators in a specific order, and if a piece of text is too big, it'll try to split it again using a different separator. For code or text in various programming languages, it has predefined lists of separators that make sense for each language, ensuring the text is split in a logical way.


It tries to create chunks based on splitting on the first character, but if any chunks are too large it then moves onto the next character, and so forth.

### The Splitting Process:

 - The method tries to find the best separator from the list to split the given text.

 - Once a separator is found, it splits the text.

 - If any of the resulting chunks are too big, it'll try to split that chunk again using the next separator in the list.

 - This process is recursive, which means it keeps trying to split chunks until they're small enough or there are no more separators to try.

• `length_function`: how the length of chunks is calculated. Defaults to just counting number of characters, but it's pretty common to pass a token counter here.

• `chunk_size`: the maximum size of your chunks (as measured by the length function).

• `chunk_overlap`: the maximum overlap between chunks. It can be nice to have some overlap to maintain some continuity between chunks (e.g. do a sliding window).

• `add_start_index`: whether to include the starting position of each chunk within the original document in the metadata.

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap = 50,
    length_function = len,
    add_start_index = True
)

In [10]:
texts = text_splitter.split_documents(golden_sayings)

In [11]:
print(texts[0])
print(texts[1])

page_content='Are these the only works of Providence within us? What words suffice to\npraise or set them forth? Had we but understanding, should we ever\ncease hymning and blessing the Divine Power, both openly and in secret,\nand telling of His gracious gifts? Whether digging or ploughing or\neating, should we not sing the hymn to God:—\n\n_Great is God_, for that He hath given us such instruments to till the\nground withal:\n\n\n_Great is God_, for that He hath given us hands and the power of\nswallowing and digesting; of unconsciously growing and breathing while\nwe sleep!\n\n\nThus should we ever have sung; yea and this, the grandest and divinest\nhymn of all:—\n\n_Great is God_, for that He hath given us a mind to apprehend these\nthings, and duly to use them!' metadata={'source': '/content/golden_hymns_of_epictetus.txt', 'start_index': 0}
page_content='What then! seeing that most of you are blinded, should there not be\nsome one to fill this place, and sing the hymn to God on be

# Text Embeddings

Text embedding models for retrieval in LangChain are used to represent text documents in a high-dimensional vector space, where the similarity between vectors corresponds to the semantic similarity between the corresponding documents.

These models capture the semantic meaning of text and allow for efficient retrieval of similar documents based on their embeddings.

In [17]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

To construct a vector store retriever, you first need to load the documents using a document loader.

Then, you can split the documents into smaller chunks using a text splitter. Next, you can generate vector embeddings for the text chunks using an embedding model like OpenAIEmbeddings. Finally, you can create a vector store using the generated embeddings.

Once the vector store is constructed, you can use it as a retriever to query the texts.

In [18]:
from langchain.vectorstores import FAISS
vectorstore = FAISS.from_documents(documents=texts, embedding=OpenAIEmbeddings())

# Retrieve

Once the vector store is constructed, you can use it as a retriever to query the texts.

The vector store retriever supports various search methods, including similarity search and maximum marginal relevance search.

You can also set a similarity score threshold or specify the number of top documents to retrieve.

Below you will use the `similarity_search` method of the vector store.

In simpler terms, think of this function as a search tool. You give it a piece of text, tell it how many results you want, and it returns a list of documents that are most similar to your given text.

If you have specific requirements, like only wanting documents from a certain author, you can use the filter option to specify that.

In [19]:
query = "How can I practice mindfulness if I am always so busy and distracted?"

vectorstore.similarity_search(query)

[Document(page_content='One who has had fever, even when it has left him, is not in the same\ncondition of health as before, unless indeed his cure is complete.\nSomething of the same sort is true also of diseases of the mind.\nBehind, there remains a legacy of traces and blisters: and unless these\nare effectually erased, subsequent blows on the same spot will produce\nno longer mere blisters, but sores. If you do not wish to be prone to\nanger, do not feed the habit; give it nothing which may tend its\nincrease. At first, keep quiet and count the days when you were not\nangry: “I used to be angry every day, then every other day: next every\ntwo, next every three days!” and if you succeed in passing thirty days,\nsacrifice to the Gods in thanksgiving.\n\nLXXVI\n\nHow then may this be attained?—Resolve, now if never before, to approve\nthyself to thyself; resolve to show thyself fair in God’s sight; long\nto be pure with thine own pure self and God!\n\nLXXVII', metadata={'source': '/co

# Generate

In [22]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI

template = """

Use the following pieces of context to answer the question at the end.

If you don't know the answer, just say 'Ah snap homie, I ain't gonna front. I don't know.`, don't try to make up an answer.

Use three sentences maximum, relevant analogies, and keep the answer as concise as possible.

Use the active voice, and speak directly to the reader using concise language.
{context}

Question: {question}

Helpful Answer:

"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)


query = "What do grief, fear, envy, and desire stem from?"

result = qa_chain({"query": query})

result["result"]

"Grief, fear, envy, and desire stem from the mind's attachment to worldly desires and emotions, such as Fear, Desire, Envy, Malignity, Avarice, Effeminacy, and Intemperance. These negative emotions can only be cast out by looking to God alone, fixing one's affections on Him, and consecrating oneself to His commands."