# RAG (Retrieval Augmented Generation)
- RAG is a technique using or retrieving data from private or real-time sources to increase, expand the capabilities of the LLMs.
![image](./images/Architecture.png)

## Retrieval

![image](./images/data_connection-95ff2033a8faa5f3ba41376c0f6dd32a.jpg)

1. From the source file, load the data by using text loader (UnstructuredFileLoader).
2. Transform it by splitting the data, because it is better for LLM to search multiple smaller documents rather than single big document. 
3. Embed the data. Embedding menas a vector representation of the meaning behind the text, documents (OpenAIEmbeddings).
4. Use CacheBackedEmbedding to cache the embeddings because it is not free.
4. Store the number (data).
5. Perform a search by using vectorestore. 

In [5]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download()



showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

- Document Loaders
    - Loader is a piece of code extract the data from a source and brings it to Langchain.
    - https://python.langchain.com/docs/integrations/document_loaders/unstructured_file

In [10]:
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
# from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import UnstructuredFileLoader

# loader = TextLoader("./files/chapter_one.txt")
loader = UnstructuredFileLoader("./files/chapter_one.pdf")

# loader.load()
len(loader.load())

1

Now, we will split the document. The return value of loader.load() is a list, and the whole chapter is only one document. It is better and efficient to split the document to store, embed, and give it to language model.

RecursiveCharacterTextSplitter will separate the file for every sentence or paragraph ending to keep the semantic meaning of sentences.

## Tiktoken
- token doesn't mean just a letter, it could be a word, a text, or a chunk of text. 
- To see the difference between token and characters, refer to OpenAI tokenizer : https://platform.openai.com/tokenizer

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300,
    chunk_overlap=100,
    separator="\n",
)

loader = UnstructuredFileLoader("./files/chapter_one.docx")

loader.load_and_split(text_splitter=splitter)

len(loader.load_and_split(text_splitter=splitter))


25

## Embedding

#### Vectors
Below cell is just an example of how each word in the list converted to vectors (numbers)

In [12]:
from langchain.embeddings import OpenAIEmbeddings

embedder = OpenAIEmbeddings()

# this is the vector with all dimesions for the word "hi"
hi = embedder.embed_query("Hi")
len(hi) #total 1536 dimensions only for the word "hi"

vector = embedder.embed_documents(
    [
    "hi",
    "how",
    "are",
    "you longer sentences",
    ]
)
print(len(vector), len(vector[0]))

4 1536


 Vector stores is sort of databases.
 1) Create vectors  
 2) Cache those vectors  
 3) Put those vectors inside of the Vector store 
 4) Perform searches to find relevant docs

Without cache, if we re-run the entire cell, this would cost more money. So it is better to save it in cache. 
Cache these embeddings.

In [23]:
from langchain.embeddings import CacheBackedEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI()

cache_dir = LocalFileStore("./.cache/")

splitter = CharacterTextSplitter.from_tiktoken_encoder(
    separator="\n",
    chunk_size=300,
    chunk_overlap=100,
)

# Load 
loader = UnstructuredFileLoader("./files/chapter_one.docx")

# Transform 
docs = loader.load_and_split(text_splitter=splitter)

# Embed 
embeddings = OpenAIEmbeddings()

# When we embed the file, first, we check if those embeddings already exist in our cache.
cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
    embeddings, cache_dir
)

# If not, call vectorestore chroma
vectorstore = FAISS.from_documents(docs, cached_embeddings)

chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="refine",
    retriever=vectorstore.as_retriever(),
)

Now we can start search in vector space.

In [22]:
results = vectorstore.similarity_search("where does Mr.Jones live")

In [26]:
chain.run("How old is Major?")

"The new context does not provide any information about Major's age. Therefore, the original answer of Major's age being twelve years old remains the same."