<a href="https://colab.research.google.com/github/nathalyAlarconT/GenAI_Workshops/blob/main/Intro_to_RAG_en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup Notebook

In [1]:
# IP from Google Colab
!curl ipecho.net/plain
# https://ipinfo.io/IP

34.145.60.126

## Libraries and Utilities installation

In [None]:
!pip install --upgrade -q langchain
!pip install google-generativeai langchain-google-genai
!pip install chromadb pypdf2 python-dotenv
!pip install PyPDF
!pip install -U langchain-community
!pip install sentence-transformers
!pip install langchainhub

### General Libraries

In [3]:
from google.colab import userdata
import os
from IPython.display import Markdown

### API Keys Configuration

In [4]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = userdata.get('SMITH_APIKEY')
GOOGLE_API_KEY = userdata.get('GoogleAIStudio')

### Initial Folder Setup

We will create two folders:

1. MyData: This will store the additional files that we will use to expand the model's knowledge base.

2. VectorDB: This folder will store the Vector database.

In [5]:
!mkdir /content/MyData
!mkdir /content/VectorDB

Load the PDFs you want to use to customize the generated responses into the MyData folder.

The file used on this demo is:
https://www.lostiempos.com/sites/default/files/edicion_online/las_delicias_de_mi_llajta.pdf




# 1. INDEXING

### Required Libraries

In [6]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import SentenceTransformerEmbeddings

In [7]:
# @title Data Source Folder
source_data_folder = "/content/MyData" # @param {type:"string"}


## Data Preparation

In [9]:
# Read PDFs from the configured folder
loader = PyPDFDirectoryLoader(source_data_folder)
data_on_pdf = loader.load()
# Size of the data / documents loaded
len(data_on_pdf)

20

In [10]:
# Partitioning the data. With a limited size (chunks) and 200 characters of overlapping to preserve the context
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=200
)
splits = text_splitter.split_documents(data_on_pdf)
# Number of Chunks generated
len(splits)

38

In [11]:
# https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
embeddings_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Data Base

In [12]:
# @title Database folder path
path_db = "/content/VectorDB" # @param {type:"string"}


In [13]:
# Store the chunks in the Data Base
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings_model, persist_directory=path_db)

# 2. RETRIEVAL

## Required Libraries

In [14]:
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

## LLM configuration

We will use Gemini

In [15]:
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro", google_api_key=GOOGLE_API_KEY)

## Prompt and Retriever



In [16]:
retriever = vectorstore.as_retriever()

# https://smith.langchain.com/hub/rlm/rag-prompt
prompt = hub.pull("rlm/rag-prompt")

In [17]:
def format_docs(docs):
    """Format the documents for the prompt."""
    return "\n\n".join(doc.page_content for doc in docs)




rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)



# RAG Execution

In [21]:
# @title Questions to the document
question = "What recipes exist on the document?" # @param {type:"string"}
response = rag_chain.invoke(question )
Markdown(response)


The document contains recipes for Silpancho, Pique Macho, and Chicharrón de Surubí. Each recipe includes a list of ingredients, preparation instructions, and information about the chef who provided the recipe. The document seems to be about Bolivian cuisine. 


In [22]:
# @title Questions to the document
question = "How to prepare a Silpancho? " # @param {type:"string"}
response = rag_chain.invoke(question )
Markdown(response)


To prepare Silpancho, season a steak with salt, pepper, and lemon zest, then flatten it. After that, coat the steak with breadcrumbs and pan-fry it in oil. Finally, serve the steak with rice, potatoes, and a salad made of tomato, onion, and locoto. 


In [25]:
# @title Questions to the document
question = "Give me the ingredients of the Chicharrón de Surubí on table format" # @param {type:"string"}
response = rag_chain.invoke(question )
Markdown(response)


| Ingredient | Quantity |
|---|---|
| Surubí (fish) | 1 kilo |
| Lemon juice | 6 tablespoons |
| Salt | To taste |
| Pepper | To taste |
| Flour (for dredging) | 1 cup |
| Oil (for frying) | 2 cups | 


In [None]:
# cleanup
# vectorstore.delete_collection()

**Sources:**

In [None]:
# https://dev.to/timesurgelabs/how-to-use-googles-gemini-pro-with-langchain-1eje

# https://python.langchain.com/v0.2/docs/tutorials/rag/

# https://smith.langchain.com/o/6467f92b-9dac-5816-964f-8abcfa4e4456/projects/p/d35fb5ce-7bac-4627-858c-621aa689239f?timeModel=%7B%22duration%22%3A%227d%22%7D