<a href="https://colab.research.google.com/github/nathalyAlarconT/GenAI_Workshops/blob/main/Intro_to_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup del Notebook

In [None]:
# IP Asignada a Google Colab
!curl ipecho.net/plain
# https://ipinfo.io/IP

## Instalando las librerías Requeridas

In [None]:
!pip install --upgrade -q langchain
!pip install google-generativeai langchain-google-genai
!pip install chromadb pypdf2 python-dotenv
!pip install PyPDF
!pip install -U langchain-community
!pip install sentence-transformers
!pip install langchainhub

### Librerías Generales

In [None]:
from google.colab import userdata
import os
from IPython.display import Markdown

### Configurando credenciales

In [None]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = userdata.get('SMITH_APIKEY')
GOOGLE_API_KEY = userdata.get('GoogleAIStudio')

### Creación de Folders necesarios

Usaremos dos carpetas:
- Mis Datos, almacenara los PDFs con la info propia
- VectorDB, en esta carpeta se creará la base de datos vectorial

In [None]:
!mkdir /content/MisDatos
!mkdir /content/VectorDB

Adiciona los archivos PDF a la carpeta Mis Datos

El archivo utilizado en la demo es:
https://www.lostiempos.com/sites/default/files/edicion_online/las_delicias_de_mi_llajta.pdf




# 1. INDEXING

### Librerías necesarias

In [None]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import SentenceTransformerEmbeddings

In [None]:
# @title Configura el origen de los datos
source_data_folder = "/content/MisDatos" # @param {type:"string"}


## Preparando y formateando el contenido necesario

In [None]:
# Leyendo los PDFs del directorio configurado
loader = PyPDFDirectoryLoader(source_data_folder)
data_on_pdf = loader.load()
# cantidad de data cargada
len(data_on_pdf)

20

In [None]:
# Particionando los datos. Con un tamaño delimitado (chunks) y 200 caracters de overlapping para preservar el contexto
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=200
)
splits = text_splitter.split_documents(data_on_pdf)
# Cantidad de chunks obtenidos
len(splits)

38

In [None]:
# https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
embeddings_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Base de datos vectorial

In [None]:
# @title Configura el path a la base de datos
path_db = "/content/VectorDB" # @param {type:"string"}


In [None]:
# Almacenamos los chunks en la base de datos
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings_model, persist_directory=path_db)

# 2. RETRIEVAL

## Librerías necesarias

In [None]:
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

## Selección / Configuración del LLM

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro", google_api_key=GOOGLE_API_KEY)

## Configuración del Prompt y Retriever



In [None]:
retriever = vectorstore.as_retriever()

# https://smith.langchain.com/hub/rlm/rag-prompt?organizationId=6467f92b-9dac-5816-964f-8abcfa4e4456
prompt = hub.pull("rlm/rag-prompt")

In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)




rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)



# Ejecución del RAG

In [None]:
# @title Preguntas al Documento
pregunta = "Cuáles son los ingredientes del pique macho ?" # @param {type:"string"}
response = rag_chain.invoke(pregunta )
Markdown(response)


Los ingredientes del pique macho son carne suave (lomo o pulpa), chorizos, papas imilla, tomates, locoto verde, cebolla colorada, cerveza, salsa de soya y pimienta negra. También lleva sal, pero esa es la única cantidad que no especifican. Este plato típico es fácil de preparar y rinde cuatro porciones. 


In [None]:
# @title Preguntas al Documento
pregunta = "Qué recetas usan carne?" # @param {type:"string"}
response = rag_chain.invoke(pregunta )
Markdown(response)


La receta de Silpancho y la receta de Pique Macho usan carne. El Silpancho usa carne de res, mientras que el Pique Macho usa carne de res y chorizo. Ambas recetas son platos tradicionales bolivianos. 


In [None]:
# @title Preguntas al Documento
pregunta = "Dame los ingredientes del CHICHARRÓN DE SURUBÍ en formato de tabla" # @param {type:"string"}
response = rag_chain.invoke(pregunta )
Markdown(response)


| Ingrediente | Cantidad |
|---|---|
| Surubí | 1 kilo |
| Zumo de limón | 6 cucharadas |
| Sal | Al gusto |
| Pimienta | Al gusto |
| Harina | 1 taza |
| Aceite | 2 tazas | 


In [None]:
# cleanup
# vectorstore.delete_collection()

**Fuentes:**

In [None]:
# https://dev.to/timesurgelabs/how-to-use-googles-gemini-pro-with-langchain-1eje

# https://python.langchain.com/v0.2/docs/tutorials/rag/

# https://smith.langchain.com/o/6467f92b-9dac-5816-964f-8abcfa4e4456/projects/p/d35fb5ce-7bac-4627-858c-621aa689239f?timeModel=%7B%22duration%22%3A%227d%22%7D