# RAG con Hugging Face, FAISS y LangChain

In [None]:
from langchain_community.document_loaders import HuggingFaceDatasetLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings

from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter

from transformers import AutoTokenizer, pipeline
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains import RetrievalQA
import pandas as pd

from langchain.schema import Document
import torch

  from .autonotebook import tqdm as notebook_tqdm


## 1. Preparación de Datos
El código trabaja con un conjunto de datos alojado en Hugging Face (`databricks-dolly-15k`), que contiene ejemplos de instrucciones y respuestas. Se transforman las filas del conjunto de datos en objetos `Document` para facilitar su manejo en las herramientas de LangChain.

In [2]:
df = pd.read_json("hf://datasets/databricks/databricks-dolly-15k/databricks-dolly-15k.jsonl", lines=True)
data = [
    Document(
        metadata={
            "instruction": row["instruction"],
            "response": row["response"],
            "category": row["category"]
        },
        page_content=row["context"]
    )
    for _, row in df.iterrows()
]

In [3]:
df[:10]

Unnamed: 0,instruction,context,response,category
0,When did Virgin Australia start operating?,"Virgin Australia, the trading name of Virgin A...",Virgin Australia commenced services on 31 Augu...,closed_qa
1,Which is a species of fish? Tope or Rope,,Tope,classification
2,Why can camels survive for long without water?,,Camels use the fat in their humps to keep them...,open_qa
3,"Alice's parents have three daughters: Amy, Jes...",,The name of the third daughter is Alice,open_qa
4,When was Tomoaki Komorida born?,Komorida was born in Kumamoto Prefecture on Ju...,"Tomoaki Komorida was born on July 10,1981.",closed_qa
5,If I have more pieces at the time of stalemate...,Stalemate is a situation in chess where the pl...,No. \nStalemate is a drawn position. It doesn'...,information_extraction
6,"Given a reference text about Lollapalooza, whe...",Lollapalooza /ˌlɒləpəˈluːzə/ (Lolla) is an ann...,Lollapalooze is an annual musical festival hel...,closed_qa
7,Who gave the UN the land in NY to build their HQ,,John D Rockerfeller,open_qa
8,Why mobile is bad for human,,We are always engaged one phone which is not g...,brainstorming
9,Who was John Moses Browning?,"John Moses Browning (January 23, 1855 – Novemb...",John Moses Browning is one of the most well-kn...,information_extraction


## 2. División de Textos
Se utiliza un divisor de texto (`RecursiveCharacterTextSplitter`) para partir los documentos en fragmentos manejables, garantizando que cada fragmento no sea demasiado grande para el modelo y conservando cierto solapamiento para el contexto.

In [4]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)

## 3. Generación de Embeddings
Los embeddings son representaciones vectoriales de texto que permiten comparar similitudes. Aquí, se usa un modelo preentrenado de Hugging Face para convertir los fragmentos de texto en embeddings:

In [5]:
modelPath = "sentence-transformers/all-mpnet-base-v2"
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,
    model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu'},
    encode_kwargs={'normalize_embeddings': False},
)

## 4. Indexación con FAISS
Los embeddings generados se almacenan en un índice FAISS, una biblioteca diseñada para búsquedas rápidas de similitud en grandes conjuntos de vectores.

In [6]:
docs_sample = docs[:100]  # Trabaja con los primeros 10 documentos
db = FAISS.from_documents(docs_sample, embeddings)
db.embeddings

HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2', cache_folder=None, model_kwargs={'device': 'cpu'}, encode_kwargs={'normalize_embeddings': False}, multi_process=False, show_progress=False)

## 5. Recuperación de Documentos Relevantes
Cuando se plantea una pregunta, el sistema usa FAISS para buscar fragmentos de texto relacionados en el índice.

In [7]:
retriever = db.as_retriever(search_kwargs={"k": 4})
docs = retriever.get_relevant_documents("What is Machine learning?") # Probamos retriever con una frase

  docs = retriever.get_relevant_documents("What is Machine learning?") # Probamos retriever con una frase


+ Convierte el índice FAISS en un buscador (retriever).

+ Busca los 4 documentos más relevantes en el índice usando la pregunta "What is Machine learning?".

    **¿Por qué se hace esto?** El objetivo es acotar la cantidad de información que el modelo debe procesar. En lugar de darle todos los documentos, solo le pasas los 4 más relevantes. 

    Esto mejora:

    + Eficiencia: El modelo de lenguaje trabaja con menos texto.

    + Precisión: Se enfoca en el contexto más relevante para generar respuestas.

+ Devuelve una lista de documentos que contienen información relacionada, que luego puede usarse como contexto para responder la pregunta.

Este paso es crucial en sistemas de Recuperación Augmentada Generativa (RAG), donde la calidad del contexto recuperado afecta directamente la precisión de las respuestas generadas.

## 6. Generación de Respuestas
Una vez recuperado el contexto, se usa un modelo de lenguaje (en este caso, `Intel/dynamic_tinybert`) para responder preguntas directamente. El pipeline de Hugging Face se configura para responder preguntas con base en el contexto recuperado:

In [8]:
question = "What is cheesemaking?"
searchDocs = db.similarity_search(question)
print(searchDocs[0].page_content)

# Specify the model name you want to use
# model_name = "Intel/dynamic_tinybert"

model_name = "Intel/dynamic_tinybert"


# Load the tokenizer associated with the specified model
tokenizer = AutoTokenizer.from_pretrained(model_name, padding=True, truncation=True, max_length=512)

# Define a question-answering pipeline using the model and tokenizer
question_answerer = pipeline(
    "question-answering",
    model=model_name,
    tokenizer=tokenizer,
    return_tensors="pt"
)

# Create an instance of the HuggingFacePipeline, which wraps the question-answering pipeline
# with additional model-specific arguments (temperature and max_length)
llm = HuggingFacePipeline(
    pipeline=question_answerer,
    model_kwargs={"temperature": 0.7, "max_length": 512},
)

Wine is an alcoholic drink typically made from fermented grapes. Yeast consumes the sugar in the grapes and converts it to ethanol and carbon dioxide, releasing heat in the process. Different varieties of grapes and strains of yeasts are major factors in different styles of wine. These differences result from the complex interactions between the biochemical development of the grape, the reactions involved in fermentation, the grape's growing environment (terroir), and the wine production process. Many countries enact legal appellations intended to define styles and qualities of wine. These typically restrict the geographical origin and permitted varieties of grapes, as well as other aspects of wine production. Wines can be made by fermentation of other fruit crops such as plum, cherry, pomegranate, blueberry, currant and elderberry.


  llm = HuggingFacePipeline(


In [9]:
# Crea un objeto retriever desde 'db' usando el método 'as_retriever'.
# Este retriever probablemente se usa para recuperar datos o documentos de la base de datos.
retriever = db.as_retriever(search_kwargs={"k": 4})

docs = retriever.get_relevant_documents("What is Machine learning?")
print(f'recuperar datos o documentos de la base de datos {docs[0].page_content}')

# mejora el rendimiento en algún conjunto de tareas. Se ve como una parte de la inteligencia artificial.
# Los algoritmos de aprendizaje automático construyen un modelo basado en ...

# Crea un objeto retriever desde 'db' con una configuración de búsqueda donde recupera hasta 4 divisiones/documentos relevantes.
retriever = db.as_retriever(search_kwargs={"k": 4})
print(f'retriever {retriever}')

# Crea una instancia de preguntas y respuestas (qa) usando la clase RetrievalQA.
# Está configurado con un modelo de lenguaje (llm), un tipo de cadena "refine", el retriever que creamos y una opción para no devolver documentos de origen.
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="refine", retriever=retriever, return_source_documents=False)
print(f'qa {qa}')

Invalid model-index. Not loading eval results into CardData.


recuperar datos o documentos de la base de datos In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues (Boser et al., 1992, Guyon et al., 1993, Cortes and Vapnik, 1995, Vapnik et al., 1997[citation needed]) SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik (1982, 1995) and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). SVM maps training examples to points in space so as to max

In [10]:
# Recuperar los documentos relevantes
question = "Who is Thomas Jefferson?"
docs = retriever.get_relevant_documents(question)

# Extraer el contenido del primer documento como contexto
if docs:
    context_str = docs[0].page_content
    print("Context:", context_str)
else:
    print("No relevant documents found.")
    context_str = ""

Context: Thomas Jefferson (April 13, 1743 – July 4, 1826) was an American statesman, diplomat, lawyer, architect, philosopher, and Founding Father who served as the third president of the United States from 1801 to 1809. Among the Committee of Five charged by the Second Continental Congress with authoring the Declaration of Independence, Jefferson was the Declaration's primary author. Following the American Revolutionary War and prior to becoming the nation's third president in 1801, Jefferson was the first United States secretary of state under George Washington and then the nation's second vice president under John Adams.


In [11]:
if context_str:  # Nos aseguramos de qu hay un contexto disponible
    result = question_answerer(question=question, context=context_str)
    print("Answer:", result["answer"])
else:
    print("No context available for answering the question.")

Answer: an American statesman, diplomat, lawyer, architect, philosopher, and Founding Father


## 7. Integración y Función Final
La función `answer_question` encapsula todo el proceso:

1. Recupera documentos relevantes.
2. Usa el contenido del documento más relevante como contexto.
3. Genera la respuesta basada en este contexto.

In [12]:
def answer_question(question, retriever, question_answerer):
    """
    Responde una pregunta utilizando un retriever para obtener el contexto y un question_answerer para generar la respuesta.

    Args:
        question (str): La pregunta que se desea responder.
        retriever: El objeto retriever para obtener documentos relevantes.
        question_answerer: El pipeline de Hugging Face para generar respuestas.

    Returns:
        str: La respuesta generada o un mensaje indicando que no se encontró contexto.
    """
    # Recuperar documentos relevantes
    docs = retriever.get_relevant_documents(question)

    # Extraer el contenido del primer documento como contexto
    if docs:
        context_str = docs[0].page_content
        # print("Context:", context_str)
    else:
        print("No relevant documents found.")
        context_str = ""

    # Generar respuesta si hay contexto
    if context_str:
        result = question_answerer(question=question, context=context_str)
        return result["answer"]
    else:
        return "No context available for answering the question."

## 8. Ejecución Final
Finalmente, el sistema responde preguntas, como esta:

In [16]:
question = "Who is Thomas Jefferson?"
answer = answer_question(question, retriever, question_answerer)
print("Answer:", answer)

Answer: an American statesman, diplomat, lawyer, architect, philosopher, and Founding Father


In [14]:
question = "Where Tomoaki Komorida born?"
answer = answer_question(question, retriever, question_answerer)
print("Answer:", answer)

Answer: Kumamoto Prefecture


In [15]:
question = "When did Virgin Australia starts?"
answer = answer_question(question, retriever, question_answerer)
print("Answer:", answer)

Answer: 31 August 2000
