En esta cuaderno, se implementará un sistema RAG que crea una base de datos vectorial a partir de datos obtenidos de una página web, a través de web scraping. Este sistema podrá procesar la información recuperada y generar respuestas a preguntas basadas en esos datos.

Importamos las dependecias

In [15]:
import requests
from bs4 import BeautifulSoup
from langchain_ollama.chat_models import ChatOllama
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough

Web scraping en este caso de wikipeda, y separar el texto.

In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Lionel_Messi'

response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    page_text = soup.get_text(separator=' ', strip=True)
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=20)
    web_texts = text_splitter.split_text(page_text)
    print(web_texts)
else:
    print(f'No se pudo acceder a la página: {response.status_code}')

In [17]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

Creamos la base vectorial a partir de los textos recogidos de nuestra página web.

In [18]:
bbdd_vector = Chroma.from_texts(
    texts=web_texts,
    collection_name="players",
    embedding=embeddings,
)

In [19]:
retriever = bbdd_vector.as_retriever()

Usamos nuestro contexto para que el modelo responda a nuestras preguntas.

In [20]:
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

llm = "llama3.2"
modelo = ChatOllama(model=llm)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}| prompt| modelo| StrOutputParser()
)
chain.invoke("Who is Lionel Messi?")

"Lionel Messi is an Argentine international who plays as a forward, widely regarded as one of the greatest players of all time. He has won numerous records for individual accolades, including eight Ballon d'Or awards and eight times being named the world's best player by FIFA."