# LLMs e IA Generativa

## Chatbot de CVs

En este ejercicio se desarrolla un chatbot basado en un sistema de Retrieval Augmented Generation (RAG). Este utiliza todos los archivos PDF almacenados en el directorio "/docs" como fuente de datos, los cuales se emplean para proporcionar contexto al modelo de lenguaje (LLM) y permitirle responder de manera más precisa a las preguntas planteadas.

Para la codificación de los documentos, se emplea una estrategia de chunking recursivo. Este enfoque divide los documentos en fragmentos más pequeños, optimizando así la eficiencia en el procesamiento de texto. Se utiliza el modelo "all-MiniLM-L6-v2" para realizar el embedding de los fragmentos. A continuación, tanto embeddings como fragmentos son persistidos en una base de datos Pinecone. Para la generación de respuestas, se utiliza el modelo de lenguaje "llama3-8b-8192", que aprovecha el contexto proporcionado por los datos almacenados en Pinecone para generar las respuestas.

In [1]:
# Configuración del entorno
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
import PyPDF2
import os
from groq import Groq
from dotenv import load_dotenv

load_dotenv()

True

Configuración de la base de datos de Pinecone.

In [2]:
# Configuración de Pinecone
pinecone_api_key = os.getenv('PINECONE_API_KEY')
pc = Pinecone(api_key=pinecone_api_key)


Carga y procesamiento de los PDFs presentes en el directorio "/docs".

In [3]:
# Carga y procesamiento de PDFs
pdf_dir = "docs"  # Directorio donde se almacenan los PDFs
pdf_files = [f for f in os.listdir(pdf_dir) if f.endswith(".pdf")]

In [4]:
def extract_text_from_pdf(file_path):
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
        return text

In [5]:
# Procesar todos los PDFs
documents = []
for pdf in pdf_files:
    text = extract_text_from_pdf(os.path.join(pdf_dir, pdf))
    documents.append({"filename": pdf, "text": text})

In [6]:
embedding_model_name="all-MiniLM-L6-v2"
embedding_model_dim=384

embedding_model = SentenceTransformer(embedding_model_name)

def generate_embedding(text):
    return embedding_model.encode(text).tolist()

Generación del index y posterior upserting de los embeddings en Pinecone.

In [7]:
# Valida si el index existe o no

index_name = "cvs-embeddings"

existing_indexes = pc.list_indexes()
index_names = [index.name for index in existing_indexes]

if index_name not in index_names:
    print("Index doesn't exist. Creating!")
    pc.create_index(
        name=index_name,
        dimension=embedding_model_dim,
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        ) 
    )
else:
    print("Index with name " + index_name + " already created. Skipping!")

Index doesn't exist. Creating!


In [8]:
# Cargando el index de Pinecone
index = pc.Index(index_name)

# Documentos cargados en el directorio
documents_names = [document['filename'] for document in documents]

# Documentos cargados en Pinecone
existing_indexes = pc.list_indexes()
index_names = [index.name for index in existing_indexes]


for document_name in documents_names:
    print("Processing document: " + document_name)

    dummy_vector = [0.0] * 384

    # Realiza la búsqueda con el filtro en el campo 'filename' de la metadata
    query_results = index.query(
        vector=dummy_vector,
        top_k=10,  # Número máximo de resultados a devolver
        filter={'filename': {'$eq': document_name}},  # Filtro de metadata
        include_metadata=True  # Incluir metadata en los resultados
    )

    # # Imprime los resultados
    # for result in query_results['matches']:
    #     print(f"ID: {result['id']}")
    #     print(f"Filename: {result['metadata']['filename']}")
    #     print(f"Score: {result['score']}")

    if query_results['matches']:
        print(f"File '{document_name}' found on index. Upserting aborted...")
    else:
        print(f"File '{document_name}' not found. Initializing upserting!!..")

        # Chunking recursivo
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
        chunks = []

        doc = next((doc for doc in documents if doc['filename'] == document_name), None)

        doc_chunks = text_splitter.split_text(doc["text"])
        chunks.extend([{"filename": doc["filename"], "chunk": chunk} for chunk in doc_chunks])
        
        for chunk in chunks:
            chunk["embedding"] = generate_embedding(chunk["chunk"])

        # Subir los vectores a Pinecone
        for i, chunk in enumerate(chunks):
            index.upsert([(str(i), chunk["embedding"], {"filename": chunk["filename"], "chunk": chunk["chunk"]})])



Processing document: Javier Villagra - Resume.pdf
File 'Javier Villagra - Resume.pdf' not found. Initializing upserting!!..


Consulta por similaridad coseno sobre la base de datos de Pinecone para obtener el contexto.

In [10]:
# Probar una consulta
def find_similar(query):
    query_embedding = generate_embedding(query)
    response = index.query(vector=query_embedding, top_k=1, include_metadata=True)
    # sample: index.query(vector=[0.1, 0.2, 0.3], top_k=10, namespace='my_namespace')
    return response


query = "Does he know Java?"
response = find_similar(query)

print("Closest chunk:")
print(response)


Closest chunk:
{'matches': [{'id': '8',
              'metadata': {'chunk': "with the channel's ESB.\n"
                                    'Conducted effective troubleshooting and '
                                    'resolved issues with coding, design and '
                                    'infrastructure.Banco Macro, Argentina\n'
                                    'Page 1Universidad de Buenos AiresSYSTEMS '
                                    'ENGINEERING2001 - 2008Microservices\n'
                                    'Java EEAgile Management\n'
                                    'Fintech\n'
                                    'Universidad de Buenos AiresPOSTGRADUATE '
                                    'DEGREE,\n'
                                    'AI SPECIALIST2023 - Machine '
                                    'LearningArtificial Intelligence\n'
                                    'Deep Learning\n'
                                    'PythonEXPERIENCE\n'
               

Generación de la consulta al LLM empleando como contexto los datos devueltos por Pinecone.

In [12]:
groq_api_key = os.getenv('GROQ_API_KEY')

client = Groq(
    api_key=groq_api_key,
)

context = response['matches'][0]['metadata']['chunk']

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Answer this question: '"+query+"' using this information: " + context,
        }
    ],
    model="llama3-8b-8192",
)

print(chat_completion.choices[0].message.content)

Based on the provided information, we can conclude that the individual has some experience with Java, specifically with Java EE, which is a Java technology developed by Sun Microsystems and now owned by Oracle. He has also listed "Java" as a skill under his "Languages" section, but without specifying the level of proficiency.

It is important to note that the individual also has experience with Python and has listed it as a language under his "EXPERIENCE" section. However, we cannot conclude that he does not know Java based on this information.

To answer the original question, we can say that:
"Based on the provided information, it is not explicitly stated that he does not know Java, and he has listed it as a skill, but the level of his proficiency is not specified. It is also worth noting that he has experience with Python and has listed it as a language."
