
# Avance 4

El objetivo principal del proyecto es implementar técnicas de RAG (Retrieval-Augmented Generation) y Graph RAG para abordar las limitaciones actuales de los Modelos de Lenguaje de Gran Escala (LLMs), que pierden relevancia en sus respuestas al enfrentarse a información no incluida en su entrenamiento. Como parte del proyecto, se desarrollará un chatbot que permitirá a los usuarios acceder a información pertinente extraída de una base de conocimientos.

En esta fase del proyecto, se han establecido los primeros componentes del flujo de datos. Estos incluyen **la carga del documento que contiene la base de información**, **la integración de un modelo de embedding preentrenado**, **la creación de una base de conocimientos vectorizada., implementación de modelos base por medio de RAG**

## Modelo de encoder y busqueda por similaridad

In [None]:
!pip install -r requirements.txt

In [None]:
#!pip install langchain

In [1]:
import time

In [2]:
import torch
import math

#check for GPU MacOs
print(torch.backends.mps.is_available())
print(torch.backends.mps.is_built())

False
False


In [3]:
print(torch.cuda.is_available())

True


In [4]:
encoder_model_name = 'avsolatorio/GIST-Embedding-v0'

encoder_model_name_2 = 'sentence-transformers/all-MiniLM-L12-v2'

In [5]:
from langchain_community.embeddings import HuggingFaceEmbeddings

encoder = HuggingFaceEmbeddings(
    model_name = encoder_model_name,
    #model_kwargs = {'device': "cpu"}
    model_kwargs = {'device': "cuda"}
    #model_kwargs = {'device': "mps"} #mac
)

encoder_2 = HuggingFaceEmbeddings(
    model_name = encoder_model_name_2,
    #model_kwargs = {'device': "cpu"}
    model_kwargs = {'device': "cuda"}
    #model_kwargs = {'device': "mps"} #mac
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/117k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
embeddings = encoder.embed_query("How are you?")

In [7]:
print(len(embeddings))

768


In [8]:
import numpy as np

q = encoder.embed_query("What is earnings?")
z1 = encoder.embed_query(
    'In finance, "earnings" refer to the net profits of a company after deducting all expenses from its revenues. It indicates a company profitability and is essential for assessing its financial health. Earnings are often reported quarterly or annually.'
)  # from wikipedia
z2 = encoder.embed_query(
    'A smartphone is a handheld electronic device that combines the functions of a mobile phone and a computer, enabling features such as internet browsing, app usage, multimedia consumption, and communication.'
)  # from wikipedia

print(np.dot(q, z1) / (np.linalg.norm(q) * np.linalg.norm(z1)))

print(np.dot(q, z2) / (np.linalg.norm(q) * np.linalg.norm(z2)))

0.9030869571349992
0.5814245683213912


## Carga de documento y separación de texto

In [9]:
from langchain.document_loaders import PyPDFLoader

loaders = [
    PyPDFLoader("goog023-alphabet-2023-annual-report-web-1.pdf"),
]
pages = []
for loader in loaders:
    pages.extend(loader.load())

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=AutoTokenizer.from_pretrained(encoder_model_name),
     chunk_size=256,
     chunk_overlap=32,
     strip_whitespace=True,
)

docs = text_splitter.split_documents(pages)



In [11]:
print(len(pages))
print(len(docs))

108
451


## Base de datos de conocimiento vectorizada

In [12]:
from langchain.vectorstores import FAISS
from langchain_community.vectorstores.utils import DistanceStrategy

faiss_db = FAISS.from_documents(
    docs, encoder, distance_strategy=DistanceStrategy.COSINE
)

faiss_db_2 = FAISS.from_documents(
    docs, encoder_2, distance_strategy=DistanceStrategy.COSINE
)

In [13]:
faiss_db

<langchain_community.vectorstores.faiss.FAISS at 0x7fdb5c872aa0>

In [14]:
def similarity_search(question: str, faiss_db, k):
  retrieved_docs = faiss_db.similarity_search(question, k=k)
  context = "".join(doc.page_content + "\n" for doc in retrieved_docs)
  return context

## LLMs

In [15]:
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from dotenv import load_dotenv


In [16]:
#ACCESS_TOKEN = os.getenv("ACCESS_TOKEN") # reads .env file with ACCESS_TOKEN=<your hugging face access token>

ACCESS_TOKEN = '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'

model_id_1 = "google/gemma-2b-it"
tokenizer_gemma = AutoTokenizer.from_pretrained(model_id_1, token=ACCESS_TOKEN)
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         bnb_4bit_compute_dtype=torch.bfloat16)

model_gemma = AutoModelForCausalLM.from_pretrained(model_id_1,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             token=ACCESS_TOKEN)


model_id_2 = "google/gemma-2b-it"

tokenizer_gemma_2 = AutoTokenizer.from_pretrained(model_id_2, token=ACCESS_TOKEN)
model_gemma_2 = AutoModelForCausalLM.from_pretrained(model_id_1,
                                             device_map="auto",
                                             torch_dtype=torch.bfloat16,
                                             token=ACCESS_TOKEN)



model_id_3 = "meta-llama/Meta-Llama-3-8B"

tokenizer_llama_3 = AutoTokenizer.from_pretrained(model_id_2, token=ACCESS_TOKEN)
model_gemma_llama_3 = AutoModelForCausalLM.from_pretrained(model_id_1,
                                             device_map="auto",
                                             torch_dtype=torch.bfloat16,
                                             token=ACCESS_TOKEN)

tokenizers=[]
tokenizers.append(tokenizer_gemma)
tokenizers.append(tokenizer_gemma_2)
tokenizers.append(tokenizer_llama_3)

models=[]

models.append(model_gemma.eval())
models.append(model_gemma_2.eval())
models.append(model_gemma_llama_3.eval())

model_names = ['gemma_quantization', 'gemma', 'llama_3']

device = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [18]:
def generate(question: str, context: str, tokenizer, model):

  start_time = time.time()  # Start the timer

  if context == None or context == "":
      prompt = f"""Give a detailed answer to the following question. Question: {question}"""
  else:
      prompt = f"""Using the information contained in the context, give a detailed answer to the question.
          Context: {context}.
          Question: {question}"""
  chat = [{"role": "user", "content": prompt}]
  formatted_prompt = tokenizer.apply_chat_template(
      chat,
      tokenize=False,
      add_generation_prompt=True,
  )
  inputs = tokenizer.encode(
      formatted_prompt, add_special_tokens=False, return_tensors="pt"
  ).to(device)
  with torch.no_grad():
      outputs = model.generate(
          input_ids=inputs,
          max_new_tokens=250,
          do_sample=False,
      )
  response = tokenizer.decode(outputs[0], skip_special_tokens=False)
  response = response[len(formatted_prompt) :]  # remove input prompt from reponse
  response = response.replace("<eos>", "")  # remove eos token

  end_time = time.time()  # End the timer
  print(f"Running time: {end_time - start_time:.2f} seconds")  # Print the running time

  return response

## Comparación de modelos

In [54]:
question = "What are the customers five key capabilities from Google Cloud?"


context = similarity_search(question, faiss_db, k=5)
context_2 = similarity_search(question,faiss_db_2, k=5)


for i, llm_model in enumerate(models):

  print(model_names[i],'-'*100)
  print('*'*100)
  print(model_names[i],'--Sin contexto:--')
  print(generate(question=question, context='', tokenizer=tokenizers[i], model=llm_model ))
  print('*'*100)
  print(model_names[i],'--Contexto (GIST-Embedding-v0):--')
  print(generate(question=question, context=context, tokenizer=tokenizers[i], model=llm_model ))
  print('*'*100)
  print(model_names[i],'--Contexto (all-MiniLM-L12-v2):--')
  print(generate(question=question, context=context_2, tokenizer=tokenizers[i], model=llm_model ))



gemma_quantization ----------------------------------------------------------------------------------------------------
****************************************************************************************************
gemma_quantization --Sin contexto:--
Running time: 9.83 seconds
Sure, here are the five key capabilities from Google Cloud:

1. **Compute:** Google Cloud provides a wide range of compute options to meet your specific needs, from virtual machines and serverless functions to containerized services and serverless AI.


2. **Storage:** Google Cloud offers a variety of storage options to ensure that your data is always accessible and reliable. These options include Cloud Storage, Cloud SQL, Cloud Spanner, and Cloud CDN.


3. **Networking:** Google Cloud provides a robust and secure network that connects you to the world. This includes services such as Cloud CDN, Cloud DNS, Cloud Load Balancing, and Cloud VPN.


4. **Identity and Access Management:** Google Cloud provides a 

In [55]:
question = "What occurred in the dispute between Epic Games and Google in December 2023?"


context = similarity_search(question, faiss_db, k=5)
context_2 = similarity_search(question,faiss_db_2, k=5)


for i, llm_model in enumerate(models):

  print(model_names[i],'-'*100)
  print('*'*100)
  print(model_names[i],'--Sin contexto:--')
  print(generate(question=question, context='', tokenizer=tokenizers[i], model=llm_model ))
  print('*'*100)
  print(model_names[i],'--Contexto (GIST-Embedding-v0):--')
  print(generate(question=question, context=context, tokenizer=tokenizers[i], model=llm_model ))
  print('*'*100)
  print(model_names[i],'--Contexto (all-MiniLM-L12-v2):--')
  print(generate(question=question, context=context_2, tokenizer=tokenizers[i], model=llm_model ))



gemma_quantization ----------------------------------------------------------------------------------------------------
****************************************************************************************************
gemma_quantization --Sin contexto:--
Running time: 2.28 seconds
I am unable to access real-time information, therefore I cannot provide a detailed answer to the question. For the most up-to-date information on the Epic Games and Google dispute, please refer to reputable news sources or legal publications.
****************************************************************************************************
gemma_quantization --Contexto (GIST-Embedding-v0):--
Running time: 3.19 seconds
Sure, here's a summary of the dispute between Epic Games and Google in December 2023:

Epic Games v. Google found that Google violated antitrust laws related to Google Play's business practices. The presiding judge will determine remedies in 2024 and the range of potential remedies vary wid

In [56]:
question = "How do Google's financial results for 2023 compare with those of 2022?"


context = similarity_search(question, faiss_db, k=5)
context_2 = similarity_search(question,faiss_db_2, k=5)


for i, llm_model in enumerate(models):

  print(model_names[i],'-'*100)
  print('*'*100)
  print(model_names[i],'--Sin contexto:--')
  print(generate(question=question, context='', tokenizer=tokenizers[i], model=llm_model ))
  print('*'*100)
  print(model_names[i],'--Contexto (GIST-Embedding-v0):--')
  print(generate(question=question, context=context, tokenizer=tokenizers[i], model=llm_model ))
  print('*'*100)
  print(model_names[i],'--Contexto (all-MiniLM-L12-v2):--')
  print(generate(question=question, context=context_2, tokenizer=tokenizers[i], model=llm_model ))



gemma_quantization ----------------------------------------------------------------------------------------------------
****************************************************************************************************
gemma_quantization --Sin contexto:--
Running time: 11.40 seconds
**Google's Financial Results for 2023 Compared to 2022**

**Revenue:**

* Google's revenue in 2023 reached **$292.7 billion**, a 13% increase from 2022.
* This growth was driven by increased search queries, advertising revenue, and other services.

**Operating Income:**

* Google's operating income in 2023 was **$16.5 billion**, a 10% decrease from 2022.
* This decline was primarily due to increased expenses for research and development, cloud infrastructure, and other operating costs.

**Net Income:**

* Google's net income in 2023 was **$10.1 billion**, a 10% decrease from 2022.
* This decrease was also attributed to increased expenses and a decline in tax rates.

**Key Financial Metrics:**

| Metric | 

In [57]:
question = "what is the number google's of employees in 2023?"


context = similarity_search(question, faiss_db, k=5)
context_2 = similarity_search(question,faiss_db_2, k=5)


for i, llm_model in enumerate(models):

  print(model_names[i],'-'*100)
  print('*'*100)
  print(model_names[i],'--Sin contexto:--')
  print(generate(question=question, context='', tokenizer=tokenizers[i], model=llm_model ))
  print('*'*100)
  print(model_names[i],'--Contexto (GIST-Embedding-v0):--')
  print(generate(question=question, context=context, tokenizer=tokenizers[i], model=llm_model ))
  print('*'*100)
  print(model_names[i],'--Contexto (all-MiniLM-L12-v2):--')
  print(generate(question=question, context=context_2, tokenizer=tokenizers[i], model=llm_model ))



gemma_quantization ----------------------------------------------------------------------------------------------------
****************************************************************************************************
gemma_quantization --Sin contexto:--
Running time: 2.29 seconds
As of October 26, 2023, Google has approximately 125,000 employees worldwide. This number is based on publicly available data from sources such as Glassdoor, LinkedIn, and Statista.
****************************************************************************************************
gemma_quantization --Contexto (GIST-Embedding-v0):--
Running time: 1.63 seconds
According to the context, as of December 31, 2023, Google had 182,502 employees.
****************************************************************************************************
gemma_quantization --Contexto (all-MiniLM-L12-v2):--
Running time: 1.59 seconds
According to the context, as of December 31, 2023, Alphabet had 182,502 employees.
gemma -