### Pre-task Work

All we really need to do to get started is to get our prerequisites!

We'll be leveraging `langchain` and `llama 2` today.

Check out the docs:
- [LangChain](https://docs.langchain.com/docs/)
- [LLaMA 2](https://huggingface.co/blog/llama2)

### Task 1: Data Preparation

In this task we'll be collecting, and then parsing, our data.

In [None]:
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/oppenheimer.csv

In [None]:
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/barbie.csv

#### Data Parsing

Now that we have our data - let's go ahead and start parsing it into a more usable format for LangChain!

We'll be using the `CSVLoader` for this application.

Check out the docs here:
- [CSVLoader](https://python.langchain.com/docs/integrations/document_loaders/csv)

In [None]:
import torch
import transformers

In [None]:
from langchain.document_loaders.csv_loader import CSVLoader

In [None]:
directory_loader = CSVLoader(file_path="../dataset/directorioL.csv", csv_args={
    'delimiter': ',',
    'quotechar': '"',
    'fieldnames': [ 'Oficina', 'Nombre', 'Descripcion', 'Telefono']
})

directory_data = directory_loader.load()

In [None]:
len(directory_data)

Now that we have collected our review information into a loader - we can go ahead and chunk the reviews into more manageable pieces.

We'll be leveraging the `RecursiveCharacterTextSplitter` for this task today.

While splitting our text seems like a simple enough task - getting this correct/incorrect can have massive downstream impacts on your application's performance.

You can read the docs here:
- [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)

> ### HINT:
>It's always worth it to check out the LangChain source code if you're ever in a bind - for instance, if you want to know how to transform a set of documents, check it out [here](https://github.com/langchain-ai/langchain/blob/5e9687a196410e9f41ebcd11eb3f2ca13925545b/libs/langchain/langchain/text_splitter.py#L268C18-L268C18)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000, # the character length of the chunk
    chunk_overlap = 100, # the character length of the overlap between chunks
    length_function = len, # the length function - in this case, character length (aka the python len() fn.)
)

In [None]:
directory_documents = text_splitter.transform_documents(directory_data)

In [None]:
len(directory_documents)

In [None]:
#in case we need more data
#oppenheimer_documents = text_splitter.transform_documents(oppenheimer_data)

In [None]:
#len(oppenheimer_documents)

In [None]:
#combined_documents = barbie_documents + oppenheimer_documents

With our documents transformed into more manageable sizes, and with the correct metadata set-up, we're now ready to move on to creating our VectorStore!

### Task 2: Creating an "Index"

The term "index" is used largely to mean: Structured documents parsed into a useful format for querying, retrieving, and use in the LLM application stack.

#### Selecting Our VectorStore

There are a number of different VectorStores, and a number of different strengths and weaknesses to each.

In this notebook, we will be keeping it very simple by leveraging [Facebook AI Similarity Search](https://ai.meta.com/tools/faiss/#:~:text=FAISS%20(Facebook%20AI%20Similarity%20Search,more%20scalable%20similarity%20search%20functions.), or `FAISS`.

We're going to be setting up our VectorStore with the OpenAI embeddings model. While this embeddings model does not need to be consistent with the LLM selection, it does need to be consistent between embedding our index and embedding our queries over that index.

While we don't have to worry too much about that in this example - it's something to keep in mind for more complex applications.

We're going to leverage a [`CacheBackedEmbeddings`](https://python.langchain.com/docs/modules/data_connection/caching_embeddings )flow to prevent us from re-embedding similar queries over and over again.

Not only will this save time, it will also save us precious embedding tokens, which will reduce the overall cost for our application.

>#### Note:
>The overall cost savings needs to be compared against the additional cost of storing the cached embeddings for a true cost/benefit analysis. If your users are submitting the same queries often, though, this pattern can be a massive reduction in cost.

In [None]:
from langchain.embeddings import CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore

store = LocalFileStore("./cache/")

embed_model_id = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'

core_embeddings_model = HuggingFaceEmbeddings(
    model_name=embed_model_id
)

embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings_model, store, namespace=embed_model_id
)

vector_store = FAISS.from_documents(directory_documents, embedder)

Now that we've created the VectorStore, we can check that it's working by embedding a query and retrieving passages from our reviews that are close to it.

In [None]:
query = "Quien es Vanessa Cedeño Mieles?"
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

for page in docs:
  print(page.page_content)

Let's see how much time the `CacheBackedEmbeddings` pattern saves us:

In [None]:
%%timeit -n 1 -r 1
query = "Quien es Jorge Aragundy?"
embedding_vector = embedder.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

In [None]:
%%timeit
query = "Quien es Jorge Aragundy?"
embedding_vector = embedder.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

As we can see, even over a significant number of runs - the cached query is significantly faster than the first instance of the query!

With that, we're ready to move onto Task 3!

### Task 3: Building a Retrieval Chain

In this task, we'll be making a Retrieval Chain which will allow us to ask semantic questions over our data.

This part is rather abstracted away from us in LangChain and so it seems very powerful.

Be sure to check the documentation, the source code, and other provided resources to build a deeper understanding of what's happening "under the hood"!

#### A Basic RetrievalQA Chain

We're going to leverage `return_source_documents=True` to ensure we have proper sources for our reviews - should the end user want to verify the reviews themselves.

Hallucinations [are](https://arxiv.org/abs/2202.03629) [a](https://arxiv.org/abs/2305.15852) [massive](https://arxiv.org/abs/2303.16104) [problem](https://arxiv.org/abs/2305.18248) in LLM applications.

Though it has been tenuously shown that using Retrieval Augmentation [reduces hallucination in conversations](https://arxiv.org/pdf/2104.07567.pdf), one sure fire way to ensure your model is not hallucinating in a non-transparent way is to provide sources with your responses. This way the end-user can verify the output.

#### Our LLM

In this notebook, we're going to leverage Meta's LLaMA 2!

Specifically, we'll be using: `meta-llama/Llama-2-13b-chat-hf`

That's right, a 13B parameter model that we're going to run on *less than* 15GB of GPU RAM.

More information on this model can be found [here](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)

In [1]:
from huggingface_hub import login
login(token="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",add_to_git_credential=True)

KeyboardInterrupt: 

We will be leveraging Tim Dettmer's `bitsandbytes` as well as `accelerate` and `transformers` from Hugging Face to make our model as small as possible. The overall quality of the model is fairly well retained!

In [None]:
from llama_cpp import Llama

# Asegúrate de ajustar esta ruta al lugar donde guardaste el archivo del modelo
model_path = "/Users/johanjairgilcesreyes/Desktop/ESPOL/INTEGRADORA/models/llama-2-7b-chat.ggmlv3.q8_0.bin"

# Carga el modelo
LLM = Llama(model_path=model_path)

In [None]:
model_id = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)


In [None]:


model_id = "meta-llama/Llama-2-13b-chat-hf"

# Configuración del modelo
model_config = transformers.AutoConfig.from_pretrained(model_id)

# Cargar el modelo enfocado en CPU
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    resume_download=True,
    device_map='cpu'  # Forzar el uso de CPU
)

model.eval()

# Tokenizador
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)


In [11]:
import torch
import transformers
import os

# Configurar la variable de entorno antes de importar torch y transformers
os.environ['PYTORCH_MPS_HIGH_WATERMARK_RATIO'] = '0.0'

model_id = "meta-llama/Llama-2-7b-chat-hf"

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=False,
    bnb_4bit_compute_dtype=torch.float16,
    load_in_8bit_fp32_cpu_offload=True,  # Activar la carga en 8 bits con offload a CPU
    llm_int8_enable_fp32_cpu_offload=True
)

model_config = transformers.AutoConfig.from_pretrained(
    model_id
)
device_map = {
    'model': 'mps',
    'input_ids': 'cpu',
    'attention_mask': 'cpu',
    'lm_head': 'cpu',
}
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map=device_map,
    cache_dir='/Users/johanjairgilcesreyes/Desktop/ESPOL/INTEGRADORA/PROJECT/production/model', 
            resume_download=True,
)

model.eval()

config.json: 100%|██████████| 614/614 [00:00<00:00, 440kB/s]
model.safetensors.index.json: 100%|██████████| 26.8k/26.8k [00:00<00:00, 3.87MB/s]
Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Now we need to pack it into a `pipeline` for compatability with `langchain`!

In [None]:
# Tokenizador
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

In [None]:
#its wrapper
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    return_full_text=True,
    do_sample=False,
    temperature=0.3, #creativity
    max_new_tokens=256
)

In [None]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

Now we can set up our chain.

In [None]:
retriever = vector_store.as_retriever()

In [None]:
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler

handler = StdOutCallbackHandler()

qa_with_sources_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    callbacks=[handler],
    return_source_documents=True
)

Now that it's set-up, let's test it out!

In [None]:
qa_with_sources_chain({"query" : "Quien es Pamela Defaz?"})

In [None]:
qa_with_sources_chain({"query" : "Cual es la oficina de Erika Mendoza?"})

In [None]:
qa_with_sources_chain({"query" : "Dame información sobre Allan Avendaño"})

In [None]:
import locale
print(locale.getpreferredencoding())

In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [None]:
!pip install -q streamlit pyngrok



In [None]:
%%writefile streamlit_app.py
import streamlit as st

# Interfaz de Streamlit
st.title("Preguntas y Respuestas con RAG")

# Campo de entrada para la pregunta
question = st.text_input("Introduce tu pregunta:")

# Botón para obtener respuesta
if st.button("Obtener Respuesta"):
    # Aquí llamas a tu modelo para obtener la respuesta
    response = qa_with_sources_chain({"query": question})
    st.write(response)


In [None]:
!ngrok authtoken  2YjR6anFRDfdTpez6iORMDx5WVt_7wDHBB2jXcQUkjn2dW116  # Reemplaza 'tu_token_de_ngrok' con tu token real de ngrok




In [None]:
from pyngrok import ngrok

# Detener cualquier túnel ngrok existente
ngrok.kill()

# Configurar y establecer un túnel ngrok al puerto 8501 (puerto predeterminado de Streamlit)
public_url = ngrok.connect(8501)
print('Public URL:', public_url)

# Ejecutar Streamlit en segundo plano
get_ipython().system_raw('streamlit run streamlit_app.py &')


In [None]:
!streamlit run app.py &>/content/logs.txt &


In [None]:
!npx localtunnel --port 8501


This Notebook is a companion to the event put on by [AIMS](https://www.linkedin.com/company/ai-maker-space/), and [Deci](https://deci.ai/), and is authored by [Chris Alexiuk](https://www.linkedin.com/in/csalexiuk/)