## ⚡ Further Notebooks In This Course ⚡

**Notebooks:**
1. [LLM 01 - How to use LLMs with Hugging Face](https://www.kaggle.com/code/aliabdin1/llm-01-llms-with-hugging-face)
2. [LLM 02 - Embeddings, Vector Databases, and Search](https://www.kaggle.com/code/aliabdin1/llm-02-embeddings-vector-databases-and-search)
3. [LLM 03 - Building LLM Chain](https://www.kaggle.com/code/aliabdin1/llm-03-building-llm-chain)
4. [LLM 04a - Fine-tuning LLMs](https://www.kaggle.com/code/aliabdin1/llm-04a-fine-tuning-llms)
4. [LLM 04b - Evaluating LLMs](https://www.kaggle.com/code/aliabdin1/llm-04b-evaluating-llms)
5. [LLM 05 - Biased LLMs and Society](https://www.kaggle.com/code/aliabdin1/llm-05-llms-and-society)
6. [LLM 06 - LLMOps](https://www.kaggle.com/code/aliabdin1/llm-06-llmops)

**Hands-on Lab Notebooks:**
1. [LLM 01L - How to use LLMs with Hugging Face Lab](https://www.kaggle.com/code/aliabdin1/llm-01l-llms-with-hugging-face-lab)
2. [LLM 02L - Embeddings, Vector Databases, and Search Lab](https://www.kaggle.com/code/aliabdin1/llm-02l-embeddings-vector-databases-and-search)
3. [LLM 03L - Building LLM Chains Lab](https://www.kaggle.com/code/aliabdin1/llm-03l-building-llm-chains-lab)
4. [LLM 04L - Fine-tuning LLMs Lab](https://www.kaggle.com/code/aliabdin1/llm-04l-fine-tuning-llms-lab)
5. [LLM 05L - Biased LLMs and Society Lab](https://www.kaggle.com/code/aliabdin1/llm-05l-llms-and-society-lab)

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

Reference
- https://www.kaggle.com/code/aliabdin1/llm-02-embeddings-vector-databases-and-search
- https://youtu.be/GKqtMBkFotA?si=mGxXnpDdFf1pBlUn
- https://github.com/databricks-academy/large-language-models

# Embeddings, Vector Databases, and Search

Converting text into embedding vectors is the first step to any text processing pipeline. As the amount of text gets larger, there is often a need to save these embedding vectors into a dedicated vector index or library, so that developers won't have to recompute the embeddings and the retrieval process is faster. We can then search for documents based on our intended query and pass these relevant documents into a language model (LM) as additional context. We also refer to this context as supplying the LM with "state" or "memory". The LM then generates a response based on the additional context it receives! 

In this notebook, we will implement the full workflow of text vectorization, vector search, and question answering workflow. While we use [FAISS](https://faiss.ai/) (vector library) and [ChromaDB](https://docs.trychroma.com/) (vector database), and a Hugging Face model, know that you can easily swap these tools out for your preferred tools or models!

<img src="https://files.training.databricks.com/images/llm/updated_vector_search.png" width=1000 target="_blank" > 

### ![Dolly](https://files.training.databricks.com/images/llm/dolly_small.png) Learning Objectives
1. Implement the workflow of reading text, converting text to embeddings, saving them to FAISS and ChromaDB 
2. Query for similar documents using FAISS and ChromaDB 
3. Apply a Hugging Face language model for question answering!

Restart kernel after running the following cell

In [None]:
# ! pip install -U git+https://github.com/huggingface/transformers.git --quiet
# ! pip install -U git+https://github.com/huggingface/accelerate.git --quiet

In [None]:
# %pip install faiss-cpu==1.7.4 sentence_transformers --quiet

## Step 1: Reading data

In this section, we are going to use the data on <a href="https://newscatcherapi.com/" target="_blank">news topics collected by the NewsCatcher team</a>, who collect and index news articles and release them to the open-source community. The dataset can be downloaded from <a href="https://www.kaggle.com/kotartemiy/topic-labeled-news-dataset" target="_blank">Kaggle</a>.

In [None]:
import pandas as pd

#pdf = pd.read_csv("../input/topic-labeled-news-dataset/labelled_newscatcher_dataset.csv", sep=";")
pdf = pd.read_csv("/kaggle/input/omdena-faq-chatbot-training-data/omdena_faq_training_data_1 - Sheet1 (6).csv")
pdf["id"] = pdf.index
display(pdf)

In [None]:
pdf.iloc[2,4]

## Vector Library: FAISS

Vector libraries are often sufficient for small, static data. Since it's not a full-fledged database solution, it doesn't have the CRUD (Create, Read, Update, Delete) support. Once the index has been built, if there are more vectors that need to be added/removed/edited, the index has to be rebuilt from scratch. 

That said, vector libraries are easy, lightweight, and fast to use. Examples of vector libraries are [FAISS](https://faiss.ai/), [ScaNN](https://github.com/google-research/google-research/tree/master/scann), [ANNOY](https://github.com/spotify/annoy), and [HNSM](https://arxiv.org/abs/1603.09320).

FAISS has several ways for similarity search: L2 (Euclidean distance), cosine similarity. You can read more about their implementation on their [GitHub](https://github.com/facebookresearch/faiss/wiki/Getting-started#searching) page or [blog post](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/). They also published their own [best practice guide here](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index).

If you'd like to read up more on the comparisons between vector libraries and databases, [here is a good blog post](https://weaviate.io/blog/vector-library-vs-vector-database#feature-comparison---library-versus-database).

The overall workflow of FAISS is captured in the diagram below. 

<img src="https://miro.medium.com/v2/resize:fit:1400/0*ouf0eyQskPeGWIGm" width=700>

Source: [How to use FAISS to build your first similarity search by Asna Shafiq](https://medium.com/loopio-tech/how-to-use-faiss-to-build-your-first-similarity-search-bf0f708aa772).

In [None]:
pdf_subset = pdf.dropna(subset=['Answers']).copy()
pdf_subset['QNA'] = pdf_subset.apply(lambda row: f"Question: {row['Questions']}, Answer: {row['Answers']}", axis=1)


In [None]:
pdf_subset['QNA'][0]

In [None]:
from sentence_transformers import InputExample

def example_create_fn(doc1: pd.Series) -> InputExample:
    """
    Helper function that outputs a sentence_transformer guid, label, and text
    """
    return InputExample(texts=[doc1])

faiss_train_examples = pdf_subset.apply(
    lambda x: example_create_fn(x["Questions"]), axis=1
).tolist()

In [None]:
pdf_subset.shape

In [None]:
pdf.shape

In [None]:
type(faiss_train_examples[0])

### Step 2: Vectorize text into embedding vectors
We will be using `Sentence-Transformers` [library](https://www.sbert.net/) to load a language model to vectorize our text into embeddings. The library hosts some of the most popular transformers on [Hugging Face Model Hub](https://huggingface.co/sentence-transformers).
Here, we are using the `model = SentenceTransformer("all-MiniLM-L6-v2")` to generate embeddings.

In [None]:
mkdir cache

In [None]:
pdf_subset.QNA.values.tolist()[:3]

All Sentence Models list: https://www.sbert.net/docs/pretrained_models.html

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
   #"all-MiniLM-L6-v2", 
   "all-distilroberta-v1",
    cache_folder="../working/cache/"
)
model.save("all-distilroberta-v1-model.pkl")
model = SentenceTransformer.load("all-distilroberta-v1-model.pkl")

faiss_title_embedding = model.encode(pdf_subset.Questions.values.tolist())
len(faiss_title_embedding), len(faiss_title_embedding[0])

In [None]:
faiss_title_embedding.shape

### Step 3: Saving embedding vectors to FAISS index
Below, we create the FAISS index object based on our embedding vectors, normalize vectors, and add these vectors to the FAISS index.

In [None]:
import numpy as np
import faiss

pdf_to_index = pdf_subset.set_index(["id"], drop=False)
id_index = np.array(pdf_to_index.id.values).flatten().astype("int")

content_encoded_normalized = faiss_title_embedding.copy()
faiss.normalize_L2(content_encoded_normalized)

# Index1DMap translates search results to IDs: https://faiss.ai/cpp_api/file/IndexIDMap_8h.html#_CPPv4I0EN5faiss18IndexIDMapTemplateE
# The IndexFlatIP below builds index
index_content = faiss.IndexIDMap(faiss.IndexFlatIP(len(faiss_title_embedding[0])))
index_content.add_with_ids(content_encoded_normalized, id_index)

In [None]:
# Save the index
faiss.write_index(index_content, "index.faiss")

# Load the index
index_content = faiss.read_index("index.faiss")

In [None]:
content_encoded_normalized.shape

In [None]:
faiss_title_embedding.shape

## Step 4: Search for relevant documents

We define a search function below to first vectorize our query text, and then search for the vectors with the closest distance.

In [None]:
def search_content(query, pdf_to_index, k=5):
    query_vector = model.encode([query])
    faiss.normalize_L2(query_vector)

    # We set k to limit the number of vectors we want to return
    top_k = index_content.search(query_vector, k)
    ids = top_k[1][0].tolist()
    similarities = top_k[0][0].tolist()
    results = pdf_to_index.loc[ids]
    results["similarities"] = similarities
    return results

Tada! Now you can query for similar content! Notice that you did not have to configure any database networks beforehand nor pass in any credentials. FAISS works locally with your code.

In [None]:
results = search_content("What is the process to join Omdena School?", pdf_to_index)
results['Answers']

In [None]:
print(results['Answers'].iloc[0])

Up until now, we haven't done the last step of conducting Q/A with a language model yet. We are going to demonstrate this with Chroma, a vector database.

## Prompt engineering for question answering 

Now that we have identified documents about space from the news dataset, we can pass these documents as additional context for a language model to generate a response based on them! 

We first need to pick a `text-generation` model. Below, we use a Hugging Face model. You can also use OpenAI as well, but you will need to get an Open AI token and [pay based on the number of tokens](https://openai.com/pricing).

In [None]:
#!pip uninstall -y accelerate

In [None]:
# !pip install --upgrade accelerate

Here's where prompt engineering, which is developing prompts, comes in. We pass in the context in our `prompt_template` but there are numerous ways to write a prompt. Some prompts may generate better results than the others and it requires some experimentation to figure out how best to talk to the model. Each language model behaves differently to prompts. 

Our prompt template below is inspired from a [2023 paper on program-aided language model](https://arxiv.org/pdf/2211.10435.pdf). The authors have provided their sample prompt template [here](https://github.com/reasoning-machines/pal/blob/main/pal/prompt/date_understanding_prompt.py).

The following links also provide some helpful guidance on prompt engineering: 
- [Prompt engineering with OpenAI](https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)
- [GitHub repo that compiles best practices to interact with ChatGPT](https://github.com/f/awesome-chatgpt-prompts)

In [None]:
results['Answers']

In [None]:
question = "how can i join"
results = search_content(question, pdf_to_index)
context = " ".join([f"#{str(i)}" for i in results['Answers']])[:2014]
#prompt_template = f"Relevant context: {context}\n\n provide Answer to the question in a paragraph: {question}"
prompt_template = f"Relevant context: {context}\n\n Answer the question in detail: {question}"


In [None]:
prompt_template

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")
model_text = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small")

input_text = prompt_template
#input_text = "translate English to German: How old are you?"
#input_text = "Summarize text:"+text#prompt_template
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model_text.generate(input_ids, max_new_tokens = 2024,temperature=0.1, do_sample=True)
print(tokenizer.decode(outputs[0]))

In [None]:
!pip install openai

In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("openai")
import openai

openai.api_key = secret_value_0

model_engine = "text-davinci-002"
prompt_template = "what is omdena project"#prompt_template

response = openai.Completion.create(
    engine=model_engine,
    prompt=prompt_template,
    max_tokens=124,
    temperature=0.8,
    n=1,
    stop=None,
)

print(response.choices[0].text)

Yay, you have just completed the implementation of your first text vectorization, search, and question answering workflow (that requires prompt engineering)!

In the lab, you will apply your newly gained knowledge to a different dataset. You can also check out the optional modules on Pinecone and Weaviate to learn how to set up vector databases that offer enterprise offerings.

&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>

In [None]:
import os
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("openai")
import openai
# Your job is to extract the user intent and create a single question
openai.api_key = secret_value_0
messages=[
    {"role": "system", "content": f"our job is to extract the user intent and create a single question"},
    {"role": "user", "content": "1 what is omdena school"},
    {"role": "user", "content": "2 what is local chapter "},
     {"role": "user", "content": "3 how do i join"}
  ]
response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=   messages,
                temperature=0,
                max_tokens=10,
                top_p=1,
                frequency_penalty=0,
                presence_penalty=0
                )
print(response)

In [None]:
response

In [None]:
response.choices[0]['message']['content'].strip()

In [None]:
import json

# Convert the response object to a JSON string
response_json = json.dumps(response)

# Extract the content of the first choice
content = response_json["choices"][0]["message"]["content"]