<a href="https://colab.research.google.com/github/jyotirmaya/Domain-Agnostic-Sentence-Specificity-Prediction/blob/master/DataPhoenix_Simple_LangChain_RAG_Pipeline_with_Llama_3_and_Arctic_Embeddings_Notebook_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning Embedding Models

In the following Notebook we will be exploring one of the most powerful techniques to take your single-domain RAG pipelines to the next level.

- Fine-tuning Embeddings Model

But before any of that, we need to grab some dependencies, and set up some boilerplate!

## Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

This notebook will require the use of GPT-4, and the final evaluation piece might exceed the standard rate-limit. You will need to modify the evaluation pipeline to ensure you aren't faced with a rate limit!

### Nest Asyncio

In [None]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

In [None]:
!pip -qU install llama-index-llms-openai llama-index-embeddings-openai llama-index-finetuning

In [None]:
!pip install -qU llama-index-readers-file llama-index-embeddings-huggingface

### Provide OpenAI API Key

In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


## Loading Data

The data can be found in [this GitHub repo](https://github.com/AI-Maker-Space/DataRepository/tree/main/high-performance-rag).

In [None]:
!git clone https://github.com/AI-Maker-Space/DataRepository.git

Cloning into 'DataRepository'...
remote: Enumerating objects: 62, done.[K
remote: Counting objects: 100% (54/54), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 62 (delta 16), reused 29 (delta 8), pack-reused 8[K
Receiving objects: 100% (62/62), 51.51 MiB | 11.33 MiB/s, done.
Resolving deltas: 100% (16/16), done.


In [None]:
%cd DataRepository

/content/DataRepository


In [None]:
%mkdir ElonData

In [None]:
%mv MuskComplaint.pdf ElonData/

Now we can begin building our simple index for each of the training directories, and the validation directories.

We will use LlamaIndex's `SimpleNodeParser` to achieve this!

In [None]:
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.schema import MetadataMode

TRAIN_FILES = "ElonData"

In [None]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.schema import MetadataMode

def load_corpus(directory, verbose=False):
    if verbose:
        print(f"Loading files in {directory}")

    reader = SimpleDirectoryReader(directory)
    docs = reader.load_data()
    if verbose:
        print(f"Loaded {len(docs)} docs")

    parser = SimpleNodeParser.from_defaults()
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f"Parsed {len(nodes)} nodes")

    return nodes

In [None]:
train_nodes = load_corpus(TRAIN_FILES, verbose=True)

Loading files in ElonData
Loaded 46 docs


Parsing nodes:   0%|          | 0/46 [00:00<?, ?it/s]

Parsed 54 nodes


Now that we've split our source documents into a number of nodes, we can move on to constructing a fine-tuning dataset.

## Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-3.5-turbo`.

We'll start by using LlamaIndex's `generate_qa_embedding_pairs` and storing it in a `EmbeddingQAFinetuneDataset`.

The basic idea here is straightforward enough:

1. We look at a node
2. We generate a question that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [None]:
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

In [None]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(temperature=0.0, model="gpt-3.5-turbo")

In [None]:
train_dataset = generate_qa_embedding_pairs(train_nodes, llm=llm)
train_dataset.save_json("train_dataset.json")

100%|██████████| 54/54 [02:05<00:00,  2.33s/it]


In [None]:
train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")

## Fine-tuning `snowflake-arctic-embed-m`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-m`](https://huggingface.co/Snowflake/snowflake-arctic-embed-m) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

In [None]:
!pip install sentence_transformers -q -U

We'll be leveraging LlamaIndex's `SentenceTransformersFinetuneEngine` to make fine-tuning our embeddings model a breeze.

In [None]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset, # Dataset to be trained on
    model_id="Snowflake/snowflake-arctic-embed-m", # HuggingFace reference to base embeddings model
    model_output_path="snowflake_finetune", # Output directory for fine-tuned embeddings model
    epochs=2 # Number of Epochs to train for
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]






README.md:   0%|          | 0.00/83.9k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/738 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

All that's left to do now is call `.finetune()`!

In [None]:
finetune_engine.finetune()

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/11 [00:00<?, ?it/s]

Iteration:   0%|          | 0/11 [00:00<?, ?it/s]

Now that we've fine-tuned our embeddings model, lets grab the model out of the engine so we can use it later!

In [None]:
finetuned_embedding_model = finetune_engine.get_finetuned_model()




Some weights of BertModel were not initialized from the model checkpoint at snowflake_finetune and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from sentence_transformers import SentenceTransformer

fine_tuned_embedding = SentenceTransformer(
    "snowflake_finetune"
)




Some weights of BertModel were not initialized from the model checkpoint at snowflake_finetune and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
fine_tuned_embedding.save_to_hub(repo_id="ai-maker-space/snowflake-ft")



model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

'https://huggingface.co/ai-maker-space/snowflake-ft/commit/5698386c22562b35ce69536cac0a96041e48c619'