## Generate Syns Data for fintuning RAG

#### Reference:
- https://betterprogramming.pub/fine-tuning-your-embedding-model-to-maximize-relevance-retrieval-in-rag-pipeline-2ea3fa231149
- https://gpt-index.readthedocs.io/en/latest/examples/finetuning/embeddings/finetune_embedding.html


In [42]:
import json
from llama_index import SimpleDirectoryReader
from llama_index.node_parser import SimpleNodeParser
from llama_index.schema import MetadataMode
from llama_index.finetuning import (
    generate_qa_embedding_pairs,
    EmbeddingQAFinetuneDataset,
)

from llama_index.llms import OpenAI
import os,sys
sys.path.insert(0,'../../libs')
from utils import load_json

### Load all API keys 
keys = load_json('/home/chengyu.huang/project/Fund_projects/openai_key.json') 
os.environ['OPENAI_API_KEY'] = keys['ChatGPT1']['API_KEY']

In [43]:
OpenAI??

[0;31mInit signature:[0m
[0mOpenAI[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mmodel[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'gpt-3.5-turbo'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtemperature[0m[0;34m:[0m [0mfloat[0m [0;34m=[0m [0;36m0.1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_tokens[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mint[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0madditional_kwargs[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mDict[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mAny[0m[0;34m][0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_retries[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m10[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mapi_key[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mapi_type[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m

In [45]:
root_folder = '/data/shared_data/Language_Model_Training_Data/Data/Raw_LM_Data'
raw_txt_folder = os.path.join(root_folder,'CLEAN_All')
reader = SimpleDirectoryReader(input_dir=raw_txt_folder,num_files_limit=1000) ## set file limits for testing purpose 
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

Loaded 1000 docs


- for more customization on parser : 
- https://gpt-index.readthedocs.io/en/stable/core_modules/data_modules/node_parsers/usage_pattern.html
- https://gpt-index.readthedocs.io/en/stable/core_modules/data_modules/documents_and_nodes/usage_metadata_extractor.html

In [4]:
parser = SimpleNodeParser.from_defaults(chunk_size=512, chunk_overlap=20)
nodes = parser.get_nodes_from_documents(docs, show_progress=True)

Parsing documents into nodes:   0%|          | 0/1000 [00:00<?, ?it/s]

In [5]:
print('total nodes : {}'.format(len(nodes)))
print(nodes[1].text[:200])

total nodes : 29010
To enhance Mexico's capacity to implement the adjustment policies, negotiations were undertaken with foreign creditors to refinance the public sector's short-term debt and its medium- and long­ term o


In [27]:
train_nodes = nodes[:20000]
val_nodes = nodes[20000:]

### Generate synthetic queries 

In [11]:
generate_qa_embedding_pairs??

[0;31mSignature:[0m
[0mgenerate_qa_embedding_pairs[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mnodes[0m[0;34m:[0m [0mList[0m[0;34m[[0m[0mllama_index[0m[0;34m.[0m[0mschema[0m[0;34m.[0m[0mTextNode[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mllm[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mllama_index[0m[0;34m.[0m[0mllms[0m[0;34m.[0m[0mbase[0m[0;34m.[0m[0mLLM[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mqa_generate_prompt_tmpl[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'Context information is below.\n\n---------------------\n{context_str}\n---------------------\n\nGiven the context information and not prior knowledge.\ngenerate only questions based on the below query.\n\nYou are a Teacher/ Professor. Your task is to setup {num_questions_per_chunk} questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. Restrict the questions to the context inf

train_dataset = generate_qa_embedding_pairs(train_nodes)
val_dataset = generate_qa_embedding_pairs(val_nodes)

- check https://github.com/run-llama/finetune-embedding/blob/main/generate_dataset.ipynb see how to customize query generation function 

In [46]:
qa_generate_prompt= 'Context information is below.\n\n---------------------\n{context_str}\n---------------------\n\n' \
'Given the context information and not prior knowledge.\ngenerate only questions based on the below query.\n\n' \
'You are an IMF Economist. Your task is to setup {num_questions_per_chunk} questions that economists are likely going to ask based on the context provided. ' \
'The questions should be relatively concise and diverse in nature across the document. ' \
'Restrict the questions to the context information provided."\n'

llm = OpenAI(
    api_key=os.environ['OPENAI_API_KEY'],
    model='gpt-3.5-turbo',
    temperature=0.0
)

In [73]:
train_dataset = generate_qa_embedding_pairs(train_nodes[:100],
                                          llm=llm,
                                          qa_generate_prompt_tmpl=qa_generate_prompt,
                                          num_questions_per_chunk=3)

val_dataset = generate_qa_embedding_pairs(val_nodes[:10],
                                          llm=llm,
                                          qa_generate_prompt_tmpl=qa_generate_prompt,
                                          num_questions_per_chunk=2)

100%|██████████| 100/100 [06:42<00:00,  4.02s/it]
100%|██████████| 10/10 [00:25<00:00,  2.58s/it]


In [77]:
list(val_dataset.queries.values())[:2]

['How did the simultaneous movements in the fiscal and external deficits in the mid-1990s impact private savings in Hungary?',
 'What factors contribute to the anticipated decline in the private savings-investment balance and the aim to reduce the external current account deficit towards its target level in Hungary?']

In [78]:
val_nodes[0].text[:500]

'1 Ricardian effects may also contribute to a reduction in private savings if there is a further fiscal consolidation, but their magnitude is unclear in Hungary, though the strong simultaneous movements in the fiscal and external deficits in the mid-1990s would suggest that they were relatively weak at that time.\nThe estimation of the appropriate fiscal stance starts from the budget for 2000, which implies a modest fiscal tightening such that such that there may be a small rise in the external cu'

#### save to file 

In [79]:
out_path = os.path.join(root_folder,"train_dataset.json")
train_dataset.save_json(out_path)
out_path = os.path.join(root_folder,"val_dataset.json")
val_dataset.save_json(out_path)

### Run Embedding Finetuning

In [80]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

In [83]:
## loss = losses.MultipleNegativesRankingLoss(model)
## there is problem using this multiple negative ranking loss here as the way it batches it
## need to be further customized 
finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="BAAI/bge-small-en",
    model_output_path=os.path.join(root_folder,"test_model"),
    val_dataset=val_dataset,
    epochs=2,
    batch_size=16,
    evaluation_steps=50
)

In [84]:
finetune_engine.finetune()

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/19 [00:00<?, ?it/s]

Iteration:   0%|          | 0/19 [00:00<?, ?it/s]

In [85]:
embed_model = finetune_engine.get_finetuned_model()
embed_model

LangchainEmbedding(model_name='/data/shared_data/Language_Model_Training_Data/Data/Raw_LM_Data/test_model', embed_batch_size=10, callback_manager=<llama_index.callbacks.base.CallbackManager object at 0x7f1967098130>)

: 