# RAG Model
Reference: https://medium.com/@mourya.dwarapudi/implementing-the-rag-model-a-step-by-step-guide-5deb4e7b9dde

In [1]:
import os
import faiss
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForSeq2SeqLM
from tqdm import tqdm

## Load Corpus

In [13]:
directory = "../utils/investopedia-dictionary"
corpus = []
for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        file_path = os.path.join(directory, filename)
        with open(file_path, 'r', encoding='utf-8') as file:
            corpus.append(file.read())

corpus = corpus[:100]

# Print the first few entries as a sample
for i, text in enumerate(corpus[:5]):
    print(f"Text {i+1}:\n{text[:200]}...\n")  # Print the first 200 characters for preview

Text 1:
What Is the Volcker Rule? The Volcker Rule is a federal regulation that generally prohibits banks from conducting certain investment activities with their own accounts and limits their dealings with h...

Text 2:
What Is a Global Registered Share (GRS)? A global registered share (GRS), or a global share, is a security that is issued in the United States, but it is registered in multiple markets around the worl...

Text 3:
Volatility is a statistical measure of returns for a given security or market index.What Is Volatility? Volatility is a statistical measure of the dispersion of returns for a given security or market ...

Text 4:
What Is a Bid? The term bid refers to an offer made by an individual orcorporationto purchase an asset. Buyers commonly make bids at auctions and in various markets, such as the stock market. Bids may...

Text 5:
What Is Stock Compensation? Stock compensation is a way corporations usestock optionsto reward employees. Employees with stock options need 

## Set up the retriever (dense vector indexing with Faiss)

In [3]:
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

### Encode the corpus using the pre-trained model

In [5]:
corpus_embeddings = []
for doc in tqdm(corpus, total=len(corpus)):
    inputs = tokenizer(doc, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
        corpus_embeddings.append(embeddings)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:16<00:00,  5.95it/s]


### Build the Faiss index

In [6]:
index = faiss.IndexFlatIP(embeddings.shape[0])
index = faiss.IndexIDMap(index)
index.add_with_ids(np.array(corpus_embeddings), np.arange(len(corpus)))

## Set up the generative language model

In [7]:
gen_model_name = 't5-base'
gen_tokenizer = AutoTokenizer.from_pretrained(gen_model_name)
gen_model = AutoModelForSeq2SeqLM.from_pretrained(gen_model_name)

## Query

In [14]:
query = "regulation that generally prohibits banks from conducting certain investment"

## Retrieve relevant documents

In [15]:
k = 5
inputs = tokenizer(query, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    outputs = model(**inputs)
    query_embedding = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Search the Faiss index for the top-k relevant documents
scores, doc_ids = index.search(np.array([query_embedding]), k)
documents = [corpus[doc_id] for doc_id in doc_ids[0]]

In [16]:
print(f'Found {len(documents)} documents')
# Print the first few entries as a sample
for i, text in enumerate(documents):
    print(f"Text {i+1}:\n{text[:200]}...\n")  # Print the first 200 characters for preview

Found 5 documents
Text 1:
What Is Bank Capital? Bank capital is the difference between abank's assets and its liabilities, and it represents the net worth of the bank or its equity value to investors. The asset portion of a ba...

Text 2:
What Is a Loan Shark? A loan shark is a person whoor an entity thatloans money at extremely high interest rates and often uses threats of violence to collect debts. The interest rates are generally we...

Text 3:
What Is a Variable Interest Entity (VIE)? A variable interest entity (VIE) is a legal structure in which controlling interest is determined by something other than majority voting rights. Controlling ...

Text 4:
What Is the Earnings Credit Rate (ECR)? The earnings credit rate (ECR) is a daily calculation of interest that a bank pays on customer deposits. The earnings credit rate is often correlated with the U...

Text 5:
What Is an Investment Banker? Investment bankers are financial professionals who advise corporations, as well as governmen

## Generate the response

In [17]:
inputs = gen_tokenizer(query, '\n'.join(documents), return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    outputs = gen_model.generate(**inputs, max_length=512, num_beams=4, early_stopping=True)
response = gen_tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

its assets and liabilities.Tier 2 capital is the total capital of the bank. This is the difference between a bank's assets and its liabilities.Tier 1 capital is the net worth of the bank or its equity value to investors.Tier 2 capital is the total capital of the bank's assets and liabilities.Tier 1 capital is the book value of shareholders' equity and any non-interest-bearing assets.Tier 2 capital is the net worth of the bank or its equity.
