# RAG Implementation using Llama-3 model

This is a simple RAG implementation using all-mpnet-base-v2 embedding model, chromadb vector database and Llama-3 8B model.

Note: Download pytorch with CUDA to use GPU

In [5]:
import PyPDF2
from sentence_transformers import SentenceTransformer
import chromadb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, AutoModelForSequenceClassification
import torch
import os
from pdfminer.high_level import extract_text
import time
import re
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

  from tqdm.autonotebook import tqdm, trange


## Chunking

In [7]:
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
chroma_client = chromadb.Client()



In [8]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [9]:
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
def get_llama3_chat_reponse(messages):
    input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt").to(model.device)
    outputs = model.generate(
    input_ids,
    max_new_tokens=500,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.0001,
    top_p=0.9,)
    response = outputs[0][input_ids.shape[-1]:]
    return (tokenizer.decode(response, skip_special_tokens=True))

In [10]:
# pdf_path = "Research_Papers/paper1.pdf"
# pdf_file = open(pdf_path, 'rb')
# reader = PyPDF2.PdfReader(pdf_file)
# page = reader.getPage(0)
# text = page.extract_text()

In [11]:
# query = "Provide keywords, authors, and other metadata information in a list."
# context = text
# messages = [
#     {"role": "system", "content": "You are a chatbot who creates metadata based on the provided context."},
#     {"role": "system", "content": {context}},
#     {"role": "user", "content": {query}},
# ]
# start_time = time.time()
# response = get_llama3_chat_reponse(messages)
# print(response)
# print("--- %.2f seconds ---" % (time.time() - start_time))

In [12]:
pdf_dir = "Research_Papers"
start_time = time.time()
metadatas = []
file_number = 1
for filename in os.listdir(pdf_dir):
    if (not filename.endswith('.pdf')):
        continue
    chunks = []
    collection = chroma_client.create_collection(
        name=("paper" + f'{file_number:02d}'),
        metadata={"hnsw:space": "cosine"}
    )
    
    pdf_path = os.path.join(pdf_dir, filename)
    print(f"Extracting text from {pdf_path}")
    pdf_file = open(pdf_path, 'rb')
    
    # Create a PDF reader object
    reader = PyPDF2.PdfReader(pdf_file)
    first_page = reader.getPage(0).extract_text()
    query = "Provide keywords, authors, and other metadata information in a list."
    messages = [
        {"role": "system", "content": "You are a chatbot who creates metadata based on the provided context."},
        {"role": "system", "content": {first_page}},
        {"role": "user", "content": {query}},
    ]
    response = get_llama3_chat_reponse(messages)
    metadatas.append(response)
    # Iterate through each page
    for page_num in range(reader.getNumPages()):
        # Get the page
        page = reader.getPage(page_num)
        
        # Extract text from the page
        text = page.extract_text()
        
        # Store the text in the dictionary with the page number as the key
        chunks.append(text)
        
    chunk_embeddings = embedding_model.encode(chunks, normalize_embeddings=True)
    collection.upsert(
    embeddings = chunk_embeddings,
    documents=chunks,
    ids= [str(i) for i in range(len(chunks))]
    )
    # Close the PDF file
    pdf_file.close()
    file_number += 1

print("--- %.2f seconds ---" % (time.time() - start_time))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Extracting text from Research_Papers\paper01.pdf


Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
  attn_output = torch.nn.functional.scaled_dot_product_attention(
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Extracting text from Research_Papers\paper02.pdf


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Extracting text from Research_Papers\paper03.pdf


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Extracting text from Research_Papers\paper04.pdf


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Extracting text from Research_Papers\paper05.pdf


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Extracting text from Research_Papers\paper06.pdf


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Extracting text from Research_Papers\paper07.pdf


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Extracting text from Research_Papers\paper08.pdf


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Extracting text from Research_Papers\paper09.pdf


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Extracting text from Research_Papers\paper10.pdf


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Extracting text from Research_Papers\paper11.pdf


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Extracting text from Research_Papers\paper12.pdf


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Extracting text from Research_Papers\paper13.pdf
--- 106.50 seconds ---


In [13]:
# embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2', cache_folder = '/data/base_models')
# chunk_embeddings = embedding_model.encode(chunks)
# chunk_embeddings.shape

## Indexing

In [15]:
print(metadatas[5])

assistant

Here is the metadata information extracted from the provided context:

**Metadata Information:**

* **Keywords:** Fault detection, knowledge distillation, multiple failure modes, cross fault mode
* **Authors:** Jinghao Zheng, Chongdang Liu, Linxuan Zhang
* **Title:** Cross-Modal Knowledge Distillation for Fault Detection under Multiple Failure Modes
* **Conference:** China Automation Congress (CAC)
* **Year:** 2021
* **DOI:** 10.1109/CAC53003.2021.9728306
* **Publisher:** IEEE
* **ISBN:** 978-1-6654-2647-3
* **License:** Authorized licensed use limited to: University of Maryland College Park. Downloaded on September 08, 2023 at 05:39:04 UTC from IEEE Xplore. Restrictions apply.


In [16]:
metadata_collection = chroma_client.create_collection(name="metadata", metadata={"hnsw:space": "cosine"})

In [20]:
metadata_embeddings = embedding_model.encode(metadatas)
metadata_collection.add(
    embeddings = metadata_embeddings,
    documents = metadatas,
    ids = [f'paper{i:02d}' for i in range(1, len(metadatas)+1)]
)

In [48]:
query = "What are the two types of models for prognostics?"
results = metadata_collection.query(
    query_embeddings = embedding_model.encode(query).tolist(),
    n_results=2
    )
print(results)
context = '\n\n'.join(results['documents'][0])
print(context)

{'ids': [['paper03', 'paper11']], 'distances': [[0.5564844608306885, 0.5907408595085144]], 'metadatas': [[None, None]], 'embeddings': None, 'documents': [["assistant\n\nHere is the metadata information extracted from the provided text:\n\n**Keywords:**\n\n* Remaining Useful Life (RUL) estimation\n* Abrupt failures\n* Data-driven approaches\n* Long Short Term Memory (LSTM) neural network\n* Prognostics and health management systems\n* Condition-based maintenance\n* System degradation\n* Machine learning\n\n**Authors:**\n\n* Wei Huang\n* Hamed Khorasgani\n* Chetan Gupta\n* Ahmed Farahat\n* Shuai Zheng\n\n**Journal/Conference:**\n\n* Industrial AI Laboratory, Hitachi America Ltd.\n\n**Year:**\n\n* 2018\n\n**Paper Title:**\n\n* Remaining Useful Life Estimation for Systems with Abrupt Failures\n\n**Abstract:**\n\n* A brief summary of the paper's main contributions and findings.\n\n**Categories:**\n\n* Artificial Intelligence\n* Machine Learning\n* Predictive Maintenance\n* Condition-Based M

## Retrieving

In [58]:
def retrieve_vector_db(query):
    results = []
    metadata_results = metadata_collection.query(
    query_embeddings = embedding_model.encode(query).tolist(),
    n_results=2
    )
    for id in metadata_results['ids'][0]:
        collection = chroma_client.get_collection(name=id)
        doc_results = collection.query(
        query_embeddings = embedding_model.encode(query).tolist(),
        n_results=2
        )
        for distance in doc_results['distances'][0]:
            if distance < 0.8:
                new_data = doc_results
                results.append(new_data)
    # return '\n\n'.join(results[0])
    return results

In [62]:
def retrieve_vector_db(query, n_results=2):
    return collection.query(
    query_embeddings = embedding_model.encode(query).tolist(),
    n_results=n_results)

In [52]:
query = "What are the two types of models for prognostics?"
retrieved_results = retrieve_vector_db(query)
print(retrieved_results)

{'ids': [['1', '4']], 'distances': [[0.6661484837532043, 0.7068995237350464]], 'metadatas': [[None, None]], 'embeddings': None, 'documents': [['186 IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, VOL. 34, NO. 2, MAY 2021\nof expertise knowledge about the physics-of-failure, hold\npotential to learn the degradation patterns from the time seriessensor data and provide accurate TTF predictions. This study\naims to develop an effective data-driven prognostic approach\nthat can achieve reliable multi-mode failure predictions withmulti-sensor data collected from the IME process.\nDespite recent advances in data-driven prognostics, there\nare still two challenging issues in fault prognosis of IME pro-cess: 1) the inherent data discrepancy among different tools.\nIn IME process, the sensor data collected from different tools\nmay have certain distribution discrepancy due to their vari-ous operating conditions or settings, which makes it difﬁcult\nto generalize the learned prognostic knowledg

## Answer Generation

In [None]:
# model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# tokenizer = AutoTokenizer.from_pretrained(model_id)
# model = AutoModelForCausalLM.from_pretrained(
#     model_id,
#     torch_dtype=torch.bfloat16,
#     device_map="auto",
# )

In [None]:
# terminators = [
#     tokenizer.eos_token_id,
#     tokenizer.convert_tokens_to_ids("<|eot_id|>")
# ]
# def get_llama3_chat_reponse(messages):
#     input_ids = tokenizer.apply_chat_template(
#     messages,
#     return_tensors="pt").to(model.device)
#     outputs = model.generate(
#     input_ids,
#     max_new_tokens=500,
#     eos_token_id=terminators,
#     do_sample=True,
#     temperature=0.01,
#     top_p=0.9,)
#     response = outputs[0][input_ids.shape[-1]:]
#     return (tokenizer.decode(response, skip_special_tokens=True))

## RAG

In [184]:
query = "What kind of vectors does one-hot encoding produce?"
retrieved_results = retrieve_vector_db(query)
context = '\n\n'.join(retrieved_results['documents'][0])
# context = '\n\n'.join(retrieved_results[0])
messages = [
    {"role": "system", "content": "You are a chatbot who gives an answer strictly based on the content provided"},
    {"role": "system", "content": {context}},
    {"role": "user", "content": {query}},
]
start_time = time.time()
response = get_llama3_chat_reponse(messages)
print(response)
print("--- %.2f seconds ---" % (time.time() - start_time))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


assistant

One-hot encoding produces binary vectors where all elements are 0, except for one element which is 1. The position of the 1 element corresponds to the class or category that the original value represents.

For example, if you have a categorical variable with three classes: A, B, and C, the one-hot encoding would produce the following vectors:

* For A: [1, 0, 0]
* For B: [0, 1, 0]
* For C: [0, 0, 1]

In this example, the binary vector has three elements, and only one element is 1, which corresponds to the class that the original value represents.
--- 5.56 seconds ---


In [None]:
# query = "What are the two types of models for prognostics?"
# retrieved_results = retrieve_vector_db(query)
# pairs = []
# for i in retrieved_results['documents'][0]:
#     pairs.append([query, i])
# with torch.no_grad():
#     inputs = rerank_tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
#     scores = rerank_model(**inputs, return_dict=True).logits.view(-1, ).float()
# numbers = scores.tolist()
# numbers_array = np.array(numbers)
# indices = np.argsort(numbers_array)[-2:][::-1]
# results = [pairs[i][1] for i in indices.tolist()]
# context = '\n\n'.join(results)
# messages = [
#     {"role": "system", "content": "You are a chatbot who gives an answer and confidence score out of 100 based on the content provided"},
#     {"role": "system", "content": {context}},
#     {"role": "user", "content": {query}},
# ]
# start_time = time.time()
# response = get_llama3_chat_reponse(messages)
# print(response)
# print("--- %.2f seconds ---" % (time.time() - start_time))

In [None]:
print(context)

In [None]:
[int(s) for s in re.findall(r'\b\d+\b', response)]