<a href="https://colab.research.google.com/github/prad69/LLM/blob/main/RAG_LLM_Nvidia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# AI-Assisted Learning for NVIDIA SDKs and Toolkits



---



In [None]:
#@title Install Modules

!pip install accelerate transformers tokenizers
!pip install bitsandbytes einops
!pip install xformers
!pip install langchain
!pip install faiss-gpu
!pip install sentence_transformers
!pip install -q langchain-openai langchain playwright beautifulsoup4
!pip install chromadb

In [None]:
#@title Imports
from torch import cuda, bfloat16
import pickle
import transformers
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from bs4 import BeautifulSoup as Soup
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import pickle
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import ConversationalRetrievalChain
from transformers import StoppingCriteria, StoppingCriteriaList
from langchain.llms import HuggingFacePipeline
import torch
from langchain.vectorstores import Chroma
import langchain
from langchain.prompts import PromptTemplate
import time
import os
langchain.debug = False

#Settings for wrap text output for colab
from IPython.display import HTML, display

def my_css():
   display(HTML("""<style>table.dataframe td{white-space: nowrap;}</style>"""))

get_ipython().events.register('pre_run_cell', my_css)



In [None]:
#Mount gdrive to save/load scraped data and Vector DB

from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Helper Functions

## Helper Function - Load Model from Huggingface
Helper function to download pretrained model from HuggingFace and save to disk.
For subsequent runs, the saved model is loaded from disk.
Note- Uncomment code to download and save the model to disk as needed

In [None]:

def loadModel(model_id, hf_auth_token ,model_save_path):

  model_config = transformers.AutoConfig.from_pretrained(
      model_id,
      use_auth_token=hf_auth_token
  )

  bnb_config = transformers.BitsAndBytesConfig(
      load_in_4bit=True,
      bnb_4bit_quant_type='nf4',
      bnb_4bit_use_double_quant=True,
      bnb_4bit_compute_dtype=bfloat16
  )

#uncomment the below code to download the model from HF for the first time and save to disk.

  # model = transformers.AutoModelForCausalLM.from_pretrained(
  #     model_id,
  #     trust_remote_code=True,
  #     config=model_config,
  #     quantization_config=bnb_config,
  #     device_map='auto',
  #     use_auth_token=hf_auth_token
  # )

  #model.save_pretrained(save_path)   # Save the model to the specified path
  #print(f"Model saved to {save_path}")


  # Load the model from the specified path
  model = transformers.AutoModelForCausalLM.from_pretrained(model_save_path)

  # enable evaluation mode to allow model inference
  model.eval()

  print(f"Model loaded on {device}")
  return model

#Define stopping criteria for the model
def getStoppingCriteria(tokenizer):
  stop_list = ['\nHuman:', '\n```\n']

  stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
  stop_token_ids
  stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
  stop_token_ids

  # define custom stopping criteria object
  class StopOnTokens(StoppingCriteria):
      def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
          for stop_ids in stop_token_ids:
              if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                  return True
          return False

  stopping_criteria = StoppingCriteriaList([StopOnTokens()])
  return stopping_criteria

## Helper Functions - VectorDB
Helper functions to split the scraped documents and save the embeddings to a Vector database.(ChromaDB or FAISS)

In [None]:

def saveVectorstoFAISS(docs ,embeddings,  save_path , text_splitter):
  print("saving vectors to FAISS")
  all_splits = text_splitter.split_documents(docs)
  vectorstore = FAISS.from_documents(all_splits, embeddings)
  vectorstore.save_local(save_path)
  print(f"Vectors saved to {save_path}")


def saveVectorstoChroma(docs ,embeddings,  save_path , text_splitter):
  print("saving vectors to ChromaDB")
  all_splits = text_splitter.split_documents(docs)
  print(len(all_splits))
  Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory=save_path)
  print(f"Vectors saved to {save_path}")

##Helper function to save /load scraped documents to File

In [None]:

def scrapeData(urls ,max_depth,file_path) :

  for url in urls:
    print("Scraping url ---->", url)
    loader = RecursiveUrlLoader(
        url=url, max_depth=max_depth, extractor=lambda x: Soup(x, "html.parser").text
    )
    docs =loader.load()
    print(f"Loaded {len(docs)} documents from {url}")
    saveNewData(docs, file_path)

def saveNewData(new_data, file_path):
  print(f"Saving {len(new_data)} documents to {file_path}")

  # Check if the pickle file exists
  if not os.path.exists(file_path):
    with open(file_path, "wb") as f:
     print("Creating file as it doesnt exist",file_path)
     pickle.dump(new_data, f)
  else :

    # Load the existing pickle file
    with open(file_path, "rb") as f:
        existing_data = pickle.load(f)

    # Combine the existing data with the new data
    new_data = existing_data + new_data
    with open(file_path, "wb") as f:
        pickle.dump(new_data, f)


def saveScrapedDataToFile(data, file_path):
  with open(file_path, 'wb') as f:
    pickle.dump(data, f)

def loadScrapedDatafromFile(file_path):
  with open(file_path, 'rb') as f:
    docs = pickle.load(f)
  return docs


##Helper function to Query the model and print Response and Execution time

In [None]:
def queryModel(chain,query,chat_history):
  start_time = time.perf_counter ()

  result = chain({"question": query, "chat_history": chat_history})
  print("Question--->",result['question'])
  print("Answer-->",result['answer'])
  print("Source Docs-->",result['source_documents'])
  chat_history.append((query, result['answer']))
  print("Response time--- %s seconds ---" % (time.perf_counter () - start_time))


# Testing Hepler Functions

##Function to test the model.
Here a Custom prompt is used to model the responses. A QA chain is created and the vectorstore and model are passed to get the result.

In [None]:
def testNvidiaQAModel(model,vectorstore):

  template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use five sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
  {context}
  Question: {question}
  Helpful Answer:"""


  qa_prompt = PromptTemplate(template=template, input_variables=["context", "question"])

  # creating chain
  chain = ConversationalRetrievalChain.from_llm(model, vectorstore.as_retriever(), return_source_documents=True
                                                ,combine_docs_chain_kwargs={"prompt": qa_prompt})

  chat_history = []

  print("-------------------------------")
  query_1 = "what is NVIDIA Nsight compute?"
  queryModel(chain,query_1,chat_history)

  print("-------------------------------")
  query_2 = "What is the NVIDIA CUDA Toolkit?"
  queryModel(chain,query_2,chat_history)

  print("-------------------------------")
  query_3 = "How can I install NVIDIA CUDA Toolkit on Windows?"
  queryModel(chain,query_3,chat_history)


  print("-------------------------------")
  query_4 = "What is the difference between NVIDIA's BioMegatron and Megatron530B LLM?"
  queryModel(chain,query_4,chat_history)



##Function to test the model with followup questions.
Helper function to test the model. Here a Custom prompt is used to model the responses. A QA chain is created and the vectorstore and model are passed
FOllowup questions are asked to test the model memory and relavence of the responses.

In [None]:

def testNvidiaQAModelMemory(llm , vectorstore):

  template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
  Use five sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
  {context}
  Question: {question}
  Helpful Answer:"""


  qa_prompt = PromptTemplate(template=template, input_variables=["context", "question"])

  # creating chain
  qa_chain = ConversationalRetrievalChain.from_llm(llm, vectorstore.as_retriever(), return_source_documents=True
                                                ,combine_docs_chain_kwargs={"prompt": qa_prompt})

  chat_history = []
  query_1 = "Shouldn’t gpu__cycles_active >= sm__cycles_active.max since SM active means the GPU is active? Or is my understanding incorrect?"
  queryModel(qa_chain,query_1,chat_history)

  print("-------------------------------")
  followup_query_1 = "Tell me more about this"
  queryModel(qa_chain,followup_query_1,chat_history)


  print("-------------------------------")
  followup_query_2 = "What are the advantages of this?"
  queryModel(qa_chain,followup_query_2,chat_history)

# Model Execution

## Scrape data
Scrape a list of URLs and extract text from the Web documents .
The scraped documents are saved to file and then subsequently loaded for converting the data to vectors.


In [None]:

scraped_data_file_path = '/content/drive/MyDrive/ColabNotebooks/datasets/NvidiaData/nvidia_scraped_docs.pkl'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

#Scrape Data Settings
max_depth_to_scrape = 3 # set to 2 for faster scraping
urls_to_scrape = [ "https://docs.nvidia.com/",
                  "https://forums.developer.nvidia.com/c/developer-tools/106",
                  "https://developer.nvidia.com/blog",
                  "https://medium.com/search?q=nvidia+sdk"]

#Note- Scrape Data only during the first run. All scraped data is stored to disk as it is a time consuming process
#scrapeData(urls_to_scrape, max_depth_to_scrape , scraped_data_file_path)

#load scraped data from disk
docs = loadScrapedDatafromFile(scraped_data_file_path)  #the data file is shared in the zip file submitted
print(f"Loaded {len(docs)} documents")

## Save/Load - Vector DB

In [None]:
vectorDB_file_path = '/content/drive/MyDrive/ColabNotebooks/datasets/NvidiaData/FAISS_index'

#Embedding model
embedding_model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name, model_kwargs=model_kwargs)

#Text splitter settings .These hyperparameters can impact the model /
chunk_size=1000
chunk_overlap=20
text_splitter = RecursiveCharacterTextSplitter(chunk_size= chunk_size, chunk_overlap=chunk_overlap)


#save embeddings to the FAISS Vector data store
#saveVectorstoFAISS(docs, embeddings, vectorDB_file_path, text_splitter) ##uncomment if vectorDB needs to be created for the first time ,else load from the existing Vector DB
#load saved data from VectorDB from file
vectorstore = FAISS.load_local(vectorDB_file_path,embeddings)


#save embeddings to the Chroma Vector data store
# saveVectorstoChroma(docs, embeddings, vectorDB_file_path, text_splitter)
# #load previously saved Docs from Chroma DB
# vectorstore = Chroma(persist_directory=vectorDB_file_path, embedding_function=embeddings)

## Load Language Model


In [None]:

#Model used for the project
model_id = 'meta-llama/Llama-2-7b-chat-hf'
#Huggingface token
hf_auth_token = 'xxxxxxxxxx'
# Define the path where you want to save the model
model_save_path = "/content/drive/MyDrive/ColabNotebooks/Models/Llama-2-7b-chat-hf.pt"

pretrained_model = loadModel(model_id, hf_auth_token,model_save_path)
tokenizer = transformers.AutoTokenizer.from_pretrained(
      model_id,
      use_auth_token=hf_auth_token
  )


qa_pipeline = transformers.pipeline(
      model=pretrained_model,
      tokenizer=tokenizer,
      return_full_text=True,
      #task='document-question-answering',
      task='text-generation',
      stopping_criteria=getStoppingCriteria(tokenizer),
      temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
      max_new_tokens=512,  # max number of tokens to generate in the output
      repetition_penalty=1.1  # without this output begins repeating
  )

nvidia_qa_model_pipeline = HuggingFacePipeline(pipeline=qa_pipeline)


`low_cpu_mem_usage` was None, now set to True since model is quantized.


Model loaded on cuda:0




## Test the Nvidia QA Model

In [None]:
testNvidiaQAModel(nvidia_qa_model_pipeline , vectorstore)

#test with Initial queries and followup queries
testNvidiaQAModelMemory(nvidia_qa_model_pipeline, vectorstore)

-------------------------------




Question---> what is NVIDIA Nsight compute?
Answer-->  NVIDIA Nsight Compute is a system-wide performance analysis tool designed to visualize an application's algorithms. It provides various features such as kernel profiling, customization options, and training resources. Thanks for asking!
Source Docs--> [Document(page_content='Nsight Compute Documentation\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNVIDIANsight Compute Documentation\n\nSearch In:\nEntire Site\nJust This Document\nclear search\nsearch\n\n\n\nNsight Compute\n\nRelease Notes\nKernel Profiling Guide\nNsight Compute\nNsight Compute CLI\n\nDeveloper Interfaces\n\nCustomization Guide\nNvRules API\n\nTraining\n\nTraining\n\nRelease Information\n\nArchives\n\nCopyright And Licenses\n\nCopyright and Licenses\n\n\n\n\nSearch Results\n\n\n\n\nNsight Compute', metadata={'source': 'https://docs.nvidia.com/nsight-compute/', 'title': 'Nsight Compute Documentation', 'language': 'en'}), Document(page_content='Release Notes\n\r\n                 



Question---> What is the NVIDIA CUDA Toolkit?
Answer-->   The NVIDIA CUDA Toolkit is a comprehensive development environment for C and C++ developers building GPU-accelerated applications. It provides a range of tools and libraries to help developers develop, optimize, and deploy their applications on various hardware platforms, including embedded systems, desktop workstations, enterprise data centers, cloud-based platforms, and HPC supercomputers. Thanks for asking!
Source Docs--> [Document(page_content='The NVIDIA® CUDA® Toolkit provides a comprehensive development environment for C and C++ developers building GPU-accelerated applications. With the CUDA Toolkit, you can develop, optimize, and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms and HPC supercomputers. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, and a runtime library to deploy your appl



Question---> How can I install NVIDIA CUDA Toolkit on Windows?
Answer-->   Thanks for asking! To install NVIDIA CUDA Toolkit on Windows, you can follow these steps:

1. Download the NVIDIA driver from the official website and install it on your computer.
2. Install the CUDA Toolkit by running the installer and following the on-screen instructions.
3. Once the installation is complete, open a command prompt and type "cuda-toolkit-11-0" to verify the installation.

If you have any further questions or concerns, feel free to ask!
Source Docs--> [Document(page_content='CUDA Toolkit\n\n\n                           After installing the NVIDIA driver, Fabric Manager and NSCQ, you can proceed to \n                           install the CUDA Toolkit on the system to build CUDA applications. Note that if you \n                           are deploying CUDA applications only, then the CUDA Toolkit is not necessary as \n                           the CUDA application should include the dependencies



Question---> What is the difference between NVIDIA's BioMegatron and Megatron530B LLM?
Answer-->  Thanks for asking! NVIDIA's BioMegatron and Megatron530B LLMs are both large language models, but they have some differences. BioMegatron is a variant of the Megatron model that uses a different training strategy, called "biological" training, which involves fine-tuning the model on a variety of biological datasets. This approach can result in better performance on certain tasks, such as text classification or sentiment analysis. On the other hand, Megatron530B is a more recent version of the Megatron model that has been further optimized for improved performance on a wide range of NLP tasks. So while BioMegatron may have some advantages in certain areas, Megatron530B is likely to be a more versatile and powerful option for many use cases.
Source Docs--> [Document(page_content='All of these products (nvidia-smi, NVML, and the NVML language bindings) are updated with each new CUDA release a