# RAG
- RAG is a technique for augmenting LLM knowledge with additional data.
- LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on.
- If you want to build AI applications that can reason about private data or data introduced after a model's cutoff date, you need to augment the knowledge of the model with the specific information it needs.
- The process of bringing the appropriate information and inserting it into the model prompt is known as retrieval augmented generation (RAG).

# Langchain
LangChain is a framework for developing applications powered by large language models (LLMs).

https://python.langchain.com/docs/introduction/

# NIM
NIM is a set of optimized cloud-native microservices designed to shorten time-to-market and simplify deployment of generative AI models anywhere, across cloud, data center, and GPU-accelerated workstations. It expands the developer pool by abstracting away the complexities of AI model development and packaging for production ‌using industry-standard APIs.

https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/

https://docs.api.nvidia.com/nim/reference/llm-apis

![image.png](images/NIM.png)

 # NVIDIA API Catalog
 https://docs.api.nvidia.com/
 
- NVIDIA API Catalog is a hosted platform for accessing a wide range of microservices online.
- You can test models on the catalog and then export them with an NVIDIA AI Enterprise license for on-premises or cloud deployment
  
# Milvus vectorStore
https://milvus.io/docs

Milvus is a high-performance, highly scalable vector database that runs efficiently across a wide range of environments, from a laptop to large-scale distributed systems. It is available as both open-source software and a cloud service.

Milvus is an open-source project under LF AI & Data Foundation distributed under the Apache 2.0 license. Most contributors are experts from the high-performance computing (HPC) community, specializing in building large-scale systems and optimizing hardware-aware code.

# Mistral mixtral-8x7b-instruct

https://docs.api.nvidia.com/nim/reference/mistralai-mixtral-8x7b-instruct


Mixtral 8x7B Instruct is a language model that can follow instructions, complete requests, and generate creative text formats. Mixtral 8x7B a high-quality sparse mixture of experts model (SMoE) with open weights.

This model has been optimized through supervised fine-tuning and direct preference optimization (DPO) for careful instruction following. On MT-Bench, it reaches a score of 8.30, making it the best open-source model, with a performance comparable to GPT3.5.

Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference. It is the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs. In particular, it matches or outperforms GPT3.5 on most standard benchmarks.

Mixtral has the following capabilities.

- It gracefully handles a context of 32k tokens.
- It handles English, French, Italian, German and Spanish.
- It shows strong performance in code generation.
- It can be finetuned into an instruction-following model that achieves a score of 8.3 on MT-Bench.

In [None]:
from dotenv import dotenv_values
import os
# read env file
ROOT_DIR = os.getcwd()
config = dotenv_values(os.path.join(ROOT_DIR, "keys", ".env"))

In [None]:
os.environ['NVIDIA_API_KEY'] = config.get('NVIDIA_API_KEY')

In [None]:
# test run and see that you can genreate a respond successfully
from langchain_nvidia_ai_endpoints import ChatNVIDIA,NVIDIAEmbeddings
llm = ChatNVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1", max_tokens=1024)
embedder_document = NVIDIAEmbeddings(model="NV-Embed-QA", truncate="END")

In [None]:
import requests

urls_content = []

url_template1 = "https://nvidianews.nvidia.com/news/nvidia-announces-financial-results-for-{quarter}-quarter-fiscal-{year}"
url_template2 = "https://nvidianews.nvidia.com/news/nvidia-announces-financial-results-for-{quarter}-quarter-and-fiscal-{year}"

for quarter in ["first", "second", "third", "fourth"]:
    for year in range(2020,2025):
        args = {"quarter":quarter, "year": str(year)}
        if quarter == "fourth":
            urls_content.append(requests.get(url_template2.format(**args)).content)
        else:
            urls_content.append(requests.get(url_template1.format(**args)).content)

In [None]:
# extract the url, title, text content, and tables in the html
from bs4 import BeautifulSoup
import markdownify

def extract_url_title_time(soup):
    url = ""
    title = ""
    revised_time = ""
    tables = []
    try:
        if soup.find("title"):
            title = str(soup.find("title").string)

        og_url_meta = soup.find("meta", property="og:url")
        if og_url_meta:
            url = og_url_meta.get("content", "")

        for table in soup.find_all("table"):
            tables.append(markdownify.markdownify(str(table)))
            table.decompose()

        text_content = soup.get_text(separator=' ', strip=True)
        text_content = ' '.join(text_content.split())

        return url, title,text_content, tables
    except:
        print("parse error")
        return "", "", "", "", []

parsed_htmls = []
for url_content in urls_content:
    soup = BeautifulSoup(url_content, 'html.parser')
    url, title, content, tables = extract_url_title_time(soup)
    parsed_htmls.append({"url":url, "title":title, "content":content, "tables":tables})

In [None]:
parsed_htmls[0]["url"]

In [None]:
parsed_htmls[0]["tables"][0]

In [None]:
# summarize tables
def get_table_summary(table, title, llm):
    res = ""
    try:
        #table = markdownify.markdownify(table)
        prompt = f"""
                    [INST] You are a virtual assistant.  Your task is to understand the content of TABLE in the markdown format.
                    TABLE is from "{title}".  Summarize the information in TABLE into SUMMARY. SUMMARY MUST be concise. Return SUMMARY only and nothing else.
                    TABLE: ```{table}```
                    Summary:
                    [/INST]
                """
        result = llm.invoke(prompt)
        res = result.content
    except Exception as e:
        print(f"Error: {e} while getting table summary from LLM")
        if not os.getenv("NVIDIA_API_KEY", False):
            print("NVIDIA_API_KEY not set")
        pass
    finally:
        return res


for parsed_item in parsed_htmls:
    title = parsed_item['title']
    for idx, table in enumerate(parsed_item['tables']):
        print(f"parsing tables in {title}...")
        table = get_table_summary(table, title, llm)
        parsed_item['tables'][idx] = table

In [None]:
parsed_item.keys()

In [None]:
len(parsed_htmls)

In [None]:
parsed_htmls[0]['tables']

In [None]:
parsed_item['url']

In [None]:
parsed_item['title']

In [None]:
#parsed_item['content']

In [None]:
parsed_item['tables'][0]

# Splitter Model
- https://huggingface.co/intfloat/e5-large-v2
- https://api.python.langchain.com/en/latest/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html

In [None]:
from langchain_milvus import Milvus
from langchain.docstore.document import Document
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
TEXT_SPLITTER_MODEL = "intfloat/e5-large-v2"
TEXT_SPLITTER_CHUNCK_SIZE = 200
TEXT_SPLITTER_CHUNCK_OVERLAP = 50

text_splitter = SentenceTransformersTokenTextSplitter(
    model_name=TEXT_SPLITTER_MODEL,
    tokens_per_chunk=TEXT_SPLITTER_CHUNCK_SIZE,
    chunk_overlap=TEXT_SPLITTER_CHUNCK_OVERLAP,
)

documents = []

for parsed_item in parsed_htmls:
    title = parsed_item['title']
    url =  parsed_item['url']
    text_content = parsed_item['content']
    documents.append(Document(page_content=text_content, metadata = {'title':title, 'url':url}))

    for idx, table in enumerate(parsed_item['tables']):
        table_content = table
        documents.append(Document(page_content=table, metadata = {'title':title, 'url':url}))

documents = text_splitter.split_documents(documents)
print(f"obtain {len(documents)} chunks")

In [None]:
documents[0]

In [None]:
URI = "./milvus_example.db"

In [None]:
COLLECTION_NAME = "NVIDIA_Finance"
from langchain_milvus import Milvus
vectorstore = Milvus.from_documents(
    documents,
    embedder_document,
    collection_name=COLLECTION_NAME,
    connection_args={"uri": URI}, # replace this with the ip of the workstation where milvus is running
    drop_old=True,
)

In [None]:
docs = vectorstore.similarity_search_with_score("what are 2024 Q3 revenues? ")

In [None]:
docs

In [None]:
from langchain.prompts.prompt import PromptTemplate

PROMPT_TEMPLATE = """[INST]You are a friendly virtual assistant and maintain a conversational, polite, patient, friendly and gender neutral tone throughout the conversation.

Your task is to understand the QUESTION, read the Content list from the DOCUMENT delimited by ```, generate an answer based on the Content, and provide references used in answering the question in the format "[Title](URL)".
Do not depend on outside knowledge or fabricate responses.
DOCUMENT: ```{context}```

Your response should follow these steps:

1. The answer should be short and concise, clear.
    * If detailed instructions are required, present them in an ordered list or bullet points.
2. If the answer to the question is not available in the provided DOCUMENT, ONLY respond that you couldn't find any information related to the QUESTION, and do not show references and citations.
3. Citation
    * ALWAYS start the citation section with "Here are the sources to generate response." and follow with references in markdown link format [Title](URL) to support the answer.
    * Use Bullets to display the reference [Title](URL).
    * You MUST ONLY use the URL extracted from the DOCUMENT as the reference link. DO NOT fabricate or use any link outside the DOCUMENT as reference.
    * Avoid over-citation. Only include references that were directly used in generating the response.
    * If no reference URL can be provided, remove the entire citation section.
    * The Citation section can include one or more references. DO NOT include same URL as multiple references. ALWAYS append the citation section at the end of your response.
    * You MUST follow the below format as an example for this citation section:
      Here are the sources used to generate this response:
      * [Title](URL)
[/INST]
[INST]
QUESTION: {question}
FINAL ANSWER:[/INST]"""

prompt_template = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["context", "question"])

In [None]:
def build_context(chunks):
    context = ""
    for chunk in chunks:
        context = context + "\n  Content: " + chunk.page_content + " | Title: (" + chunk.metadata["title"] + ") | URL: (" + chunk.metadata.get("url", "source") + ")"
    return context


def generate_answer(llm, vectorstore, prompt_template, question):
    retrieved_chunks = vectorstore.similarity_search(question)
    context = build_context(retrieved_chunks)
    args = {"context":context, "question":question}
    prompt = prompt_template.format(**args)
    ans = llm.invoke(prompt)
    return ans.content


question = "what are 2024 Q1 revenues?"

In [None]:
generate_answer(llm, vectorstore, prompt_template, question)