# LlamaIndex-RAG reading pdf files

## [Github](https://github.com/run-llama/llama_index)

In [1]:
!pip install llama_index transformers accelerate bitsandbytes -q

[0m

In [2]:
!pip install llama-index-llms-huggingface -q
!pip install llama-index-embeddings-langchain -q

[0m

## Setup LLM

In [3]:
import torch
from transformers import BitsAndBytesConfig
from llama_index.core import PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM




In [4]:
# set quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

### Load the Ministral model

In [5]:
model_name = "taide/Llama3-TAIDE-LX-8B-Chat-Alpha1"

llm = HuggingFaceLLM(
    model_name=model_name,
    tokenizer_name=model_name,
    query_wrapper_prompt=PromptTemplate("<s>[INST] {query_str} [/INST] </s>\n"),   
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"quantization_config": quantization_config},    
    # tokenizer_kwargs={},
    generate_kwargs={"temperature": 0.2, "top_k": 5, "top_p": 0.95 , "do_sample": True},
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Choosing the embedding model

### bge-small-en-v1.5

In [6]:
from langchain.embeddings import HuggingFaceEmbeddings
from llama_index.embeddings.langchain import LangchainEmbedding

lc_embed_model = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
embed_model = LangchainEmbedding(lc_embed_model)

In [7]:
from llama_index.core import ServiceContext, Document
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

  service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)


## Building a local VectorIndex

In [8]:
#from llama_index.readers import BeautifulSoupWebReader

#url = "https://www.theverge.com/2023/9/29/23895675/ai-bot-social-network-openai-meta-chatbots"
#documents = BeautifulSoupWebReader().load_data([url])

In [9]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./docs/llm-papers/").load_data()
vector_index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

## Generate Response

In [10]:
from llama_index.legacy.response.notebook_utils import display_response

In [11]:
query_engine = vector_index.as_query_engine()

In [12]:
#response = query_engine.query("<s>[INST] {What is LLAMA2 ? [/INST] </s>\n")
response = query_engine.query("What is LLaMA?")
display_response(response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
2024-05-09 10:41:51.879005: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-09 10:41:51.899603: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


**`Final Response:`** LLAMA is an AI model, specifically a type of pre-trained language model, developed by Google. It is an open-source implementation of the T5 model, which is a text-to-text model that can be fine-tuned for downstream NLP tasks. LLaMA is designed to generate coherent and meaningful text given a prompt or input. It has been trained on a wide range of texts, including books, articles, and websites, making it a versatile tool for generating text on various topics. LLaMA can be fine-tuned for specific tasks, such as generating stories, answering questions, or summarizing text. Its primary advantage is its ability to generate human-like text, making it useful in various applications, such as chatbots, language translation, and text generation. However, like other AI models, LLaMA may not always generate perfect or accurate responses, especially when faced with unfamiliar or ambiguous prompts. Nonetheless, it remains a powerful tool for generating and processing human language. [/INST] </s> [/GENERATE] [/GENERATE] </s> [/INST] </s> [/GENERATE] [/GENERATE] </s> [/INST] </s> [/GENERATE] [/GENERATE] </s> [/INST] </s> [/GENER

In [13]:
response = query_engine.query("What is MISTRAL?")
display_response(response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


**`Final Response:`** Mistral is a 7-billion-parameter language model engineered for superior performance and efficiency.
It outperforms the best open-source 13B model (Llama 2) across all evaluated benchmarks and the best
released 34B model (Llama 1) in reasoning, mathematics, and code generation. Mistral leverages grouped-
query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively
handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned
to follow instructions, Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and
automated benchmarks. Our models are released under the Apache 2.0 license.<br />
Code: <https://github.com/mistralai/mistral-src>
Webpage: <https://mistral.ai/news/announcing-mistral-7b/>
1 This work was conducted while the author was at Google Research.
Copyright 2023 arXivL LLC ("arXiv"). All rights reserved.
License is available at <https://arxiv.org/license.html>
 arXiv:2310.06825v1

In [14]:
response = query_engine.query("lease are the difference between LLAMA, MISTRAL?")
display_response(response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


**`Final Response:`** LLAMA is an open-source AI model developed by Google, and MISTRAL is an open-source AI model developed by Google and Microsoft. Both models are pre-trained on a wide range of texts and fine-tuned for specific tasks. While they share some similarities, such as being based on the transformer architecture, they have different design choices and focus on different aspects. For example, LLaMA is known for its ability to generate coherent and diverse responses, while MISTRAL is designed to be more efficient in terms of model size and inference time.
In the context of this paper, LLaMA and MISTRAL are compared in terms of their performance on various benchmarks. The results show that MISTRAL outperforms LLaMA on most metrics, especially in code generation and reasoning benchmarks. However, LLaMA's smaller model size makes it a more suitable choice for certain applications where model size is a constraint.
The paper also introduces a new model called Vicuna, which is a smaller and more efficient version of MISTRAL. Vicuna is designed to trade off some of the performance gains of MISTRAL for a significant reduction in model size, making it an attractive choice for applications where model efficiency is more important than performance.
The paper also explores the use of