### Installing Necessary Libaries

In [1]:
!pip install pypdf
!pip install -q transformers
!pip install sentence_transformers
!pip install llama_index
!pip install llama-index-llms-huggingface
!pip install accelerate
!pip install bitsandbytes
!pip install einops
!pip install langchain



In [2]:
from llama_index.core import VectorStoreIndex,SimpleDirectoryReader,ServiceContext
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core.prompts import PromptTemplate




### Loading the data

In [3]:
documents = SimpleDirectoryReader("/content/")

In [4]:
doc = documents.load_data()
doc

[Document(id_='0ffca711-27a2-45f3-b08c-6c84a3f86f2e', embedding=None, metadata={'page_label': '1', 'file_name': 'lora.pdf', 'file_path': '/content/lora.pdf', 'file_type': 'application/pdf', 'file_size': 1609513, 'creation_date': '2024-05-09', 'last_modified_date': '2024-05-09'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='LORA: L OW-RANK ADAPTATION OF LARGE LAN-\nGUAGE MODELS\nEdward Hu‚àóYelong Shen‚àóPhillip Wallis Zeyuan Allen-Zhu\nYuanzhi Li Shean Wang Lu Wang Weizhu Chen\nMicrosoft Corporation\n{edwardhu, yeshe, phwallis, zeyuana,\nyuanzhil, swang, luw, wzchen }@microsoft.com\nyuanzhil@andrew.cmu.edu\n(Version 2)\nABSTRACT\nAn important paradigm of natural language processing consists of large-scale pre-\ntraining on general domain data and adap

### Prompt

In [5]:
system_prompt="""
You are a Q&A assistant. Your goal is to answer questions as
accurately as possible based on the instructions and context provided.
"""


In [6]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) y
Token is valid (permission: read).
[1m[31mCannot authenticate through g

## Importing the model and embeddings

In [7]:
import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=300,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    system_prompt = system_prompt,
    tokenizer_name="meta-llama/Meta-Llama-3-8B",
    model_name="meta-llama/Meta-Llama-3-8B",
    device_map="auto",
    model_kwargs={"torch_dtype": torch.float16 , "load_in_8bit":True}
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [8]:
!pip install llama-index-embeddings-langchain



In [9]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index.core import ServiceContext
from llama_index.embeddings.langchain import LangchainEmbedding

embed_model=LangchainEmbedding(
    HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))

In [10]:
service_context=ServiceContext.from_defaults(
    chunk_size=1024,
    llm=llm,
    embed_model=embed_model
)

  service_context=ServiceContext.from_defaults(


## converting into vectors

In [11]:
index=VectorStoreIndex.from_documents(doc,service_context=service_context)


## retreving the answer with user queries

In [12]:
query_engine=index.as_query_engine()

In [13]:
response=query_engine.query("what is lora")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [14]:
print(response)

 LoRa is a low-power wide-area network (LPWAN) technology that is designed to enable long-range communications at a low bit rate among battery-operated devices. LoRa is a modulation technique that is used in conjunction with the chirp spread spectrum (CSS) modulation technique. LoRa is a modulation technique that is used in conjunction with the chirp spread spectrum (CSS) modulation technique. LoRa is a modulation technique that is used in conjunction with the chirp spread spectrum (CSS) modulation technique. LoRa is a modulation technique that is used in conjunction with the chirp spread spectrum (CSS) modulation technique. LoRa is a modulation technique that is used in conjunction with the chirp spread spectrum (CSS) modulation technique. LoRa is a modulation technique that is used in conjunction with the chirp spread spectrum (CSS) modulation technique. LoRa is a modulation technique that is used in conjunction with the chirp spread spectrum (CSS) modulation technique. LoRa is a mod

In [15]:
response=query_engine.query("can you explain me in simple words what is meant by lora")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [16]:
print(response)

 LoRA takes a step further and does not require the accumu-
lated gradient update to weight matrices to have full-rank during adaptation. This means that when
applying LoRA to all weight matrices and training all biases2, we roughly recover the expressive-
ness of full Ô¨Åne-tuning by setting the LoRA rank rto the rank of the pre-trained weight matrices. In
other words, as we increase the number of trainable parameters3, training LoRA roughly converges
to training the original model, while adapter-based methods converges to an MLP and preÔ¨Åx-based
methods to a model that cannot take long input sequences.
No Additional Inference Latency. When deployed in production, we can explicitly compute and
storeW=W0+BA and perform inference as usual. Note that both W0andBA are inRd√ók.
When we need to switch to another downstream task, we can recover W0by subtracting BAand
then adding a different B‚Ä≤A‚Ä≤, a quick operation with very little memory overhead. Critically, this
2They represent a negl