In this notebook, we use 4-bit quantization to run Llama-7B Chat model. This code uses only 10 GB of VRAM. It can run on a free instance of Google Colab or on a local GPU (e.g., RTX 3060 12GB).
[More details here.](https://open.substack.com/pub/kaitchup/p/run-llama-2-chat-models-on-your-computer?r=2kp66c&utm_campaign=post&utm_medium=web)


We only need the following libraries:


*   transformers
*   accelerate (for device_map)
*   bitsandbytes (for 4-bit quantization)




In [None]:
# List of JSON objects with explicit _id field
data = [
    {
        "id": 5,#key ID is required by Azure AI Search for document injection
        "question": "What is Microsoft Word?",
        "answer": "Microsoft Word is a word processing software developed by Microsoft. It allows users to create, edit, and format documents such as letters, reports, resumes, and more."
    },
    {
        "id": 6,
        "question": "What is Microsoft Excel?",
        "answer": "Microsoft Excel is a spreadsheet software developed by Microsoft. It is used for tasks such as storing, organizing, and manipulating data, as well as performing calculations, creating charts, and generating reports."
    },
    {
        "id": 7,
        "question": "What is Microsoft PowerPoint?",
        "answer": "Microsoft PowerPoint is a presentation software developed by Microsoft. It enables users to create slideshows with text, images, videos, and animations, making it suitable for presentations, lectures, and meetings."
    },
    {
        "id": 8,
        "question": "What is Microsoft Outlook?",
        "answer": "Microsoft Outlook is an email client and personal information manager developed by Microsoft. It allows users to manage email, contacts, calendars, tasks, and notes, and integrates with other Microsoft Office applications."
    },
    {
        "id": 9,
        "question": "What is Microsoft Teams?",
        "answer": "Microsoft Teams is a collaboration platform developed by Microsoft. It combines workplace chat, video meetings, file storage, and application integration, providing a hub for teamwork within organizations."
    },
    {
        "id": 10,
        "question": "What is Microsoft Azure?",
        "answer": "Microsoft Azure is a cloud computing platform and services developed by Microsoft. It offers a wide range of cloud services, including computing, storage, analytics, networking, and more, enabling businesses to build, deploy, and manage applications and services through Microsoft's global network of data centers."
    }
]


In [None]:
!pip install pymongo

In [None]:
import pymongo
import json

# MongoDB connection string
connection_string = "m"
client = pymongo.MongoClient(connection_string)

# Database name and collection name
db = client["azure-rag-production"]
collection = db["qa-pairs"]

# Upload each JSON object to MongoDB
for document in data:
  print(document)
  collection.insert_one(document)

print("Documents uploaded successfully.")

In [None]:
# List all databases
print("Databases:")
for db in client.list_databases():
    print(f"- {db['name']}")

# List all collections in each database
for db_name in client.list_database_names():
    print(f"\nCollections in database '{db_name}':")
    db = client[db_name]
    for collection_name in db.list_collection_names():
        print(f"- {collection_name}")

In [1]:
!pip install transformers accelerate bitsandbytes

Collecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl (119.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelera

In [2]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Note that to run the following code, you must have got access to Llama 2's weights and have an access token from Hugging Face. You can find instructions on the model cards on the hugging face hub: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf


In [3]:
!pip install azure-search-documents==11.6.0b3 azure-identity

Collecting azure-search-documents==11.6.0b3
  Downloading azure_search_documents-11.6.0b3-py3-none-any.whl (317 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.8/317.8 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting azure-identity
  Downloading azure_identity-1.16.0-py3-none-any.whl (166 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m166.1/166.1 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting azure-core>=1.28.0 (from azure-search-documents==11.6.0b3)
  Downloading azure_core-1.30.1-py3-none-any.whl (193 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.4/193.4 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting azure-common>=1.1 (from azure-search-documents==11.6.0b3)
  Downloading azure_common-1.1.28-py2.py3-none-any.whl (14 kB)
Collecting isodate>=0.6.0 (from azure-search-documents==11.6.0b3)
  Downloading isodate-0.6.1-py2.py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━

In [4]:
# Pure Vector Search
from azure.search.documents.models import VectorizedQuery
import requests
from azure.core.credentials import AzureKeyCredential

from azure.search.documents import SearchClient

# The following variables from your .env file are used in this notebook
endpoint = ''
credential = AzureKeyCredential('') #if len(os.environ["AZURE_SEARCH_ADMIN_KEY"]) > 0 else DefaultAzureCredential()
index_name = 'aisearch-rag-production'

search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=credential)
query = "Microsoft Word?"
hf_token = "" #"get your token in http://hf.co/settings/tokens"
model_id = "sentence-transformers/all-MiniLM-L6-v2"

api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}
response=requests.post(api_url, headers=headers, json={"inputs": query, "options":{"wait_for_model":True}})
embedding=response.json()
vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=3,exhaustive=True, fields="questionVector, answerVector")

results = search_client.search(
    search_text=None,
    vector_queries= [vector_query],
    select=["question", "answer"],
)

for result in results:
    print(f"Title: {result['question']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['answer']}")

def nonewlines(s: str) -> str:
    return s.replace("\n", " ").replace("\r", " ")

sources_content=[nonewlines(doc.content or "") for doc in results]
documents = "\n".join(sources_content)

Title: What is Microsoft Word?
Score: 0.03333333507180214
Content: Microsoft Word is a word processing software developed by Microsoft. It allows users to create, edit, and format documents such as letters, reports, resumes, and more.
Title: What is Microsoft PowerPoint?
Score: 0.032786883413791656
Content: Microsoft PowerPoint is a presentation software developed by Microsoft. It enables users to create slideshows with text, images, videos, and animations, making it suitable for presentations, lectures, and meetings.
Title: What is Microsoft Excel?
Score: 0.032258063554763794
Content: Microsoft Excel is a spreadsheet software developed by Microsoft. It is used for tasks such as storing, organizing, and manipulating data, as well as performing calculations, creating charts, and generating reports.


In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-chat-hf"


prompt  = f"""
        <<SYS>>
        Only respond with "Not in the text." if the information needed to answer the question is not contained in the document. \n
        Answer the question using only the information from the attached document below. \n
        Ensure that the questions are answered fully and effectively. \n
        Respond in short and concise yet fully formulated sentences, being precise and accurate
        <</SYS>>
        [INST]
        User:{query}
        [/INST]\
        [INST]
        User:{documents}
        [/INST]\n

        Assistant:
    """
# access_token = "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"



model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")

output = model.generate(**model_inputs)

print(tokenizer.decode(output[0], skip_special_tokens=True))


tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]




        <<SYS>>
        Only respond with "Not in the text." if the information needed to answer the question is not contained in the document. 

        Answer the question using only the information from the attached document below. 

        Ensure that the questions are answered fully and effectively. 

        Respond in short and concise yet fully formulated sentences, being precise and accurate
        <</SYS>>
        [INST]
        User:Microsoft Word?
        [/INST]        [INST]
        User:
        [/INST]


        Assistant:
     Microsoft Word is a word processing software developed by Microsoft. It is widely used for creating documents, reports, and letters. The software offers a range of features such as formatting tools, spell check, and collaboration features.

        Is there anything else you would like to know about Microsoft Word?
