With 4bit quantization, `HuggingFaceH4/zephyr-7b-beta` uses about 8GB of VRAM and spiked to 14GB of RAM when loading the model, then settled around 5GB.

In [None]:
!pip install -q llama-index==0.8.61
!pip install -q git+https://github.com/huggingface/transformers
!pip install -q accelerate bitsandbytes 

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyterlab 4.0.5 requires jupyter-lsp>=2.0.0, but you have jupyter-lsp 1.5.1 which is incompatible.
jupyterlab-lsp 5.0.0 requires jupyter-lsp>=2.0.0, but you have jupyter-lsp 1.5.1 which is incompatible.
jupyterlab-lsp 5.0.0 requires jupyterlab<5.0.0a0,>=4.0.6, but you have jupyterlab 4.0.5 which is incompatible.
preprocessing 0.1.13 requires nltk==3.2.4, but you have nltk 3.8.1 which is incompatible.[0m[31m
[0m

In [None]:
!pip install -q "openai<1.0"

## Setup

In [None]:
import llama_index
llama_index.__version__

### LLM


In [None]:
import torch
from transformers import BitsAndBytesConfig
from llama_index.prompts import PromptTemplate
from llama_index.llms import HuggingFaceLLM
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index import ServiceContext
from llama_index import VectorStoreIndex

### Helpful Imports / Logging

In [None]:
from llama_index.response.notebook_utils import display_response

In [None]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

## Basic Query Engine

### Compact (default)

In [None]:
query = r"""You were given a large knowledge base about Laser Powder Bed Fusion additive manufacturing and from that, you will generate a JSON TRIPLETS, each consisting of an INSTRUCTION, an INPUT, and an OUTPUT, based on a random technical fact from the knowledge you are provided.
Here's an example of the format you must adhere to: 

{"instruction": X,
"input". Y,
"output": Z}

An example of intended outcome is the following:

{"instruction": "What are the expected stress levels when the SMCT demo coil is powered to 15 kA in a 2L mirror configuration?",
"input". "With powering only SMCT demo coil to 15 kA (2L mirror) the peak stress in the SMCT structure in average increases to ~450 MPa with the maximum stress point of 599 MPa in the corner of second mid-plane block of the outer layer.",
"output": "When the SMCT demo coil is powered to 15 kA in a 2L mirror configuration, the average peak stress in the SMCT structure is expected to increase to approximately 450 MPa. The maximum stress point is anticipated to be 599 MPa, and this will be located in the corner of the second mid-plane block of the outer layer."}


The instruction should pose a specific question or ask for an explanation concerning a technical aspect covered in your knowledge base. The input should contain relevant and brief information, theories, or methodologies from the paper that pertain to the instruction. The output should provide a detailed answer to the instruction, adhering to the following guidelines: 
Incorporate mathematical formulations, equations, or symbols where applicable to substantiate your claims. Make use of precise technical terminology and definitions found in the paper. In your generations, never mention the sources. Be absolute. For example, instead of saying "the paper says", say "it is said". Do not reference images or tables from the context.
When possible, embed mathematical intuitions/formulas in your triplets.
Make sure your samples are unique and diverse, yet high-quality and complex enough to train a well-performing model.
Start by generating one triplet."""

In [None]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

def messages_to_prompt(messages):
  prompt = ""
  for message in messages:
    if message.role == 'system':
      prompt += f"<|system|>\n{message.content}</s>\n"
    elif message.role == 'user':
      prompt += f"<|user|>\n{message.content}</s>\n"
    elif message.role == 'assistant':
      prompt += f"<|assistant|>\n{message.content}</s>\n"

  # ensure we start with a system prompt, insert blank if needed
  if not prompt.startswith("<|system|>\n"):
    prompt = "<|system|>\n</s>\n" + prompt

  # add final assistant prompt
  prompt = prompt + "<|assistant|>\n"

  return prompt

llm = HuggingFaceLLM(
    model_name="ehartford/dolphin-2.1-mistral-7b",
    tokenizer_name="ehartford/dolphin-2.1-mistral-7b",
    query_wrapper_prompt=PromptTemplate("<|im_start|>system\n<|im_end|>\n<|im_start|>user\n{query_str}<|im_end|>\n<|im_start|>assistant"),
    context_window=3900,
    max_new_tokens=800,
    model_kwargs={"quantization_config": quantization_config},
    generate_kwargs={"temperature": 0.35, "top_k": 10, "top_p": 0.95, "do_sample" :True},
    messages_to_prompt=messages_to_prompt,
    device_map="auto",
)

service_context = ServiceContext.from_defaults(llm=llm, embed_model="local:BAAI/bge-base-en-v1.5")

In [None]:
import os

processed_2_dir = '/kaggle/input/md-processed/papers_txt_organized/papers_txt_organized'
examples = []
count = 0
threshold = 2685


if not os.path.exists(processed_2_dir):
    print(f"The directory {processed_2_dir} does not exist.")
else:
    # Loop through each subfolder in the 'processed_2' directory
    for folder_name in os.listdir(processed_2_dir):
        count += 1
        if count < threshold:
            print(f"skipping {folder_name}, already processed in previous runs") 
            continue
        else:
            subfolder_path = os.path.join(processed_2_dir, folder_name)

            documents = SimpleDirectoryReader(subfolder_path).load_data()
            vector_index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)
            query_engine = vector_index.as_query_engine(response_mode="refine", streaming=True)

            response_stream = query_engine.query(query)
            response_stream.print_response_stream()
            string_res = response_stream.response_txt
            examples.append(string_res)
            #display_response(response)


In [12]:
boundary = len(examples) + threshold
boundary

2685

### Save to file

In [13]:
with open(f'my_list_first{boundary}.txt', 'w', encoding='utf-8') as f:
    for item in examples:
        f.write("%s\n" % item)