In [47]:
from langchain.chat_models import init_chat_model
import chromadb
from chromadb.utils import embedding_functions
from langchain.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.embeddings import Embeddings
from dotenv import load_dotenv
import os

In [48]:
load_dotenv()
os.environ["GROQ_API_KEY"] = os.environ.get('GROQ_API_KEY')
llm = init_chat_model("llama3-8b-8192", model_provider="groq")

In [49]:
ef = chromadb.utils.embedding_functions.DefaultEmbeddingFunction()

class DefChromaEF(Embeddings):
  def __init__(self,ef):
    self.ef = ef

  def embed_documents(self,texts):
    return self.ef(texts)

  def embed_query(self, query):
    return self.ef([query])[0]

# Get vector store | retriever

In [56]:
client = chromadb.PersistentClient(path="chromadb/")
db = Chroma(client=client, collection_name="2402.05123.pdf", embedding_function=DefChromaEF(ef))
retriever = db.as_retriever()

# Create prompt template

In [57]:
template = """<bos><start_of_turn>user\nAnswer the question based only on the following context and extract out a meaningful answer. \
Please write in full sentences with correct spelling and punctuation. if it makes sense use lists. \
If the context doesn't contain the answer, just respond that you are unable to find an answer. \

CONTEXT: {context}

QUESTION: {question}

<end_of_turn>
<start_of_turn>model\n
ANSWER:"""

In [58]:
prompt = ChatPromptTemplate.from_template(template)

# Create the RAG chain using LCEL with prompt printing and streaming output

In [59]:
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
)

# Function to ask questions

In [60]:
def ask_question(question):
    print("Answer:\n\n", end=" ", flush=True)
    for chunk in rag_chain.stream(question):
        print(chunk.content, end="", flush=True)
    print("\n")

# Example usage

In [None]:
if __name__ == "__main__":
    while True:
        user_question = input("Ask a question (or type 'quit' to exit): ")
        if user_question.lower() == 'quit':
            break
        answer = ask_question(user_question)

Ask a question (or type 'quit' to exit):  what is fine-tuning?


Answer:

 Based on the provided context, fine-tuning refers to the process of adjusting a large language model (LLM) to a specific task or dataset. This is done by training the LLM on a small, high-quality subset of the dataset, which is carefully selected to improve the model's performance on that task.

In the context of instruction tuning, fine-tuning involves training the LLM on a subset of the instruction dataset to enhance its ability to follow human instructions. This process is crucial in aligning the LLM with human intent and improving its controllability and safety.

The fine-tuning process typically involves several steps, including:

1. Selecting a high-quality subset of the instruction dataset using a data selection method.
2. Training the LLM on the selected subset.
3. Evaluating the LLM's performance on a testing set.

Fine-tuning is a crucial step in the instruction tuning process, as it enables the LLM to adapt to specific domains and tasks, reducing the need for large

Ask a question (or type 'quit' to exit):  summarize the text


Answer:

 The text discusses various methods for selecting and filtering instruction data for large language models (LLMs) to improve their performance and efficiency. The methods include using powerful LLMs like ChatGPT, generating and filtering instructions, and selecting high-quality instruction data. The text also mentions several papers and conferences related to this topic.

The methods for selecting instruction data include:

1. Using powerful LLMs like ChatGPT to assess the quality of each data tuple.
2. Generating and filtering instructions using techniques like instruction backtranslation and self-alignment.
3. Selecting high-quality instruction data using methods like dataset quantization and core-set approach.

The text also mentions several papers and conferences related to this topic, including papers on instruction mining, dataset quantization, and model-oriented data selection.

Overall, the text provides an overview of the current state of research in selecting and fil

Ask a question (or type 'quit' to exit):  whar are the main findings mention in the text?


Answer:

 Based on the provided context, I was unable to find an answer to the question "what are the main findings mentioned in the text?" because there is no specific mention of findings or results in the text. The text appears to be an introduction to a research paper and presents an overview of the topic, methodology, and approaches used in the research.



Ask a question (or type 'quit' to exit):  what are the main findings mention in the text?


Answer:

 Based on the provided context, the main findings mentioned in the text are:

1. The performances of different selection methods on winning rates are summarized in Table 1.
2. The inner comparisons of LLMs tuned on subsets with themselves tuned on full sets are summarized in Table 2.
3. The external comparisons of LLMs with external LLMs on different testing benchmarks are summarized in Table 3.
4. The INSTRUCTMINING method uses a linear equation to calculate the quality of instructions and can be used for data selection.
5. The InstructionGPT-4 method uses a trainable data selector to process and assign scores to each piece of instruction fine-tuning data.
6. The DQ method uses a gain function to partition the dataset into non-overlapping subsets and select a representative sample from each subset.
7. The IFD method uses an LLM to initially learn to recognize instructions from carefully selected data and enhance its ability to assess broader dataset quality and estimate instr

Ask a question (or type 'quit' to exit):  give overview on the model-oriented data selection mention in the text.


Answer:

 Based on the provided context, the model-oriented data selection methods mentioned in the text are:

1. **System of Indicators**: This method employs a metric set system directly to identify multiple metrics, such as instruction length, complexity, and reward scores. These metrics are used to calculate individual scores for each data instance, and then the scores are used to develop a more robust metric set system.
2. **Trainable LLMs**: This method uses a trainable LLM, such as LLaMa, to develop computation formulas in data selection processes. The LLM processes and assigns scores to each piece of instruction fine-tuning data.
3. **Powerful LLMs**: This method uses powerful external LLMs, such as ChatGPT, to select instructions.
4. **Small Models**: This method uses smaller models, such as MoDS, and designs a comprehensive process to achieve a balanced effect across all aspects.

These model-oriented data selection methods are used to select high-quality instruction data for

Ask a question (or type 'quit' to exit):  what are the measures for indicator use in the text?


Answer:

 Based on the provided context, the measures for indicator use are:

1. **Wining Rate**: Calculated as `(Num(win)−Num(lose))/Num(all)+1` (Formula 10), it evaluates the effectiveness of the dataset selection methods.
2. **Prompt Uncertainty**: Determined by generating k perturbed instructions through the random deletion of words from an original instruction, it serves as a measure of task uncertainty.
3. **KLIP Score**: Used in [Radford et al., 2021] to assess the quality of instruction data.
4. **Complexity**: Described by factors such as length, difficulty, and complexity of the instruction.
5. **Quality**: Captures the accuracy of the output.
6. **Quality labels**: Assigned post fine-tuning and evaluation of each subset.
7. **IFD (Instruction Following Difﬁculty)**: A metric introduced by [Liet al., 2023a] to assess the challenge of responding to specific instructions.

These measures are used to evaluate the quality, diversity, and complexity of the instruction data and to 