# A quick guide into dataset generation for fine tuning small language models

## Introduction

### Why do we need to know this

   -- While foundational models are designed to be highly versatile and capable of performing a wide range of tasks without modification they can know too much and especially for instances such as RAG can rely on their inate knowledge which could be wrong. However, fine-tuning them could compromise their broad applicability and the integrity of their underlying architectures, making them less effective at handling the diverse array of tasks they are meant to perform and does not decrease the hallucination risk significantly.

   >We test foundation models like Mistral, Llama, MosaicML, and their fine-tuned versions. Our research shows that fine-tuning and quantization reduces jailbreak resistance significantly, leading to increased LLM vulnerabilities. 
   [Increased LLM vunlerabilities from Fine Tuning and Quntization](<references/Increased LLM vunlnderabilities from fine-tuning and quantization.pdf>)

   -- Fine-tuning smaller pre-trained models however, like LLaMA or Phi-3, can significantly enhance their performance on specific tasks by allowing them to adapt to domain-specific data, leading to improved accuracy and relevance. This customization can transform a general-purpose model into a highly specialized domain expert that can be used in multiple different settings while decreasing compute cost.
   
   ![alt text](references/ForbesQuoteSLM.png)

   [Forbes - Small language Models for Enterprise AI](<references/Small Language Models – More Effective And Efficient For Enterprise AI.pdf>)

   -- Fine-tuning smaller, domain-specific models can significantly reduce the risk of hallucinations, which are factually incorrect or misleading outputs. This is because these models are trained on high-quality, relevant data, allowing them to produce outputs more aligned with the factual nuances of the domain2. By focusing on targeted learning, these models are less likely to generate inaccurate information compared to larger, general-purpose models. This targeted approach helps in mitigating the generation of misinformation, toxicity, and stereotypes.
   
   ![alt text](references/ComparisionOfHallucinations.png "Comparision of model size vs hallucination risk")

   [Knowledge Mismatch Hypothosis](<references/Knowledge Mismatch Hypothosis.pdf>)

   -- However, fine-tuning also carries risks, such as overfitting to the fine-tuning data, which can reduce the model's generalization capability. There is also the potential for introducing biases from the fine-tuning dataset, resulting in unethical or unreliable outputs.

   <strong>At the end of the day especially with small lanage models the data sets matter!</strong>
   
 ![alt text](references/ModelCollapse.png)  

   > Our evaluation suggests a ‘first mover advantage’ when it comes to training models such as LLMs. In our work, we demonstrate that 
training on samples from another generative model can induce 
a distribution shift, which—over time—causes model collapse. This in 
turn causes the model to mis-perceive the underlying learning task. 
To sustain learning over a long period of time, we need to make sure 
that access to the original data source is preserved and that further 
data not generated by LLMs remain available over time. The need to 
distinguish data generated by LLMs from other data raises questions 
about the provenance of content that is crawled from the Internet: 
it is unclear how content generated by LLMs can be tracked at scale. 
One option is community-wide coordination to ensure that different 
parties involved in LLM creation and deployment share the information needed to resolve questions of provenance. Otherwise, it may 
become increasingly difficult to train newer versions of LLMs without 
access to data that were crawled from the Internet before the mass 
adoption of the technology or direct access to data generated by 
humans at scale

[Nature-AI Model Collapse When Trained on Recursively Generated Data](references/Nature_model_collapse.pdf)

## What to look for in a dataset

### Quality

- <strong>Don't just use a dataset from HuggingFace </strong>

![alt text](references/MalwareHuggingface.png)

[Forbes: Malware uploaded to HuggingFace](<references/Hackers Have Uploaded Thousands Of Malicious Files To AI’s Biggest Online Repository.pdf>)

- For every dataset you download perform due diligence:
    
    - Always source datasets from reputable and trusted repositories to minimize the risk of malicious content. 
    
    - Before using a dataset, conduct thorough data cleaning and validation to identify and remove any anomalies or potentially harmful data. 
    
    - Implement version control and proper documentation to track changes and maintain a clear record of data usage.

### Model and Task

- What dataset you use also depends on the task at hand and what model you are looking to train

[Comparative analysis of encoder and Decoder Lanuage Models on multilingual NLU Tasks](<references/Comparative analysis of encoders vs decoders.pdf>)

# Bidirectional Encoder Models

- These models thrive in extraction, classification. Because they are bidirectional they are more contextually aware and can perform tasks such as extractive question answering, multihead classification and can outperform decoder models that are much larger in size: 

![alt text](references/GPTvsDistilBERT.png)

[Comparision of DistilBERT vs GPT-3 in multilabel classification](<references/A Comparative Analysis_ Fine-tuning Multilabel Classification Models Using DistilBERT vs. GPT-3 – AlgoBlog.pdf>)

## Question Answering: 
    
### Archtype: 
   <a href="https://rajpurkar.github.io/SQuAD-explorer/">SQuAD dataset</a>

### Format:

"title": {

  "Context":  <strong>"Description paragraph" </strong>, 

  "Question": <strong>"Question for which the answer is in the context" </strong>,

  "answers": [

   {
    
  "text": <strong>"Answer from context" </strong>,
    
  "answer_start": <strong> "Where the answer starts within the context" </strong>
    
  }

]

  "is impossible": true | false

}

### Leaderboard

<a href="https://paperswithcode.com/sota/question-answering-on-squad20">Squad 2.0 Model Leaderboard </a>

## Extractive Summarization

### Archtype

NYTimes Dataset - no longer available

<a href="https://paperswithcode.com/dataset/billsum">BillSum</a>

### Format

{

"title": <strong> "Article title" </strong>

"text": <strong> "Article Text" </strong>

"summary" <strong> "summarization of the article" </strong>

}


## Classification

### Archtype

<a href="https://paperswithcode.com/dataset/imdb-movie-reviews">IMDB movie Reviews</a>

### Format

{

"text": <strong> "Review Text" </strong>

"label" <strong> 0 (neg) | 1 (positive) </strong>

}


# Decoder models

- These models thrive in text generation and chat templates. They are unidirectional and function by causally predicting the next token

- These models can perform all the tasks above however require larger models to complete these tasks

-- May be worthwhile though for tasks that require a larger context window as most bidirectional encoders max out at about 500 tokens


## Chain of thought prompting

- Once of the most promising new prompt techniques Chain of thought prompting is a technique used in AI and NLP to encourage models to break down complex tasks into a series of logical steps or smaller questions, promoting step-by-step reasoning. 

- This approach can significantly enhance the model's ability to solve complex problems by mimicking human problem-solving processes, leading to more accurate and coherent responses. 

- The benefits of chain of thought prompting include improved task performance, better handling of multi-step reasoning tasks, and increased interpretability of the model's thought process. 

- However, there are drawbacks as well, such as increased computational costs and the potential for the model to produce verbose outputs that may not always be efficient. Additionally, if not properly managed, the breakdown of tasks can sometimes lead to over-complication or irrelevant steps, which could hinder the model's effectiveness.

### Getting small models to reason

- COT was thought of as a domain of only the large language models since small lanauage models have limited capability for memorizing, do not possess the same powerful integrated abilities in question understanding and knowledge reasoning

- However it is possible to fine tune a small language model for domain specific chain of thought reasoning:

> Language models (LMs) with less than 100B
parameters are known to perform poorly on
chain-of-thought (CoT) reasoning in contrast
to large LMs when solving unseen tasks. In this
work, we aim to equip smaller LMs with the
step-by-step reasoning capability by instruction
tuning with CoT rationales.

[The CoT Collection](<references/COT Collection.pdf>)



### COT format

{

  "Instruction and Question": <strong>"Question and Context if needed"</strong>
  
  "Answer": <strong>"Final answer given by the model"
  
  "Rationale": <strong>"Why the model should arrive at that answer"</strong>

}



## Beyond COT - note that all of these use more than one model or one model call

We can take a lot of the manual work for multishot prompting and utilize a pool of models to do this for us.

- While this does allow for better answers and less hallucinations it does have increased cost in terms of compute

- <strong>May not want to just use a GPT for this whole pipeline!</strong>

## Multihop Question Answering with small models

Traditional Chain-of-thought Distillation (CoTD) methods, which fine-tune SLMs using rationales generated by large language models (LLMs), face limitations in knowledge-intensive multi-hop question answering due to SLMs' limited knowledge memorization and integrated reasoning abilities.

D&R Distillation addresses these limitations by training two separate student models: a Decomposer and a Responser. The Decomposer breaks down complex questions into simpler subquestions, while the Responser answers these subquestions using external knowledge sources. This interactive process allows SLMs to access comprehensive knowledge and reduces task complexity.

Key advantages of D&R Distillation include:

Enhanced knowledge access: SLMs can retrieve and utilize external knowledge for each subquestion, providing a more thorough understanding of multi-hop questions.
Simplified reasoning: By decomposing complex questions into simpler subquestions, the method reduces the overall task difficulty and data requirements.
Experimental results on three datasets (HotpotQA, StrategyQA, and 2WikiMultiHopQA) demonstrate that D&R Distillation significantly outperforms previous CoTD methods, even with much less training data. For instance, with only 1/10 of the training data, D&R Distillation with two 220M SLMs (T5-base) surpasses the performance of an 11B LLM (Flan-T5-XXL) on HotpotQA and 2WikiMultiHopQA.

[Reference](<references/Teaching small models to reason with multi hop QA.pdf>)

## Tree of thought 

Tree-of-Thought (ToT) framework, enhances large language models (LLMs) by mimicking human reasoning processes. Unlike traditional linear token generation, ToT allows LLMs to create a tree-like structure for each thought, enabling self-evaluation and error correction. 

This method significantly improves the quality of outputs for complex tasks. Compared to Chain-of-Thought (CoT) prompting, ToT uses search algorithms like breadth-first search (BFS) and depth-first search (DFS) to explore multiple reasoning paths. 

Research by Yao et al. (2023), Long et al. (2023), and Hulbert (2023) demonstrates ToT's effectiveness in tasks such as the math game of 24, creative writing, and mini crosswords, with GPT-4 achieving a 74% success rate in mini crosswords using ToT prompts versus 4% with CoT prompts.

[Unlocking LLMs’ Potential with Tree-of-Thought Prompting](<references/Unlocking LLMs’ Potential with Tree-of-Thought Prompting _ by Albert _ Medium.pdf>)

## Iteration of thought

Iteration of Thought (IoT) framework, enhances LLM responses through dynamic, context-sensitive prompts. 

Unlike static methods like Chain of Thought (CoT) or Tree of Thoughts (ToT), IoT adapts its reasoning path based on evolving context without generating and discarding alternate thoughts. 

The framework consists of three components: an Inner Dialogue Agent (IDA) that generates context-specific prompts, an LLM Agent (LLMA) that refines responses, and an iterative prompting loop that facilitates a conversation between the two. In the article below there are two main versions listed 

- Autonomous Iteration of Thought (AIoT), where the LLM decides when to stop iterating, and 

- Guided Iteration of Thought (GIoT), which enforces a fixed number of iterations. 

Experiments across various datasets, including GPQA, Game of 24, Mini Crosswords, and HotpotQA, demonstrate that IoT significantly improves LLM performance over CoT by enabling more adaptive and efficient reasoning. AIoT, in particular, shows a 14.11% improvement in accuracy over the baseline on the GPQA dataset.

>A human user’s interaction with in an LLM often proceeds as follows: the user poses a question
to the LLM, receives an initial response, and, if the answer is incomplete or suboptimal, provides
additional guidance to the LLM by reiterating contextual clues (e.g. by reminding the LLM of its
role, suggesting additional information to consider, or highlighting specific parts of the response that
need refinement). This back-and-forth process helps narrow the focus of the LLM while reducing
the research effort required from the user, since the LLM is responsible for the bulk of the reasoning
and information retrieval.
We identify two predominant forms of human-LLM interaction. In the first form of interaction,
the user simply guides an LLM through its own internal knowledge base. For example, consider a
scenario where an LLM generates code that is syntactically incorrect due to a missing bracket. The
user might prompt it to "verify the syntax," leading the LLM to correct the error in a subsequent
response. In the second for of interaction, the user introduces new information to improve the
LLM’s response. For example, an LLM may be asked to provide up-to-date weather information for
a specific city, but lacks access to real-time data. In this case, the user can supply this information
(using a tool or API), then prompt the LLM to e.g. recommend weather-appropriate clothing or
destination to visit in that locale. All together, the first form an interaction leads the LLM to
better utilize its internal knowledge, whereas the second form of interaction involves augmenting
the LLM’s knowledge with new information.

[Iteration of thought - leveraging inner dialog for reasoning](<references/Iteration of thought - leveraging inner dialogue.pdf>)

<a href="https://paperswithcode.com/paper/iteration-of-thought-leveraging-inner">Papers With Code - link to repo and datasets used</a>

# Resources

https://github.com/janani-ravi-loony/hands-on-prompt-engineering