# Crash Course in Generative AI Worked Examples - Sentiment Analysis with Flan-T5

*Authors: Yash Gopalji Pankhania, Jeel Kanzaria, Megha Patel*

# Introduction

Welcome to this hands-on Colab notebook, where we will explore the dynamic realm of sentiment analysis using the power of Generative AI. Throughout this interactive journey, you will gain insights into the intricate relationship between input text and model output, mastering the art of prompt engineering to tailor responses for specific tasks.

In our exploration, we will compare zero-shot, one-shot, and few-shot inferences, unraveling the potential of prompt engineering to elevate the generative capabilities of Large Language Models (LLMs). To elevate our inferences, we'll employ a comprehensive fine-tuning approach and evaluate the results using ROUGE metrics. This meticulous examination will shed light on the nuanced improvements achievable through tailored model training. Diving deeper into efficiency, we'll explore Parameter Efficient Fine-Tuning (PEFT). Despite potential slightly-lower performance metrics, we'll uncover how the benefits of PEFT outweigh these trade-offs, providing valuable insights into resource-efficient fine-tuning.

# Overview of the Labs

###**Lab 1: Generative AI use cases, project lifecycle, and model pre-training**

Learning Objectives:

**1. Discuss model pre-training and the value of continued pre-training vs fine-tuning:**

Model Pre-training:

Definition:

Model pre-training is the initial phase in training a large language model. During this stage, the model is exposed to a diverse and extensive dataset containing a wide range of linguistic patterns and structures. The primary goal is for the model to learn general language patterns, grammar, and semantics, creating a foundation of linguistic knowledge.

Purpose:

The purpose of model pre-training is to build a versatile language model that can understand and generate coherent text across various language-related tasks. By exposing the model to a broad dataset, it learns to capture general language features, enabling it to perform well on a range of applications.

Continued Pre-training vs. Fine-tuning:

Continued Pre-training:

Definition: Continued pre-training involves further training the pre-trained model on a specific dataset or within a particular domain.

Purpose: This approach is beneficial when more domain-specific knowledge is required. It allows the model to delve deeper into the intricacies of a specific context, capturing nuances and patterns that might be crucial for tasks within that domain.

Fine-tuning:

Definition: Fine-tuning refers to adapting the pre-trained model for a specific task or dataset.

Purpose: Fine-tuning is effective when there is limited task-specific data available. Instead of training the model from scratch, fine-tuning leverages the general language knowledge gained during pre-training and refines the model's parameters to make it more suitable for the targeted application.

Decision Factors:

Data Specificity: Continued pre-training is advantageous when there's a substantial amount of domain-specific data available. Fine-tuning is preferred when task-specific data is limited.

Task Requirements: If the target task is closely related to the pre-training data, fine-tuning may suffice. If the task requires specialized knowledge, continued pre-training might be more beneficial.

Computational Resources: Fine-tuning is generally more computationally efficient than continued pre-training, making it preferable in resource-constrained scenarios.

In summary, the choice between continued pre-training and fine-tuning depends on factors such as the availability of task-specific data, the relevance of pre-training data to the target task, and the computational resources at hand. The decision is a balance between leveraging general language knowledge and adapting the model for specific applications.

**2.Define the terms Generative AI, large language models, prompt, and describe the transformer architecture that powers LLMs:**

**Generative AI:**

Definition:

Generative AI refers to a category of artificial intelligence systems designed to generate new, original content. These systems, instead of being explicitly programmed for specific tasks, are trained on data and can produce outputs such as text, images, or other forms of creative content. Generative AI models are known for their ability to autonomously create content that resembles the patterns and styles present in their training data.

**Large Language Models (LLMs):**

Definition:

Large Language Models are advanced neural network architectures, often containing millions or even billions of parameters, designed to understand and generate human-like language. These models, like OpenAI's GPT (Generative Pre-trained Transformer) series, are pre-trained on massive datasets, enabling them to learn complex language patterns, semantics, and contextual relationships. LLMs are versatile and can be fine-tuned for various natural language processing tasks.

**Prompt:**

Definition:

A prompt is a specific set of instructions or input provided to a generative AI model, especially large language models. It guides the model in generating a desired output. For instance, in the context of text generation, a prompt could be a sentence or a series of words that instruct the model on the type of text or information to generate in response.

**Transformer Architecture:**

Description:

The transformer architecture is a type of neural network architecture introduced in the paper "Attention is All You Need." It revolutionized natural language processing tasks, providing an efficient way to capture long-range dependencies in sequential data. Transformers use self-attention mechanisms to weigh the importance of different parts of the input sequence, enabling parallelized processing. The transformer architecture is the backbone of many state-of-the-art models, including large language models like GPT-3. Its ability to capture contextual relationships efficiently makes it well-suited for tasks requiring an understanding of language semantics and structure.

**3.Define chain-of-thought prompting and describe how it can be used to improve LLMs reasoning and planning abilities:**

**LLM-Based Generative AI Model Lifecycle:**

**1.Data Gathering:**

Objective: Collecting diverse and relevant datasets to train the LLM.

Factors:

Data Quality: The quality of training data significantly impacts model performance.

Data Diversity: Ensuring the dataset covers a broad range of language patterns and contexts.

Ethical Considerations: Avoiding biases and ensuring fair representation in the training data.

**2.Model Selection:**

Objective: Choosing an appropriate LLM architecture based on the task requirements.

Factors:

Model Size: Balancing between model complexity and computational resources.

Task-specific Features: Selecting a model that aligns with the specific requirements of the intended tasks.

Pre-training Data: Considering models pre-trained on relevant or diverse datasets.

**3.Training:**

Objective: Pre-training the selected model on a large dataset to learn general language patterns.

Factors:

Computational Resources: Availability of GPUs or TPUs for efficient training.

Training Time: Balancing between training time and the depth of model pre-training.

Data Volume: Utilizing a sufficiently large dataset for effective pre-training.

**4.Performance Evaluation:**

Objective: Assessing the model's performance on validation datasets.

Factors:

Task-specific Metrics: Choosing metrics relevant to the intended tasks.

Generalization: Ensuring the model performs well on diverse data.

Overfitting: Identifying and mitigating overfitting issues.

**5.Fine-tuning (Optional):**

Objective: Adapting the pre-trained model for specific tasks or domains.

Factors:

Task-specific Data: The availability and quality of data for fine-tuning.

Computational Resources: The capacity to fine-tune the model efficiently.

Task Complexity: The intricacy of the target tasks influencing the need for fine-tuning.

**6.Deployment:**

Objective: Integrating the trained model into real-world applications.

Factors:

Inference Speed: Balancing between model accuracy and real-time inference requirements.

Scalability: Ensuring the model can handle varying workloads.

Ethical and Regulatory Compliance: Complying with ethical guidelines and legal requirements.

**Constraining Factors:**

**1.Resource Constraints:**

Impact: Limited computational resources can restrict the size and complexity of the chosen model.

Decision Influence: Determines the feasibility of training large models or fine-tuning for specific tasks.

**2.Data Limitations:**

Impact: Insufficient or biased training data can hinder model generalization.

Decision Influence: Influences the effectiveness of model training and its ability to perform well across diverse scenarios.

**3.Task Complexity:**

Impact: Complex tasks may require more extensive pre-training or fine-tuning efforts.

Decision Influence: Determines the depth and specificity of training needed for optimal performance.

**4.Ethical Considerations:**

Impact: Biased or unfair representation in training data can lead to ethical concerns.

Decision Influence: Guides decisions in data gathering, model training, and deployment to ensure fairness and avoid ethical pitfalls.

**5.Inference Requirements:**

Impact: Real-time applications may require trade-offs between model complexity and inference speed.

Decision Influence: Shapes decisions during model selection, fine-tuning, and deployment to meet real-time demands.

The lifecycle of an LLM-based generative AI model is a dynamic process, and the decisions made at each step are influenced by a combination of technical, resource, and ethical considerations. Striking the right balance is crucial for developing effective and responsible generative AI systems.

**4.Discuss the challenges that LLMs face with knowledge cut-offs, and explain how information retrieval and augmentation techniques can overcome these challenges:**

**Computational Challenges during Model Pre-training:**

1. **High Memory Usage:**
  - **Challenge:** Pre-training large language models (LLMs) with millions or billions of parameters demands substantial GPU or TPU memory.
  - **Impact:** Limited availability of memory can hinder the size and complexity of models that can be pre-trained.
2. **Long Training Time:**
  - **Challenge:** Pre-training on extensive datasets can be time-consuming, affecting overall training efficiency.
  - **Impact:** Lengthy training times may limit the frequency of model experimentation and development.
3. **Resource Intensiveness:**
  - **Challenge:** Training large models requires significant computational resources, including powerful GPUs or TPUs.
  - **Impact:** Resource-intensive pre-training can be costly and may be impractical for smaller research or development environments.

**Strategies to Optimize Memory Usage and Reduce Footprint:**

1. **Gradient Accumulation:**
  - **Description:** Instead of updating model parameters after processing the entire batch, accumulate gradients over several smaller batches.
  - **Benefits:** Reduces the effective batch size, allowing models to be trained on larger datasets with limited memory.
2. **Model Parallelism:**
  - **Description:** Distribute the model's parameters across multiple GPUs, enabling parallel processing.
  - **Benefits:** Reduces the memory requirement on individual GPUs and facilitates training larger models.
3. **Data Parallelism:**
  - **Description:** Distribute training data across multiple devices and synchronize updates to the model.
  - **Benefits:** Enables the use of larger batch sizes without increasing memory requirements on each device.
4. **Mixed-Precision Training:**
  - **Description:** Use lower-precision data types (e.g., float16) for model parameters during training.
  - **Benefits:** Reduces memory usage without significant loss of model accuracy.
5. **Gradient Checkpointing:**
  - **Description:** Store intermediate activations during backward passes to trade computation for memory.
  - **Benefits:** Enables training of deeper models with lower memory requirements at the cost of increased computation.
6. **Memory-Efficient Architectures:**
  - **Description:** Choose model architectures designed for memory efficiency without compromising performance.
  - **Benefits:** Allows for the training of larger models within constrained memory environments.
7. **Reduced Model Size:**
  - **Description:** Prune or compress model parameters to reduce overall memory usage.
  - **Benefits:** Facilitates pre-training of models on less powerful hardware or in resource-constrained environments.
8. **Progressive Loading of Data:**
  - **Description:** Load and process data progressively, reducing the need to store the entire dataset in memory.
  - **Benefits:** Allows pre-training on datasets that do not fit entirely in memory.

Efficiently addressing the computational challenges during model pre-training involves a combination of architectural choices, distributed computing strategies, and memory optimization techniques. By employing these strategies, researchers and practitioners can navigate resource constraints and pre-train large language models more effectively.

**Define the term scaling law and describe the laws related to LLMs:**

Scaling Law:

Definition: A scaling law is a mathematical principle that describes how the performance of a system changes as a function of certain factors, such as size, volume, or resources. In the context of machine learning, scaling laws provide insights into how the performance of models, particularly large language models (LLMs), is affected by changes in factors like dataset size, computational resources, and inference requirements.

Scaling Laws Related to LLMs:

1. Dataset Size Scaling:
  - Principle: Increasing the size of the training dataset can improve model performance, but with diminishing returns.
  - Insights: Initially, as more data is added, models benefit from increased diversity and coverage. However, beyond a certain point, the marginal improvement decreases, and the computational cost of processing larger datasets becomes significant.
2. Computational Resource Scaling:
  - Principle: Increasing computational resources, such as GPU or TPU power, allows training larger and more complex models, improving performance.
  - Insights: Larger models with more parameters tend to benefit from additional computation. However, the relationship is not always linear, and there's a point of diminishing returns where further increases in resources may not proportionally improve performance.
3. Model Size Scaling:
  - Principle: Enlarging the size of the model, measured by the number of parameters, can enhance performance.
  - Insights: Larger models have the capacity to capture more complex patterns and relationships in the data. However, the computational cost increases, and there is a trade-off between model size and training efficiency.
4. Inference Scaling:
  - Principle: The computational requirements for model inference, especially for deployment in real-world applications, can be influenced by model size.
  - Insights: Smaller models are often preferred for inference due to lower computational requirements. Efficient model architectures and optimization techniques become crucial for deployment in resource-constrained environments.

Key Insights from Scaling Laws:

1. Diminishing Returns:
  - Increasing certain factors, such as dataset size or model complexity, may lead to diminishing returns in terms of improved performance.
2. Trade-offs:
  - There are trade-offs between factors like model size, computational resources, and inference speed. Finding the right balance is crucial for practical implementation.
3. Efficiency Concerns:
  - Efficient use of resources becomes paramount, especially when dealing with large datasets and complex models. Optimization techniques and algorithmic improvements play a significant role.
4. Domain-Specific Considerations:
  - The optimal scaling strategy may vary based on the specific requirements of the task or domain. Understanding the nuances is essential for effective model development.

Scaling laws guide practitioners and researchers in optimizing large language models for efficiency and effectiveness, considering factors that impact performance across different stages of model development and deployment.

###**Lab 2: Fine-tuning and evaluating large language models**

Learning Objectives:

**1.Describe how fine-tuning with instructions using prompt datasets can improve performance on one or more tasks:**

**Fine-tuning with Instructions Using Prompt Datasets:**

**Objective:** The objective of fine-tuning with instructions using prompt datasets is to adapt a pre-trained large language model (LLM) to perform specific tasks or excel in certain domains. By providing task-specific instructions through carefully crafted prompts during the fine-tuning process, the model can be guided to generate contextually relevant outputs for the desired applications.

**Key Concepts:**

1. **Fine-tuning:**
  - **Definition:** Fine-tuning involves taking a pre-trained model and adjusting its parameters on a task-specific dataset.
  - **Purpose:** It allows the model to specialize for particular applications, leveraging the general language knowledge gained during pre-training.
2. **Instructions using Prompt Datasets:**
  - **Definition:** A prompt is a set of input instructions given to the model to guide its output generation.
  - **Purpose:** By providing specific instructions through prompts during fine-tuning, the model can be tailored to respond effectively to particular tasks or domains.
3. **Enhancing Performance:**
  - **Mechanism:** The prompts act as cues for the model, influencing its understanding and generation of contextually relevant content.
  - **Benefits:**
    - **Task-Specific Adaptation:** The model becomes more adept at tasks outlined in the prompts.
    - **Improved Contextual Understanding:** Fine-tuning helps the model grasp nuances and specific patterns related to the given instructions.
4. **Versatility:**
  - **Applicability:** This fine-tuning approach makes large language models versatile, allowing them to be used for a wide range of tasks beyond their initial pre-training objectives.
  - **Customization:** Tailoring the model through prompt-based fine-tuning enables it to adapt to diverse application scenarios.
5. **Examples of Tasks:**
  - **Text Completion:** Instructing the model to complete sentences or paragraphs based on given prompts.
  - **Summarization:** Fine-tuning for summarizing longer texts with specific guidelines.
  - **Translation:** Adapting the model for translating text from one language to another.
  - **Question-Answering:** Training the model to generate accurate responses to specific types of questions.

**Significance:**

- **Efficiency:** Fine-tuning with prompt datasets is an efficient way to leverage pre-trained models for task-specific applications without starting the training process from scratch.
- **Domain Adaptation:** This approach is particularly powerful for adapting models to specialized domains, where the language nuances may differ from the general pre-training data.
- **Transfer Learning:** Fine-tuning extends the principles of transfer learning, allowing models to transfer knowledge gained in one context to excel in another.

In summary, fine-tuning with instructions using prompt datasets is a crucial step in the lifecycle of large language models, enabling them to perform effectively on specific tasks or domains by providing targeted guidance during the training process.

**2. Define catastrophic forgetting and explain techniques that can be used to overcome it:**

**Catastrophic Forgetting:**

**Definition:** Catastrophic forgetting is a phenomenon observed in machine learning, particularly in neural networks, where a model tends to forget previously learned information when it is trained on new tasks. This can be problematic when the model needs to continually adapt to new data without losing knowledge of its previously acquired skills.

**Techniques to Overcome Catastrophic Forgetting:**

1. **Rehearsal or Experience Replay:**
  - **Description:** Store and periodically revisit old data during training.
  - **Purpose:** By interleaving new and old data, the model is forced to retain knowledge of both tasks over time.
2. **Elastic Weight Consolidation (EWC):**
  - **Description:** Introduce a regularization term in the loss function that penalizes changes to important weights learned during previous tasks.
  - **Purpose:** Prioritizes the preservation of important knowledge while allowing adaptation to new tasks.
3. **Dual Memory Networks:**
  - **Description:** Introduce a separate memory module dedicated to storing information from the past.
  - **Purpose:** Allows the model to selectively access and update information in the memory module, preventing interference with previous knowledge.
4. **Progressive Neural Networks (PNN):**
  - **Description:** Maintain a separate network for each task, and share parameters between tasks.
  - **Purpose:** Ensures that the knowledge learned for each task is retained in its dedicated network, preventing interference with other tasks.
5. **Meta-learning or Learning to Learn:**
  - **Description:** Train the model to quickly adapt to new tasks with minimal forgetting.
  - **Purpose:** Enables the model to efficiently learn and adapt to new information while retaining knowledge of previous tasks.
6. **Parameter Regularization:**
  - **Description:** Regularize the model parameters during training to constrain changes that may lead to forgetting.
  - **Purpose:** Encourages the model to adapt to new information without excessively modifying parameters crucial for retaining previous knowledge.
7. **Knowledge Distillation:**
  - **Description:** Transfer knowledge from an existing model (teacher) to the model being fine-tuned (student).
  - **Purpose:** Helps the model retain knowledge from the teacher model while adapting to new tasks.
8. **Task-specific Heads or Modules:**
  - **Description:** Design the model with task-specific heads or modules for different tasks.
  - **Purpose:** Allows the model to focus on learning new tasks without disrupting the representations associated with previous tasks.

**Key Insights:**

- Catastrophic forgetting is a challenge, especially in scenarios where a model needs to adapt to new tasks without losing knowledge of previously learned tasks.
- Techniques to overcome catastrophic forgetting aim to strike a balance between adapting to new information and preserving knowledge from the past.
- The choice of a specific technique depends on the nature of the tasks, the architecture of the model, and the available computational resources.

By incorporating these techniques, practitioners can address the issue of catastrophic forgetting and build models that continually learn and adapt to new tasks without compromising their previously acquired knowledge.

**3.**** Define the term Parameter-efficient Fine Tuning (PEFT):**

**Parameter-efficient Fine Tuning (PEFT):**

**Definition:** Parameter-efficient Fine Tuning (PEFT) is a fine-tuning methodology that emphasizes optimizing computational efficiency during the adaptation of pre-trained models for specific tasks. The primary goal of PEFT is to achieve effective fine-tuning with minimal updates to the model parameters, thereby reducing the computational resources required for the fine-tuning process.

**Key Concepts:**

1. **Fine-tuning with Minimal Parameter Updates:**
  - **Objective:** PEFT aims to fine-tune models with only essential updates to the pre-trained parameters, avoiding extensive modifications.
  - **Rationale:** By minimizing parameter updates, PEFT seeks to retain the valuable knowledge gained during pre-training while adapting the model to task-specific requirements.
2. **Efficient Use of Computational Resources:**
  - **Objective:** Focus on achieving fine-tuning with reduced computational cost.
  - **Rationale:** Computational efficiency is a critical consideration, especially when dealing with resource-constrained environments or when deploying models at scale.
3. **Balancing Task-specific Adaptation and Model Preservation:**
  - **Objective:** Strike a balance between adapting the model to new tasks and preserving knowledge from pre-training.
  - **Rationale:** PEFT seeks to avoid overfitting to task-specific data while ensuring that the model maintains its generalization capabilities acquired during pre-training.
4. **Optimizing Memory Usage:**
  - **Objective:** Reduce the memory footprint required for fine-tuning.
  - **Rationale:** Optimizing memory usage is crucial for efficient training, especially when dealing with large models or limited computational resources.
5. **Adaptability to Various Tasks:**
  - **Objective:** Design PEFT to be adaptable across a range of tasks.
  - **Rationale:** The method should be versatile, allowing models to be efficiently fine-tuned for different applications without the need for extensive computational resources.

**Significance:**

- **Resource Efficiency:** PEFT is particularly valuable in scenarios where computational resources are limited, enabling the fine-tuning of models without excessive demands on hardware.
- **Scalability:** The efficiency of PEFT makes it scalable for deploying models in large-scale applications, where the cost of fine-tuning and model adaptation is a critical factor.
- **Rapid Model Deployment:** PEFT facilitates quick adaptation of pre-trained models to new tasks, enabling rapid deployment in real-world applications.
- **Versatility:** PEFT is applicable to a variety of tasks and domains, making it a versatile approach for fine-tuning models for different applications while considering resource constraints.

Parameter-efficient Fine Tuning addresses the need for adapting pre-trained models with a focus on computational efficiency, making it a valuable methodology in scenarios where resource utilization is a crucial consideration.

**4.**** Explain how PEFT decreases computational cost and overcomes catastrophic forgetting:**

**How Parameter-efficient Fine Tuning (PEFT) Decreases Computational Cost:**

1. **Minimal Parameter Updates:** PEFT aims to fine-tune models with minimal updates to the pre-trained parameters. This reduces the amount of computation required during the fine-tuning process. Instead of making extensive modifications to the model, PEFT selectively adjusts parameters relevant to the new task.
2. **Efficient Memory Usage:** PEFT optimizes memory usage during the fine-tuning process. By minimizing the changes to the model's parameters, it helps in efficiently utilizing available memory resources. This is crucial, especially when dealing with large models that may have memory constraints.
3. **Adaptability Across Tasks:** PEFT is designed to be adaptable across various tasks. This adaptability allows a single pre-trained model to be fine-tuned for different applications without the need for separate resource-intensive fine-tuning processes. This versatility contributes to overall computational efficiency.
4. **Reduced Computational Overhead:** Since PEFT focuses on retaining the knowledge gained during pre-training while adapting to new tasks, it avoids unnecessary computations that may lead to increased computational overhead. This results in a more resource-efficient fine-tuning process.

**How PEFT Overcomes Catastrophic Forgetting:**

1. **Balancing Task-specific Adaptation:** PEFT seeks to strike a balance between adapting the model to new tasks and preserving knowledge from pre-training. This prevents the model from overly specializing on the new task, reducing the risk of catastrophic forgetting.
2. **Regularization Techniques:** PEFT may incorporate regularization techniques that penalize drastic changes to important parameters learned during pre-training. This helps in retaining essential knowledge and preventing catastrophic forgetting when adapting to new tasks.
3. **Efficient Transfer of Knowledge:** PEFT focuses on efficient transfer of knowledge from pre-training to fine-tuning. By emphasizing minimal parameter updates, it ensures that the model maintains a significant portion of its previously acquired knowledge while adapting to new information.
4. **Selective Parameter Updates:** PEFT selectively updates parameters that are relevant to the new task, minimizing interference with existing knowledge. This selective approach prevents catastrophic forgetting by ensuring that the model retains its ability to perform well on a range of tasks.
5. **Task-specific Modules:** If applicable, PEFT may involve the use of task-specific modules or heads, allowing the model to focus on learning new tasks without disrupting representations associated with previous tasks. This modular approach contributes to preventing catastrophic forgetting.

In summary, PEFT decreases computational costs by optimizing memory usage, minimizing parameter updates, and providing adaptability across tasks. Simultaneously, it overcomes catastrophic forgetting by carefully balancing task-specific adaptation and preserving knowledge from pre-training through selective updates and regularization techniques.

**5.**** Explain how fine-tuning with instructions using prompt datasets can increase LLM performance on one or more tasks:**

**Fine-tuning with Instructions using Prompt Datasets: Enhancing LLM Performance**

1. **Task-Specific Adaptation:**

- **Importance:** Fine-tuning with instructions using prompt datasets enables the model to adapt to specific tasks by providing explicit guidance through prompts.
- **Effect:** The model learns to generate contextually relevant responses based on the task-specific instructions, enhancing its performance for targeted applications.

2. **Guided Context Understanding:**

- **Importance:** Prompts serve as cues, guiding the model's understanding of the desired context for generating outputs.
- **Effect:** This guided context understanding ensures that the model produces outputs that align more closely with the intended meaning of the given instructions, leading to improved task performance.

3. **Customization for Diverse Tasks:**

- **Importance:** Fine-tuning with prompt datasets allows large language models (LLMs) to be customized for a variety of tasks beyond their initial pre-training objectives.
- **Effect:** The model becomes versatile, capable of adapting to diverse applications by leveraging task-specific instructions, making it a valuable tool for a broad range of natural language processing tasks.

4. **Improved Contextual Relevance:**

- **Importance:** The inclusion of task-specific instructions refines the model's contextual understanding, improving the relevance of generated content.
- **Effect:** The model produces more accurate and contextually appropriate responses, addressing the specific requirements outlined in the prompts and enhancing its overall performance on various tasks.

5. **Nuanced Adaptation to Instructions:**

- **Importance:** Fine-tuning allows the model to capture nuanced instructions and patterns within the prompts, leading to more nuanced adaptations.
- **Effect:** The model learns to generate outputs that align not only with explicit instructions but also with subtle nuances present in the provided prompts, making it more adept at understanding complex language patterns.

6. **Reduced Ambiguity:**

- **Importance:** Task-specific instructions help clarify ambiguous scenarios and guide the model in generating more precise responses.
- **Effect:** The model becomes less prone to generating ambiguous or generic outputs, resulting in improved clarity and accuracy tailored to the specific task at hand.

7. **Efficient Model Adaptation:**

- **Importance:** Fine-tuning with prompt datasets provides an efficient way to adapt pre-trained models to new tasks without retraining from scratch.
- **Effect:** The model leverages its pre-existing language knowledge and adapts more quickly and efficiently to new tasks, making the fine-tuning process resource-efficient.

8. **Enhanced Generalization:**

- **Importance:** The fine-tuning process with specific instructions contributes to the model's ability to generalize well across a range of tasks.
- **Effect:** The model retains its adaptability to different tasks, ensuring that the learned knowledge from prompts contributes to improved generalization capabilities.

In conclusion, fine-tuning with instructions using prompt datasets is crucial for enhancing the performance of large language models by providing task-specific guidance, refining contextual understanding, and enabling efficient adaptation to diverse applications. The incorporation of explicit instructions through prompts is a powerful approach to tailoring models for specific tasks and domains, contributing to their versatility and effectiveness.

## **Lab 3: Reinforcement learning and LLM-powered applications**

Learning Objectives:

**1.Describe how RLHF uses human feedback to improve the performance and alignment of large language models:**

**1. Reinforcement Learning from Human Feedback (RLHF):**

**Definition:** Reinforcement Learning from Human Feedback (RLHF) is a methodology that leverages human-generated feedback to enhance the performance and alignment of large language models (LLMs). It involves using human-provided signals, such as comparisons or rankings, to refine the model's behavior over time.

**Key Concepts:**

1. **Performance Improvement:**
  - **Objective:** RLHF aims to improve the model's performance by iteratively incorporating feedback from human evaluators.
  - **Process:** The model receives feedback signals based on its generated outputs, allowing it to adapt and refine its behavior in subsequent iterations.
2. **Alignment with Human Intent:**
  - **Objective:** RLHF seeks to align the model's outputs more closely with human preferences and intentions.
  - **Human Feedback Types:** Evaluators provide feedback in the form of rankings, preferences, or comparisons, guiding the model to generate content that better meets human expectations.
3. **Iterative Learning Process:**
  - **Process:** RLHF involves multiple iterations where the model generates outputs, receives human feedback, and updates its parameters to improve alignment and performance.
  - **Feedback Refinement:** The iterative process allows the model to learn from nuanced human feedback, progressively refining its understanding and generating more desirable content.
4. **Diversity of Human Feedback:**
  - **Importance:** RLHF benefits from diverse human perspectives and preferences.
  - **Adaptability:** The model can adapt to a wide range of preferences, ensuring that it aligns well with the varied expectations of users.
5. **Reward Signal Design:**
  - **Objective:** Designing an effective reward signal is crucial in RLHF.
  - **Feedback Types:** Rewards can be based on binary preferences, numerical scores, or more complex signals, depending on the nature of the task and the quality of generated outputs.
6. **Addressing Ethical Considerations:**
  - **Consideration:** RLHF should be designed with ethical considerations to avoid biases and ensure fair representation in the feedback process.
  - **Transparency:** Transparent communication about the model's learning process and the use of human feedback is essential for ethical RLHF.
7. **Application to LLMs:**
  - **Relevance:** RLHF is particularly relevant for large language models, where aligning generated text with human preferences is crucial for diverse applications.
  - **Versatility:** LLMs can be fine-tuned and adapted for various tasks using RLHF, making them more versatile and capable of delivering contextually relevant content.

**Significance:**

- **Human-Centric Improvement:** RLHF puts human evaluators at the center of the improvement process, ensuring that the model's outputs align with human expectations and preferences.
- **Adaptability to Dynamic Preferences:** RLHF allows models to adapt to evolving user preferences and expectations, making them more responsive to changes in language use and cultural context.
- **Ethical Considerations:** RLHF emphasizes ethical considerations in the training process, promoting fairness, transparency, and responsible AI practices.

In summary, RLHF is a powerful approach to enhance the performance and alignment of large language models by leveraging human feedback, facilitating iterative learning, and ensuring adaptability to diverse user preferences.

**2.**  **Explain how data gathered from human labelers is used to train a reward model for RLHF:**

**Training a Reward Model for RLHF using Data from Human Labelers:**

Reinforcement Learning from Human Feedback (RLHF) involves training a reward model that guides the learning process of a model based on human-provided signals. The process typically involves collecting data from human labelers who evaluate the model's outputs, and this data is then used to construct a reward model. Here's an overview of how this process works:

1. **Data Collection:**

- **Human Labelers:** Gather a group of human labelers who can assess the quality of the model's outputs. These labelers may provide rankings, preferences, or comparisons based on their judgment.

2. **Task Definition:**

- **Define the Task:** Clearly define the task for which the model is seeking improvement. This could be generating coherent text, answering questions, or any other task relevant to the model's purpose.

3. **Model Output Generation:**

- **Model Inference:** Have the model generate outputs for a set of inputs relevant to the defined task. These outputs serve as the candidate solutions that will be evaluated by human labelers.

4. **Human Feedback Collection:**

- **Feedback Signals:** Human labelers assess the model-generated outputs and provide feedback in the form of comparisons, rankings, or preferences. For example, they may rank different outputs in terms of quality or indicate preferences between pairs of generated content.

5. **Reward Model Construction:**

- **Mapping Human Feedback to Rewards:** Convert the human-provided feedback into a reward signal. This involves mapping the feedback types (e.g., preferences, rankings) to numerical values that represent the desirability of the generated outputs.
- **Reward Function:** Define a reward function that captures the desired characteristics of the outputs based on the human feedback. For instance, if the task is language generation, a reward function might assign higher scores to outputs that are more coherent, contextually relevant, or grammatically correct.

6. **Supervised Learning:**

- **Training the Reward Model:** Utilize the collected data and the constructed reward model to train the model in a supervised learning setting. The model learns to generate outputs that align with the reward signals derived from human feedback.

7. **Iterative Feedback Loop:**

- **Iterative Learning Process:** Repeat the process iteratively. Generate new outputs, collect feedback from human labelers, update the reward model, and retrain the model. This iterative loop allows the model to progressively improve its performance based on refined human guidance.

8. **Ethical Considerations:**

- **Bias Mitigation:** Pay attention to potential biases in the human feedback and take steps to mitigate them. Ensure diverse perspectives are considered and that the training process adheres to ethical guidelines.

9. **Validation and Testing:**

- **Validation Set:** Reserve a portion of the collected data as a validation set to assess the generalization of the model to new, unseen inputs.
- **Testing Set:** Use a separate testing set to evaluate the model's performance on inputs that were not part of the training or validation sets.

10. **Deployment and Monitoring:**

- **Deployment:** Once the model achieves satisfactory performance, deploy it for the intended task.
- **Monitoring:** Continuously monitor the model's outputs and gather ongoing feedback from users to address any issues and further improve its performance over time.

11. **Documentation and Transparency:**

- **Documentation:** Document the entire process, including the task definition, feedback collection, reward model construction, and training details.
- **Transparency:** Be transparent about the use of human feedback, the reward model, and the iterative learning process to build trust and ensure accountability.

12. **Feedback Refinement:**

- **Refining Feedback Mechanisms:** Based on the model's ongoing performance, refine the feedback mechanisms used by human labelers. Continuous improvement in feedback collection contributes to the model's adaptability.

**3.**  **Define chain-of-thought prompting and describe how it can be used to improve LLMs reasoning and planning abilities:**

**Chain-of-Thought Prompting:**

**Definition:** Chain-of-Thought Prompting is a technique used in natural language processing, particularly in the context of large language models (LLMs), to guide the model's reasoning and planning abilities. This approach involves structuring a sequence of prompts or queries in a way that encourages the model to build upon its own generated responses iteratively, creating a coherent chain of thought.

**Key Concepts:**

1. **Sequential Prompting:**
  - **Approach:** Provide a series of prompts or questions to the model in a sequential manner.
  - **Intent:** Each prompt builds upon the context established by the previous one, guiding the model to maintain a coherent and logical chain of thought.
2. **Progressive Expansion of Ideas:**
  - **Purpose:** Encourage the model to progressively expand on its initial responses, developing a more nuanced and detailed line of reasoning.
  - **Continuity:** The prompts maintain continuity in the thought process, facilitating the generation of logically connected ideas.
3. **Contextual Understanding:**
  - **Emphasis:** Chain-of-thought prompting emphasizes the development of a deeper contextual understanding within the model.
  - **Context Retention:** The model is guided to retain and build upon the context established in earlier prompts, contributing to more informed and contextually relevant responses.
4. **Reasoning and Planning:**
  - **Application:** The technique is particularly useful for enhancing the model's reasoning and planning abilities.
  - **Structured Thought Process:** By guiding the model through a structured sequence of prompts, it learns to reason through problems and plan responses in a more organized manner.
5. **Task-Specific Adaptation:**
  - **Flexibility:** Chain-of-thought prompting is adaptable to various tasks, allowing the model to reason and plan in domains such as problem-solving, decision-making, and narrative generation.
  - **Task-Specific Context:** The prompts can be tailored to elicit responses aligned with specific task requirements.

**Application to Improve LLMs Reasoning and Planning Abilities:**

1. **Scenario-Based Problem Solving:**
  - **Prompt Sequence:** Present the model with a scenario or problem in the initial prompt. Subsequent prompts can ask for step-by-step solutions or elaborations, guiding the model through a reasoning process.
  - **Result:** The model learns to approach complex problems systematically, demonstrating improved reasoning abilities.
2. **Long-Form Narrative Generation:**
  - **Prompt Sequence:** Initiate the narrative with an initial scenario or idea. Subsequent prompts guide the model to add details, develop characters, and progress the storyline.
  - **Result:** The model produces coherent and logically structured long-form narratives, showcasing enhanced planning capabilities.
3. **Strategic Decision-Making:**
  - **Prompt Sequence:** Present a decision-making scenario in the initial prompt. Follow up with prompts that seek justifications, alternative choices, and potential outcomes.
  - **Result:** The model learns to reason through strategic decisions, considering various factors and consequences, leading to improved planning abilities.
4. **Argumentative Essay Construction:**
  - **Prompt Sequence:** Begin with a prompt introducing a topic. Subsequent prompts guide the model to present arguments, counterarguments, and evidence.
  - **Result:** The model develops a well-structured argumentative essay, demonstrating its ability to reason and plan the presentation of ideas cohesively.

**Benefits:**

- **Contextual Continuity:** Chain-of-thought prompting ensures that the model maintains contextual continuity, reducing the risk of generating disjointed or inconsistent responses.
- **Structured Reasoning:** The technique promotes structured reasoning, aiding the model in breaking down complex tasks into a series of logical steps.
- **Adaptability:** Chain-of-thought prompting is adaptable to various applications, allowing LLMs to improve their reasoning and planning abilities across diverse tasks.

By employing chain-of-thought prompting, developers and researchers can guide large language models through a structured thinking process, enhancing their reasoning and planning capabilities in a wide range of applications.

**4.**  **Discuss the challenges that LLMs face with knowledge cut-offs, and explain how information retrieval and augmentation techniques can overcome these challenges:**

**Challenges of Knowledge Cut-offs in Large Language Models (LLMs):**

1. **Temporal Limitations:**
  - **Challenge:** LLMs have a knowledge cut-off, meaning they are trained on a fixed dataset up to a certain point in time. They may lack awareness of events or information that occurred after the cut-off, leading to temporal limitations in their knowledge.
2. **Dynamic Information:**
  - **Challenge:** Real-world information is dynamic and continuously evolving. LLMs may not be aware of the latest developments, leading to outdated or inaccurate responses when queried about recent events.
3. **Limited Domain Expertise:**
  - **Challenge:** LLMs may not have expertise in specialized or rapidly evolving domains. Their knowledge is confined to the general patterns learned during training, and they may struggle with in-depth understanding of specific topics.
4. **Emerging Trends and Terminology:**
  - **Challenge:** New trends, technologies, and terminology may emerge after the knowledge cut-off, causing LLMs to be unaware of or misinterpret current terminology and concepts.

**Information Retrieval and Augmentation Techniques to Overcome Knowledge Cut-offs:**

1. **Continuous Training:**
  - **Approach:** Periodically update the LLM by retraining it on more recent data to extend its knowledge beyond the initial cut-off.
  - **Benefit:** Allows the model to stay informed about evolving information and adapt to changes in language use.
2. **External Knowledge Bases:**
  - **Approach:** Integrate information retrieval from external knowledge bases or databases during inference.
  - **Benefit:** Enables the model to access real-time information and domain-specific knowledge beyond its training data, overcoming cut-off limitations.
3. **Pre-training on Broad Datasets:**
  - **Approach:** Pre-train LLMs on diverse and extensive datasets to capture a broader understanding of language patterns.
  - **Benefit:** Increases the chances that the model has encountered a wide range of topics during training, improving its ability to generate contextually relevant responses even after the knowledge cut-off.
4. **Federated Learning:**
  - **Approach:** Implement federated learning where models are trained collaboratively on decentralized data sources.
  - **Benefit:** By aggregating knowledge from multiple sources, the model gains insights from a broader set of data, reducing the impact of individual knowledge cut-offs.
5. **Active Learning:**
  - **Approach:** Allow the model to actively seek user feedback and incorporate it into its knowledge base.
  - **Benefit:** Users can correct and update the model's knowledge, helping it adapt to changes and overcome limitations imposed by the cut-off.
6. **Fine-tuning on Task-Specific Data:**
  - **Approach:** Fine-tune the LLM on task-specific datasets relevant to the domain in question.
  - **Benefit:** Improves the model's performance in specific domains or tasks by providing targeted training data beyond the initial cut-off.
7. **Semantic Augmentation:**
  - **Approach:** Introduce semantic augmentation techniques that generate paraphrased versions of queries or responses.
  - **Benefit:** Helps the model generalize its understanding and address gaps in knowledge by presenting information in varied ways.
8. **Active Context Integration:**
  - **Approach:** Allow the model to maintain an active context window, dynamically updating it with recent information during conversations.
  - **Benefit:** Keeps the model informed about the context of ongoing interactions, compensating for knowledge cut-offs in a conversational context.

# Sentiment Analysis


Upon successfully navigating the lab focused on generating summaries from dialogues, our curiosity naturally led us to explore the realm of sentiment analysis using the powerful Flan-T5 model. In each section of our investigation, we aimed to extract sentiment, categorizing it as positive, negative, or neutral. Employing pretrained models, we meticulously replicated the process on a Twitter dataset, fine-tuning our approach by refining prompts to guide the model in discerning our specific criteria. Let's demonstrate this in the next section as we walkthrough the labs again.

# Walkthrough of the Lab Using Sentiment Analysis

In this notebook, we will do sentiment analysis using generative AI. You will
explore how the input text affects the output of the model, and perform prompt engineering to direct it towards our task. By comparing zero shot, one shot, and few shot inferences, we will take the first step towards prompt engineering and see how it can enhance the generative output of Large Language Models.

<a name='1.1'></a>
## 1 - Install Required Dependencies

In [None]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 --quiet\
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

Collecting pip
  Downloading pip-23.3.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.3.1
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m887.5/887.5 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m113.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m849.3/849.3 kB[0m [31m58.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m557.1/557.1 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.1/317.1 MB[0m [31m4.2 MB/s[0m eta [36m

In [None]:
!pip install git+https://github.com/lvwerra/trl.git@25fa1bd
!pip install tensorflow

Collecting git+https://github.com/lvwerra/trl.git@25fa1bd
  Cloning https://github.com/lvwerra/trl.git (to revision 25fa1bd) to /tmp/pip-req-build-g55a65vb
  Running command git clone --filter=blob:none --quiet https://github.com/lvwerra/trl.git /tmp/pip-req-build-g55a65vb
[0m  Running command git checkout -q 25fa1bd
  Resolved https://github.com/lvwerra/trl.git to commit 25fa1bd
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: trl
  Building wheel for trl (setup.py) ... [?25l[?25hdone
  Created wheel for trl: filename=trl-0.4.2.dev0-py3-none-any.whl size=67532 sha256=f5933ee651713ae901a7e4d465fef4d47e992ceacf13960577d48563b05d871b
  Stored in directory: /tmp/pip-ephem-wheel-cache-lciq9t1l/wheels/24/b4/20/2fa3a1e47c0411c39e198029315e3af2a2c1d59132913f136f
Successfully built trl
Installing collected packages: trl
Successfully installed trl-0.4.2.dev0
[0m

Load the datasets, Large Language Model (LLM), tokenizer, and configurator.

In [None]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig
import torch

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

<a name='1.2'></a>
## 2 - Sentiment Analysis without Prompt Engineering

In this use case, you will be generating the sentiment of a text with the pre-trained Large Language Model (LLM) FLAN-T5 from Hugging Face. The list of available models in the Hugging Face `transformers` package can be found [here](https://huggingface.co/docs/transformers/index).

Let's upload some simple dialogues from the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. This dataset contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [None]:
huggingface_dataset_name = "mteb/tweet_sentiment_extraction"

dataset = load_dataset(huggingface_dataset_name)

Downloading readme:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

Downloading and preparing dataset json/mteb--tweet_sentiment_extraction to /root/.cache/huggingface/datasets/mteb___json/mteb--tweet_sentiment_extraction-19a608df77d581ad/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.63M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/465k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/mteb___json/mteb--tweet_sentiment_extraction-19a608df77d581ad/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Here we print a couple of texts with their set sentiments.

In [None]:
example_indices = [50, 78]

dash_line = '-'.join('' for x in range(100))

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print('Text:')
    print(dataset['train'][index]['text'])
    print(dash_line)
    print('Sentiment:')
    print(dataset['train'][index]['label_text'])
    print(dash_line)
    print()

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
Text:
 Then you should check out http://twittersucks.com and connect with other tweeple who hate twitter
---------------------------------------------------------------------------------------------------
Sentiment:
neutral
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  2
---------------------------------------------------------------------------------------------------
Text:
I am sooo tired
---------------------------------------------------------------------------------------------------
Sentiment:
negative
---------------------------------------------------------------------------------------------------



Load the [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5), creating an instance of the `AutoModelForSeq2SeqLM` class with the `.from_pretrained()` method.

In [None]:
model_name='google/flan-t5-base'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo):

To perform encoding and decoding, you need to work with text in a tokenized form. **Tokenization** is the process of splitting texts into smaller units that can be processed by the LLM models.

Download the tokenizer for the FLAN-T5 model using `AutoTokenizer.from_pretrained()` method. Parameter `use_fast` switches on fast tokenizer. At this stage, there is no need to go into the details of that, but you can find the tokenizer parameters in the [documentation](https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoTokenizer).

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Test the tokenizer encoding and decoding a simple sentence:

In [None]:
sentence = "What time is it, Tom?"

sentence_encoded = tokenizer(sentence, return_tensors='pt')

sentence_decoded = tokenizer.decode(
        sentence_encoded["input_ids"][0],
        skip_special_tokens=True
    )

print('ENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)

ENCODED SENTENCE:
tensor([ 363,   97,   19,   34,    6, 3059,   58,    1])

DECODED SENTENCE:
What time is it, Tom?


Now it's time to explore how well the base LLM assigns sentiment without any prompt engineering. **Prompt engineering** is an act of a human changing the **prompt** (input) to improve the response for a given task.

In [None]:
for i, index in enumerate(example_indices):
    text = dataset['train'][index]['text']
    sentiment = dataset['train'][index]['label_text']

    inputs = tokenizer(text, return_tensors='pt').to(device)
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{text}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{sentiment}')
    print(dash_line)
    print(f'MODEL GENERATION - WITHOUT PROMPT ENGINEERING:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:
 Then you should check out http://twittersucks.com and connect with other tweeple who hate twitter
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
neutral
---------------------------------------------------------------------------------------------------
MODEL GENERATION - WITHOUT PROMPT ENGINEERING:
If you're not a twitter user, then you should check out http://twitterucks.com and connect with other twitter users who hate twitter.

---------------------------------------------------------------------------------------------------
Example  2
---------------------------------------------------------------------------------------------------
INPUT PROMPT:
I am sooo tired
---------------------

You can see that the guesses of the model make some sense, but it doesn't seem to be sure what task it is supposed to accomplish. Seems it just makes up the next sentence in the dialogue. Prompt engineering can help here.

<a name='3'></a>
## 3 - Summarize Dialogue with an Instruction Prompt

Prompt engineering is an important concept in using foundation models for text generation. You can check out [this blog](https://www.amazon.science/blog/emnlp-prompt-engineering-is-the-new-feature-engineering) from Amazon Science for a quick introduction to prompt engineering.

<a name='3.1'></a>
### 3.1 - Zero Shot Inference with an Instruction Prompt

In order to instruct the model to perform a task - summarize a dialogue - you can take the dialogue and convert it into an instruction prompt. This is often called **zero shot inference**.  You can check out [this blog from AWS](https://aws.amazon.com/blogs/machine-learning/zero-shot-prompting-for-the-flan-t5-foundation-model-in-amazon-sagemaker-jumpstart/) for a quick description of what zero shot learning is and why it is an important concept to the LLM model.

Wrap the dialogue in a descriptive instruction and see how the generated text will change:

In [None]:
for i, index in enumerate(example_indices):
    text = dataset['train'][index]['text']
    sentiment = dataset['train'][index]['label_text']

    prompt = f"""
Provide sentiment of the following text as either positive, negative or neutral

{text}

Sentiment:
    """

    # Input constructed prompt instead of the dialogue.
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{sentiment}')
    print(dash_line)
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Provide sentiment of the following text as either positive, negative or neutral

 Then you should check out http://twittersucks.com and connect with other tweeple who hate twitter

Sentiment:
    
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
neutral
---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
negative

---------------------------------------------------------------------------------------------------
Example  2
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Provide sentiment of the following text as either positive, negative or neutral

This is much better! But the model still does not pick up on the nuance of the conversations though.

<a name='3.2'></a>
### 3.2 - Zero Shot Inference with the Prompt Template from FLAN-T5

Let's use a slightly different prompt. FLAN-T5 has many prompt templates that are published for certain tasks [here](https://github.com/google-research/FLAN/tree/main/flan/v2). In the following code, you will use one of the [pre-built FLAN-T5 prompts](https://github.com/google-research/FLAN/blob/main/flan/v2/templates.py):

In [None]:
for i, index in enumerate(example_indices):
    text = dataset['train'][index]['text']
    sentiment = dataset['train'][index]['label_text']

    prompt = f"""
Text:

{text}

Provide sentiment of the following text as either positive, negative or neutral
"""

    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{sentiment}\n')
    print(dash_line)
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Text:

 Then you should check out http://twittersucks.com and connect with other tweeple who hate twitter

Provide sentiment of the following text as either positive, negative or neutral 

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
neutral

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
negative

---------------------------------------------------------------------------------------------------
Example  2
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Text:

I am sooo tired

Provide sentiment of the following text as either positive, ne

Notice that this prompt from FLAN-T5 did help a bit, but still struggles to pick up on the nuance of the conversation. This is what you will try to solve with the few shot inferencing.

<a name='4'></a>
## 4 - Summarize Dialogue with One Shot and Few Shot Inference

**One shot and few shot inference** are the practices of providing an LLM with either one or more full examples of prompt-response pairs that match your task - before your actual prompt that you want completed. This is called "in-context learning" and puts your model into a state that understands your specific task.  You can read more about it in [this blog from HuggingFace](https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api).

<a name='4.1'></a>
### 4.1 - One Shot Inference

Let's build a function that takes a list of `example_indices_full`, generates a prompt with full examples, then at the end appends the prompt which you want the model to complete (`example_index_to_summarize`).  You will use the same FLAN-T5 prompt template from section [3.2](#3.2).

In [None]:
def make_prompt(example_indices_full, example_index_to_summarize):
    prompt = ''
    for index in example_indices_full:
        text = dataset['train'][index]['text']
        sentiment = dataset['train'][index]['label_text']

        # The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
        prompt += f"""
Text:

{text}

Provide sentiment of the following text as either positive, negative or neutral
{sentiment}


"""

    text = dataset['test'][example_index_to_summarize]['text']

    prompt += f"""
Text:

{text}

Provide sentiment of the following text as either positive, negative or neutral
"""

    return prompt

Construct the prompt to perform one shot inference:

In [None]:
example_indices_full = [40]
example_index_to_summarize = 200

one_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(one_shot_prompt)


Text:

 Car not happy, big big dent in boot! Hoping theyre not going to write it off, crossing fingers and waiting

Provide sentiment of the following text as either positive, negative or neutral 
neutral



Text:

St joe is dirty.

Provide sentiment of the following text as either positive, negative or neutral



Now pass this prompt to perform the one shot inference:

In [None]:
sentiment = dataset['train'][example_index_to_summarize]['label_text']

inputs = tokenizer(one_shot_prompt, return_tensors='pt').to(device)
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{sentiment}\n')
print(dash_line)
print(f'MODEL GENERATION - ONE SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
negative

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ONE SHOT:
negative


<a name='4.2'></a>
### 4.2 - Few Shot Inference

Let's explore few shot inference by adding two more full dialogue-summary pairs to your prompt.

In [None]:
example_indices_full = [40, 80, 120]
example_index_to_summarize = 200

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)


Text:

 Car not happy, big big dent in boot! Hoping theyre not going to write it off, crossing fingers and waiting

Provide sentiment of the following text as either positive, negative or neutral 
neutral



Text:

 THANK YYYYYYYYYOOOOOOOOOOUUUUU!

Provide sentiment of the following text as either positive, negative or neutral 
positive



Text:

 I had it! On my itunes, but then I lost all my songs.

Provide sentiment of the following text as either positive, negative or neutral 
neutral



Text:

St joe is dirty.

Provide sentiment of the following text as either positive, negative or neutral



Now pass this prompt to perform a few shot inference:

In [None]:
sentiment = dataset['train'][example_index_to_summarize]['label_text']

inputs = tokenizer(few_shot_prompt, return_tensors='pt').to(device)
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{sentiment}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
negative

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
negative


In this case, few shot did not provide much of an improvement over one shot inference.  And, anything above 5 or 6 shot will typically not help much, either.  Also, you need to make sure that you do not exceed the model's input-context length which, in our case, if 512 tokens.  Anything above the context length will be ignored.

However, you can see that feeding in at least one full example (one shot) provides the model with more information and qualitatively improves the sentiment generated overall.

<a name='5'></a>
## 5 - Generative Configuration Parameters for Inference

You can change the configuration parameters of the `generate()` method to see a different output from the LLM. So far the only parameter that you have been setting was `max_new_tokens=50`, which defines the maximum number of tokens to generate. A full list of available parameters can be found in the [Hugging Face Generation documentation](https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig).

A convenient way of organizing the configuration parameters is to use `GenerationConfig` class.

In [None]:
# generation_config = GenerationConfig(max_new_tokens=50)
# generation_config = GenerationConfig(max_new_tokens=10)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.1)
# generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.5)
generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=1.0)

inputs = tokenizer(few_shot_prompt, return_tensors='pt').to(device)
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        generation_config=generation_config,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{sentiment}\n')

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
negative
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
negative



Comments related to the choice of the parameters in the code cell above:
- Choosing `max_new_tokens=10` will make the output text too short, so the text sentiment will be cut.
- Putting `do_sample = True` and changing the temperature value you get more flexibility in the output.

As you can see, prompt engineering can take you a long way for this use case, but there are some limitations. Next, you will start to explore how you can use fine-tuning to help your LLM to understand a particular use case in better depth!

# Lab 2: Fine-Tune a Generative AI Model for Dialogue Summarization

In this section, we will fine-tune an existing LLM from Hugging Face for enhanced sentiment analysis. We will use the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, which provides a high quality instruction tuned model and can summarize text out of the box. To improve the inferences, we will explore a full fine-tuning approach and evaluate the results with ROUGE metrics. Then we will perform Parameter Efficient Fine-Tuning (PEFT), evaluate the resulting model and see that the benefits of PEFT outweigh the slightly-lower performance metrics.

<a name='1.1'></a>

---


### 1.1 - Install Required Dependencies

Import the necessary components. Some of them are new for this section, they will be discussed later in the notebook.

In [None]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it.

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


<a name='2'></a>
## 2 - Perform Full Fine-Tuning

<a name='2.1'></a>
### 2.1 - Preprocess the Text-Sentiment Dataset

You need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Find the sentiment` and to the start of the summary with `Sentiment` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary:
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [None]:
def tokenize_function(example):
    start_prompt = 'Provide sentiment of the following text as either positive, negative or neutral.\n\n'
    end_prompt = '\n\Sentiment: '
    prompt = [start_prompt + text + end_prompt for dialogue in example["text"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["label_text"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'label', 'text', 'label_text',])

Map:   0%|          | 0/27481 [00:00<?, ? examples/s]

Map:   0%|          | 0/3534 [00:00<?, ? examples/s]

To save some time in the lab, you will subsample the dataset:

In [None]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Filter:   0%|          | 0/27481 [00:00<?, ? examples/s]

Filter:   0%|          | 0/3534 [00:00<?, ? examples/s]

Check the shapes of all three parts of the dataset:

In [None]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (275, 2)
Test: (36, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 275
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 36
    })
})


The output dataset is ready for fine-tuning.

<a name='2.2'></a>
### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

Now utilize the built-in Hugging Face `Trainer` class (see the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)). Pass the preprocessed dataset with reference to the original model. Other training parameters are found experimentally and there is no need to go into details about those at the moment.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("philschmid/flan-t5-base-samsum")
instruct_model = AutoModelForSeq2SeqLM.from_pretrained("philschmid/flan-t5-base-samsum")
instruct_model.to(device)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo):

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [None]:
index = 200
text = dataset['test'][index]['text']
human_baseline_summary = dataset['test'][index]['label_text']

prompt = f"""
Provide sentiment of the following text as either positive, negative or neutral.

{text}

Sentiment:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

original_model_outputs = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
negative
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
negative
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
negative


<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) helps quantify the validity of summarizations produced by models. It compares summarizations to a "baseline" summary which is usually created by a human. While not perfect, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.

In [None]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Generate the outputs for the sample of the test dataset (only 10 texts and sentiments to save time), and save the results.

In [None]:
texts = dataset['test'][0:10]['text']
human_baseline_summaries = dataset['test'][0:10]['label_text']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(texts):
    prompt = f"""
Provide sentiment of the following text as either positive, negative or neutral.

{text}

Sentiment: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    original_model_outputs = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries
0,neutral,negative,negative
1,positive,negative,negative
2,negative,negative,negative
3,positive,negative,negative
4,positive,negative,negative
5,positive,negative,negative
6,negative,negative,negative
7,negative,negative,negative
8,neutral,negative,negative
9,neutral,negative,negative


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [None]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

ORIGINAL MODEL:
{'rouge1': 0.3, 'rouge2': 0.0, 'rougeL': 0.3, 'rougeLsum': 0.3}
INSTRUCT MODEL:
{'rouge1': 0.3, 'rouge2': 0.0, 'rougeL': 0.3, 'rougeLsum': 0.3}


The results show substantial improvement in all ROUGE metrics:

In [None]:
print("Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE
rouge1: 0.00%
rouge2: 0.00%
rougeL: 0.00%
rougeLsum: 0.00%


<a name='3'></a>
## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon.

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

<a name='3.1'></a>
### 3.1 - Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [None]:
peft_model = get_peft_model(model, lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%


<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [None]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

Now everything is ready to train the PEFT adapter and save the model.



In [None]:
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)



Step,Training Loss
1,51.028


('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [None]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       peft_model_path,
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

peft_model.to(device)

PeftModelForSeq2SeqLM(
  (base_model): LoraModel(
    (model): T5ForConditionalGeneration(
      (shared): Embedding(32128, 768)
      (encoder): T5Stack(
        (embed_tokens): Embedding(32128, 768)
        (block): ModuleList(
          (0): T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(
                (SelfAttention): T5Attention(
                  (q): Linear(
                    in_features=768, out_features=768, bias=False
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=768, out_features=32, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=32, out_features=768, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B):

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [None]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
all model parameters: 251116800
percentage of trainable model parameters: 0.00%


<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [None]:
index = 200
dialogue = dataset['test'][index]['text']
baseline_human_summary = dataset['test'][index]['label_text']

prompt = f"""
Provide sentiment of the following text as either positive, negative or neutral.

{text}

Sentiment: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

original_model_outputs = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
negative
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
negative
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
negative
---------------------------------------------------------------------------------------------------
PEFT MODEL: negative


<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 dialogues and summaries to save time).

In [None]:
dialogues = dataset['test'][0:10]['text']
human_baseline_summaries = dataset['test'][0:10]['label_text']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Provide sentiment of the following text as either positive, negative or neutral.

{text}

Sentiment: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    human_baseline_text_output = human_baseline_summaries[idx]

    original_model_outputs = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,neutral,negative,negative,negative
1,positive,negative,negative,negative
2,negative,negative,negative,negative
3,positive,negative,negative,negative
4,positive,negative,negative,negative
5,positive,negative,negative,negative
6,negative,negative,negative,negative
7,negative,negative,negative,negative
8,neutral,negative,negative,negative
9,neutral,negative,negative,negative


In [None]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.3, 'rouge2': 0.0, 'rougeL': 0.3, 'rougeLsum': 0.3}
INSTRUCT MODEL:
{'rouge1': 0.3, 'rouge2': 0.0, 'rougeL': 0.3, 'rougeLsum': 0.3}
PEFT MODEL:
{'rouge1': 0.3, 'rouge2': 0.0, 'rougeL': 0.3, 'rougeLsum': 0.3}


You already computed ROUGE score on the full dataset, after loading the results from the `data/dialogue-summary-training-results.csv` file. Load the values for the PEFT model now and check its performance compared to other models.

The results show less of an improvement over full fine-tuning, but the benefits of PEFT typically outweigh the slightly-lower performance metrics.

Calculate the improvement of PEFT over the original model:

In [None]:
print("Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE
rouge1: 0.00%
rouge2: 0.00%
rougeL: 0.00%
rougeLsum: 0.00%


Now calculate the improvement of PEFT over a full fine-tuned model:

In [None]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL
rouge1: 0.00%
rouge2: 0.00%
rougeL: 0.00%
rougeLsum: 0.00%


Here you see a small percentage decrease in the ROUGE metrics vs. full fine-tuned. However, the training requires much less computing and memory resources (often just a single GPU).

# Lab 3: Fine-Tune FLAN-T5 with Reinforcement Learning (PPO) and PEFT to Generate Less-Toxic Summaries

In this notebook, you will fine-tune a FLAN-T5 model to generate less toxic content with Meta AI's hate speech reward model. The reward model is a binary classifier that predicts either "not hate" or "hate" for the given text. You will use Proximal Policy Optimization (PPO) to fine-tune and reduce the model's toxicity.

Import the necessary components.

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig
from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType

# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import evaluate

import numpy as np
import pandas as pd

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()

<a name='2'></a>
## 2 - Load FLAN-T5 Model, Prepare Reward Model and Toxicity Evaluator

<a name='2.1'></a>
### 2.1 - Load Data and FLAN-T5 Model Fine-Tuned with Summarization Instruction

The next step will be to preprocess the dataset. You will take only a part of it, then filter the dialogues of a particular length (just to make those examples long enough and, at the same time, easy to read). Then wrap each dialogue with the instruction and tokenize the prompts. Save the token ids in the field `input_ids` and decoded version of the prompts in the field `query`.

You could do that all step by step in the cell below, but it is a good habit to organize that all in a function `build_dataset`:

In [None]:
def build_dataset(model_name,
                  dataset_name,
                  input_min_text_length,
                  input_max_text_length):

    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model_name (str): Tokenizer model name.
    - dataset_name (str): Name of the dataset to load.
    - input_min_text_length (int): Minimum length of the dialogues.
    - input_max_text_length (int): Maximum length of the dialogues.

    Returns:
    - dataset_splits (datasets.dataset_dict.DatasetDict): Preprocessed dataset containing train and test parts.
    """

    # load dataset (only "train" part will be enough for this lab).
    dataset = load_dataset(dataset_name, split="train")

    # Filter the dialogues of length between input_min_text_length and input_max_text_length characters.
    dataset = dataset.filter(lambda x: len(x["text"]) > input_min_text_length and len(x["text"]) <= input_max_text_length, batched=False)

    # Prepare tokenizer. Setting device_map="auto" allows to switch between GPU and CPU automatically.
    tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

    def tokenize(sample):

        # Wrap each dialogue with the instruction.
        prompt = f"""
Provide sentiment of the following text as either positive, negative or neutral.

{sample["text"]}

Sentiment:
"""
        sample["input_ids"] = tokenizer.encode(prompt)

        # This must be called "query", which is a requirement of our PPO library.
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    # Tokenize each dialogue.
    dataset = dataset.map(tokenize, batched=False)
    dataset.set_format(type="torch")

    # Split the dataset into train and test parts.
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

    return dataset_splits

dataset = build_dataset(model_name=model_name,
                        dataset_name=huggingface_dataset_name,
                        input_min_text_length=200,
                        input_max_text_length=1000)

print(dataset)



Filter:   0%|          | 0/27481 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 0
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 0
    })
})


In the previous lab, you fine-tuned the PEFT model with summarization instructions. The training in the notebook was done on a subset of data. Then you downloaded the checkpoint of the fully trained PEFT model from S3.

Let's load the same model checkpoint here:

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

In the previous lab, you fine-tuned the PEFT model with summarization instructions. The training in the notebook was done on a subset of data. Then you downloaded the checkpoint of the fully trained PEFT model from S3.

Let's load the same model checkpoint here:

In [None]:
lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name,
                                              torch_dtype=torch.bfloat16)

peft_model = PeftModel.from_pretrained(model,
                                       peft_model_path,
                                       lora_config=lora_config,
                                       torch_dtype=torch.bfloat16,
                                       device_map="auto",
                                       is_trainable=True)


print(f'PEFT model parameters to be updated:\n{print_number_of_trainable_model_parameters(peft_model)}\n')

PEFT model parameters to be updated:

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%



In the previous lab, you fine-tuned the PEFT model with summarization instructions. The training in the notebook was done on a subset of data. Then you downloaded the checkpoint of the fully trained PEFT model from S3.

Let's load the same model checkpoint here:

In [None]:
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
                                                               torch_dtype=torch.bfloat16,
                                                               is_trainable=True)

print(f'PPO model parameters to be updated (ValueHead + 769 params):\n{print_number_of_trainable_model_parameters(ppo_model)}\n')
print(ppo_model.v_head)

PPO model parameters to be updated (ValueHead + 769 params):

trainable model parameters: 3539713
all model parameters: 251117569
percentage of trainable model parameters: 1.41%

ValueHead(
  (dropout): Dropout(p=0.1, inplace=False)
  (summary): Linear(in_features=768, out_features=1, bias=True)
  (flatten): Flatten(start_dim=1, end_dim=-1)
)


During PPO, only a few parameters will be updated. Specifically, the parameters of the `ValueHead`. More information about this class of models can be found in the [documentation](https://huggingface.co/docs/trl/main/en/models#trl.create_reference_model). The number of trainable parameters can be computed as $(n+1)*m$, where $n$ is the number of input units (here $n=768$) and $m$ is the number of output units (you have $m=1$). The $+1$ term in the equation takes into account the bias term.

Now create a frozen copy of the PPO which will not be fine-tuned - a reference model. The reference model will represent the LLM before detoxification. None of the parameters of the reference model will be updated during PPO training. This is on purpose.

In [None]:
ref_model = create_reference_model(ppo_model)

print(f'Reference model parameters to be updated:\n{print_number_of_trainable_model_parameters(ref_model)}\n')

Reference model parameters to be updated:

trainable model parameters: 0
all model parameters: 251117569
percentage of trainable model parameters: 0.00%



Everything is set. It is time to prepare the reward model!

<a name='2.2'></a>
### 2.2 - Prepare Reward Model

**Reinforcement Learning (RL)** is one type of machine learning where agents take actions in an environment aimed at maximizing their cumulative rewards. The agent's behavior is defined by the **policy**. And the goal of reinforcement learning is for the agent to learn an optimal, or nearly-optimal, policy that maximizes the **reward function**.

In the [previous section](#2.1) the original policy is based on the instruct PEFT model - this is the LLM before detoxification. Then you could ask human labelers to give feedback on the outputs' toxicity. However, it can be expensive to use them for the entire fine-tuning process. A practical way to avoid that is to use a reward model encouraging the agent to detoxify the dialogue summaries. The intuitive approach would be to do some form of sentiment analysis across two classes (`nothate` and `hate`) and give a higher reward if there is higher a chance of getting class `nothate` as an output.

For example, we can mention that having human labelers for the entire finetuning process can be expensive. A practical way to avoid that is to use a reward model.

use feedback generated by a model

You will use [Meta AI's RoBERTa-based hate speech model](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target) for the reward model. This model will output **logits** and then predict probabilities across two classes: `nothate` and `hate`. The logits of the output `nothate` will be taken as a positive reward. Then, the model will be fine-tuned with PPO using those reward values.

Create the instance of the required model class for the RoBERTa model. You also need to load a tokenizer to test the model. Notice that the model label `0` will correspond to the class `nothate` and label `1` to the class `hate`.

In [None]:
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")
print(toxicity_model.config.id2label)

tokenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/816 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

{0: 'nothate', 1: 'hate'}


Take some non-toxic text, tokenize it, and pass it to the model. Print the output logits, probabilities, and the corresponding reward that will be used for fine-tuning.

In [None]:
non_toxic_text = "I really hate cookies."

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids.to(device)

logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')

logits [not hate, hate]: [4.685029983520508, -4.2312469482421875]
probabilities [not hate, hate]: [0.9998657703399658, 0.00013416889123618603]
reward (high): [4.685029983520508]


Let's show a toxic comment.  This will have a low reward because it is more toxic.

In [None]:
toxic_text = "I think bananas are horrible and dreadful."

toxicity_input_ids = toxicity_tokenizer(toxic_text, return_tensors="pt").input_ids.to(device)

logits = toxicity_model(toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# Get the logits for "not hate" - this is the reward!
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (low): {nothate_reward}')

logits [not hate, hate]: [4.600322723388672, -4.2066216468811035]
probabilities [not hate, hate]: [0.9998502731323242, 0.00014966761227697134]
reward (low): [4.600322723388672]


<a name='2.3'></a>
### 2.3 - Evaluate Toxicity

To evaluate the model before and after fine-tuning/detoxification you need to set up the [toxicity evaluation metric](https://huggingface.co/spaces/evaluate-measurement/toxicity). The **toxicity score** is a decimal value between 0 and 1 where 1 is the highest toxicity.

Try to calculate toxicity for the same sentences as in section [2.2](#2.2). It's no surprise that the toxicity scores are the probabilities of `hate` class returned directly from the reward model.

In [None]:
def evaluate_toxicity(model,
                      toxicity_evaluator,
                      tokenizer,
                      dataset,
                      num_samples):

    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model (trl model): Model to be evaluated.
    - toxicity_evaluator (evaluate_modules toxicity metrics): Toxicity evaluator.
    - tokenizer (transformers tokenizer): Tokenizer to be used.
    - dataset (dataset): Input dataset for the evaluation.
    - num_samples (int): Maximum number of samples for the evaluation.

    Returns:
    tuple: A tuple containing two numpy.float64 values:
    - mean (numpy.float64): Mean of the samples toxicity.
    - std (numpy.float64): Standard deviation of the samples toxicity.
    """

    max_new_tokens=100

    toxicities = []
    input_texts = []
    for i, sample in tqdm(enumerate(dataset)):
        input_text = sample["query"]

        if i > num_samples:
            break

        input_ids = tokenizer(input_text, return_tensors="pt", padding=True).input_ids

        generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
                                             tok_k=0.0,
                                             top_p=1.0,
                                             do_sample=True)

        response_token_ids = model.generate(input_ids=input_ids,
                                            generation_config=generation_config)

        generated_text = tokenizer.decode(response_token_ids[0], skip_special_tokens=True)

        toxicity_score = toxicity_evaluator.compute(predictions=[(input_text + " " + generated_text)])

        toxicities.extend(toxicity_score["toxicity"])

    # Compute mean & std using np.
    mean = np.mean(toxicities)
    std = np.std(toxicities)

    return mean, std

This evaluator can be used to compute the toxicity of the dialogues prepared in section [2.1](#2.1). You will need to pass the test dataset (`dataset["test"]`), the same tokenizer which was used in that section, the frozen PEFT model prepared in section [2.2](#2.2), and the toxicity evaluator. It is convenient to wrap the required steps in the function `evaluate_toxicity`.

<a name='3'></a>
## 3 - Perform Fine-Tuning to Detoxify the Summaries
Optimize a RL policy against the reward model using Proximal Policy Optimization (PPO).

<a name='3.1'></a>
### 3.1 - Initialize `PPOTrainer`

For the `PPOTrainer` initialization, you will need a collator. Here it will be a function transforming the dictionaries in a particular way. You can define and test it:

In [None]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

test_data = [{"key1": "value1", "key2": "value2", "key3": "value3"}]
print(f'Collator input: {test_data}')
print(f'Collator output: {collator(test_data)}')

Collator input: [{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}]
Collator output: {'key1': ['value1'], 'key2': ['value2'], 'key3': ['value3']}


# Navigating the Difficulties in the Process

## Optimizing Hyperparameters

### Exploring Hyperparameter Combinations
- **Utilize techniques** such as grid search or random search to test various hyperparameter settings (e.g., learning rate, batch size).

### Leveraging Automated Optimization Tools
- **Employ tools** like Hyperopt or Bayesian Optimization for a more streamlined hyperparameter tuning process.

## Balancing Model Capacity in PEFT

### Defining Model Capacity
- **This is the model's ability** to recognize intricate patterns in data. In PEFT, it's crucial to find the right balance to prevent both underfitting (insufficient capacity) and overfitting (excessive capacity).

### Fine-Tuning Strategy
- **Modify the degree of fine-tuning** in PEFT, including choosing which layers to adjust and to what extent. Sometimes, only fine-tuning the top layers or a subset of parameters can be adequate.

## Strategies to Counter Overfitting

### Implementing Regularization Methods
- **Apply techniques** such as dropout, weight decay, or early stopping to combat overfitting during training.

### Data Partitioning Approach
- **Ensure a proper distribution** of data across training, validation, and test sets to accurately assess the model's ability to generalize.

# MIT License

Copyright (c) 2023

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.