In [None]:
!pip install -U -q transformers langchain peft bitsandbytes trl datasets notebook accelerate evaluate

In [1]:
from IPython.core.display import HTML
table_css = """
    table {
        align: left; display: block
    }
"""
HTML('<style>{}</style>'.format(table_css))

# Text summarization with Gemma and LangChain

# Table of Contents
1. [Introduction](#Introduction)<br>
    1.1. [Aim of the project](#Aim-of-the-project)<br>
2. [Setup and important aspects](#Setup-and-important-aspects)<br>
    2.1. [Chat template](#Chat-template)<br>
    2.2. [Prompt engineering](#Prompt-engineering)<br>
    2.3. [Pipeline parameters](#Pipeline-parameters)<br>
3. [Text summarization: Methods and strategies](#Text-summarization:-Methods-and-strategies)<br>
    3.1. [Stuffing](#Stuffing)<br>
    3.2. [MapReduce](#MapReduce)<br>
    3.3. [Refine](#Refine)<br>
    3.4. [Document splitting strategies](#Document-splitting-strategies)<br>
4. [Experiments](#Experiments)<br>
5. [Fine-tuning Gemma with LoRa](#Fine-tuning-Gemma-with-LoRa)
6. [Conclusions and next steps](#Conclusions-and-next-steps)

# Introduction

Over the years, the amount of data produced, copied and consumed in the world has grown exponentially, soaring from 2 Zettabytes in 2010 to projected estimates of [181 Zettabytes in 2025](https://www.statista.com/statistics/871513/worldwide-data-created/). <br>
To put these numbers into context, envision each byte as a grain of rice; one zettabyte (10^21 bytes) would be somewhat equivalent to filling [the Pacific Ocean with rice](https://www.oldcolony.us/wp-content/uploads/2014/11/whatisbigdata-DKB-v2.pdf).

In this context, one of the most common and useful tasks in the NLP field is **text summarization.** <br>
Summarization is the process of extracting the **most important information from a text** and presenting it in a condensed form. Having quality condensed information can help individuals and organizations **reduce this information overload**, [optimizing processes while saving time and resources](https://nowigence.com/importance-benefits-of-auto-text-summarization/).

There are **mainly two** text summarization techniques: abstractive and extractive.

- **Abstractive**: The text is summarized using the available context but using different words. This is a process that requires "understanding" the information contained in the text.
- **Extractive**: The most relevant phrases or words are selected and extracted from the text, without rephrasing or generating new words.

The advent of LLM models (like Gemma) helped to **significantly improve** the [quality of the summaries produced](https://arxiv.org/pdf/2310.10449.pdf), so much that LLMs are now considered the [gold standard for text summarization](https://arxiv.org/pdf/2305.14239.pdf).

However, there are **some considerations** to bear in mind when undertaking this task, in particular:

1. **Quality**: although there are metrics like ROUGE to assess quality, a summary is often subjective and varies based on the target audience. Is the summary intended for a technical or non-technical audience? Should it provide detailed information or offer a high-level overview?
2. **Hallucinations**: it is known that LLMs tend to have [hallucinations](https://arxiv.org/pdf/2401.11817.pdf), that is to generate plausible but incorrect information from a factual or logical point of view. This phenomenon can have significant impacts, especially when dealing with large documents where content is not known in advance, making it challenging to control.
3. **Text length**: every LLM have a maximum context window and large documents can't fit entirely in one pass, thus different approaches are required.


## Aim of the project

In this notebook, I will demonstrate the process of **text summarization using Gemma**, with a dedicated emphasis on **Kaggle writeups.** <br>
The key aspects I will discuss are:
- Establishing a text summarization pipeline using Gemma and LangChain
- Providing an overview of the crucial parameters and methods one should keep in mind while working with an LLM
- Exploring summarization techniques, such as Stuffing, MapReduce and Refine
- Fine-tuning Gemma using Parameter Efficient Fine-Tuning (PEFT)
- Future considerations and next steps

This work aims to build a **comprehensive understanding of the task** and develop a pipeline that can serve as a **good starting point** for indkividuals interested in approaching summarization tasks using open-source models like Gemma on Kaggle.

---

# Setup and important aspects

This notebook will use the following building blocks:
- **Gemma 2B**, in order to work even on commercial laptop with common GPUs
- **HuggingFace**, which thanks to its abstraction levels allows you to work with LLM in a user-friendly way
- **LangChain**, to build summarization pipelines even in the presence of large documents

Let's start importing the model and pipeline that we will use with HuggingFace:

In [2]:
from transformers import pipeline, set_seed
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from accelerate.utils import release_memory
import torch
from kaggle_secrets import UserSecretsClient
from huggingface_hub import login
from datasets import Dataset
from trl import SFTTrainer
from peft import LoraConfig, PeftModel
import pandas as pd
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import CharacterTextSplitter, HTMLHeaderTextSplitter
from langchain.prompts import PromptTemplate
from langchain.docstore.document import Document
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
import evaluate
import transformers
from langchain.llms.base import LLM
from typing import Any
import warnings
import gc
import random
import numpy as np

warnings.filterwarnings('ignore')

# Set seed for reproducibility
set_seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed(42)
np.random.seed(42)
random.seed(42)

# Read writeups dataset
writeups = pd.read_csv('/kaggle/input/kaggle-winning-solutions-methods/kaggle_winning_solutions_methods.csv')
writeups = writeups.drop_duplicates(subset=['link', 'writeup']).reset_index(drop=True)

# Logging in HF
hf_access_token = UserSecretsClient().get_secret("hf_token")
login(token = hf_access_token)

model = "/kaggle/input/gemma/transformers/2b-it/1"

# Load the HF pipeline using Gemma 2B from Kaggle
pipe = pipeline(
    "text-generation",
    model=model,
    model_kwargs={"torch_dtype": torch.float16},
    device='cuda',
    max_new_tokens=512
)

2024-03-04 00:04:13.452683: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-04 00:04:13.452779: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-04 00:04:13.600307: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Pipelines provide an efficient and user-friendly way to leverage models for inference. They consist of:
- A tokenizer, which, if not explicitly specified, is automatically imported from the model configurations on HuggingFace
- The model itself
- Parameters for controlling and fine-tuning the output

Considering the code above, several crucial parameters have been configured:
- `max_new_tokens` - controls the **maximum number of newly generated tokens**. If not specified, the default value may not be sufficient to generate enough text (therefore summaries).
- `model_kwargs` - here we control the **precision** using `torch.float16`, with beneficial effects on [memory](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#:~:text=Mixed%20precision%20training%20achieves%20all,bit%20floating%20point%20everywhere%20else.).

Now let's test our pipeline on a the first writeup from our dataset, taken from [here](https://www.kaggle.com/c/asl-signs/discussion/406306):

In [3]:
# Import the first writeup from the dataset and inspect the first 1000 chars
writeup = writeups.iloc[0, 9]
print('Number of characters:', len(writeup))
writeup[:1000]

Number of characters: 9864


'<h2>TLDR</h2>\n<p>We used an approach similar to audio spectrogram classification using the EfficientNet-B0 model, with numerous augmentations and transformer models such as BERT and DeBERTa as helper models. The final solution consists of one EfficientNet-B0 with an input size of 160x80, trained on a single fold from 8 randomly split folds, as well as DeBERTa and BERT trained on the full dataset. A single fold model using EfficientNet has a CV score of 0.898 and a leaderboard score of ~0.8.</p>\n<p>We used only competition data.</p>\n<h2>1. Data Preprocessing</h2>\n<h3>1.1 CNN Preprocessing</h3>\n<ul>\n<li>We extracted 18 lip points, 20 pose points (including arms, shoulders, eyebrows, and nose), and all hand points, resulting in a total of 80 points.</li>\n<li>During training, we applied various augmentations.</li>\n<li>We implemented standard normalization.</li>\n<li>Instead of dropping NaN values, we filled them with zeros after normalization.</li>\n<li>We interpolated the time ax

In [4]:
messages = [
    {
        "role": "user",
        "content": "Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:\n\n{}".format(writeup)
    }
]

prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_k=20,
    top_p=0.3,
    add_special_tokens=True
)
print(outputs[0]["generated_text"][len(prompt):])

Sure, here's a summary of the text in a technical way:

**1. Data Preprocessing**

* Extract 80 points from the image, including lip and pose points.
* Apply various augmentations and normalizations.
* Fill NaN values with zeros and use nearest interpolation for the time axis.

**2. Augmentation**

* Use common and CNN specific augmentations.
* Implement a mixup augmentation that only works with CNNs.

**3. Training**

* Train EfficientNet-B0 and BERT on a single fold with 0.1 warm-up.
* Train a transformer model with a ranger optimizer and 4-layer transformer.
* Tune hyperparameters with Optuna.

**4. Submissions**

* Aggregate models in a tf.Module.
* Calculate ensemble weights for fold 0 and apply to the full dataset.

**5. PS. Need BETTER TFlite DepthwiseConv2D**

* Explore different ways to implement depthwise convolution in tflite.
* Experiment with different FLOP configurations.

**6. Conclusion**

* EfficientNet-B0 achieved a leaderboard score of 0.8.
* Transformers improved th

The output looks great, and this was possible thanks to **key features that we will now explore.**

## Chat template
The main use of LLMs is in a **chat style setup**. This means that instead of continuing a text string, the model receives messages in which **"roles" are present.** <br>
Just as there are different tokenizers for different models, each model expects different chat templates.

[Gemma's technical documentation](https://ai.google.dev/gemma/docs/formatting) outlines the necessity of employing specific tokens to indicate roles:
- Token to indicate a user turn: `user`
- Token to indicate a model turn: `model`
- Token to indicate the beginning of dialogue turn: `<start_of_turn>`
- Token to indicate the end of dialogue turn: `<end_of_turn>`

In the code above, we achieved that thanks to the presence of the following piece of code:

```
messages = [
    {"role": "user",
     "content": "...."}
     ]

prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
```
To better understand what is happening, let's try a test chat:

In [5]:
test_messages = [
    {"role": "user",
     "content": "This is a test"},
    {"role": "assistant",
     "content": "Good for you!"},
    {"role": "user",
     "content": "Ah ah"},
]

test_prompt = pipe.tokenizer.apply_chat_template(test_messages, tokenize=False, add_generation_prompt=True)
print(test_prompt)

<start_of_turn>user
This is a test<end_of_turn>
<start_of_turn>model
Good for you!<end_of_turn>
<start_of_turn>user
Ah ah<end_of_turn>
<start_of_turn>model



Given that LLMs generate text predicting the next token, [HuggingFace](https://huggingface.co/docs/transformers/main/en/chat_templating#what-are-generation-prompts) provides the `apply_chat_template` function to **ensure the model generates text as a response to an input** rather than as a continuation of the user's prompt. This is achieved using the `add_generation_prompt` parameter, which adds the `<start_of_turn>` token, **reducing the possibility of generating text that continues the user's message.**

## Prompt engineering

Obtaining quality output also depends on the prompt structure. <br>
The query used was:
```
Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary into chapters, be impersonal and use bullet points:

[writeup]
```
Depending on the audience, this **prompt may vary** to allow us to generate summaries more **aligned with our target.** <br>
For example, let's assume we need to create a writeup summary to a less technical audience:

In [6]:
messages_eli5 = [
    {"role": "user",
     "content": "Summarize the following text while avoiding difficult jargon using bullet points chapters. Explain it like I am a 5 years old:\n\n{}".format(writeup)},
]

prompt_eli5 = pipe.tokenizer.apply_chat_template(messages_eli5, tokenize=False, add_generation_prompt=True)
outputs_eli5 = pipe(
    prompt_eli5,
    add_special_tokens=True,
    do_sample=True,
    temperature=0.1,
    top_k=20,
    top_p=0.3
)
print(outputs_eli5[0]["generated_text"][len(prompt_eli5):])

Sure, here's a summary of the text in a way that a 5-year-old might understand:

<h2>Introduction</h2>
We used an approach similar to audio spectrogram classification using the EfficientNet-B0 model.

<h3>Data Preprocessing</h3>
We extracted 80 points from the image, including lip and body points. We also used augmentation to make the data more diverse.

<h2>Training</h2>
We trained the model on one fold with a random split. We used a one-cycle scheduler and weighted cross-entropy loss.

<h3>Hyperparameter Tuning</h2>
We tuned the learning rate, dropout rate, and other parameters to find the best settings for the model.

<h2>Results</h2>
We got a leaderboard score of 0.81, which is pretty good.

<h2>Conclusion</h2>
We used a combination of EfficientNet-B0, BERT, and DeBERTa to achieve this result.


The summary is easier to understand, simply by adding `Explain it like I am a 5 years old`.

It's worth mentioning that there are **other techniques** that can help achieve better results, one of this is called called [Few-Shot Prompting](https://www.promptingguide.ai/techniques/fewshot). <br>
Few shot prompring involves utilizing **examples as conditioning for subsequent examples, guiding the model in generating the desired responses.**

Let's demonstrate few shot conditioning using the pipeline we developed so far:

In [7]:
messages_few_shot = [
    {"role": "user",
     "content": "This film was great, rich of details and with great actors."},
    {"role": "assistant",
     "content": "SENTIMENT: Positive.\nSUBJECT: Film"},
    {"role": "user",
     "content": "This park is dirty."},
    {"role": "assistant",
     "content": "SENTIMENT: Negative.\nSUBJECT: Park"},
    {"role": "user",
     "content": "This notebook is fantastic. I'm learning a lot"},
]

prompt_few_shot = pipe.tokenizer.apply_chat_template(messages_few_shot, tokenize=False, add_generation_prompt=True)
outputs_few_shot = pipe(
    prompt_few_shot,
    add_special_tokens=True,
    do_sample=True,
    temperature=0.1,
    top_k=20,
    top_p=0.3
)
print(outputs_few_shot[0]["generated_text"][len(prompt_few_shot):])

SENTIMENT: Positive.
SUBJECT: Notebook<end_of_turn>


The model learned how to perform the task **with just few examples using the `apply_chat_template`.** Without any request, it correctly understood that we wanted two lines, one with `SENTIMENT` and one with the `SUBJECT` of the sentence.

While applying Few-Shot Learning could be viable for our scenario, **it would require examples of real summarization of Kaggle writeups passed as examples**, potentially resulting in **overly large prompte**. Considering that the length of a text can be problematic (more on that later), **we will stick to a simpler approach** by working on out prompt engineering skills like we did before which already yielded good results.

## Pipeline parameters

You noticed that the pipeline uses few paramenters that are essential for controlling our output. <br>
Here's the list along with some rationale:

- `do_sample`: this parameter enables **decoding strategies** to select the next token from the probability distribution over the entire vocabulary. Together with `num_beams`, we can control [different strategies](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin). I opted for `True` and `num_beams=1` (default), which is the multinomial sampling. More of decoding strategies [here](https://deci.ai/blog/llm-evaluation-and-how-decoding-strategies-impact-instruction-following/) and [here](https://huggingface.co/docs/transformers/main/en/generation_strategies#decoding-strategies).
- `temperature`: this parameter controls the **randomness**. The lower the temperature, the more deterministic the results are in the sense that the highest probable token is picked. I opted for a very low value because we need to encourage **more factual responses** and not creative ones.
- `top_p`: according to the [documentation](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.top_p), this parameter controls the **sampling of tokens**. The higher the value, the higher the chances that the model will look to more possible words, including less likely ones. Again, I opted for a relatively low value **to maintain coherence** given the task of summarization.
- `top_k`: in simple terms, together with `top_p`, [it controls the number of tokens](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.top_k) to keep for prediction. Once again, a low value will favour **less creative responses**, which is exactly what we are looking for in this notebook.

<div class="alert alert-block alert-info">
<b>Key learnings </b><br> <br>
    - Following a <b>chat template</b> is highly recommented in order to <b>mimick model's training process</b> (therefore, model's knowledge).<br>
    - <b>Prompt engineering</b> is mandatory: a poor prompt will lead to poor results. Techniques such as <b>Few-Shot learning</b> can be useful tools in our arsenale. <br>
    - Controlling the generation <b>parameters</b> is important and depends on the task, whether <b>we seek creativity or factual responses.</b>
</div>

Now it's time to talk about some **important aspects** of text summarization and **LangChain!**

---

# Text summarization: Methods and strategies

There are many techniques and ways to perform a text summarization, but I will focus only on **three strategies: Stuffing, MapReduce and Refine.** These ideas where based on this amazing blog post you can find [here](https://medium.com/@onkarmishra/using-langchain-for-question-answering-on-own-data-3af0a82789ed).
The methods are simple and easy to understand, yet quite powerful.

## Stuffing

<img src="https://i.imgur.com/28BXaOG.png" width="600">

Stuffing is pretty straightforward: we **pass the entire data** to the LLM by stuffing it into the prompt as context. **This is exactly what we did so far using Gemma.**

| Pros ✅| Cons ❌|
|---|---|
| Only a single call to the LLM | The data can surpass the model's context length, thus this method could not be feasible <br> for larger text files |
| Comprehensive context, since the model have access to the <br> entire information | The quality might not be ideal for extensive  documents, <br> as some information might be skipped |


## MapReduce

<img src="https://i.imgur.com/6TJbU8V.png" width="600">

MapReduce introduces a **multi-stage summarization**:
- First we split the document into chunks
- We perform text summarization for each chunk
- One final call to the LLM is used to create a comprehensive final summary, using all the summaries as input

| Pros ✅ | Cons ❌|
|---|---|
| The context limit is no longer a problem, since the document <br> is broken in manageable chunks | Multiple LLM calls are required, affecting processing time |
| Each chunk can be process in parallel, thus speeding <br> the summarization task | Potential loss of information due to the fact that the LLM only sees each chunk <br> indipendently without a context |


## Refine

<img src="https://i.imgur.com/CBmwG10.png" width="600">

The last method is called Refine, and follows an **iterative approach**:
- The document is split into chunks, just like the MapReduce
- The first chunk is summarized
- For each following chunk, the previous output (summary) is combined with the new information
- The LLM is instructed to improve (refine!) the previous summary

| Pros ✅| Cons ❌|
|---|---|
| It solves MapReduce's potential loss of information while retaining <br> the ability to process very large files| Multiple LLM calls are required affecting processing time |
| It follows a sequentiality, thus potentially improve summary quality | LLM errors and hallucinations can propagade during each iteration, <br>thus affecting the final quality |


## Document splitting strategies

MapReduce and Refine depend heavily on document splitting. While the concept of splitting is easy to understand, **this step is often intricate.**

First of all, when we deal with LLM we deal with **tokens**. Tokens are [pieces of words](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them): when we write a prompt, the input is transformed into tokens. One token doens't mean one work, but we can generally approximate **1 token ~ 4 characters in English ~ 3/4 of a word**. Simply put, 75 words ~ 100 tokens. <br>
Depending on the model used, we can accomodate a certain amount of tokens **shared between prompt and model's generation**, thus forcing us to operate some **splitting if the context is too large**. [Gemma](https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf) has a **maximum context length of 8192 tokens**, which roughly translates to more than **6100 words.**

Document splitting can follow different strategies. For example, it can be based on a **character** (`'\n'` or `'\n\n'` which often delimit new sentences or paragraphs) or on the **document structure** (such as chapters or sections etc). <br>
The choice of the splitting strategy depends on the task and document we need to analyze. For our task, it is reasonable assume the following scenarios:
- **Scenario A - Stuffing is feasible**: This is the best case scenario, we can simply pass the entire context to the model
- **Scenario B - Stuffing is feasible but output is poor**: Given that it's unlikely that a writeup exceeds 6100 words, this scenario could arise if the quality of the Stuffing summarization is poor. Our model might skip some useful information if provided with the entire text, therefore a possible approach could be splitting writeups based on sections. The Kaggle template works well because is divided into clear sections, allowing for a smooth split while keeping semantic sentences in the same split.
- **Scenario C - Output is poor and writeups don't follow a clear structure**: This is the most difficult, yet plausible, scenario, in which our model struggles with the winner's stream of consciousness and lack of a clear document structure. Moreover, if the writeup is lengthy, the situation could be particularly challenging. In such cases, a character splitting strategy could be ideal.

Let's see it in action, testing a nicely formatted example writeup from the [documentation](https://www.kaggle.com/solution-write-up-documentation) and a messy one.

In [8]:
example_clean_writeup = """
# Context section
This section only contains 2 links 
# Data context
link to the competition data page
# Overview of the Approach
this section should describe the models or algorithms used, describe the data preprocessing, feature engineering, and/or feature selection strategy, described the validation strategy.
# Details of the submission
this section should include what was special, creative, important, and/or impactful about the submission. And also, what was tried and didn’t work.
# Sources
this section should include links to helpful resources like research papers, past winning write-up solutions, forum posts, helpful notebooks, etc."""

example_messy_writeup = """
This section only contains 2 links, and here the link to the competition data page.
Partial section.
Another partial section.
Extensive model secondi which describes the models or algorithms used, describe the data preprocessing, feature engineering, and/or feature selection strategy, described the validation strategy.
Special section should include what was special, creative, important, and/or impactful about the submission. And also, what was tried and didn’t work. Last section should include links to helpful resources like research papers, past winning write-up solutions, forum posts, helpful notebooks, etc."""

We'll now use LangChain [text splitters methods](https://python.langchain.com/docs/modules/data_connection/document_transformers/) to demonstrate document splitting. [LangChain](https://python.langchain.com/docs/get_started/introduction) is a framework for developing applications powered by language models and it's great because **simplify tasks** for developing LLM-powered applications, which will easily let us experiment with the above techniques **without re-inventing the wheel.**

In [9]:
# Split the clean writeup based on sections
text_splitter = CharacterTextSplitter(separator='#', chunk_size=100, chunk_overlap=10)
texts_clean_writeup = text_splitter.split_text(example_clean_writeup)

# Print the first characters of each split
print([i[:50] for i in texts_clean_writeup])

['# Context section\nThis section only contains 2 lin', 'Data context\nlink to the competition data page', 'Overview of the Approach\nthis section should descr', 'Details of the submission\nthis section should incl', 'Sources\nthis section should include links to helpf']


Some important parameters to consider:
- **separator**: the character to split the text on
- **chunk_size**: number of characters in each chunk
- **chunk_overlap**: if we want to overlap the current chunk with previous text

What `CharacterTextSplitter` does in this example is it first looks for the **first 100 characters** and then **splits the next chunk** from the closest separator.

In case of a messy writeup, we could opt for a sentence separation:

In [10]:
# Split the messy writeup based on newlines
text_splitter = CharacterTextSplitter(separator='\n', chunk_size=100, chunk_overlap=50)
texts_messy_writeup = text_splitter.split_text(example_messy_writeup)

print([i[:50] for i in texts_messy_writeup])

['This section only contains 2 links, and here the l', 'Partial section.\nAnother partial section.', 'Extensive model secondi which describes the models', 'Special section should include what was special, c']


In this case, we can see also the effect of the `chunk_overlap`, as the second document includes a `\n` in its body. This is because we **told the function to overlap** chunks by 50 characters.

These examples are simplified; actual writeups tend to be more intricate, but **we can leverage HTML formatting**. LangChain comes with `HTMLHeaderTextSplitter`, let's test it using the previous writeup:

In [11]:
# Split on HTML headers
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2")
]

# Split the real HTML writeup based on headers
text_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on, return_each_element=False)
texts_html_writeup = text_splitter.split_text(writeup)

print('Length writup:', len(writeup))
print('Number of splits:', len(texts_html_writeup))
print('Element returned:', type(texts_html_writeup[0]))
print('Length of each split:', [len(i.page_content) for i in texts_html_writeup])

# Print the first characters for each split
print(); print([(i.page_content[:50], i.metadata) for i in texts_html_writeup])

Length writup: 9864
Number of splits: 6
Element returned: <class 'langchain_core.documents.base.Document'>
Length of each split: [511, 1567, 1040, 1203, 1031, 1260]

[('We used an approach similar to audio spectrogram c', {'Header 2': 'TLDR'}), ('We extracted 18 lip points, 20 pose points (includ', {'Header 2': '1. Data Preprocessing'}), ('These augmentations are used in both CNN training ', {'Header 2': '2. Augmentation'}), ('Train on one fold with a random split (8 folds in ', {'Header 2': '3. Training'}), ('We rewrote all our models in Keras and transferred', {'Header 2': '4. Submissions, Conversion and Ensemble'}), ('Depthwise convolution models performed very well f', {'Header 2': '5. PS. Need BETTER TFlite DepthwiseConv2D'})]


Let's clarify we we got:
- The returned structure contains **6 splits** based on HTML headers (h1 and h2, but we can also specify more)
- Each split is made of a [Document class](https://api.python.langchain.com/en/v0.0.339/schema/langchain.schema.document.Document.html), a specific structure to store text and metadata. As a matter of fact, each Document is made of the **page content and the headers (as metadata).**
    
We could **add the metadata information** back in the page content so that our model will have **access to the section titles for better context.**

In [12]:
for i, text in enumerate(texts_html_writeup):
    # Join the metadata and the content together
    final_content = '\n'.join(text.metadata.values()) + '\n' + text.page_content
    # Replace the old content with the enriched one
    text.page_content = final_content
    
    # Print some examples
    if i < 2:
        print(final_content); print()

TLDR
We used an approach similar to audio spectrogram classification using the EfficientNet-B0 model, with numerous augmentations and transformer models such as BERT and DeBERTa as helper models. The final solution consists of one EfficientNet-B0 with an input size of 160x80, trained on a single fold from 8 randomly split folds, as well as DeBERTa and BERT trained on the full dataset. A single fold model using EfficientNet has a CV score of 0.898 and a leaderboard score of ~0.8.  
We used only competition data.

1. Data Preprocessing
We extracted 18 lip points, 20 pose points (including arms, shoulders, eyebrows, and nose), and all hand points, resulting in a total of 80 points. During training, we applied various augmentations. We implemented standard normalization. Instead of dropping NaN values, we filled them with zeros after normalization. We interpolated the time axis to a size of 160 using 'nearest' interpolation: yy = F.interpolate(yy[None, None, :], size=self.new_size, mode='n

Now we can could use again `CharacterTextSplitter` if we further want to **split within each document** after a certain chunk size.

In [13]:
text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=100)

# Split
splits = text_splitter.split_documents(texts_html_writeup)
print('Number of final splits:', len(splits))
print('Length of each final split:', [len(i.page_content) for i in splits])

print(); print([(i.page_content[:50], i.metadata) for i in splits])

Number of final splits: 6
Length of each final split: [516, 1589, 1056, 1215, 1071, 1302]

[('TLDR\nWe used an approach similar to audio spectrog', {'Header 2': 'TLDR'}), ('1. Data Preprocessing\nWe extracted 18 lip points, ', {'Header 2': '1. Data Preprocessing'}), ('2. Augmentation\nThese augmentations are used in bo', {'Header 2': '2. Augmentation'}), ('3. Training\nTrain on one fold with a random split ', {'Header 2': '3. Training'}), ('4. Submissions, Conversion and Ensemble\nWe rewrote', {'Header 2': '4. Submissions, Conversion and Ensemble'}), ('5. PS. Need BETTER TFlite DepthwiseConv2D\nDepthwis', {'Header 2': '5. PS. Need BETTER TFlite DepthwiseConv2D'})]


Given that we selected a character window larger than the document length, we got the same result as before. <br>

Now that everything is clear, let's run some experiment! We'll use the `splits` in a moment with MapReduce and Refine!

<div class="alert alert-block alert-info">
<b>Key learnings </b><br> <br>
    - <b>Stuffing, MapReduce and Refine</b> are three different techniques that can be used to summarize documents. <br>
    - Given the task, document structure and model capabilities, we might need to <b>split our document in chunks</b> to fit the context in our prompt or to improve the summary<br>
    - Kaggle writeups can potentially all fit in <b>Gemma context length</b>, but different strategies such as Sections splitting based on HTML formatting could potentially be tested
</div>

---

# Experiments

To setup our experiments, we'll first wrap our HuggingFace pipeline in LangChain [following this guide](https://python.langchain.com/docs/modules/model_io/llms/custom_llm):

In [14]:
with torch.no_grad():
    torch.cuda.empty_cache()
gc.collect()

class GemmaLLM(LLM):
    hf_pipe: Any = None
    pipe_kwargs: Any = None
        
    def __init__(self, hf_pipeline, pipe_kwargs):
        super(GemmaLLM, self).__init__()
        self.hf_pipe = hf_pipeline
        self.pipe_kwargs = pipe_kwargs

    @property
    def _llm_type(self):
        return "Gemma pipeline"

    def _call(self, prompt, **kwargs):
        
        outputs = self.hf_pipe(
            prompt,
            do_sample=self.pipe_kwargs['do_sample'],
            temperature=self.pipe_kwargs['temperature'],
            top_k=self.pipe_kwargs['top_k'],
            top_p=self.pipe_kwargs['top_p'],
            add_special_tokens=self.pipe_kwargs['add_special_tokens']
        )
        return outputs[0]["generated_text"][len(prompt):]  

    @property
    def _identifying_params(self):
        """Pipeline params"""
        return {"n": self.pipe_kwargs}

hf = GemmaLLM(hf_pipeline=pipe,
              pipe_kwargs={
                    'do_sample':True,
                    'temperature':0.1,
                    'top_k':20,
                    'top_p':0.3,
                    'add_special_tokens':True
                })

To see it in action, let's test it on the prompt we created at the beginning:

In [15]:
prompt[:350]

'<start_of_turn>user\nSummarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:\n\n<h2>TLDR</h2>\n<p>We used an approach similar to audio spectrogram classification using the EfficientNet-B0 model, with numerous augmentations and transformer models s'

In [16]:
out = hf.invoke(prompt)
print(out)

Sure, here's a summary of the text in a technical way:

**1. Data Preprocessing**

* Extract 80 points from the image, including lip and pose points.
* Apply various augmentations and normalizations.
* Fill NaN values with zeros and use nearest interpolation for the time axis.

**2. Augmentation**

* Use common and CNN specific augmentations.
* Implement a mixup augmentation that only works with CNNs.

**3. Training**

* Train EfficientNet-B0 and BERT on a single fold with 0.1 warm-up.
* Train a transformer model with a ranger optimizer and 4-layer transformer.
* Tune hyperparameters with Optuna.

**4. Submissions**

* Aggregate models in a tf.Module.
* Calculate ensemble weights for fold 0 and apply to the full dataset.

**5. PS. Need BETTER TFlite DepthwiseConv2D**

* Explore different ways to implement depthwise convolution in tflite.
* Experiment with different FLOP configurations.

**6. Conclusion**

* EfficientNet-B0 achieved a leaderboard score of 0.8.
* Transformers improved th

Everything is still working as expecting: we wrapped out pipeline and passed the same parameters. Once again, **this was an example of the Stuffing method.**

Let's see how MapReduce and Refine methods perform using the split based on HTML tags we created before.

In [17]:
# MapReduce strategy

# Define prompt for summarization of each chunk
prompt_template = """<bos><start_of_turn>user
Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:

{text}<end_of_turn>
<start_of_turn>model"""
prompt_init = PromptTemplate.from_template(prompt_template)

# Define prompt for final output, the summary of summaries
combine_template = """<bos><start_of_turn>user
You are given a text containing summaries of different part of a document.
Create one single summary combining all the information of the chapters. Divide the summary in chapters, be impersonal and use bullet points:

{text}<end_of_turn>
<start_of_turn>model"""
combine_prompt = PromptTemplate.from_template(combine_template)

# Create the chain of summarization, using map_reduce
chain = load_summarize_chain(hf, chain_type='map_reduce', map_prompt=prompt_init, combine_prompt=combine_prompt)

# Run the chain on the chunks
out_summary = chain.invoke(splits)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1749 > 1024). Running this sequence through the model will result in indexing errors


In [18]:
print(out_summary['output_text'].replace('\n\n','\n'))

 1: EfficientNet-B0
* The EfficientNet-B0 model is a deep neural network architecture that is designed to be efficient.
* The model consists of a series of depthwise convolutions, followed by a global average pooling layer.
* The model is trained using a single fold of 8 randomly split folds.
**Chapter 1: Data Preparation**
* The dataset consists of 10,000 images with 10 classes.
* The EfficientNet-B0 model is trained on a single fold with the following settings:
    * Input size: 160x80
    * Number of filters: 512
    * Number of layers: 19
    * Batch size: 32
    * Learning rate: 0.001
**Chapter 2: Model Training**
* The EfficientNet-B0 model is trained on the single fold with the following settings:
    * Input size: 160x80
    * Number of filters: 512
    * Number of layers: 19
    * Batch size: 32
    * Learning rate: 0.001
**Chapter 3: Evaluation**
* The model is evaluated on the single fold with the following metrics:
    * CV score: 0.898
    * Leaderboard score: ~0.8
**Chapt

Considering that we divided our writeup in chunks, the results is still decent. We lost some coherence as the model can't access the entire document at once. <br>
By setting `verbose=true`, we can also see the steps of the chain (**expand the output cell!**):

In [19]:
# Repeat the process above, with verbose True
chain = load_summarize_chain(hf, chain_type='map_reduce', verbose=True, map_prompt=prompt_init, combine_prompt=combine_prompt)

# Run the chain on the chunks
out_summary = chain.invoke(splits)



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m<bos><start_of_turn>user
Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:

TLDR
We used an approach similar to audio spectrogram classification using the EfficientNet-B0 model, with numerous augmentations and transformer models such as BERT and DeBERTa as helper models. The final solution consists of one EfficientNet-B0 with an input size of 160x80, trained on a single fold from 8 randomly split folds, as well as DeBERTa and BERT trained on the full dataset. A single fold model using EfficientNet has a CV score of 0.898 and a leaderboard score of ~0.8.  
We used only competition data.<end_of_turn>
<start_of_turn>model[0m
Prompt after formatting:
[32;1m[1;3m<bos><start_of_turn>user
Summarize the following text in a technical way. Focus

Let's see the **Refine method** in action!

In [20]:
# Refine strategy 

# Define prompt for the first summarization
prompt_template = """<bos><start_of_turn>user
Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:

{text}<end_of_turn>
<start_of_turn>model"""
prompt_init = PromptTemplate.from_template(prompt_template)

# Define prompt for the refine phase, enhancing the previous summary with the new information
refine_template = """<bos><start_of_turn>user
Your job is to produce a final document divided in chapters and bullet points.
You are given a text containing an existing summary to a certain point:

{existing_answer}

You can now refine it (if necessary) with more context below.

{text}

Given the new context, refine the original summary.<end_of_turn>
<start_of_turn>model"""
prompt_refine = PromptTemplate.from_template(refine_template)


chain = load_summarize_chain(hf, chain_type='refine',
                             return_intermediate_steps=True,
                             input_key='input_documents',
                             output_key='output_text',
                             question_prompt=prompt_init,
                             refine_prompt=prompt_refine)

out_summary = chain.invoke(splits, return_only_outputs=True)

In [21]:
print(out_summary['output_text'])

 training

The EfficientNet-B0 model is trained on the single fold with the following settings:

* Input size: 160x80
* Number of filters: 512
* Number of layers: 19
* Batch size: 32
* Learning rate: 0.001

The BERT and DeBERTa models are trained on the full dataset with the following settings:

* Input size: 128
* Number of filters: 512
* Number of layers: 12
* Batch size: 16
* Learning rate: 0.001

**Chapter 2: Model Training**

* The EfficientNet-B0 model is trained on the single fold with the following settings:
    * Input size: 160x80
    * Number of filters: 512
    * Number of layers: 19
    * Batch size: 32
    * Learning rate: 0.001
* The BERT and DeBERTa models are trained on the full dataset with the following settings:
    * Input size: 128
    * Number of filters: 512
    * Number of layers: 12
    * Batch size: 16
    * Learning rate: 0.001

**Chapter 3: Evaluation**

The model is evaluated on the validation set with the following metrics:

* Accuracy
* Precision
* Recal

We can still set `verbose=True` as before, or we can directly inspect the intermediate steps created (**expand the cell output**):

In [22]:
print("\n###############################\n".join(out_summary["intermediate_steps"]))

 architecture:

* EfficientNet-B0 model
* Transformer models (BERT and DeBERTa) as helper models
* Single fold training with 8 splits
* Use of competition data only
</start_of_turn>

**Chapter 1: Data Preparation**

* The dataset consists of 10,000 images with 10 classes.
* The EfficientNet-B0 model is used as the base model.
* The BERT and DeBERTa models are used as helper models.
* The model is trained on a single fold from 8 randomly split folds.

**Chapter 2: Model Training**

* The EfficientNet-B0 model is trained on the single fold with the following settings:
    * Input size: 160x80
    * Number of filters: 512
    * Number of layers: 19
    * Batch size: 32
    * Learning rate: 0.001
* The BERT and DeBERTa models are trained on the full dataset with the following settings:
    * Input size: 128
    * Number of filters: 512
    * Number of layers: 12
    * Batch size: 16
    * Learning rate: 0.001

**Chapter 3: Evaluation**

* The model is evaluated on the single fold with the 

We can see how the summary **progressively grows** as new context is being passed to the previous step.

Considering the current capabilities of Gemma and the length of the writeup. I would likely choose **the Stuffing strategy**. However, with some prompt tuning, the **other methods could be a viable approach.**

<div class="alert alert-block alert-warning">
MapReduce and Refine <b>heavily depends on the prompts you provide!</b>
</div>

<div class="alert alert-block alert-info">
<b>Key learnings </b><br><br>
    - Dividing in <b>chunk based on HTML sections seems</b> to be a reasonable approach to preserve the logic of the document.<br>
    - Gemma is able to handle large context window, therefore <b>Stuffing summarization</b> seems the go-to strategy for Kaggle writeups. <br>
    - <b>Refine method is a valid alternative</b> given the progressive growth of the final output
</div>

---

# Fine-tuning Gemma with LoRa

Earlier we saw that we can achieve **great results by optimizing our prompts**. However, there might be times where [fine-tuning a model would work better](https://huggingface.co/docs/transformers/tasks/prompting#prompting-vs-fine-tuning):

- The domain is wildly different from what LLMs were pre-trained on and prompt optimization did not yield sufficient results.
- We need our model to work well in a low-resource language.
- We need the model to be trained on sensitive data that is under strict regulations.
- We have to use a small model due to cost, privacy, infrastructure or other limitations.

In such scenarios, we will need to fine-tuned our model on a **domain-specific dataset**. <br> 

Fine-tuning a LLM is **computational demanding**. In order to **train it on commercial laptops**, the only feasible way to save memory is by **reducing the model size.** Building on this, the idea is to **fine-tune only few parameters** rather than the entire model, a method currently used extensively called **Parameter Efficient Fine-Tuning or PEFT.**
For more details about PEFT and, specifically, **LoRa**, the technique we are going to use now, I encourage you to read this beautiful [blog post](https://wandb.ai/capecape/alpaca_ft/reports/How-to-Fine-tune-an-LLM-Part-3-The-HuggingFace-Trainer--Vmlldzo1OTEyNjMy#parameter-efficient-fine-tuning-(peft)).

Let's now see how **Gemma can be fine-tuned** using HuggingFace [TRL](https://huggingface.co/docs/trl/index), [Transformers](https://huggingface.co/docs/transformers/index) and [datasets](https://huggingface.co/docs/datasets/index). 

To do so, I'll use the [CNN daily news dataset](https://huggingface.co/datasets/cnn_dailymail#dataset-card-for-cnn-dailymail-dataset), also available on Kaggle. This dataset comprises **news articles written by journalists at CNN and the Daily Mail**. For each instance, there is a string for the article and a **string for the highlights (summary).**

<div class="alert alert-block alert-warning">
Currently, there are <b> no domain-specific dataset </b> to fine-tune our models on <b>Kaggle writeups </b>. <br> We will utilize the CNN news dataset simply to <b>demonstrate the benefits of fine-tuning which could potentially apply to our specific use case as well</b>.
</div>

In [23]:
# Make space in memory
hf = release_memory(hf)

with torch.no_grad():
    torch.cuda.empty_cache()
gc.collect()

0

In [24]:
# Import of the validation set which contains fewer examples than training
validation = pd.read_csv('/kaggle/input/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/validation.csv')[['article', 'highlights']]
validation.head()

Unnamed: 0,article,highlights
0,"Sally Forrest, an actress-dancer who graced th...","Sally Forrest, an actress-dancer who graced th..."
1,A middle-school teacher in China has inked hun...,Works include pictures of Presidential Palace ...
2,A man convicted of killing the father and sist...,"Iftekhar Murtaza, 29, was convicted a year ago..."
3,Avid rugby fan Prince Harry could barely watch...,Prince Harry in attendance for England's crunc...
4,A Triple M Radio producer has been inundated w...,Nick Slater's colleagues uploaded a picture to...


Let's now setup the LoRa [configurations](https://huggingface.co/blog/gemma-peft#low-rank-adaptation-for-large-language-models), mainly following this [blog post](https://huggingface.co/blog/gemma-peft#low-rank-adaptation-for-large-language-models) from HuggingFace.

In [25]:
model = "/kaggle/input/gemma/transformers/2b-it/1"

lora_config = LoraConfig(
    r=6,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model)
tokenizer.padding_side = "right" # Fixing overflow issue ref: source code
model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", quantization_config=bnb_config)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Once again, we need to pay attention on how we are going to [format our examples](https://huggingface.co/docs/trl/sft_trainer#dataset-format-support) (remember the chat template?). This time, we make sure we are not generating the model response at the end of the string by setting `add_generation_prompt = False`:

In [26]:
train_data = Dataset.from_pandas(validation)

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['article'])):
        messages = [
            {"role": "user",
             "content": "Given the following article, write a short summary of the article in 2-3 sentences:\n\nArticle: {}".format(example['article'][i])},
            {"role": "assistant",
             "content": "{}".format(example['highlights'][i])}
        ]
        output_texts.append(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False))
        
    return output_texts

# Print the first training example
print(formatting_prompts_func(train_data[:1])[0])

<start_of_turn>user
Given the following article, write a short summary of the article in 2-3 sentences:

Article: Sally Forrest, an actress-dancer who graced the silver screen throughout the '40s and '50s in MGM musicals and films such as the 1956 noir While the City Sleeps died on March 15 at her home in Beverly Hills, California. Forrest, whose birth name was Katherine Feeney, was 86 and had long battled cancer. Her publicist, Judith Goffin, announced the news Thursday. Scroll down for video . Actress: Sally Forrest was in the 1951 Ida Lupino-directed film 'Hard, Fast and Beautiful' (left) and the 1956 Fritz Lang movie 'While the City Sleeps' A San Diego native, Forrest became a protege of Hollywood trailblazer Ida Lupino, who cast her in starring roles in films including the critical and commercial success Not Wanted, Never Fear and Hard, Fast and Beautiful. Some of Forrest's other film credits included Bannerline, Son of Sinbad, and Excuse My Dust, according to her iMDB page. The p

Now we can launch the actual training phase:

In [27]:
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    max_seq_length=512,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=12,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        report_to='none',
        output_dir='logs',
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_prompts_func,
)
trainer.train()

Map:   0%|          | 0/13368 [00:00<?, ? examples/s]

Step,Training Loss
1,3.3872
2,3.4999
3,3.503
4,3.5833
5,3.074
6,3.0885
7,3.1425
8,3.1157
9,2.5704
10,2.7566


TrainOutput(global_step=12, training_loss=3.100179115931193, metrics={'train_runtime': 55.0034, 'train_samples_per_second': 0.873, 'train_steps_per_second': 0.218, 'total_flos': 288551021051904.0, 'train_loss': 3.100179115931193, 'epoch': 0.0})

In [28]:
trainer.model.save_pretrained('lora_adapter')

We now have saved the LoRa adapter which can be later be used to either load the model or continue training if necessary.

In [29]:
# Make space in memory
trainer, model, tokenizer = release_memory(trainer, model, tokenizer)

We could continue with our model, but for the sake of this tutorial I'll show how to [load our adapter and merge into Gemma pretrained model](https://huggingface.co/docs/trl/use_model#use-adapters-peft).

In [30]:
# Load the pretrained model and our LoRa adapter
base_model_name = "/kaggle/input/gemma/transformers/2b-it/1"
adapter_model_name = "/kaggle/working/lora_adapter"

model = AutoModelForCausalLM.from_pretrained(base_model_name, device_map='auto', torch_dtype=torch.float16)
model = PeftModel.from_pretrained(model, adapter_model_name, device_map='auto', torch_dtype=torch.float16)

# Merge the adapters into the base model so you can use the model like a normal transformers model
model = model.merge_and_unload()
model.save_pretrained('final_model')

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [31]:
model = "/kaggle/working/final_model"

# Load the HF pipeline using our newly fine-tuned Gemma 2B
pipe_finetuned = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    model_kwargs={"torch_dtype": torch.float16},
    device_map='auto',
    max_new_tokens=512
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Let's see it in action on the same writeup we analyzed before:

In [32]:
outputs = pipe_finetuned(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_k=20,
    top_p=0.3,
    add_special_tokens=True
)
print(outputs[0]["generated_text"][len(prompt):])

Sure, here's a summary of the text in a technical way:

**Chapter 1: Data Preprocessing**

* Extract 18 lip points, 20 pose points, and all hand points.
* Apply various augmentations and normalization techniques.
* Fill NaN values with zeros and use nearest interpolation for the time axis.

**Chapter 2: Augmentation**

* Implement common augmentations like random affine, replace augmentation, and time and frequency masking.
* Use finger tree rotation for specific finger pairs.

**Chapter 3: Training**

* Train CNN and transformer models on one fold with a random split.
* Use onecycle scheduler, weighted cross-entropy loss, and hypercolumn tuning.
* Train EfficientNet-B0 and BERT on the full dataset.

**Chapter 4: Submissions and Ensemble**

* Aggregate models in a tf.Module class.
* Calculate ensemble weights for fold 0 and apply to the full dataset.
* Achieve leaderboard scores with EfficientNet-B0 and BERT ensemble.

**Chapter 5: PS. Need Better TFlite DepthwiseConv2D**

* Explore th

As we can see, our fine-tuned model is **still able to follow our instructions**. But how can we assess if the extra training worked? 

Given that we **requested concise summaries** during the fine-tuning process, let's compare our base model with the fine-tuned model:

In [33]:
messages = [
    {
        "role": "user",
        "content": "Write a short summary of 2-3 sentences of the following text:\n\n{}".format(writeup)
    }
]

prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_k=20,
    top_p=0.3,
    add_special_tokens=True
)
print(outputs[0]["generated_text"][len(prompt):])

Sure, here's a summary of the text:

The text describes the training of an EfficientNet-B0 model and a transformer model on a dataset of lip and pose data. The models are trained using an ensemble approach, which combines models trained on different folds.

**EfficientNet-B0:**

* Uses a CNN architecture with 5 blocks.
* Each block has a depthwise convolution operation followed by a residual connection.
* The model is trained with a weighted cross-entropy loss function and an optimizer called Adam.

**Transformer:**

* Uses a transformer architecture with 4 layers and 256 hidden units.
* The model is trained with a ranger optimizer and a loss function that encourages the model to predict the same labels as the training data.

**Ensemble:**

* The models are trained on different folds and then combined using an ensemble approach.
* The weights for each model are learned from the corresponding fold.
* The final ensemble includes EfficientNet-B0, BERT, and DeBERTa.

**Key takeaways:**

* 

In [34]:
outputs = pipe_finetuned(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_k=20,
    top_p=0.3,
    add_special_tokens=True
)
print(outputs[0]["generated_text"][len(prompt):])

Sure, here's a summary of the text:

The text describes the training of an EfficientNet-B0 model and a BERT model on a dataset of lip and pose data. The EfficientNet-B0 model achieved a leaderboard score of approximately 0.8, while the BERT model achieved a score of 0.81. The ensemble of these models achieved a leaderboard score of approximately 0.82.

The text also provides some insights into the training process, including the use of augmentation, the optimization of hyperparameters, and the conversion of DepthwiseConv2D operations to tflite.


It is clear that **fine-tuning improved significantly our output**: the model now **adheres to our instructions** and summarize in **few sentences the writeup**, whereas in its **previous** state the output was **lengthy.**

> Looking ahead, it would be beneficial for the Kaggle community to develop a **domain-specific dataset** for fine-tuning Gemma on Kaggle write-ups. The curated summaries could adhere to a defined format, such as highlighting key aspects and identifying any omissions in the [Kaggle template](https://www.kaggle.com/solution-write-up-documentation). <br>


<div class="alert alert-block alert-info">
<b>Key learnings </b><br><br>
    - <b>Fine-tuning</b> proves beneficial in cases where <b>optimizing the prompt does not yield satisfactory results</b>. In these instances, a <b>domain-specific dataset</b> can greatly improve the desidered output.<br>
    - Default (full weight) LLMs are <b>memory and compute-intensive</b>, which may render fine-tuning impractical. Parameter Efficient Fine-Tuning or <b>PEFT</b> is a potent approach that enables the model to be "downsized," achieving <b>performance comparable to full fine-tuning with only a limited number of trainable parameters.</b> <br>
</div>

---

# Conclusions and next steps

In this notebook, we explore the intricate world of text summarization using **Gemma capabilities** together with **HuggingFace and LangChain abstractions.** <br>
We progressed from merely requesting a summary to understard the **essential setup and parameters** required for meaningful results. This includes prompt engineering, chat template, chunking and summarization strategies and fine-tuning. <br>

>Why is this useful? <br>
From a Kaggle perspective, I foresee the possibility of having a personalized assistant on the website, helping users navigate through Kaggle complexity both on forum and on competition cards.

There are still one aspect I intend to explore in the coming days, that is to implement the findings outlined in the [QAGS paper](https://aclanthology.org/2020.acl-main.450.pdf). This poses an intriguing question: **how can we ensure we obtained factual consistency in our summary?**

The paper talks about a framework called **QAGS (Question Answering and Generation for Summarization)** and it can be summarized with one image:

<img src="https://i.imgur.com/NX6P37e.png" width="600">

The idea I'll try to implement is the following:
- Use a model to generate Q&A on the summary produced by Gemma
- Collect answers for both the summary and the input text by querying a model
- Examine consistency and identify hallucinations, that is if facts are present only in the summary but not in the original text.

**Stay tuned, any feedback will be appreciated!**