In [None]:
# IF USING GOOGLE COLABORATORY -> RUN FIRST!!!
# OTHERWISE -> IGNORE ;-)

from google.colab import drive

drive.mount('/content/gdrive')

# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>
# <font color="#003660">Lesson 9: Hands-On Training: Developing Summarization Models in Resource-Limited Settings</font>

<center><br><img width=256 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... build and run your own summarization pipeline;<br>
        ... critically analyze the performance of a summarization model; and<br>
        ... optimize resource usage.<br>
    </font>
</div>
</center>
</p>

<h2>Task</h2>

<p>In today's session, we will focus on building a text summarization model, with an emphasis on efficient resource utilization. This includes using all available tools for optimal preprocessing and batching of text data, ensuring that the input to our model is streamlined and manageable, especially in the context of varying document lengths and sizes. Additionally, we will explore the implementation of mixed-precision training, which allows us to leverage the benefits of both FP16 and FP32 data types to balance training speed and model accuracy.</p>

<p>Given the limited GPU resources typically available in academic settings, this exercise aims to provide a realistic experience in training NLP models under computational constraints. You will learn to adjust batch sizes, manage document lengths, and choose an appropriate number of samples, all while ensuring the model runs smoothly on the available hardware. This practical approach not only bolsters your coding skills but also instills an understanding of crucial efficiency considerations in model training.</p>

<p>Moreover, a significant part of today's session will be dedicated to the critical evaluation of the summarization model. We will delve into selecting the right metrics for assessing model performance, including accuracy, coherence, and relevance of the generated summaries. This evaluation process is crucial for understanding the real-world applicability of the model and for learning how to critically assess outputs, an essential skill for any machine learning practitioner.</p>

<h2>Guidance</h2>

<ul>
  <li>Begin by comprehending the structure of the base model you are working with.</li>
  <li>Identify and correctly utilize the specific tokens associated with the chosen model architecture.</li>
  <li>For models that need or suggest it, incorporate prompts into your documents prior to the training process.</li>
  <li>Assess the memory demands of the models and explore different batch sizes through experimentation.</li>
  <li>Investigate the appropriate metric(s) for evaluation and thoroughly evaluate their significance.</li>
  <li>If necessary, concentrate on documents that are similar in length or subject matter to enhance the performance and reliability of this smaller, proof-of-concept model.</li>
</ul>

<h2>Useful Links</h2>

<ul>
  <li><a href="https://huggingface.co/learn/nlp-course/chapter1/7?fw=pt">Hugging Face | Sequence-to-Sequence Models</a></li>
  <li><a href="https://huggingface.co/docs/transformers/tasks/summarization">Hugging Face | Summarization</a></li>
  <li><a href="https://huggingface.co/docs/transformers/main_classes/data_collator">Hugging Face | Data Collator</a></li>
  <li><a href="https://huggingface.co/docs/transformers/v4.35.2/en/main_classes/trainer#transformers.Trainer">Hugging Face | Trainer</a></li>
  <li><a href="https://huggingface.co/docs/transformers/training#train-a-tensorflow-model-with-keras">Hugging Face | Train with PyTorch Trainer (<i>under Show Pytorch Content</i>)</a></li>
  <li><a href="https://huggingface.co/docs/datasets/index">Hugging Face | Datasets</a></li>
  <ul>
    <li><a href="https://huggingface.co/datasets/cnn_dailymail">CNN / Dailymail</a></li>
    <li><a href="https://huggingface.co/datasets/newsroom">Newsroom</a></li>
    <li><a href="https://huggingface.co/datasets/samsum">SAMsum</a></li>
    <li>...</li>
  </ul>
  <li><a href="https://paperswithcode.com/datasets?task=text-summarization">Papers With Code (Meta) | Text Summarization Datasets</a></li>
  <li><a href="https://metatext.io/datasets-list/summarization-task">Metatext | Summarization Datasets</a></li>
</ul> 

In [None]:
# Import

import torch
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer

# Continue...

In [None]:
# Have fun ;-)
# ...