# Task-specific knowledge distillation for Longformer-Encoder-Decoder (LED)
### Abstractive Summarization using Longformer-Encoder-Decoder (LED) as Teacher and ??? as Student
#### Group 22: Anh Le, Kasra Sohrab, and Katie Krupczak

! INTRO HERE !

## Installation

In [2]:
%pip install transformers datasets tensorboard --upgrade


Note: you may need to restart the kernel to use updated packages.


In [3]:
%pip install accelerate -U
%pip install transformers[torch]
%pip install prettytable

Collecting accelerate
  Obtaining dependency information for accelerate from https://files.pythonhosted.org/packages/53/fe/0251ccd9e0015c705e772da0fb2c96cdafd87b1d7dd45dc13dca7ced0eb7/accelerate-0.29.3-py3-none-any.whl.metadata
  Downloading accelerate-0.29.3-py3-none-any.whl.metadata (18 kB)
Collecting torch>=1.10.0 (from accelerate)
  Obtaining dependency information for torch>=1.10.0 from https://files.pythonhosted.org/packages/c3/33/d7a6123231bd4d04c7005dde8507235772f3bc4622a25f3a88c016415d49/torch-2.2.2-cp311-cp311-manylinux1_x86_64.whl.metadata
  Downloading torch-2.2.2-cp311-cp311-manylinux1_x86_64.whl.metadata (25 kB)
Collecting typing-extensions>=4.8.0 (from torch>=1.10.0->accelerate)
  Obtaining dependency information for typing-extensions>=4.8.0 from https://files.pythonhosted.org/packages/01/f3/936e209267d6ef7510322191003885de524fc48d1b43269810cd589ceaf5/typing_extensions-4.11.0-py3-none-any.whl.metadata
  Downloading typing_extensions-4.11.0-py3-none-any.whl.metadata (3.0 

## Setup & Configuration


In [None]:
student_id = ???
teacher_id = ???

Below are some checks to make sure the `Teacher` & `Student` are creating the same output.

In [None]:
from transformers import AutoTokenizer

# init tokenizer
teacher_tokenizer = AutoTokenizer.from_pretrained(teacher_id)
student_tokenizer = AutoTokenizer.from_pretrained(student_id)

# sample input
sample = "This is a basic example, with different words to test."

# assert results
assert teacher_tokenizer(sample) == student_tokenizer(sample), "Tokenizers haven't created the same output"

## Dataset & Pre-processing
We will be using the ["scientific_papers"](https://huggingface.co/datasets/scientific_papers) dataset available from Hugging Face. This dataset comprises two sets of scientific papers obtained from ArXic and PubMed OpenAccess repositories.

This dataset is ideal for our purposes due to its alignment with LED’s design strengths–processing lengthy sequences.

In [6]:
from datasets import load_dataset

#two options for datasets within scientific_papers: arxiv and pubmed
dataset = load_dataset("scientific_papers", 'arxiv')
dataset

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading data:   0%|          | 0.00/3.62G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/880M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/203037 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/6436 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6440 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['article', 'abstract', 'section_names'],
        num_rows: 203037
    })
    validation: Dataset({
        features: ['article', 'abstract', 'section_names'],
        num_rows: 6436
    })
    test: Dataset({
        features: ['article', 'abstract', 'section_names'],
        num_rows: 6440
    })
})

### Pre-processing & Tokenization

To distill our model we need to convert our "Natural Language" to token IDs. We are going to use the tokenizer of the `Teacher`, but since both are creating same output you could also go with the `Student` tokenizer.

### Data Cleaning Methods


In [None]:
### Data Cleaning and Tokenizing

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(teacher_id)

In [None]:
def process(examples):
    tokenized_inputs = tokenizer(
        examples["sentence"], truncation=True, max_length=512
    )
    return tokenized_inputs

tokenized_datasets = dataset.map(process, batched=True)
tokenized_datasets = tokenized_datasets.rename_column("label","labels")

tokenized_datasets["test"].features

In [None]:
## Vocabulary Building