<a href="https://colab.research.google.com/github/insublee/GAS_summarization/blob/main/BART_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#0.install and import

In [None]:
!pip install transformers pytorch-lightning wandb datasets rouge_score hydra-core

Collecting transformers
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 10.0 MB/s 
[?25hCollecting pytorch-lightning
  Downloading pytorch_lightning-1.4.9-py3-none-any.whl (925 kB)
[K     |████████████████████████████████| 925 kB 51.3 MB/s 
[?25hCollecting wandb
  Downloading wandb-0.12.4-py2.py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 37.6 MB/s 
[?25hCollecting datasets
  Downloading datasets-1.13.3-py3-none-any.whl (287 kB)
[K     |████████████████████████████████| 287 kB 36.9 MB/s 
[?25hCollecting rouge_score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Collecting hydra-core
  Downloading hydra_core-1.1.1-py3-none-any.whl (145 kB)
[K     |████████████████████████████████| 145 kB 65.6 MB/s 
[?25hCollecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 4.9 MB/s 
[?25hCollecting to

In [None]:
import os
from typing import Optional, List, Dict
from itertools import chain
import hydra
from omegaconf import DictConfig, OmegaConf
import pandas as pd
import datasets
import math
import pickle

from sklearn.model_selection import train_test_split
import pytorch_lightning as pl
from pytorch_lightning import LightningDataModule, LightningModule, Trainer, seed_everything
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning.callbacks import LearningRateMonitor, ModelCheckpoint

import torch
from torch.utils.data import DataLoader
from transformers import (
    BartForConditionalGeneration,
    AdamW,
    AutoModelForCausalLM,
    AutoTokenizer,
    get_cosine_schedule_with_warmup,
)
os.chdir('./drive/MyDrive/codes/Dacon/gas/')

#1. Config

In [None]:
config = DictConfig({
    "project" : "gas summary",
    "cache_path" : "data/processed_dataset",
    "vaild_path" : "data/vaild_original.json",
    "train_path" : "data/train_original.json",
    "test_path" : "data/test.jsonl",
    "sample_submission_path" : "data/sample_submission.csv",
    "submission_path" : 'data/BART_baseline.csv',
    'train_vaild_split' : 0.01,
    "field" : 'documents',
    "text_fields" : ['text', 'abstractive'],
    "input_colums" : ['input_ids', 'attention_mask', 'decoder_input_ids', 'decoder_attention_mask', 'labels'],
    "model_name_or_path":'hyunwoongko/kobart',
    "text_max_length":800,
    "abstractive_max_length":200,
    "train_batch_size":10,
    "valid_batch_size":10,
    "test_batch_size":20,
    "learning_rate" : 3e-5,
    "adam_epsilon" : 1e-8,
    "warmup_steps" : 500,
    "total_steps" : 600000,
    "train_examples" : 298202,
    "weight_decay" : 1e-2,
    'gpus': 1,
    'precision': 16,
    'accumulate_grad_batches': 1,
    'val_check_interval': 0.25, # default 1.0
    'gradient_clip_val': 1.0,
    'num_save_ckpt': 5,
    'num_beams' : 4, 
    'no_repeat_ngram_size' : 3, 
    'length_penalty' : 0.8,
    'repetition_penalty' : 2.0,
    'accelerator' : None,#'ddp'
    }
)

#2. Data

In [None]:
# !unzip data/'문서요약 텍스트'/Validation/신문기사_vaild_original.zip -d data
# !unzip data/'문서요약 텍스트'/Training/신문기사_train_original.zip -d data
# !unzip data/'235829_가스에너지분야 문서요약 모델개발_data.zip' -d data

In [None]:
# !du -h ./data
# !rm data/processed_dataset
!ls -al ./data

total 1428839
drwx------ 2 root root       4096 Oct 17 13:15 '문서요약 텍스트'
-rw------- 1 root root    4157527 Oct 18 02:11 '235829_가스에너지분야 문서요약 모델개발_data.zip'
-rw------- 1 root root    6306423 Oct 19 11:04  BART_baseline.csv
-rw------- 1 root root       5644 Oct 19 05:27  processed_dataset
-rw------- 1 root root      83231 Oct 11 08:22  sample_submission.csv
-rw------- 1 root root   20039720 Oct 11 08:22  test.jsonl
-rw------- 1 root root 1285766826 Oct 18 01:08  train_original.json
-rw------- 1 root root  146765156 Oct 18 01:08  vaild_original.json


In [None]:
class SummaryDataModule(LightningDataModule):
    """저장된 데이터를 처리하여 dataloader를 준비합니다.
    
    BART 모델은 인코더-디코더를 이루고 있기 때문에 입력으로는 
    'input_ids', 'attention_mask', 'decoder_input_ids', 'decoder_attention_mask', 'labels'
    5가지가 필요합니다. 
    문서 "Text"를 토크나이저로 인코딩하여 input_ids와 attention_mask를 만들고,
    추상요약본 "abstractive"를 토크나이저로 인코딩하여 decoder_input_ids와 decoder_attention_mask 만듭니다.
    labels 로는 디코더 인풋과 같게 줍니다.
    
    최종적으로 dataloader를 사용합니다.

    Args:
        config: DictConfig 형태로 원하는 하이퍼 파라미터를 간편하게 관리해줍니다.

    Attributes:
        save : 전처리한 데이터를 pickle로 저장
        load : 전처리된 데이터를 pickle로 불러오기
        prepare_data : jaon과 jsonl 파일을 읽은 후 필요한 데이터만 남기고, 이중 리스트 형태로 저장합니다.
        setup : 데이터를 토크나이징 한 후 텐서로 변환하여 저장합니다.
        convert_to_features : setup에서 토크나이징 할 때 사용되는 함수입니다.
        train_dataloader : train 데이터로더 반환함수 입니다.
        val_dataloader : validation 데이터로더 반환함수 입니다.
        test_dataloader : test 데이터로더 반환함수 입니다.
        _dataloader : 데이터로더 반환함수 입니다.

    Example:
        train_batch = next(iter(dm.train_dataloader()))
        for k in train_batch.keys():
            print(k, train_batch[k].size())

    Reference:
        https://colab.research.google.com/github/PytorchLightning/lightning-tutorials/blob/publication/.notebooks/lightning_examples/text-transformers.ipynb

    TODO:
        save & load 시 datasets에서 캐시사용으로 인해 load가 불가능한 에러가 남. 해결해야함
    """

    def __init__(
        self,
        config:DictConfig
    ) -> None:
        super().__init__()
        self.config = config
        self.tokenizer = AutoTokenizer.from_pretrained(self.config.model_name_or_path, use_fast=True)
        self.prepare_data()
        self.setup("fit")


    def save(self):
        cached_file = open(self.config.cache_path, "wb")
        pickle.dump(self.dataset, cached_file)
        cached_file.close()
        self.done = True


    def load(self):
        cache_path = self.config.cache_path
        if (os.path.exists(cache_path)) and (os.path.getsize(cache_path) > 0):
            cached_file = open(cache_path, "rb")
            self.dataset = pickle.load(cached_file)
            cached_file.close()
            self.done=True


    def prepare_data(self):
        """
        train, validation, test data를 각각 datasets.Dataset.from_json 으로 불러온 후 
        ['text', 'abstractive'] 필드만 남기고 다 없에줌.
        또한 각 필드의 형식을 List[str]으로 만들어줌
        """
        train_news = datasets.Dataset.from_json(self.config.train_path, field = self.config.field)
        val_news = datasets.Dataset.from_json(self.config.vaild_path, field = self.config.field)
        test_news = datasets.Dataset.from_json(self.config.test_path)

        train_valid = datasets.concatenate_datasets([train_news, val_news]).train_test_split(self.config.train_vaild_split)
        
        self.dataset = datasets.DatasetDict({
            'train': train_valid['train'],
            'valid': train_valid['test'],
            'test': test_news,
        })

        for split in self.dataset.keys():
            if split !='test':
                self.dataset[split] = self.dataset[split].map(
                    lambda x:{'abstractive':'\n'.join(x['abstractive']),
                              'text':'\n'.join(list(chain(*[[j['sentence'] for j in i] for i in x['text']])))}
                )
            else:
                self.dataset[split] = self.dataset[split].map(
                    lambda x:{'text':'\n'.join(x['article_original'])}
                )
            remove_column = [i for i in self.dataset[split].column_names if i not in self.config.text_fields]
            self.dataset[split] = self.dataset[split].remove_columns(remove_column)

        self.sample_submission = pd.read_csv(self.config.sample_submission_path)


    def setup(self, stage: str)->None:
        for split in self.dataset.keys():
            self.dataset[split] = self.dataset[split].map(
                self.convert_to_features,
                batched=True,
            )
            columns = [c for c in self.dataset[split].column_names if c in self.config.input_colums]
            self.dataset[split].set_format(type="torch", columns=columns)


    def convert_to_features(
        self,
        example_batch:Dict[str, List[List[str]]],
        indices=None
    ) -> Dict[str, List]:
        return_dict = {}
        encoded_text = self.tokenizer.batch_encode_plus(
                example_batch[self.config.text_fields[0]],
                truncation=True,
                padding='max_length',
                max_length=self.config.text_max_length,
            )
        return_dict['input_ids'] = encoded_text.input_ids
        return_dict['attention_mask'] = encoded_text.attention_mask
        
        if self.config.text_fields[1] in example_batch.keys():
            encoded_text = self.tokenizer.batch_encode_plus(
                example_batch[self.config.text_fields[1]],
                padding='max_length',
                truncation=True,
                max_length=self.config.abstractive_max_length,
            )
            return_dict['decoder_input_ids'] = encoded_text.input_ids
            return_dict['decoder_attention_mask'] = encoded_text.attention_mask
            return_dict['labels'] = encoded_text.input_ids

            # return_dict['labels'] = [[j if j!= self.tokenizer.pad_token_id else -100 for j in i] for i in return_dict['labels']]

        return return_dict


    def train_dataloader(self):
        return self._dataloader(
            dataset=self.dataset["train"],
            batch_size=self.config.train_batch_size,
            shuffle=True
        )


    def val_dataloader(self):
        return self._dataloader(
            dataset=self.dataset["valid"],
            batch_size=self.config.valid_batch_size
        )


    def test_dataloader(self):
        return self._dataloader(
            dataset=self.dataset["test"],
            batch_size=self.config.test_batch_size
        )


    def _dataloader(
        self,
        dataset:datasets.Dataset,
        batch_size: int,
        shuffle: bool=False,
    ) -> DataLoader:

        return DataLoader(
            dataset=dataset,
            batch_size=batch_size,
            num_workers=os.cpu_count(),
            drop_last=False,
            shuffle=shuffle,
            pin_memory=True,
        )


dm = SummaryDataModule(config)

Downloading:   0%|          | 0.00/109 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/172k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/666k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Using custom data configuration default-46b91eb94e06dc1d


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-46b91eb94e06dc1d/0.0.0...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Using custom data configuration default-833fd5127ff3672a


Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-46b91eb94e06dc1d/0.0.0. Subsequent calls will reuse this data.
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-833fd5127ff3672a/0.0.0...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Using custom data configuration default-8b90d36c768af83c


Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-833fd5127ff3672a/0.0.0. Subsequent calls will reuse this data.
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-8b90d36c768af83c/0.0.0...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-8b90d36c768af83c/0.0.0. Subsequent calls will reuse this data.


  0%|          | 0/298202 [00:00<?, ?ex/s]

  0%|          | 0/3013 [00:00<?, ?ex/s]

  0%|          | 0/4161 [00:00<?, ?ex/s]

  0%|          | 0/299 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

In [None]:
dm.dataset

DatasetDict({
    train: Dataset({
        features: ['abstractive', 'attention_mask', 'decoder_attention_mask', 'decoder_input_ids', 'input_ids', 'labels', 'text'],
        num_rows: 298202
    })
    valid: Dataset({
        features: ['abstractive', 'attention_mask', 'decoder_attention_mask', 'decoder_input_ids', 'input_ids', 'labels', 'text'],
        num_rows: 3013
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'text'],
        num_rows: 4161
    })
})

#3. Model

In [None]:
class AbstractiveBART(LightningModule):
    """ BART 모델을 로드하고 트레이닝, 발리데이션, 테스팅 과정을 준비합니다.

    Args:
        config: DictConfig 형태로 원하는 하이퍼 파라미터를 간편하게 관리해줍니다.

    Attributes:
        forward : 가지고 있는 모델에 forward 해줍니다.
        training_step : 모델의 forward output에서 로스를 가지고 온 후 logger에 기록합니다.
        validation_step : 모델이 생성한 문장과 정답문장을 rouge score로 비교하게 됩니다.
        test_step : 주어진 문장을 통해 요약본을 생성합니다.
        test_epoch_end : 생성된 요약본을 submission file에 저장합니다.
        _generate : 주어진 문장을 생성하는 함수입니다.
        configure_optimizers : bias와 LayerNorm.weight는 weight decay에서 제외시켜줍니다. 또한 optimizer와 scheduler를 리턴해줍니다.

    Reference:
        https://huggingface.co/transformers/main_classes/model.html?highlight=generate#generation
        https://github.com/huggingface/datasets/blob/master/metrics/rouge/rouge.py
        https://huggingface.co/blog/how-to-generate
        https://colab.research.google.com/github/PytorchLightning/lightning-tutorials/blob/publication/.notebooks/lightning_examples/text-transformers.ipynb#scrollTo=647c7ec9

    TODO:
        현재 생성이 엉망으로 나옴. 
        모델, 데이터, 옵티마이저, 러닝레이트, 다양하게 바꾸어보면서 실험.
    """
    def __init__(
        self,
        config:DictConfig
        ) -> None:

        super().__init__()
        self.config = config
        self.tokenizer = AutoTokenizer.from_pretrained(self.config.model_name_or_path, use_fast=True)
        self.model = BartForConditionalGeneration.from_pretrained(self.config.model_name_or_path)
        self.metric = datasets.load_metric('rouge')


    def forward(
        self,
        **inputs:Dict[str, torch.tensor]
        ):

        return self.model(**inputs)


    def training_step(
        self,
        batch:Dict[str, torch.tensor],
        batch_idx:int,
        )->float:

        outputs = self(**batch)
        loss = outputs[0]
        self.log(f"train_loss", loss)
        return loss


    def validation_step(
        self,
        batch:Dict[str, torch.tensor],
        batch_idx:int,
        dataloader_idx:int=0
        ):

        #https://huggingface.co/transformers/main_classes/model.html?highlight=generate#generation
        #https://github.com/huggingface/datasets/blob/master/metrics/rouge/rouge.py
        #https://huggingface.co/blog/how-to-generate
        pred = self._generate(
            batch['input_ids'],
            batch['attention_mask'],
            )
        ref = self.tokenizer.batch_decode(
            batch['decoder_input_ids'],
            skip_special_tokens=True
        )
        rouge = self.metric.compute(predictions=pred, references=ref)
        self.log(f"rouge1", rouge['rouge1'].mid.fmeasure)
        self.log(f"rouge2", rouge['rouge2'].mid.fmeasure)
        self.log(f"rougeL", rouge['rougeL'].mid.fmeasure)


    def test_step(
        self,
        batch:Dict[str, torch.tensor],
        batch_idx:int,
        ) -> list:

        pred = self._generate(batch['input_ids'],batch['attention_mask'])
        return pred


    def test_epoch_end(self, outputs) -> None:
        sample = pd.read_csv(self.config.sample_submission_path)
        sample['summary'] = list(chain(*outputs))
        sample.to_csv(self.config.submission_path,index=False)
        return


    def _generate(
        self,
        input_ids:torch.tensor,
        attention_mask:torch.tensor,
        )->List[str]:

        # https://huggingface.co/transformers/main_classes/model.html?highlight=generate
        generated_out = self.model.generate(
            input_ids,
            attention_mask = attention_mask,
            max_length=self.config.abstractive_max_length,
            num_return_sequences=1,
            use_cache=True,
            early_stopping=True, 
            no_repeat_ngram_size = self.config.no_repeat_ngram_size,
            length_penalty = self.config.length_penalty,
            repetition_penalty = self.config.repetition_penalty,
            pad_token_id=self.tokenizer.pad_token_id,
            eos_token_id=self.tokenizer.eos_token_id,
            bos_token_id=self.tokenizer.bos_token_id,
        )
        pred = self.tokenizer.batch_decode(
            generated_out,
            skip_special_tokens=True
        )
        return pred


    def configure_optimizers(self):
        #https://colab.research.google.com/github/PytorchLightning/lightning-tutorials/blob/publication/.notebooks/lightning_examples/text-transformers.ipynb#scrollTo=647c7ec9
        model = self.model
        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": self.config.weight_decay,
            },
            {
                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
            },
        ]
        self.optimizer = AdamW(optimizer_grouped_parameters, lr=self.config.learning_rate, eps=self.config.adam_epsilon)

        self.scheduler = get_cosine_schedule_with_warmup(
            self.optimizer,
            num_warmup_steps=self.config.warmup_steps,
            num_training_steps=self.config.total_steps,
        )
        scheduler = {"scheduler": self.scheduler, "interval": "step", "frequency": 1}
        return [self.optimizer], [scheduler]


model = AbstractiveBART(config)

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

#4. Train

In [None]:
trainer = Trainer(
    **{
        "logger": WandbLogger(
            name=config.model_name_or_path,
            project=config.project,
        ),
        "callbacks": [
            LearningRateMonitor(
                logging_interval="step",
                log_momentum=False,
            ),
            ModelCheckpoint(
                monitor="rougeL",
                save_top_k=config.num_save_ckpt,
                mode="min",
            ),
        ],
        'max_steps' : config.total_steps,
        'gpus': config.gpus,
        'precision': config.precision, 
        'accumulate_grad_batches': config.accumulate_grad_batches,
        'val_check_interval': config.val_check_interval,
        'gradient_clip_val': config.gradient_clip_val,
    }
)

trainer.fit(
    model=model,
    train_dataloaders=dm.train_dataloader(),
    val_dataloaders=dm.val_dataloader(),
)

Using native 16bit precision.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc



  | Name  | Type                         | Params
-------------------------------------------------------
0 | model | BartForConditionalGeneration | 123 M 
-------------------------------------------------------
123 M     Trainable params
0         Non-trainable params
123 M     Total params
495.440   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: -1it [00:00, ?it/s]

  torch.nn.utils.clip_grad_norm_(parameters, clip_val)


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")


#5. Test

In [None]:
# model = AbstractiveBART.load_from_checkpoint('AbstractiveBART/checkpointepoch=2.ckpt', hparams=args).eval()
trainer = Trainer(gpus=config.gpus)
trainer.test(
    model=model,
    test_dataloaders=dm.test_dataloader(),
)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
  "`trainer.test(test_dataloaders)` is deprecated in v1.4 and will be removed in v1.6."
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{}
--------------------------------------------------------------------------------


[{}]