GitHub - moon23k/CPT: Customized Pretraining for NLG Tasks

Customized Pre-Training

This repository shares a set of codes for customized pre-trained model tailored to specific tasks. The training objectives for pre-training consist of three components: Masked Language Modeling, Casual Language Modeling, and Masked Casual Language Modeling. To comprehensively evaluate natural language generation capabilities for each approach, three tasks have been chosen: machine translation, dialogue generation, and document summarization. The goal is to examine how performance fluctuates across diverse tasks when utilizing the pre-trained model.

Pretraining Objective

Masked Lanugage Modeling

Utilizing a logic similar to that employed in BERT, we implement the Masked Language Modeling approach. Masking is applied to only 20% of the entire corpus, excluding Special Tokens. The objective is to enhance bidirectional understanding of the provided corpus. The model structure for Masked Language Modeling includes a Transformer Encoder with an added Pooler. The pre-trained Encoder will be utilized in the subsequent training process.

Casual Lanugage Modeling

This is a Casual Language Modeling approach that utilizes both an encoder and a decoder. The input values and labels for the model are identical, and the model learns the predicted distribution of the next word given the previous word. In this pre-training approach, both the encoder and decoder are trained during pre-training, and in the subsequent training process, both the encoder and decoder are used as they are.

Masked Casual Lanugage Modeling

This approach is a combination of the previously discussed Masked Language Modeling and Casual Language Modeling. Essentially, it is Casual Language Modeling, but with some of the input values masked. Both the encoder and decoder are employed in pre-training, and in the subsequent training, both the encoder and decoder are used as they are.

How to Use

├── ckpt                    --this dir saves model checkpoints and training logs
├── pt_ckpt                 --this dir saves pretrained model checkpoints and training logs
├── config.yaml             --this file is for setting up arguments for model, training, and tokenizer 
├── data                    --this dir is for saving Training, Validataion and Test Datasets
├── model                   --this dir contains files for Deep Learning Model
│   ├── __init__.py
│   └── transformer.py
├── module                  --this dir contains a series of modules
│   ├── data.py
│   ├── generate.py
│   ├── __init__.py
│   ├── model.py
│   ├── test.py
│   └── train.py
├── README.md
├── run.py                 --this file includes codes for actual tasks such as training, testing, and inference to carry out the practical aspects of the work
└── setup.py               --this file contains a series of codes for preprocessing data, training a tokenizer, and saving the dataset.

First clone git repo in your local env

git clone https://github.com/moon23k/CPT_Training

Download and Process Dataset via setup.py

bash setup.py -task [all, translation, dialogue, summarization]

Execute the run file on your purpose (search is optional)

python3 run.py -task [translation, dialogue, summarization] \
               -mode [pretrain, train, test, inference] \
               -lm_type [mask, casual, masked_casual] \
               -search [greedy, beam]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ckpt

ckpt

data

data

model

model

module

module

pt_ckpt

pt_ckpt

README.md

README.md

config.yaml

config.yaml

run.py

run.py

setup.py

setup.py

Repository files navigation

Customized Pre-Training

Pretraining Objective

Masked Lanugage Modeling

Casual Lanugage Modeling

Masked Casual Lanugage Modeling

How to Use

Reference

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
ckpt		ckpt
data		data
model		model
module		module
pt_ckpt		pt_ckpt
README.md		README.md
config.yaml		config.yaml
run.py		run.py
setup.py		setup.py

moon23k/CPT

Folders and files

Latest commit

History

Repository files navigation

Customized Pre-Training

Pretraining Objective

Masked Lanugage Modeling

Casual Lanugage Modeling

Masked Casual Lanugage Modeling

How to Use

Reference

About

Topics

Resources

Stars

Watchers

Forks

Languages