# DeBERTa-v3-SMALL Regression Starter
This notebook is a fork from Yuto_H's great notebook [here][1]. If you like my notebook, remember to upvote Yuto's notebook too. In this notebook we add the following modifications which are explained in my discussion post [here][3]:
* Change model to `DeBERTa-v3-small` for fast experiments (Note that `xsmall` works well too)
* Increase token `max_length to 1024` (instead of 512 to include all essay text)
* Use total `train batch size = 8`, valid batch size 16 (Note `batch per gpu = 4` and we have 2xT4 GPU)
* Train `4 epochs linear` with start `LR = 1e-5` and `no warmup`
* Remove seed everything (I like randomness)
* Add `QWK metric for regression`
* Add `new tokens` to tokenizer because DeBERTa removes "new paragraph" and "double space" from essay
* `Remove dropout` for regression
* Save `full OOF` predictions
* Add `test inference` and `LB submit`
* Achieves surprising `CV = 0.822` WOW! and LB = ??? (submitting now, let's see what LB is...)

For training, this notebook averages 1 hour per fold which is 15 minutes per epoch training on 2xT4 Kaggle GPU. (Training is done in version 1. And inference and submit to LB is done in version 2).

# Version Notes
In version 1, we finetune a new DeBERTa-v3-SMALL and save it to the Kaggle dataset [here][4]. Therefore if you want to see training epoch details, view notebook version 1. This took 6 hours using Kaggle's 2xT4 GPU.

In notebook version 2, we load the saved fold models and infer test data and submit to LB. Version 2 runs quickly because it is only inference. It will run in either 6 minutes or 1 minute depending on whether we infer OOF and compute CV score again.

If we want version 2 inference to run more quickly, we can set `COMPUTE_CV = False` below, then we will not use 5 minutes to predict OOF and compute CV score. Instead we will only infer test data.

[1]: https://www.kaggle.com/code/hashidoyuto/deberta-baseline-aes2-0-train
[2]: https://www.kaggle.com/code/hashidoyuto/deberta-v3-base-aes2-0-infer
[3]: https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/discussion/497832
[4]: https://www.kaggle.com/datasets/cdeotte/deberta-v3-small-finetuned-v1

## Kaggle Data Download 

In [None]:
# !kaggle datasets download -d cdeotte/deberta-v3-small-finetuned-v1

In [None]:
# !unzip deberta-v3-small-finetuned-v1.zip -d ./kaggle/input/deberta-v3-small-finetuned-v1/

In [None]:
# !kaggle datasets download -d verracodeguacas/huggingfacedebertav3variants

In [None]:
# !unzip huggingfacedebertav3variants.zip -d ./kaggle/input/huggingfacedebertav3variants/

In [None]:
# !kaggle datasets download nbroad/persaude-corpus-2

In [None]:
# !unzip persaude-corpus-2.zip -d ./kaggle/input/persaude-corpus-2/

## Imports and Config
Import libraries and define configuration parameters here.

In [33]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="1" #TODO: Kaggle using 2 T4 GPU, so remember to modify it to 0,1

# True USES REGRESSION, False USES CLASSIFICATION
USE_REGRESSION = True

# VERSION NUMBER FOR NAMING OF SAVED MODELS
VER=1

# IF "LOAD_FROM" IS None, THEN WE TRAIN NEW MODELS
# LOAD_FROM = "./kaggle/input/deberta-v3-small-finetuned-v1/"
# LOAD_FROM = None
LOAD_FROM = "./kaggle/input/deberta-small-two-stage-v1/" # TODO: when submit to kaggle, activate this line

# WHEN TRAINING NEW MODELS SET COMPUTE_CV = True
# WHEN LOADING MODELS, WE CAN CHOOSE True or False
COMPUTE_CV = False #TODO: When submit to kaggle, modify it to False

# WHEN TWO-STAGE TRAINING, SET PSEUDO_LABEL = True AND RETRAIN = True
PSEUDO_LABEL= False # TODO: when submit to kaggle, modify it to False
RETRAIN = False  # TODO: when submit to kaggle, modify it to False
INFERENCE = True #TODO: when submit to kaggle, modify it to True

In [34]:
import warnings
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
import optuna
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from datasets import Dataset
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import f1_score, accuracy_score
from tokenizers import AddedToken
warnings.simplefilter('ignore')

In [35]:
class PATHS: # TODO: modify path (.)
    train_path = './kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv'
    test_path = './kaggle/input/learning-agency-lab-automated-essay-scoring-2/test.csv'
    sub_path = './kaggle/input/learning-agency-lab-automated-essay-scoring-2/sample_submission.csv'
    model_path = "./kaggle/input/huggingfacedebertav3variants/deberta-v3-small"
    # model_path = "./kaggle/input/huggingfacedebertav3variants/deberta-v3-large"
    persuade_path = './kaggle/input/persaude-corpus-2/persuade_2.0_human_scores_demo_id_github.csv'
    finetune_path = './kaggle/input/deberta-small-two-stage-v1'

In [36]:
class CFG:
    n_splits = 5 # 5 or 7 are the best
    seed = 42 
    max_length = 1024
    lr_phase1 = 1e-5 # 1e-5 is the best
    lr_phase2 = 1e-5
    train_batch_size = 8 # total 8: 4 * 2
    eval_batch_size = 16 # total 16: 8 * 2
    train_phase1_epochs = 4 # 4 is the best
    train_phase2_epochs = 2
    weight_decay = 0.01
    warmup_ratio = 0.0
    num_labels = 6

## Data Tokenization
We use `max_length = 1024` to avoid truncating majority of essays.

In [37]:
class Tokenize(object):
    def __init__(self, train, valid, tokenizer):
        self.tokenizer = tokenizer
        self.train = train
        self.valid = valid
        
    def get_dataset(self, df):
        ds = Dataset.from_dict({
                'essay_id': [e for e in df['essay_id']],
                'full_text': [ft for ft in df['full_text']],
                'label': [s for s in df['label']],
            })
        return ds
        
    def tokenize_function(self, example):
        tokenized_inputs = self.tokenizer(
            example['full_text'], truncation=True, max_length=CFG.max_length
        )
        return tokenized_inputs
    
    def __call__(self):
        train_ds = self.get_dataset(self.train)
        valid_ds = self.get_dataset(self.valid)
        
        tokenized_train = train_ds.map(
            self.tokenize_function, batched=True
        )
        tokenized_valid = valid_ds.map(
            self.tokenize_function, batched=True
        )
        
        return tokenized_train, tokenized_valid, self.tokenizer

## Compute Metrics
Below we provide compute metric function for both regression and classification. In this notebook we will use regression.

In [38]:
def compute_metrics_for_regression(eval_pred):
    
    predictions, labels = eval_pred
    qwk = cohen_kappa_score(labels, predictions.clip(0,5).round(0), weights='quadratic')
    results = {
        'qwk': qwk
    }
    return results

In [39]:
def compute_metrics_for_classification(eval_pred):
    
    predictions, labels = eval_pred
    qwk = cohen_kappa_score(labels, predictions.argmax(-1), weights='quadratic')
    results = {
        'qwk': qwk
    }
    return results

## Load Data and Set Fold
For our label, we will use `label = score - 1`. Then the labels will range from 0 to 5. For regression, we convert the label to `float32`. For classification, we would convert to `int32`.

In [40]:
data = pd.read_csv(PATHS.train_path)
data['label'] = data['score'].apply(lambda x: x-1)
if USE_REGRESSION: data["label"] = data["label"].astype('float32') 
else: data["label"] = data["label"].astype('int32')
print("Origin data length:", len(data))

persuade = pd.read_csv(PATHS.persuade_path)
intersection = pd.merge(data, persuade, on="full_text", how="inner")[["essay_id", "full_text", "score", "label", "prompt_name"]].reset_index(drop=True)
difference = data[~data["essay_id"].isin(intersection["essay_id"])].reset_index(drop=True)
print("Persuade data length(intersection):", len(intersection))
print("Non-persuade data length(difference):", len(difference))

Origin data length: 17307
Persuade data length(intersection): 12871
Non-persuade data length(difference): 4436


Use persuade data as phase 1 training dataset, and use non-persuade data as phase 2 training dtaset.

In [41]:
intersection["score_and_prompt"] = intersection["score"].astype(str) + "-" + intersection["prompt_name"]

intersection.head()

Unnamed: 0,essay_id,full_text,score,label,prompt_name,score_and_prompt
0,000d118,Many people have car where they live. The thin...,3,2.0,Car-free cities,3-Car-free cities
1,002ba53,"Dear, State Senator\n\nThis is a letter to arg...",3,2.0,Does the electoral college work?,3-Does the electoral college work?
2,0030e86,If I were to choose between keeping the electo...,4,3.0,Does the electoral college work?,4-Does the electoral college work?
3,0033bf4,What is the Seagoing Cowboys progam?\n\nIt was...,3,2.0,"""A Cowboy Who Rode the Waves""","3-""A Cowboy Who Rode the Waves"""
4,0036253,The challenge of exploring Venus\n\nThis stori...,2,1.0,Exploring Venus,2-Exploring Venus


In [42]:
skf_42 = StratifiedKFold(n_splits=CFG.n_splits, shuffle=True, random_state=CFG.seed)

# phase 1
if PSEUDO_LABEL:
    persuade_data = pd.read_csv('./kaggle/input/pseudo_labeling.csv')
else:
    persuade_data = intersection.copy(deep=True)
    for i, (_, val_index) in enumerate(skf_42.split(persuade_data, persuade_data["score_and_prompt"])):
        persuade_data.loc[val_index, "fold"] = i
        
persuade_data.head()

Unnamed: 0,essay_id,full_text,score,label,prompt_name,score_and_prompt,fold
0,000d118,Many people have car where they live. The thin...,3,2.0,Car-free cities,3-Car-free cities,1.0
1,002ba53,"Dear, State Senator\n\nThis is a letter to arg...",3,2.0,Does the electoral college work?,3-Does the electoral college work?,0.0
2,0030e86,If I were to choose between keeping the electo...,4,3.0,Does the electoral college work?,4-Does the electoral college work?,3.0
3,0033bf4,What is the Seagoing Cowboys progam?\n\nIt was...,3,2.0,"""A Cowboy Who Rode the Waves""","3-""A Cowboy Who Rode the Waves""",3.0
4,0036253,The challenge of exploring Venus\n\nThis stori...,2,1.0,Exploring Venus,2-Exploring Venus,3.0


In [43]:
# phase 2
skf_6 = StratifiedKFold(n_splits=CFG.n_splits, shuffle=True, random_state=CFG.seed)

non_persuade_data = difference.copy(deep=True)
for i, (_, val_index) in enumerate(skf_6.split(non_persuade_data, non_persuade_data["score"])):
    non_persuade_data.loc[val_index, "fold"] = i
non_persuade_data.head()

Unnamed: 0,essay_id,full_text,score,label,fold
0,000fe60,I am a scientist at NASA that is discussing th...,3,2.0,1.0
1,001ab80,People always wish they had the same technolog...,4,3.0,3.0
2,001bdc0,"We all heard about Venus, the planet without a...",4,3.0,2.0
3,0033037,The posibilty of a face reconizing computer wo...,2,1.0,1.0
4,0065bd6,Driverless cars should not exsist it can cause...,3,2.0,2.0


## Set Training Args
We use `fp16=True` which uses mixed precision and uses less GPU VRAM and makes training faster. We use `per_device_train_batch_size = (8 / number of gpus)` because we want total train batch size to be 8. With Kaggle T4, we have 2xT4 GPUs and use `per_device_train_batch_size = 4`.

In [44]:
training_args_phase1 = TrainingArguments(
    output_dir=f'output_v{VER}_phase1',
    fp16=True,
    learning_rate=CFG.lr_phase1,
    per_device_train_batch_size=CFG.train_batch_size,
    per_device_eval_batch_size=CFG.eval_batch_size,
    num_train_epochs=CFG.train_phase1_epochs,
    weight_decay=CFG.weight_decay,
    evaluation_strategy='epoch',
    metric_for_best_model='qwk',
    save_strategy='epoch',
    save_total_limit=1,
    load_best_model_at_end=True,
    report_to='none',
    warmup_ratio=CFG.warmup_ratio,
    lr_scheduler_type='linear', # "cosine" or "linear" or "constant"
    optim='adamw_torch',
    logging_first_step=True,
)

In [45]:
training_args_phase2 = TrainingArguments(
    output_dir=f'output_v{VER}_phase2',
    fp16=True,
    learning_rate=CFG.lr_phase2,
    per_device_train_batch_size=CFG.train_batch_size,
    per_device_eval_batch_size=CFG.eval_batch_size,
    num_train_epochs=CFG.train_phase2_epochs,
    weight_decay=CFG.weight_decay,
    evaluation_strategy='epoch',
    metric_for_best_model='qwk',
    save_strategy='epoch',
    save_total_limit=1,
    load_best_model_at_end=True,
    report_to='none',
    warmup_ratio=CFG.warmup_ratio,
    lr_scheduler_type='linear', # "cosine" or "linear" or "constant"
    optim='adamw_torch',
    logging_first_step=True,
)

## K Fold Training
We add new tokens for ("\n") new paragraph and (" "*2) double space because the default DeBERTa tokenizer removes these but these are helpful for scoring essays. We remove dropout from our model because this does not work well when using regression. Read discussion [here][1]

[1]: https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/discussion/497832

### Phase 1

In [46]:
# phase 1
if COMPUTE_CV:
    for fold in range(len(persuade_data['fold'].unique())):

        # GET TRAIN AND VALID DATA
        train = persuade_data[persuade_data['fold'] != fold]
        valid = persuade_data[persuade_data['fold'] == fold].copy()

        # PSEUDO-LABEL
        if PSEUDO_LABEL:
            print("Using Pseudo-label")
            train['label'] = train['pseudo_label'].astype('float32')

        # ADD NEW TOKENS for ("\n") new paragraph and (" "*2) double space 
        if RETRAIN:
            tokenizer_path = PATHS.finetune_path + f'/deberta-v3-small_AES2_fold_{fold}_v{VER}'
            tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)    
        else:
            tokenizer = AutoTokenizer.from_pretrained(PATHS.model_path)
            tokenizer.add_tokens([AddedToken("\n", normalized=False)])
            tokenizer.add_tokens([AddedToken(" "*2, normalized=False)])

        tokenize = Tokenize(train, valid, tokenizer)
        tokenized_train, tokenized_valid, _ = tokenize()

        if RETRAIN:
            config_path = PATHS.finetune_path + f'/deberta-v3-small_AES2_fold_{fold}_v{VER}'
        else:
            # REMOVE DROPOUT FROM REGRESSION
            config = AutoConfig.from_pretrained(PATHS.model_path)
            if USE_REGRESSION:
                config.attention_probs_dropout_prob = 0.0 
                config.hidden_dropout_prob = 0.0 
                config.num_labels = 1 
            else: config.num_labels = CFG.num_labels 

        if LOAD_FROM and INFERENCE:
            model = AutoModelForSequenceClassification.from_pretrained(LOAD_FROM + f'deberta-v3-small_AES2_fold_{fold}_v{VER}')
        elif RETRAIN:
            model = AutoModelForSequenceClassification.from_pretrained(config_path)
        else:
            model = AutoModelForSequenceClassification.from_pretrained(PATHS.model_path, config=config)
            model.resize_token_embeddings(len(tokenizer))

        # TRAIN WITH TRAINER
        data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
        if USE_REGRESSION: compute_metrics = compute_metrics_for_regression
        else: compute_metrics = compute_metrics_for_classification
        trainer_phase1 = Trainer( 
            model=model,
            args=training_args_phase1,
            train_dataset=tokenized_train,
            eval_dataset=tokenized_valid,
            data_collator=data_collator,
            tokenizer=tokenizer,
            compute_metrics=compute_metrics
        )
        if LOAD_FROM is None or RETRAIN is True:
            print(f"Training fold {fold}:")
            trainer_phase1.train()
        
        # PLOT CONFUSION MATRIX
        y_true = valid['score'].values
        predictions0 = trainer_phase1.predict(tokenized_valid).predictions
        if USE_REGRESSION: predictions = predictions0.round(0) + 1
        else: predictions = predictions0.argmax(axis=1) + 1 
        cm = confusion_matrix(y_true, predictions, labels=[x for x in range(1,7)])
        draw_cm = ConfusionMatrixDisplay(confusion_matrix=cm,
                                      display_labels=[x for x in range(1,7)])
        draw_cm.plot()
        plt.show()

        # SAVE FOLD MODEL AND TOKENIZER
        if LOAD_FROM is None:
            trainer_phase1.save_model(PATHS.finetune_path + f'/deberta-v3-small_AES2_fold_{fold}_v{VER}_phase1')
            tokenizer.save_pretrained(PATHS.finetune_path + f'/deberta-v3-small_AES2_fold_{fold}_v{VER}_phase1')

        # SAVE OOF PREDICTIONS
        if USE_REGRESSION: 
            valid['pred'] = predictions0 + 1 
        else:
            COLS = [f'p{x}' for x in range(CFG.num_labels)] 
            valid[COLS] = predictions0 
        valid.to_csv(f'phase1_valid_df_fold_{fold}_v{VER}.csv', index=False)

In [47]:
# phase 2
if COMPUTE_CV:
    for fold in range(len(non_persuade_data['fold'].unique())):

        # GET TRAIN AND VALID DATA
        train = non_persuade_data[non_persuade_data['fold'] != fold]
        valid = non_persuade_data[non_persuade_data['fold'] == fold].copy()

        # ADD NEW TOKENS for ("\n") new paragraph and (" "*2) double space 
        tokenizer_path = PATHS.finetune_path + f'/deberta-v3-small_AES2_fold_{fold}_v{VER}_phase1'
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        tokenize = Tokenize(train, valid, tokenizer)
        tokenized_train, tokenized_valid, _ = tokenize()

        # LOAD CONFIG AND MODEL FROM PHASE 1
        config_path = PATHS.finetune_path + f'/deberta-v3-small_AES2_fold_{fold}_v{VER}_phase1'
        model = AutoModelForSequenceClassification.from_pretrained(config_path)

        # TRAIN WITH TRAINER
        data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
        if USE_REGRESSION: compute_metrics = compute_metrics_for_regression
        else: compute_metrics = compute_metrics_for_classification
        trainer_phase2 = Trainer( 
            model=model,
            args=training_args_phase2,
            train_dataset=tokenized_train,
            eval_dataset=tokenized_valid,
            data_collator=data_collator,
            tokenizer=tokenizer,
            compute_metrics=compute_metrics
        )
        if LOAD_FROM is None:
            trainer_phase2.train()
        
        # PLOT CONFUSION MATRIX
        y_true = valid['score'].values
        predictions0 = trainer_phase2.predict(tokenized_valid).predictions
        if USE_REGRESSION: predictions = predictions0.round(0) + 1
        else: predictions = predictions0.argmax(axis=1) + 1 
        cm = confusion_matrix(y_true, predictions, labels=[x for x in range(1,7)])
        draw_cm = ConfusionMatrixDisplay(confusion_matrix=cm,
                                      display_labels=[x for x in range(1,7)])
        draw_cm.plot()
        plt.show()

        # SAVE FOLD MODEL AND TOKENIZER
        if LOAD_FROM is None:
            trainer_phase2.save_model(PATHS.finetune_path + f'/deberta-v3-small_AES2_fold_{fold}_v{VER}')
            tokenizer.save_pretrained(PATHS.finetune_path + f'/deberta-v3-small_AES2_fold_{fold}_v{VER}')

        # SAVE OOF PREDICTIONS
        if USE_REGRESSION: 
            valid['pred'] = predictions0 + 1 
        else:
            COLS = [f'p{x}' for x in range(CFG.num_labels)] 
            valid[COLS] = predictions0 
        valid.to_csv(f'phase2_valid_df_fold_{fold}_v{VER}.csv', index=False)

## Generating Pseudo-Label

In [48]:
# Prepare pseudo-label creation function
def create_pseudo_labels(dataset):
    predictions_list = []
    
    for fold in range(len(dataset['fold'].unique())):
        
        # Tokenize data
        tokenizer_path = PATHS.finetune_path + f'/deberta-v3-small_AES2_fold_{fold}_v{VER}'
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        tokenize = Tokenize(dataset, dataset, tokenizer)
        tokenized_dataset, _, _ = tokenize()
        
        # Load fold model
        config_path = PATHS.finetune_path + f'/deberta-v3-small_AES2_fold_{fold}_v{VER}'
        model = AutoModelForSequenceClassification.from_pretrained(config_path)

        # make prediction
        trainer = Trainer(
            model=model,
            args=TrainingArguments(
                per_device_eval_batch_size=CFG.eval_batch_size,
                output_dir='./output/pseudo_labeling',
                report_to='none'
            ),
            data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
            tokenizer=tokenizer,
        )
        predictions = trainer.predict(tokenized_dataset).predictions
        
        if USE_REGRESSION:
            predicted_labels = predictions.round(0) + 1
        else:
            predicted_labels = predictions.argmax(axis=1) + 1
            
        predictions_list.append(predicted_labels)
        print(f"Fold {fold}: Predictions shape = {predicted_labels.shape}")
    
    avg_predictions = sum(predictions_list) / len(predictions_list)

    # Create pseudo-label using weighted average with original score
    weight_label = 8
    weight_pred = 2
    dataset['pseudo_label'] = (weight_label * dataset['score'] + weight_pred * avg_predictions) / (weight_label + weight_pred)
    
    # save pseudo-labels dataset
    dataset.to_csv('./kaggle/input/pseudo_labeling.csv', index=False)

    return dataset

In [49]:
if not INFERENCE:
    persuade_data = create_pseudo_labels(persuade_data)
    PSEUDO_LABEL = True
    RETRAIN = True

In [50]:
persuade_data.head()

Unnamed: 0,essay_id,full_text,score,label,prompt_name,score_and_prompt,fold
0,000d118,Many people have car where they live. The thin...,3,2.0,Car-free cities,3-Car-free cities,1.0
1,002ba53,"Dear, State Senator\n\nThis is a letter to arg...",3,2.0,Does the electoral college work?,3-Does the electoral college work?,0.0
2,0030e86,If I were to choose between keeping the electo...,4,3.0,Does the electoral college work?,4-Does the electoral college work?,3.0
3,0033bf4,What is the Seagoing Cowboys progam?\n\nIt was...,3,2.0,"""A Cowboy Who Rode the Waves""","3-""A Cowboy Who Rode the Waves""",3.0
4,0036253,The challenge of exploring Venus\n\nThis stori...,2,1.0,Exploring Venus,2-Exploring Venus,3.0


## Overall CV Score

In [51]:
if COMPUTE_CV:
    # Initialize a list to store all folds' results
    all_folds = []

    # Read Phase 1 predictions
    for fold in range(CFG.n_splits):
        phase1_path = f'phase1_valid_df_fold_{fold}_v{VER}.csv'
        phase1_df = pd.read_csv(phase1_path)
        all_folds.append(phase1_df)

    # Read Phase 2 predictions
    for fold in range(CFG.n_splits):
        phase2_path = f'phase2_valid_df_fold_{fold}_v{VER}.csv'
        phase2_df = pd.read_csv(phase2_path)
        all_folds.append(phase2_df)

    # Concatenate all folds
    overall_df = pd.concat(all_folds, axis=0).reset_index(drop=True)

    # Save the overall predictions to a CSV file (optional)
    overall_df.to_csv(f'overall_valid_df_v{VER}.csv', index=False)

In [52]:
overall_df.head()

Unnamed: 0,essay_id,full_text,score,label,prompt_name,score_and_prompt,fold,pred
0,002ba53,"Dear, State Senator\n\nThis is a letter to arg...",3,2.0,Does the electoral college work?,3-Does the electoral college work?,0.0,2.569336
1,005a72e,I agree that driverless cars are a developing ...,4,3.0,Driverless cars,4-Driverless cars,0.0,4.365234
2,006d0e1,"Have you ever seen Europe? What about China, o...",4,3.0,"""A Cowboy Who Rode the Waves""","4-""A Cowboy Who Rode the Waves""",0.0,4.173828
3,00aa6de,This system could be very benificial in classr...,2,1.0,Facial action coding system,2-Facial action coding system,0.0,2.166016
4,00b2fe2,"I think the idea of ""driverless""'cars might no...",4,3.0,Driverless cars,4-Driverless cars,0.0,3.582031


In [53]:
if COMPUTE_CV:
    # Calculate Overall CV (QWK)
    if USE_REGRESSION:
        overall_score = cohen_kappa_score(
            overall_df['score'].values,
            overall_df['pred'].values.clip(1, 6).round(0),
            weights='quadratic'
        )
    else:
        overall_score = cohen_kappa_score(
            overall_df['score'].values,
            overall_df.iloc[:, -CFG.num_labels:].values.argmax(axis=1) + 1,
            weights='quadratic'
        )

    print(f'Overall QWK CV = {overall_score:.4f}')

## Optimize Threshold

In [54]:
class OptunaRounder:
    def __init__(self, y_true, y_pred):
        self.y_true = y_true
        self.y_pred = y_pred
        self.labels = np.unique(y_true)

    def __call__(self, trial):
        thresholds = []
        for i in range(len(self.labels) - 1):
            low = max(thresholds) if i > 0 else min(self.labels)
            high = max(self.labels)
            t = trial.suggest_uniform(f't{i}', low, high)
            thresholds.append(t)
        try:
            opt_y_pred = self.adjust(self.y_pred, thresholds)
        except: return 0
        return cohen_kappa_score(self.y_true, opt_y_pred, weights='quadratic')

    def adjust(self, y_pred, thresholds):
        opt_y_pred = pd.cut(y_pred,
                            [-np.inf] + thresholds + [np.inf],
                            labels=self.labels)
        return opt_y_pred

In [55]:
def predict_dataset(dataset):
    predictions_list = []
    
    for fold in range(len(dataset['fold'].unique())):
        
        # Tokenize data
        tokenizer_path = PATHS.finetune_path + f'/deberta-v3-small_AES2_fold_{fold}_v{VER}'
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        tokenize = Tokenize(dataset, dataset, tokenizer)
        tokenized_dataset, _, _ = tokenize()
        
        # Load fold model
        config_path = PATHS.finetune_path + f'/deberta-v3-small_AES2_fold_{fold}_v{VER}'
        model = AutoModelForSequenceClassification.from_pretrained(config_path)

        # make prediction
        trainer = Trainer(
            model=model,
            args=TrainingArguments(
                per_device_eval_batch_size=CFG.eval_batch_size,
                output_dir='./output/pseudo_labeling',
                report_to='none'
            ),
            data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
            tokenizer=tokenizer,
        )
        predictions = trainer.predict(tokenized_dataset).predictions
            
        predictions_list.append(predictions)
        print(f"Fold {fold}: Predictions shape = {predictions.shape}")
    
    avg_predictions = sum(predictions_list) / len(predictions_list)
    dataset['avg_pred'] = avg_predictions
    
    return dataset

In [56]:
for i, (_, val_index) in enumerate(skf_6.split(data, data["score"])):
    data.loc[val_index, "fold"] = i
        
data.head()

Unnamed: 0,essay_id,full_text,score,label,fold
0,000d118,Many people have car where they live. The thin...,3,2.0,3.0
1,000fe60,I am a scientist at NASA that is discussing th...,3,2.0,4.0
2,001ab80,People always wish they had the same technolog...,4,3.0,1.0
3,001bdc0,"We all heard about Venus, the planet without a...",4,3.0,0.0
4,002ba53,"Dear, State Senator\n\nThis is a letter to arg...",3,2.0,2.0


In [59]:
opt_dataset = data.copy()
if not INFERENCE:
    opt_dataset = predict_dataset(data)
    
opt_dataset.head()

Unnamed: 0,essay_id,full_text,score,label,fold
0,000d118,Many people have car where they live. The thin...,3,2.0,3.0
1,000fe60,I am a scientist at NASA that is discussing th...,3,2.0,4.0
2,001ab80,People always wish they had the same technolog...,4,3.0,1.0
3,001bdc0,"We all heard about Venus, the planet without a...",4,3.0,0.0
4,002ba53,"Dear, State Senator\n\nThis is a letter to arg...",3,2.0,2.0


In [60]:
if not INFERENCE:
    optuna.logging.set_verbosity(optuna.logging.WARNING) 
    objective = OptunaRounder(opt_dataset['label'].values, opt_dataset['avg_pred'].values)
    study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=0))
    study.optimize(objective, timeout=200)
    best_thresholds = sorted(study.best_params.values())
    print(f'Optimized thresholds: {best_thresholds}')

    preds_opt = objective.adjust(opt_dataset['avg_pred'].values, best_thresholds)
    preds_opt = preds_opt.astype(int)

    qwk = cohen_kappa_score(opt_dataset['label'], preds_opt, weights='quadratic')
    f1 = f1_score(opt_dataset['label'], preds_opt, average='macro')
    acc = accuracy_score(opt_dataset['label'], preds_opt)
    print("QWK: %.5f"%qwk, "F1: %.5f"%f1, "Accuracy: %.5f"%acc)

## Infer Test Data
We infer test data using Hugging Face trainer and load our saved best fold models.

In [61]:
test = pd.read_csv(PATHS.test_path)
print('Test shape:', test.shape )
test.head()

Test shape: (3, 2)


Unnamed: 0,essay_id,full_text
0,000d118,Many people have car where they live. The thin...
1,000fe60,I am a scientist at NASA that is discussing th...
2,001ab80,People always wish they had the same technolog...


In [62]:
all_pred = []
test['label'] = 0.0

for fold in range(CFG.n_splits):
    
    # LOAD TOKENIZER
    if LOAD_FROM:
        tokenizer = AutoTokenizer.from_pretrained(LOAD_FROM + f'deberta-v3-small_AES2_fold_{fold}_v{VER}')
    else:
        tokenizer = AutoTokenizer.from_pretrained(f'deberta-v3-small_AES2_fold_{fold}_v{VER}')
    tokenize = Tokenize(test, test, tokenizer)
    tokenized_test, _, _ = tokenize()

    # LOAD MODEL
    if LOAD_FROM:
        model = AutoModelForSequenceClassification.from_pretrained(LOAD_FROM + f'deberta-v3-small_AES2_fold_{fold}_v{VER}')
    else:
        model = AutoModelForSequenceClassification.from_pretrained(f'deberta-v3-small_AES2_fold_{fold}_v{VER}')
    
    # CREATE TRAINING ARGS FOR INFERENCE
    inference_args = TrainingArguments(
        per_device_eval_batch_size=CFG.eval_batch_size,  
        fp16=True,                      
        do_train=False,                 
        do_eval=False,                  
        logging_dir='./logs',          
        report_to="none",
        output_dir='output_v1'                
    )

    # INFER WITH TRAINER
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    trainer = Trainer( 
        model=model,
        args=inference_args,
        train_dataset=tokenized_test,
        data_collator=data_collator,
        tokenizer=tokenizer,
    )

    # SAVE PREDICTIONS
    predictions = trainer.predict(tokenized_test).predictions
    all_pred.append( predictions )

Map: 100%|██████████| 3/3 [00:00<00:00, 406.41 examples/s]
Map: 100%|██████████| 3/3 [00:00<00:00, 531.58 examples/s]


Map: 100%|██████████| 3/3 [00:00<00:00, 446.55 examples/s]
Map: 100%|██████████| 3/3 [00:00<00:00, 515.57 examples/s]


Map: 100%|██████████| 3/3 [00:00<00:00, 506.60 examples/s]
Map: 100%|██████████| 3/3 [00:00<00:00, 532.09 examples/s]


Map: 100%|██████████| 3/3 [00:00<00:00, 463.53 examples/s]
Map: 100%|██████████| 3/3 [00:00<00:00, 476.17 examples/s]


Map: 100%|██████████| 3/3 [00:00<00:00, 458.28 examples/s]
Map: 100%|██████████| 3/3 [00:00<00:00, 563.37 examples/s]


In [63]:
preds = np.mean(all_pred, axis=0)
print('Predictions shape:',preds.shape)
print(preds)

Predictions shape: (3,)
[1.800586  2.0132813 3.4617188]


## Apply Best Threshold

In [64]:
def apply_opt_thresholds(predictions, thresholds):
    labels = np.zeros_like(predictions, dtype=int)
    for i, threshold in enumerate(thresholds):
        labels += (predictions > threshold).astype(int)
        
    return labels + 1

In [65]:
best_thresholds = [0.781520185283719, 1.6011450107222063, 2.5870851259767407, 3.205894677704853, 3.649166723300159]

opt_pred = apply_opt_thresholds(preds, best_thresholds)
print(opt_pred)

[3 3 5]


## Create Submission CSV

In [66]:
sub = pd.read_csv(PATHS.sub_path)
# if USE_REGRESSION: sub["score"] = preds.clip(0,5).round(0)+1
if USE_REGRESSION: sub['score'] = opt_pred
else: sub["score"] = preds.argmax(axis=1)+1
sub.score = sub.score.astype('int32')
sub.to_csv('submission.csv',index=False)
print('Submission shape:', sub.shape )
sub.head()

Submission shape: (3, 2)


Unnamed: 0,essay_id,score
0,000d118,3
1,000fe60,3
2,001ab80,5
