# RAPIDS SVR - CV 0.450 - LB 0.44x !!

This notebook uses code and ideas from Noufal's great notebook [here][1]. In his notebook he extracts 1 NLP transformer embeddings and trains Sklearn's multioutput regressor + gradient boosting regressor on CPU with 5-Folds.

In this notebook, we use [RAPIDS SVR][3] to train and predict. Since [RAPIDS cuML's SVR][3] uses GPU it is very fast. This allows us to train with more extracted embeddings quickly and more folds. In this notebook, we use 25-Folds! And in this notebook, we extract embeddings from 5 NLP transformers. Afterward we concatenate them and have 6000 columns of features! [RAPIDS SVR][3] has built in feature reduction, so it learns to use the most informative features from all the NLP transformers!

![](https://raw.githubusercontent.com/cdeotte/Kaggle_Images/main/Sep-2022/svr.png)

Note that we do not finetune the NLP transformers. The Deberta transformers in this notebook are the same pretrained transformers that we download from Hugging Face. They have not been finetuned on Kaggle's competition data. This demonstrates that pretrained models already come with intelligence.

This is similar to Giba's 1st place solution in Kaggle's Pet Competition [here][2]. That competition was computer vision regression. Giba extracted embeddings from dozens of image CNN's and image transformers. The models were pretrained (most likely on ImageNet data) but not finetuned (on Kaggle competition data). He concatenated the embeddings and trained a [RAPIDS SVR][3] on tens of thousands of feature columns!

[1]: https://www.kaggle.com/code/kvsnoufal/lb0-46-gb-debertaembedding
[2]: https://www.kaggle.com/competitions/petfinder-pawpularity-score/discussion/301686
[3]: https://docs.rapids.ai/api/cuml/stable/api.html#support-vector-machines

# Load Libraries and Data

In [1]:
import numpy as np 
import pandas as pd 
import os, gc, re, warnings
warnings.filterwarnings("ignore")

In [2]:
dftr = pd.read_csv("/kaggle/input/feedback-prize-english-language-learning/train.csv")
dftr["src"]="train"
dfte = pd.read_csv("/kaggle/input/feedback-prize-english-language-learning/test.csv")
dfte["src"]="test"
print('Train shape:',dftr.shape,'Test shape:',dfte.shape,'Test columns:',dfte.columns)
df = pd.concat([dftr,dfte],ignore_index=True)

dftr.head()

Train shape: (3911, 9) Test shape: (3, 3) Test columns: Index(['text_id', 'full_text', 'src'], dtype='object')


Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions,src
0,0016926B079C,I think that students would benefit from learn...,3.5,3.5,3.0,3.0,4.0,3.0,train
1,0022683E9EA5,When a problem is a change you have to let it ...,2.5,2.5,3.0,2.0,2.0,2.5,train
2,00299B378633,"Dear, Principal\n\nIf u change the school poli...",3.0,3.5,3.0,3.0,3.0,2.5,train
3,003885A45F42,The best time in life is when you become yours...,4.5,4.5,4.5,4.5,4.0,5.0,train
4,0049B1DF5CCC,Small act of kindness can impact in other peop...,2.5,3.0,3.0,3.0,2.5,2.5,train


In [3]:
target_cols = ['cohesion', 'syntax', 'vocabulary', 'phraseology', 'grammar', 'conventions',]

# Make 25 Stratified Folds!

In [4]:
import sys
sys.path.append('../input/iterativestratification')
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
FOLDS = 25
skf = MultilabelStratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=42)
for i,(train_index, val_index) in enumerate(skf.split(dftr,dftr[target_cols])):
    dftr.loc[val_index,'FOLD'] = i
print('Train samples per fold:')
dftr.FOLD.value_counts()

Train samples per fold:


11.0    157
21.0    157
1.0     157
7.0     157
20.0    157
14.0    157
19.0    157
12.0    157
23.0    157
3.0     157
6.0     157
5.0     156
13.0    156
22.0    156
0.0     156
15.0    156
24.0    156
17.0    156
9.0     156
8.0     156
4.0     156
2.0     156
18.0    156
10.0    156
16.0    156
Name: FOLD, dtype: int64

# Generate Embeddings

In [5]:
from transformers import AutoModel,AutoTokenizer
import torch
import torch.nn.functional as F
from tqdm import tqdm

In [6]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state.detach().cpu()
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )

In [7]:
BATCH_SIZE = 4

class EmbedDataset(torch.utils.data.Dataset):
    def __init__(self,df):
        self.df = df.reset_index(drop=True)
    def __len__(self):
        return len(self.df)
    def __getitem__(self,idx):
        text = self.df.loc[idx,"full_text"]
        tokens = tokenizer(
                text,
                None,
                add_special_tokens=True,
                padding='max_length',
                truncation=True,
                max_length=MAX_LEN,return_tensors="pt")
        tokens = {k:v.squeeze(0) for k,v in tokens.items()}
        return tokens

ds_tr = EmbedDataset(dftr)
embed_dataloader_tr = torch.utils.data.DataLoader(ds_tr,\
                        batch_size=BATCH_SIZE,\
                        shuffle=False)
ds_te = EmbedDataset(dfte)
embed_dataloader_te = torch.utils.data.DataLoader(ds_te,\
                        batch_size=BATCH_SIZE,\
                        shuffle=False)

# Extract Embeddings

In [8]:
tokenizer = None
MAX_LEN = 640

def get_embeddings(MODEL_NM='', MAX=640, BATCH_SIZE=4, verbose=True):
    global tokenizer, MAX_LEN
    DEVICE="cuda"
    model = AutoModel.from_pretrained( MODEL_NM )
    tokenizer = AutoTokenizer.from_pretrained( MODEL_NM )
    MAX_LEN = MAX
    
    model = model.to(DEVICE)
    model.eval()
    all_train_text_feats = []
    for batch in tqdm(embed_dataloader_tr,total=len(embed_dataloader_tr)):
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        with torch.no_grad():
            model_output = model(input_ids=input_ids,attention_mask=attention_mask)
        sentence_embeddings = mean_pooling(model_output, attention_mask.detach().cpu())
        # Normalize the embeddings
        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
        sentence_embeddings =  sentence_embeddings.squeeze(0).detach().cpu().numpy()
        all_train_text_feats.extend(sentence_embeddings)
    all_train_text_feats = np.array(all_train_text_feats)
    if verbose:
        print('Train embeddings shape',all_train_text_feats.shape)
        
    te_text_feats = []
    for batch in tqdm(embed_dataloader_te,total=len(embed_dataloader_te)):
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        with torch.no_grad():
            model_output = model(input_ids=input_ids,attention_mask=attention_mask)
        sentence_embeddings = mean_pooling(model_output, attention_mask.detach().cpu())
        # Normalize the embeddings
        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
        sentence_embeddings =  sentence_embeddings.squeeze(0).detach().cpu().numpy()
        te_text_feats.extend(sentence_embeddings)
    te_text_feats = np.array(te_text_feats)
    if verbose:
        print('Test embeddings shape',te_text_feats.shape)
        
    return all_train_text_feats, te_text_feats

# Get Base Embeddings

In [9]:
MODEL_NM = '../input/huggingface-deberta-variants/deberta-base/deberta-base'
all_train_text_feats, te_text_feats = get_embeddings(MODEL_NM)

Some weights of the model checkpoint at ../input/huggingface-deberta-variants/deberta-base/deberta-base were not used when initializing DebertaModel: ['lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'config', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.dense.weight']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 978/978 [03:02<00:00,  5.35it/s]


Train embeddings shape (3911, 768)


100%|██████████| 1/1 [00:00<00:00,  7.31it/s]

Test embeddings shape (3, 768)





# Get Large V3 Embeddings

In [10]:
MODEL_NM = '../input/deberta-v3-large/deberta-v3-large'
all_train_text_feats2, te_text_feats2 = get_embeddings(MODEL_NM)

Some weights of the model checkpoint at ../input/deberta-v3-large/deberta-v3-large were not used when initializing DebertaV2Model: ['mask_predictions.classifier.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.dense.bias', 'mask_predictions.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.dense.weight']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Spec

Train embeddings shape (3911, 1024)


100%|██████████| 1/1 [00:00<00:00,  2.40it/s]

Test embeddings shape (3, 1024)





# Get Large Embeddings

In [11]:
MODEL_NM = '../input/huggingface-deberta-variants/deberta-large/deberta-large'
all_train_text_feats3, te_text_feats3 = get_embeddings(MODEL_NM)

Some weights of the model checkpoint at ../input/huggingface-deberta-variants/deberta-large/deberta-large were not used when initializing DebertaModel: ['lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'config', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.dense.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 978/978 [08:06<00:00,  2.01it/s]


Train embeddings shape (3911, 1024)


100%|██████████| 1/1 [00:00<00:00,  2.49it/s]

Test embeddings shape (3, 1024)





# Get Large MNLI Embeddings

In [12]:
MODEL_NM = '../input/huggingface-deberta-variants/deberta-large-mnli/deberta-large-mnli'
all_train_text_feats4, te_text_feats4 = get_embeddings(MODEL_NM, MAX=512)

Some weights of the model checkpoint at ../input/huggingface-deberta-variants/deberta-large-mnli/deberta-large-mnli were not used when initializing DebertaModel: ['config', 'classifier.weight', 'classifier.bias', 'pooler.dense.bias', 'pooler.dense.weight']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 978/978 [06:22<00:00,  2.56it/s]


Train embeddings shape (3911, 1024)


100%|██████████| 1/1 [00:00<00:00,  3.37it/s]

Test embeddings shape (3, 1024)





# Get XLarge Embeddings

In [13]:
MODEL_NM = '../input/huggingface-deberta-variants/deberta-xlarge/deberta-xlarge'
all_train_text_feats5, te_text_feats5 = get_embeddings(MODEL_NM, MAX=512)

Some weights of the model checkpoint at ../input/huggingface-deberta-variants/deberta-xlarge/deberta-xlarge were not used when initializing DebertaModel: ['lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.dense.bias']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 978/978 [12:26<00:00,  1.31it/s]


Train embeddings shape (3911, 1024)


100%|██████████| 1/1 [00:00<00:00,  1.75it/s]

Test embeddings shape (3, 1024)





# Combine Feature Embeddings

In [14]:
all_train_text_feats = np.concatenate([all_train_text_feats,all_train_text_feats2,
                                       all_train_text_feats3,all_train_text_feats4,
                                       all_train_text_feats5],axis=1)

te_text_feats = np.concatenate([te_text_feats,te_text_feats2,
                                te_text_feats3,te_text_feats4,
                                te_text_feats5],axis=1)

del all_train_text_feats2, te_text_feats2
del all_train_text_feats3, te_text_feats3
del all_train_text_feats4, te_text_feats4
del all_train_text_feats5, te_text_feats5
gc.collect()

print('Our concatenated embeddings have shape', all_train_text_feats.shape )

Our concatenated embeddings have shape (3911, 4864)


# Train RAPIDS cuML SVR
Documentation for RAPIDS SVM is [here][1]

[1]: https://docs.rapids.ai/api/cuml/stable/api.html#support-vector-machines

In [15]:
from cuml.svm import SVR
import cuml
print('RAPIDS version',cuml.__version__)

RAPIDS version 21.10.02


In [16]:
from sklearn.metrics import mean_squared_error

preds = []
scores = []
def comp_score(y_true,y_pred):
    rmse_scores = []
    for i in range(len(target_cols)):
        rmse_scores.append(np.sqrt(mean_squared_error(y_true[:,i],y_pred[:,i])))
    return np.mean(rmse_scores)

#for fold in tqdm(range(FOLDS),total=FOLDS):
for fold in range(FOLDS):
    print('#'*25)
    print('### Fold',fold+1)
    print('#'*25)
    
    dftr_ = dftr[dftr["FOLD"]!=fold]
    dfev_ = dftr[dftr["FOLD"]==fold]
    
    tr_text_feats = all_train_text_feats[list(dftr_.index),:]
    ev_text_feats = all_train_text_feats[list(dfev_.index),:]
    
    ev_preds = np.zeros((len(ev_text_feats),6))
    test_preds = np.zeros((len(te_text_feats),6))
    for i,t in enumerate(target_cols):
        print(t,', ',end='')
        clf = SVR(C=1)
        clf.fit(tr_text_feats, dftr_[t].values)
        ev_preds[:,i] = clf.predict(ev_text_feats)
        test_preds[:,i] = clf.predict(te_text_feats)
    print()
    score = comp_score(dfev_[target_cols].values,ev_preds)
    scores.append(score)
    print("Fold : {} RSME score: {}".format(fold,score))
    preds.append(test_preds)
    
print('#'*25)
print('Overall CV RSME =',np.mean(scores))

#########################
### Fold 1
#########################
cohesion , syntax , vocabulary , phraseology , grammar , conventions , 
Fold : 0 RSME score: 0.45941574428793963
#########################
### Fold 2
#########################
cohesion , syntax , vocabulary , phraseology , grammar , conventions , 
Fold : 1 RSME score: 0.44688430931942386
#########################
### Fold 3
#########################
cohesion , syntax , vocabulary , phraseology , grammar , conventions , 
Fold : 2 RSME score: 0.4599025031310042
#########################
### Fold 4
#########################
cohesion , syntax , vocabulary , phraseology , grammar , conventions , 
Fold : 3 RSME score: 0.44741512751324436
#########################
### Fold 5
#########################
cohesion , syntax , vocabulary , phraseology , grammar , conventions , 
Fold : 4 RSME score: 0.45173666632901305
#########################
### Fold 6
#########################
cohesion , syntax , vocabulary , phraseology , grammar , c

# Overall CV Score 0.4505
Wow, nice! Our overall CV score using RAPIDS SVR without training any NLP transformers is 0.4505! And our LB is 0.44x!

# Create Submission CSV

In [17]:
sub = dfte.copy()

sub.loc[:,target_cols] = np.average(np.array(preds),axis=0) #,weights=[1/s for s in scores]
sub_columns = pd.read_csv("../input/feedback-prize-english-language-learning/sample_submission.csv").columns
sub = sub[sub_columns]


In [18]:
sub.to_csv("submission.csv",index=None)
sub.head()

Unnamed: 0,text_id,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0000C359D63E,2.920062,2.798517,3.135469,2.941101,2.661195,2.668508
1,000BAD50D026,2.688524,2.443526,2.692254,2.304748,2.013087,2.642344
2,00367BB2546B,3.648225,3.471449,3.58472,3.656427,3.407419,3.358812
