<a href="https://colab.research.google.com/github/mjain2/csci544-group32/blob/main/svr_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAPIDS SVR - Edited by Bryan

This notebook uses code and ideas from Noufal's great notebook [here][1]. In his notebook he extracts 1 NLP transformer embeddings and trains Sklearn's multioutput regressor + gradient boosting regressor on CPU with 5-Folds.

In this notebook, we use [RAPIDS SVR][3] to train and predict. Since [RAPIDS cuML's SVR][3] uses GPU it is very fast. This allows us to train with more extracted embeddings quickly and more folds. In this notebook, we use 5-Folds! And in this notebook, we extract embeddings from 1 NLP transformer. [RAPIDS SVR][3] has built in feature reduction, so it learns to use the most informative features from all the NLP transformers!

Note that we do not finetune the NLP transformers. The Deberta transformers in this notebook are the same pretrained transformers that we download from Hugging Face. They have not been finetuned on Kaggle's competition data. This demonstrates that pretrained models already come with intelligence.

# Load Libraries and Data

In [None]:
import numpy as np 
import pandas as pd 
import os, gc, re, warnings
warnings.filterwarnings("ignore")

In [None]:
train = pd.read_csv("/kaggle/input/feedback-prize-english-language-learning/train.csv")
#train["src"]="train"
test = pd.read_csv("/kaggle/input/feedback-prize-english-language-learning/test.csv")
#test["src"]="test"
print('Train shape:',train.shape,'Test shape:',test.shape,'Test columns:',test.columns)
#df = pd.concat([train,test],ignore_index=True)

train.head()

Train shape: (3911, 8) Test shape: (3, 2) Test columns: Index(['text_id', 'full_text'], dtype='object')


Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0016926B079C,I think that students would benefit from learn...,3.5,3.5,3.0,3.0,4.0,3.0
1,0022683E9EA5,When a problem is a change you have to let it ...,2.5,2.5,3.0,2.0,2.0,2.5
2,00299B378633,"Dear, Principal\n\nIf u change the school poli...",3.0,3.5,3.0,3.0,3.0,2.5
3,003885A45F42,The best time in life is when you become yours...,4.5,4.5,4.5,4.5,4.0,5.0
4,0049B1DF5CCC,Small act of kindness can impact in other peop...,2.5,3.0,3.0,3.0,2.5,2.5


In [None]:
target_cols = ['cohesion', 'syntax', 'vocabulary', 'phraseology', 'grammar', 'conventions',]

In [None]:
# split into train/validation (80-20)
from sklearn.model_selection import train_test_split

train, validation, train_labels, validation_labels = train_test_split(train, train[target_cols], test_size=0.20)

In [None]:
# train, train_labels
train = train.reset_index()
validation = validation.reset_index()

In [None]:
print(train.shape)
print(validation.shape)
print(test.shape)

(3128, 9)
(783, 9)
(3, 2)


# Make 5 Stratified Folds!

In [None]:
import sys
sys.path.append('../input/iterativestratification')
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
FOLDS = 5 # CHANGED
skf = MultilabelStratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=42)
for i,(train_index, val_index) in enumerate(skf.split(train,train[target_cols])):
    train.loc[val_index,'FOLD'] = i
print('Train samples per fold:')
train.FOLD.value_counts()

Train samples per fold:


4.0    626
1.0    626
2.0    626
3.0    625
0.0    625
Name: FOLD, dtype: int64

In [None]:
train.head()

Unnamed: 0,index,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions,FOLD
0,2084,9C7E82646A0F,I remember when I was having a difficult time ...,2.0,3.0,3.0,3.0,3.0,3.0,3.0
1,1594,7999A063F80E,I agree that praising a students work is an ou...,3.5,3.5,3.0,3.5,3.5,3.5,4.0
2,3469,ED57CFEC1CF1,"I agree, because I think the influence helpen ...",2.5,2.0,2.5,2.0,2.0,2.0,1.0
3,2021,988BFF516FA0,We have many parks that provide a wide of vari...,2.5,2.5,3.0,3.0,2.5,3.0,3.0
4,3127,DCE82F61BE10,Is good idea for students to recive online cou...,3.0,3.5,3.5,3.5,3.0,3.0,1.0


In [None]:
# drop index columns
train = train.drop(columns=["index"])

# Generate Embeddings

In [None]:
from transformers import AutoModel,AutoTokenizer
import torch
import torch.nn.functional as F
from tqdm import tqdm

In [None]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state.detach().cpu()
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )

In [None]:
BATCH_SIZE = 4

class EmbedDataset(torch.utils.data.Dataset):
    def __init__(self,df):
        self.df = df.reset_index(drop=True)
    def __len__(self):
        return len(self.df)
    def __getitem__(self,idx):
        text = self.df.loc[idx,"full_text"]
        tokens = tokenizer(
                text,
                None,
                add_special_tokens=True,
                padding='max_length',
                truncation=True,
                max_length=MAX_LEN,return_tensors="pt")
        tokens = {k:v.squeeze(0) for k,v in tokens.items()}
        return tokens

ds_tr = EmbedDataset(train)
embed_dataloader_tr = torch.utils.data.DataLoader(ds_tr,\
                        batch_size=BATCH_SIZE,\
                        shuffle=False)

ds_va = EmbedDataset(validation)
embed_dataloader_te = torch.utils.data.DataLoader(ds_va,\
                        batch_size=BATCH_SIZE,\
                        shuffle=False)

ds_te = EmbedDataset(test)
embed_dataloader_te_old = torch.utils.data.DataLoader(ds_te,\
                        batch_size=BATCH_SIZE,\
                        shuffle=False)

In [None]:
ds_tr.df.head()

Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions,FOLD
0,9C7E82646A0F,I remember when I was having a difficult time ...,2.0,3.0,3.0,3.0,3.0,3.0,3.0
1,7999A063F80E,I agree that praising a students work is an ou...,3.5,3.5,3.0,3.5,3.5,3.5,4.0
2,ED57CFEC1CF1,"I agree, because I think the influence helpen ...",2.5,2.0,2.5,2.0,2.0,2.0,1.0
3,988BFF516FA0,We have many parks that provide a wide of vari...,2.5,2.5,3.0,3.0,2.5,3.0,3.0
4,DCE82F61BE10,Is good idea for students to recive online cou...,3.0,3.5,3.5,3.5,3.0,3.0,1.0


In [None]:
ds_va.df.head()

Unnamed: 0,index,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,2812,CD7201C66BBC,Students being able to graduate a year early f...,4.0,4.5,5.0,5.0,4.5,4.5
1,277,144671E445D0,A British naturalist and politician John Lubbo...,2.5,3.0,3.0,2.0,2.5,2.5
2,2138,9FDD8A146288,i think it not good for students because it di...,3.0,2.5,2.5,2.5,2.5,2.5
3,1714,81FB30DE2618,"To Whom It May Concern, I heard that you are l...",2.5,2.5,3.0,3.0,3.0,3.0
4,3285,E429DAB69106,what would happen if older and younger student...,4.0,3.5,3.5,3.5,3.0,3.0


# Extract Embeddings

In [None]:
tokenizer = None
MAX_LEN = 640

def get_embeddings(MODEL_NM='', MAX=640, BATCH_SIZE=4, verbose=True):
    global tokenizer, MAX_LEN
    DEVICE="cuda"
    model = AutoModel.from_pretrained( MODEL_NM )
    tokenizer = AutoTokenizer.from_pretrained( MODEL_NM )
    MAX_LEN = MAX
    
    model = model.to(DEVICE)
    model.eval()
    all_train_text_feats = []
    for batch in tqdm(embed_dataloader_tr,total=len(embed_dataloader_tr)):
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        with torch.no_grad():
            model_output = model(input_ids=input_ids,attention_mask=attention_mask)
        sentence_embeddings = mean_pooling(model_output, attention_mask.detach().cpu())
        # Normalize the embeddings
        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
        sentence_embeddings =  sentence_embeddings.squeeze(0).detach().cpu().numpy()
        all_train_text_feats.extend(sentence_embeddings)
    all_train_text_feats = np.array(all_train_text_feats)
    if verbose:
        print('Train embeddings shape',all_train_text_feats.shape)
        
    te_text_feats = []
    for batch in tqdm(embed_dataloader_te,total=len(embed_dataloader_te)):
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        with torch.no_grad():
            model_output = model(input_ids=input_ids,attention_mask=attention_mask)
        sentence_embeddings = mean_pooling(model_output, attention_mask.detach().cpu())
        # Normalize the embeddings
        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
        sentence_embeddings =  sentence_embeddings.squeeze(0).detach().cpu().numpy()
        te_text_feats.extend(sentence_embeddings)
    te_text_feats = np.array(te_text_feats)
    if verbose:
        print('Test embeddings shape',te_text_feats.shape)
        
    return all_train_text_feats, te_text_feats

# Get Base Embeddings

In [None]:
MODEL_NM = '../input/huggingface-deberta-variants/deberta-large/deberta-large'
all_train_text_feats, te_text_feats = get_embeddings(MODEL_NM)

Some weights of the model checkpoint at ../input/huggingface-deberta-variants/deberta-large/deberta-large were not used when initializing DebertaModel: ['lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'config']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 782/782 [06:25<00:00,  2.03it/s]


Train embeddings shape (3128, 1024)


100%|██████████| 196/196 [01:36<00:00,  2.03it/s]

Test embeddings shape (783, 1024)





In [None]:
#import pickle

#all_train_text_feats = []
#te_text_feats = []
#with open('../input/deberta-pickle-v2/dataset_dict_v2.pickle', 'rb') as pickleFile:
    #x = pickle.load(pickleFile)
    #all_train_text_feats = np.array(x['X_train'])
    #te_text_feats = np.array(x['X_test'])

In [None]:
all_train_text_feats.shape

(3128, 1024)

In [None]:
te_text_feats.shape

(783, 1024)

# Train RAPIDS cuML SVR
Documentation for RAPIDS SVM is [here][1]

[1]: https://docs.rapids.ai/api/cuml/stable/api.html#support-vector-machines

In [None]:
#!pip3 install cuml
import cuml
from cuml.svm import SVR
print('RAPIDS version',cuml.__version__)

RAPIDS version 21.10.02


In [None]:
from sklearn.metrics import mean_squared_error

preds = []
scores = []
def comp_score(y_true,y_pred):
    rmse_scores = []
    for i in range(len(target_cols)):
        rmse_scores.append(np.sqrt(mean_squared_error(y_true[:,i],y_pred[:,i])))
    return np.mean(rmse_scores)

#for fold in tqdm(range(FOLDS),total=FOLDS):
for fold in range(FOLDS):
    print('#'*5)
    print('### Fold',fold+1)
    print('#'*5)
    
    train_ = train[train["FOLD"]!=fold]
    dev_ = train[train["FOLD"]==fold]
    
    tr_text_feats = all_train_text_feats[list(train_.index),:]
    ev_text_feats = all_train_text_feats[list(dev_.index),:]
    
    ev_preds = np.zeros((len(ev_text_feats),6))
    test_preds = np.zeros((len(te_text_feats),6))
    for i,t in enumerate(target_cols):
        print(t,', ',end='')
        clf = SVR(C=1)
        clf.fit(tr_text_feats, train_[t].values)
        ev_preds[:,i] = clf.predict(ev_text_feats)
        test_preds[:,i] = clf.predict(te_text_feats)
    score = comp_score(dev_[target_cols].values,ev_preds)
    scores.append(score)
    print("Fold : {} RSME score: {}".format(fold,score))
    preds.append(test_preds)
    
print('#'*5)
print('Overall CV RSME =',np.mean(scores))

#####
### Fold 1
#####
cohesion , syntax , vocabulary , phraseology , grammar , conventions , Fold : 0 RSME score: 0.45890545346957384
#####
### Fold 2
#####
cohesion , syntax , vocabulary , phraseology , grammar , conventions , Fold : 1 RSME score: 0.4616110043568627
#####
### Fold 3
#####
cohesion , syntax , vocabulary , phraseology , grammar , conventions , Fold : 2 RSME score: 0.45542986676007735
#####
### Fold 4
#####
cohesion , syntax , vocabulary , phraseology , grammar , conventions , Fold : 3 RSME score: 0.4578918158315441
#####
### Fold 5
#####
cohesion , syntax , vocabulary , phraseology , grammar , conventions , Fold : 4 RSME score: 0.47466910281366737
#####
Overall CV RSME = 0.4617014486463451


In [None]:
# predict on test data
preds = np.zeros((len(te_text_feats),6))
for i,t in enumerate(target_cols):
    clf = SVR(C=1)
    clf.fit(all_train_text_feats, train_labels[t])
    print("Predicting on validation set for " + str(t))
    preds[:, i] = clf.predict(te_text_feats)

Predicting on validation set for cohesion
Predicting on validation set for syntax
Predicting on validation set for vocabulary
Predicting on validation set for phraseology
Predicting on validation set for grammar
Predicting on validation set for conventions


In [None]:
score = comp_score(validation_labels.values, preds)

# Overall CV Score 0.4882

# Create Submission CSV

In [None]:
sub = test.copy()

sub.loc[:,target_cols] = np.average(np.array(preds),axis=0) #,weights=[1/s for s in scores]
sub_columns = pd.read_csv("../input/feedback-prize-english-language-learning/sample_submission.csv").columns
sub = sub[sub_columns]

In [None]:
sub.to_csv("submission.csv",index=None)
sub.head()

Unnamed: 0,text_id,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0000C359D63E,3.129556,3.025909,3.237558,3.119506,3.026942,3.072633
1,000BAD50D026,3.129556,3.025909,3.237558,3.119506,3.026942,3.072633
2,00367BB2546B,3.129556,3.025909,3.237558,3.119506,3.026942,3.072633
