# Pretrained embeddings inference

In my experiments, I have extracted embeddings from publicly available NLP transformer models and implemented a hill climbing approach to finding the best combination of all the extracted embeddings. This is explored in the notebooks below

Step 1: Extract embeddings (https://www.kaggle.com/code/illidan7/fb3-save-pretrained-embeddings/notebook)

Step 2: Hill Climb (https://www.kaggle.com/illidan7/fb3-hill-climb-pretrained-embeddings/edit)

Step 3: Submission (You are here)

I have taken one of the combinations found in the Hill Climb process (debertav3large + distilrobertabase) and implemented it here.


_____

References:

- https://www.kaggle.com/code/cdeotte/rapids-svr-cv-0-450-lb-0-44x
- https://www.kaggle.com/competitions/petfinder-pawpularity-score/discussion/301686
- https://www.kaggle.com/code/electro/deberta-layerwiselr-lastlayerreinit-tensorflow

# Load libraries

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR

import tensorflow as tf
import tensorflow_addons as tfa

import transformers

from transformers import AutoTokenizer, AutoModel
from transformers import DataCollatorWithPadding
from transformers import logging as hf_logging
hf_logging.set_verbosity_error()

from datasets import Dataset

import os
import gc
import sys
from tqdm.notebook import tqdm

# Set Configs

In [2]:
CONFIG = {
        'folds': 5,
        'seed': 101,
        
        # Kaggle embeds
        'debertav3large_npy': '../input/fb3-save-pretrained-embeddings/debertav3large_FB3.npy',
        'distilrobertabase_npy': '../input/fb3-save-pretrained-embeddings/distilrobertabase_FB3.npy',
        
        'debertav3large': '../input/deberta-v3-large/deberta-v3-large/',
        'distilrobertabase': '../input/huggingface-roberta-variants/distilroberta-base/distilroberta-base',

        'batch_size': 4,
        'max_len': 512
        }

# Read in data

In [3]:

#df = pd.read_csv('/kaggle/input/580-final-train/new_train.csv')
#df = df[['text_id','full_text','cohesion', 'syntax', 'vocabulary','phraseology', 'grammar', 'conventions',
#           'text_length','verb','wh','connection','adb','pronoun','noun','adj','fw']]
#df['split'] = np.random.randn(df.shape[0], 1)

#msk = np.random.rand(len(df)) <= 0.7

#train = df[msk]
#test = df[~msk]

#tgtCols = ['cohesion', 'syntax', 'vocabulary','phraseology', 'grammar', 'conventions',
#           'text_length','verb','wh','connection','adb','pronoun','noun','adj','fw']
#print(train.shape)
#print(test.shape)

In [4]:
df = pd.read_csv("../input/feedback-prize-english-language-learning/train.csv")
msk = np.random.rand(len(df)) <= 0.7

train = df[msk]
test = df[~msk]
#test = pd.read_csv("../input/feedback-prize-english-language-learning/test.csv")

tgtCols = ['cohesion', 'syntax', 'vocabulary','phraseology', 'grammar', 'conventions']

print(train.shape)
print(test.shape)

(2711, 8)
(1200, 8)


In [5]:
test

Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0016926B079C,I think that students would benefit from learn...,3.5,3.5,3.0,3.0,4.0,3.0
6,005661280443,Imagine if you could prove other people that y...,3.5,4.0,3.5,3.5,4.0,4.0
8,009BCCC61C2A,positive attitude is the key to success. I agr...,3.0,3.0,3.5,3.5,3.0,3.0
9,009F4E9310CB,Asking more than one person for and advice hel...,3.0,3.0,3.5,2.5,3.0,2.5
11,00BCADB373EF,A positive attitude is the key to success for ...,3.5,3.0,4.0,3.5,3.0,3.0
...,...,...,...,...,...,...,...,...
3897,FF8BAC351714,I disagree with the students attend classes fo...,2.5,2.0,3.0,3.0,2.0,2.5
3899,FF9469424ED0,All schools have different educational activit...,3.5,3.5,4.5,3.5,4.0,4.0
3901,FF9E0379CD98,Some school offer distence learning as a optio...,3.5,2.5,3.0,3.0,2.0,2.5
3906,FFD29828A873,I believe using cellphones in class for educat...,2.5,3.0,3.0,3.5,2.5,2.5


In [6]:
my_list = df.columns.values.tolist()
my_list

['text_id',
 'full_text',
 'cohesion',
 'syntax',
 'vocabulary',
 'phraseology',
 'grammar',
 'conventions']

In [7]:
train

Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
1,0022683E9EA5,When a problem is a change you have to let it ...,2.5,2.5,3.0,2.0,2.0,2.5
2,00299B378633,"Dear, Principal\n\nIf u change the school poli...",3.0,3.5,3.0,3.0,3.0,2.5
3,003885A45F42,The best time in life is when you become yours...,4.5,4.5,4.5,4.5,4.0,5.0
4,0049B1DF5CCC,Small act of kindness can impact in other peop...,2.5,3.0,3.0,3.0,2.5,2.5
5,004AC288D833,"Dear Principal,\r\n\r\nOur school should have ...",3.5,4.0,4.0,3.5,3.5,4.0
...,...,...,...,...,...,...,...,...
3904,FFAEAF8D0C90,"Soccer, all people like to play soccer, and ot...",2.5,2.0,2.5,1.5,2.0,2.0
3905,FFCDB2524616,"I agree with Ralph Waldo Emerson's ""\n\nTo be ...",2.5,3.0,3.0,4.0,3.5,3.0
3907,FFD9A83B0849,"Working alone, students do not have to argue w...",4.0,4.0,4.0,4.0,3.5,3.0
3909,FFE16D704B16,Many people disagree with Albert Schweitzer's ...,4.0,4.5,4.5,4.0,4.5,4.5


# Create folds

In [8]:
train.loc[:, 'kfold'] = -1 # Create a new column `fold` containing `-1`s.
train = train.sample(frac=1).reset_index(drop=True) # Shuffle the rows.
data_labels = train[tgtCols].values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


In [9]:
import sys
sys.path.append('../input/iterativestratification')
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

In [10]:
mskf = MultilabelStratifiedKFold(n_splits=5)
for fold_, (train_, valid_) in enumerate(mskf.split(X=train, y=data_labels)):
    train.loc[valid_, 'kfold'] = fold_ + 1

In [11]:
train

Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions,kfold
0,06C9FC1267BB,Do people accomplish more if they are always d...,3.5,3.0,3.5,3.0,3.0,3.0,3
1,77D210B9EC7E,Students will benefit from being able to do th...,3.5,3.5,2.5,3.5,3.0,3.0,4
2,51CE1DCC5307,"Some people might have heard this before, ""The...",4.0,4.0,4.0,4.0,3.5,3.5,2
3,9725E454418F,"Inteoduction, I planned my paper befoer writin...",2.0,2.0,2.0,2.0,2.0,2.5,5
4,6F0330A6C504,This shows how students could think and have t...,3.0,3.5,4.0,4.0,3.5,3.5,1
...,...,...,...,...,...,...,...,...,...
2706,221BA383246D,The job I want to pursue is to play in a profe...,4.5,3.0,3.5,3.5,3.5,4.0,2
2707,EDDAE2D4080B,first i did think about in then i writer on pa...,1.5,1.0,2.0,1.5,2.0,1.5,3
2708,E8795A1AA404,When making life decisions it is good to ask s...,3.0,2.5,3.5,3.5,3.0,3.5,1
2709,EB1500B8EA75,Some students who live far away from their sch...,2.5,2.0,3.0,3.0,2.0,2.0,5


In [12]:
train['kfold'].value_counts().sort_index()

1    542
2    542
3    542
4    542
5    543
Name: kfold, dtype: int64

# Custom functions

## Competition metric

## Data process functions

In [13]:
def hf_encode(texts, chkpt):
    
    tokenizer = transformers.AutoTokenizer.from_pretrained(CONFIG[chkpt])
    tokenizer.save_pretrained('./tokenizer/')

    input_ids = []
    attention_mask = []
    
    for text in texts.tolist():
        token = tokenizer(text, 
                          add_special_tokens=True, 
                          max_length=CONFIG['max_len'], 
                          return_attention_mask=True, 
                          return_tensors="np", 
                          truncation=True, 
                          padding='max_length')
        input_ids.append(token['input_ids'][0])
        attention_mask.append(token['attention_mask'][0])
    return np.array(input_ids, dtype="int32"), np.array(attention_mask, dtype="int32")

In [14]:
def pickle_dump(path, saveobj):
    import pickle
    filehandler = open(path,"wb")
    pickle.dump(saveobj,filehandler)
#     print("File pickled")
    filehandler.close()

In [15]:
def pickle_load(path):
    import pickle
    file = open(path,'rb')
    loadobj = pickle.load(file)
    file.close()
    return loadobj

## Transformer embeddings

In [16]:
"""
class MeanPool(tf.keras.layers.Layer):
    def call(self, inputs, mask=None):
        broadcast_mask = tf.expand_dims(tf.cast(mask, "float32"), -1)
        embedding_sum = tf.reduce_sum(inputs * broadcast_mask, axis=1)
        mask_sum = tf.reduce_sum(broadcast_mask, axis=1)
        mask_sum = tf.math.maximum(mask_sum, tf.constant([1e-9]))
        return embedding_sum / mask_sum
"""

'\nclass MeanPool(tf.keras.layers.Layer):\n    def call(self, inputs, mask=None):\n        broadcast_mask = tf.expand_dims(tf.cast(mask, "float32"), -1)\n        embedding_sum = tf.reduce_sum(inputs * broadcast_mask, axis=1)\n        mask_sum = tf.reduce_sum(broadcast_mask, axis=1)\n        mask_sum = tf.math.maximum(mask_sum, tf.constant([1e-9]))\n        return embedding_sum / mask_sum\n'

In [17]:
def pretrain_embeddings(chkpt, df):
    cfg = transformers.AutoConfig.from_pretrained(CONFIG[chkpt], output_hidden_states=True)
    cfg.hidden_dropout_prob = 0
    cfg.attention_probs_dropout_prob = 0
    cfg.save_pretrained('./tokenizer/')
    
    input_ids = tf.keras.layers.Input(
        shape=(CONFIG['max_len'],), dtype=tf.int32, name="input_ids"
    )
    
    attention_masks = tf.keras.layers.Input(
        shape=(CONFIG['max_len'],), dtype=tf.int32, name="attention_masks"
    )
    
    try:
        model = transformers.TFAutoModel.from_pretrained(CONFIG[chkpt], config=cfg)
    except:
        model = transformers.TFAutoModel.from_pretrained(CONFIG[chkpt], config=cfg, from_pt=True)
        
    output = model(
        input_ids, attention_mask=attention_masks
    )
    hidden_states = output.hidden_states
    mean_pool = []
    for hidden_s in hidden_states[-1:]:
        #def call(self, inputs, mask=None):
        broadcast_mask = tf.expand_dims(tf.cast(attention_masks, "float32"), -1)
        embedding_sum = tf.reduce_sum(hidden_s * broadcast_mask, axis=1)
        mask_sum = tf.reduce_sum(broadcast_mask, axis=1)
        mask_sum = tf.math.maximum(mask_sum, tf.constant([1e-9]))
        tmp = embedding_sum / mask_sum
        mean_pool.append(tmp)
    output = tf.stack(mean_pool,axis=2)
   
    #output = tf.stack(
    #    [MeanPool()(hidden_s, mask=attention_masks) for hidden_s in hidden_states[-1:]], 
    #    axis=2)
    
    output = tf.squeeze(output, axis=-1)
    
    model = tf.keras.Model(inputs=[input_ids, attention_masks], outputs=output)

    model.compile(optimizer="adam",
                 loss='huber_loss',
                 metrics=[tf.keras.metrics.RootMeanSquaredError()],
                 )
    print(model.summary())
    dataset = hf_encode(df['full_text'], chkpt)
    preds = model.predict(dataset, batch_size=CONFIG['batch_size'])
    
    del model, dataset
    _ = gc.collect()
    
    return preds

# Model training

In [18]:
train_dataset = np.load(CONFIG['debertav3large_npy'])
train_dataset = np.concatenate([train_dataset, np.load(CONFIG['distilrobertabase_npy'])], axis=1)

train_dataset.shape

(3911, 1792)

In [19]:
scores = []
rmse_scores = []

for fold in range(1,CONFIG['folds']):

    print('#'*25)
    print(f'## Fold {fold}')
    print('#'*25)

    trn_idx = train[train['kfold']!=fold].index.values
    val_idx = train[train['kfold']==fold].index.values

    X_train = train_dataset[trn_idx,:]
    X_valid = train_dataset[val_idx,:]

    y_train = train[train['kfold']!=fold][tgtCols].copy()
    y_valid = train[train['kfold']==fold][tgtCols].copy()

    val_preds = np.zeros((len(val_idx),6))

    for i, tgt in enumerate(tgtCols):

        print(tgt,', ',end='')
        clf = SVR(C=1)
        clf.fit(X_train, y_train[tgt].values)
        pickle_dump(f"./SVR_tgt{tgt}_fold{fold}.pkl", clf)
        val_preds[:,i] = clf.predict(X_valid)
   
    
    for i in range(len(tgtCols)):
        rmse_scores.append(np.sqrt(mean_squared_error(y_valid[tgtCols].values[:,i],val_preds[:,i])))
        score = np.mean(rmse_scores)
    #score = mcrmse(y_valid[tgtCols].values, val_preds)
        scores.append(score)
    print("Fold : {} RMSE score: {}".format(fold,score))

    print('#'*25)
    print('Overall CV RMSE =',np.mean(scores))

#########################
## Fold 1
#########################
cohesion , syntax , vocabulary , phraseology , grammar , conventions , Fold : 1 RMSE score: 0.6566137647188007
#########################
Overall CV RMSE = 0.656107845037741
#########################
## Fold 2
#########################
cohesion , syntax , vocabulary , phraseology , grammar , conventions , Fold : 2 RMSE score: 0.6779639445043969
#########################
Overall CV RMSE = 0.6630906569204559
#########################
## Fold 3
#########################
cohesion , syntax , vocabulary , phraseology , grammar , conventions , Fold : 3 RMSE score: 0.6647338592916149
#########################
Overall CV RMSE = 0.6646033469011701
#########################
## Fold 4
#########################
cohesion , syntax , vocabulary , phraseology , grammar , conventions , Fold : 4 RMSE score: 0.660425160143737
#########################
Overall CV RMSE = 0.6634618713166004


In [20]:
del train_dataset
_ = gc.collect()

# Model inference

In [21]:
#test = pd.read_csv("../input/feedback-prize-english-language-learning/test.csv")

In [22]:
test_dataset = pretrain_embeddings('debertav3large', test)
test_dataset = np.concatenate([test_dataset, pretrain_embeddings('distilrobertabase', test)], axis=1)

test_dataset.shape

2022-11-28 21:41:42.768811: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-28 21:41:42.769926: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-28 21:41:42.770638: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-28 21:41:42.771505: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compil

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
attention_masks (InputLayer)    [(None, 512)]        0                                            
__________________________________________________________________________________________________
input_ids (InputLayer)          [(None, 512)]        0                                            
__________________________________________________________________________________________________
tf.cast (TFOpLambda)            (None, 512)          0           attention_masks[0][0]            
__________________________________________________________________________________________________
tf_deberta_v2_model (TFDebertaV TFBaseModelOutput(la 434012160   input_ids[0][0]                  
                                                                 attention_masks[0][0]        

  "The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option"
2022-11-28 21:42:23.816500: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
attention_masks (InputLayer)    [(None, 512)]        0                                            
__________________________________________________________________________________________________
input_ids (InputLayer)          [(None, 512)]        0                                            
__________________________________________________________________________________________________
tf.cast_1 (TFOpLambda)          (None, 512)          0           attention_masks[0][0]            
__________________________________________________________________________________________________
tf_roberta_model (TFRobertaMode TFBaseModelOutputWit 82118400    input_ids[0][0]                  
                                                                 attention_masks[0][0]      

(1200, 1792)

In [23]:
fold_preds = []

for fold in range(1,CONFIG['folds']):

    print('#'*25)
    print(f'## Fold {fold}')
    print('#'*25)
    
    test_preds = np.zeros((len(test_dataset),6))
    for i, tgt in enumerate(tgtCols):

        print(tgt,', ',end='')
        model = pickle_load(f"./SVR_tgt{tgt}_fold{fold}.pkl")
        test_preds[:,i] = model.predict(test_dataset)
    
    fold_preds.append(test_preds)
    
    for i in range(len(tgtCols)):
        rmse_scores.append(np.sqrt(mean_squared_error(y_valid[tgtCols].values[:,i],val_preds[:,i])))
        score = np.mean(rmse_scores)
    #score = mcrmse(y_valid[tgtCols].values, val_preds)
        scores.append(score)
    print("Fold : {} RMSE score: {}".format(fold,score))

    print('#'*25)
    print('Overall CV RMSE =',np.mean(scores))
    
    del model
    _ = gc.collect()

#########################
## Fold 1
#########################
cohesion , syntax , vocabulary , phraseology , grammar , conventions , Fold : 1 RMSE score: 0.6578399406550103
#########################
Overall CV RMSE = 0.6622252104589444
#########################
## Fold 2
#########################
cohesion , syntax , vocabulary , phraseology , grammar , conventions , Fold : 2 RMSE score: 0.656116460995859
#########################
Overall CV RMSE = 0.6611072815790241
#########################
## Fold 3
#########################
cohesion , syntax , vocabulary , phraseology , grammar , conventions , Fold : 3 RMSE score: 0.6548854040964652
#########################
Overall CV RMSE = 0.6601341263026252
#########################
## Fold 4
#########################
cohesion , syntax , vocabulary , phraseology , grammar , conventions , Fold : 4 RMSE score: 0.65396211142192
#########################
Overall CV RMSE = 0.6592919441217241


In [24]:
preds = np.mean(fold_preds, axis=0)
preds = np.clip(preds, 1, 5)

In [25]:
output_df = test[['text_id']].reset_index()
output_df

Unnamed: 0,index,text_id
0,0,0016926B079C
1,6,005661280443
2,8,009BCCC61C2A
3,9,009F4E9310CB
4,11,00BCADB373EF
...,...,...
1195,3897,FF8BAC351714
1196,3899,FF9469424ED0
1197,3901,FF9E0379CD98
1198,3906,FFD29828A873


In [26]:
preds_df = pd.DataFrame(preds, columns = tgtCols)
preds_df

Unnamed: 0,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,3.150271,3.008212,3.202044,3.152295,3.031673,3.090434
1,3.131481,3.041934,3.113607,3.091941,2.971744,3.082702
2,3.095260,3.060563,3.120512,3.100017,3.020904,3.085684
3,3.106777,2.971765,3.111706,3.053807,3.046731,3.065468
4,3.114892,3.100257,3.201369,3.124506,3.112098,3.075693
...,...,...,...,...,...,...
1195,3.077533,2.927974,3.101727,3.077560,2.942753,3.068244
1196,3.127624,3.000497,3.172509,3.117813,3.003409,3.052569
1197,3.067304,2.965205,3.088830,3.072609,2.978793,3.086108
1198,3.007043,2.938843,3.073669,3.046476,2.920017,3.034785


In [27]:
preds_df['text_id'] = output_df['text_id']
preds_df = preds_df.reindex(['text_id', *preds_df.columns], axis=1).iloc[: , :-1]
preds_df

Unnamed: 0,text_id,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0016926B079C,3.150271,3.008212,3.202044,3.152295,3.031673,3.090434
1,005661280443,3.131481,3.041934,3.113607,3.091941,2.971744,3.082702
2,009BCCC61C2A,3.095260,3.060563,3.120512,3.100017,3.020904,3.085684
3,009F4E9310CB,3.106777,2.971765,3.111706,3.053807,3.046731,3.065468
4,00BCADB373EF,3.114892,3.100257,3.201369,3.124506,3.112098,3.075693
...,...,...,...,...,...,...,...
1195,FF8BAC351714,3.077533,2.927974,3.101727,3.077560,2.942753,3.068244
1196,FF9469424ED0,3.127624,3.000497,3.172509,3.117813,3.003409,3.052569
1197,FF9E0379CD98,3.067304,2.965205,3.088830,3.072609,2.978793,3.086108
1198,FFD29828A873,3.007043,2.938843,3.073669,3.046476,2.920017,3.034785
