_____
**Credits:**<br>
This notebook demonstrates a simple ensemble method for ranking problems. It is **based on the two incredible notebooks:**
- **[Stronger baseline with code cells](https://www.kaggle.com/code/suicaokhoailang/stronger-baseline-with-code-cells)** by [suicaokhoailang](https://www.kaggle.com/suicaokhoailang)
- **[AI4Code Pairwise BertSmall inference](https://www.kaggle.com/code/yuanzhezhou/ai4code-pairwise-bertsmall-inference)** by [yuanzhezhou](https://www.kaggle.com/yuanzhezhou)<br>

All credits for the models themselves (both training and prediction) belogs to the original authors! I simply cloned their code and retrained my own version.
_____





# Ensembling Rank Based Submissions

We are a month away from the finish line and yet there is no high scoring ensemble in sight!

Let's fix this. 

But how do you actually combine rank based predictions? Let alone predictions that come from completly different approaches (direct rank prediction / pairwise).<br>
Actually, this is pretty simple: **Average the indices of the elements.**<br>
This way, we can sort the final prediction by the ensembled indices and it will simply represent an aggragated representation of the element's location.<br>

_____

### **[Stronger baseline with code cells](https://www.kaggle.com/code/suicaokhoailang/stronger-baseline-with-code-cells)**
#### By [suicaokhoailang](https://www.kaggle.com/suicaokhoailang)

In [1]:
def read_notebook(path):
    import pandas as pd
    return (
        pd.read_json(
            path,
            dtype={'cell_type': 'category', 'source': 'str'})
        .assign(id=path.stem)
        .rename_axis('cell_id')
    )

In [2]:
def clean_code(cell): return str(cell).replace("\\n", "\n")

def sample_cells(cells, n):
    import numpy as np
    cells = [clean_code(cell) for cell in cells]
    if n >= len(cells): return [cell[:200] for cell in cells]
    else:
        results = []
        step = len(cells) / n
        idx = 0
        while int(np.round(idx)) < len(cells):
            results.append(cells[int(np.round(idx))])
            idx += step        
        if cells[-1] not in results: results[-1] = cells[-1]
        return results

def get_features(df):
    from tqdm import tqdm
    features = dict()
    df = df.sort_values("rank").reset_index(drop=True)
    for idx, sub_df in tqdm(df.groupby("id")):
        features[idx] = dict()
        total_md = sub_df[sub_df.cell_type == "markdown"].shape[0]
        code_sub_df = sub_df[sub_df.cell_type == "code"]
        total_code = code_sub_df.shape[0]
        codes = sample_cells(code_sub_df.source.values, 20)
        features[idx]["total_code"] = total_code
        features[idx]["total_md"] = total_md
        features[idx]["codes"] = codes
    return features

In [33]:
df

In [3]:
def read_data(data): return tuple(d.cuda() for d in data[:-1]), data[-1].cuda()

def validate(model, val_loader):    
    import sys
    import torch    
    import numpy as np
    from tqdm import tqdm    
    model.eval()    
    tbar = tqdm(val_loader, file=sys.stdout)    
    preds = []
    labels = []
    with torch.no_grad():
        for idx, data in enumerate(tbar):
            inputs, target = read_data(data)
            pred = model(*inputs)
            preds.append(pred.detach().cpu().numpy().ravel())
            labels.append(target.detach().cpu().numpy().ravel())    
    return np.concatenate(labels), np.concatenate(preds)

def predict_caller(args): return predict(args[0], args[1])
    
def predict(model_path, ckpt_path):
    
    import gc
    import json
    import sys, os
    import numpy as np
    import pandas as pd
    from tqdm import tqdm
    from pathlib import Path
    from scipy import sparse

    data_dir = Path('../input/AI4Code')
    paths_test = list((data_dir / 'test').glob('*.json'))
    notebooks_test = [
        read_notebook(path) for path in tqdm(paths_test, desc='Test NBs')
    ]
    test_df = (
        pd.concat(notebooks_test)
        .set_index('id', append=True)
        .swaplevel()
        .sort_index(level='id', sort_remaining=False)
    ).reset_index()
    test_df["rank"] = test_df.groupby(["id", "cell_type"]).cumcount()
    test_df["pred"] = test_df.groupby(["id", "cell_type"])["rank"].rank(pct=True)

    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    from torch.utils.data import DataLoader, Dataset
    from transformers import AutoModel, AutoTokenizer

    class MarkdownModel(nn.Module):
        def __init__(self, model_path):
            super(MarkdownModel, self).__init__()
            self.model = AutoModel.from_pretrained(model_path)
            self.top = nn.Linear(769, 1)

        def forward(self, ids, mask, fts):
            x = self.model(ids, mask)[0]
            x = self.top(torch.cat((x[:, 0, :], fts),1))
            return x


    class MarkdownDataset(Dataset):

        def __init__(self, df, model_name_or_path, total_max_len, md_max_len, fts):
            super().__init__()
            self.df = df.reset_index(drop=True)
            self.md_max_len = md_max_len
            self.total_max_len = total_max_len  # maxlen allowed by model config
            self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
            self.fts = fts

        def __getitem__(self, index):
            row = self.df.iloc[index]

            inputs = self.tokenizer.encode_plus(
                row.source,
                None,
                add_special_tokens=True,
                max_length=self.md_max_len,
                padding="max_length",
                return_token_type_ids=True,
                truncation=True
            )
            code_inputs = self.tokenizer.batch_encode_plus(
                [str(x) for x in self.fts[row.id]["codes"]],
                add_special_tokens=True,
                max_length=23,
                padding="max_length",
                truncation=True
            )
            n_md = self.fts[row.id]["total_md"]
            n_code = self.fts[row.id]["total_code"]
            if n_md + n_code == 0:
                fts = torch.FloatTensor([0])
            else:
                fts = torch.FloatTensor([n_md / (n_md + n_code)])

            ids = inputs['input_ids']
            for x in code_inputs['input_ids']:
                ids.extend(x[:-1])
            ids = ids[:self.total_max_len]
            if len(ids) != self.total_max_len:
                ids = ids + [self.tokenizer.pad_token_id, ] * (self.total_max_len - len(ids))
            ids = torch.LongTensor(ids)

            mask = inputs['attention_mask']
            for x in code_inputs['attention_mask']:
                mask.extend(x[:-1])
            mask = mask[:self.total_max_len]
            if len(mask) != self.total_max_len:
                mask = mask + [self.tokenizer.pad_token_id, ] * (self.total_max_len - len(mask))
            mask = torch.LongTensor(mask)

            assert len(ids) == self.total_max_len

            return ids, mask, fts, torch.FloatTensor([row.pct_rank])

        def __len__(self):
            return self.df.shape[0]
    
    model = MarkdownModel(model_path)
    model = model.cuda()
    model.eval()
    model.load_state_dict(torch.load(ckpt_path))
    BS = 32
    NW = 8
    MAX_LEN = 64
    test_df["pct_rank"] = 0
    test_fts = get_features(test_df)
    test_ds = MarkdownDataset(test_df[test_df["cell_type"] == "markdown"].reset_index(drop=True), md_max_len=64,total_max_len=512, model_name_or_path=model_path, fts=test_fts)
    test_loader = DataLoader(test_ds, batch_size=BS, shuffle=False, num_workers=NW,
                              pin_memory=False, drop_last=False)
    _, y_test = validate(model, test_loader)
    model.to(torch.device('cpu'))
    torch.cuda.empty_cache()    
    del model, test_loader, test_ds
    gc.collect()      
    
    test_df.loc[test_df["cell_type"] == "markdown", "pred"] = y_test
    sub_df = test_df.sort_values("pred").groupby("id")["cell_id"].apply(lambda x: " ".join(x)).reset_index()
    sub_df.rename(columns={"cell_id": "cell_order"}, inplace=True)
    sub_df.head()
    sub_df.to_csv("submission_1.csv", index=False)

    del test_df, paths_test, notebooks_test, test_fts, model_path, ckpt_path, sub_df
    del json, np, pd, tqdm, Path, sparse, torch, sys, os, nn, F, AutoModel, AutoTokenizer
    gc.collect()
    

In [4]:
import gc
ckpt_path = "../input/ai4code-model/model.bin"
model_path = "../input/codebert-base/codebert-base/"

from tqdm.contrib.concurrent import process_map
process_map(predict_caller, [(model_path, ckpt_path)])[0]
gc.collect()

_____

### **[AI4Code Pairwise BertSmall inference](https://www.kaggle.com/code/yuanzhezhou/ai4code-pairwise-bertsmall-inference)**
#### By [yuanzhezhou](https://www.kaggle.com/yuanzhezhou)

In [5]:
import json
import numpy as np
import pandas as pd
from tqdm import tqdm
from pathlib import Path
from scipy import sparse

pd.options.display.width = 180
pd.options.display.max_colwidth = 120

BERT_PATH = "../input/huggingface-bert-variants/distilbert-base-uncased/distilbert-base-uncased"

data_dir = Path('../input/AI4Code')
NUM_TRAIN = 200

def read_notebook(path):
    return (
        pd.read_json(
            path,
            dtype={'cell_type': 'category', 'source': 'str'})
        .assign(id=path.stem)
        .rename_axis('cell_id')
    )

paths_train = list((data_dir / 'train').glob('*.json'))[:NUM_TRAIN]
notebooks_train = [
    read_notebook(path) for path in tqdm(paths_train, desc='Train NBs')
]
df = (
    pd.concat(notebooks_train)
    .set_index('id', append=True)
    .swaplevel()
    .sort_index(level='id', sort_remaining=False)
)

df

In [6]:
# Get an example notebook
nb_id = df.index.unique('id')[6]
print('Notebook:', nb_id)

print("The disordered notebook:")
nb = df.loc[nb_id, :]
display(nb)
print()

In [7]:

df_orders = pd.read_csv(
    data_dir / 'train_orders.csv',
    index_col='id',
    squeeze=True,
).str.split()  # Split the string representation of cell_ids into a list

df_orders

In [8]:
len(df_orders.loc["002ba502bdac45"])

In [9]:
cell_order = df_orders.loc[nb_id]

print("The ordered notebook:")
nb.loc[cell_order, :]

In [10]:
def get_ranks(base, derived):
    return [base.index(d) for d in derived]

cell_ranks = get_ranks(cell_order, list(nb.index))
nb.insert(0, 'rank', cell_ranks)

nb

In [11]:
df_orders_ = df_orders.to_frame().join(
    df.reset_index('cell_id').groupby('id')['cell_id'].apply(list),
    how='right',
)

ranks = {}
for id_, cell_order, cell_id in df_orders_.itertuples():
    ranks[id_] = {'cell_id': cell_id, 'rank': get_ranks(cell_order, cell_id)}

df_ranks = (
    pd.DataFrame
    .from_dict(ranks, orient='index')
    .rename_axis('id')
    .apply(pd.Series.explode)
    .set_index('cell_id', append=True)
)

df_ranks

In [12]:
df_ancestors = pd.read_csv(data_dir / 'train_ancestors.csv', index_col='id')
df_ancestors

In [13]:
df = df.reset_index().merge(df_ranks, on=["id", "cell_id"]).merge(df_ancestors, on=["id"])
df

In [14]:
df["pct_rank"] = df["rank"] / df.groupby("id")["cell_id"].transform("count")
df["pct_rank"].hist(bins=10)

In [15]:
dict_cellid_source = dict(zip(df['cell_id'].values, df['source'].values))
import os
import re
import numpy as np
import pandas as pd
from tqdm import tqdm
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from nltk.stem import WordNetLemmatizer
import nltk; nltk.download('wordnet')

stemmer = WordNetLemmatizer()

def preprocess_text(document):
        # Remove all the special characters
        document = re.sub(r'\W', ' ', str(document))

        # remove all single characters
        document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

        # Remove single characters from the start
        document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)

        # Substituting multiple spaces with single space
        document = re.sub(r'\s+', ' ', document, flags=re.I)

        # Removing prefixed 'b'
        document = re.sub(r'^b\s+', '', document)

        # Converting to Lowercase
        document = document.lower()
        #return document

        # Lemmatization
        tokens = document.split()
        tokens = [stemmer.lemmatize(word) for word in tokens]
        tokens = [word for word in tokens if len(word) > 3]

        preprocessed_text = ' '.join(tokens)
        return preprocessed_text

    
def preprocess_df(df):
    """
    This function is for processing sorce of notebook
    returns preprocessed dataframe
    """
    return [preprocess_text(message) for message in df.source]

df.source = df.source.apply(preprocess_text)

In [16]:
from sklearn.model_selection import GroupShuffleSplit

NVALID = 0.1  # size of validation set

splitter = GroupShuffleSplit(n_splits=1, test_size=NVALID, random_state=0)

train_ind, val_ind = next(splitter.split(df, groups=df["ancestor_id"]))

train_df = df.loc[train_ind].reset_index(drop=True)
val_df = df.loc[val_ind].reset_index(drop=True)

In [17]:
from tqdm.notebook import tqdm

def generate_triplet(df, mode='train'):
  triplets = []
  ids = df.id.unique()
  random_drop = np.random.random(size=10000)>0.9
  count = 0

  for id, df_tmp in tqdm(df.groupby('id')):
    df_tmp_markdown = df_tmp[df_tmp['cell_type']=='markdown']

    df_tmp_code = df_tmp[df_tmp['cell_type']=='code']
    df_tmp_code_rank = df_tmp_code['rank'].values
    df_tmp_code_cell_id = df_tmp_code['cell_id'].values

    for cell_id, rank in df_tmp_markdown[['cell_id', 'rank']].values:
      labels = np.array([(r==(rank+1)) for r in df_tmp_code_rank]).astype('int')

      for cid, label in zip(df_tmp_code_cell_id, labels):
        count += 1
        if label==1:
          triplets.append( [cell_id, cid, label] )
          # triplets.append( [cid, cell_id, label] )
        elif mode == 'test':
          triplets.append( [cell_id, cid, label] )
          # triplets.append( [cid, cell_id, label] )
        elif random_drop[count%10000]:
          triplets.append( [cell_id, cid, label] )
          # triplets.append( [cid, cell_id, label] )
    
  return triplets

triplets = generate_triplet(train_df)
val_triplets = generate_triplet(val_df, mode = 'test')

In [18]:
val_df.head()

In [19]:
from bisect import bisect


def count_inversions(a):
    inversions = 0
    sorted_so_far = []
    for i, u in enumerate(a):
        j = bisect(sorted_so_far, u)
        inversions += i - j
        sorted_so_far.insert(j, u)
    return inversions


def kendall_tau(ground_truth, predictions):
    total_inversions = 0
    total_2max = 0  # twice the maximum possible inversions across all instances
    for gt, pred in zip(ground_truth, predictions):
        ranks = [gt.index(x) for x in pred]  # rank predicted order in terms of ground truth
        total_inversions += count_inversions(ranks)
        n = len(gt)
        total_2max += n * (n - 1)
    return 1 - 4 * total_inversions / total_2max

In [20]:
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel
import torch.nn.functional as F
import torch.nn as nn
import torch
from transformers import AutoModelWithLMHead, AutoTokenizer, AutoModel

MAX_LEN = 128

    
class MarkdownModel(nn.Module):
    def __init__(self):
        super(MarkdownModel, self).__init__()
        self.distill_bert = AutoModel.from_pretrained("../input/mymodelpairbertsmallpretrained/models/checkpoint-18000")
        self.top = nn.Linear(512, 1)

        self.dropout = nn.Dropout(0.2)
        
    def forward(self, ids, mask):
        x = self.distill_bert(ids, mask)[0]
        x = self.dropout(x)
        x = self.top(x[:, 0, :])
        x = torch.sigmoid(x) 
        return x

In [21]:
from torch.utils.data import DataLoader, Dataset



class MarkdownDataset(Dataset):
    
    def __init__(self, df, max_len, mode='train'):
        super().__init__()
        self.df = df
        self.max_len = max_len
        self.tokenizer = AutoTokenizer.from_pretrained("../input/mymodelpairbertsmallpretrained/my_own_tokenizer", do_lower_case=True)
        self.mode=mode

    def __getitem__(self, index):
        row = self.df[index]

        label = row[-1]

        txt = dict_cellid_source[row[0]] + '[SEP]' + dict_cellid_source[row[1]]

        inputs = self.tokenizer.encode_plus(
            txt,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding="max_length",
            return_token_type_ids=True,
            truncation=True
        )
        ids = torch.LongTensor(inputs['input_ids'])
        mask = torch.LongTensor(inputs['attention_mask'])

        return ids, mask, torch.FloatTensor([label])

    def __len__(self):
        return len(self.df)


In [22]:
def adjust_lr(optimizer, epoch):
    if epoch < 1:
        lr = 5e-5
    elif epoch < 2:
        lr = 1e-3
    elif epoch < 5:
        lr = 1e-4
    else:
        lr = 1e-5

    for p in optimizer.param_groups:
        p['lr'] = lr
    return lr
    
def get_optimizer(net):
    optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=3e-4, betas=(0.9, 0.999), eps=1e-08)
    return optimizer

BS = 128
NW = 8

In [23]:
def read_data(data):
    return tuple(d.cuda() for d in data[:-1]), data[-1].cuda()

def validate(model, val_loader, mode='train'):
    model.eval()
    
    tbar = tqdm(val_loader, file=sys.stdout)
    
    preds = np.zeros(len(val_loader.dataset), dtype='float32')
    labels = []
    count = 0

    with torch.no_grad():
        for idx, data in enumerate(tbar):
            inputs, target = read_data(data)

            pred = model(inputs[0], inputs[1]).detach().cpu().numpy().ravel()

            preds[count:count+len(pred)] = pred
            count += len(pred)
            
            if mode=='test':
              labels.append(target.detach().cpu().numpy().ravel())
    if mode=='test':
      return preds
    else:
      return np.concatenate(labels), np.concatenate(preds)

In [24]:
paths_test = list((data_dir / 'test').glob('*.json'))
notebooks_test = [
    read_notebook(path) for path in tqdm(paths_test, desc='Test NBs')
]
test_df = (
    pd.concat(notebooks_test)
    .set_index('id', append=True)
    .swaplevel()
    .sort_index(level='id', sort_remaining=False)
).reset_index()

In [25]:
test_df.source = test_df.source.apply(preprocess_text)
dict_cellid_source = dict(zip(test_df['cell_id'].values, test_df['source'].values))
test_df["rank"] = test_df.groupby(["id", "cell_type"]).cumcount()
test_df["pred"] = test_df.groupby(["id", "cell_type"])["rank"].rank(pct=False)
test_triplets = generate_triplet(test_df, mode = 'test')

In [26]:
test_df["pct_rank"] = 0
test_ds = MarkdownDataset(test_triplets, max_len=MAX_LEN)
test_loader = DataLoader(test_ds, batch_size=BS * 4, shuffle=False, num_workers=NW, pin_memory=False, drop_last=False)

import gc 
gc.collect()
len(test_ds), test_ds[0]

In [27]:
import sys 

model = MarkdownModel()
model = model.cuda()
model.load_state_dict(torch.load('../input/mymodelbertsmallpretrained120000/my_own_model.bin'))
y_test = validate(model, test_loader, mode='test')

In [28]:
preds_copy = y_test
pred_vals = []
count = 0
for id, df_tmp in tqdm(test_df.groupby('id')):
  df_tmp_mark = df_tmp[df_tmp['cell_type']=='markdown']
  df_tmp_code = df_tmp[df_tmp['cell_type']!='markdown']
  df_tmp_code_rank = df_tmp_code['rank'].rank().values
  N_code = len(df_tmp_code_rank)
  N_mark = len(df_tmp_mark)

  preds_tmp = preds_copy[count:count+N_mark * N_code]

  count += N_mark * N_code

  for i in range(N_mark):
    pred = preds_tmp[i*N_code:i*N_code+N_code] 

    softmax = np.exp((pred-np.mean(pred)) *20)/np.sum(np.exp((pred-np.mean(pred)) *20)) 

    rank = np.sum(softmax * df_tmp_code_rank)
    pred_vals.append(rank)

del model
del test_triplets[:]
del dict_cellid_source
gc.collect()

In [29]:
test_df.loc[test_df["cell_type"] == "markdown", "pred"] = pred_vals
sub_df = test_df.sort_values("pred").groupby("id")["cell_id"].apply(lambda x: " ".join(x)).reset_index()
sub_df.rename(columns={"cell_id": "cell_order"}, inplace=True)
sub_df.to_csv("submission_2.csv", index=False)

_____

## Rank Ensemble

###### (finally)

And now for the moment we have all been waiting for: **Ensemling rank based submissions.**

But how are we going to do this?

- Let's say that we have two different submissions: "submission_1.csv", "submission_2.csv". Each containing a list of sorted strings per row.
- We would like to create a new submission such that each row contains a sorted list that is an aggregation of the sorted list in the same row of both submissions.
- To do this, we simply ensemble the indices. The index is nothing but a rank of a particular string. From the highest likelyhood of the string being in it's expected package to the lowest.
- Then sort the strings by their ensembled index.

**Reading the submissions**

In [30]:
df_1 = pd.read_csv('submission_2.csv')
df_2 = pd.read_csv('submission_1.csv')

**Averaging the indices and sorting the resulting submission by the aggregated ensembled indices**

In [31]:
new_samples = []
for sample_idx in range(len(df_1)):
    sample_1 = {k: v for v, k in enumerate(df_1.iloc[sample_idx]['cell_order'].split(' '))}
    sample_2 = {k: v for v, k in enumerate(df_2.iloc[sample_idx]['cell_order'].split(' '))}
    for key in sample_1: sample_1[key] = ( (sample_1[key] * 0.252) + (sample_2[key] * 0.748) )
    new_samples.append(' '.join([i[0] for i in list(sorted(sample_1.items(), key=lambda x:x[1]))]))
df_1['cell_order'] = new_samples

**Saving as output so we can submit**

In [32]:
df_1.to_csv('submission.csv', index = False)
df_1

In [34]:
df