**This notebook aims to determine parameters of ensemble learning via cross validation.**
--------------------
## 1) Model details:
| Index| Model Flag    | Method |   Pretrain step | Finetune step | DCG on leaderboard | 
| --------| -------- | ------- | ------- | ------- | ------- | 
| 1| large_group2_wwm_from_unw4625K | M1 | 1700000 | 5130 | 11.96214 |
| 2| large_group2_wwm_from_unw4625K | M1 | 1700000 | 4180 | NAN |
| 3| base_group2_wwm | M2 | 2150000 | 5130 | ~11.32363 |
| 4| large_group2_wwm_from_unw4625K | M1 | 590000 | 5130 | 11.94845 |
| 5| large_group2_wwm_from_unw4625K | M1 | 1700000 | 4180 | NAN |
| 6| large_group2_mt_pretrain | M3 | 1940000 | 5130 | NAN |

## 2) Method details

| Method  | Model Layers |   Details |
| -------- | ------- | ------- |
| M1 | 24 | WWM & CTR prediction as pretraining tasks|
| M2 | 12 | WWM & CTR prediction as pretraining tasks |
| M3 | 24 | WWM & Multi-task CTR prediction as pretraining tasks|


In [9]:
valid_sparse_feat_addr = './features/sparse_features/annotation_data_0522.sparse.feat.undup'#.undup
test_sparse_feat_addr = './features/sparse_features/wsdm_test_2_all.sparse.feat.undup'

In [10]:
def get_bert_feat_addr(model_index, data_type):
    prefix = 'validation.' if data_type == 'valid' else ''
    addr = f'./features/bert_features/model{model_index}/{prefix}result.csv'
    return addr

In [11]:
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
from pandas import DataFrame
from collections import defaultdict

In [12]:
def load_as_csv(data_addr, size, with_label=True):
    data = {'label':[], 'qid':[] }
    data.update({f'f{i+1}':[] for i in range(size)})
    for line in tqdm(open(data_addr)):
        splits = line.strip().split()
        label = splits[0]
        qid = splits[1].split(':')[1]
        data['label'].append(int(label))
        data['qid'].append(int(qid))
        for i,f in list(map(lambda s:s.split(':'), splits[2:])):
            data[f'f{i}'].append(float(f))

    return DataFrame(data)
            
    

In [13]:
def get_merge_feat_csv(sparse_feat_addr, *args):
    data = [[] for i in range(len(args))]
    for i,addr in enumerate(args):
        for line in open(addr):
            data[i].append(float(line.strip()))

    df = defaultdict(list)
    df.update({f'f{i+1}':[] for i in range(6)})
    for i,line in enumerate(open(sparse_feat_addr)):
        splits = line.strip().split()
        label = splits[0]
        qid = splits[1].split(':')[1]
        if '.' in qid:
            print(splits)
            break
        df['label'].append(int(label))
        df['qid'].append(int(qid))
        for j,f in list(map(lambda s:s.split(':'), splits[2:])):
            df[f'f{j}'].append(float(f))
        for idx in range(len(data)):
            df[f'f{6 + idx + 1}'].append(data[idx][i])
    return DataFrame(df)
    
                              

In [14]:
feat_addrs = [get_bert_feat_addr(i, 'valid') for i in [1,2,3,4,6]]
all_data_df = get_merge_feat_csv(valid_sparse_feat_addr,*feat_addrs)

In [19]:
import lightgbm
def split_train_valid(df, train_ratio=0.8):
    qids = df['qid'].unique()
    np.random.seed(2023)
    np.random.shuffle(qids)
    split_point = int(len(qids) * train_ratio)
    train_qids = set(qids[:split_point])
    valid_qids = set(qids[split_point:])
    
    train_df = df.loc[df['qid'].isin(train_qids)]
    valid_df = df.loc[df['qid'].isin(valid_qids)]
    return train_df,valid_df

def prepare_data(df):
    qids_group = df.groupby("qid")["qid"].count().to_numpy()
    X_train = df.drop(["qid", "label"], axis = 1)
    # Relevance label for train
    y_train = df['label'].astype(int)
    return X_train, y_train, qids_group

def select_train(df, train_ratio):
    ranker = lightgbm.LGBMRanker(
                    objective="lambdarank",
                    boosting_type = "gbdt",
                    n_estimators = 500,
                    importance_type = "gain",
                    metric= "ndcg",
                    num_leaves = 30,
                    learning_rate = 0.05,
                    max_depth = -1,)#If random_state is None, default seeds in C++ code are used.
    
    train_df,valid_df = split_train_valid(df, train_ratio)
    x_train,y_train,qid_group_train = prepare_data(train_df)
    x_test,y_test,qid_group_test = prepare_data(valid_df)
    ranker.fit(
          X=x_train,
          y=y_train,
          group=qid_group_train,
          eval_set=[(x_test, y_test),],
          eval_group=[qid_group_test, ],
          eval_at=[10])
    return ranker

In [20]:
train_df,valid_df = split_train_valid(all_data_df, train_ratio=0.8)

In [21]:
sranker = select_train(valid_df, 0.8)
#seed = 123
#epoch=500, leave=30, lr=0.05, 0.821512
#epoch=100, leave=30, lr=0.05, 0.821748
#epoch=100, leave=10, lr=0.05, 0.826392 ***
#epoch=100, leave=20, lr=0.05, 0.826137
#epoch=100, leave=10, lr=0.01, 0.825981

#seed = 2023
#epoch=500, leave=30, lr=0.05, 0.811326
#epoch=100, leave=30, lr=0.05, 0.817943
#epoch=100, leave=10, lr=0.05, 0.820679 ***
#epoch=100, leave=20, lr=0.05, 0.814458
#epoch=100, leave=10, lr=0.01, 0.816119

[1]	valid_0's ndcg@10: 0.79007
[2]	valid_0's ndcg@10: 0.807151
[3]	valid_0's ndcg@10: 0.805741
[4]	valid_0's ndcg@10: 0.807598
[5]	valid_0's ndcg@10: 0.808061
[6]	valid_0's ndcg@10: 0.809606
[7]	valid_0's ndcg@10: 0.808922
[8]	valid_0's ndcg@10: 0.806076
[9]	valid_0's ndcg@10: 0.806653
[10]	valid_0's ndcg@10: 0.810297
[11]	valid_0's ndcg@10: 0.809622
[12]	valid_0's ndcg@10: 0.811707
[13]	valid_0's ndcg@10: 0.813054
[14]	valid_0's ndcg@10: 0.812945
[15]	valid_0's ndcg@10: 0.815191
[16]	valid_0's ndcg@10: 0.812636
[17]	valid_0's ndcg@10: 0.813891
[18]	valid_0's ndcg@10: 0.815498
[19]	valid_0's ndcg@10: 0.815169
[20]	valid_0's ndcg@10: 0.816988
[21]	valid_0's ndcg@10: 0.817538
[22]	valid_0's ndcg@10: 0.815596
[23]	valid_0's ndcg@10: 0.81484
[24]	valid_0's ndcg@10: 0.815743
[25]	valid_0's ndcg@10: 0.815495
[26]	valid_0's ndcg@10: 0.816304
[27]	valid_0's ndcg@10: 0.814687
[28]	valid_0's ndcg@10: 0.8168
[29]	valid_0's ndcg@10: 0.816603
[30]	valid_0's ndcg@10: 0.816264
[31]	valid_0's ndcg@10:

**Similarly, we choose models basd on cross validation. Here is our experimental results for references. Finally, we ingore model5 becase it brings in negative effects.**
-------------

| Leaves    | Models |   NDCG@10 | 
| -------- | ------- | ------- |
| 10  | [0]    | 0.7790606982681543    |
| 10 | [1]     | 0.7791298174754135    |
| 10  | [5]    | 0.781309490583906    |
| 10    | [0,1]    | 0.7884139940363432    |
| 10    | [0,1,2]    | 0.7883411766547597   |
| 10    | [0,1,2,3]    | 0.7925544769079671  |
| 10    | [0,1,3]    | 0.792395820505957  |
| 10    | [0,1,3,4]    | 0.7921549115058365  |
| 10    | [0,1,2,3,4]    | 0.792328979445144  |
| 10    | [0,1,3,5]    |  0.7955011289453335 |
| 10    | [0,1,2,3,5]    | 0.7957985512498594  |
| 12    | [0,1,2,3,5] | **0.7959779139881844** |
| 20    | [0,1,2,3,5] | **0.7964698912399729** |
| 30    | [0,1,2,3,5] | **0.7966145678395755** |
| 10    | [0,1,2,3,4,5]    | 0.7956787267202614  |