# Yahoo Data Processing

Testing the performance of various feature configurations when using DeltaMART with the Yahoo LETOR dataset (train/validation/test split):

https://github.com/QingyaoAi/Unbiased-Learning-to-Rank-with-Unbiased-Propensity-Estimation

Only using a small subset of queries (?)

In [1]:
import data_utils  # To load Yahoo dataset
import numpy as np
import pandas as pd
from xgboost import XGBRegressor
from scipy.special import expit  # Logistic function
from rank_metrics import ndcg_at_k, mean_average_precision

## Exploration

### Raw data

#### Features: train.feature

Description: "test_2_5" means the 5th document for the query with identifier "2" in the original test set of the Yahoo letor data.

Interpretation: first value is test_queryNum_docNum, rest are feature values (svm_light format?)

****

#### Labels: train.weights

Description: The annotated relevance value for documents in the initial list of each query.

Interpretation: first value is queryNum (query_id), rest are labels for URLs at corresponding indexes

In [None]:
data = data_utils.read_data(data_path='/Users/Ashtekar15/Desktop/Thesis/MGBoost/other/test_data/generate_dataset/',
                            file_prefix='valid')

### Stats for validation set

**dids** (71083): valid_19945_15..., stores query/URL id info

**qids** (2994): stores query id info

**features** (71083): list of lists, each sublist is a given query/URL pair (sublist len 700)

**gold_weights** (2994): list of lists, each sublist is the labels for URLs of a single query (sublist len varies)

### Planning

In [None]:
# Num queries in train/val/test (should be 29921)
19944 + 2994 + 6983

In [None]:
# Get info on number of URLs/query
lls = []

for ls in data.gold_weights:
    lls.append(len(ls))

np.mean(lls), np.std(lls), min(lls), max(lls),

In [None]:
# Solve for num_queries
num_queries = 10

mean = np.mean(lls)

# (Estimated) total size in GB with given num_queries
(700 * 3) * (mean ** 2) * num_queries * 64 / (10 ** 9)

In [None]:
# To get total number of values FOR ENTIRE TRAIN SET thru feature generation
total = 0
for ls in data.gold_weights:
    total += (700 * 3) * (len(ls) ** 2)

# Estimation of total size in GB
(total * 64) / (10 ** 9)

**Plan:**
- Convert np arrays to np.float32 (?)
    - To save memory/use more training data
- Choose 10 queries randomly
    - Use seed
- Get features and labels corresponding to these queries
    - Include query id in features (?)
    - data.features, data.goldlist
- Generate pairwise features
    - Include option for delta_features
- Build model/make predictions on validation data

## Data Preparation

In [None]:
# Convert lists to np arrays for faster access

# String
dids = np.array(data.dids)

# String -> int
qids = np.array(data.qids, dtype=int)

# float64 -> float32
features = np.array(data.features, dtype=np.float32)

# Since not all sublists of same size
gold_weights = np.array([np.array(x, dtype=np.float32) for x in data.gold_weights])

In [None]:
""" SAVING ENTIRE DATASET """
# folder = 'Yahoo_Train//'

# np.save(folder + 'dids.npy', dids)
# np.save(folder + 'qids.npy', qids)
# np.save(folder + 'features.npy', features)
# np.save(folder + 'gold_weights.npy', gold_weights)

### Run imports and cells below

In [139]:
""" LOADING ENTIRE DATASET """
folder = 'Yahoo_Val//'

dids = np.load(folder + 'dids.npy')
qids = np.load(folder + 'qids.npy')
features = np.load(folder + 'features.npy')
gold_weights = np.load(folder + 'gold_weights.npy')

In [140]:
""" SET SIZE, SEED FOR RANDOM QUERY SELECTION HERE """
np.random.seed(2)
size = 5

# Randomly select 10 queries (train) or 5 queries (val)
q_choice = np.random.choice(qids, size=size)

# Get query id aligned with features
query_id = np.array([int(ele.split("_")[1]) for ele in dids])

In [141]:
# Get relevant queries, features, and labels
q_rel = query_id[np.isin(query_id, q_choice)]
feat_rel = features[np.isin(query_id, q_choice)]
label_rel = gold_weights[np.isin(qids, q_choice)]

# Join subarrays
label_rel = np.concatenate(label_rel)

# Include query id in features
feat_rel = np.hstack((q_rel.reshape(-1, 1), feat_rel))

In [142]:
# Important to free up memory
# del data
del dids, qids, features, gold_weights

## Feature Generation

In [143]:
""" SET DELTA FEATURES, REPEAT IMPORTANCE HERE """
delta_features = True
repeat_importance = False

n_rows = 0
max_diff = 4
n_features = 700

# Find max possible number of rows: n_queries * (n_urls_per_query ^ 2) * max_repeat_factor
for qid in q_choice:
    urls_per_query = np.sum(np.isin(q_rel, qid))
    
    # If not repeating importance, then every query-URL pair only appears once
    if repeat_importance:
        n_rows += (urls_per_query ** 2) * max_diff
    else:
        n_rows += (urls_per_query ** 2)
    
# Add extra set of columns if delta_features, + 2 for (query_id, label)
if delta_features:
    n_columns = (n_features * 3) + 2
else:
    n_columns = (n_features * 2) + 2

# Create array to fill in later (faster), step thru with idx
features = np.full(shape=(n_rows, n_columns), fill_value=np.nan)
idx = 0

# Iter thru queries
for progress, qid in enumerate(q_choice):
    
    temp_feat = feat_rel[np.isin(q_rel, qid)]
    temp_label = label_rel[np.isin(q_rel, qid)]
    
    m = temp_feat.shape[0]
    
    # First URL
    for i in range(m):
        
        # Second URL
        for j in range(m):
            
            label_diff = temp_label[i] - temp_label[j]
            
            # Repeat importance: duplicate row |label_diff| times
            if repeat_importance:
                end_k = int(abs(label_diff)) + 1
            else:
                end_k = 1

            for k in range(end_k):

                # Delta features: for feature (a, b), represent as (a, b, a-b)
                # Format: (qid, feat[i], feat[j], feat[i] - feat[j], label_diff)
                if delta_features:
                    new_row = np.hstack((temp_feat[i], 
                                         temp_feat[j, 1:], 
                                         temp_feat[i, 1:] - temp_feat[j, 1:],
                                         label_diff))
                else:
                    new_row = np.hstack((temp_feat[i], 
                                         temp_feat[j, 1:], 
                                         label_diff))

                features[idx] = new_row
                idx += 1

    print(progress + 1)
    
# Originally allocated array is likely too large, only save relevant rows
features = features[~np.isnan(features[:, 0])]

1
2
3
4
5


In [144]:
# # Save train/val features
# train_feat = features
val_feat = features

In [145]:
train_feat.shape, val_feat.shape

((19795, 2102), (5817, 2102))

In [146]:
# To work with model
test_feat = val_feat

## Model

In [147]:
""" TRAIN """
# Features does not include i, j, does includes query_id
X_train = train_feat[:, :-1]
y_train = train_feat[:, -1]

# Same parameters for all calls to ensure consistency
xgbr = XGBRegressor(max_depth=6, 
                    learning_rate=0.1,
                    n_estimators=100, # Change to make faster OR more powerful (?)
                    objective='reg:squarederror')

xgbr.fit(X_train, y_train)

print('Model fitted')

""" TEST """
# Want to make predictions on every URL pair within a query, for all queries
X_test = test_feat[:, :-1]
y_test = test_feat[:, -1]
y_pred = xgbr.predict(X_test)

# Record results over all queries
MAP = 0
NDCG1, NDCG3, NDCG5, NDCG10, NDCGM = 0, 0, 0, 0, 0

# Save rankings (to visually compare)
r_ls = []

# For each query, make a prediction array (scores)
for qid in np.unique(X_test[:, 0]):

    # m will be the number of URLs per given query ID
    m = int(np.sqrt(np.sum(X_test[:, 0] == qid)))

    # Save y_pred only for query of interest as y_pq, reshape in order to sum across rows
    # Note that the default order='C' in reshape is fine (row-major)
    # Setting order='F' will result in roughly the same result, just reversed since the 
    # learned labels correspond to (URLi - URLj)
    y_pq = y_pred[X_test[:, 0] == qid]
    y_pq = y_pq.reshape(m, m, order='C')

    # Apply logistic function
    y_pq = expit(y_pq)

    # Sum across rows to get 'power' of each individual training example
    # Get order using the scores as indices
    scores = np.sum(y_pq, axis=0)
    order = np.argsort(scores)

    # Apply order to original labels
    y_orig = label_rel[feat_rel[:, 0] == qid]
    r = y_orig[order]
    
    # Save ranking
    r_ls.append(r)

    # Get results
    m_a_p = mean_average_precision([r])
    n1, n3, n5, n10, nm = ndcg_at_k(r=r, k=1),ndcg_at_k(r=r, k=3), ndcg_at_k(r=r, k=5), ndcg_at_k(r=r, k=10), ndcg_at_k(r=r, k=m)

    # Update overall results
    MAP += m_a_p
    NDCG1 += n1
    NDCG3 += n3
    NDCG5 += n5
    NDCG10 += n10
    NDCGM += nm

    # Results for query
    print('Query %d, m=%d:' % (qid, m))
    print('\tMAP:     %.4f' % m_a_p)
    print('\tNDCG@1:  %.4f' % n1)
    print('\tNDCG@3:  %.4f' % n3)
    print('\tNDCG@5:  %.4f' % n5)
    print('\tNDCG@10: %.4f' % n10)
    print('\tNDCG@m:  %.4f' % nm)

# Results over all queries
print('\nOverall:')
print('\tMAP:     %.4f' % (MAP / size))
print('\tNDCG@1:  %.4f' % (NDCG1 / size))
print('\tNDCG@3:  %.4f' % (NDCG3 / size))
print('\tNDCG@5:  %.4f' % (NDCG5 / size))
print('\tNDCG@10: %.4f' % (NDCG10 / size))
print('\tNDCG@m:  %.4f' % (NDCGM / size))

Model fitted
Query 21553, m=6:
	MAP:     1.0000
	NDCG@1:  1.0000
	NDCG@3:  1.0000
	NDCG@5:  1.0000
	NDCG@10: 1.0000
	NDCG@m:  1.0000
Query 22292, m=31:
	MAP:     0.9087
	NDCG@1:  0.6667
	NDCG@3:  0.5867
	NDCG@5:  0.5189
	NDCG@10: 0.6442
	NDCG@m:  0.8195
Query 22459, m=50:
	MAP:     0.9207
	NDCG@1:  0.2500
	NDCG@3:  0.5087
	NDCG@5:  0.5072
	NDCG@10: 0.5055
	NDCG@m:  0.7855
Query 22486, m=32:
	MAP:     0.5441
	NDCG@1:  0.0000
	NDCG@3:  0.2398
	NDCG@5:  0.3276
	NDCG@10: 0.4653
	NDCG@m:  0.6350
Query 22520, m=36:
	MAP:     0.7870
	NDCG@1:  0.5000
	NDCG@3:  0.4319
	NDCG@5:  0.6168
	NDCG@10: 0.7063
	NDCG@m:  0.8188

Overall:
	MAP:     0.8321
	NDCG@1:  0.4833
	NDCG@3:  0.5534
	NDCG@5:  0.5941
	NDCG@10: 0.6643
	NDCG@m:  0.8118


In [148]:
r_ls

[array([2., 1., 1., 1., 1., 1.], dtype=float32),
 array([2., 2., 1., 0., 1., 2., 2., 3., 1., 2., 1., 1., 1., 0., 2., 1., 1.,
        1., 3., 3., 1., 2., 1., 2., 1., 1., 2., 1., 2., 1., 2.],
       dtype=float32),
 array([1., 1., 4., 1., 1., 1., 1., 1., 1., 1., 2., 3., 1., 1., 2., 1., 2.,
        0., 2., 2., 1., 2., 1., 0., 0., 2., 1., 0., 0., 0., 1., 1., 1., 1.,
        1., 1., 1., 1., 3., 1., 1., 2., 1., 1., 1., 1., 1., 1., 1., 1.],
       dtype=float32),
 array([0., 0., 2., 1., 1., 2., 0., 1., 0., 2., 0., 1., 2., 1., 0., 0., 0.,
        0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1.],
       dtype=float32),
 array([1., 1., 0., 2., 1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 0., 1., 1.,
        1., 0., 0., 1., 0., 1., 1., 1., 0., 2., 1., 0., 1., 1., 1., 1., 1.,
        0., 1.], dtype=float32)]

## Testing

    ----------------------------------------
    Seed = 1, 10/5 train/test query split
    
    delta_features = False
    Overall:
        MAP:     0.8772
        NDCG@1:  0.7167
        NDCG@3:  0.6580
        NDCG@5:  0.6551
        NDCG@10: 0.7511
        NDCG@m:  0.8404
        
    delta_features = True
    Overall:
        MAP:     0.9184
        NDCG@1:  0.8000
        NDCG@3:  0.8254
        NDCG@5:  0.7619
        NDCG@10: 0.8207
        NDCG@m:  0.8969
        
    ----------------------------------------
    Seed = 2, 10/5 train/test query split
    
    delta_features = False
    Overall:
        MAP:     0.8719
        NDCG@1:  0.6833
        NDCG@3:  0.7657
        NDCG@5:  0.7592
        NDCG@10: 0.7537
        NDCG@m:  0.8828
        
    delta_features = True
    Overall:
        MAP:     0.8321
        NDCG@1:  0.4833
        NDCG@3:  0.5534
        NDCG@5:  0.5941
        NDCG@10: 0.6643
        NDCG@m:  0.8118

    ----------------------------------------
    
    
    
    
    
    
    
    Seed = 2 (train), Seed = 1 (test), delta_features = False, 10 train/5 test queries
    Overall:
        MAP:     0.8830
        NDCG@1:  0.7667
        NDCG@3:  0.6900
        NDCG@5:  0.6284
        NDCG@10: 0.6952
        NDCG@m:  0.8401