# Yahoo Data Processing

Testing the performance of various feature configurations when using DeltaMART with the Yahoo LETOR dataset (train/validation/test split):

https://github.com/QingyaoAi/Unbiased-Learning-to-Rank-with-Unbiased-Propensity-Estimation

Only using a small subset of queries (?)

In [1]:
import data_utils  # To load Yahoo dataset
import numpy as np
import pandas as pd
from xgboost import XGBRegressor
from scipy.special import expit  # Logistic function
from rank_metrics import ndcg_at_k, mean_average_precision

## Exploration

### Raw data

#### Features: train.feature

Description: "test_2_5" means the 5th document for the query with identifier "2" in the original test set of the Yahoo letor data.

Interpretation: first value is test_queryNum_docNum, rest are feature values (svm_light format?)

****

#### Labels: train.weights

Description: The annotated relevance value for documents in the initial list of each query.

Interpretation: first value is queryNum (query_id), rest are labels for URLs at corresponding indexes

In [2]:
data = data_utils.read_data(data_path='/Users/Ashtekar15/Desktop/Thesis/MGBoost/other/test_data/generate_dataset/',
                            file_prefix='train')

### Stats for validation set

**dids** (71083): valid_19945_15..., stores query/URL id info

**qids** (2994): stores query id info

**features** (71083): list of lists, each sublist is a given query/URL pair (sublist len 700)

**gold_weights** (2994): list of lists, each sublist is the labels for URLs of a single query (sublist len varies)

### Planning

In [3]:
# Num queries in train/val/test (should be 29921)
19944 + 2994 + 6983

29921

In [4]:
# Get info on number of URLs/query
lls = []

for ls in data.gold_weights:
    lls.append(len(ls))

np.mean(lls), np.std(lls), min(lls), max(lls),

(23.723124749298034, 18.310095072710077, 1, 139)

In [5]:
# Solve for num_queries
num_queries = 10

mean = np.mean(lls)

# (Estimated) total size in GB with given num_queries
(700 * 3) * (mean ** 2) * num_queries * 64 / (10 ** 9)

0.7563852547382973

In [6]:
# To get total number of values FOR ENTIRE TRAIN SET thru feature generation
total = 0
for ls in data.gold_weights:
    total += (700 * 3) * (len(ls) ** 2)

# Estimation of total size in GB
(total * 64) / (10 ** 9)

2407.1892096

**Plan:**
- Convert np arrays to np.float32 (?)
    - To save memory/use more training data
- Choose 10 queries randomly
    - Use seed
- Get features and labels corresponding to these queries
    - Include query id in features (?)
    - data.features, data.goldlist
- Generate pairwise features
    - Include option for delta_features
- Build model/make predictions on validation data

## Data Preparation

In [7]:
# Convert lists to np arrays for faster access

# String
dids = np.array(data.dids)

# String -> int
qids = np.array(data.qids, dtype=int)

# float64 -> float32
features = np.array(data.features, dtype=np.float32)

# Since not all sublists of same size
gold_weights = np.array([np.array(x, dtype=np.float32) for x in data.gold_weights])

In [8]:
""" SET SEED FOR RANDOM QUERY SELECTION HERE """
np.random.seed(1)

# Randomly select 10 queries
q_choice = np.random.choice(qids, size=10)

# Get query id aligned with features
query_id = np.array([int(ele.split("_")[1]) for ele in dids])

In [9]:
# Get relevant queries, features, and labels
q_rel = query_id[np.isin(query_id, q_choice)]
feat_rel = features[np.isin(query_id, q_choice)]
label_rel = gold_weights[np.isin(qids, q_choice)]

# Join subarrays
label_rel = np.concatenate(label_rel)

# Include query id in features
feat_rel = np.hstack((q_rel.reshape(-1, 1), feat_rel))

In [10]:
# Important to free up memory
del data, dids, qids, features, gold_weights

## Feature Generation

In [12]:
""" SET DELTA FEATURES HERE """
delta_features = False

n_rows = 0
max_diff = 4
n_features = 700

# Find max possible number of rows: n_queries * (n_urls_per_query ^ 2) * max_repeat_factor
for qid in q_choice:
    n_rows += (np.sum(np.isin(q_rel, qid)) ** 2) * max_diff
    
# Add extra set of columns if delta_features, + 2 for (query_id, label)
if delta_features:
    n_columns = (n_features * 3) + 2
else:
    n_columns = (n_features * 2) + 2

# Create array to fill in later (faster), step thru with idx
features = np.full(shape=(n_rows, n_columns), fill_value=np.nan)
idx = 0

# Iter thru queries
for progress, qid in enumerate(q_choice):
    
    temp_feat = feat_rel[np.isin(q_rel, qid)]
    temp_label = label_rel[np.isin(q_rel, qid)]
    
    m = temp_feat.shape[0]
    
    # First URL
    for i in range(m):
        
        # Second URL
        for j in range(m):
            
            label_diff = temp_label[i] - temp_label[j]
            
            # Repeat importance: duplicate row |label_diff| times in next loop
            end_k = int(abs(label_diff)) + 1

            for k in range(end_k):

                # Delta features: for feature (a, b), represent as (a, b, a-b)
                # Format: (qid, feat[i], feat[j], feat[i] - feat[j], label_diff)
                if delta_features:
                    new_row = np.hstack((temp_feat[i], 
                                         temp_feat[j, 1:], 
                                         temp_feat[i, 1:] - temp_feat[j, 1:],
                                         label_diff))
                else:
                    new_row = np.hstack((temp_feat[i], 
                                         temp_feat[j, 1:], 
                                         label_diff))

                features[idx] = new_row
                idx += 1

    print(progress + 1)
    
# Originally allocated array is likely too large, only save relevant rows
features = features[~np.isnan(features[:, 0])]

0
1
2
3
4
5
6
7
8
9


## Model

## Testing