# Dataset Utilities

In the following notebook we will demonstrate several utilities provided by RankEval in working and manipulating a Dataset in the SVMLight format. In particular, how to:
 - Easily Load standard LtR datasets as well as several pre-trained models
 - Fork a dataset by selecting only a subset of the features
 - Fork a dataset by selecting only some queries
 - Dump a dataset on file in the SVMLight format
 - Split a dataset in train, validation (eventually) and test sets
 - Manually accessing low level dataset information
 - Iterate over each query of a dataset

#### Essential imports

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import os
import numpy as np
import pandas as pd

# Useful to reload the module without having to restart the notebook kernel
from rankeval.dataset.datasets_fetcher import load_dataset
from rankeval.dataset import Dataset
from rankeval.model import RTEnsemble

# Loading datasets and models

Standard LtR datasets can be easily loaded by calling the load_dataset utility. This tool allows to load datasets and several pre-trained models from a central repository, in such a way to simplify the setting of the workspace.

In [2]:
# Dataset container
dataset_container = load_dataset(dataset_name='msn10k', 
                                fold='1', 
                                download_if_missing=True, 
                                force_download=False,
                                with_models=True)

Loading files. This may take a few minutes.
done loading dataset!


## Datasets

In [3]:
# Remapping Datasets Names
msn_train = dataset_container.train_dataset
msn_validation = dataset_container.validation_dataset
msn_test = dataset_container.test_dataset

## Choose and load models

In [4]:
# View available models
for item, file_name in enumerate(dataset_container.model_filenames):
    print item, file_name

0 /Users/salvatore/rankeval_data/msn10k/models/Fold1/quickrank/msn1.quickrank.LAMBDAMART.20000.32.T15000.xml
1 /Users/salvatore/rankeval_data/msn10k/models/Fold1/quickrank/msn1.quickrank.LAMBDAMART.20000.32.T5000.xml
2 /Users/salvatore/rankeval_data/msn10k/models/Fold1/quickrank/msn1.quickrank.LAMBDAMART.20000.32.T20000.xml
3 /Users/salvatore/rankeval_data/msn10k/models/Fold1/quickrank/msn1.quickrank.LAMBDAMART.20000.32.T10000.xml
4 /Users/salvatore/rankeval_data/msn10k/models/Fold1/quickrank/msn1.quickrank.LAMBDAMART.20000.32.T1000.xml
5 /Users/salvatore/rankeval_data/msn10k/models/Fold1/xgboost/XGBOOST.msn10k.fold-1.pairwise.d-5.lr-10.trees-1000.model
6 /Users/salvatore/rankeval_data/msn10k/models/Fold1/lightgbm/LGBM.msn10k.fold-1.lambdarank.leaves-32.lr-5.trees-1000.model
7 /Users/salvatore/rankeval_data/msn10k/models/Fold1/catboost/msn1.catboost.LAMBDAMART.1000.5.T1000.json
8 /Users/salvatore/rankeval_data/msn10k/models/Fold1/catboost/msn1.catboost.LAMBDAMART.1000.5.T1000.model


In [5]:
# Model files
msn_qr_lmart_1Ktrees_file = dataset_container.model_filenames[4]
# Loading model into RankEval
msn_qr_lmart_1Ktrees = RTEnsemble(msn_qr_lmart_1Ktrees_file, name="QR_lmart_1K", format="QuickRank")

# Fork a dataset by selecting only a subset of the features

Starting from a dataset, it is possible to create a new dataset instance with only a subset of the features appearing in the original dataset. 
This feature is particularly useful when the task is to analyze the feature importance, trying to reduce as much as possible the features needed by a LtR model without affecting the quality of the learned model. 
An example of such analysis is reported in the notebook: [Feature Analysis](Feature%20Analysis.ipynb).

In this notebook the features selected will be the 20% of all the features, randomly choosen:

In [6]:
feature_permutation = np.random.permutation(msn_train.n_features)
selected_features = feature_permutation[:int(msn_train.n_features*0.2)]

msn_train_subset_features = msn_train.subset_features(selected_features)

In [7]:
d = {'# Queries': [msn_train.n_queries, msn_train_subset_features.n_queries], 
     '# Instances': [msn_train.n_instances, msn_train_subset_features.n_instances], 
     '# Features': [msn_train.n_features, msn_train_subset_features.n_features],}
df = pd.DataFrame(data=d, index=["Full", "Sampled by Feature"])
df

Unnamed: 0,# Features,# Instances,# Queries
Full,136,723412,6000
Sampled by Feature,27,723412,6000


Recall the models have to be used only in consistent conditions, i.e., if you train a model on this sampled dataset also the test dataset (and eventually the validation) has to be changed accordingly.

In [8]:
msn_test_subset_features = msn_test.subset_features(selected_features)

# Fork a dataset by selecting only a subset of the queries

Starting from a dataset, it is possible to create a new dataset instance by selecting only some of the queries of the original dataset. The query to select are identified by qid (query id).

In this notebook we select only 20% of all the queries, randomly choosen:

In [9]:
qid_permutation = np.random.permutation(msn_train.query_ids)
selected_qid = qid_permutation[:int(msn_train.query_ids.size*0.2)]

msn_train_subset_queries = msn_train.subset(query_ids=selected_qid, name="MSN Train Fold1 - 20% of the queries")

In [10]:
d = {'# Queries': [msn_train.n_queries, msn_train_subset_queries.n_queries], 
     '# Instances': [msn_train.n_instances, msn_train_subset_queries.n_instances], 
     '# Features': [msn_train.n_features, msn_train_subset_queries.n_features],}
df = pd.DataFrame(data=d, index=["Full", "Sampled by Query"])
df

Unnamed: 0,# Features,# Instances,# Queries
Full,136,723412,6000
Sampled by Query,136,142441,1200


# Dump a model on file

If you modify a dataset (by selecting only a subset of the queries or of the features) you can save the modified version on file in the standard SVMLight format. This operation is provided by RankEval to simplify the activity of manipulating and working on LtR datasets.

In [11]:
destination_file = "msn_fold1_27_random_features.txt"
msn_train_subset_features.dump(destination_file, format="svmlight")

Read the top 10 lines of the file where the dataset has been written into (and delete it):

In [12]:
!head -10 $destination_file
!rm $destination_file

2 qid:1 1:3.078917 2:0 3:0.0064099999 4:2 5:11089534 6:0.25 7:1 8:0 9:167 10:3 11:6 12:0 13:-20.203779 14:7 15:-13.581932 16:0 17:1 18:0 19:13.853103 20:0.75 21:2 22:0.011976 23:0 24:20.59276 25:6.9265509 26:1 27:0 
2 qid:1 1:30.789171 2:0 3:0.022988999 4:10.333333 5:11089534 6:0 7:0 8:0 9:416 10:3 11:31 12:0 13:-16.208809 14:5 15:-11.411068 16:0.2 17:0 18:0.88888901 19:70.792755 20:0 21:9 22:0.026442001 23:0 24:0 25:6.9265509 26:0 27:0 
0 qid:1 1:18.473503 2:0 3:0.031962998 4:5.333333 5:3 6:0 7:0 8:0 9:156 10:3 11:16 12:0 13:-18.589542 14:7 15:-11.436378 16:0 17:0 18:9.5555563 19:33.436523 20:0 21:1 22:0.051282 23:0 24:0 25:6.9265509 26:0 27:0 
2 qid:1 1:6.1578341 2:0 3:0.00813 4:3.333333 5:11089534 6:0 7:0 8:0 9:299 10:3 11:10 12:0 13:-19.180737 14:7 15:-13.825417 16:0.25 17:0 18:0.222222 19:21.928251 20:0 21:3 22:0.013378 23:0 24:0 25:6.9265509 26:0 27:0 
1 qid:1 1:8.100687 2:0 3:0.001327 4:3.666667 5:5 6:0 7:0 8:0 9:2022 10:3 11:11 12:0 13:-20.589939 14:7 15:-19.226044 16:0.25 17:0

# Split a dataset in train, validation (eventually) and test sets

It is common the need to split a LtR dataset in train, validation and test sets as to train a LtR model and test its effectiveness on a different split.
Rankeval provides an utility to split a dataset into partitions. It shuffle the query ids before partitioning.

In this notebook for simplicity we split the training set of the MSN-Fold1 dataset according to the 60%/20%/20% ratios.

In [13]:
train, vali, test = msn_train.split(train_size=0.6, vali_size=0.2)

In [14]:
d = {'# Queries': [train.n_queries, vali.n_queries, test.n_queries], 
     '# Instances': [train.n_instances, vali.n_instances, test.n_instances], 
     '# Features': [train.n_features, vali.n_features, test.n_features],}
df = pd.DataFrame(data=d, index=["Train", "Vali", "Test"])
df

Unnamed: 0,# Features,# Instances,# Queries
Train,136,437968,3600
Vali,136,142682,1200
Test,136,142762,1200


If `vali_size` is set to 0, the method will skip the creation of a validation set.

# Accessing low-level information

The dataset class provides an interface storing dataset information and providing utilities for easily accessing it. In some situation however it is needed to access the low level information there stored. In this notebook, we describe how this information is stored, and how to iterate over the queries of a dataset accessing features, labels and query ids in order.

We start by describing the base component of a dataset instance.

This are the main charateristics of the dataset we are adopting:

In [15]:
d = {'# Queries': [msn_train.n_queries], 
     '# Instances': [msn_train.n_instances], 
     '# Features': [msn_train.n_features],}
df = pd.DataFrame(data=d, index=["Dataset"])
df

Unnamed: 0,# Features,# Instances,# Queries
Dataset,136,723412,6000


Each dataset is described in primis by the feature matrix (numpy 2d array) of its feature. The rows of this matrix are the instances, whether the columns are the features. The shape of the matrix is thus (n_instances, n_features):

In [16]:
msn_train.X.shape

(723412, 136)

Each cell of this matrix identify the value of a single feature/document pair. For example, the value of the 10-th feature of the 5-th instance is (remember indices starts from 0):

In [17]:
msn_train.X[4, 9]

1.0

The ground truth labels are stored in a separate vector (numpy 1d array), with a single value for each dataset instance. The shape of the vector is thus (n_instances):

In [18]:
msn_train.y.shape

(723412,)

Each cell of this vector identify the ground truth value of a single instance for the specific query. For example, the ground truth value of the 5-th instance is (recall ground truth labels usually range in [0-4], with 0 meaning completely irrelevant and 4 perfectly relevant):

In [19]:
msn_train.y[4]

1.0

The query ids of the dataset are stored in a separate vector. This information is not strictly needed, but when you manipulate a dataset sometime it is important to preserve this information (e.g., for comparison).
This information is stored in a vector (numpy 1d array) with a single value for each dataset query. The shape of the vector is thus (n_queries):

In [20]:
msn_train.query_ids.shape

(6000,)

The query id of the first query is:

In [21]:
msn_train.query_ids[0]

1

The offsets of each query are on the other hand stored in another vector (numpy 1d array). These offsets allows to associate each dataset instance to the correct query. Recall indeed that both features and ground truth labels are stored contiguosly between several queries, thus it is needed a mechanism allowing to discriminate and reconstruct the information regarding each query.

This array stores the starting offset of each query (starting from the first instance identified by the first row). Its shape is thus (n_queries+1), with the latter element that is not strictly needed but is put there for simplicity (its value is always the number of instances in the dataset).

In [22]:
msn_train.query_offsets.shape

(6001,)

Given this information, the i-th query has indices ranging from query_offsets[i] up to query_offsets[i+1], with the latter element excluded. For example, the first query start at index 0 and contains 86 documents up to index 86 (excluded):

In [23]:
msn_train.query_offsets[0:2]

array([ 0, 86])

The number of documents in each query can thus be obtained automatically from the `query_offset` variable. However, RankEval provide a simple utility for accessing this information, that is the `get_query_sizes` method:

In [24]:
msn_train.get_query_sizes()

array([ 86, 106,  92, ...,  79, 180,  40])

This method returns a vector (numpy 1d array) with a single element for each query, identifying the number of documents belonging to this query. As noticed above, the first query has 86 documents, the second has 106 documents, and so on and so forth. The shape of this vector is thus (n_queries):

In [25]:
msn_train.get_query_sizes().shape

(6000,)

In some situation we need to have the list of query ids, one for each instance (in place of the compact version adopted by RankEval). For example, this feature is needed by LightGBM when you need to create a dataset. RankEval provides such information with the `get_qids_dataset` utility, returning a vector (numpy 1d array) with shape (n_instances):

In [26]:
msn_train.get_qids_dataset().shape

(723412,)

The first document has qid:

In [27]:
msn_train.get_qids_dataset()[0]

1

# Iterate over each query of a dataset

Sometime it is needed to iterate over the low level information stored in the dataset instance. RankEval allows you to do that with the `query_iterator` utility, that provides you with an iterator over the offsets of the query_ids in the dataset. In particular, each element of the iterator is a tuple (qid, start_offset, end_offset), highlighting thus the qid of the query and the row index of the instances belonging it. For example, the first element of the iterator is related to the query with `qid=1` and offset ranging from 0 to 86 (excluded):

In [28]:
next(msn_train.query_iterator())

(1, 0, 86)