# Exploration of ranking datasets

**Plan**
- Load sample.tsv data
- Get sense of data format
    - May be same as https://www.microsoft.com/en-us/research/project/mslr/
- Figure out how to represent URL pairs
- Write function to create features from dataset
    - Parameters:
        - Data path (string)
        - Repeat pairs (bool)
        - Two sided (bool)
        - Delta features (bool)
    - Return:
        - Features (np array?)

In [1]:
import numpy as np
import pandas as pd
import re

## Sample Data

In [2]:
# data = pd.read_csv('/Users/Ashtekar15/Desktop/Thesis/MGBoost/other/test_data/ranking/sample.tsv',
#                    sep='\t')

In [3]:
# data.shape

In [4]:
# for i in list(data): print(i)

There are almost 2,000 columns, and I do not know which corresponds to query id, URL id, or score...

In [5]:
# # Explicitly delete to save memory
# del data

## Microsoft Data
MSLR-WEB10K/Fold1/train.txt

https://www.microsoft.com/en-us/research/project/mslr/

In [6]:
path = '/Users/Ashtekar15/Desktop/Thesis/MGBoost/other/test_data/ranking/MSLR-WEB10K/Fold1/'

# Use validation set for prototyping since smaller (1/3 size of train set)
data = pd.read_csv(path + 'vali.txt',
                   sep=' ',
                   header=None)
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,129,130,131,132,133,134,135,136,137,138
0,0,qid:10,1:2,2:0,3:0,4:0,5:2,6:0.666667,7:0,8:0,...,128:1,129:0,130:117,131:55115,132:7,133:2,134:0,135:0,136:0,
1,0,qid:10,1:1,2:0,3:1,4:3,5:3,6:0.333333,7:0,8:0.333333,...,128:0,129:0,130:153,131:3866,132:17,133:104,134:0,135:0,136:0,
2,1,qid:10,1:3,2:0,3:3,4:0,5:3,6:1,7:0,8:1,...,128:0,129:9,130:266,131:56137,132:5,133:2,134:0,135:0,136:0,
3,0,qid:10,1:3,2:0,3:2,4:0,5:3,6:1,7:0,8:0.666667,...,128:8,129:0,130:541,131:12621,132:11,133:11,134:0,135:0,136:0,
4,1,qid:10,1:3,2:0,3:3,4:0,5:3,6:1,7:0,8:1,...,128:6,129:0,130:14687,131:40205,132:5,133:3,134:0,135:0,136:0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235254,2,qid:29995,1:1,2:0,3:0,4:0,5:1,6:0.50000,7:0,8:0,...,128:103,129:62,130:5131,131:65535,132:2,133:1,134:0,135:0,136:0,
235255,2,qid:29995,1:1,2:0,3:1,4:0,5:1,6:0.50000,7:0,8:0.50000,...,128:428,129:2,130:1940,131:54880,132:7,133:4,134:0,135:0,136:0,
235256,1,qid:29995,1:1,2:0,3:0,4:0,5:1,6:0.50000,7:0,8:0,...,128:24242,129:27,130:6135,131:51819,132:2,133:4,134:0,135:0,136:0,
235257,2,qid:29995,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,...,128:100,129:0,130:3121,131:61234,132:49,133:6,134:0,135:0,136:0,


In [7]:
# Remove last column of NaN
data = data.iloc[:, :-1]

In [8]:
# Save hand-labeled scores
scores = data.iloc[:, 0]

# Use regex to get number after colon for every column other than score
features = data.iloc[:, 1:].applymap(lambda x: float(re.findall(r':(.*)', x)[0]))

# Put features and scores in same dataframe
data = features
data['score'] = scores

data

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,129,130,131,132,133,134,135,136,137,score
0,10.0,2.0,0.0,0.0,0.0,2.0,0.666667,0.0,0.000000,0.0,...,1.0,0.0,117.0,55115.0,7.0,2.0,0.0,0.0,0.0,0
1,10.0,1.0,0.0,1.0,3.0,3.0,0.333333,0.0,0.333333,1.0,...,0.0,0.0,153.0,3866.0,17.0,104.0,0.0,0.0,0.0,0
2,10.0,3.0,0.0,3.0,0.0,3.0,1.000000,0.0,1.000000,0.0,...,0.0,9.0,266.0,56137.0,5.0,2.0,0.0,0.0,0.0,1
3,10.0,3.0,0.0,2.0,0.0,3.0,1.000000,0.0,0.666667,0.0,...,8.0,0.0,541.0,12621.0,11.0,11.0,0.0,0.0,0.0,0
4,10.0,3.0,0.0,3.0,0.0,3.0,1.000000,0.0,1.000000,0.0,...,6.0,0.0,14687.0,40205.0,5.0,3.0,0.0,0.0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235254,29995.0,1.0,0.0,0.0,0.0,1.0,0.500000,0.0,0.000000,0.0,...,103.0,62.0,5131.0,65535.0,2.0,1.0,0.0,0.0,0.0,2
235255,29995.0,1.0,0.0,1.0,0.0,1.0,0.500000,0.0,0.500000,0.0,...,428.0,2.0,1940.0,54880.0,7.0,4.0,0.0,0.0,0.0,2
235256,29995.0,1.0,0.0,0.0,0.0,1.0,0.500000,0.0,0.000000,0.0,...,24242.0,27.0,6135.0,51819.0,2.0,4.0,0.0,0.0,0.0,1
235257,29995.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,...,100.0,0.0,3121.0,61234.0,49.0,6.0,0.0,0.0,0.0,2


In [9]:
data = data.rename(columns={1: 'query_id'})
data

Unnamed: 0,query_id,2,3,4,5,6,7,8,9,10,...,129,130,131,132,133,134,135,136,137,score
0,10.0,2.0,0.0,0.0,0.0,2.0,0.666667,0.0,0.000000,0.0,...,1.0,0.0,117.0,55115.0,7.0,2.0,0.0,0.0,0.0,0
1,10.0,1.0,0.0,1.0,3.0,3.0,0.333333,0.0,0.333333,1.0,...,0.0,0.0,153.0,3866.0,17.0,104.0,0.0,0.0,0.0,0
2,10.0,3.0,0.0,3.0,0.0,3.0,1.000000,0.0,1.000000,0.0,...,0.0,9.0,266.0,56137.0,5.0,2.0,0.0,0.0,0.0,1
3,10.0,3.0,0.0,2.0,0.0,3.0,1.000000,0.0,0.666667,0.0,...,8.0,0.0,541.0,12621.0,11.0,11.0,0.0,0.0,0.0,0
4,10.0,3.0,0.0,3.0,0.0,3.0,1.000000,0.0,1.000000,0.0,...,6.0,0.0,14687.0,40205.0,5.0,3.0,0.0,0.0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235254,29995.0,1.0,0.0,0.0,0.0,1.0,0.500000,0.0,0.000000,0.0,...,103.0,62.0,5131.0,65535.0,2.0,1.0,0.0,0.0,0.0,2
235255,29995.0,1.0,0.0,1.0,0.0,1.0,0.500000,0.0,0.500000,0.0,...,428.0,2.0,1940.0,54880.0,7.0,4.0,0.0,0.0,0.0,2
235256,29995.0,1.0,0.0,0.0,0.0,1.0,0.500000,0.0,0.000000,0.0,...,24242.0,27.0,6135.0,51819.0,2.0,4.0,0.0,0.0,0.0,1
235257,29995.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,...,100.0,0.0,3121.0,61234.0,49.0,6.0,0.0,0.0,0.0,2


## Checking size of potential features

In [10]:
# Number of queries, specific query ids
data.query_id.unique().shape, data.query_id.unique()[:10]

((2000,), array([ 10.,  25.,  40.,  55.,  70.,  85., 100., 115., 130., 145.]))

In [11]:
# Want: average number of URLs for each query
acc = 0

for qid in data.query_id.unique():
    acc += data[data.query_id == qid].shape[0]

acc / data.query_id.unique().size

117.6295

In [12]:
# Estimated size of generated features in GB
# (with two_sided=False, repear_pairs=False)
((((117.6295 ** 2) / 2) * 2000) * (138 * 2)* 4) / (10 ** 9)

15.275715994356

The generated features will be much too large to fit in memory.

Possible solutions:
- Choose some queries (by query id) and run the experiment
    - Choose ~30 (?), will be 0.7 GB
- Do something completely different (?)
    - Run on PSU cluster (?)

## Algorithm for generating features from data

    for all unique query ids:

        save temp dataframe w/ only give query id
        
        for all m URLs in temp (i -> m):
        
            (Use some start_idx variable to do this in only one loop)
            if two_sided:
                for all m URLs in temp (j -> m):
                
            else if not two_sided:
                for all m URLs in temp (i -> m):
                
                    (Use some loop value to avoid repeat code)
                    if repeat_pairs:
                        for k in range(y_diff + 1):

                    else if not repeat_pairs: (only do computation once)
                        
                        if delta_features:
                            append features[i], features[j], features[i] - features[j], 
                                   score[i] - scores[j]

                        else if not delta_features:
                            append features[i], features[j], score[i] - scores[j]