# Task 1 Alignment

This notebook computes the target distributions and retrieved page alignments for **Task 1**.
It depends on the output of the PageAlignments notebook.

This notebook can be run in two modes: 'train', to process the training topics, and 'eval' for the eval topics.

In [1]:
DATA_MODE = 'train'

## Setup

We begin by loading necessary libraries:

In [2]:
import sys
import warnings
from collections import namedtuple
from functools import reduce
from itertools import product
import operator
from pathlib import Path

In [3]:
import pandas as pd
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gzip
import json
from natural.size import binarysize
from natural.number import number

Set up progress bar and logging support:

In [4]:
from tqdm.auto import tqdm
tqdm.pandas(leave=False)

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
import sys, logging
logging.basicConfig(level=logging.INFO, stream=sys.stderr)
log = logging.getLogger('Task1Alignment')

And set up an output directory:

In [6]:
from wptrec.save import OutRepo
output = OutRepo('data/metric-tables')

## Data and Helpers

Most data loading is outsourced to `MetricInputs`.  First we save the data mode where metric inputs can find it:

In [7]:
import wptrec
wptrec.DATA_MODE = DATA_MODE

In [8]:
from MetricInputs import *

data/trec_2022_train_reldocs.jsonl


INFO:MetricInputs:reading data\metric-tables\page-sub-geo-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-src-geo-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-gender-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-occ-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-alpha-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-age-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-pop-align.parquet
INFO:MetricInputs:reading data\metric-tables\page-langs-align.parquet


In [9]:
dimensions

[<dimension "sub-geo": 21 levels>,
 <dimension "src-geo": 21 levels>,
 <dimension "gender": 4 levels>,
 <dimension "occ": 33 levels>,
 <dimension "alpha": 4 levels>,
 <dimension "age": 4 levels>,
 <dimension "pop": 4 levels>,
 <dimension "langs": 3 levels>]

### qrel join

We want a function to join alignments with qrels:

In [10]:
def qr_join(align):
    return qrels.join(align, on='page_id').set_index(['topic_id', 'page_id'])

### norm_dist

And a function to normalize to a distribution:

In [11]:
def norm_dist_df(mat):
    sums = mat.sum('columns')
    return mat.divide(sums, 'rows')

## Prep Overview

Now that we have our alignments and qrels, we are ready to prepare the Task 1 metrics.

We're first going to prepare the target distributions; then we will compute the alignments for the retrieved pages.

## Subject Geography

Subject geography targets the average of the relevant set alignments and the world population.

In [13]:
qr_sub_geo_align = qr_join(sub_geo_align)
qr_sub_geo_align

Unnamed: 0_level_0,Unnamed: 1_level_0,@UNKNOWN,Antarctica,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Middle Africa,Northern Africa,...,Northern Europe,Oceania,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe
topic_id,page_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
84,572,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84,627,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84,678,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84,903,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84,1193,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2859,69878035,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2859,69879576,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2859,69882349,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2859,69887896,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


For purely geographic fairness, we just need to average the unknowns with the world pop:

In [14]:
qr_sub_geo_tgt = qr_sub_geo_align.groupby('topic_id').mean()
qr_sub_geo_fk = qr_sub_geo_tgt.iloc[:, 1:].sum('columns')
qr_sub_geo_tgt.iloc[:, 1:] *= 0.5
qr_sub_geo_tgt.iloc[:, 1:] += qr_sub_geo_fk.apply(lambda k: world_pop * k * 0.5)
qr_sub_geo_tgt.head()

Unnamed: 0_level_0,@UNKNOWN,Antarctica,Caribbean,Central America,Central Asia,Eastern Africa,Eastern Asia,Eastern Europe,Middle Africa,Northern Africa,...,Northern Europe,Oceania,South America,South-eastern Asia,Southern Africa,Southern Asia,Southern Europe,Western Africa,Western Asia,Western Europe
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
84,0.565557,3.35441e-08,0.002502,0.006862,0.00303,0.015496,0.057007,0.01512,0.004712,0.007335,...,0.035066,0.038202,0.016755,0.023619,0.004326,0.064602,0.022839,0.012731,0.010956,0.02533
111,0.835637,1.269079e-08,0.00196,0.003607,0.000739,0.021889,0.018752,0.003447,0.00246,0.002528,...,0.001532,0.017619,0.031333,0.01217,0.004777,0.020544,0.001931,0.003872,0.002891,0.00237
265,0.831659,0.0008998958,0.000604,0.00249,0.001123,0.004972,0.02337,0.007559,0.002558,0.004063,...,0.009939,0.003705,0.008164,0.008057,0.001746,0.022624,0.00673,0.004274,0.003843,0.01313
323,0.308798,7.05938e-05,0.005224,0.012907,0.004569,0.023592,0.092005,0.032874,0.011654,0.013631,...,0.044531,0.017033,0.039143,0.038852,0.006107,0.094145,0.021056,0.020223,0.01847,0.050314
396,0.107112,6.894146e-08,0.003881,0.01591,0.004326,0.023964,0.128285,0.032485,0.008756,0.015524,...,0.068992,0.016482,0.03587,0.048727,0.005273,0.151294,0.034219,0.022984,0.021168,0.050547


Make sure the rows are distributions:

In [15]:
qr_sub_geo_tgt.sum('columns').describe()

count    5.000000e+01
mean     1.000000e+00
std      1.197429e-16
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      1.000000e+00
max      1.000000e+00
dtype: float64

Everything is 1, we're good to go!

In [16]:
output.save_table(qr_sub_geo_tgt, f'task1-{DATA_MODE}-sub-geo-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-train-sub-geo-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-train-sub-geo-target.csv.gz: 10.71 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-train-sub-geo-target.parquet
INFO:wptrec.save:data\metric-tables\task1-train-sub-geo-target.parquet: 25.97 KiB


## Gender

Now we're going to grab the gender alignments.  Again, we ignore UNKNOWN.

In [12]:
qr_gender_align = qr_join(gender_align)
qr_gender_align.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,@UNKNOWN,female,male,NB
topic_id,page_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
84,572,1.0,0.0,0.0,0.0
84,627,1.0,0.0,0.0,0.0
84,678,1.0,0.0,0.0,0.0
84,903,1.0,0.0,0.0,0.0
84,1193,1.0,0.0,0.0,0.0


In [18]:
qr_gender_tgt = qr_gender_align.groupby('topic_id').mean()
qr_gender_fk = qr_gender_tgt.iloc[:, 1:].sum('columns')
qr_gender_tgt.iloc[:, 1:] *= 0.5
qr_gender_tgt.iloc[:, 1:] += qr_gender_fk.apply(lambda k: gender_tgt * k * 0.5)
qr_gender_tgt.head()

Unnamed: 0_level_0,@UNKNOWN,female,male,NB
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
84,0.905943,0.03379,0.059797,0.00047
111,0.996106,0.001344,0.002531,1.9e-05
265,0.883099,0.038968,0.077328,0.000647
323,0.890183,0.033058,0.07621,0.000549
396,0.007847,0.428546,0.558768,0.005349


In [19]:
output.save_table(qr_gender_tgt, f'task1-{DATA_MODE}-gender-target', parquet=True)

INFO:wptrec.save:saving CSV to data\metric-tables\task1-train-gender-target.csv.gz
INFO:wptrec.save:data\metric-tables\task1-train-gender-target.csv.gz: 2.24 KiB
INFO:wptrec.save:saving Parquet to data\metric-tables\task1-train-gender-target.parquet
INFO:wptrec.save:data\metric-tables\task1-train-gender-target.parquet: 6.80 KiB


### Dimension Definitions

Let's define background distributions for some of our dimensions:

In [20]:
dim_backgrounds = {
    'sub-geo': world_pop,
    #'gender': gender_tgt,
}

Now we'll make a list of dimensions to treat with averaging:

In [21]:
DR = namedtuple('DimRec', ['name', 'align', 'background'], defaults=[None])
avg_dims = [
    DR(d.name, d.page_align_xr, xr.DataArray(dim_backgrounds[d.name], dims=[d.name]))
    for d in dimensions
    if d.name in dim_backgrounds
]
[d.name for d in avg_dims]

['sub-geo']

Now: these dimension are in the original order - `dimensions` has the averaged dimensions before the non-averaged ones. **This is critical for the rest of the code to work.**

### Demo

To demonstrate how the logic works, let's first work it out in cells for one query (1).

What are its documents?

In [22]:
qno = qrels['topic_id'].iloc[0]
qdf = qrels[qrels['topic_id'] == qno]
qdf.name = qno
qdf

Unnamed: 0,topic_id,page_id
0,84,572
1,84,627
2,84,678
3,84,903
4,84,1193
...,...,...
7416,84,69689018
7417,84,69730264
7418,84,69738629
7419,84,69846681


We can use these page IDs to get its alignments.

In [23]:
q_pages = qdf['page_id'].values

#### Accumulating Initial Targets

We're now going to grab the dimensions that have targets, and create a single xarray with all of them:

In [24]:
q_xta = reduce(operator.mul, [d.align.loc[q_pages] for d in avg_dims])
q_xta

In [25]:
from wptrec.dimension import mean_outer

Now, we need to combine this with the other matrix to produce a complete alignment matrix, which we then will collapse into a query target matrix.  However, we don't have memory to do the whole thing at one go. Therefore, we will do it page by page.

The `mean_outer` function does this:

In [26]:
q_tam = q_xta
q_tam = q_tam.mean(dim='page')

In [27]:
q_tam.sum()

In 2021, we ignored fully-unknown for Task 1. However, it isn't clear hot to properly do that with some attributes that are never fully unknown - they still need to be counted. Therefore, we consistently treat fully-unknown as a distinct category for both Task 1 and Task 2 metrics.

#### Data Subsetting

Before we average, we need to be able to select data by its known/unknown status.

Let's start by making a list of cases - the known/unknown status of each dimension.

In [28]:
avg_cases = list(product(*[[True, False] for d in avg_dims]))
avg_cases

[(True,), (False,)]

The last entry is the all-unknown case - remove it:

In [29]:
avg_cases.pop()
avg_cases

[(True,)]

We now want the ability to create an indexer to look up the subset of the alignment frame corresponding to a case. Let's write that function:

In [30]:
def case_selector(case):
    def mksel(known):
        if known:
            # select all but 1st column
            return slice(1, None, None)
        else:
            # select 1st column
            return 0
    
    return tuple(mksel(k) for k in case)

Let's test this function quick:

In [31]:
case_selector(avg_cases[0])

(slice(1, None, None),)

In [32]:
case_selector(avg_cases[-1])

(slice(1, None, None),)

And make sure we can use it:

In [33]:
q_tam[case_selector(avg_cases[0])]

Fantastic! Given a case (known and unknown statuses), we can select the subset of the target matrix with exactly those.

#### Averaging

Ok, now we have to - very carefully - average with our target modifier.  For each dimension that is not fully-unknown, we average with the intersectional target defined over the known dimensions.

At all times, we also need to respect the fraction of the total it represents.

We'll use the selection capabilities above to handle this.

First, let's make sure that our target matrix sums to 1 to start with:

In [34]:
q_tam.sum()

Fantastic.  This means that if we sum up a subset of the data, it will give us the fraction of the distribution that has that combination of known/unknown status.

For each condition, we are going to proceed as follows:

1. Compute an appropriate intersectional background distribution (based on the dimensions that are "known")
2. Select the subset of the target matrix with this known status
3. Compute the sum of this subset
4. Re-normalize the subset to sum to 1
5. Compute a normalization table such that each coordinate in the distributions to correct sums to 1 (so multiplying this by the background distribution spreads the background across the other dimensions appropriately), and use this to spread the background distribution
6. Average with the spread background distribution
7. Re-normalize to preserve the original sum

Let's define the whole process as a function:

In [35]:
import gc

def avg_with_bg(tm, verbose=False):
    tm = tm.copy()
    
    tail_names = []
    
    # compute the tail mass for each coordinate (can be done once)
    tail_mass = tm.sum(tail_names)
    
    # now some things don't have any mass, but we still need to distribute background distributions.
    # solution: we impute the marginal tail distribution
    # first compute it
    tail_marg = tm.sum([d.name for d in avg_dims])
    # then impute that where we don't have mass
    tm_imputed = xr.where(tail_mass > 0, tm, tail_marg)
    # and re-compute the tail mass
    tail_mass = tm_imputed.sum(tail_names)
    # and finally we compute the rescaled matrix
    tail_scale = tm_imputed / tail_mass
    del tm_imputed
    
    for case in avg_cases:
        # for deugging: get names
        known_names = [d.name for (d, known) in zip(avg_dims, case) if known]
        if verbose:
            print('processing known:', known_names)
        
        # Step 1: background
        bg = reduce(operator.mul, [
            d.background
            for (d, known) in zip(avg_dims, case)
            if known
        ])
        if not np.allclose(bg.sum(), 1.0):
            warnings.warn('background distribution for {} sums to {}, expected 1'.format(known_names, bg.values.sum()))
        
        # Step 2: selector
        sel = case_selector(case)
        
        # Steps 3: sum in preparation for normalization
        c_sum = tm[sel].sum()
        
        # Step 5: spread the background
        bg_spread = bg * tail_scale[sel] * c_sum
        if not np.allclose(bg_spread.sum(), c_sum):
            warnings.warn('rescaled background sums to {}, expected c_sum'.format(bg_spread.values.sum()))
        
        # Step 4 & 6: average with the background
        tm[sel] *= 0.5
        bg_spread *= 0.5
        tm[sel] += bg_spread
                        
        if not np.allclose(tm[sel].sum(), c_sum):
            warnings.warn('target distribution for {} sums to {}, expected {}'.format(known_names, tm[sel].values.sum(), c_sum))
        gc.collect()
    return tm

And apply it:

In [36]:
q_target = avg_with_bg(q_tam, True)
print(q_target.shape)
q_target.sum()

processing known: ['sub-geo']
(21,)


In [37]:
q_target
#sourceFile = open('task1-target-test.txt', 'w')
new_numpy_ndarray = q_target.values.ravel()
np.set_printoptions(threshold=np.Inf)
print(len(new_numpy_ndarray))#, file = sourceFile)
#sourceFile.close()

21


In [38]:
print(number(q_target.values.size), 'values taking', binarysize(q_target.nbytes))

21 values taking 168.00 iB


Is it still a distribution?

In [39]:
q_target.sum()

We can unravel this value into a single-dimensional array representing the multidimensional target:

In [40]:
array = q_target.values.ravel()
print(array.shape)

(21,)


Now we have all the pieces to compute this for each of our queries.

### Implementing Function

To perform this combination for every query, we'll use a function that takes a data frame for a query's relevant docs and performs all of the above operations:

In [41]:
import gc

gc.collect()

def query_xalign(pages):
    # compute targets to average
    avg_pages = reduce(operator.mul, [d.align.loc[pages] for d in avg_dims])

    # convert to query distribution
    tgt =  avg_pages.mean(dim='page')

    # average with background distributions
    tgt = avg_with_bg(tgt)
    
    # and return the result
    gc.collect()
    return tgt

Make sure it works:

In [None]:
print(qdf.page_id.values)

In [43]:
query_xalign(qdf.page_id.values)

### Computing Query Targets

Now with that function, we can compute the alignment vector for each query.  Extract queries into a dictionary:

In [44]:
print(qrels)

         topic_id   page_id
0              84       572
1              84       627
2              84       678
3              84       903
4              84      1193
...           ...       ...
2088301      2859  69878035
2088302      2859  69879576
2088303      2859  69882349
2088304      2859  69887896
2088305      2859  69891491

[2088306 rows x 2 columns]


In [45]:
queries = {
    t: df['page_id'].values
    for (t, df) in qrels.groupby('topic_id')
}

Make an index that we'll need later for setting up the XArray dimension:

In [46]:
q_ids = pd.Index(queries.keys(), name='topic_id')
q_ids

Index([  84,  111,  265,  323,  396,  397,  403,  409,  426,  475,  594,  604,
        620,  666,  677,  716,  724,  726,  765,  770,  785,  805,  893,  956,
       1055, 1102, 1225, 1319, 1343, 1368, 1371, 1412, 1457, 1509, 1565, 1625,
       1630, 1656, 1715, 1773, 1970, 2006, 2213, 2230, 2272, 2365, 2429, 2465,
       2741, 2859],
      dtype='int64', name='topic_id')

Now let's create targets for each of these:

In [47]:
import gc
q_tgts = []
q_ids_new = []
for i, q in enumerate( tqdm(q_ids) ):
    try:
        q_tgts.append(query_xalign(queries[q])) 
        q_ids_new.append(i)
    except:
        print(i)
    gc.collect()

100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [00:17<00:00,  2.86it/s]


Assemble a composite xarray:

In [None]:
q_tgts

In [49]:
q_tgts_conc = xr.concat(q_tgts, q_ids_new)#q_ids)

In [50]:
gc.collect()

0

In [51]:
q_tgts_conc.astype(np.int64) 
q_tgts_conc[0]
#q_tgts_conc

Save this to NetCDF (xarray's recommended format):

In [52]:
output.save_xarray(q_tgts_conc, f'task1-{DATA_MODE}-int-targets-test')

INFO:wptrec.save:saving NetCDF to data\metric-tables\task1-train-int-targets-test.nc
