# Annif Fusion experiment

**Osma Suominen, October 2018**

This notebook is an experiment for evaluating different methods to combine results of multiple subject indexing algorithms. We will use a document collection where gold standard subjects have been manually assigned and compare the subjects assigned by three different algorithm first individually, then in combinations created by different fusion approaches. In particular, we will test simple mean-of-subject-scores and union-of-top-K approaches as well as more advanced methods such as score normalization, isotonic regression and Learning to Rank style machine learning. Ideally, we would like to find a method for combining results from multiple algorithm that gives us the best quality results, combining the strengths of individual algorithms (both statistical/associative and lexical) while eliminating the effects of their weaknesses. The findings will then inform the further development of Annif.

The experiment was inspired by the Martin Toepfer's paper [Fusion architectures for automatic subject indexing under concept drift](https://link.springer.com/article/10.1007/s00799-018-0240-3) as well as discussions with him during and after the NKOS2018 workshop.

If you want to run this yourself, you need Python 3.5+ with the following libraries:

* jupyter (obviously)
* scikit-learn
* pyltr
* matplotlib (for the final plots)

This document is (c) Osma Suominen. It may be shared and reused according to the terms of the [CC Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/) license. Attribution should include the name of the author and a link to the original source document. However, code snippets in this notebook may be freely reused according to the [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/) license. Attribution for code is requested but not required.

## Document corpus

We will use the ["Ask a Librarian" document corpus](https://github.com/NatLibFi/Annif-corpora/tree/master/fulltext/kirjastonhoitaja) for the experiment. It contains 3150 short question-answer pairs in Finnish language. The collection is subdivided into three subsets: train (n=2625), validate (n=213) and test (n=312). All the final evaluations will be performed on the test set, which contains the most recent questions asked during the year 2017. However, for some of the fusion methods we will make use of the train and validate subsets in order to fine-tune the way results are combined.

All documents in the collection have been assigned gold standard subjects by librarians, stored in `*.tsv` files where the basenames correspond to the original `*.txt` files (which are not read by this code at all and are not included in this repository). Librarians have assigned at least 4 subjects per document; the average is 4.4-4.8 subjects per document, depending on the subset.

In addition, the documents have been analyzed using the [Annif](https://github.com/NatLibFi/Annif/) tool (development version v0.31.0) using three independent automated subject indexing algorithms: TF-IDF vector similarity, fastText and Maui, asking each algorithm to suggest up to 1000 subjects per document (however, we will in practice only load a fraction of these) with scores ranging from 1.0 to 0.0 given to each suggested subject. The subjects suggested by these algoritms are stored, respectively, in `*.tfidf`, `*.fasttext` and `*.maui` files.

First we load some basic modules and define the locations of the data files.

In [1]:
import pyltr
import os
import os.path
import glob
import numpy as np

SUBJECTFILE='data/yso.tsv'
TRAIN_DIR='data/kirjastonhoitaja/train/'
VALI_DIR='data/kirjastonhoitaja/validate/'
TEST_DIR='data/kirjastonhoitaja/test/'
FILE_SIZES='data/kirjastonhoitaja/file-sizes.txt'
ALGORITHMS = ('tfidf','fasttext','maui')  # these correspond to file extensions

## Subject vocabulary

All the subjects have been chosen from the Finnish General Ontology YSO, which contains around 28000 concepts.
Here we load the vocabulary from a TSV file where the first column is the concept URI and the second column is the concept label in Finnish. This has been extracted from the full YSO SKOS file.

Hereafter we will only use integer concept IDs which range from 0 to (n_concepts-1). All calculations are performed using the concept IDs so we just need to map the concept URIs which appear in files to their IDs.

In addition, we store the concept labels for later use as features for the Learning to Rank algorithm.

In [2]:
uri_to_cid = {}
subject_label = {}
with open(SUBJECTFILE) as subjf:
    for cid, line in enumerate(subjf):
        uri, label = line.strip().split("\t")
        uri_to_cid[uri] = cid
        subject_label[cid] = label
n_concepts = len(uri_to_cid)
print("Vocabulary loaded with", n_concepts, "concepts")

Vocabulary loaded with 27760 concepts


## Document corpus

We define a function to load the data from a directory of files in the format explained above. To be able to use both `scikit-learn` and `pyltr` libraries, we will need to express the data in two somewhat different formats, both based on multidimensional NumPy arrays:
    
1. pyltr ranking format, where the NumPy array rows correspond to document-subject pairs, i.e. there are multiple rows per document, one row per subject that has been suggested by at least one algorithm. A separate qids array ("query IDs", actually document IDs when applied this way) maps the document-subject pairs to the individual documents.
2. scikit-learn multilabel format, where the NumPy array has one row per document and the columns are individual concepts (either True/False for the gold standard subjects, or subject scores for the predicted subjects)

We will define this function once and then use it to parse all three document subsets (train, validate, test). We will only load up to 25 suggested subjects per document to keep the size of the rankings manageable.

In [3]:
def load_data(directory, max_pred=50):
    """Reads document corpus with gold standard and predicted subjects for each document.
    Only the top max_pred predictions per document and algorithm are considered for ranking.
    Returns a tuple with
     - predictions of individual algorithms as a 2D NumPy array with the shape (doc-suggestions, n_algos)
     - relevant (gold standard) concepts as a 1D NumPy array with the shape (doc-suggestions)
     - qids (actually document ids) as a 1D NumPy array with length doc-suggestions
     - concept IDs of suggested concepts as a 1D NumPy array with shape (doc-suggestions)
     - gold standard subjects in sklearn multilabel format as a 2D NumPy array with shape (n_docs, n_concepts) 
     - predictions of individual algorithms in sklearn multilabel format as a 3D NumPy array with shape (n_docs, n_concepts, n_algos)
    """
    
    # temporary lists to be converted into NumPy arrays at the end
    docdatalist = []
    docscorelist = []
    docgoldlist = []
    doctruelist = []
    qidlist = []
    cidlist = []
    
    for tsvfilename in glob.glob(os.path.join(directory, '*.tsv')):
        basename = tsvfilename[:-4]
        
        # Initialize binary mask containing the top max_pred concepts that were suggested by any algorithm.
        # This will be used to pare down the doc-suggestions for ranking so that we will skip concepts
        # that were not suggested by any algorithm. They will still be considered in F1 score evaluation,
        # which is not based only on the ranked suggestions but used the full gold standard subjects.
        mask = np.zeros(n_concepts, dtype=np.bool)

        docdata = np.zeros((n_concepts, len(ALGORITHMS)))
        for algoid, fileext in enumerate(ALGORITHMS):
            with open(basename + "." + fileext) as algooutput:
                lines = 0
                for lineno, line in enumerate(algooutput):
                    uri, _, score = line.strip().split("\t")
                    if uri not in uri_to_cid:
                        continue  # ignore unknown URIs
                    cid = uri_to_cid[uri]
                    docdata[cid,algoid] = float(score)
                    if lineno < max_pred:
                        mask[cid] = True
        docdatalist.append(docdata[mask])
        docscorelist.append(docdata)
    
        docgold = np.zeros(n_concepts, dtype=np.bool)
        with open(tsvfilename) as tsvfile:
            for line in tsvfile:
                uri = line.split("\t")[0]
                if uri in uri_to_cid:
                    cid = uri_to_cid[uri]
                    docgold[cid] = True
        docgoldlist.append(docgold[mask])
        doctruelist.append(docgold)
        
        fileid = int(basename.split('/')[-1].split('-')[-1])
        qidlist.append(np.full(mask.sum(), fileid))
        cidlist.append(np.arange(n_concepts)[mask])

    return np.concatenate(docdatalist), np.concatenate(docgoldlist), np.concatenate(qidlist), \
           np.concatenate(cidlist), np.array(doctruelist), np.array(docscorelist)

# now load the train, validate and test documents
TX, Ty, Tqids, Tcids, Ttrue, Tscores = load_data(TRAIN_DIR)
VX, Vy, Vqids, Vcids, Vtrue, Vscores = load_data(VALI_DIR)
EX, Ey, Eqids, Ecids, Etrue, Escores = load_data(TEST_DIR)

# Show some information about the shapes
print("TX shape:", TX.shape)
print("Ty shape:", Ty.shape)
print("Tqids shape:", Tqids.shape)
print("Tcids shape:", Tcids.shape)
print("Ttrue shape:", Ttrue.shape)
print("Tscores shape:", Tscores.shape)
print()
print("VX shape:", VX.shape)
print("Vy shape:", Vy.shape)
print("Vqids shape:", Vqids.shape)
print("Vcids shape:", Vcids.shape)
print("Vtrue shape:", Vtrue.shape)
print("Vscores shape:", Vscores.shape)
print()
print("EX shape:", EX.shape)
print("Ey shape:", Ey.shape)
print("Eqids shape:", Eqids.shape)
print("Ecids shape:", Ecids.shape)
print("Etrue shape:", Etrue.shape)
print("Escores shape:", Escores.shape)

TX shape: (283089, 3)
Ty shape: (283089,)
Tqids shape: (283089,)
Tcids shape: (283089,)
Ttrue shape: (2625, 27760)
Tscores shape: (2625, 27760, 3)

VX shape: (22604, 3)
Vy shape: (22604,)
Vqids shape: (22604,)
Vcids shape: (22604,)
Vtrue shape: (213, 27760)
Vscores shape: (213, 27760, 3)

EX shape: (33318, 3)
Ey shape: (33318,)
Eqids shape: (33318,)
Ecids shape: (33318,)
Etrue shape: (312, 27760)
Escores shape: (312, 27760, 3)


## Evaluation metrics

For evaluating the quality of results, we will use two metrics: [NDCG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG) which is based on ranked suggestions (if the top ranked suggestion is correct it will contribute more to the score than getting the 2nd or 3rd suggestion right) and [F1 score](https://en.wikipedia.org/wiki/F1_score) that only considers binary choices: either a concept is a subject of a document or it is not, there are no fuzzy options. 

For the **NDCG score**, we can simply use the implementation from the `pyltr` library. Here we limit the NDCG evaluation to the top 20 most highly ranked subjects (i.e. NDCG@20), meaning that subjects ranked 21st, 22nd etc won't contribute to the score (their effect would be quite small anyway do to the discounting performed in NDCG calculations).

For the **F1 score**, we can use the `scikit-learn` implementation. However, its usage is complicated by the fact that we are mostly dealing with ranked suggestions in the pyltr format, so we need to convert those into the multilabel prediction format understood by scikit-learn and further binarize the predictions using a thresholding strategy. For simplicity we will pick the top 5 suggested concepts, so we get an amount of subjects that is similar to the librarian-assigned ones (around 4.5 subjects per document), and compare those to the gold standard. The score for each document will be calculated individually and the final result will be the average of those scores (i.e. sample based average).

Finally, we will define an evaluation function that is given ranked suggestions as well as gold standard subjects (in both pyltr and scikit-learn formats) and some auxiliary information (qids and cids). It will calculate both NDCG and F1 score, print them out, and also return the scores so that they can be stored for later plotting etc.

In [4]:
import collections
from sklearn.metrics import f1_score

# initialize the NDCG metric implementation
ndcg_metric = pyltr.metrics.NDCG(k=20)

def ranking_to_multilabel(pred, qids, cids):
    """convert from pyltr ranking prediction format to sklearn multilabel prediction format"""
    doc_labels = collections.OrderedDict()  # key: qid, val: NumPy 1D array with length n_concepts
    for score, qid, cid in zip(pred, qids, cids):
        if qid not in doc_labels:
            doc_labels[qid] = np.zeros(n_concepts)
        doc_labels[qid][cid] = score
    return np.array(list(doc_labels.values()))

def binarize_multilabel_pred(pred, k=5):
    """convert predictions with scores to a boolean matrix, taking the top K predictions per document"""
    # I'm pretty sure this could be done easier, perhaps in a single NumPy operation, but I can't figure it out
    # Using a loop instead, it's not going to take long anyway
    binary_labels = np.zeros(pred.shape, dtype=np.bool)
    pred_order = pred.argsort()[:,::-1]
    for i, mask in enumerate(pred_order[:,:k]):
        binary_labels[i,mask] = True
    return binary_labels & (pred > 0.0)  # make sure not to include predictions with 0.0 score

def evaluate(name, qids, y, y_true, pred, cids):
    """evaluate a ranking, calculating its NDCG and F1 score. Print and return the scores"""
    # calculate NDCG score
    ndcg = ndcg_metric.calc_mean(qids, y, pred)

    # calculate F1 score
    pred_labels = ranking_to_multilabel(pred, qids, cids)
    f1 = f1_score(y_true, binarize_multilabel_pred(pred_labels), average='samples')
    
    print("{:>20}:\tNDCG={:.4f}\tF1={:.4f}".format(name, ndcg, f1))
    return ndcg, f1

## Baseline evaluation

Now we are ready for some baseline evaluations: we can evaluate the individual algorithms as well as a simplistic combination method using the mean of assigned scores (the only one implemented in Annif v0.31.0). We will store the results for later too.

In [5]:
# initialize dictionary for collecting evaluation results
eval_results = collections.OrderedDict()

# evaluate the individual algorithms
for algoid, name in enumerate(ALGORITHMS):
    eval_results[name] = evaluate(name, Eqids, Ey, Etrue, EX[:,algoid], Ecids)

# evaluate the combination of algorithms using mean of scores
eval_results["mean"] = evaluate("mean", Eqids, Ey, Etrue, EX.mean(1), Ecids)

               tfidf:	NDCG=0.4223	F1=0.2220
            fasttext:	NDCG=0.2737	F1=0.1329
                maui:	NDCG=0.5133	F1=0.2949
                mean:	NDCG=0.5701	F1=0.3142


We can see that out of individual algorithms, Maui performed best, followed by TF-IDF and fastText. The mean of scores gave better results than any algorithm alone. The NDCG and F1 measures seem to agree on which methods are better than others, as they both ranked the methods in the same order.

Note that the F1 scores may seem quite low (in some classification tasks F1 scores of 0.95 and more are not uncommon), but keep in mind that the task given to the algorithms is quite difficult: for each document, it should pick the 5 subjects (out of nearly 28000) that best describe that document. There are numerous ways of getting this wrong and only a few ways of picking a good set of subjects. So even a rather low F1 score of 0.2 means that around one out of five subjects was right - much better than chance.

## Union of top K concepts

Let's try a variation of averaging the scores of the different algorithms. This time we only pick the top K subjects (where K is a small number, e.g. 2 or 3) suggested by each algorithm and use their union as the suggested concepts. We will still take the mean of scores in order to rank the results (otherwise the NDCG score would not be well defined as it requires the results to have some stable ranking order) but will drastically reduce the number of subjects being considered.

This is a bit challenging to implement within the ranked suggestions format of pyltr, so what we will do instead is to implement it using sklearn multilabel prediction format and then convert that to a list of ranking predictions using the qids and cids information.

In [6]:
def multilabel_to_ranking_preds(scores, qids, cids):
    """convert a sklearn multilabel prediction to pyltr ranking according to the given qids and cids"""
    
    qid_to_docid = collections.OrderedDict()
    for qid in qids:
        if qid not in qid_to_docid:
            qid_to_docid[qid] = len(qid_to_docid)
    return np.array([scores[qid_to_docid[qid],cid] for qid, cid in zip(qids, cids)])

def select_top_k(scores, k):
    mask = binarize_multilabel_pred(scores, k=k)
    return scores * mask

def union_of_top_k(scores, qids, cids, k):
    top_scores = np.zeros(scores.shape)
    for algoid in range(len(ALGORITHMS)):
        top_scores[:,:,algoid] = select_top_k(scores[:,:,algoid], k=k)
    
    preds_union = multilabel_to_ranking_preds(top_scores, qids, cids)
    name = "union-{}".format(k)
    eval_results[name] = evaluate(name, Eqids, Ey, Etrue, preds_union.mean(1), Ecids)

for k in (1,2,3,5,10,20,40):
    union_of_top_k(Escores, Eqids, Ecids, k)

             union-1:	NDCG=0.3026	F1=0.2289
             union-2:	NDCG=0.3988	F1=0.2672
             union-3:	NDCG=0.4410	F1=0.2773
             union-5:	NDCG=0.5091	F1=0.2947
            union-10:	NDCG=0.5588	F1=0.3036
            union-20:	NDCG=0.5560	F1=0.3053
            union-40:	NDCG=0.5635	F1=0.3096


The results were not encouraging. NDCG and F1 scores are generally lower than for the mean strategy. With higher K values, the results get closer to the ones for the mean method, which is expected since having a low limit on the number of results considered is the only aspect that differentiates this method from using just the mean of all scores.

## Normalization of prediction scores

Each algorithm returns its scores on the scale 0.0 to 0.1, but they may be using different ranges within that scale. Because of this, simple averaging of scores may cause one algorithm which tends to report high scores to have much more impact on the result than another one that uses low scores. We can counteract that by normalizing the scores before taking the mean of scores. `scikit-learn` implements several normalization methods: L1, L2 and max-value. Let's see if they have an effect on merging results.

In [7]:
from sklearn.preprocessing import normalize

def normalized_scores(scores, norm):
    """normalize the given predictions (in sklearn multilabel format)
    using the given normalization method ('l1', 'l2' or 'max')"""

for norm in ('l1', 'l2', 'max'):
    # we will have to perform the normalization separately for each algorithm's output
    Escores_norm = np.zeros(Escores.shape)
    for algoid in range(len(ALGORITHMS)):
        Escores_norm[:,:,algoid] = normalize(Escores[:,:,algoid], norm=norm)
    Epreds_norm = multilabel_to_ranking_preds(Escores_norm, Eqids, Ecids)
    eval_results["mean-" + norm] = evaluate("mean-" + norm, Eqids, Ey, Etrue, Epreds_norm.mean(1), Ecids)

             mean-l1:	NDCG=0.5489	F1=0.3095
             mean-l2:	NDCG=0.5544	F1=0.2991
            mean-max:	NDCG=0.5526	F1=0.2941


This does not seem to help much either. All normalization methods give similar results as the mean method.

## TODO: PAV aka Isotonic Regression

## Learning to Rank

Let's try a true machine learning based fusion approach. We will apply the state of the art Learning to Rank algorithm LambdaMART, which is typically used for ranking results in a search engine. In this case, we consider each document a "query" and the predicted subjects as "results" and will try to come up with the ideal ranking of those subjects so that gold standard subjects are ranked as high as possible while non-relevant subjects are ranked lower. The algorithm is given a ranking measure that it then attempts to optimize; we will use NDCG as the measure to optimize.



The LambdaMART algorithm is conveniently implemented in the `pyltr` library. This implementation has some advanced features: it can be connected to a monitor that periodically evaluates the learned model on validation data and keeps track of the learning progress. If the monitor notices that the learning has reached a plateau - no improvement in validation scores has been made in the last K training rounds - it can stop the learning as well as roll back the model to the state when it reached that plateau. This is called *early stopping* and *trimming*, respectively, and it should help guard against overfitting the model on the training data.

### Feature engineering

Any Learning to Rank algorithm requires features for the "results" (i.e. subjects in our case); here we will simply use the raw scores predicted by the subject indexing algorithms as the features. In addition, we will use the lengths of the original input documents as features, as well as some concept features such as label length. These are all quite easy for us to generate, but we don't know whether the LTR algorithm has any use for them - let's give it whatever features we can think of and let it make its own decisions on whether to use them and how.

First the document-level features (currently just size):

In [8]:
# read information about document lengths from a data file generated using the `du -b` command
doc_length = {}
with open(FILE_SIZES) as file_sizes:
    for line in file_sizes:
        size, filename = line.strip().split()
        basename = filename[:-4].split('-')[-1]
        doc_length[int(basename)] = int(size)

# define a function that looks up document lengths based on qids and returns a 1D feature matrix
def doc_features(qids):
    """return a NumPy array of document features (in practice, file sizes) corresponding to the given qids"""
    
    return np.array([np.array([doc_length[qid]]) for qid in qids])

# let's also make note of what document features we used, for later display
DOCUMENT_FEATURES = ('d-length',)

# generate document features for train, validate and test sets
Tdocf = doc_features(Tqids)
Vdocf = doc_features(Vqids)
Edocf = doc_features(Eqids)

Then we generate some concept level features. We will generate several features per concept, based mainly on concept labels: label length (characters and words), number and proportion of capital letters etc. Also we will count the number of occurrences of each concept in the train set and use that as a feature.

In [35]:
# count the per-concept frequencies from the validation(!) set into a 1D NumPy array with size n_concepts

# To (try to) avoid overfitting, we can use just the base 10 or natural logarithms of the frequencies (incremented by
# one to avoid divide by zero) and round them to the nearest integer to indicate rough magnitude
#concept_freq = np.log(Vtrue.sum(axis=0) + 1).round()    # natural logarithm
#concept_freq = np.log10(Vtrue.sum(axis=0) + 1).round()  # base 10 logarithm
concept_freq = Vtrue.sum(axis=0)                         # raw frequency value

def features_of_concept(cid):
    """return a 1D NumPy array of features for a particular concept"""
    
    features = []
    features.append(concept_freq[cid])  # frequency in the train set
    label = subject_label[cid]
    features.append(len(label))  # label length in characters
    features.append(len(label.split()))  # label length in words
    caps = sum(1 for c in label if c.isupper())
    features.append(caps)  # number of capital letters
    features.append(caps / len(label))  # proportion of capital letters
    return np.array(features)

def concept_features(cids):
    """return a 2D NumPy array of features for the given concepts"""
    
    return np.array([features_of_concept(cid) for cid in cids])

# let's also make note of what concept features we used, for later display
CONCEPT_FEATURES = ('c-freq', 'c-chars', 'c-words', 'c-caps', 'c-capprop')

# generate concept features for train, validate and test sets
Tconcf = concept_features(Tcids)
Vconcf = concept_features(Vcids)
Econcf = concept_features(Ecids)

Then we can combine the features into matrices for feeding into the LTR algorithm:

In [36]:
TXall = np.hstack((TX, Tdocf, Tconcf))
print(TXall.shape)

VXall = np.hstack((VX, Vdocf, Vconcf))
print(VXall.shape)

EXall = np.hstack((EX, Edocf, Econcf))
print(EXall.shape)

# make note what features we used
ALL_FEATURES = ALGORITHMS + DOCUMENT_FEATURES + CONCEPT_FEATURES

(283089, 9)
(22604, 9)
(33318, 9)


Finally we can run the LTR algorithm itself.

In [37]:
%%time

# create a monitor for early stopping and trimming, using validation data
monitor = pyltr.models.monitors.ValidationMonitor(
    VXall, Vy, Vqids, metric=ndcg_metric, stop_after=50)

# create a LambdaMART model for learning a suitable ranking
model = pyltr.models.LambdaMART(
    metric=ndcg_metric,
    n_estimators=1000,
    learning_rate=0.2,
    query_subsample=0.1,
    max_leaf_nodes=10,
    min_samples_leaf=64,
    verbose=1,
)

# fit the model to training data
model.fit(TXall, Ty, Tqids, monitor=monitor)

 Iter  Train score  OOB Improve    Remaining                           Monitor Output 
    1       0.5718       0.5618       70.48m      C:      0.5217 B:      0.5217 S:  0
    2       0.6155       0.0594       73.20m      C:      0.5733 B:      0.5733 S:  0
    3       0.6318       0.0148       74.47m      C:      0.5929 B:      0.5929 S:  0
    4       0.6455       0.0031       74.60m      C:      0.5969 B:      0.5969 S:  0
    5       0.6525       0.0042       74.89m      C:      0.6067 B:      0.6067 S:  0
    6       0.6740       0.0062       75.13m      C:      0.6100 B:      0.6100 S:  0
    7       0.6598       0.0124       75.23m      C:      0.6232 B:      0.6232 S:  0
    8       0.6636       0.0025       75.32m      C:      0.6281 B:      0.6281 S:  0
    9       0.6633       0.0017       75.45m      C:      0.6365 B:      0.6365 S:  0
   10       0.6820       0.0109       75.54m      C:      0.6432 B:      0.6432 S:  0
   15       0.7031       0.0006       75.64m      C: 

Now we have a model trained on the training documents and validated on the validation documents. First, since we spent so much time creating the model, let's save it for later:

In [12]:
# Save the learned model for later use
from sklearn.externals import joblib
joblib.dump(model, 'ltr-model.joblib')

['ltr-model.joblib']

Let's use the model to predict subject rankings on the evaluation documents and evaluate how well it did compared to the other fusion approaches.

In [38]:
Epred = model.predict(EX)
eval_results["ltr"] = evaluate("ltr", Eqids, Ey, Etrue, Epred, Ecids)

                 ltr:	NDCG=0.5952	F1=0.3199


  'precision', 'predicted', average, warn_for)


## LTR results for test set

### old, bad Maui scores; using only raw scores as features (3 features)
* max_pred=25, n_estimators=1000, query_subsample=0.5, max_leaf_nodes=10, min_samples_leaf=64: 
  * after 52(+50) iterations: NDCG=0.6097	F1=0.3108
* max_pred=50, n_estimators=1000, query_subsample=0.5, max_leaf_nodes=10, min_samples_leaf=64: 
  * after 98(+30) iterations: NDCG=0.5745	F1=0.3056
* max_pred=50, n_estimators=1000, query_subsample=0.25, max_leaf_nodes=10, min_samples_leaf=64: 
  * after 90(+30) iterations: NDCG=0.5828	F1=0.3135
* max_pred=50, n_estimators=1000, query_subsample=0.1, max_leaf_nodes=10, min_samples_leaf=64: 
  * after 54(+30) iterations: NDCG=0.5825	F1=0.3053 (6min 15s)

### old, bad Maui scores; using raw score + doc length + concept label length&caps (=8 features)

* max_pred=50, n_estimators=1000, query_subsample=0.1, max_leaf_nodes=10, min_samples_leaf=64: 
  * after 78(+30) iterations: NDCG=0.5848	F1=0.3076 (8min 28s)
* max_pred=50, n_estimators=1000, learning_rate=0.2, query_subsample=0.1, max_leaf_nodes=10, min_samples_leaf=64: 
  * after 66(+50) iterations: NDCG=0.5715	F1=0.3039 (8min 54s)

### new Maui scores; using 8 features

* max_pred=50, n_estimators=1000, learning_rate=0.2, query_subsample=0.1, max_leaf_nodes=10, min_samples_leaf=64: 
  * after 40(+50) iterations: NDCG=0.5925	F1=0.3159 (7min 13s)
  
### new Maui scores: using 9 features (all above + train concept freq)

* max_pred=50, n_estimators=1000, learning_rate=0.2, query_subsample=0.1, max_leaf_nodes=10, min_samples_leaf=64: 
  * after 135(+50) iterations: NDCG=0.3777	F1=0.0069 (15min 37s) ??? **what went wrong?** Apparently the model got too hooked on c-freq (importance 0.2477 vs. Maui 0.3144) and this didn't generalize so well...
  
### new Maui scores: using 9 features (8 basic + log10 of train concept freq)

* max_pred=50, n_estimators=1000, learning_rate=0.2, query_subsample=0.1, max_leaf_nodes=10, min_samples_leaf=64:
  * after 58(+50) iterations: NDCG=0.5722	F1=0.2821 (9min 21s) -- **better than raw freq but still not good**

### new Maui scores: using 9 features (8 basic + logN of train concept freq)
* max_pred=50, n_estimators=1000, learning_rate=0.2, query_subsample=0.1, max_leaf_nodes=10, min_samples_leaf=64:
  * after 69(+50) iterations: NDCG=0.3641	F1=0.0000 (9min 11s) -- **horrible**
  
### new Maui scores: using 9 features (8 basic + logN of validate concept freq)
* max_pred=50, n_estimators=1000, learning_rate=0.2, query_subsample=0.1, max_leaf_nodes=10, min_samples_leaf=64:
  * after 57(+50) iterations: NDCG=0.5973	F1=0.3234 (8min 52s) -- **new record**
  
### new Maui scores: using 9 features (8 basic + log10 of validate concept freq)
* max_pred=50, n_estimators=1000, learning_rate=0.2, query_subsample=0.1, max_leaf_nodes=10, min_samples_leaf=64:
  * after 72(+50) iterations: NDCG=0.5984	F1=0.3147 (9min 46s)

### new Maui scores: using 9 features (8 basic + validate concept freq)
* max_pred=50, n_estimators=1000, learning_rate=0.2, query_subsample=0.1, max_leaf_nodes=10, min_samples_leaf=64:
  * after 35(+50) iterations: NDCG=0.5952	F1=0.3199 (6min 44s)

We can also check how much the input from each algorithm contributed to the rankings.

In [16]:
for feature, importance in zip(ALL_FEATURES, model.feature_importances_):
    print("{:>20} importance: {:.4f}".format(feature, importance))

               tfidf importance: 0.1672
            fasttext importance: 0.1224
                maui importance: 0.3144
            d-length importance: 0.0771
              c-freq importance: 0.2477
             c-chars importance: 0.0619
             c-words importance: 0.0094
              c-caps importance: 0.0000
           c-capprop importance: 0.0000


## TODO: plot the results using matplotlib