In this article, I want to introduce you to one of the key algorithms behind successful learning to rank implementations: gradient boosting. Gradient boosting is the "brain" behind learning to rank. It's the core idea behind LambdaMART. In this article I want to walk you through why it's such a powerful machine-learning algorithm for search relevance in particular. You'll see it's ability to capture subtle, non-linear with some level of interpretability helps make gradient boosting really powerful solution for search, personalization, and recommendation systems.

## Learning to Rank as a Regression Problem
To introduce you to gradient boosting, we're NOT going to look at traditional learning to rank beyond this section. So I want to map learning to rank, as you might be familiar from [previous articles](http://opensourceconnections.com/blog/2017/02/24/what-is-learning-to-rank/) and [documentation](https://github.com/o19s/elasticsearch-learning-to-rank#building-a-learning-to-rank-system-with-elasticsearch) to a more general problem: regression. *Regression* trains a model to map a set of numerical features to a predicted numerical value. 

For example, what if you wanted to be able to predict a company's profit? You might have, on hand, historical data about public corporations including number of employees, stock market price, revenue, cash on hand, etc. Given data you know about existing companies, your model could be trained to predict profit as a function of these variables (or a subset thereof). For a new company you could use your function to arrive at a prediction of the company's profit.

Just the same, learning to rank can be a regression problem. You have on hand a series of *judgments* that grade how relevant a document is for a query. Our relevance grades could range from A to F. More commonly they range from 0 (not at all relevant) to 4 (exactly relevant). If we just consider a keyword search to be a query, this become, as an example:


    grade,movie,keywordquery
    4,Rambo,rambo
    0,Turner and Hootch,rambo
    4,First Blood,rambo
    1,Rocky,rambo
    ...
    
    
Learning to Rank becomes a regression problem when you build a model to predict the *grade* as a function of ranking-time *signals.* Recall from [Relevant Search](http://manning.com/books/relevant-search) we term signals to mean any measurement about the relationship between the query and a document. Often signals are *query-dependent* - that is they result by taking some measurement of how a keyword (or other part of the query) relates to the document. Other times they are query or document-only, such as publication date, or whether a "company name" could be extracted from the query using NLP method.

In other words, let's consider the movie example above. You might have 2 query-dependent signals you suspect could help predict relevance:

1. How many times a search keyword occurs in the **title** field
2. How many times a search keyword occurs in the **overview** field

Augmenting the *judgments* above, you might arrive at a training like below:

    ```
    4,1,1
    0,0,0
    4,0,3
    1,0,1
    ```
    
You can apply a regression process (really any regression process, including linear regression), to predict the first column using the other columns. You can build such a system on top of an existing search engine like [Solr](https://cwiki.apache.org/confluence/display/solr/Learning+To+Rank) or [Elasticsearch](http://opensourceconnections.com/blog/2017/02/14/elasticsearch-learning-to-rank/).

Learning to Rank comes with a few extra complications we'll save for a future article

1. Examples are provided in groups, grouped by queries. A document can be a grade 4 for query "rambo" and a 0 for query "beauty and the beast."
2. How do you arrive at these "grades"? 
3. How does all this work in practice?

Yes yes, but for now I want to look at a specific *kind* of regression. It's the magic under learning to rank and I want to teach you all about it.



## What *kind* of regression works best for Learning to Rank?

If you've learned any statistics, you may be familiar with Linear Regression. *Linear Regression* defines the regression problem as a simple linear function. For example, if in learning to rank we called the first signal above (how many times a search keyword occurs it the **title** field) as `t` and the second signal above (the same for the **overview** field) as `o`, our model might be able to generate a function `s` to score our relevance as follows:

    s(t, o) = c0 + c1 * t + c2 * o
    

We can estimate the best fit coefficients `c0, c1, c2...` that predict our training data using a procedure known as [least squares fitting](http://mathworld.wolfram.com/LeastSquaresFitting.html). We won't cover that here, but the gist is we can find the `c0, c1, c2, ...` that minimize the error between the actual grade, `g` and the prediction `s(t,o)`. 

You can get fancier with linear regression, including deciding there's really a third ranking signal, `a` which we can define as `t*o`. Or another signal called t2, which could be in reality `t^2` or `log(t)` or whatever formulation you suspect could help best predict relevance.

There's a deeper art to of designing, testing ,and evaluating models of any flavor, I heartily recommend [Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) if you'd like to learn more.


## Playing with Linear Regression in sklearn

To give you a taste, Python's [sklearn](http://scikit-learn.org/stable/) family of libraries is a convenient way to play with regression. If we want to try out the simple learning to rank training set above for linear regression, we can express the relevance grade's we're trying to predict as `S`, and the signals we feel will predict that score as `X`:




In [2]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegressor
from math import sin
import numpy as np

ImportError: No module named sklearn.tree

In [85]:
numFeatures = 10
numDatapoints = 1000


# Generate 100 samples with 10 random features
X = np.random.random(numFeatures*numDatapoints).reshape(numDatapoints,numFeatures)
Y = np.arctan(X).sum(axis=1)

In [46]:
# Generate 100 random values
Y = np.random.random(numDatapoints)

In [98]:
# Step 1, perform a regression over the X & Y values

tree = DecisionTreeRegressor()
tree.n_features=10
tree.fit(X, Y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

In [99]:
# Step 2. Next we make predictions using X, and see how far we are from Y

Ypredicted = tree.predict(X)
np.power(Y - Ypredicted, 2).sum()

1.5491898320864383e-08

In [101]:
# Wait, what, error is near 0! We've perfectly predicted the training data.
# This isn't surprising. But it's likely an example of overfitting. Let's train on all but the 
# first 10 data points and then recalculate the *test error*
Xtrain = X[100:]
Xtest = X[:100]
Ytrain = Y[100:]
Ytest = Y[:100]
tree.fit(Xtrain, Ytrain)
Ypredicted = tree.predict(X)
np.power(Y - Ypredicted, 2).sum()

37.243293215375147

In [58]:
X[1]

array([ 0.09914091,  0.31433702,  0.34429422,  0.55603666,  0.83777067,
        0.38102517,  0.85519233,  0.8633129 ,  0.50484428,  0.76361867])

In [59]:
X[:10]

array([[ 0.5206042 ,  0.45071729,  0.94032958,  0.68446533,  0.70176207,
         0.75528753,  0.72025058,  0.22522763,  0.18514977,  0.84070608],
       [ 0.09914091,  0.31433702,  0.34429422,  0.55603666,  0.83777067,
         0.38102517,  0.85519233,  0.8633129 ,  0.50484428,  0.76361867],
       [ 0.41816377,  0.53497742,  0.83310324,  0.05345249,  0.39873131,
         0.11529311,  0.56652605,  0.04872523,  0.42650035,  0.41389597],
       [ 0.32663508,  0.30300935,  0.57256465,  0.45474402,  0.37723288,
         0.63898176,  0.25567442,  0.14961114,  0.73494424,  0.29039654],
       [ 0.27869052,  0.07218927,  0.96985285,  0.65412681,  0.10789258,
         0.10191945,  0.4291287 ,  0.61700272,  0.64590024,  0.07351153],
       [ 0.48881607,  0.37273356,  0.16319504,  0.83005925,  0.67285752,
         0.0910396 ,  0.42477796,  0.70907164,  0.04713346,  0.86128398],
       [ 0.67287971,  0.14537552,  0.68258379,  0.21359162,  0.66374476,
         0.57542011,  0.35296409,  0.22235011