## XGBoost Demo for Ranking

This notebook demonstrates using XGBoost as a Ranking classifier. You are allowed to use and/or modify this code for the Project (Part-2).

In [1]:
import numpy as np
import xgboost as xgb

In [2]:
xgb.__version__

'0.90'

### Generate Random Features and labels for Training Data

At first, we generate some random features to train the XGBoost Classifier. For the project, you will be required to use the data provided (explained in `6714_proj_part2.ipynb`) to generate your features.

For this example, we assume:<br>

* We have 5 mentions in the training data, with total number of candidate entities for each mention as follows: [5, 4, 4, 3, 4].

* We form pairs of the form: $(mention,candidate\_entity)$, so we will have 20 pairs (for 5 mentions) in total $\sim \; \sum_{i=1}^{N}{\#c_{m_i}}$, where $\#c_{m_i}$ corresponds to the number of candidates of the mention $m_{i}$. We consider the candidate entities corresponding to each mention as a seperate group.

* For each <mention, entity> pair, we may generate some features using men_docs ($men\_docs.pickle$) and entity description pages ($parsed\_candidate\_entities.pickle$). For illustration, we randomly generate some features (d-dimensional). For 20 <mention, entity> pairs, we will have a fearure matrix of the shape $(20 \times d)$.

In [3]:
## Randomly Generate Features for Training....

### Set Numpy Seed
np.random.seed(23)

### We generate random features (13-dim). The feature matrix will be of the shape: (20,13)
train_data = np.random.rand(20, 13)
train_data.shape
train_data

array([[5.17297884e-01, 9.46962604e-01, 7.65459759e-01, 2.82395844e-01,
        2.21045363e-01, 6.86222085e-01, 1.67139203e-01, 3.92442466e-01,
        6.18052347e-01, 4.11930095e-01, 2.46488120e-03, 8.84032182e-01,
        8.84947538e-01],
       [3.00409689e-01, 5.89581865e-01, 9.78426916e-01, 8.45093822e-01,
        6.50754391e-02, 2.94744465e-01, 2.87934441e-01, 8.22466339e-01,
        6.26183038e-01, 1.10477714e-01, 5.28811169e-04, 9.42166233e-01,
        1.41500758e-01],
       [4.21596526e-01, 3.46489440e-01, 8.69785084e-01, 4.28601812e-01,
        8.28751484e-01, 7.17851838e-01, 1.19226694e-01, 5.96384173e-01,
        1.29756298e-01, 7.75340917e-02, 8.31205256e-01, 4.64385615e-01,
        1.62012479e-01],
       [5.47975292e-01, 5.88485822e-01, 7.73613169e-01, 6.55845458e-01,
        5.57706759e-01, 1.78247267e-01, 2.40583531e-01, 5.06054632e-01,
        3.96745699e-01, 4.83055185e-01, 9.55739841e-01, 9.01602193e-01,
        5.05759322e-01],
       [8.20701485e-01, 8.27715926e-

### Labels for the Training data

* Next, we assign labels to each <mention,entity> pair in the training data, such that:
> * The Ground Truth entity Label is assigned a label (1) and is positioned at the start of the group (although, strictly speaking you may place the Ground Truth label at any position within the group, we do so in order to facilitate explanation). <br>
> * The rest of the <mention, entity> pairs are assigned a label (0).

**Note:** The features generated from each <mention, entity> pair should also follow the same order as that of the labels in each group.

In [4]:
## Labels for training data...
train_labels = np.array([1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0])

### Groups:

Here, we form groups for the training data, i.e., represent total number of candidate entities corresponding to each mention in the training data. [5, 4, 4, 3, 4] means that the first mention contains 5 candidate entities, second mention contains 4 candidate entities and so on...

In [5]:
## Form Groups...

idxs = np.where(train_labels == 1)[0]
train_groups = np.append(np.delete(idxs, 0), len(train_labels)) - idxs
print(train_groups)

[5 4 4 3 4]


### Transform Data for XGBoost...

For model training, $XGBoost$ expects a `DMatrix`. Here, we transform our training data to XGboost's `DMatrix` form. For illustration, you may check-out the documentation of the $XGBoost$ classifier: https://xgboost.readthedocs.io/en/latest/python/python_api.html

In [6]:
def transform_data(features, groups, labels=None):
    xgb_data = xgb.DMatrix(data=features, label=labels)
    xgb_data.set_group(groups)
    return xgb_data


xgboost_train = transform_data(train_data, train_groups, train_labels)

### Generate Features for the Test data

We follow the same steps, as described previously, to randomly generate some features for testing.

In [7]:
## Randomly Generate Features for Testing....

## Set Numpy Random seed...
np.random.seed(53)

## Generate features of same dimensionality as that of training features...
test_data = np.random.rand(10, 13)

## Assign Groups, assuming there are 3 mentions, with 3, 3 and 4 candidate entities...
test_groups = np.array([3, 3, 4])

# Transform the features to XGBoost DMatrix...
xgboost_test = transform_data(test_data, test_groups)

### Model Training + Prediction

After feature generation, and data transformation, the next step is to set hyper-parameters of the $XGBoost$ classifier and and train our model. Once the model is trained, we use it to generate predictions for the testing data.

**Note:** We use `rank:pairwise` as the objective function of our model.

In [8]:
## Parameters for XGBoost, you can fine-tune these parameters according to your settings...

param = {'max_depth': 8, 'eta': 0.05, 'silent': 1, 'objective': 'rank:pairwise',
         'min_child_weight': 0.01, 'lambda':100}

## Train the classifier...
classifier = xgb.train(param, xgboost_train, num_boost_round=4900)
##  Predict test data...
preds = classifier.predict(xgboost_test)
preds

array([ 1.9814533 ,  1.4072076 , -0.5223563 ,  2.2223825 ,  0.3374607 ,
       -1.1113675 , -1.0744805 ,  2.9586015 ,  2.495078  , -0.91634274],
      dtype=float32)

### Prediction scores of Each Testing Group...

We can separetely consider the prediction score of each group to get the final entity corresponding to each mention. Based on the prediction scores for each group, you may select the best candidate entity for the testing mention.

In [9]:
idx = 0

for iter_, group in enumerate(test_groups):
    print("Prediction scores for Group {} = {}".format(iter_,preds[idx:idx+group]))
    idx+=group

Prediction scores for Group 0 = [ 1.9814533  1.4072076 -0.5223563]
Prediction scores for Group 1 = [ 2.2223825  0.3374607 -1.1113675]
Prediction scores for Group 2 = [-1.0744805   2.9586015   2.495078   -0.91634274]


In [11]:
for a, b in enumerate(test_groups):
    print(a, b)

0 3
1 3
2 4
