# napkinXC demo @ Prosus Global AI Marketplace

napkinXC is an extremely simple and fast library for extreme multi-class and multi-label classification. It allows to train a classifier for very large datasets in few lines of code with minimal resources. This notebooks demonstraits how yo train and evaluate extreme multi-label classifier using napkinXC.

napkinXC GitHub: [https://github.com/mwydmuch/napkinXC](https://github.com/mwydmuch/napkinXC)

napkinXC authors:
- Marek Wydmuch (Data Scientist @ OLX Group, PhD Student @ Poznań University of Techonology)
- Kalina Jasinska-Kobus (Research Engineer @ Allegro ML Reaserch, PhD Student @ Poznań University of Techonology)
- Krzysztof Dembczynski (Senior Research Scientist @ Yahoo! Research)
- Robert Busa-Fekete (Research Scientist @ Google Research)

In [1]:
!pip install napkinxc



In [2]:
!system_profiler SPHardwareDataType | head -n 14 | tail -n 10

      Model Name: MacBook Pro
      Model Identifier: MacBookPro15,2
      Processor Name: Quad-Core Intel Core i7
      Processor Speed: 2,7 GHz
      Number of Processors: 1
      Total Number of Cores: 4
      L2 Cache (per Core): 256 KB
      L3 Cache: 8 MB
      Hyper-Threading Technology: Enabled
      Memory: 16 GB


### Benchmark dataset

Let's use `load_dataset` function to load AmazonCat-13K dataset, one of the benchmarks datasets from [XML Repository](http://manikvarma.org/downloads/XC/XMLRepository.html).

The task in AmazonCat-13K datasets is to assign topics to books from the Amazon catalog based on their descriptions.

In [3]:
from napkinxc.datasets import load_dataset

X_train, Y_train = load_dataset("amazoncat-13k", "train", format='tf-idf')
X_test, Y_test = load_dataset("amazoncat-13k", "test", format='tf-idf')

Let's calculate the most basic statistics of this dataset:

In [4]:
print("features:", X_train.shape[1])
print("labels:", max([max((0,) + y) for y in Y_train]))
print("train data points:", X_train.shape[0])
print("test data points:", X_test.shape[0])
print("avg. labels per data point:", sum([len(y) for y in Y_train]) / X_train.shape[0])

features: 203882
labels: 13329.0
train data points: 1186239
test data points: 306782
avg. labels per data point: 5.04066971327026


### Training

Let's import Probabilistic Label Trees (`PLT`) model from napkinXC and construct it.
Because napkinXC uses a local drive for storing unused data to optimize memory footprint during the training, the constructor requires a path to the model directory as the first argument.

In [5]:
from napkinxc.models import PLT

plt = PLT("amazoncat-model", optimizer="adagrad", epochs=1, threads=4)

Now we are ready to fit the model on the training data. napkinXC follows sklearn conventions for methods names.
Currently, X can be a dense Numpy or sparse Scipy CSR matrix, while Y can be a scipy sparse matrix or list of lists with positive labels.

In [6]:
import time

train_start = time.time()
plt.fit(X_train, Y_train)
train_end = time.time()
print("train time:", train_end - train_start, "s")

train time: 94.39669394493103 s


### Evaluation
Now let's evaluate our model. We are first predicting one label with the highest probability for each data point in the test set.

In [7]:
prediction_start = time.time()
Y_pred = plt.predict(X_test, top_k=1)
prediction_end = time.time()
print("prediction time:", prediction_end - prediction_start, "s")
print("prediction time / data point:", (prediction_end - prediction_start) / len(Y_pred) * 1000, "ms")

prediction time: 26.69292902946472 s
prediction time / data point: 0.0870094367644279 ms


Finally, we can evaluate the model's predictive performance using `precision_at_k` (precision at k-th place), one of the measures implemented in napkinXC.

Precision at k is defined as:

$$
\mathrm{p}@k(\boldsymbol{y}, \boldsymbol{x}, \boldsymbol{h}) = \frac{1}{k} \sum_{j \in \hat {\mathcal{Y}_k}} [[ y_j = 1 ]],
$$

where $\hat {\mathcal{Y}_k}$ is a set of $k$ labels predicted by classfier $\boldsymbol{h}$ for $\boldsymbol{x}$.

In [8]:
from napkinxc.measures import precision_at_k

print("p@1:", precision_at_k(Y_test, Y_pred, k=1))

p@1: 0.9308792562797035


The current SOTA methods based on transformers language models obtain ~0.95 on this dataset. However, they require hours of training on much larger machines. Using napkinXC, you can train model performing. 

Training napkinXC PLT model for Amazon-3M dataset, product to product recommendation task with 1.7 mln of training examples and 2.8 mln of labels takes ~2h on a similar machine. It achieves ~0.47 almost out of the box. The current SOTA achieves ~0.50.

### napkinXC allows you to quickly apply extreme classification to you problem!
### napkinXC  is the only library available that allows online prediction within milliseconds!

napkinXC is currently under heavy development, we plan to add a possibility to use any type of binary classifier from Python.