## A first example running SMURFF

In this notebook we will run the BPMF algorithm using SMURFF, 
on compound-activity data.

This notebook is available on `SMURFF's GitHub repository <https://github.com/ExaScience/smurff/tree/master/python/notebooks>`_

### Downloading the Data Files

In these examples we use ChEMBL dataset for compound-proteins activities (IC50). The IC50 values and ECFP fingerprints can be downloaded from these two urls:

In [None]:
%%bash
wget http://homes.esat.kuleuven.be/~jsimm/chembl-IC50-346targets.mm

### Load the downloaded files

The `scipy.io.mmread` function loads the matrix market `.mm` files. The `ic50` is
a sparse matrix containing interactions between chemical compounds (in the rows)
and protein targets (called essays - in the columns). This `ic50` matrix
we will use as train data.

SMURFF has a `make_train_test` function that splits the loaded matrix

In [None]:
import smurff
import scipy.io

import scipy.sparse
import numpy

## loading data
ic50 = scipy.io.mmread("chembl-IC50-346targets.mm")

## creating train and test sets
ic50_train, ic50_test = smurff.make_train_test(ic50, 0.2)

### Having a look at the data

The `spy` function in `matplotlib` is a handy function to plot sparsity pattern of a matrix.

In [None]:
%matplotlib notebook

from matplotlib.pyplot import figure, show
from scipy.sparse import coo_matrix

fig = figure()
ax = fig.add_subplot(111)
ax.spy(ic50.tocsr()[0:1000,:].T, markersize = 1)
show()

In [None]:
%load_ext wurlitzer

### Running SMURFF

Finally we run make a `SMURFF` BPMF training session and and `run`. The `run` function
returns the `predictions` of the test data. 

We can use the `calc_rmse` function to calculate the RMSE.

In [None]:
session = smurff.BPMFSession(
                       Ytrain     = ic50_train,
                       Ytest      = ic50_test,
                       num_latent = 16,
                       burnin     = 40,
                       nsamples   = 200,
                       verbose    = 1,)

predictions = session.run()

In [None]:
rmse = smurff.calc_rmse(predictions)
rmse

### Plotting predictions versus actual values
Next to RMSE, we can also plot the predicted versus the actual values, to see how well the model performs.

In [None]:
%matplotlib notebook

from matplotlib.pyplot import subplots, show

y = numpy.array([ p.val for p in predictions ])
predicted = numpy.array([ p.pred_avg for p in predictions ])

fig, ax = subplots()
ax.scatter(y, predicted, edgecolors=(0, 0, 0))
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
show()