# MNIST using scikit-learn and SuperDuperDB

In a [previous example](mnist_torch.html) we discussed how to implement MNIST classification with CNNs in `torch`
using SuperDuperDB. 

In [1]:
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score,classification_report
import numpy as np
from sklearn import svm

As before we'll import the python MongoDB client `pymongo`
and "wrap" our database to convert it to a SuperDuper `Datalayer`:

In [2]:
import os

# Uncomment one of the following lines to use a bespoke MongoDB deployment
# For testing the default connection is to mongomock

mongodb_uri = os.getenv("MONGODB_URI","mongomock://test")
# mongodb_uri = "mongodb://localhost:27017"
# mongodb_uri = "mongodb://pinnacle:pinnacle@mongodb:27017/documents"
# mongodb_uri = "mongodb://<user>:<pass>@<mongo_cluster>/<database>"
# mongodb_uri = "mongodb+srv://<username>:<password>@<atlas_cluster>/<database>"

# Super-Duper your Database!
from pinnacledb import pinnacle
db = pinnacle(mongodb_uri)

INFO:numexpr.utils:NumExpr defaulting to 8 threads.


Similarly to last time, we can add data to SuperDuperDB in a way which very similar to using `pymongo`.
This time, we'll add the data as `numpy.array` to SuperDuperDB, using the `Document-Encoder` formalism:

In [3]:
from pinnacledb.ext.numpy.array import array
from pinnacledb.container.document import Document as D
from pinnacledb.db.mongodb.query import Collection

mnist = fetch_openml('mnist_784')
ix = np.random.permutation(10000)
X = np.array(mnist.data)[ix, :]
y = np.array(mnist.target)[ix].astype(int)

a = array('float64', shape=(784,))

collection = Collection(name='mnist')

data = [D({'img': a(X[i]), 'class': int(y[i])}) for i in range(len(X))]

db.execute(
    collection.insert_many(data, encoders=[a])
)

  warn(
INFO:root:found 0 uris


(<pymongo.results.InsertManyResult at 0x7febed1af280>,
 TaskWorkflow(database=<pinnacledb.db.base.db.DB object at 0x7feced51f8b0>, G=<networkx.classes.digraph.DiGraph object at 0x7fecf817b700>))

In [4]:
db.execute(collection.find_one())

Document({'img': Encodable(encoder=Encoder(identifier='numpy.float64[784]', decoder=<Artifact artifact=<pinnacledb.ext.numpy.array.DecodeArray object at 0x7fec49ed0130> serializer=dill>, encoder=<Artifact artifact=<pinnacledb.ext.numpy.array.EncodeArray object at 0x7fec49ed0190> serializer=dill>, shape=[784], version=0), x=array([  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,

Models are built similarly to the `Datalayer`, by wrapping a standard Python-AI-ecosystem model:

In [5]:
model = pinnacle(
    svm.SVC(gamma='scale', class_weight='balanced', C=100, verbose=True),
    postprocess=lambda x: int(x)
)

Now let's fit the model. The optimization uses Scikit-Learn's inbuilt training procedures.
Unlike in a standard `sklearn` use-case, we don't need to fetch the data client side. Instead, 
we simply name the fields in the MongoDB collection which we'd like to use.

In [6]:
model.fit(X='img', y='class', db=db, select=collection.find())

100%|████████████████████████████████████████████| 9522/9522 [00:00<00:00, 122257.31it/s]


[LibSVM]*
optimization finished, #iter = 244
obj = -22.563081, rho = -0.511772
nSV = 104, nBSV = 0
*
optimization finished, #iter = 653
obj = -84.729413, rho = 0.218175
nSV = 246, nBSV = 0
*
optimization finished, #iter = 604
obj = -70.892050, rho = 0.088456
nSV = 233, nBSV = 0
*
optimization finished, #iter = 414
obj = -48.010933, rho = -0.193460
nSV = 167, nBSV = 0
*
optimization finished, #iter = 784
obj = -89.431867, rho = 0.071219
nSV = 274, nBSV = 0
*
optimization finished, #iter = 608
obj = -87.859848, rho = -0.075461
nSV = 208, nBSV = 0
*
optimization finished, #iter = 445
obj = -50.251100, rho = -0.103000
nSV = 173, nBSV = 0
*
optimization finished, #iter = 557
obj = -70.816661, rho = -0.004518
nSV = 224, nBSV = 0
*
optimization finished, #iter = 558
obj = -73.233331, rho = -0.241526
nSV = 203, nBSV = 0
*
optimization finished, #iter = 639
obj = -128.556342, rho = 0.754778
nSV = 183, nBSV = 0
*
optimization finished, #iter = 492
obj = -92.165363, rho = 0.876759
nSV = 163, nBSV

Installed models and functionality can be viewed using `db.show`:

In [7]:
db.show('model')

['svc']

The model may be reloaded in another session from the database. 
As with `.fit`, the model may be applied to data in the database with `.predict`:

In [None]:
m = db.load('model', 'svc')
m.predict(X='img', db=db, select=collection.find(), max_chunk_size=3000)

We can verify that the predictions make sense by fetching a few random data-points:

In [None]:
r = next(db.execute(collection.aggregate([{'$match': {'_fold': 'valid'}} ,{'$sample': {'size': 1}}])))
print(r['class'])
print(r['_outputs'])