# MNIST using scikit-learn and SuperDuperDB

In a [previous example](mnist_torch.html) we discussed how to implement MNIST classification with CNNs in `torch`
using SuperDuperDB. 

In [1]:
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score,classification_report
import numpy as np
from sklearn import svm

As before we'll import the python MongoDB client `pymongo`
and "wrap" our database to convert it to a SuperDuper `Datalayer`:

In [2]:
import pymongo
from pinnacledb import pinnacle

db = pymongo.MongoClient().documents

db = pinnacle(db)

INFO:faiss.loader:Loading faiss.
INFO:faiss.loader:Successfully loaded faiss.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.


In [3]:
db.db.client.drop_database('documents')
db.db.client.drop_database('_filesystem:documents')

Similarly to last time, we can add data to SuperDuperDB in a way which very similar to using `pymongo`.
This time, we'll add the data as `numpy.array` to SuperDuperDB, using the `Document-Encoder` formalism:

In [4]:
from pinnacledb.ext.numpy.array import array
from pinnacledb.container.document import Document as D
from pinnacledb.db.mongodb.query import Collection

mnist = fetch_openml('mnist_784')
ix = np.random.permutation(10000)
X = np.array(mnist.data)[ix, :]
y = np.array(mnist.target)[ix].astype(int)

a = array('float64', shape=(784,))

collection = Collection(name='mnist')

data = [D({'img': a(X[i]), 'class': int(y[i])}) for i in range(len(X))]

db.execute(
    collection.insert_many(data, encoders=[a])
)

  warn(
INFO:root:found 0 uris


(<pymongo.results.InsertManyResult at 0x206bbe200>,
 TaskWorkflow(database=<pinnacledb.db.base.db.DB object at 0x19d769b90>, G=<networkx.classes.digraph.DiGraph object at 0x19d831410>))

In [5]:
db.execute(collection.find_one())

Document({'_id': ObjectId('64c96ab0a31b3b21c9d05a7f'), 'img': Encodable(encoder=Encoder(identifier='numpy.float64[784]', decoder=<Artifact artifact=<pinnacledb.ext.numpy.array.DecodeArray object at 0x19d83a310> serializer=dill>, encoder=<Artifact artifact=<pinnacledb.ext.numpy.array.EncodeArray object at 0x19d83a190> serializer=dill>, shape=[784], version=0), x=array([  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0

Models are built similarly to the `Datalayer`, by wrapping a standard Python-AI-ecosystem model:

In [6]:
model = pinnacle(
    svm.SVC(gamma='scale', class_weight='balanced', C=100, verbose=True),
    postprocess=lambda x: int(x)
)

Now let's fit the model. The optimization uses Scikit-Learn's inbuilt training procedures.
Unlike in a standard `sklearn` use-case, we don't need to fetch the data client side. Instead, 
we simply name the fields in the MongoDB collection which we'd like to use.

In [7]:
model.fit(X='img', y='class', db=db, select=collection.find())

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9500/9500 [00:00<00:00, 252304.14it/s]


[LibSVM]*
optimization finished, #iter = 236
obj = -23.399893, rho = -0.463885
nSV = 105, nBSV = 0
*
optimization finished, #iter = 609
obj = -81.768282, rho = 0.193953
nSV = 242, nBSV = 0
*
optimization finished, #iter = 613
obj = -67.032301, rho = 0.117926
nSV = 230, nBSV = 0
*
optimization finished, #iter = 419
obj = -46.100537, rho = -0.205733
nSV = 166, nBSV = 0
*
optimization finished, #iter = 722
obj = -88.213315, rho = 0.142346
nSV = 265, nBSV = 0
*
optimization finished, #iter = 594
obj = -88.259112, rho = -0.059735
nSV = 214, nBSV = 0
*
optimization finished, #iter = 460
obj = -49.869525, rho = -0.106536
nSV = 177, nBSV = 0
*
optimization finished, #iter = 624
obj = -72.297982, rho = -0.023123
nSV = 237, nBSV = 0
*
optimization finished, #iter = 549
obj = -71.066525, rho = -0.211969
nSV = 210, nBSV = 0
*
optimization finished, #iter = 623
obj = -130.642691, rho = 0.679970
nSV = 179, nBSV = 0
*
optimization finished, #iter = 513
obj = -100.402195, rho = 0.729825
nSV = 166, nBS

Installed models and functionality can be viewed using `db.show`:

In [8]:
db.show('model')

['svc']

The model may be reloaded in another session from the database. 
As with `.fit`, the model may be applied to data in the database with `.predict`:

In [9]:
m = db.load('model', 'svc')
m.predict(X='img', db=db, select=collection.find(), max_chunk_size=3000)



Computing chunk 0/3




Computing chunk 1/3




Computing chunk 2/3




Computing chunk 3/3


We can verify that the predictions make sense by fetching a few random data-points:

In [18]:
r = next(db.execute(collection.aggregate([{'$match': {'_fold': 'valid'}} ,{'$sample': {'size': 1}}])))
print(r['class'])
print(r['_outputs'])

0
{'img': {'svc': 0}}
