# MNIST using scikit-learn and SuperDuperDB

In a [previous example](mnist_torch.html) we discussed how to implement MNIST classification with CNNs in `torch`
using SuperDuperDB. 

In [1]:
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score,classification_report
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn import svm

As before we'll import the python MongoDB client `pymongo`
and "wrap" our database to convert it to a SuperDuper `Datalayer`:

In [2]:
import pymongo
from pinnacledb import pinnacle

db = pymongo.MongoClient().documents

db = pinnacle(db)

INFO:faiss.loader:Loading faiss.
INFO:faiss.loader:Successfully loaded faiss.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.


In [3]:
db.db.client.drop_database('documents')
db.db.client.drop_database('_filesystem:documents')

Similarly to last time, we can add data to SuperDuperDB in a way which very similar to using `pymongo`.
This time, we'll add the data as `numpy.array` to SuperDuperDB, using the `Document-Encoder` formalism:

In [4]:
from pinnacledb.ext.numpy.array import array
from pinnacledb.container.document import Document as D
from pinnacledb.db.mongodb.query import Collection

mnist = fetch_openml('mnist_784')
ix = np.random.permutation(10000)
X = np.array(mnist.data)[ix, :]
y = np.array(mnist.target)[ix].astype(int)

a = array('float64', shape=(784,))

collection = Collection(name='mnist')

data = [D({'img': a(X[i]), 'class': int(y[i])}) for i in range(len(X))]

db.execute(
    collection.insert_many(data, encoders=[a])
)

  warn(
INFO:root:found 0 uris


(<pymongo.results.InsertManyResult at 0x1ab48cca0>,
 TaskWorkflow(database=<pinnacledb.db.base.datalayer.Datalayer object at 0x19be72790>, G=<networkx.classes.digraph.DiGraph object at 0x19bf3dd10>))

In [5]:
db.execute(collection.find_one())

Document({'_id': ObjectId('64c8e8d74c309285038096ec'), 'img': Encodable(encoder=Encoder(identifier='numpy.float64[784]', decoder=<Artifact artifact=<pinnacledb.ext.numpy.array.DecodeArray object at 0x19bf414d0> serializer=dill>, encoder=<Artifact artifact=<pinnacledb.ext.numpy.array.EncodeArray object at 0x19bf45d10> serializer=dill>, shape=[784], version=0), x=array([  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0

Models are built similarly to the `Datalayer`, by wrapping a standard Python-AI-ecosystem model:

In [6]:
model = pinnacle(
    svm.SVC(gamma='scale', class_weight='balanced', C=100, verbose=True),
    postprocess=lambda x: int(x)
)

Now let's fit the model. The optimization uses Scikit-Learn's inbuilt training procedures.
Unlike in a standard `sklearn` use-case, we don't need to fetch the data client side. Instead, 
we simply name the fields in the MongoDB collection which we'd like to use.

In [7]:
model.fit(X='img', y='class', db=db, select=collection.find())

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9494/9494 [00:00<00:00, 250764.95it/s]


[LibSVM]*
optimization finished, #iter = 243
obj = -23.942314, rho = -0.479301
nSV = 105, nBSV = 0
*
optimization finished, #iter = 667
obj = -83.842827, rho = 0.220104
nSV = 239, nBSV = 0
*
optimization finished, #iter = 605
obj = -70.804738, rho = 0.088037
nSV = 228, nBSV = 0
*
optimization finished, #iter = 415
obj = -48.058209, rho = -0.156508
nSV = 164, nBSV = 0
*
optimization finished, #iter = 718
obj = -87.742649, rho = 0.070272
nSV = 262, nBSV = 0
*
optimization finished, #iter = 587
obj = -88.853650, rho = -0.067306
nSV = 215, nBSV = 0
*
optimization finished, #iter = 433
obj = -51.209997, rho = -0.121251
nSV = 178, nBSV = 0
*
optimization finished, #iter = 591
obj = -73.952459, rho = -0.047918
nSV = 238, nBSV = 0
*
optimization finished, #iter = 542
obj = -73.099406, rho = -0.256044
nSV = 200, nBSV = 0
*
optimization finished, #iter = 652
obj = -130.136954, rho = 0.720547
nSV = 180, nBSV = 0
*
optimization finished, #iter = 543
obj = -101.068817, rho = 0.773049
nSV = 170, nBS

Installed models and functionality can be viewed using `db.show`:

In [8]:
db.show('model')

['svc']

The model may be reloaded in another session from the database. 
As with `.fit`, the model may be applied to data in the database with `.predict`:

In [9]:
m = db.load('model', 'svc')
m.predict(X='img', db=db, select=collection.find(), max_chunk_size=3000)



Computing chunk 0/3




Computing chunk 1/3




Computing chunk 2/3




Computing chunk 3/3


We can verify that the predictions make sense by fetching a few random data-points:

In [10]:
r = next(db.execute(collection.aggregate([{'$match': {'_fold': 'valid'}} ,{'$sample': {'size': 1}}])))
print(r['class'])
print(r['_outputs'])

0
{'img': {'svc': 0}}
