# Transfer learning using Sentence Transformers and Scikit-Learn

In this example, we'll be demonstrating how to simply implement transfer learning using SuperDuperDB.
You'll find related examples on vector-search and simple training examples using scikit-learn in the 
the notebooks directory of the project. Transfer learning leverages similar components, and may be used synergistically with vector-search. Vectors are, after all, simultaneously featurizations of 
data and may be used in downstream learning tasks.

Let's first connect to MongoDB via SuperDuperDB, you read explanations of how to do this in 
the docs, and in the `notebooks/` directory.

In [1]:
from pinnacledb import pinnacle
from pinnacledb.db.mongodb.query import Collection
import pymongo

db = pinnacle(
    pymongo.MongoClient().documents
)

collection = Collection('transfer')

INFO:faiss.loader:Loading faiss.
INFO:faiss.loader:Successfully loaded faiss.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.


We'll use textual data labelled with sentiment, to test the functionality. Transfer learning 
can be used on any data which can be processed with SuperDuperDB models.

In [3]:
import numpy
from datasets import load_dataset

from pinnacledb.container.document import Document as D

data = load_dataset("imdb")

train_data = [
    D({'_fold': 'train', **data['train'][int(i)]}) 
    for i in numpy.random.permutation(len(data['train']))
][:5000]

valid_data = [
    D({'_fold': 'valid', **data['test'][int(i)]}) 
    for i in numpy.random.permutation(len(data['test']))
][:500]

db.execute(collection.insert_many(train_data))

r = db.execute(collection.find_one())
r



  0%|          | 0/3 [00:00<?, ?it/s]

INFO:root:found 0 uris


Document({'_id': ObjectId('64c96de74107fec93297b820'), '_fold': 'train', 'text': 'Most folks might say that if one were to spend a Saturday night watching a movie,you must be really bored. Actually,I had just gotten back home from being out and turned on the TV and there it was,"Paulie". <br /><br />I had missed the opening credits,so I didn\'t know the name of it but I saw that it had Cheech Marin in it,so I naturally thought I had tuned into "Born In East L.A." When I saw him talking to a talking parrot,I was ready to dismiss this as the kind of flop movie they show late in the night.<br /><br />Happy to say,it was better than that. As you know,if you don\'t already Paulie is lost and trying to get back to his original owner. Seems it\'s taken years to find her. What should be Paulie\'s advantage is actually a dis-advantage in ways. People come across a literate parrot and all they see is a way to make money or benefit themselves. <br /><br />While Cheech Marin\'s character "is" maki

Let's create a SuperDuperDB model based on a `sentence_transformers` model.
You'll notice that we don't necessarily need a native SuperDuperDB integration to a model library 
in order to leverage its power with SuperDuperDB. For example, in this case, we just need 
to configure the `Model` wrapper to interoperate correctly with the `SentenceTransformer` class. After doing this, we can link the model to a collection, and daemonize the model using the `listen=True` keyword:

In [4]:
from pinnacledb.container.model import Model
import sentence_transformers

from pinnacledb.ext.numpy.array import array

m = Model(
    identifier='all-MiniLM-L6-v2',
    object=sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2'),
    encoder=array('float32', shape=(384,)),
    predict_method='encode',
    batch_predict=True,
)

m.predict(
    X='text',
    db=db,
    select=collection.find(),
    listen=True
)

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /var/folders/y9/b74b9yj906s_wtj0rrh2lf7c0000gn/T/tmpafi9hkj6
INFO:torch.distributed.nn.jit.instantiator:Writing /var/folders/y9/b74b9yj906s_wtj0rrh2lf7c0000gn/T/tmpafi9hkj6/_remote_module_non_scriptable.py
INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu


Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Now that we've created and added the model which computes features for the `"text"`, we can train a 
downstream model using Scikit-Learn:

In [5]:
from sklearn.svm import SVC

model = pinnacle(
    SVC(gamma='scale', class_weight='balanced', C=100, verbose=True),
    postprocess=lambda x: int(x)
)

model.fit(
    X='text',
    y='label',
    db=db,
    select=collection.find().featurize({'text': 'all-MiniLM-L6-v2'}),
)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 221768.31it/s]


[LibSVM]..................*.........*
optimization finished, #iter = 27996
obj = -7423.055031, rho = 0.220510
nSV = 6557, nBSV = 0
Total nSV = 6557


Now that the model has been trained, we can apply the model to the database, also daemonizing the model 
with `listen=True`.

In [6]:
model.predict(
    X='text',
    db=db,
    select=collection.find().featurize({'text': 'all-MiniLM-L6-v2'}),
    listen=True,
)



To verify that this process has worked, we can sample a few records, to inspect the sanity of the predictions

In [13]:
r = next(db.execute(collection.aggregate([{'$sample': {'size': 1}}])))
print(r['text'][:100])
print(r['_outputs']['text']['svc'])

I, as a teenager really enjoyed this movie! Mary Kate and Ashley worked great together and everyone 
1
