# Transfer Learning with Sentence Transformers and Scikit-Learn

## Introduction

In this notebook, we will explore the process of transfer learning using SuperDuperDB. We will demonstrate how to connect to a MongoDB datastore, load a dataset, create a SuperDuperDB model based on Sentence Transformers, train a downstream model using Scikit-Learn, and apply the trained model to the database. Transfer learning is a powerful technique that can be used in various applications, such as vector search and downstream learning tasks.

## Prerequisites

Before diving into the implementation, ensure that you have the necessary libraries installed by running the following commands:

In [None]:
!pip install pinnacledb
!pip install ipython numpy datasets sentence-transformers

## Connect to datastore 

First, we need to establish a connection to a MongoDB datastore via SuperDuperDB. You can configure the `MongoDB_URI` based on your specific setup. 
Here are some examples of MongoDB URIs:

* For testing (default connection): `mongomock://test`
* Local MongoDB instance: `mongodb://localhost:27017`
* MongoDB with authentication: `mongodb://pinnacle:pinnacle@mongodb:27017/documents`
* MongoDB Atlas: `mongodb+srv://<username>:<password>@<atlas_cluster>/<database>`

In [1]:
from pinnacledb import pinnacle
from pinnacledb.backends.mongodb import Collection
import os

mongodb_uri = os.getenv("MONGODB_URI", "mongomock://test")

# SuperDuperDB, now handles your MongoDB database
# It just super dupers your database
db = pinnacle(mongodb_uri, artifact_store='filesystem://./data/')

# Reference a collection called transfer
collection = Collection('transfer')

[32m 2023-Dec-13 11:12:52.51[0m| [34m[1mDEBUG   [0m | [36mDuncans-MacBook-Pro.local[0m| [36m81014ec8-a24e-4ebd-96b6-651004cd7edb[0m| [36mpinnacledb.base.build[0m:[36m50  [0m | [34m[1mParsing data connection URI:mongomock://test[0m
[32m 2023-Dec-13 11:12:52.52[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36m81014ec8-a24e-4ebd-96b6-651004cd7edb[0m| [36mpinnacledb.base.build[0m:[36m137 [0m | [1mData Client is ready. mongomock.MongoClient('localhost', 27017)[0m
[32m 2023-Dec-13 11:12:52.53[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36m81014ec8-a24e-4ebd-96b6-651004cd7edb[0m| [36mpinnacledb.base.datalayer[0m:[36m79  [0m | [1mBuilding Data Layer[0m


## Load Dataset

Transfer learning can be applied to any data that can be processed with SuperDuperDB models.
For our example, we will use a labeled textual dataset with sentiment analysis.  We'll load a subset of the IMDb dataset.

In [2]:
import numpy
from datasets import load_dataset
from pinnacledb import Document as D

# Load IMDb dataset
data = load_dataset("imdb")

# Set the number of data points for training (adjust as needed)
N_DATAPOINTS = 100

# Prepare training data
train_data = [
    D({'_fold': 'train', **data['train'][int(i)]})
    for i in numpy.random.permutation(len(data['train']))
][:N_DATAPOINTS]

# Prepare validation data
valid_data = [
    D({'_fold': 'valid', **data['test'][int(i)]})
    for i in numpy.random.permutation(len(data['test']))
][:N_DATAPOINTS // 10]

# Insert training data into the 'collection' SuperDuperDB collection
db.execute(collection.insert_many(train_data))

[32m 2023-Dec-13 11:12:55.72[0m| [34m[1mDEBUG   [0m | [36mDuncans-MacBook-Pro.local[0m| [36m81014ec8-a24e-4ebd-96b6-651004cd7edb[0m| [36mpinnacledb.base.datalayer[0m:[36m716 [0m | [34m[1mBuilding task workflow graph. Query:<pinnacledb.backends.mongodb.query.MongoCompoundSelect[
    [92m[1mtransfer.find({}, {})[0m}
] object at 0x15630f490>[0m
[32m 2023-Dec-13 11:12:55.72[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36m81014ec8-a24e-4ebd-96b6-651004cd7edb[0m| [36mpinnacledb.backends.local.compute[0m:[36m32  [0m | [1mSubmitting job. function:<function callable_job at 0x110919120>[0m
[32m 2023-Dec-13 11:12:55.72[0m| [34m[1mDEBUG   [0m | [36mDuncans-MacBook-Pro.local[0m| [36m81014ec8-a24e-4ebd-96b6-651004cd7edb[0m| [36mpinnacledb.misc.download[0m:[36m337 [0m | [34m[1m{'cls': 'MongoCompoundSelect', 'dict': {'table_or_collection': {'cls': 'Collection', 'dict': {'identifier': 'transfer'}, 'module': 'pinnacledb.backends.mongodb.query'},

([ObjectId('657983a788bef09ab851d97a'),
  ObjectId('657983a788bef09ab851d97b'),
  ObjectId('657983a788bef09ab851d97c'),
  ObjectId('657983a788bef09ab851d97d'),
  ObjectId('657983a788bef09ab851d97e'),
  ObjectId('657983a788bef09ab851d97f'),
  ObjectId('657983a788bef09ab851d980'),
  ObjectId('657983a788bef09ab851d981'),
  ObjectId('657983a788bef09ab851d982'),
  ObjectId('657983a788bef09ab851d983'),
  ObjectId('657983a788bef09ab851d984'),
  ObjectId('657983a788bef09ab851d985'),
  ObjectId('657983a788bef09ab851d986'),
  ObjectId('657983a788bef09ab851d987'),
  ObjectId('657983a788bef09ab851d988'),
  ObjectId('657983a788bef09ab851d989'),
  ObjectId('657983a788bef09ab851d98a'),
  ObjectId('657983a788bef09ab851d98b'),
  ObjectId('657983a788bef09ab851d98c'),
  ObjectId('657983a788bef09ab851d98d'),
  ObjectId('657983a788bef09ab851d98e'),
  ObjectId('657983a788bef09ab851d98f'),
  ObjectId('657983a788bef09ab851d990'),
  ObjectId('657983a788bef09ab851d991'),
  ObjectId('657983a788bef09ab851d992'),


## Run Model

We'll create a SuperDuperDB model based on the `sentence_transformers` library. This demonstrates that you don't necessarily need a native SuperDuperDB integration with a model library to leverage its power. We configure the `Model wrapper` to work with the `SentenceTransformer class`. After configuration, we can link the model to a collection and daemonize the model with the `listen=True` keyword.

In [3]:
import sentence_transformers
from sklearn.svm import SVC

from pinnacledb import Model
from pinnacledb.ext.numpy import array
from pinnacledb.ext.sklearn import Estimator
from pinnacledb import pinnacle
from pinnacledb.components.stack import Stack


# Create a SuperDuperDB Model using Sentence Transformers
m1 = Model(
    identifier='all-MiniLM-L6-v2',
    object=sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2'),
    encoder=array('float32', shape=(384,)),
    predict_method='encode',
    batch_predict=True,
    predict_X='text',
    predict_select=collection.find(),
    predict_kwargs={'show_progress_bar': True},
)


# Create a SuperDuperDB model with an SVC classifier
m2 = Estimator(
    'svc',
    object=SVC(gamma='scale', class_weight='balanced', C=100, verbose=True),
    postprocess=lambda x: int(x),
    train_X='_outputs.text.all-MiniLM-L6-v2.0',
    train_y='label',
    train_select=collection.find(),
    predict_X='_outputs.text.all-MiniLM-L6-v2.0',
    predict_select=collection.find({'_fold': 'valid'})
)

stack = Stack('my-stack', components=[m1, m2])

In [4]:
db.add(stack)

[32m 2023-Dec-13 11:12:58.50[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36m81014ec8-a24e-4ebd-96b6-651004cd7edb[0m| [36mpinnacledb.components.model[0m:[36m231 [0m | [1mAdding model all-MiniLM-L6-v2 to db[0m
[32m 2023-Dec-13 11:12:58.50[0m| [34m[1mDEBUG   [0m | [36mDuncans-MacBook-Pro.local[0m| [36m81014ec8-a24e-4ebd-96b6-651004cd7edb[0m| [36mpinnacledb.base.datalayer[0m:[36m873 [0m | [34m[1mmodel/all-MiniLM-L6-v2/0 already exists - doing nothing[0m


100it [00:00, 49531.22it/s]


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

[32m 2023-Dec-13 11:13:13.25[0m| [34m[1mDEBUG   [0m | [36mDuncans-MacBook-Pro.local[0m| [36m81014ec8-a24e-4ebd-96b6-651004cd7edb[0m| [36mpinnacledb.base.datalayer[0m:[36m873 [0m | [34m[1mmodel/svc/0 already exists - doing nothing[0m


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 92/92 [00:00<00:00, 58263.02it/s]


[LibSVM].*
optimization finished, #iter = 178
obj = -51.025101, rho = 0.245493
nSV = 89, nBSV = 0
Total nSV = 89
[32m 2023-Dec-13 11:13:13.26[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36m81014ec8-a24e-4ebd-96b6-651004cd7edb[0m| [36mpinnacledb.components.model[0m:[36m231 [0m | [1mAdding model svc to db[0m
[32m 2023-Dec-13 11:13:13.26[0m| [34m[1mDEBUG   [0m | [36mDuncans-MacBook-Pro.local[0m| [36m81014ec8-a24e-4ebd-96b6-651004cd7edb[0m| [36mpinnacledb.base.datalayer[0m:[36m873 [0m | [34m[1mmodel/svc/0 already exists - doing nothing[0m


8it [00:00, 13628.93it/s]


([None, SVC(C=100, class_weight='balanced', verbose=True), None],
 Stack(identifier='my-stack', components=[Model(identifier='all-MiniLM-L6-v2', encoder=Encoder(identifier='numpy.float32[384]', decoder=<Artifact artifact=<pinnacledb.ext.numpy.encoder.DecodeArray object at 0x156324190> serializer=dill>, encoder=<Artifact artifact=<pinnacledb.ext.numpy.encoder.EncodeArray object at 0x156324e50> serializer=dill>, shape=(384,), load_hybrid=True), output_schema=None, flatten=False, preprocess=None, postprocess=None, collate_fn=None, batch_predict=True, takes_context=False, metrics=(), model_update_kwargs={}, validation_sets=None, predict_X='text', predict_select=<pinnacledb.backends.mongodb.query.MongoCompoundSelect[
     [92m[1mtransfer.find({'_id': "{'$in': '[657983a788bef09ab851d97a, 657983a788bef09ab851d97b, 657983a788bef09ab851d97c, 657983a788bef09ab851d97d, 657983a788bef09ab851d97e, 657983a788bef09ab851d97f, 657983a788bef09ab851d980, 657983a788bef09ab851d981, 657983a788bef09ab851d98

## Verification

To verify that the process has worked, we can sample a few records to inspect the sanity of the predictions.

In [10]:
# Query a random document from the 'collection' SuperDuperDB collection
r = next(db.execute(collection.aggregate([{'$match': {'_fold': 'valid'}},{'$sample': {'size': 1}}])))

# Print a portion of the 'text' field from the random document
print(r['text'][:100])

# Print the prediction made by the SVC model stored in '_outputs'
print(r['_outputs']['text']['svc']['0'])

this was a favorite Christmas Special that I wish that they would release on vhs or dvd , since my 3
1
