# Transfer Learning with Sentence Transformers and Scikit-Learn

## Introduction

In this notebook, we will explore the process of transfer learning using SuperDuperDB. We will demonstrate how to connect to a MongoDB datastore, load a dataset, create a SuperDuperDB model based on Sentence Transformers, train a downstream model using Scikit-Learn, and apply the trained model to the database. Transfer learning is a powerful technique that can be used in various applications, such as vector search and downstream learning tasks.

## Prerequisites

Before diving into the implementation, ensure that you have the necessary libraries installed by running the following commands:

In [None]:
# !pip install pinnacledb
!pip install ipython numpy datasets sentence-transformers

## Connect to datastore 

First, we need to establish a connection to a MongoDB datastore via SuperDuperDB. You can configure the `MongoDB_URI` based on your specific setup. 
Here are some examples of MongoDB URIs:

* For testing (default connection): `mongomock://test`
* Local MongoDB instance: `mongodb://localhost:27017`
* MongoDB with authentication: `mongodb://pinnacle:pinnacle@mongodb:27017/documents`
* MongoDB Atlas: `mongodb+srv://<username>:<password>@<atlas_cluster>/<database>`

In [1]:
from pinnacledb import pinnacle
from pinnacledb.backends.mongodb import Collection
import os

mongodb_uri = os.getenv("MONGODB_URI", "mongomock://test")

# SuperDuperDB, now handles your MongoDB database
# It just super dupers your database
db = pinnacle(mongodb_uri, artifact_store='filesystem://./data/')

# Reference a collection called transfer
collection = Collection('transfer')

[32m 2023-Dec-02 15:55:47.07[0m| [34m[1mDEBUG   [0m | [36mDuncans-MacBook-Pro.local[0m| [36m9f457d82-6f85-4388-9550-dab53ce67e22[0m| [36mpinnacledb.base.build[0m:[36m50  [0m | [34m[1mParsing data connection URI:mongomock://test[0m
[32m 2023-Dec-02 15:55:47.09[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36m9f457d82-6f85-4388-9550-dab53ce67e22[0m| [36mpinnacledb.base.build[0m:[36m133 [0m | [1mData Client is ready. mongomock.MongoClient('localhost', 27017)[0m
[32m 2023-Dec-02 15:55:47.09[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36m9f457d82-6f85-4388-9550-dab53ce67e22[0m| [36mpinnacledb.base.datalayer[0m:[36m79  [0m | [1mBuilding Data Layer[0m


## Load Dataset

Transfer learning can be applied to any data that can be processed with SuperDuperDB models.
For our example, we will use a labeled textual dataset with sentiment analysis.  We'll load a subset of the IMDb dataset.

In [2]:
import numpy
from datasets import load_dataset
from pinnacledb import Document as D

# Load IMDb dataset
data = load_dataset("imdb")

# Set the number of data points for training (adjust as needed)
N_DATAPOINTS = 500

# Prepare training data
train_data = [
    D({'_fold': 'train', **data['train'][int(i)]})
    for i in numpy.random.permutation(len(data['train']))
][:N_DATAPOINTS]

# Prepare validation data
valid_data = [
    D({'_fold': 'valid', **data['test'][int(i)]})
    for i in numpy.random.permutation(len(data['test']))
][:N_DATAPOINTS // 10]

# Insert training data into the 'collection' SuperDuperDB collection
db.execute(collection.insert_many(train_data))

[32m 2023-Dec-02 15:55:54.55[0m| [34m[1mDEBUG   [0m | [36mDuncans-MacBook-Pro.local[0m| [36m9f457d82-6f85-4388-9550-dab53ce67e22[0m| [36mpinnacledb.base.datalayer[0m:[36m716 [0m | [34m[1mBuilding task workflow graph. Query:<pinnacledb.backends.mongodb.query.MongoCompoundSelect[
    [92m[1mtransfer.find({}, {})[0m}
] object at 0x156359010>[0m
[32m 2023-Dec-02 15:55:54.56[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36m9f457d82-6f85-4388-9550-dab53ce67e22[0m| [36mpinnacledb.backends.local.compute[0m:[36m32  [0m | [1mSubmitting job. function:<function callable_job at 0x1109be840>[0m
[32m 2023-Dec-02 15:55:54.56[0m| [34m[1mDEBUG   [0m | [36mDuncans-MacBook-Pro.local[0m| [36m9f457d82-6f85-4388-9550-dab53ce67e22[0m| [36mpinnacledb.misc.download[0m:[36m337 [0m | [34m[1m{'cls': 'MongoCompoundSelect', 'dict': {'table_or_collection': {'cls': 'Collection', 'dict': {'identifier': 'transfer'}, 'module': 'pinnacledb.backends.mongodb.query'},

([ObjectId('656b457a19b8a68a37a24306'),
  ObjectId('656b457a19b8a68a37a24307'),
  ObjectId('656b457a19b8a68a37a24308'),
  ObjectId('656b457a19b8a68a37a24309'),
  ObjectId('656b457a19b8a68a37a2430a'),
  ObjectId('656b457a19b8a68a37a2430b'),
  ObjectId('656b457a19b8a68a37a2430c'),
  ObjectId('656b457a19b8a68a37a2430d'),
  ObjectId('656b457a19b8a68a37a2430e'),
  ObjectId('656b457a19b8a68a37a2430f'),
  ObjectId('656b457a19b8a68a37a24310'),
  ObjectId('656b457a19b8a68a37a24311'),
  ObjectId('656b457a19b8a68a37a24312'),
  ObjectId('656b457a19b8a68a37a24313'),
  ObjectId('656b457a19b8a68a37a24314'),
  ObjectId('656b457a19b8a68a37a24315'),
  ObjectId('656b457a19b8a68a37a24316'),
  ObjectId('656b457a19b8a68a37a24317'),
  ObjectId('656b457a19b8a68a37a24318'),
  ObjectId('656b457a19b8a68a37a24319'),
  ObjectId('656b457a19b8a68a37a2431a'),
  ObjectId('656b457a19b8a68a37a2431b'),
  ObjectId('656b457a19b8a68a37a2431c'),
  ObjectId('656b457a19b8a68a37a2431d'),
  ObjectId('656b457a19b8a68a37a2431e'),


## Run Model

We'll create a SuperDuperDB model based on the `sentence_transformers` library. This demonstrates that you don't necessarily need a native SuperDuperDB integration with a model library to leverage its power. We configure the `Model wrapper` to work with the `SentenceTransformer class`. After configuration, we can link the model to a collection and daemonize the model with the `listen=True` keyword.

In [3]:
from pinnacledb import Model
import sentence_transformers
from pinnacledb.ext.numpy import array

# Create a SuperDuperDB Model using Sentence Transformers
m = Model(
    identifier='all-MiniLM-L6-v2',
    object=sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2'),
    encoder=array('float32', shape=(384,)),
    predict_method='encode',
    batch_predict=True,
)

# Make predictions on 'text' data from the 'collection' SuperDuperDB collection
m.predict(
    X='text',
    db=db,
    select=collection.find(),
    listen=True,
    show_progress_bar=True,
)

[32m 2023-Dec-02 15:56:06.70[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36m9f457d82-6f85-4388-9550-dab53ce67e22[0m| [36mpinnacledb.components.model[0m:[36m220 [0m | [1mAdding model all-MiniLM-L6-v2 to db[0m
[32m 2023-Dec-02 15:56:08.23[0m| [34m[1mDEBUG   [0m | [36mDuncans-MacBook-Pro.local[0m| [36m9f457d82-6f85-4388-9550-dab53ce67e22[0m| [36mpinnacledb.base.datalayer[0m:[36m873 [0m | [34m[1mmodel/all-MiniLM-L6-v2/0 already exists - doing nothing[0m
[32m 2023-Dec-02 15:56:08.23[0m| [1mINFO    [0m | [36mDuncans-MacBook-Pro.local[0m| [36m9f457d82-6f85-4388-9550-dab53ce67e22[0m| [36mpinnacledb.components.model[0m:[36m220 [0m | [1mAdding model all-MiniLM-L6-v2 to db[0m
[32m 2023-Dec-02 15:56:08.23[0m| [34m[1mDEBUG   [0m | [36mDuncans-MacBook-Pro.local[0m| [36m9f457d82-6f85-4388-9550-dab53ce67e22[0m| [36mpinnacledb.base.datalayer[0m:[36m873 [0m | [34m[1mmodel/all-MiniLM-L6-v2/0 already exists - doing nothing[0m


500it [00:00, 51113.90it/s]


Batches:   0%|          | 0/16 [00:00<?, ?it/s]

[None]

In [4]:
db.execute(collection.find_one())

Document({'_fold': 'train', 'text': "Like The Jeffersons, Good Times was one of the those classic American sitcoms which was never aired in the UK, not to mention it came out in the 1970s- a decade where of which I wasn't born yet.<br /><br />But like most fans of the show, I watched a few episodes on You Tube- and afterwards, I loved it.<br /><br />The Evans family are headed by James and Florida- two parents trying to make ends meet, and who despite their lack of qualifications, encourage their children, who have their own aspirations in life to fulfil them and to take their chances. James was the strict but loving dad, who didn't dare hesitate in disciplining J.J, Michael and Thelma- should they over-step the line. Whilst Florida, in contrast was a fair, kind- hearted and considerate mother and loving wife, although she was in many ways similar to James, with regards to their attitudes to parenthood and family values from an Afro- American perspective.<br /><br />The kids were just 

## Train Downstream Model
Now that we've created and added the model that computes features for the `"text"`, we can train a downstream model using Scikit-Learn.

In [None]:
# Import necessary modules and classes
from sklearn.svm import SVC
from pinnacledb import pinnacle

# Create a SuperDuperDB model with an SVC classifier
model = pinnacle(
    SVC(gamma='scale', class_weight='balanced', C=100, verbose=True),
    postprocess=lambda x: int(x)
)

# Train the model on 'text' data with corresponding labels
model.fit(
    X='_outputs.text.all-MiniLM-L6-v2.0',
    y='label',
    db=db,
    select=collection.find(),
)


## Run Downstream Model

With the model trained, we can now apply it to the database. 

In [None]:
# Make predictions on 'text' data with the trained SuperDuperDB model
model.predict(
    X='_outputs.text.all-MiniLM-L6-v2.0',
    db=db,
    select=collection.find(),
    listen=True,
)

## Verification

To verify that the process has worked, we can sample a few records to inspect the sanity of the predictions.

In [None]:
# Query a random document from the 'collection' SuperDuperDB collection
r = next(db.execute(collection.aggregate([{'$sample': {'size': 1}}])))

# Print a portion of the 'text' field from the random document
print(r['text'][:100])

# Print the prediction made by the SVC model stored in '_outputs'
print(r['_outputs']['text']['svc']['0'])