# Sentiment analysis with transformers

In this notebook we implement a classic NLP use-case using Hugging Face's `transformers` library.
We show that this use-case may be implementing directly in the SuperDuperDB `Datalayer` using MongoDB as the
data-backend. 

In [None]:
!pip install datasets

In [1]:
from datasets import load_dataset, load_metric
import numpy
import pymongo
from transformers import AutoTokenizer, AutoModelForSequenceClassification

import pinnacledb
from pinnacledb.misc.pinnacle import pinnacle
from pinnacledb.core.document import Document as D
from pinnacledb.datalayer.mongodb.query import Collection
from pinnacledb.models.transformers.wrapper import TransformersTrainerConfiguration, Pipeline
from pinnacledb.core.dataset import Dataset

INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /var/folders/y9/b74b9yj906s_wtj0rrh2lf7c0000gn/T/tmpyfbyxnon
INFO:torch.distributed.nn.jit.instantiator:Writing /var/folders/y9/b74b9yj906s_wtj0rrh2lf7c0000gn/T/tmpyfbyxnon/_remote_module_non_scriptable.py


In [2]:
pymongo.MongoClient().drop_database('documents')
pymongo.MongoClient().drop_database('_filesystem:documents')

SuperDuperDB supports MongoDB as a databackend.
Correspondingly, we'll import the python MongoDB client pymongo and "wrap" our database to convert it 
to a SuperDuper Datalayer:

In [3]:
db = pymongo.MongoClient().documents
db = pinnacle(db)
collection = Collection('imdb')

We use the IMDB dataset for training the model:

In [4]:
data = load_dataset("imdb")

db.execute(collection.insert_many([
    D({'_fold': 'train', **data['train'][int(i)]}) for i in numpy.random.permutation(len(data['train']))[:4]
]))

db.execute(collection.insert_many([
    D({'_fold': 'valid', **data['test'][int(i)]}) for i in numpy.random.permutation(len(data['test']))[:4]
]))



  0%|          | 0/3 [00:00<?, ?it/s]

INFO:root:found 0 uris
INFO:root:found 0 uris


(<pymongo.results.InsertManyResult at 0x199e1f4f0>,
 TaskWorkflow(database=<pinnacledb.datalayer.base.datalayer.Datalayer object at 0x1924e61d0>, G=<networkx.classes.digraph.DiGraph object at 0x199e4ae10>))

Check a sample from the database:

In [5]:
r = db.execute(collection.find_one())
r

Document({'_id': ObjectId('64be898b579e06012525858b'), '_fold': 'train', 'text': "Comes this heartwarming tale of hope. Hope that you'll never have to endure anything this awful again. *cough* Razzie award *cough*<br /><br />I disliked this movie because it was unfunny, predictable and inane. While watching I felt like I was in a psychology experiment to determine how low movie standards could get before people complained. When I requested my money back at the end of the movie I was informed that because I watched the whole thing 'I wasn't entitled to reimbursement'. I was told by the assistant manager that several people had complained and gotten refunds already though.<br /><br />The movie summary is pretty basic. The midget thief steals a diamond and the poses as a baby to elude police. Underneath this clever outline however, lies a repertoire of original, fresh and hilarious skits. Or not.<br /><br />Ask yourself the following: Do you like to see people getting hit by pans? Do you 

Create a tokenizer and use it to provide a data-collator for batching inputs:

In [6]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
model = Pipeline(
    identifier='my-sentiment-analysis',
    task='text-classification',
    preprocess=tokenizer,
    object=model,
)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'pre_classifier.we

We'll evaluate the model using a simple accuracy metric. This metric gets logged in the
model's metadata during training:

In [7]:
training_args = TransformersTrainerConfiguration(
    identifier='sentiment-analysis',
    output_dir='sentiment-analysis',
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=2,
    weight_decay=0.01,
    save_strategy="epoch",
    use_mps_device=False,
    evaluation_strategy='epoch',
    do_eval=True,
)

Now we're ready to train the model:

In [8]:
from pinnacledb.core.metric import Metric

model.fit(
    X='text',
    y='label',
    db=db,
    select=collection.find(),
    configuration=training_args,
    validation_sets=[
        Dataset(
            identifier='my-eval',
            select=collection.find({'_fold': 'valid'}),
        )
    ],
    data_prefetch=False,
    metrics=[Metric(
        identifier='acc',
        object=lambda x, y: sum([xx == yy for xx, yy in zip(x, y)]) / len(x)
    )]
)                                                                            

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,My-eval/acc
1,No log,0.704457,0.5
2,No log,0.706399,0.5


Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
INFO:root:Saving model...
INFO:root:Saving model...


We can verify that the model gives us reasonable predictions:

In [11]:
model.predict("This movie sucks!", one=True)

1