### 

### Natasha Seelam (natasha@mindsdb.com), Patricio Cerda-Mardini, Cosmo Jenytin, George Hosu, Jorge Torres
### 2021.04.21

The following notebook is intended for users to experience MindsDB, and how to build a model.


In [1]:
import pandas as pd
import numpy as np

# Machine learning
import torch
import mindsdb_native as mdb


import os
import time

S3 Datasource is not available by default. If you wish to use it, please install mindsdb_native[extra_data_sources]
Athena Datasource is not available by default. If you wish to use it, please install mindsdb_native[extra_data_sources]
Google Cloud Storage Datasource is not available by default. If you wish to use it, please install mindsdb_native[extra_data_sources]


pip3 install mindsdb_native --upgrade
[0m


In [2]:
# Setup torch seed for reproducibility
torch.manual_seed(1234)

<torch._C.Generator at 0x7efb6091acb0>

### 01. Load the dataset

In the following, we will use the "Stanford IMDB" dataset, a dataset that is commonly used to benchmark text sentiment and classification for natural language processing (NLP).

The leaderboard results suggest 95% accuracy is generally the state-of-the art, as indicated [here](https://paperswithcode.com/sota/text-classification-on-imdb). We're going to build a predictor for text classification with just a few lines of code.

First, we load the stanford IMDB dataset. Given the way the data was read, we scramble them in arbitrary order to ensure we have a roughly homogenous distribution of positive (1) and negative (0) examples.

The dataset has two columns, "Comments", which are the text reviews of the movie, and "Polarity", which indicate the sentiment as positive or negative. Our task will be to convert the comments into a featurized form, and train a model to predict the polarity.

In [3]:
data_dir = "datasets/"
dataset = "stanford_movie_review"
filename = os.path.join(data_dir, dataset, "data.csv")


# Load the data, and scramble the order. The way the data has been 
data = pd.read_csv(filename).sample(frac=1, random_state=1234)

data.head()

Unnamed: 0,Comments,Polarity
20308,"I had high hopes going to see this, as I alway...",0
37706,This is an enjoyable movie. Its very realistic...,1
6041,"This should be re-titled ""The Curious Case Of ...",0
42143,Lucasarts have pulled yet another beauty out o...,1
23202,Blackwater Valley Exorcism is a movie about a ...,0


We can see an example of the model as such for a positive versus negative review

In [4]:
# Rating examples
print("\033[1m" + "Positive (Label=1) Review:\n" + "\033[0m", data[data["Polarity"] == 1]["Comments"].iloc[0])

print("\033[1m" + "\n\nNegative (Label=0) Review: \n" + "\033[0m", data[data["Polarity"] == 0]["Comments"].iloc[0])


[1mPositive (Label=1) Review:
[0m This is an enjoyable movie. Its very realistic to the "wonderful world of music" I've been there and done that. It shows a human element in each character and the realism that nobody is perfect. These amateur musicians weren't all that bad players. Cleavon Little's character, Marshall Tucker, was played very well. Marshall was no saint himself. Here he was getting paid to do a job and he's giving these guys a hard time about everything in the van on the way up there. You don't bite the hands that feed you. I do find it hard to believe that a player with the jazz experience he has, claims he does not know any of the dixieland tunes. He has a tremendous sense of predicting chord changes to tunes he does not know. Not common, but not unheard of either. He delivers a true and harsh message at the end of the movie when he tells the clarinet player, "its not a religion, devotion is not enough." On that level, he is correct, although I think the clarinet pl

We will now split our data into 80% training and 20% testing.

In [5]:
# We'll split our training/testing fraction manually

ptrain = 0.8
Ntrain = int(data.shape[0] * ptrain)
Ntest = int(data.shape[0] - Ntrain)

train = data.iloc[:Ntrain, :]
test = data.iloc[Ntrain:, :]

### 02. Train a model!

We'll train a model on our training data with simply 2 lines of code!

**NOTE** without a GPU, this will likely take a very long time to take. With a GPU, this is probably about 30 minutes.

In [6]:
# Create a predictor instance
model = mdb.Predictor(name="text_reviews")

# Train your model!
model.learn(from_data=train, to_predict="Polarity", use_gpu=True)

pip3 install mindsdb_native --upgrade
[0m
[0m
[32mINFO:mindsdb-logger-8822bc82-a1d7-11eb-b180-495c3e517957---51d83821-9541-4234-bb11-6859bb465cbd:/home/natasha/mdb/lib/python3.8/site-packages/mindsdb_native/libs/phases/base_module.py:51 - [START] DataExtractor
[0m
[32mINFO:mindsdb-logger-8822bc82-a1d7-11eb-b180-495c3e517957---51d83821-9541-4234-bb11-6859bb465cbd:/home/natasha/mdb/lib/python3.8/site-packages/mindsdb_native/libs/phases/base_module.py:56 - [END] DataExtractor, execution time: 0.052 seconds
[0m
[32mINFO:mindsdb-logger-8822bc82-a1d7-11eb-b180-495c3e517957---51d83821-9541-4234-bb11-6859bb465cbd:/home/natasha/mdb/lib/python3.8/site-packages/mindsdb_native/libs/phases/base_module.py:51 - [START] DataCleaner
[0m
[0m
[32mINFO:mindsdb-logger-8822bc82-a1d7-11eb-b180-495c3e517957---51d83821-9541-4234-bb11-6859bb465cbd:/home/natasha/mdb/lib/python3.8/site-packages/mindsdb_native/libs/phases/base_module.py:56 - [END] DataCleaner, execution time: 0.072 seconds
[0m
[32mINFO

[32mINFO:mindsdb-logger-8822bc82-a1d7-11eb-b180-495c3e517957---51d83821-9541-4234-bb11-6859bb465cbd:/home/natasha/mdb/lib/python3.8/site-packages/mindsdb_native/libs/phases/base_module.py:51 - [START] DataTransformer
[0m
[32mINFO:mindsdb-logger-8822bc82-a1d7-11eb-b180-495c3e517957---51d83821-9541-4234-bb11-6859bb465cbd:/home/natasha/mdb/lib/python3.8/site-packages/mindsdb_native/libs/phases/base_module.py:56 - [END] DataTransformer, execution time: 0.065 seconds
[0m
[32mINFO:mindsdb-logger-8822bc82-a1d7-11eb-b180-495c3e517957---51d83821-9541-4234-bb11-6859bb465cbd:/home/natasha/mdb/lib/python3.8/site-packages/mindsdb_native/libs/phases/base_module.py:51 - [START] ModelInterface
[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expecte

INFO:lightwood.6038:Subtest test error: 0.19307737691061838 on subset 2, overall test error: 0.19380428418517112
INFO:lightwood.6038:Model predictions and decoding completed
[37mDEBUG:mindsdb-logger-8822bc82-a1d7-11eb-b180-495c3e517957---51d83821-9541-4234-bb11-6859bb465cbd:/home/natasha/mdb/lib/python3.8/site-packages/mindsdb_native/libs/phases/model_interface/lightwood_backend.py:328 - We've reached training epoch nr 14 with an accuracy of 92.62% on the testing dataset
[0m
INFO:lightwood.6038:Subtest test error: 0.1926073282957077 on subset 2, overall test error: 0.19278974421322345
INFO:lightwood.6038:Model predictions and decoding completed
[37mDEBUG:mindsdb-logger-8822bc82-a1d7-11eb-b180-495c3e517957---51d83821-9541-4234-bb11-6859bb465cbd:/home/natasha/mdb/lib/python3.8/site-packages/mindsdb_native/libs/phases/model_interface/lightwood_backend.py:328 - We've reached training epoch nr 21 with an accuracy of 92.65% on the testing dataset
[0m
INFO:lightwood.6038:Subtest test erro

We can now predict on the test set and see how we do:

In [7]:
result = model.predict(test)

[32mINFO:mindsdb-logger-core-logger---:/home/natasha/mdb/lib/python3.8/site-packages/mindsdb_native/libs/phases/base_module.py:51 - [START] DataExtractor
[0m
[32mINFO:mindsdb-logger-core-logger---:/home/natasha/mdb/lib/python3.8/site-packages/mindsdb_native/libs/phases/base_module.py:56 - [END] DataExtractor, execution time: 0.021 seconds
[0m
[32mINFO:mindsdb-logger-core-logger---:/home/natasha/mdb/lib/python3.8/site-packages/mindsdb_native/libs/phases/base_module.py:51 - [START] DataTransformer
[0m
[32mINFO:mindsdb-logger-core-logger---:/home/natasha/mdb/lib/python3.8/site-packages/mindsdb_native/libs/phases/base_module.py:56 - [END] DataTransformer, execution time: 0.013 seconds
[0m
[32mINFO:mindsdb-logger-core-logger---:/home/natasha/mdb/lib/python3.8/site-packages/mindsdb_native/libs/phases/base_module.py:51 - [START] ModelInterface
[0m
INFO:lightwood.6038:Computing device used: cuda
INFO:lightwood.6038:Model predictions and decoding completed
[32mINFO:mindsdb-logger-cor

In [8]:
ypred = result._data["Polarity"]
ytrue = test["Polarity"].astype(str).tolist()

acc = [ypred[j] == ytrue[j] for j in range(len(ypred))]

In [9]:
print("Accuracy is =", sum(acc)/len(acc))

Accuracy is = 0.9276


### 03. Studying what goes on under the hood

The above model only required 2 lines of code to train.What is the underlying model?

We can check this out by probing the model instance. By looking into the backend, we can see the types of encoders the mixer intakes.

First, let's look into the configuration file used to construct the predictor:

In [10]:
model.transaction.model_backend.predictor.config

{'input_features': [{'name': 'Comments',
   'type': 'text',
   'grouped_by': False,
   'encoder_attrs': {'original_type': None, 'secondary_type': None}}],
 'output_features': [{'name': 'Polarity',
   'type': 'categorical',
   'grouped_by': False,
   'weights': {'0': 5.047445992327882e-05, '1': 5.02209722780233e-05},
   'encoder_attrs': {'predict_proba': True,
    'original_type': None,
    'secondary_type': None}}],
 'data_source': {'cache_transformed_data': True},
 'mixer': {'class': lightwood.mixers.nn.NnMixer,
  'kwargs': {'callback_on_iter': None,
   'eval_every_x_epochs': 15.5,
   'stop_training_after_seconds': 15955.125}}}

As you can see, the "Polarity" column is modeled categorically (as it's a binary label). The "Comments" column is modeled with "text". In `lightwood/api/data_sources.py`, you can see how each of these data attributes are mapped to a particular encoder.

We can even directly query these encoders and access their models as such:



In [11]:
model.transaction.model_backend.predictor._mixer.encoders

{'Polarity': <lightwood.encoders.categorical.autoencoder.CategoricalAutoEncoder at 0x7ef8a74c7ee0>,
 'Comments': <lightwood.encoders.text.pretrained.PretrainedLang at 0x7ef8a74c7f40>}

Each encoder has a "prepared" flag that indicatews whether it has been trained or not. You can choose to not prepare an encoder, but the default behavior is to do so during the learn() call in the predictor above.

In [12]:
print("Prepared encoder [Comments]?", 
      model.transaction.model_backend.predictor._mixer.encoders["Comments"]._prepared
     )

Prepared encoder [Comments]? True


Lastly, we can see the nature of the mixer below:

In [13]:
model.transaction.model_backend.predictor._mixer

<lightwood.mixers.nn.NnMixer at 0x7ef8a74c7f10>

In fact, we can look at the details of the mixer as such. As you can see, it is a class of pytorch model called "net":

The mixer default here is nn.Mixer (specifically our DefaultNet). You can choose whatever type of model architecture you like to deploy it.

In [14]:
model.transaction.model_backend.predictor._mixer.net

DefaultNet(
  (net): Sequential(
    (0): Linear(in_features=768, out_features=130, bias=True)
    (1): SELU()
    (2): Dropout(p=0.0, inplace=False)
    (3): Linear(in_features=130, out_features=3, bias=True)
  )
  (_foward_net): Sequential(
    (0): Linear(in_features=768, out_features=130, bias=True)
    (1): SELU()
    (2): Dropout(p=0.0, inplace=False)
    (3): Linear(in_features=130, out_features=3, bias=True)
  )
)