## Sparse Tsetlin Machine training on IMDB

This notebook gives an example for using the Sparse Tsetlin Machine from green-tsetlin on the **IMDB sentiment dataset**.

In [1]:
import numpy as np

seed = 42
rng = np.random.default_rng(seed)

### Sklearn CountVectorizer

With sklearn CountVectorizer, we can transform the data into bag-of-words.

E.g the input text "I love swimming in the ocean" is transformed to : [0, 1, 1, 1, 0, 0] \
This vector is based on the vocabulary of the CountVectorizer, e.g ["dogs", "love", "ocean", "swimming", "biking", "movie"] \
We obtain the vocabulary by fitting the data. This gives us words / tokens that occur in the data.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import datasets

imdb = datasets.load_dataset('imdb')
x, y = imdb['train']['text'], imdb['train']['label']

vectorizer = CountVectorizer(ngram_range=(1, 3), binary=True, lowercase=True, max_features=30000)
vectorizer.fit(x)

x_bin = vectorizer.transform(x).toarray().astype(np.uint8)
y = np.array(y).astype(np.uint32)

shuffle_index = [i for i in range(len(x))]
rng.shuffle(shuffle_index)

x_bin = x_bin[shuffle_index]
y = y[shuffle_index]


print(np.unique(y, return_counts=True))

train_x_bin, val_x_bin, train_y, val_y = train_test_split(x_bin, y, test_size=0.2, random_state=seed, shuffle=True)

  from .autonotebook import tqdm as notebook_tqdm


(array([0, 1], dtype=uint32), array([12500, 12500]))


### Convert x_train and x_val to a sparse csr matrix.

In [3]:
from scipy.sparse import csr_matrix

train_x = csr_matrix(train_x_bin)
val_x = csr_matrix(val_x_bin)

### Install the green-tsetlin package using pip

In [4]:
# pip install green-tsetlin

In [5]:
from green_tsetlin.sparse_tsetlin_machine import SparseTsetlinMachine
from green_tsetlin.trainer import Trainer

n_clauses = 500
s = 2.0
threshold = 2000
boost_true_positive_feedback = True
literal_budget = 10
dynamic_AL = True
AL_size = 100
clause_size = 50
lower_ta_threshold = -40

tm = SparseTsetlinMachine(n_literals=train_x_bin.shape[1],
                          n_clauses=n_clauses,
                          n_classes=2,
                          s=s,
                          threshold=threshold,
                          boost_true_positives=boost_true_positive_feedback,
                          literal_budget=literal_budget,
                          dynamic_AL=dynamic_AL)

# set sparse specific params if needed
tm.lower_ta_threshold = lower_ta_threshold
tm.active_literals_size = AL_size
tm.clause_size = clause_size

trainer = Trainer(tm=tm, n_jobs=5, n_epochs=5, seed=seed, progress_bar=True)

trainer.set_train_data(train_x, train_y)
trainer.set_test_data(val_x, val_y)

trainer.train()

Processing epoch 1 of 5, train acc: NA, best test score: NA:   0%|          | 0/5 [00:00<?, ?it/s]

Processing epoch 5 of 5, train acc: 0.823, best test score: 0.813 (epoch: 4): 100%|██████████| 5/5 [02:14<00:00, 26.98s/it]


{'train_time_of_epochs': [29.243809992000024,
  24.66695534200062,
  23.12717687300028,
  23.223710567000126,
  22.50294175199997],
 'best_test_score': 0.813,
 'best_test_epoch': 4,
 'n_epochs': 5,
 'train_log': [0.73115, 0.79825, 0.8196, 0.82465, 0.823],
 'test_log': [0.7898, 0.8076, 0.8086, 0.803, 0.813],
 'did_early_exit': False}