## Sparse Tsetlin Machine training on IMDB

This notebook gives an example for using the Sparse Tsetlin Machine from green-tsetlin on the **IMDB sentiment dataset**.

With **sklearn CountVectorizer**, we can transform the data into bag-of-words.

E.g the input text "I love swimming in the ocean" is transformed to : [0, 1, 1, 1, 0, 0] \
This vector is based on the vocabulary of the CountVectorizer, e.g ["dogs", "love", "ocean", "swimming", "biking", "movie"] \
We obtain the vocabulary by fitting the data. This gives us words / tokens that occur in the data.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import numpy as np
import datasets

imdb = datasets.load_dataset('imdb')
x, y = imdb['train']['text'], imdb['train']['label']

vectorizer = CountVectorizer(ngram_range=(1, 3), binary=True, lowercase=True, max_features=30_000)
vectorizer.fit(x)

x_bin = vectorizer.transform(x).toarray().astype(np.uint8)
y = np.array(y).astype(np.uint32)

train_x_bin, val_x_bin, train_y, val_y = train_test_split(x_bin, y, test_size=0.2, random_state=42, shuffle=True)

Convert x_train and x_val to a **sparse csr matrix**.

In [None]:
from scipy.sparse import csr_matrix

train_x = csr_matrix(train_x_bin)
val_x = csr_matrix(val_x_bin)

Install the **green-tsetlin** package using **pip**

```shell
pip install green-tsetlin
```

In [None]:
from green_tsetlin.sparse_tsetlin_machine import SparseTsetlinMachine
from green_tsetlin.trainer import Trainer

n_clauses = 500
s = 2.0
threshold = 2000
boost_true_positive_feedback = True
literal_budget = 10
dynamic_AL = True
AL_size = 100
clause_size = 50
lower_ta_threshold = -40

tm = SparseTsetlinMachine(n_literals=train_x_bin.shape[1],
                          n_clauses=n_clauses,
                          n_classes=2,
                          s=s,
                          threshold=threshold,
                          boost_true_positives=boost_true_positive_feedback,
                          literal_budget=literal_budget,
                          dynamic_AL=dynamic_AL)

# set sparse specific params if needed
tm.lower_ta_threshold = lower_ta_threshold
tm.active_literals_size = AL_size
tm.clause_size = clause_size

trainer = Trainer(tm=tm, n_jobs=5, n_epochs=5, seed=42, progress_bar=True)

trainer.set_train_data(train_x, train_y)
trainer.set_eval_data(val_x, val_y)

trainer.train()