## Tsetlin Machine trains on IMDB
This notebook shows how the green-tsetlin Tsetlin Machine trains on the IMDB sentiment dataset. With this tutorial, huggingface datasets library is used. The library holds the dataset that is on front preprocessed, removing symbols and white spaces.  

In [1]:
import numpy as np

seed = 42
rng = np.random.default_rng(seed)

In [2]:
import datasets

imdb = datasets.load_dataset('imdb')
x, y = imdb['train']['text'], imdb['train']['label']

  from .autonotebook import tqdm as notebook_tqdm


### Sklearn CountVectorizer

With sklearn CountVectorizer, we can transform the data into bag-of-words.

E.g the input text "I love swimming in the ocean" is transformed to : [0, 1, 1, 1, 0, 0] \
This vector is based on the vocabulary of the CountVectorizer, e.g ["dogs", "love", "ocean", "swimming", "biking", "movie"] \
We obtain the vocabulary by fitting the data. This gives us words / tokens that occur in the data. \

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

vectorizer = CountVectorizer(ngram_range=(1, 1), binary=True, lowercase=True, max_features=5000)
vectorizer.fit(x)

x_bin = vectorizer.transform(x).toarray().astype(np.uint8)
y = np.array(y).astype(np.uint32)

shuffle_index = [i for i in range(len(x))]
rng.shuffle(shuffle_index)

x_bin = x_bin[shuffle_index]
y = y[shuffle_index]

x_bin = x_bin[:1000]
y = y[:1000]

train_x_bin, val_x_bin, train_y, val_y = train_test_split(x_bin, y, test_size=0.2, random_state=seed, shuffle=True)

### Install the green-tsetlin package using pip
The green-tsetlin library offers cpu heavy and less cpu heavy implemenation of the library, offering systems with older cpus a plug-and-play version of the library

In [17]:
#pip install green-tsetlin
#pip install green-tsetlin[cpu]

### Green Tsetlin hpsearch

With a number of different parameters to set in the TM, we can optimize by using the built in optuna optimizer. \

With this optimizer, we can set disired search spaces for each paramater. We Can also optimize for minimizing number of literals in each clause.  

In [18]:
from green_tsetlin.hpsearch import HyperparameterSearch

s_space = (2.0, 10.0)
clause_space = (100, 500)
threshold_space = (50, 1000)
max_epoch_per_trial = 15
literal_budget = (5, 10)
minimize_literal_budget = False
n_jobs = 5
k_folds = 4

imdb_hpsearch = HyperparameterSearch(s_space=s_space,
    clause_space=clause_space,
    threshold_space=threshold_space,
    max_epoch_per_trial=max_epoch_per_trial,
    literal_budget=literal_budget,
    minimize_literal_budget=minimize_literal_budget,
    seed=seed,
    n_jobs=n_jobs,
    k_folds=k_folds
    )

imdb_hpsearch.set_train_data(train_x_bin, train_y)
imdb_hpsearch.set_test_data(val_x_bin, val_y)

imdb_hpsearch.optimize(n_trials=10, study_name="IMDB hpsearch", show_progress_bar=True)

[I 2024-03-18 14:27:13,898] A new study created in memory with name: IMDB hpsearch
Processing trial 9 of 10, best score: [0.815]: 100%|██████████| 10/10 [01:08<00:00,  6.84s/it]


In [19]:
params = imdb_hpsearch.best_trials[0].params
print(params)

performance = imdb_hpsearch.best_trials[0].value 
print(performance)

{'s': 9.607623603636291, 'n_clauses': 487, 'threshold': 497.09954100791236, 'literal_budget': 7}
0.815


In [7]:
test_x, test_y = imdb['test']['text'], imdb['test']['label']

test_x_bin = vectorizer.transform(test_x).toarray().astype(np.uint8)
test_y = np.array(test_y).astype(np.uint32)

test_x_bin = test_x_bin[:500]
test_y = test_y[:500]

In [8]:
from green_tsetlin.tsetlin_machine import TsetlinMachine
from green_tsetlin.trainer import Trainer


tm = TsetlinMachine(n_literals=test_x_bin.shape[1], 
                    
                    n_clauses=params["n_clauses"],
                    s=params["s"],
                    threshold=int(params["threshold"]),
                    literal_budget=params["literal_budget"],
                    n_classes=2,
                    )

trainer = Trainer(tm=tm, n_jobs=5, n_epochs=15, seed=seed, progress_bar=True, k_folds=4)

trainer.set_train_data(train_x_bin, train_y)
trainer.set_test_data(val_x_bin, val_y)

trainer.train()

Processing epoch 15 of 15, train acc: 0.851, best test score: 0.788 (epoch: 11): 100%|██████████| 15/15 [00:02<00:00,  5.03it/s]
Processing epoch 15 of 15, train acc: 0.885, best test score: 0.860 (epoch: 0): 100%|██████████| 15/15 [00:03<00:00,  4.83it/s]
Processing epoch 15 of 15, train acc: 0.884, best test score: 0.860 (epoch: 2): 100%|██████████| 15/15 [00:03<00:00,  4.94it/s]
Processing epoch 15 of 15, train acc: 0.900, best test score: 0.876 (epoch: 2): 100%|██████████| 15/15 [00:03<00:00,  4.92it/s]


{'best_test_score': 0.876, 'k_folds': 4}