## Tsetlin Machine trains on IMDB
This notebook shows how the green-tsetlin Tsetlin Machine trains on the IMDB sentiment dataset. With this tutorial, huggingface datasets library is used. The library holds the dataset that is on front preprocessed, removing symbols and white spaces.  

In [4]:
import numpy as np

seed = 42
rng = np.random.default_rng(seed)

In [5]:
import datasets

imdb = datasets.load_dataset('imdb')
x, y = imdb['train']['text'], imdb['train']['label']

### Sklearn CountVectorizer

With sklearn CountVectorizer, we can transform the data into bag-of-words.

E.g the input text "I love swimming in the ocean" is transformed to : [0, 1, 1, 1, 0, 0] \
This vector is based on the vocabulary of the CountVectorizer, e.g ["dogs", "love", "ocean", "swimming", "biking", "movie"] \
We obtain the vocabulary by fitting the data. This gives us words / tokens that occur in the data. \

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

vectorizer = CountVectorizer(ngram_range=(1, 1), binary=True, lowercase=True, max_features=5000)
vectorizer.fit(x)

x_bin = vectorizer.transform(x).toarray().astype(np.uint8)
y = np.array(y).astype(np.uint32)

shuffle_index = [i for i in range(len(x))]
rng.shuffle(shuffle_index)

x_bin = x_bin[shuffle_index]
y = y[shuffle_index]

x_bin = x_bin[:1000]
y = y[:1000]

train_x_bin, val_x_bin, train_y, val_y = train_test_split(x_bin, y, test_size=0.2, random_state=seed, shuffle=True)

### Install the green-tsetlin package using pip
The green-tsetlin library offers cpu heavy and less cpu heavy implemenation of the library, offering systems with older cpus a plug-and-play version of the library

In [None]:
#pip install green-tsetlin
#pip install green-tsetlin[cpu]

### Green Tsetlin hpsearch

With a number of different parameters to set in the TM, we can optimize by using the built in optuna optimizer. \

With this optimizer, we can set disired search spaces for each paramater. We Can also optimize for minimizing number of literals in each clause.  

In [7]:
from green_tsetlin.hpsearch import HyperparameterSearch

s_space = (2.0, 10.0)
clause_space = (100, 500)
threshold_space = (50, 1000)
max_epoch_per_trial = 15
literal_budget = (5, 10)
minimize_literal_budget = False
n_jobs = 5
k_folds = 4

imdb_hpsearch = HyperparameterSearch(s_space=s_space,
    clause_space=clause_space,
    threshold_space=threshold_space,
    max_epoch_per_trial=max_epoch_per_trial,
    literal_budget=literal_budget,
    minimize_literal_budget=minimize_literal_budget,
    seed=seed,
    n_jobs=n_jobs,
    k_folds=k_folds
    )

imdb_hpsearch.set_train_data(train_x_bin, train_y)
imdb_hpsearch.set_test_data(val_x_bin, val_y)

imdb_hpsearch.optimize(n_trials=10, study_name="IMDB hpsearch", show_progress_bar=True)

[I 2024-03-18 09:49:24,327] A new study created in memory with name: IMDB hpsearch
Best trial: 0. Best value: 0.876364:  10%|█         | 1/10 [00:27<04:08, 27.63s/it]

[I 2024-03-18 09:49:51,957] Trial 0 finished with value: 0.8763636363636363 and parameters: {'s': 7.368522014683706, 'n_clauses': 458, 'threshold': 485.16728766509294, 'literal_budget': 6}. Best is trial 0 with value: 0.8763636363636363.


Best trial: 1. Best value: 0.92:  20%|██        | 2/10 [00:55<03:44, 28.00s/it]    

[I 2024-03-18 09:50:20,214] Trial 1 finished with value: 0.92 and parameters: {'s': 9.632100354894577, 'n_clauses': 440, 'threshold': 919.2502354964952, 'literal_budget': 6}. Best is trial 1 with value: 0.92.


Best trial: 1. Best value: 0.92:  30%|███       | 3/10 [01:20<03:05, 26.49s/it]

[I 2024-03-18 09:50:44,917] Trial 2 finished with value: 0.8727272727272727 and parameters: {'s': 4.392909625401906, 'n_clauses': 332, 'threshold': 509.2398133442047, 'literal_budget': 7}. Best is trial 1 with value: 0.92.


Best trial: 1. Best value: 0.92:  40%|████      | 4/10 [01:40<02:23, 24.00s/it]

[I 2024-03-18 09:51:05,081] Trial 3 finished with value: 0.8763636363636363 and parameters: {'s': 2.294396725277722, 'n_clauses': 189, 'threshold': 986.1529681100911, 'literal_budget': 7}. Best is trial 1 with value: 0.92.


Best trial: 1. Best value: 0.92:  50%|█████     | 5/10 [02:00<01:52, 22.52s/it]

[I 2024-03-18 09:51:24,996] Trial 4 finished with value: 0.8945454545454545 and parameters: {'s': 5.675778881812062, 'n_clauses': 152, 'threshold': 627.7328118311426, 'literal_budget': 7}. Best is trial 1 with value: 0.92.


Best trial: 1. Best value: 0.92:  60%|██████    | 6/10 [02:22<01:28, 22.15s/it]

[I 2024-03-18 09:51:46,412] Trial 5 finished with value: 0.9054545454545454 and parameters: {'s': 8.37097355486566, 'n_clauses': 184, 'threshold': 794.7288988595533, 'literal_budget': 9}. Best is trial 1 with value: 0.92.


Best trial: 1. Best value: 0.92:  70%|███████   | 7/10 [02:45<01:07, 22.48s/it]

[I 2024-03-18 09:52:09,584] Trial 6 finished with value: 0.8472727272727273 and parameters: {'s': 2.7469447175891704, 'n_clauses': 286, 'threshold': 771.8432494320348, 'literal_budget': 5}. Best is trial 1 with value: 0.92.


Best trial: 1. Best value: 0.92:  80%|████████  | 8/10 [03:05<00:43, 21.89s/it]

[I 2024-03-18 09:52:30,216] Trial 7 finished with value: 0.8945454545454545 and parameters: {'s': 4.674085069920422, 'n_clauses': 219, 'threshold': 847.2469101054708, 'literal_budget': 8}. Best is trial 1 with value: 0.92.


Best trial: 1. Best value: 0.92:  90%|█████████ | 9/10 [03:27<00:21, 21.68s/it]

[I 2024-03-18 09:52:51,432] Trial 8 finished with value: 0.9163636363636364 and parameters: {'s': 6.358222931528444, 'n_clauses': 237, 'threshold': 748.3979007438049, 'literal_budget': 7}. Best is trial 1 with value: 0.92.


Best trial: 1. Best value: 0.92: 100%|██████████| 10/10 [03:44<00:00, 22.47s/it]

[I 2024-03-18 09:53:09,002] Trial 9 finished with value: 0.7890909090909091 and parameters: {'s': 5.539293447590991, 'n_clauses': 100, 'threshold': 184.75922366287574, 'literal_budget': 10}. Best is trial 1 with value: 0.92.





In [21]:
params = imdb_hpsearch.best_trials[0].params
print(params)

performance = imdb_hpsearch.best_trials[0].value 
print(performance)

{'s': 9.632100354894577, 'n_clauses': 440, 'threshold': 919.2502354964952, 'literal_budget': 6}
0.92


In [16]:
test_x, test_y = imdb['test']['text'], imdb['test']['label']

test_x_bin = vectorizer.transform(test_x).toarray().astype(np.uint8)
test_y = np.array(test_y).astype(np.uint32)

test_x_bin = test_x_bin[:500]
test_y = test_y[:500]

In [25]:
from green_tsetlin.tsetlin_machine import TsetlinMachine
from green_tsetlin.trainer import Trainer


tm = TsetlinMachine(n_literals=test_x_bin.shape[1], 
                    
                    n_clauses=params["n_clauses"],
                    s=params["s"],
                    threshold=int(params["threshold"]),
                    literal_budget=params["literal_budget"],
                    n_classes=2,
                    )

trainer = Trainer(tm=tm, n_jobs=5, n_epochs=15, seed=seed, progress_bar=True, k_folds=4)

trainer.set_train_data(train_x_bin, train_y)
trainer.set_test_data(val_x_bin, val_y)

trainer.train()

Processing epoch 15 of 15, train acc: 0.847, best test score: 0.785 (epoch: 8): 100%|██████████| 15/15 [00:05<00:00,  2.82it/s]
Processing epoch 15 of 15, train acc: 0.879, best test score: 0.836 (epoch: 1): 100%|██████████| 15/15 [00:05<00:00,  2.57it/s]
Processing epoch 15 of 15, train acc: 0.897, best test score: 0.862 (epoch: 1): 100%|██████████| 15/15 [00:06<00:00,  2.18it/s]
Processing epoch 15 of 15, train acc: 0.892, best test score: 0.895 (epoch: 1): 100%|██████████| 15/15 [00:06<00:00,  2.17it/s]


{'best_test_score': 0.8945454545454545, 'k_folds': 4}