## IMDB
This notebook shows how the green-tsetlin Tsetlin Machine trains on the **IMDB sentiment dataset**.  

In [None]:
import numpy as np

seed = 42
rng = np.random.default_rng(seed)

With sklearn **CountVectorizer**, we can transform the data into bag-of-words.

E.g the input text "I love swimming in the ocean" is transformed to : [0, 1, 1, 1, 0, 0] \
This vector is based on the vocabulary of the CountVectorizer, e.g ["dogs", "love", "ocean", "swimming", "biking", "movie"] \
We obtain the vocabulary by fitting the data. This gives us words / tokens that occur in the data.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import datasets

imdb = datasets.load_dataset('imdb')
x, y = imdb['train']['text'], imdb['train']['label']

vectorizer = CountVectorizer(ngram_range=(1, 1), binary=True, lowercase=True, max_features=5000)
vectorizer.fit(x)

x_bin = vectorizer.transform(x).toarray().astype(np.uint8)
y = np.array(y).astype(np.uint32)

shuffle_index = [i for i in range(len(x))]
rng.shuffle(shuffle_index)

x_bin = x_bin[shuffle_index]
y = y[shuffle_index]

x_bin = x_bin[:2500]
y = y[:2500]

train_x_bin, val_x_bin, train_y, val_y = train_test_split(x_bin, y, test_size=0.2, random_state=seed, shuffle=True)

The green-tsetlin library offers cpu heavy and less cpu heavy implemenation of the library, offering systems with older cpus a plug-and-play version of the library.
- **pip install green-tsetlin**
- **pip install green-tsetlin[cpu]**

With a number of different parameters to set in the TM, we can optimize by using the built in TM optuna optimizer, `green_tsetlin.hpsearch.HyperparameterSearch()`.

HyperparameterSearch:

- **search spaces**: Set a disired search space for each paramater. Either set the search space to a tuple, e.g (1, 4) will search between 1 and 4, or set it to a single value $\\$
e.g 4 will only search on 4. `clause_space=(50, 250)` or `clause_space=125` 

- **literal budget**: Optimize for a minimum literal budget by setting `minimize_literal_budget=True`.

- **Cross validation**: Set `k_folds=k` to an integer $k > 2$ to run cross validation k times on each trial

HyperparameterSearch.optimize:

- Run optimization over `n_trials`, store in database, e.g `"sqlite:///my_database.db"`. 

[See the Optuna documentation here:](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.create_study.html)

In [None]:
from green_tsetlin.hpsearch import HyperparameterSearch


hpsearch = HyperparameterSearch(s_space=(2.0, 20.0),
                                clause_space=(100, 1000),
                                threshold_space=(100, 1500),
                                max_epoch_per_trial=15,
                                literal_budget=(5, 10),
                                k_folds=2,
                                n_jobs=5,
                                seed=42,
                                minimize_literal_budget=False)

hpsearch.set_train_data(train_x_bin, train_y)
hpsearch.set_test_data(val_x_bin, val_y)

hpsearch.optimize(n_trials=30, study_name="IMDB hpsearch", show_progress_bar=True, storage=None)

We get the results by calling `HyperparameterSearch().best_trials`.

[See the Optuna documentation here:](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.study.Study.html#optuna.study.Study.best_trials)

In [None]:
params = hpsearch.best_trials[0].params
performance = hpsearch.best_trials[0].values

print("best paramaters: ", params)
print("best score: ", performance)