MNIST : 28*28 grayscale image multi-class classification
=========================================================

In this tutorial we show how green\_tsetlin TM can be used to train on the **MNIST dataset**. MNIST is a benchmark by digit recognition
that contains images of handwritten digits with a total of 70,000 images. Each image is a 28x28 pixel grayscale image with values between 0 and 255.

For further documentation head over to the original source:

https://github.com/ooki/green_tsetlin

https://green-tsetlin.readthedocs.io/en/latest/

And have a look at the definition of the MNIST dataset:

https://botpenguin.com/glossary/mnist-dataset

# Training of a TM model

Get Python library.


In [1]:
!pip install green-tsetlin

Collecting green-tsetlin
  Downloading green_tsetlin-1.0.1.tar.gz (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m67.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting optuna (from green-tsetlin)
  Downloading optuna-4.0.0-py3-none-any.whl.metadata (16 kB)
Collecting alembic>=1.5.0 (from optuna->green-tsetlin)
  Downloading alembic-1.13.3-py3-none-any.whl.metadata (7.4 kB)
Collecting colorlog (from optuna->green-tsetlin)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna->green-tsetlin)
  Downloading Mako-1.3.6-py3-none-any.whl.metadata (2.9 kB)
Downloading optuna-4.0.0-py3-none-any.whl (362 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.8/362.8 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[?25

Get MNIST dataset and split training and test sets (80% / 20%).

In [2]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split as split
import numpy as np

X, y = fetch_openml("mnist_784", version=1, return_X_y=True, as_frame=False)

X_train, X_test, y_train, y_test = split(X, y, test_size=0.2, random_state=42, shuffle=True)

Restrict the training and test datasets to the first 10,000 and 1,000 samples respectively.

In [4]:
# THIS ACCELERATES TRAINING BUT REDUCES INFERENCE ACCURACY
# DO NOT USE THIS!

X_train = X_train[:10000]
y_train = y_train[:10000]
X_test = X_test[:1000]
y_test = y_test[:1000]

Convert arrays into binary format based on a threshold of 75

In [3]:
X_train = np.where(X_train > 75, 1, 0)
X_train = X_train.astype(np.uint8)

X_test = np.where(X_test > 75, 1, 0)
X_test = X_test.astype(np.uint8)

y_train = y_train.astype(np.uint32)
y_test = y_test.astype(np.uint32)

We can run a hyperparameter search. But we don't have to (!).

In [None]:
# NOT REQUIRED FOR TYPICAL TRAINING!
from green_tsetlin.hpsearch import HyperparameterSearch


hpsearch = HyperparameterSearch(s_space=(3.0, 40.0),
                                clause_space=(1000, 8000),
                                threshold_space=(1000, 8000),
                                max_epoch_per_trial=20,
                                literal_budget=(5, 10),
                                k_folds=3,
                                n_jobs=4,
                                seed=42,
                                minimize_literal_budget=False)

hpsearch.set_train_data(X_train, y_train)
hpsearch.set_eval_data(X_test, y_test)

hpsearch.optimize(n_trials=10, study_name="MNIST hpsearch", show_progress_bar=True, storage=None)

[I 2024-10-25 13:14:21,803] A new study created in memory with name: MNIST hpsearch
Processing trial 9 of 10, best score: [0.9825422804146209]: 100%|██████████| 10/10 [49:32<00:00, 297.26s/it]


**TODO: How to output best parameters?!**

### Best parameters

best paramaters:  {'s': 21.627727185060525, 'n_clauses': 6154, 'threshold': 1218, 'literal_budget': 10}

best score:  0.9937278429233706

Initialize a Tsetlin machine.

In [4]:
import green_tsetlin as gt

# Number of n_clauses directly relates to size of trained model.

# Original code:
best_params = {'s': 21.627727185060525, 'n_clauses': 6154, 'threshold': 1218, 'literal_budget': 10}

# Half the number of n_clauses (6154/2 -> 3077):
#best_params = {'s': 21.627727185060525, 'n_clauses': 3077, 'threshold': 1218, 'literal_budget': 10}

tm = gt.TsetlinMachine(n_literals=28*28,
                        n_clauses=best_params['n_clauses'],
                        s=best_params['s'],
                        threshold=int(best_params['threshold']),
                        n_classes=10,
                        literal_budget=best_params['literal_budget'])

Run training.

In [5]:
# Number of k_folds relates to execution time

# Half the training time:
trainer = gt.Trainer(tm, k_folds=1, n_epochs=20, seed=42, n_jobs=7, progress_bar=True)

# Original code:
#trainer = gt.Trainer(tm, k_folds=2, n_epochs=20, seed=42, n_jobs=7, progress_bar=True)

trainer.set_train_data(X_train, y_train)
trainer.set_eval_data(X_test, y_test)

res = trainer.train()

Processing epoch 20 of 20, train acc: 0.992, best eval score: 0.975 (epoch: 19): 100%|██████████| 20/20 [14:57<00:00, 44.89s/it]


Print training results.

TODO: Parameterize dict when using more than one training run (i.e. k_folds>=2).

In [6]:
#res
print(f"Number of epochs: {res['n_epochs']}")
print(f"Best eval score: {res['best_eval_score']}")
print(f"Best eval epoch: {res['best_eval_epoch']}")

Number of epochs: 20
Best eval score: 0.9749285714285715
Best eval epoch: 19


**Saving and reloading the TM**

Save as .npz state file (optional)

In [7]:
tm.save_state("tsetlin_state.npz")

Download .npz state file (optional)

In [9]:
from google.colab import files
files.download('tsetlin_state.npz')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Upload .npz state file (optional) - please select the file called 'tsetlin_state.npz'.

In [None]:
from google.colab import files
uploaded = files.upload()
for filename in uploaded.keys():
    print(f'Uploaded file: {filename}')

Reload state from .npz file (optional)

In [8]:
tm.load_state("tsetlin_state.npz")

**Export trained model as trained_votes.h file**

Copy trained_votes.h (see below) to the local directory where you cloned this project from GitHub.

Run 'make predict' after creating a pixel_data.h with the pixel_painter.py script.

Check pixel_data.h "with your own eyes" by opening it with your favourite editor.

Export as trained_votes.h

According to: https://green-tsetlin.readthedocs.io/en/latest/userguide.html

In [10]:
predictor = tm.get_predictor(explanation="literals", exclude_negative_clauses=False)
predictor.export_as_program("trained_votes.h")

Display file size of trained_clauses.h

In [13]:
import os
print(f"\'trained_clauses.h\' - file size: {os.path.getsize('trained_votes.h') / 1024:.2f} KB")

'trained_clauses.h' - file size: 1429.46 KB


Download trained_votes.h

In [14]:
from google.colab import files
files.download('trained_votes.h')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Classifying .pkl (Pickle) images with Colab

**Either** - Use a demo image from GitHub - pixel_data.pkl

In [None]:
# TODO
!git clone https://github.com/mongoq/Green-Tsetlin-Machine
#!cp file to /content/ **TODO**
#!file test.pkl

**Or**  - Upload pixel_data.pkl created with pixel_painter.py script.

In [15]:
from google.colab import files
uploaded = files.upload()
for filename in uploaded.keys():
    print(f'Uploaded file: {filename}')

Saving pixel_data.pkl to pixel_data.pkl
Uploaded file: pixel_data.pkl


Display binary pixels parsed from pixel_data.pkl.

In [16]:
import pickle

try:
    with open("pixel_data.pkl", "rb") as pfile:
        pixel_data = pickle.load(pfile)
        for row in pixel_data:
            print(" ".join(str(value) for value in row))
except Exception as e:
    print(e)

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 

Run inference (and time it!).

In [17]:
import time

start_time = time.time()

predictor = tm.get_predictor()

end_time = time.time()
duration = end_time - start_time

print(f"I can see a digit: {predictor.predict(pixel_data)}")
print(f"Inference took: {duration:.4f} sec.")

I can see a digit: 2
Inference took: 0.0855 sec.
