<a href="https://www.kaggle.com/code/mikedelong/rf-acc-0-74-with-img2vec-cpu?scriptVersionId=164030749" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
!pip install --quiet img2vec_pytorch
print('pip installed img2vec')

pip installed img2vec


In [2]:
from warnings import filterwarnings
filterwarnings(action='ignore', category=FutureWarning) # quiet a plotly issue
filterwarnings(action='ignore', category=UserWarning) # quiet an img2vec issue

In [3]:
from img2vec_pytorch import Img2Vec
from PIL import Image
from arrow import now
from glob import glob
import pandas as pd
from os.path import basename

SIZE = 512

# https://stackoverflow.com/a/952952
def flatten(arg):
    return [x for xs in arg for x in xs]

# we're going to just read a few pictures while we're building
def get_from_glob(arg: str, tag: str, stop: int) -> list:
    time_get = now()
    result = []
    for index, input_file in enumerate(glob(pathname=arg)):
        if index < stop:
            name = basename(input_file)
            try:
                with Image.open(fp=input_file, mode='r') as image:
                    vector = img2vec.get_vec(image, tensor=True).numpy().reshape(SIZE,)
                    result.append(pd.Series(data=[tag, name, vector], index=['tag', 'name', 'value']))
            except RuntimeError:
                # we only have a few failures so we're just going to discard them
                print('runtime failure: {}'.format(tag, name))
                pass
    print('encoded {} data {} rows in {}'.format(tag, len(result), now() - time_get))
    return result

STOP = 500 # we will load all our data with this limit on the number of instances per class

img2vec = Img2Vec(cuda=False, model='resnet-18', layer='default', layer_output_size=SIZE)

time_start = now()
train = {basename(folder) : folder + '/*.jpg' for folder in glob('/kaggle/input/pakistan-currency-dataset/Pakistan/Training/*')}
train_data = [get_from_glob(arg=value, tag=key, stop=STOP) for key, value in train.items()]
df = pd.DataFrame(data=flatten(arg=train_data))
    
print('done in {}'.format(now() - time_start))

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 170MB/s]


encoded 10Rs data 385 rows in 0:00:23.930624
encoded 1000Rsback data 356 rows in 0:00:23.453235
encoded 50Rs data 355 rows in 0:00:24.764226
encoded 5000Rs data 376 rows in 0:00:28.287364
encoded 10Rsback data 399 rows in 0:00:23.926458
encoded 500Rsback data 377 rows in 0:00:21.853393
encoded 20Rs data 367 rows in 0:00:29.456464
encoded 100Rs data 393 rows in 0:00:21.546029
encoded 5000Rsback data 381 rows in 0:00:27.645780
encoded 1000Rs data 347 rows in 0:00:22.028797
encoded 50Rsback data 369 rows in 0:00:25.886356
encoded 500Rs data 378 rows in 0:00:22.304083
encoded 100Rsback data 399 rows in 0:00:23.222542
encoded others data 500 rows in 0:00:59.927267
encoded 20Rsback data 378 rows in 0:00:28.067759
done in 0:06:46.752553


In [4]:
from plotly.express import histogram
histogram(data_frame=df, x='tag')

Our classes are unbalanced but not severely so.

In [5]:
from arrow import now
from umap import UMAP

time_start = now()
umap = UMAP(random_state=2024, verbose=True, n_jobs=1, low_memory=False, n_epochs=500)
plot_df = pd.concat(objs=[df, pd.DataFrame(data=umap.fit_transform(X=df['value'].apply(pd.Series)), columns=['x', 'y'])], axis=1)
print('done with UMAP in {}'.format(now() - time_start))

2024-02-23 17:28:08.764516: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-23 17:28:08.764668: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-23 17:28:08.928798: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


UMAP(low_memory=False, n_epochs=500, n_jobs=1, random_state=2024, verbose=True)
Fri Feb 23 17:28:21 2024 Construct fuzzy simplicial set
Fri Feb 23 17:28:21 2024 Finding Nearest Neighbors
Fri Feb 23 17:28:21 2024 Building RP forest with 9 trees
Fri Feb 23 17:28:25 2024 NN descent for 12 iterations
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	 5  /  12
	 6  /  12
	 7  /  12
	Stopping threshold met -- exiting after 7 iterations
Fri Feb 23 17:28:40 2024 Finished Nearest Neighbor Search
Fri Feb 23 17:28:43 2024 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Fri Feb 23 17:28:49 2024 Finished embedding
done with UMAP in 0:00:28.927592


In [6]:
from plotly.express import scatter
scatter(data_frame=plot_df, x='x', y='y', color='tag', hover_name='name', height=900).show()

This is not encourating; UMAP clusters our pictures, but not by tag. Let's be hopeful and build a model.

In [7]:
from arrow import now

# we don't have test data so let's use the validation set as our test data
test = {basename(folder) : folder + '/*.jpg' for folder in glob('/kaggle/input/pakistan-currency-dataset/Pakistan/Valid/*')}

time_start = now()
test_data = [get_from_glob(arg=value, tag=key, stop=STOP) for key, value in test.items()]
test_df = pd.DataFrame(data=flatten(arg=test_data))
print('done in {}'.format(now() - time_start))

encoded 10Rs data 9 rows in 0:00:00.808911
encoded 1000Rsback data 9 rows in 0:00:00.798720
encoded 50Rs data 10 rows in 0:00:01.311455
encoded 5000Rs data 10 rows in 0:00:00.940826
encoded 10Rsback data 8 rows in 0:00:01.057041
encoded 500Rsback data 10 rows in 0:00:00.597212
encoded 20Rs data 10 rows in 0:00:01.798703
encoded 100Rs data 10 rows in 0:00:01.054656
encoded 5000Rsback data 10 rows in 0:00:00.696922
encoded 1000Rs data 9 rows in 0:00:01.022298
encoded 50Rsback data 8 rows in 0:00:01.231272
encoded 500Rs data 10 rows in 0:00:00.544999
encoded 100Rsback data 10 rows in 0:00:01.594859
encoded others data 52 rows in 0:00:06.680688
encoded 20Rsback data 10 rows in 0:00:01.253949
done in 0:00:21.421536


In [8]:
from sklearn.metrics import f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from arrow import now

best_k = 1
best = 0
# let's step through a range of cluster sizes to find the one that will give us the best accuracy
for n_neighbors in range(2, 8):
    current = KNeighborsClassifier(n_neighbors=n_neighbors)
    current.fit(X=df['value'].apply(pd.Series), y=df['tag'])
    score = f1_score(average='weighted', labels=test_df['tag'].unique().tolist(), y_true=test_df['tag'], y_pred=current.predict(X=test_df['value'].apply(pd.Series)))
    if score > best:
        best = score
        best_k = n_neighbors
    print('neighbors: {} score: {:5.4f}'.format(n_neighbors, score))
        
time_start = now()
print('building best-k model with k = {}'.format(best_k))
knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X=df['value'].apply(pd.Series), y=df['tag'])
print(classification_report(labels=test_df['tag'].unique().tolist(), y_true=test_df['tag'], y_pred=knn.predict(X=test_df['value'].apply(pd.Series))))
print('model time: {}'.format(now() - time_start))

neighbors: 2 score: 0.5879
neighbors: 3 score: 0.6654
neighbors: 4 score: 0.6559
neighbors: 5 score: 0.6705
neighbors: 6 score: 0.6674
neighbors: 7 score: 0.6741
building best-k model with k = 7
              precision    recall  f1-score   support

        10Rs       0.56      0.56      0.56         9
  1000Rsback       0.42      0.89      0.57         9
        50Rs       0.86      0.60      0.71        10
      5000Rs       0.54      0.70      0.61        10
    10Rsback       0.33      0.25      0.29         8
   500Rsback       0.45      0.50      0.48        10
        20Rs       0.64      0.70      0.67        10
       100Rs       0.27      0.40      0.32        10
  5000Rsback       0.60      0.30      0.40        10
      1000Rs       0.62      0.89      0.73         9
    50Rsback       0.86      0.75      0.80         8
       500Rs       0.88      0.70      0.78        10
   100Rsback       0.70      0.70      0.70        10
      others       1.00      0.81      0.89     

70% accuracy overall isn't bad, but a lot of that is just determining which pictures are currency and which ones aren't. Let's try again without the others class, and let's look at the confusion matrix.

In [9]:
from plotly.express import imshow
from sklearn.metrics import confusion_matrix
confusion_df = pd.DataFrame(data=confusion_matrix(y_true=test_df[test_df['tag'] != 'others']['tag'], y_pred=knn.predict(X=test_df[test_df['tag'] != 'others']['value'].apply(pd.Series))), 
                            columns=test_df[test_df['tag'] != 'others']['tag'].unique().tolist(),
                            index=test_df[test_df['tag'] != 'others']['tag'].unique().tolist(),)

print(classification_report(labels=test_df[test_df['tag'] != 'others']['tag'].unique().tolist(), y_true=test_df[test_df['tag'] != 'others']['tag'], y_pred=knn.predict(X=test_df[test_df['tag'] != 'others']['value'].apply(pd.Series))))
imshow(img=confusion_df).show()

              precision    recall  f1-score   support

        10Rs       0.56      0.56      0.56         9
  1000Rsback       0.47      0.89      0.62         9
        50Rs       0.86      0.60      0.71        10
      5000Rs       0.54      0.70      0.61        10
    10Rsback       0.33      0.25      0.29         8
   500Rsback       0.71      0.50      0.59        10
        20Rs       0.70      0.70      0.70        10
       100Rs       0.31      0.40      0.35        10
  5000Rsback       0.60      0.30      0.40        10
      1000Rs       0.62      0.89      0.73         9
    50Rsback       1.00      0.75      0.86         8
       500Rs       0.88      0.70      0.78        10
   100Rsback       0.70      0.70      0.70        10
    20Rsback       0.67      0.60      0.63        10

    accuracy                           0.61       133
   macro avg       0.64      0.61      0.61       133
weighted avg       0.64      0.61      0.61       133



This is a tough problem and there doesn't seem to be anything systematic about the errors in this result.

In [10]:
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from arrow import now

best_estimators = 0
best = 0
# let's step through a range of estimator counts to find the one that will give us the best accuracy
for n_estimators in range(285, 315, 5):
    current = RandomForestClassifier(random_state=2024, n_estimators=n_estimators)
    current.fit(X=df['value'].apply(pd.Series), y=df['tag'])
    score = f1_score(average='weighted', labels=test_df['tag'].unique().tolist(), y_true=test_df['tag'], y_pred=current.predict(X=test_df['value'].apply(pd.Series)))
    if score > best:
        best = score
        best_estimators = n_estimators
    print('neighbors: {} score: {:5.4f}'.format(n_estimators, score))
        
time_start = now()
print('building best model with estimators count = {}'.format(best_estimators))
forest = RandomForestClassifier(verbose=0, random_state=2024, n_estimators=best_estimators)
forest.fit(X=df['value'].apply(pd.Series), y=df['tag'])
print(classification_report(labels=test_df['tag'].unique().tolist(), y_true=test_df['tag'], y_pred=forest.predict(X=test_df['value'].apply(pd.Series))))
print('model time: {}'.format(now() - time_start))

neighbors: 285 score: 0.7203
neighbors: 290 score: 0.7278
neighbors: 295 score: 0.7284
neighbors: 300 score: 0.7341
neighbors: 305 score: 0.7344
neighbors: 310 score: 0.7314
building best model with estimators count = 305
              precision    recall  f1-score   support

        10Rs       0.50      0.56      0.53         9
  1000Rsback       0.62      0.56      0.59         9
        50Rs       1.00      0.30      0.46        10
      5000Rs       0.50      0.90      0.64        10
    10Rsback       0.43      0.38      0.40         8
   500Rsback       0.70      0.70      0.70        10
        20Rs       0.82      0.90      0.86        10
       100Rs       0.54      0.70      0.61        10
  5000Rsback       0.71      0.50      0.59        10
      1000Rs       0.58      0.78      0.67         9
    50Rsback       1.00      0.62      0.77         8
       500Rs       0.77      1.00      0.87        10
   100Rsback       0.67      0.60      0.63        10
      others       0.

We can do marginally better overall by using random forests and throwing lots of estimators at the problem, but some of our bill class F1s are still hovering around 0.5.