<a href="https://www.kaggle.com/code/mikedelong/knn-accuracy-0-5-with-img2vec-cpu?scriptVersionId=163629030" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
!pip install --quiet img2vec_pytorch
print('pip installed img2vec')

pip installed img2vec


In [2]:
from warnings import filterwarnings
filterwarnings(action='ignore', category=FutureWarning) # quiet a plotly issue
filterwarnings(action='ignore', category=UserWarning) # quiet an img2vec issue

Let's load our training data and use img2vec with resnet-18 to project it into a 512-dimensional space.

In [3]:
from img2vec_pytorch import Img2Vec
from PIL import Image
from arrow import now
from glob import glob
import pandas as pd
from os.path import basename

img2vec = Img2Vec(cuda=False, model='resnet-18', layer='default', layer_output_size=512)


# https://stackoverflow.com/a/952952
def flatten(arg):
    return [x for xs in arg for x in xs]

def get_from_glob(arg: str, tag: str) -> list:
    time_get = now()
    result = []
    for input_file in glob(pathname=arg):
        name = basename(input_file)
        try:
            with Image.open(fp=input_file, mode='r') as image:
                vector = img2vec.get_vec(image, tensor=True).numpy().reshape(512,)
                result.append(pd.Series(data=[tag, name, vector], index=['tag', 'name', 'value']))
        except RuntimeError:
            # we only have a few failures so we're just going to discard them
            print('runtime failure: {}'.format(tag, name))
            pass
    print('encoded {} data in {}'.format(tag, now() - time_get))
    return result

time_start = now()
train = {
    'dry' : '/kaggle/input/oily-dry-and-normal-skin-types-dataset/Oily-Dry-Skin-Types/train/dry/*.jpg',
    'normal' : '/kaggle/input/oily-dry-and-normal-skin-types-dataset/Oily-Dry-Skin-Types/train/normal/*.jpg',
    'oily' : '/kaggle/input/oily-dry-and-normal-skin-types-dataset/Oily-Dry-Skin-Types/train/oily/*.jpg'
}
train_data = [get_from_glob(arg=value, tag=key) for key, value in train.items()]
# we can lump the validation data here too because otherwise we will not use it
valid = {
    'dry' : '/kaggle/input/oily-dry-and-normal-skin-types-dataset/Oily-Dry-Skin-Types/valid/dry/*.jpg',
    'normal' : '/kaggle/input/oily-dry-and-normal-skin-types-dataset/Oily-Dry-Skin-Types/valid/normal/*.jpg',
    'oily' : '/kaggle/input/oily-dry-and-normal-skin-types-dataset/Oily-Dry-Skin-Types/valid/oily/*.jpg'
}
valid_data = [get_from_glob(arg=value, tag=key) for key, value in valid.items()]
df = pd.DataFrame(data=flatten(arg=train_data) + flatten(arg=valid_data))
    
print('done in {}'.format(now() - time_start))

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 170MB/s]


encoded dry data in 0:00:31.386060
encoded normal data in 0:00:53.554957
encoded oily data in 0:00:48.015466
encoded dry data in 0:00:03.582526
encoded normal data in 0:00:05.214893
encoded oily data in 0:00:03.834714
done in 0:02:25.790695


Now let's get our test data.

In [4]:
test = {
    'dry' : '/kaggle/input/oily-dry-and-normal-skin-types-dataset/Oily-Dry-Skin-Types/test/dry/*.jpg',
    'normal' : '/kaggle/input/oily-dry-and-normal-skin-types-dataset/Oily-Dry-Skin-Types/test/normal/*.jpg',
    'oily' : '/kaggle/input/oily-dry-and-normal-skin-types-dataset/Oily-Dry-Skin-Types/test/oily/*.jpg'
}

time_start = now()
test_data = [get_from_glob(arg=value, tag=key) for key, value in test.items()]
test_df = pd.DataFrame(data=flatten(arg=test_data))
print('done in {}'.format(now() - time_start))

encoded dry data in 0:00:01.758551
encoded normal data in 0:00:02.785400
encoded oily data in 0:00:01.849297
done in 0:00:06.404133


How are our classes distributed in our training data?

In [5]:
from plotly.express import histogram
histogram(data_frame=df, x='tag')

In [6]:
print(df['tag'].value_counts(normalize=True).to_dict())

{'normal': 0.4025844930417495, 'oily': 0.35785288270377735, 'dry': 0.23956262425447317}


We have unbalanced classes, but not terribly so.

We have a high dimensional space; let's use a few epochs of UMAP to project it into two dimensions and take a look.

In [7]:
from umap import UMAP
umap = UMAP(random_state=2024, verbose=True, n_jobs=1, low_memory=False, n_epochs=200)
plot_df = pd.concat(objs=[df, pd.DataFrame(data=umap.fit_transform(X=df['value'].apply(pd.Series)), columns=['x', 'y'])], axis=1)
print('done with UMAP in {}'.format(now() - time_start))

2024-02-20 22:40:28.949626: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-20 22:40:28.949767: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-20 22:40:29.125747: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


UMAP(low_memory=False, n_epochs=200, n_jobs=1, random_state=2024, verbose=True)
Tue Feb 20 22:40:41 2024 Construct fuzzy simplicial set
Tue Feb 20 22:40:46 2024 Finding Nearest Neighbors
Tue Feb 20 22:40:49 2024 Finished Nearest Neighbor Search
Tue Feb 20 22:40:51 2024 Construct embedding


Epochs completed:   0%|            0/200 [00:00]

	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Tue Feb 20 22:40:54 2024 Finished embedding
done with UMAP in 0:00:51.189215


In [8]:
from plotly.express import scatter
scatter(data_frame=plot_df, x='x', y='y', color='tag', hover_name='name', height=900)

We don't see a lot of separation between our classes, but we do see a lot of local clustering in twos and threes; there's some signal here that UMAP can find. Let's see how we do with a simple classifier

In [9]:
from sklearn.metrics import f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from arrow import now

best_k = 1
best = 0
for n_neighbors in range(2, 15):
    current = KNeighborsClassifier(n_neighbors=n_neighbors)
    current.fit(X=df['value'].apply(pd.Series), y=df['tag'])
    score = f1_score(average='weighted', labels=test_df['tag'].unique().tolist(), y_true=test_df['tag'], y_pred=current.predict(X=test_df['value'].apply(pd.Series)))
    if score > best:
        best = score
        best_k = n_neighbors
    print('neighbors: {} score: {:5.4f}'.format(n_neighbors, score))
        
time_start = now()
knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X=df['value'].apply(pd.Series), y=df['tag'])
print(classification_report(labels=test_df['tag'].unique().tolist(), y_true=test_df['tag'], y_pred=knn.predict(X=test_df['value'].apply(pd.Series))))
print('model time: {}'.format(now() - time_start))

neighbors: 2 score: 0.4943
neighbors: 3 score: 0.4880
neighbors: 4 score: 0.4994
neighbors: 5 score: 0.4525
neighbors: 6 score: 0.4324
neighbors: 7 score: 0.4174
neighbors: 8 score: 0.4615
neighbors: 9 score: 0.4513
neighbors: 10 score: 0.4290
neighbors: 11 score: 0.4213
neighbors: 12 score: 0.4484
neighbors: 13 score: 0.4429
neighbors: 14 score: 0.3843
              precision    recall  f1-score   support

         dry       0.39      0.34      0.36        35
      normal       0.62      0.61      0.62        59
        oily       0.42      0.47      0.45        40

    accuracy                           0.50       134
   macro avg       0.48      0.48      0.48       134
weighted avg       0.50      0.50      0.50       134

model time: 0:00:00.274466


Our accuracy is about 50%; if we always guessed the largest class we would be correct about 40% of the time.