# Report ενδιάμεσης εργασίας

In this notebook i will be comparing the performance of the 1NN, 3NN and nearest centroid classifier, using the `CIFAR-10` dataset

## Load the data

- I am importing the `numpy` library because the `unpickle()` function returns numpy arrays and i will need to manipulate them

- I used the recommended function `unpickle(file)` that i saw in this site https://www.cs.toronto.edu/~kriz/cifar.html

- In this cell, i am storing the batch names in an array called `file_names` and i am also using an array called `files` to store the 6 dictionaries that i will get from the `unpickle()` function.

- At the end i am printing the number of the files that are in `files`

In [1]:
import numpy as np

file_names = []
files = []

def unpickle(file):
    import pickle
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict


for i in range(1,6):
    file = f"data_batch_{i}"
    file_names.append(file)

file_names.append("test_batch")

for file in file_names:
    cifar10_dict = unpickle(file)
    files.append(cifar10_dict)

n_of_files = len(file_names)
n_of_files

6

## Inspect the list `files`
* In this list i have 6 dictionaries
* Below i get to see the keys from all the dictionaries
* I notice that i have 5 training batches, each one with 50000 images and 1 testing batch with 10000 images

In [2]:
for i in range(n_of_files):
    print(f"Dictionary {file_names[i]}: \nKeys: {files[i].keys()}\nBatch label: {files[i][b'batch_label']}\nData size: {files[i][b'data'].shape} \n")

Dictionary data_batch_1: 
Keys: dict_keys([b'batch_label', b'labels', b'data', b'filenames'])
Batch label: b'training batch 1 of 5'
Data size: (10000, 3072) 

Dictionary data_batch_2: 
Keys: dict_keys([b'batch_label', b'labels', b'data', b'filenames'])
Batch label: b'training batch 2 of 5'
Data size: (10000, 3072) 

Dictionary data_batch_3: 
Keys: dict_keys([b'batch_label', b'labels', b'data', b'filenames'])
Batch label: b'training batch 3 of 5'
Data size: (10000, 3072) 

Dictionary data_batch_4: 
Keys: dict_keys([b'batch_label', b'labels', b'data', b'filenames'])
Batch label: b'training batch 4 of 5'
Data size: (10000, 3072) 

Dictionary data_batch_5: 
Keys: dict_keys([b'batch_label', b'labels', b'data', b'filenames'])
Batch label: b'training batch 5 of 5'
Data size: (10000, 3072) 

Dictionary test_batch: 
Keys: dict_keys([b'batch_label', b'labels', b'data', b'filenames'])
Batch label: b'testing batch 1 of 1'
Data size: (10000, 3072) 



## Since i have the contents of the cifar10, i am keeping only the data and the labels

- I am printing the name and the keys of each one of the 6 dictionaries, in order to be sure, that i am only keeping the data and the labels 

In [3]:
for i in range(n_of_files):
    #use try except because if i try to run this cell alone, an error will occur, because i have already deleted these keys
    try:
        del files[i][b'batch_label']
        del files[i][b'filenames']
    except KeyError:
        print("These keys have already been deleted!\nRun all the cells again")
        pass
    print(f"Dictionary {file_names[i]}: \nKeys: {files[i].keys()}\n")

    

    

Dictionary data_batch_1: 
Keys: dict_keys([b'labels', b'data'])

Dictionary data_batch_2: 
Keys: dict_keys([b'labels', b'data'])

Dictionary data_batch_3: 
Keys: dict_keys([b'labels', b'data'])

Dictionary data_batch_4: 
Keys: dict_keys([b'labels', b'data'])

Dictionary data_batch_5: 
Keys: dict_keys([b'labels', b'data'])

Dictionary test_batch: 
Keys: dict_keys([b'labels', b'data'])



## Create the X_train, y_train, X_test, y_test

* At first, i initialize with zeroes 4 arrays.
* Then the `X_train` and the `y_train` are filled. The first with the 50000*3072 data (50000 images) from each batch file and the second with the 50000 labels from each batch file.
* When i get to the testing batch, indexed at the 5th place in the `files` array, i fill the `X_test` with 10000*3072 data (10000 images) and the `y_test` with the 10000 labels
* I am printing the size of each array, to showcase that the sizes are correct

In [4]:
#create X_train, y_train, X_test, y_test
X_train = np.full((50000,3072),0,dtype=int)
X_test = np.full((10000,3072),0,dtype=int)
y_train = np.full((50000,),0,dtype=int)
y_test = np.full((10000,),0,dtype=int)

for i in range(n_of_files):
    if i != 5:
        #this is the X_train, y_train
        X_train[i*10000:(i+1)*10000,:] = files[i][b'data']
        y_train[i*10000:(i+1)*10000] = files[i][b'labels']
    else:
        #i have just finished X_train, y_train
        print(f"Shape X_train: {X_train.shape}\nShape y_train: {y_train.shape}")
        #this is the X_test, y_test
        X_test[:,:] = files[i][b'data']
        y_test[:] = files[i][b'labels']
        print(f"Shape X_test: {X_test.shape}\nShape y_test: {y_test.shape}\n")




Shape X_train: (50000, 3072)
Shape y_train: (50000,)
Shape X_test: (10000, 3072)
Shape y_test: (10000,)



## Preprocessing the data

In the cell bellow, i will try some methods to improve the accuracy of my algorithms by preprocessing the data

- `fit_transform` is used on `X_train` to calculate the mean and standard deviation of the training data (this is the "fit" part).
It then scales `X_train` using those calculated values (this is the "transform" part).

- The `transform` only applies the previously computed mean and standard deviation (from X_train) to scale X_test.

- Using the PCA, the dimensionality of the dataset is reduced to 100 features (these are the features with the most variance), by excluding the features with the least variance

- Adding some guassian noise an the train data improves the accuracy and the robustness of our model. In more detail, the noise that i am adding has mean = 0 and varaince = 2. With bigger values, i am getting worse accuracy. This way my training set gets twice as big as before.

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

#Feature scaling
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)


#PCA
pca = PCA(n_components=100)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

#adding gaussian noise to the dataset
noise_factor = 0.1  # Adjust the noise level as needed
X_train_noisy = X_train + noise_factor * np.random.normal(loc=0.0, scale=2.0, size=X_train.shape)


X_train_combined = np.concatenate((X_train,X_train_noisy),axis=0)
y_train_combined = np.concatenate((y_train,y_train),axis=0)   #i am putting the same array twice

len(X_train_combined), len(y_train_combined)

(100000, 100000)

## Turn train and test data into tensors

I am doing this, because i might try to use my GPU for acceleration

In [6]:
import torch
X_train_combined = torch.from_numpy(X_train_combined).type(torch.float)  #important to define the default type for tensors
y_train_combined = torch.from_numpy(y_train_combined).type(torch.float)
X_test = torch.from_numpy(X_test).type(torch.float)
y_test = torch.from_numpy(y_test).type(torch.float)

X_train_combined[:5,:2], y_train_combined[:5], y_test[:5], X_test[:5,:2]

(tensor([[-22.0557,  12.2849],
         [  4.0135,  -5.0492],
         [ 21.1123, -47.6872],
         [-39.2313,   2.3340],
         [-15.5716, -16.6879]]),
 tensor([6., 9., 9., 4., 1.]),
 tensor([3., 8., 8., 0., 6.]),
 tensor([[-11.3506,   3.4136],
         [ 33.1129, -42.5072],
         [ 11.9410, -35.2011],
         [ 30.2431, -18.6372],
         [-16.7043,  18.7802]]))

## Implement 1NN

* So the concept is this:
``
I am working in the 3072 dimensional space and i have seen all the training batches.
The new image will use the KNN to find the K nearest images, using a defined metric (Euklideian Distance,cosine,...)
Then the majority class (label) between the K nearest images will be the class (label) of the new image 
``

* I noticed that by using the `weight = "distance"` and `metric = "cosine"` the accuracy improves. I noticed that by trying all the `weights` and the `metrics` available


In [7]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Define the model: init 1-NN
classifier_1NN = KNeighborsClassifier(n_neighbors=1,weights="distance",metric="cosine")  #i noticed that with the cosine metric, the accuracy is higher than with the euclidean

# Train the model
# Only the training batches
classifier_1NN.fit(X_train_combined,y_train_combined)

# Predict the test set results
# predict the labels from the test batch data
y_pred_labels = classifier_1NN.predict(X_test)

# Evaluate the model using accuracy (y_pred_labels == y_test) / number of tests
# number of tests = 10k
print(f"Accuracy of 1-NN: {accuracy_score(y_test,y_pred_labels)}")

Accuracy of 1-NN: 0.4239


## Implement the 3NN

In [8]:
# Define the model: init 3-NN
classifier_3NN = KNeighborsClassifier(n_neighbors=3,weights="distance",metric="cosine")  #i noticed that with the cosine metric, the accuracy is higher than with the euclidean

# Train the model
# Only the training batches
classifier_3NN.fit(X_train_combined,y_train_combined)

# Predict the test set results
# predict the labels from the test batch data
y_pred_labels = classifier_3NN.predict(X_test)

# Evaluate the model using accuracy (y_pred_labels == y_test_labels) / number of tests
# number of tests = 10k
print(f"Accuracy of 3-NN: {accuracy_score(y_test,y_pred_labels)}")

Accuracy of 3-NN: 0.4253


I have tried many different things:
* used all the other metrics: `cityblock`, `haversine`, `l1`, `l2`, `manhattan`, `nan_euclidean` < 0.35.

* Also i used `weights = "distance"` and i got a slightly better accuracy. This means that neighbors that are nearer to the query point will have a greater influence on the predicted class
The default value is `weights = "uniform"` (each neighbor contributes equally to the decision.)

## Implement of Nearest Centroid classifier

In [10]:
from sklearn.neighbors import NearestCentroid
from sklearn.metrics import accuracy_score

classifier_KNC = NearestCentroid(shrink_threshold=1.4)

#training the classifier
classifier_KNC.fit(X_train_combined,y_train_combined)

y_preds_labels = classifier_KNC.predict(X_test)

accuracy_score(y_test,y_preds_labels)

0.277