# Report exercise 0

In this notebook i will be comparing the performance of the 1NN, 3NN and Nearest Centroid classifier, using the `CIFAR-10` dataset

## Load the data

- I am importing the `numpy` library because the `unpickle()` function returns numpy arrays and i will need to manipulate them

- I used the recommended function `unpickle(file)` that i saw in this site https://www.cs.toronto.edu/~kriz/cifar.html

- In this cell, i am storing the batch names in an array called `file_names` and i am also using an array called `files` to store the 6 dictionaries that i will get from the `unpickle()` function.

- At the end i am printing the number of the elements that are in `files`

In [1]:
import os
os.environ["LOKY_MAX_CPU_COUNT"] = "8"  # This line erases a warning that i am getting

import numpy as np

file_names = []
files = []

def unpickle(file):
    import pickle
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict


for i in range(1,6):
    file = f"data_batch_{i}"
    file_names.append(file)

file_names.append("test_batch")

for file in file_names:
    cifar10_dict = unpickle(file)
    files.append(cifar10_dict)

n_of_files = len(file_names)
n_of_files

6

## Inspect the list `files`
* In this list i have 6 dictionaries
* Below i get to see the keys from all the dictionaries
* I notice that i have 5 training batches, each one with 10000 images and 1 testing batch with 10000 images

In [2]:
for i in range(n_of_files):
    print(f"Dictionary {file_names[i]}: \nKeys: {files[i].keys()}\nBatch label: {files[i][b'batch_label']}\nData size: {files[i][b'data'].shape} \n")

Dictionary data_batch_1: 
Keys: dict_keys([b'batch_label', b'labels', b'data', b'filenames'])
Batch label: b'training batch 1 of 5'
Data size: (10000, 3072) 

Dictionary data_batch_2: 
Keys: dict_keys([b'batch_label', b'labels', b'data', b'filenames'])
Batch label: b'training batch 2 of 5'
Data size: (10000, 3072) 

Dictionary data_batch_3: 
Keys: dict_keys([b'batch_label', b'labels', b'data', b'filenames'])
Batch label: b'training batch 3 of 5'
Data size: (10000, 3072) 

Dictionary data_batch_4: 
Keys: dict_keys([b'batch_label', b'labels', b'data', b'filenames'])
Batch label: b'training batch 4 of 5'
Data size: (10000, 3072) 

Dictionary data_batch_5: 
Keys: dict_keys([b'batch_label', b'labels', b'data', b'filenames'])
Batch label: b'training batch 5 of 5'
Data size: (10000, 3072) 

Dictionary test_batch: 
Keys: dict_keys([b'batch_label', b'labels', b'data', b'filenames'])
Batch label: b'testing batch 1 of 1'
Data size: (10000, 3072) 



## Since i have the contents of the cifar10, i am keeping only the data and the labels

- I am printing the name and the keys of each one of the 6 dictionaries, in order to be sure, that i am only keeping the data and the labels 

In [3]:
for i in range(n_of_files):
    #use try except because if i try to run this cell alone, an error will occur, because i have already deleted these keys
    try:
        del files[i][b'batch_label']
        del files[i][b'filenames']
    except KeyError:
        print("These keys have already been deleted!\nRun all the cells again")
        pass
    print(f"Dictionary {file_names[i]}: \nKeys: {files[i].keys()}\n")

    

    

Dictionary data_batch_1: 
Keys: dict_keys([b'labels', b'data'])

Dictionary data_batch_2: 
Keys: dict_keys([b'labels', b'data'])

Dictionary data_batch_3: 
Keys: dict_keys([b'labels', b'data'])

Dictionary data_batch_4: 
Keys: dict_keys([b'labels', b'data'])

Dictionary data_batch_5: 
Keys: dict_keys([b'labels', b'data'])

Dictionary test_batch: 
Keys: dict_keys([b'labels', b'data'])



## Create the X_train, y_train, X_test, y_test

* At first, i initialize with zeroes 4 arrays.
* Then the `X_train` and the `y_train` are filled. The first with 50000*3072 data (10000 images from each batch file) and the second with  50000 labels (10000 from each batch file).
* When i get to the testing batch, indexed at the 5th place in the `files` array, i fill the `X_test` with 10000*3072 data (10000 images) and the `y_test` with 10000 labels
* I am printing the size of each array, to showcase that the sizes are correct

In [4]:
#create X_train, y_train, X_test, y_test
X_train = np.full((50000,3072),0,dtype=int)
X_test = np.full((10000,3072),0,dtype=int)
y_train = np.full((50000,),0,dtype=int)
y_test = np.full((10000,),0,dtype=int)

for i in range(n_of_files):
    if i != 5:
        #this is the X_train, y_train
        X_train[i*10000:(i+1)*10000,:] = files[i][b'data']
        y_train[i*10000:(i+1)*10000] = files[i][b'labels']
    else:
        #i have just finished X_train, y_train
        print(f"Shape X_train: {X_train.shape}\nShape y_train: {y_train.shape}")
        #this is the X_test, y_test
        X_test[:,:] = files[i][b'data']
        y_test[:] = files[i][b'labels']
        print(f"Shape X_test: {X_test.shape}\nShape y_test: {y_test.shape}\n")




Shape X_train: (50000, 3072)
Shape y_train: (50000,)
Shape X_test: (10000, 3072)
Shape y_test: (10000,)



## Preprocessing the data

In the cell bellow, i will try some methods to improve the accuracy of my algorithms by preprocessing the data

- `fit_transform` is used on `X_train` to calculate the mean and standard deviation of the training data (this is the "fit" part).
It then scales `X_train` using those calculated values (this is the "transform" part).
The `transform` only applies the previously computed mean and standard deviation (from X_train) to scale X_test. This way i am getting the `X_train_sc` and `X_test_sc`

- Using the PCA, the dimensionality of the dataset is reduced to 100 features (these are the features with the most variance), by excluding the features with the least variance. This way i am getting the `X_train_PCA` and `X_test_PCA`

- Adding some guassian noise in the training data improves the accuracy and the robustness of our model. In more detail, the noise that i am adding has mean = 0 and varaince = 2. With bigger values, i am getting worse accuracy. I am getting `X_train_sc_PCA_noisy`

- So right now i have:
    * the initial data
    * the scaled data
    * the PCA transformed data
    * the scaled PCA transformed noisy data 
  
This way i will monitor how the accuracy of the algorithms improve with these changes

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

#Feature scaling
sc_X = StandardScaler()
X_train_sc = sc_X.fit_transform(X_train)
X_test_sc = sc_X.transform(X_test)


#PCA
pca = PCA(n_components=100)
X_train_PCA = pca.fit_transform(X_train)
X_test_PCA = pca.transform(X_test)

X_train_sc_PCA = pca.fit_transform(X_train_sc)
X_test_sc_PCA = pca.transform(X_test_sc)

#adding gaussian noise to the dataset
noise_factor = 0.1  # Adjust the noise level as needed
X_train_sc_PCA_noisy = X_train_sc_PCA + noise_factor * np.random.normal(loc=0.0, scale=2.0, size=X_train_sc_PCA.shape)


## Impement KNN

* So the concept is this:
``
I am working in the 3072 dimensional space and i have seen all the training batches.
The new image will use the KNN to find the K nearest images, using a defined metric (Euklideian Distance,cosine,...)
Then the majority class (label) between the K nearest images will be the class (label) of the new image 
``

## Implement 1NN

* I noticed that by using the `weight = "distance"` and `metric = "cosine"` the accuracy improves. I noticed that by trying all the `weights` and the `metrics` available


I have tried many different things:
* used all the other metrics: `cityblock`, `haversine`, `l1`, `l2`, `manhattan`, `nan_euclidean` < 0.35.

* Also i used `weights = "distance"` and i got a slightly better accuracy. This means that neighbors that are nearer to the query point will have a greater influence on the predicted class
The default value is `weights = "uniform"` (each neighbor contributes equally to the decision.)

* I used also the initial data, then the scaled data, then the PCA transformed data ,then the scaled and PCA transformed data and the scaled,PCA,noisy data

Let's showcase some of the tries below

### 1NN using:

* the initial data 
* weights = `uniform`
* metric = `euclidean`

In [6]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Define the model: init 1-NN
classifier_1NN = KNeighborsClassifier(n_neighbors=1,weights="uniform",metric="euclidean")  #i noticed that with the cosine metric, the accuracy is higher than with the euclidean

# Train the model
# Only the training batches
classifier_1NN.fit(X_train,y_train)

# predict the labels from the test batch data
y_pred_labels = classifier_1NN.predict(X_test)

# Evaluate the model using accuracy (y_pred_labels == y_test) / number of tests
# number of tests = 10k
print(f"Accuracy of 1-NN: {accuracy_score(y_test,y_pred_labels)}")

Accuracy of 1-NN: 0.3539


### 1NN using

* scaled data
* metric = `cosine`
* weights = `distance`


In [7]:
classifier_1NN = KNeighborsClassifier(n_neighbors=1,weights="distance",metric="cosine")
classifier_1NN.fit(X_train_sc,y_train)
y_pred_labels = classifier_1NN.predict(X_test_sc)
print(f"Accuracy of 1-NN (2): {accuracy_score(y_test,y_pred_labels)}")

Accuracy of 1-NN (2): 0.4102


### 1NN using:

* PCA transformed data
* weights = `distance`
* metric = `cosine`

In [8]:
classifier_1NN.fit(X_train_PCA,y_train)
y_pred_labels = classifier_1NN.predict(X_test_PCA)
print(f"Accuracy of 1-NN (3): {accuracy_score(y_test,y_pred_labels)}")

Accuracy of 1-NN (3): 0.4221


### 1NN using:

* scaled and PCA transformed data
* weights = `distance`  
* metric = `cosine`

In [9]:
classifier_1NN.fit(X_train_sc_PCA,y_train)
y_pred_labels = classifier_1NN.predict(X_test_sc_PCA)
print(f"Accuracy of 1-NN (4): {accuracy_score(y_test,y_pred_labels)}")

Accuracy of 1-NN (4): 0.4244


### 1NN using:

* scaled, PCA transformed and noisy data
* weights = `distance`
* metric = `cosine`

In [10]:
classifier_1NN.fit(X_train_sc_PCA_noisy,y_train)
y_pred_labels = classifier_1NN.predict(X_test_sc_PCA)
print(f"Accuracy of 1-NN (5): {accuracy_score(y_test,y_pred_labels)}")

Accuracy of 1-NN (5): 0.4247


Now lets try the above cell but with weights = `uniform` and metrics = `euclidean` 

In [11]:
classifier_1NN = KNeighborsClassifier(n_neighbors=1,weights="uniform",metric="euclidean")
classifier_1NN.fit(X_train_sc_PCA_noisy,y_train)
y_pred_labels = classifier_1NN.predict(X_test_sc_PCA)
print(f"Accuracy of 1-NN (6): {accuracy_score(y_test,y_pred_labels)}")

Accuracy of 1-NN (6): 0.3844


### Conclusion for 1NN

So i come to the conclusion that the best performance that i achieved for 1NN was 0.4247. I achieved this by having scaled, PCA transformed noisy data.
Also i tuned the parameters of the `KNeighborsclassifier` (`metrics` and `distance`).
I should also say that in some of the testings that i did, the best accuracy was achieved for the scaled and PCA transformed data.

## Implement the 3NN

Lets also try the same testings that i did above for the 3NN

### 3NN using:

* initial data
* weights = `uniform`
* metric = `euclidean`


In [12]:
# Define the model: init 3-NN
classifier_3NN = KNeighborsClassifier(n_neighbors=3,weights="uniform",metric="euclidean")  #i noticed that with the cosine metric, the accuracy is higher than with the euclidean

# Train the model
# Only the training batches
classifier_3NN.fit(X_train,y_train)

# Predict the test set results
# predict the labels from the test batch data
y_pred_labels = classifier_3NN.predict(X_test)

# Evaluate the model using accuracy (y_pred_labels == y_test_labels) / number of tests
# number of tests = 10k
print(f"Accuracy of 3-NN: {accuracy_score(y_test,y_pred_labels)}")

Accuracy of 3-NN: 0.3303


### 3NN using:

* initial data
* weights = `distance`
* metric = `cosine` 

In [13]:
classifier_3NN = KNeighborsClassifier(n_neighbors=3,weights="distance",metric="cosine")  
classifier_3NN.fit(X_train,y_train)
y_pred_labels = classifier_3NN.predict(X_test)
print(f"Accuracy of 3-NN (2): {accuracy_score(y_test,y_pred_labels)}")

Accuracy of 3-NN (2): 0.3813


### 3NN using:

* scaled data
* same weights and metric as the above cell

In [14]:
classifier_3NN.fit(X_train_sc,y_train)
y_pred_labels = classifier_3NN.predict(X_test_sc)
print(f"Accuracy of 3-NN (3): {accuracy_score(y_test,y_pred_labels)}")

Accuracy of 3-NN (3): 0.4272


### 3NN using:

* PCA transformed data
* same weights and metric as the above cell

In [15]:
classifier_3NN.fit(X_train_PCA,y_train)
y_pred_labels = classifier_3NN.predict(X_test_PCA)
print(f"Accuracy of 3-NN (4): {accuracy_score(y_test,y_pred_labels)}")

Accuracy of 3-NN (4): 0.4422


### 3NN using:

* scaled and PCA transformed data
* same weights and metric as the above cell

In [16]:
classifier_3NN.fit(X_train_sc_PCA,y_train)
y_pred_labels = classifier_3NN.predict(X_test_sc_PCA)
print(f"Accuracy of 3-NN (5): {accuracy_score(y_test,y_pred_labels)}")

Accuracy of 3-NN (5): 0.4412


### 3NN using:

* scaled, PCA transformed and noisy data
* same weights and metric as above

In [17]:
classifier_3NN.fit(X_train_sc_PCA_noisy,y_train)
y_pred_labels = classifier_3NN.predict(X_test_sc_PCA)
print(f"Accuracy of 3-NN (6): {accuracy_score(y_test,y_pred_labels)}")

Accuracy of 3-NN (6): 0.4425


### Conclusion for 3NN

So i see that the best performance that i achieved for 3NN was 0.4425. I achieved this having scaled, PCA transformed noisy data.
Also i tuned the parameters of the `KNeighborsclassifier` (`metrics = cosine` and `weights = distance`).
In some other testings that i did, the best accuracy was given by the PCA transformed data or the scaled PCA transformed data.

## Implement of Nearest Centroid classifier

* The concept of the NC is the following:
  - For each class in the training data, compute the centroid (average of all feature points).
  - For a new data point, calculate the distance to each class centroid.
  - The label of the nearest centroid is assigned as the new data point's predicted class.

### NC using:

* initial data


In [18]:
from sklearn.neighbors import NearestCentroid
from sklearn.metrics import accuracy_score

classifier_KNC = NearestCentroid(shrink_threshold=1.4)

#training the classifier
classifier_KNC.fit(X_train,y_train)

y_pred_labels = classifier_KNC.predict(X_test)

print(f"Accuracy of NC : {accuracy_score(y_test,y_pred_labels)}")


Accuracy of NC : 0.2737


### NC using:
* scaled data

In [19]:
classifier_KNC.fit(X_train_sc,y_train)
y_pred_labels = classifier_KNC.predict(X_test_sc)
print(f"Accuracy of NC (2): {accuracy_score(y_test,y_pred_labels)}")

Accuracy of NC (2): 0.2781


### NC using:
* PCA transformed data

In [20]:
classifier_KNC.fit(X_train_PCA,y_train)
y_pred_labels = classifier_KNC.predict(X_test_PCA)
print(f"Accuracy of NC (3) : {accuracy_score(y_test,y_pred_labels)}")

Accuracy of NC (3) : 0.2723


### NC using:
* Scaled and PCA transformed data

In [21]:
classifier_KNC.fit(X_train_sc_PCA,y_train)
y_pred_labels = classifier_KNC.predict(X_test_sc_PCA)
print(f"Accuracy of NC (4): {accuracy_score(y_test,y_pred_labels)}")

Accuracy of NC (4): 0.276


### NC using:
* Scaled, PCA transformed and noisy data

In [22]:
classifier_KNC.fit(X_train_sc_PCA_noisy,y_train)
y_pred_labels = classifier_KNC.predict(X_test_sc_PCA)
print(f"Accuracy of NC (5): {accuracy_score(y_test,y_pred_labels)}")

Accuracy of NC (5): 0.2762


### Conclusion for NC

So i see that the best performance that i achieved for NC was 0.2781. I achieved this having only scaled the data.

## Conclusion

Having seen all the above, i can say that both the KNN (k=1 or k=3) and the NC are not good classifiers for the classification of the images from the CIFAR-10 dataset. I thought and used many differents methods to improve the accuracy of the algorithms, but the improvement wasn't significant.  

```
ΧΡΗΣΤΟΣ ΚΟΥΝΣΟΛΑΣ
AEM : 10345
```