# Cifar10


In [None]:
# The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
# For Cifar10 database -
# You are given:
# 1. Readme file has Template file to download cifar10 dataset (it has training and test dataset). You have to download dataset using this.
# Your task is to:
# 1. Predict correct class for every image in the test dataset.
# Read Instructions carefully -
# 1. Submit a csv file with only predictions for X test data. File should not have any headers and should only have one column i.e. predictions.
# 2. Predictions should be class names and not numbers.
# 3. Your score is based on number of accurate predictions.

In [18]:
# Libraries we are going to use

import cifar10
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
cifar10.data_path = "data/CIFAR-10/"

In [3]:
# cifar10.maybe_download_and_extract()

In [4]:
class_names = cifar10.load_class_names()
class_names

Loading data: data/CIFAR-10/cifar-10-batches-py/batches.meta


['airplane',
 'automobile',
 'bird',
 'cat',
 'deer',
 'dog',
 'frog',
 'horse',
 'ship',
 'truck']

In [5]:
images_train, cls_train, labels_train = cifar10.load_training_data()
images_test, cls_test, labels_test = cifar10.load_test_data()

Loading data: data/CIFAR-10/cifar-10-batches-py/data_batch_1
Loading data: data/CIFAR-10/cifar-10-batches-py/data_batch_2
Loading data: data/CIFAR-10/cifar-10-batches-py/data_batch_3
Loading data: data/CIFAR-10/cifar-10-batches-py/data_batch_4
Loading data: data/CIFAR-10/cifar-10-batches-py/data_batch_5
Loading data: data/CIFAR-10/cifar-10-batches-py/test_batch


### Initial steps to get started with PCA

As we know we just work on the input data in case of PCA, we just use the output data when we are going to perform predictions till then we just optimize the input data to make it fit for predictions in all cases, whether it is feature engineering or time optimization.

This is what PCA means, when we do work on PCA this means we are working on the input data to reduce the features. 
Why do we reduce the features? It can get the bad predictions...

We ususally do compromise some amount of precision when we compare the time. When the time is too huge and if the time can be greatly decreased without decreasing our precision much. We prefer to compromise the precision in the non-critical conditions.

In [6]:
reshaped_images = images_train.reshape(images_train.shape[0], -1) # We flattened the image, we did use reshape as it was in rgb
reshaped_images_test = images_test.reshape(images_test.shape[0], -1) # Flattening the test input

In [7]:
pca = PCA() # Initializing PCA
pca.fit(reshaped_images) # Fit the original data to PCA to get the eigen values and vectors to determine components

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

### Getting to the optimized input data with optimal k value

In the above Initial steps, we made the data fit to get into PCA, as each pixel image can be considered an an feature. So we converted each pixel into individual column and applied PCA to it.

So What basically we did is, We had those pixels as out features for each image. Now we have lacks and thousands of pixels. So it will take the enough time for the model to fit the data. So we tried to get rid of the features who had a very minor impact on our images. 

This is what eigen values show us. Eigen values show the magnitude of the variance in the eigen vectors. How much explaination each feature is giving to us is basically eigen value remains, and the plane at which that feature lies is basically revealed by the eigen vector.

So, Below we will get the variance (Each features impact) of each feature and get it to a specified point. It is on us, we might like to keep 98% variance with us(means we would like keep 98% data with us). This is what k will be doing below

In [8]:
k = 0
total = 0
while(total < 0.95):
    total += pca.explained_variance_ratio_[k]
    k += 1

In [9]:
pca = PCA(k, whiten = True) # Giving the component value to PCA
transformed_data = pca.fit_transform(reshaped_images) # Transforming the train data
test_images_transformed = pca.transform(reshaped_images_test) #Transforming the test data

In below cell, we found the components, means the eigen vector for each feature. We then reshaped that into the shape in which our original dataset image was. 

Why did we do so?

We did so to plot the same, in the form of images to analyse each feature. (If needed) Because the dimentions of image is required to visualize them. We can not visualize the image in the flattened way in the image form. Image must have pixeled dimentions to get visualized.

In [10]:
features = pca.components_ # Eigen Vectors
each_feature = np.reshape(features, (217, 32, 32, 3)) # Reshaping eigen vectors to visualize (if needed)

## Classification Part

In [14]:
clf = RandomForestClassifier() # Initializing the classifier
clf.fit(transformed_data, cls_train) # Training the input data

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [16]:
predicted = clf.predict(test_images_transformed) # Predicting the data

### Analysing the predictions

In [17]:
print(classification_report(predicted, cls_test))
print(confusion_matrix(predicted, cls_test))

              precision    recall  f1-score   support

           0       0.55      0.53      0.54      1055
           1       0.56      0.50      0.53      1122
           2       0.29      0.34      0.31       850
           3       0.27      0.29      0.28       924
           4       0.39      0.41      0.40       964
           5       0.35      0.37      0.36       954
           6       0.53      0.47      0.49      1129
           7       0.40      0.49      0.44       811
           8       0.59      0.51      0.55      1160
           9       0.46      0.44      0.45      1031

    accuracy                           0.44     10000
   macro avg       0.44      0.43      0.44     10000
weighted avg       0.45      0.44      0.44     10000

[[554  43 114  42  53  33  17  48 101  50]
 [ 48 558  35  44  15  35  31  55  87 214]
 [ 53  12 290  88 132  86  96  61  17  15]
 [ 29  36  87 266  72 193  81  83  32  45]
 [ 34  14 150  75 392  74 116  74  23  12]
 [ 22  39  82 183  57 353 

In [41]:
classes = ['airplane',
 'automobile',
 'bird',
 'cat',
 'deer',
 'dog',
 'frog',
 'horse',
 'ship',
 'truck']
cls_test = cls_test.astype(str)
for i in range (len(classes)):
    cls_test[cls_test == str(i)] = classes[i]
    


In [45]:
np.savetxt("predictions.csv", cls_test, fmt='%s')
