<img src="res/itm_logo.jpg" width="300px">

## Inteligencia Artificial - IAI84
### Instituto Tecnológico Metropolitano
#### Pedro Atencio Ortiz - 2019

En este notebook se aborda un ejemplo de aplicación del algoritmo k-NN para la clasificación de imágenes en dos categorías: perros o gatos, utilizando SKLearn. Por otra parte se tratan distintas formas de evaluación.

<hr>
## k-Vecinos Cercanos (k-NN)

Una aproximación más sofisticada, clasificación k-NN, encuentra un grupo de $k$ objetos en el conjunto de entrenamiento que se encuentran más cerca del objeto de prueba, y asigna una clase al mismo basado en la predominancia de una clase particular en el vecindario.
<img src="res/knn/knn.png" width="400">
Dados un conjunto de entrenamiento $(X,Y)$ y un objeto de prueba $x_i$, el algoritmo computa la distancia o similaridad entre $x_i$ y todos los objetos de entrenamiento que pertenecen a $(X,Y)$ para determinar la lista de vecinos más cercanos.  Una vez se obtiene dicha lista, $x_i$ se clasifica con la clase de mayor aparición en su vecindario (mayoría de votos). 
<img src="res/knn/knn_example.png" width="700">

<hr>
# Perro o Gato?

<img src="res/knn/clasificacion.png" width="500">

<hr>
## Caracteristicas...

Una forma de enfrentar este problema, es tomar los píxeles como características de las imágenes que se desea clasificar. Esta aproximación es ingenua, ya que en una imagen existe más información que la simple secuencia de los píxeles que la componen. Sin embargo para este caso procedamos de esta manera.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from skimage.io import imread, imshow

cat_image = imread("dogscats_dataset/train/cats/cat.18.jpg")
print("image shape: ", cat_image.shape)

f, ax = plt.subplots(1,4, figsize=(20,10))
ax[0].imshow(cat_image, cmap='gray')
ax[0].set_title("imagen original")
ax[1].imshow(cat_image[:,:,0], cmap='gray')
ax[1].set_title("matrix R")
ax[2].imshow(cat_image[:,:,1], cmap='gray')
ax[2].set_title("matrix G")
ax[3].imshow(cat_image[:,:,2], cmap='gray')
ax[3].set_title("matrix B")

plt.show()

## Reshape...

In [None]:
from skimage.transform import resize

cat_image_resize = resize(cat_image, (64,64))
print("new shape: ", cat_image_resize.shape)

plt.imshow(cat_image_resize)
plt.show()

'''
a = np.random.randn(3,3,3)
print(a.shape)
print(a)
a_flat = a.flatten()
print(a_flat.shape)
print(a_flat)
'''

#Transformams la imagen en un vector de (1, 64x64x3) = (1, 12288)
cat_image_x = cat_image_resize.flatten().reshape(1, 12288)
print("flattened shape: ",cat_image_x.shape)

In [None]:
from os import listdir
from os.path import isfile


def get_dataset_size(path):
    cat_files = listdir(path)
    
    number_of_images = 0
    for f in cat_files:
        if(not(f.startswith(".")) and f.endswith(".jpg")):
            number_of_images += 1
    
    return number_of_images

def load_dataset(folder_path, imsize=(64,64,3), class_index=0):
    
    folder_files = listdir(folder_path)
    folder_len = get_dataset_size(folder_path)
    
    flattened_size = imsize[0]*imsize[1]*imsize[2]
    
    X = np.zeros([folder_len, flattened_size])
    Y = np.ones([folder_len, 1]) * class_index
    
    i = 0
    for f in folder_files:
        if(not(f.startswith(".")) and f.endswith(".jpg")):
            t = imread(folder_path+f)
            t_reshape = resize(t, imsize)
            X[i, :] = t_reshape.flatten().reshape(1, flattened_size)
            i += 1
    
    return (X, Y)

<hr>
## Crear dataset de entrenamiento (train set)

En este ejemplo tenemos imágenes separadas para entrenamiento (train) y prueba (test). A continuación cargamos ambos datasets, entrenamos k-NN y Naive Bayes, y posteriormente medimos el redimientos de cada clasificador utilizando distintas métricas de evaluación.

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics


In [None]:
root_path = "dogscats_dataset/train/"
cats_path = "cats/"
dogs_path = "dogs/"

(X_cats, Y_cats) = load_dataset(root_path+cats_path, class_index=0)
(X_dogs, Y_dogs) = load_dataset(root_path+dogs_path, class_index=1)

X_train = np.concatenate((X_cats, X_dogs))
Y_train = np.concatenate((Y_cats, Y_dogs))

print X_train.shape

## Crear dataset de prueba (test set)

In [None]:
root_path = "dogscats_dataset/test/"
cats_path = "cats/"
dogs_path = "dogs/"

(X_cats, Y_cats) = load_dataset(root_path+cats_path, class_index=0)
(X_dogs, Y_dogs) = load_dataset(root_path+dogs_path, class_index=1)

X_test = np.concatenate((X_cats, X_dogs))
Y_test = np.concatenate((Y_cats, Y_dogs))

print X_test.shape

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

neigh = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
neigh.fit(X_train, Y_train)

naive_bayes = GaussianNB()
naive_bayes.fit(X_train, Y_train)

## Testing sobre el test set

In [None]:
def accuracy(y, y_pred):
    correctly_predicted = np.count_nonzero(y == y_pred)
    accuracy = np.float(correctly_predicted) / len(y)
    
    return accuracy

In [None]:
print "kNN Classifier"

y_pred = neigh.predict(X_test).reshape(len(Y_test),1)

accuracy_score = accuracy(Y_test, y_pred)

print "accuracy score (own implementation): ", accuracy_score

from sklearn.metrics import accuracy_score, average_precision_score, f1_score, confusion_matrix

print "SKLearn Metrics"
print "accuracy score: ", accuracy_score(Y_test, y_pred)
print "average precision score: ", average_precision_score(Y_test, y_pred)
print "f1-score: ", f1_score(Y_test, y_pred)
print "Confusion matrix: ", confusion_matrix(Y_test, y_pred)

In [None]:
print "Gaussian Naive Bayes Classifier"

y_pred = naive_bayes.predict(X_test).reshape(len(Y_test),1)

accuracy_score = accuracy(Y_test, y_pred)

print "accuracy score (own implementation): ", accuracy_score

from sklearn.metrics import accuracy_score, average_precision_score, f1_score, confusion_matrix

print "SKLearn Metrics"
print "accuracy score: ", accuracy_score(Y_test, y_pred)
print "average precision score: ", average_precision_score(Y_test, y_pred)
print "f1-score: ", f1_score(Y_test, y_pred)
print "Confusion matrix: ", confusion_matrix(Y_test, y_pred)

In [None]:
#image = imread("dogscats_dataset/train/cats/cat.0.jpg")
image = imread("dogscats_dataset/test/dogs/dog.110.jpg")

plt.imshow(image)
plt.show()

x = resize(image, (64,64)).reshape(1, 12288)

y_hat = int(neigh.predict(x)[0])

classes = ["cat", "dog"]

print("It's a: ",classes[y_hat])

<hr>
## Validación cruzada.

Si bien el particionamiento del dataset en 70-30 o 80-20 es útil al momento de validar el rendimiento del modelo, no podemos asegurar que los datos en cada partición sean **representativos**.

Es por ello que una estrategia consiste en realizar una validación cruzada la cual realiza el proceso de particionamiento múltiples veces.

A continuación implementamos **k-fold** y **leave-one-out**.

In [None]:
from sklearn.model_selection import KFold

# Concatenemos el dataset TRAIN y el dataset TEST en un solo macro dataset X, Y
X = np.concatenate((X_test, X_train))
Y = np.concatenate((Y_test, Y_train))

In [None]:
splits = 5
kf = KFold(n_splits=splits)

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

neigh = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
naive_bayes = GaussianNB()

accuracy_score_NB = 0
accuracy_score_kNN = 0

for train_index, test_index in kf.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    Y_train, Y_test = Y[train_index], Y[test_index]
    
    neigh.fit(X_train, Y_train)
    naive_bayes.fit(X_train, Y_train)

    y_pred_NB = naive_bayes.predict(X_test).reshape(len(Y_test),1)
    y_pred_kNN = neigh.predict(X_test).reshape(len(Y_test), 1)
    
    print accuracy_score(y_pred_NB, Y_test), accuracy_score(y_pred_kNN, Y_test)
    
    accuracy_score_NB += accuracy_score(y_pred_NB, Y_test)
    accuracy_score_kNN += accuracy_score(y_pred_kNN, Y_test)

print "Mean test accuracy Naive Bayes: ", accuracy_score_NB / splits
print "Mean test accuracy k-NN: ", accuracy_score_kNN / splits

<hr>
## Ajuste fino de nuestro modelo

Algunos modelos de clasificación son paramétricos, lo cuál implica que el experto debe determinar el conjunto de parámetros que mejor desempeño consiguen. Una forma de encontrar dichos parámetros consiste en ejecutar múltiples experimentos de forma manual hasta conseguir un óptimo. Sin embargo, SKLearn nos permite automatizar esta búsqueda mediante **GridSearch**.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [{'n_neighbors':[3,5,7]}]

In [None]:
neigh = KNeighborsClassifier(metric='euclidean')

grid_search = GridSearchCV(neigh, param_grid=param_grid, cv=5, scoring='accuracy')

In [None]:
grid_search.fit(X, Y)

In [None]:
grid_search.best_params_

In [None]:
grid_search.cv_results_