![image](https://drive.google.com/u/0/uc?id=15DUc09hFGqR8qcpYiN1OajRNaASmiL6d&export=download)

# **Taller No. 10 - ISIS4825**

## **Support Vector Machines, Modelo Conjuntos y Extracción de Características**
## **Contenido**
1. [**Objetivos**](#id1)
2. [**Problema**](#id2)
3. [**Importando las librerías necesarias para el laboratorio**](#id3)
4. [**Visualización y Análisis Exploratorio**](#id4)
5. [**Preparación de los Datos**](#id5)
6. [**Modelamiento**](#id6)
7. [**Predicción**](#id7)
8. [**Validación**](#id8)
9. [**Trabajo Asíncrono**](#id9)

## **Objetivos**<a name="id1"></a>
- Familiarizarse con las máquinas de soporte vectorial y los modelos conjuntos.
- Hacer un recorrido básico por las imágenes médicas.
- Extraer características básicas de imágenes.

## **Problema**<a name="id2"></a>
- En un dataset de imágenes varias, buscamos clasificar las imágenes que pertenezcan a dos clases.

## **Notebook Configuration**

In [None]:
!shred -u setup_colab.py
!shred -u setup_colab_general.py
!wget -q "https://github.com/jpcano1/python_utils/raw/main/setup_colab_general.py" -O setup_colab_general.py
!wget -q "https://github.com/jpcano1/python_utils/raw/main/ISIS_4825/setup_colab.py" -O setup_colab.py
import setup_colab as setup
setup.setup_workshop_10()

## **Importando las librerías necesarias para el laboratorio**<a name="id3"></a>

In [None]:
from utils import general as gen

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, recall_score, precision_score,
                             confusion_matrix, f1_score)
from sklearn.utils import resample

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
plt.style.use("seaborn-dark")
import seaborn as sns

from skimage import io

from tqdm.auto import tqdm

In [None]:
def cat_frequencies(labels):
    freq_p = labels.mean()
    freq_n = 1 - freq_p
    return freq_p, freq_n

def reshaped(data, batch=False):
    if batch:
        return np.moveaxis(data.reshape(-1, 3, 32, 32), 1, -1)
    return np.moveaxis(data.reshape(3, 32, 32), 0, -1)

### **Carga de Datos**

In [None]:
batch_set = []
batch_set.append(gen.unpickle("data/cifar-10-batches-py/data_batch_1"))
batch_set.append(gen.unpickle("data/cifar-10-batches-py/data_batch_2"))
batch_set.append(gen.unpickle("data/cifar-10-batches-py/data_batch_3"))
batch_set.append(gen.unpickle("data/cifar-10-batches-py/data_batch_4"))
batch_set.append(gen.unpickle("data/cifar-10-batches-py/data_batch_5"))

In [None]:
total_data = []
total_targets = []
for batch in batch_set:
    total_data.append(batch[b"data"])
    total_targets.append(batch[b"labels"])

In [None]:
total_data = np.array(total_data).reshape(-1, 3072)
total_targets = np.array(total_targets).reshape(-1)

In [None]:
mask = (total_targets == 6) | (total_targets == 7)
total_data = total_data[mask]
total_targets = total_targets[mask]

In [None]:
total_targets[total_targets == 6] = 0
total_targets[total_targets == 7] = 1

## **Visualización y Análisis Exploratorio**
- En este laboratorio usaremos un dataset conocido `CIFAR-10` donde buscaremos clasificar las imágenes pertenecientes a las clases de caballo y rana.

In [None]:
total_data.shape

In [None]:
np.random.seed(5678)
random_sample = np.random.choice(range(total_data.shape[0]), size=9)
gen.visualize_subplot(
    reshaped(total_data[random_sample], batch=True),
    total_targets[random_sample], (3, 3), (6, 6)
)

## **Preparación de los Datos**

### **Redimensionamiento**

In [None]:
random_sample = reshaped(total_data[0])

In [None]:
gen.imshow(random_sample, title=f"{total_targets[0]}")

In [None]:
total_data.shape

### **Train Set, Validation Set, Test Set**

In [None]:
sss = StratifiedShuffleSplit(n_splits=2, test_size=0.3, random_state=1234)

In [None]:
for train_index, test_index in sss.split(total_data, total_targets):
    full_X_train, X_test = total_data[train_index], total_data[test_index]
    full_y_train, y_test = total_targets[train_index], total_targets[test_index]

In [None]:
sss = StratifiedShuffleSplit(n_splits=2, test_size=0.3, random_state=5678)

In [None]:
for train_index, val_index in sss.split(full_X_train, full_y_train):
    X_train, X_val = full_X_train[train_index], full_X_train[val_index]
    y_train, y_val = full_y_train[train_index], full_y_train[val_index]

In [None]:
fp, fn = cat_frequencies(total_targets)
fp, fn

In [None]:
fp, fn = cat_frequencies(y_train)
fp, fn

In [None]:
fp, fn = cat_frequencies(y_val)
fp, fn

In [None]:
fp, fn = cat_frequencies(y_test)
fp, fn

## **Modelamiento**
- En esta ocasión haremos uso de las máquinas de soporte vectorial (SVM).

In [None]:
svm_clf = SVC(kernel="linear")

In [None]:
%%time
svm_clf.fit(X_train, y_train)

In [None]:
%%time
y_pred = svm_clf.predict(X_val)

In [None]:
accuracy_score(y_val, y_pred)

In [None]:
precision_score(y_val, y_pred)

In [None]:
recall_score(y_val, y_pred)

In [None]:
svm_clf = SVC(kernel="rbf")

In [None]:
%%time
svm_clf.fit(X_train, y_train)

In [None]:
%%time
y_pred = svm_clf.predict(X_val)

In [None]:
accuracy_score(y_val, y_pred)

In [None]:
precision_score(y_val, y_pred)

In [None]:
recall_score(y_val, y_pred)

## **Predicción**

In [None]:
np.random.seed(1234)
random_sample = np.random.choice(range(X_test.shape[0]), size=9)
y_pred = svm_clf.predict(X_test[random_sample])

In [None]:
gen.visualize_subplot(
    reshaped(X_test[random_sample], batch=True),
    y_pred, (3, 3), (6, 6)
)

## **Validación**

In [None]:
%%time
y_pred = svm_clf.predict(X_test)

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)

In [None]:
plt.matshow(conf_matrix, cmap="gray")
plt.show()

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
recall_score(y_test, y_pred)

In [None]:
precision_score(y_test, y_pred)

## **Trabajo Asíncrono**
1. En primera instancia, utilizar [`GridSearch`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) para determinar los mejores valores de los hiperparámetros. Para eso, averiguar sobre los siguientes hiperparámetros:
    - `C`
    - `kernel`

2. Luego, realizar una clasificación multiclase sobre este mismo dataset con todas las clases utilizando un kernel `rbf`. Para el control de la complejidad, realice un [`GridSearch`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) sobre el parámetro de regularización `C` dejando el `kernel` estático. Revisar la documentación de la Support Vector Machine [aquí](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)

3. Finalmente, Comparar los resultados del mejor modelo obtenido en el segundo punto con un modelo de tipo `RandomForest` y con el modelo `Ensemble` que ud escoja. Para eso, leer el capítulo 7 del libro: **Geron*, A. (2019). Hands-On Machine Learning with Scikit-Learn and TensorFlow. Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, Inc.** Para la comparación utilice las métricas que se han venido trabajando en el curso y construya una curva de precisión y cobertura (precision-recall curve) para cada modelo.