Suppose a classification problem with $N$ patterns and $M$ classes, where the patterns are labeled with the variable $i=1,\ldots,N$ and the classes with $l=1,...,M$.

# **Representaciones**

**Matrix representation** As a binary matrix of $M \times N$, where each cell $i,l$ is equal to 1 if $x_i \in C_l$, and 0 otherwise.
$$ y = [ y_{l, i} ], \qquad y_{l,i} = \begin{cases} 1 & \text{if}  \quad x_i\in C_l\\
0 & \text{if} \quad x_i \not\in C_l \end{cases}, \qquad l=1,\ldots, M, \ i=1,\ldots, N$$

**Set of sets representation**: As a set of sets, a set of classes consisting of the indices of the patterns within the class:
$$y = \left\{\begin{matrix} y_1 = \{i \ \mid \ x_i\in C_1, \ i=1,\ldots, N\}\\
y_2 = \{i \ \mid\ x_i\in C_2, \ i=1,\ldots, N\}\\
\vdots\\
y_M = \{i \ \mid\ x_i\in C_M, \ i=1,\ldots, N\}\\
\end{matrix}\right\}$$

**Vector representation**: As a vector of length $n$ where each element $i$ contains the label class $l$ to which it belongs.
$$ y =[y_1, \  y_2, \  \ldots, \  y_N ], \qquad \text{where} \ \  y_i = l \ | \ x_i \in C_l $$

## **Ejemplo**

Vector de etiquetas correctas

$$Y = (1, 2, 2, 3, 3, 3, 4, 4, 4, 4)$$

Matriz de pertenencia a clases

$$
Y =
\begin{bmatrix}
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1 \\
0 & 0 & 0 & 1 \\
0 & 0 & 0 & 1 \\
0 & 0 & 0 & 1
\end{bmatrix}
$$

Partición de las muestras en clases

$$
Y = \left\{ \{1\}, \{2, 3\}, \{4, 5, 6\}, \{7, 8, 9, 10\} \right\}
$$


### *Predicciones*

Vector de predicciones

$$\hat{Y} = (1, 1, 1, 1, 2, 2, 2, 3, 3, 4)$$

Matriz de pertenencia a clases

$$
\hat{Y} =
\begin{bmatrix}
1 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1
\end{bmatrix}
$$

Partición de las muestras en clases

$$
\hat{Y} = \left\{ \{1, 2, 3, 4\}, \{5, 6, 7\}, \{8, 9\}, \{10\} \right\}
$$

\begin{align*}
\hat{Y}_1 &= \{1, 2, 3, 4\} \\
\hat{Y}_2 &= \{5, 6, 7\} \\
\hat{Y}_3 &= \{8, 9\} \\
\hat{Y}_4 &= \{10\} 
\end{align*}

# **Matriz de Confusion**

Precisión

|             | Predicted       | $\hat{y_1}$    | ...| $\hat{y_m}$     | Total |
|-------------|-----------------|----------------|----|-----------------|-------|
| True        |                 |                |    |                 |       |
| $Y_1$       |                 |                |    |                 | $s_1$ |
| ...         |                 |                |    |                 |       |
| $Y_m$       |                 |                |    |                 | $s_m$ |
| Total       |                 |     $p_1$      |... |      $p_m$      |       |

$$
p_i = \sum_{j=1}^m | Y = i, \hat{Y} = j |, \quad i \in \{1, 2, \dots, m\} \\
$$

$$
\text{$p_i$ es el número de patrones que Predecimos en la clase $i$ }
$$


$$
s_i =  \sum_{j=1}^m | Y = j, \hat{Y} = i |, \quad i \in \{1, 2, \dots, m\}\\
$$

$$
\text{$s_i$ es el número de patrones que Sí son de la clase $i$ }
$$




$$
M_c = 
\begin{bmatrix}
|Y = 1, \hat{Y} = 1|   & &  |Y = 1, \hat{Y} = 2|   & &  \dots & &  |Y = 1, \hat{Y} = m|   \\
|Y = 2, \hat{Y} = 1|   & &  |Y = 2, \hat{Y} = 2|   & &  \dots & &  |Y = 2, \hat{Y} = m|   \\
|Y = 3, \hat{Y} = 1|   & &  |Y = 3, \hat{Y} = 2|   & &  \dots & &  |Y = 3, \hat{Y} = m|   \\
\vdots                 & &  \vdots                 & &  \ddots& &  \vdots                 \\
|Y = m, \hat{Y} = 1|   & &  |Y = m, \hat{Y} = 2|   & &  \dots & &  |Y = m, \hat{Y} = m|   \\
\end{bmatrix}
$$

#### Ejemplo

$$Y = (1, 2, 2, 3, 3, 3, 4, 4, 4, 4)$$
$$\hat{Y} = (1, 1, 1, 1, 2, 2, 2, 3, 3, 4)$$

$$
M_c = 
\begin{bmatrix}
1 & 0 & 0 & 0 \\
2 & 0 & 0 & 0 \\
1 & 2 & 0 & 0 \\
0 & 1 & 2 & 1
\end{bmatrix}
$$

$$
M_c = 
\begin{bmatrix}
\text{Que sean 1 y los clasifiquemos como 1} & \text{que sean 1 y lo clasifiquemos como 2} & \dots \\
\text{Que sean 2 y los clasifiquemos como 1} & \text{que sean 2 y lo clasifiquemos como 2} & \dots \\
\text{Que sean 3 y los clasifiquemos como 1} & \text{que sean 3 y lo clasifiquemos como 2} & \dots \\
\vdots & \vdots & \ddots
\end{bmatrix}
$$

## Metricas

$$
M_{cT}  = 
\begin{bmatrix}
|Y = 1, \hat{Y} = 1|   & &  |Y = 1, \hat{Y} = 2|   & &  \dots & &  |Y = 1, \hat{Y} = m|  & | & s_1 \\
|Y = 2, \hat{Y} = 1|   & &  |Y = 2, \hat{Y} = 2|   & &  \dots & &  |Y = 2, \hat{Y} = m|  & | & s_2 \\
|Y = 3, \hat{Y} = 1|   & &  |Y = 3, \hat{Y} = 2|   & &  \dots & &  |Y = 3, \hat{Y} = m|  & | & s_3 \\
\vdots                 & &  \vdots                 & &  \ddots& &  \vdots                 \\
|Y = m, \hat{Y} = 1|   & &  |Y = m, \hat{Y} = 2|   & &  \dots & &  |Y = m, \hat{Y} = m|  & | & s_m \\
\hline
p_1   & &  p_2   & &  \dots & &  p_m  & | & N &    \\
\end{bmatrix}
$$

$$
M_{cT}  = 
\begin{bmatrix}
A(1, 1)   &  \dots   &  \dots   &  | \text{sí son de } C_1 \\
\dots   &  A(2, 2)    &  \dots   &    | \text{sí son de } C_2 \\
\vdots  & \vdots  & \vdots  & \vdots  & \\
\dots   &  \dots   &  A(M, M)    &   | \text{sí son de } C_M \\
\hline
\text{predijimos } C_1 & \text{predijimos } C_2 & \dots \text{ predijimos } C_M  & | \text{ Total--------------}

\end{bmatrix}
$$

$$
s_i =  \sum_{j=1}^m | Y = j, \hat{Y} = i |, \quad i \in \{1, 2, \dots, m\}\\
$$

$$
\text{$s_i$ es el número de patrones que Sí son de la clase $i$, es la suma de las filas}
$$




$$
p_i = \sum_{j=1}^m | Y = i, \hat{Y} = j |, \quad i \in \{1, 2, \dots, m\} \\
$$

$$
\text{$p_i$ es el número de patrones que Predecimos en la clase $i$, es la suma de las columnas }
$$


$$
\text{Accuracy} = \frac{\sum_{i=1}^m \left| Y = i, \hat{Y} = i \right|}{N} 
$$

$$ \text{Balanced Accuracy} = \frac{1}{M} \sum_{i=1}^{M} \frac{|Y = i, \hat{Y} = i|}{s_i} $$

$$
\begin{aligned}
\text{Precision}_i &= \frac{|Y = i, \hat{Y} = i|}{p_i} \\
\text{Sensibility}_i &= \frac{|Y = i, \hat{Y} = i|}{s_i} \\
\text{F1}_i &= 2 \frac{P_i S_i}{P_i + S_i} \\
F\beta_i &= (1+\beta^2) \frac{P_i S_i}{\beta^2 P_i + S_i}
\end{aligned}
$$

Sensitibity is also Recall

$$
\begin{aligned}
\text{Macro-Precision} &= \frac{1}{m} \sum_{i=1}^m \text{Precision}_i \\
\text{Macro-Recall} &= \frac{1}{m} \sum_{i=1}^m \text{Recall}_i \\
\text{Macro-F1} &= \frac{1}{m} \sum_{i=1}^m \text{F1}_i \\
\end{aligned}
$$

$$
\begin{aligned}
\text{Micro-Precision} &= \frac{\sum_{i=1}^m \text{True Positives}_i}{\sum_{i=1}^m \left(\text{True Positives}_i + \text{False Positives}_i\right)} \\
\text{Micro-Recall} &= \frac{\sum_{i=1}^m \text{True Positives}_i}{\sum_{i=1}^m \left(\text{True Positives}_i + \text{False Negatives}_i\right)} \\
\text{Micro-F1} &= 2 \cdot \frac{\text{Micro-Precision} \cdot \text{Micro-Recall}}{\text{Micro-Precision} + \text{Micro-Recall}}
\end{aligned}
$$

Matriz de Confusión con Totales

$$
M_{cT} = 
\begin{bmatrix}
1 & 0 & 0 & 0 & | & 1 \\
2 & 0 & 0 & 0 & | & 2 \\
1 & 2 & 0 & 0 & | & 3 \\
0 & 1 & 2 & 1 & | & 4 \\
\hline
4 & 3 & 2 & 1 & | & 10
\end{bmatrix}
$$


$$
\begin{align*}
\text{Accuracy} &= \frac{1+0+0+1}{10} = 0.2 \\
\text{Balanced Accuracy} &= \frac{1}{4} \left( \frac{1}{1} + \frac{0}{2} + \frac{0}{3} + \frac{1}{4} \right) = 0.3125 \\
\text{Precisión}_1 &= \frac{1}{4} \\
\text{Sensibilidad}_1 &= 1 \\
\text{Precisión}_2 &= \frac{0}{3} \\
\text{Sensibilidad}_2 &= \frac{0}{2} \\
\text{Precisión}_3 &= \frac{0}{2} \\
\text{Sensibilidad}_3 &= \frac{0}{3} \\
\text{Precisión}_4 &= 1 \\
\text{Sensibilidad}_4 &= \frac{1}{4}
\end{align*}
$$

### **Errores**

> The **error probability of the false positive**, known as a **type I error**, is denoted by $\alpha$ (in which case the probability of the true positive is $1-\alpha$).

> The **error probability of a false negative**, known as a **type II error**, is denoted by $\beta$ (and the probability of a true negative is $1-\beta$).

<table border="1" style="border-collapse: collapse; text-align: center;">
   <caption style="font-weight: bold; margin-bottom: 10px;"> <b>Confusion Matrix for the two-class problem.</b></caption>
  <tr>
    <th rowspan="2">True Class</th>
    <th colspan="2">Predicted Class</th>
    <th rowspan="2">Total</th>
  </tr>
  <tr>
    <th>Positive</th>
    <th>Negative</th>
  </tr>
  <tr>
    <td><b>Positive</b></td>
    <td>tp : true positive</td>
    <td>fn : false negative</td>
    <td>p</td>
  </tr>
  <tr>
    <td><b>Negative</b></td>
    <td>fp : false positive</td>
    <td>tn : true negative</td>
    <td>n</td>
  </tr>
  <tr>
    <td><b>Total</b></td>
    <td>p'</td>
    <td>n'</td>
    <td>N</td>
  </tr>
</table>
**Main performance measures**

> **Total Error**
$$e = \frac{fp+fn}{N}=\alpha + \beta$$

> **Accuracy** (*exactitud*), measures the proportion of correct predictions in relation to the total number of elements.
$$\frac{tp+tn}{N} = 1- e$$

> **False alarm rate** (**false positive fraction**) $fp$-rate is the proportion of impostors we wrongly accept.
$$\frac{fp}{n}=\alpha$$

> **True positive rate**, **recall** or **sensitivity** $tp$-rate, measures what proportion of valid users (positive pattern) we authenticate. Intuitively, it is the ability of the classifier to find all the positive samples.
$$ \frac{tp}{p}=1-\alpha$$

> **Precision** measures the proportion of records classified as positive that actually are. Intuitively, precision is the ability to the classifier not to label as positive a sample that is negative.
$$\frac{tp}{p'}$$

> **Specificity** is how well we detect the negatives.
$$\frac{tn}{n} = 1 - fp\text{-rate}$$


# **Comparación de Modelos con teoría de la información**

### **Joint Distribution**

$$p(l, l') = P(l \cap l') = \frac{|y_l \cap y_{l'}|}{N}$$

$$
p(Y, \hat{Y} ) = 
\frac{1}{N}
M_c 
$$

$$
p(l, l') = 
\begin{bmatrix}
.1 & 0 & 0 & 0 \\
.2 & 0 & 0 & 0 \\
.1 & .2 & 0 & 0 \\
0 & .1 & .2 & .1
\end{bmatrix}
$$

$$
p(l, l') =
\begin{bmatrix}
.1 & 0 & 0 & 0 & | & .1 \\
.2 & 0 & 0 & 0 & | & .2 \\
.1 & .2 & 0 & 0 & | & .3 \\
0 & .1 & .2 & .1 & | & .4 \\
\hline
.4 & .3 & .2 & .1 & | & 1
\end{bmatrix}
$$


### **Entropy**

$$ \Eta (X) = \sum_{x \in X} p(x) \log_2 \frac{1}{p(x)}= -\sum_{x \in X} p(x) \log_2 p(x)$$

Donde $p(x) := \mathbb{P}[X = x]$

$$ \Eta (X) = \sum_{i=1}^m \mathbb{P}(X = i) \log_2 \frac{1}{\mathbb{P}(X = i)}= -\sum_{i=1}^m \mathbb{P}(X = i) \log_2 \mathbb{P}(X = i)$$

$$Y = (1, 2, 2, 3, 3, 3, 4, 4, 4, 4)$$

$$
\Eta (Y) 
=\frac{1}{10}\log_2\frac{10}{1} 
+\frac{2}{10}\log_2\frac{10}{2} 
+\frac{3}{10}\log_2\frac{10}{3} 
+\frac{4}{10}\log_2\frac{10}{4} 
= 1.85 \ \text{bits}$$

$$
\Eta (Y) 
=
-\frac{1}{10}\log_2\frac{1}{10} 
-\frac{2}{10}\log_2\frac{2}{10} 
-\frac{3}{10}\log_2\frac{3}{10} 
-\frac{4}{10}\log_2\frac{4}{10} 
= 1.85 \ \text{bits}$$

Cuando una columna es siempre igual, la informacion mutua es 0.  
Cuando cambian igual (o completamente diferente) es algo cerca de 0.5

### **Mutual Information**

$$I(X, Y) = \sum_{x \in \mathcal{X}, y \in \mathcal{y} }p(x,y) \log_2 \frac{p(x,y)}{p(x) p(y)}$$

where $p_{X,Y}(x,y) := \mathbb{P}[X = x, Y = y]$ and $p_{Y}(y) = \mathbb{P}[Y = y]$. 

$$I(X, Y) = \sum_{i=1}^{M} \sum_{j=1}^{M} \mathbb{P}(X = i,Y = j) \log_2 \frac{\mathbb{P}(X = i,Y = j)}{\mathbb{\mathbb{P}}(X = i)\mathbb{P}(Y = j)}$$ 

\begin{align}
I(X, Y) &= \sum_{x \in \mathcal{X}, y \in \mathcal{y} }p(x,y) \log_2 \frac{p(x,y)}{p(x) p(y)}
\\
&=
\sum_{i=1}^{M} \sum_{j=1}^{M} 
\frac{|x_i \cap y_j|}{N}
\log_2
\frac{\frac{|x_i \cap y_j|}{N}}
{\frac{|x_i|}{N} \frac{|y_i|}{N}}
\\
&=
\sum_{i=1}^{M} \sum_{j=1}^{M} 
\frac{|x_i \cap y_j|}{N}
\log_2
N\frac{|x_i \cap y_j|}{|x_i||y_i|}
\end{align}

$$I(X, Y) = \Eta (Y) - \Eta (Y|X) = \Eta (X) - \Eta (X|Y)$$

### **Bayes' Theorem**

$$
P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(A,B)}{P(B)} = \frac{P(B|A) P(A)}{P(B)}
$$


### **Conditional Entropy**

One may also define the conditional entropy of two variables $X$ and $Y$ taking values from sets $\mathcal{X}$ and $\mathcal{Y}$ respectively, as:

$$
\mathrm{H}(X|Y) = -\sum_{x,y \in \mathcal{X} \times \mathcal{Y}} p_{X,Y}(x,y) \log \frac{p_{X,Y}(x,y)}{p_{Y}(y)},
$$

where $p_{X,Y}(x,y) := \mathbb{P}[X = x, Y = y]$ and $p_{Y}(y) = \mathbb{P}[Y = y]$. This quantity should be understood as the remaining randomness in the random variable $X$ given the random variable $Y$.

\begin{align}
\Eta (X|Y) 
&= -\sum_{y} \sum_X p(x,y) \log_2 p(x| y) \\
&= -\sum_{y} \sum_X p(x,y) \log_2 \frac{p(x,y) }{p(y)} \\
&= -\sum_{y} \frac{|y_i \cap x_j|}{N} \log_2 \frac{ \frac{|y_i \cap x_j|}{N} }{ \frac{|y_i|}{N} } \\
&= -\sum_{i=1}^M\sum_{j=1}^M \frac{|y_i \cap x_j|}{N} \log_2\left( \frac{|y_i \cap x_{j}|}{|y_i|} \right)
\end{align}

### **Variaton of Information**

$$VI(y, y') = H(y) + H(y') -2I(y, y')$$

This is the sum of two positive terms
$$VI(y, y') = [H(y) -I(y,y')] + [H(y^{'})-I(y,y')] = H(y|y') + H(y'|y)$$