## Function to encode entire labels in one-hot encoding

This function is also called conversion for categorical data. We have 3 classes of
flowers: Iris setosa, Iris virginica and Iris versicolor. These classes can be coded
as classes 0, 1 and 2 (numeric labels) or in coding with 3 binary variables:

<table border="1">
<tr>
<td>Species</td>
<td>Y</td>
<td>Y_oh[0]</td>
<td>Y_oh[1]</td>
<td>Y_oh[2]</td>
</tr>
<tr>
<td>Iris setosa</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Iris virginica</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>Iris versicolor</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
</table>



The `oneHotIt` function below efficiently implements this conversion, using the
ease of creating sparse arrays.

The function input is the vector `Y` and the output will be an array with the same number of
lines that the number of elements of `Y` and the width will have the number of columns of the
largest label available in `Y`:

As an illustration and matrix programming exercise, we present below two implementations of the function that converts labels to "one-hot" encoding:

In [2]:
import numpy as np

First solution: What is the technique?

In [None]:
def oneHotIt2(Y,n_classes):
    Y = Y.reshape(-1,1) # column matrix
    i = np.arange(n_classes).reshape(1,n_classes) # row matrix
    Y_oh = (Y == i).astype(int)
    return Y_oh

Second solution: What is the technique?

In [1]:
def oneHotIt(Y,n_classes):
    n_samples = Y.size # number of samples
    i = np.arange(n_samples)
    Y_oh = np.zeros(shape=(n_samples,n_classes))
    Y_oh[i,Y] = 1
    return Y_oh

In [8]:
Y = np.arange(10)%7
Y_oh = oneHotIt2(Y,7)
Y_oh

array([[1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0]])

In [9]:
Y = np.arange(10000)%10
%timeit oneHotIt(Y,10000)

10 loops, best of 3: 34.9 ms per loop


In [10]:
%timeit oneHotIt2(Y,10000)

1 loop, best of 3: 933 ms per loop
