<a href="https://colab.research.google.com/github/johnsanterre/santerreAI/blob/main/002.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 2: One-hot Encoding

One-hot encoding is a process of representing categorical variables as numerical data. It is called one-hot encoding because only one element of the categorical list is represented as a "hot" (1) value, while the rest of the elements are represented as 0. One-hot encoding is often used as a preprocessing step for machine learning algorithms, because many algorithms expect numerical input and cannot handle categorical data directly.
This is a tutorial on how to perform one-hot encoding using words as features:
First, we will start by creating a list of words that we want to encode. For this tutorial, we will use the following list of words:


In [2]:
words = ['cat', 'dog', 'mouse', 'bird', 'lizard']

Next, we will use the OneHotEncoder class from the sklearn.preprocessing module to perform the one-hot encoding. The OneHotEncoder class expects numerical input, so we will need to first convert our list of words to a list of integers. We can do this using the LabelEncoder class from the sklearn.preprocessing module.


In [3]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(words)

Now that we have converted our list of words to a list of integers, we can use the OneHotEncoder to perform the one-hot encoding.

In [4]:
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)



The onehot_encoded variable now contains the one-hot encoded version of our list of words. It is a NumPy array with shape (5, 5), where each row represents a word and each column represents a word in the list. The value at each position (i, j) is 1 if the ith word is the jth word in the list, and 0 otherwise.


In [5]:
print(onehot_encoded)

[[0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0.]]
