<!--BOOK_INFORMATION-->
<img align="left" src="" style="width: 76px; height: 100px; background: white; border: 1px solid black; margin-right:10px;">
*This notebook contains an excerpt from the upcoming book Machine Learning for OpenCV by Michael Beyeler.
The code is released under the [MIT license](https://opensource.org/licenses/MIT),
and is available on [GitHub](https://github.com/mbeyeler/opencv-machine-learning).*

*Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations.
If you find this content useful, please consider supporting the work by
[buying the book](https://github.com/mbeyeler/opencv-machine-learning)!*

<!--NAVIGATION-->
< [Preprocessing Data](04.01-Preprocessing-Data.ipynb) | [Contents](../README.md) | [Representing images](04.03-Representing-images.ipynb) >

# Representing Categorical Variables

Consider the following data containing a list of some of the forefathers of machine learning and artificial intelligence:

In [1]:
data = [
    {'name': 'Alan Turing', 'born': 1912, 'died': 1954},
    {'name': 'Herbert A. Simon', 'born': 1916, 'died': 2001},
    {'name': 'Jacek Karpinski', 'born': 1927, 'died': 2010},
    {'name': 'J.C.R. Licklider', 'born': 1915, 'died': 1990},
    {'name': 'Marvin Minsky', 'born': 1927, 'died': 2016},
]

You might be tempted to encode the `name` strings with a straightforward numerical mapping:

In [2]:
{'Alan Turing': 1,
 'Herbert A. Simon': 2,
 'Jacek Karpinsky': 3,
 'J.C.R. Licklider': 4,
 'Marvin Minsky': 5};

A better way is to use a `DictVectorizer`, also known as a *one-hot encoding*:

In [3]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)

array([[1912, 1954,    1,    0,    0,    0,    0],
       [1916, 2001,    0,    1,    0,    0,    0],
       [1927, 2010,    0,    0,    0,    1,    0],
       [1915, 1990,    0,    0,    1,    0,    0],
       [1927, 2016,    0,    0,    0,    0,    1]], dtype=int32)

To see the meaning of each column, you can inspect the feature names:

In [4]:
vec.get_feature_names()

['born',
 'died',
 'name=Alan Turing',
 'name=Herbert A. Simon',
 'name=J.C.R. Licklider',
 'name=Jacek Karpinski',
 'name=Marvin Minsky']

If your category has many possible values, it is better to use a sparse matrix:

In [5]:
vec = DictVectorizer(sparse=True, dtype=int)
vec.fit_transform(data)

<5x7 sparse matrix of type '<class 'numpy.int32'>'
	with 15 stored elements in Compressed Sparse Row format>

<!--NAVIGATION-->
< [Preprocessing Data](04.01-Preprocessing-Data.ipynb) | [Contents](../README.md) | [Representing images](04.03-Representing-images.ipynb) >