# DictVectorizer

Hello, welcome back to this Machine Learning book with scikit-learn. It's time to start talking about feature engineering and data preprocessing. In the vast majority of cases, data preprocessing consists of transforming our variables into numbers so that our model can process them. Let's begin.

It is not uncommon to work with information contained in Python dictionaries, after all it is one of the default supported types in the language.

To handle this type of data, scikit-learn provides us with a data transformer called `DictVectorizer` to convert dictionaries with categorical and numerical features into vector representations.

To demonstrate an example, we are going to create a dataset in the form of a list of dictionaries:

In [None]:
data = [
    {'name': 'Hugo', 'age': 25, 'city': 'Bogot√°'},
    {'name': 'Paco', 'age': 30, 'city': 'Tlaxcala'},
    {'name': 'Luis', 'age': 22, 'city': 'Buenos Aires'}
]

We import the class:

In [None]:
from sklearn.feature_extraction import DictVectorizer

We initialize an object:

In [None]:
vectorizer = DictVectorizer(sparse=False)

And we train the vectorizer with our input data, and immediately proceed to transform the same input array:

In [None]:
vectorizer.fit(data)
vectorized_data = vectorizer.transform(data)

By doing this, and thanks to the `sparse=False` argument, we obtain a two-dimensional NumPy array:

In [None]:
vectorized_data

If you're curious about the order of the columns in this two-dimensional array, you can use the `feature_names_` property or the `vocabulary_` property:

In [None]:
print(vectorizer.feature_names_)
print(vectorizer.vocabulary_)

One gives you an ordered list of the columns, while the other gives you a dictionary that maps the name of a column to its corresponding number within the resulting two-dimensional array.

We can thus see that the text columns have been encoded using the *one-hot encoding* technique, which means a one where the value corresponds and zeros in the rest of the columns. On the other hand, the "age" property has remained as the numerical value it already was.

## Extra parameters

Regarding the parameters we pass to the constructor, the most relevant one is one we have already used: `sparse`, which by default is equal to `True`, and when this argument is true, the vectorizer will return a SciPy sparse matrix instead of a NumPy array:

Perhaps the most relevant argument is `sparse`, which allows specifying the output type:

In [None]:
vectorizer = DictVectorizer()
vectorized_data = vectorizer.fit_transform(data)
vectorized_data

```{hint} For homework, I suggest you experiment by passing dictionaries with keys and values that you haven't seen before. Tell me in the comments, what happens?
```

## Conclusion

`DictVectorizer` is a powerful tool; however, it is not always the best way to encode your data.

Use it when dealing with structured data in the form of dictionaries, and when their properties are categorical values in the form of strings or numbers.

You should also be careful using it when you have high cardinality in categorical values. In our previous example, you could consider the "name" property as one with high cardinality, after all, there can be an infinite number of names.

Another thing to keep in mind is that `DictVectorizer` is somewhat generic, and there are times when you'll need more control over how the transformation between input data and features occurs.

Let's continue exploring other ways to prepare our data. I'll see you in the next chapter.