# DictVectorizer

This method converts categorical variables in a dictionary to vectorial representations. It is not always the best way to vectorize data, and should be avoided if the data has high cardinality (a large number of unique categories).

In [2]:
data = [
    {'name': 'Hugo', 'age': 25, 'city': 'Bogota'},
    {'name': 'Paco', 'age': 30, 'city': 'Tlaxcala'},
    {'name': 'Luis', 'age': 22, 'city': 'Buenos Aires'}
]

In [3]:
from sklearn.feature_extraction import DictVectorizer

In [4]:
vectorizer = DictVectorizer(sparse=False)

- **sparse**: * If True -> sparse matrix
              * If False -> numpy array

- Use **One Hot Encoding** to codify the categorical columns

In [5]:
vectorizer.fit(data)

In [6]:
vectorizer_data = vectorizer.transform(data)

In [7]:
vectorizer_data

array([[25.,  1.,  0.,  0.,  1.,  0.,  0.],
       [30.,  0.,  0.,  1.,  0.,  0.,  1.],
       [22.,  0.,  1.,  0.,  0.,  1.,  0.]])

In [8]:
type(vectorizer_data)

numpy.ndarray

In [9]:
vectorizer.feature_names_

['age',
 'city=Bogota',
 'city=Buenos Aires',
 'city=Tlaxcala',
 'name=Hugo',
 'name=Luis',
 'name=Paco']

In [10]:
vectorizer.vocabulary_

{'age': 0,
 'city=Bogota': 1,
 'city=Buenos Aires': 2,
 'city=Tlaxcala': 3,
 'name=Hugo': 4,
 'name=Luis': 5,
 'name=Paco': 6}

## Use sparse True

In [11]:
vectorizer = DictVectorizer(sparse=True)

In [12]:
vectorizer.fit(data)

In [13]:
vectorizer_data = vectorizer.transform(data)
vectorizer_data

<3x7 sparse matrix of type '<class 'numpy.float64'>'
	with 9 stored elements in Compressed Sparse Row format>

----

Add more keys to the dict to review the behaviour of the dictvectorizer

In [14]:
data = [
    {'name': 'Hugo', 'age': 25, 'city': 'Bogota', 'parents': 'Cesar'},
    {'name': 'Paco', 'age': 30, 'city': 'Tlaxcala', 'parents': 'Cielo'},
    {'name': 'Luis', 'age': 22, 'city': 'Buenos Aires'}
]

In [15]:
vectorizer = DictVectorizer(sparse=False)

In [16]:
vectorizer.fit(data)

In [17]:
vectorizer_data = vectorizer.transform(data)

In [18]:
vectorizer_data

array([[25.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.],
       [30.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  1.],
       [22.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.]])

In [19]:
type(vectorizer_data)

numpy.ndarray

In [20]:
vectorizer.feature_names_

['age',
 'city=Bogota',
 'city=Buenos Aires',
 'city=Tlaxcala',
 'name=Hugo',
 'name=Luis',
 'name=Paco',
 'parents=Cesar',
 'parents=Cielo']

In [21]:
vectorizer.vocabulary_

{'age': 0,
 'city=Bogota': 1,
 'city=Buenos Aires': 2,
 'city=Tlaxcala': 3,
 'name=Hugo': 4,
 'name=Luis': 5,
 'name=Paco': 6,
 'parents=Cesar': 7,
 'parents=Cielo': 8}

## Use sparse True

In [22]:
vectorizer = DictVectorizer(sparse=True)

In [23]:
vectorizer.fit(data)

In [24]:
vectorizer_data = vectorizer.transform(data)
vectorizer_data

<3x9 sparse matrix of type '<class 'numpy.float64'>'
	with 11 stored elements in Compressed Sparse Row format>