# Crear Embeddings para Variables Categóricas

* Una variable categórica se utiliza para representar categorías o etiquetas.

* Los modelos de aprendizaje automático (ML) y aprendizaje profundo (DL) sólo funcionan con variables numéricas. Por lo tanto, necesitaremos convertir una variable categórica en valores numéricos para poder alimentarlos en un modelo ML o DL.

* Tradicionalmente, convertimos variables categóricas en números mediante ***codificación en caliente*** o ***codificación de etiqueta**



## One-hot Encoding

* En una codificación en caliente, construimos tantas características como el número de categorías únicas en esa característica y para cada fila, asignamos un 1 a la característica que representa la categoría de esa fila y el resto de características se marcan con 0.

* Esta técnica resulta problemática cuando hay muchas categorías (valores únicos) en una característica, lo que da lugar a datos muy dispersos. Además, como cada vector es equidistante de los demás, se pierde la relación entre las variables.


## Codificación de etiquetas

* La codificación de etiquetas consiste simplemente en convertir cada valor de esa columna en un número entero. Esta técnica es muy sencilla pero induce la comparación entre categorías de características porque utiliza la secuenciación numérica.

* Sin embargo, si tenemos tres modos de transporte: autobús, coche y bicicleta, y los etiquetamos 1, 2 y 3 respectivamente. Supondríamos implícitamente que existe un orden o peso asociado a cada modo, lo que puede no ser lo que deseamos.

## Categorical Embedding

* En la incrustación categórica, cada categoría de variable categórica se asigna a un vector de n dimensiones. Este mapeo es aprendido por una red neuronal durante un proceso estándar de entrenamiento supervisado.

* Después, sustituiremos cada categoría por sus vectores correspondientes en nuestros datos.

* Las ventajas de las incrustaciones categóricas son: (1) Podemos limitar el número de columnas que necesitamos por categoría. Esto es útil cuando una variable tiene muchas categorías; y (2) Las incrustaciones generadas obtenidas de la red neuronal revelan las propiedades intrínsecas de las variables categóricas, lo que significa que categorías similares tendrán incrustaciones similares.

See the [article](https://medium.com/analytics-vidhya/categorical-embedder-encoding-categorical-variables-via-neural-networks-b482afb1409d) for more details.

Este tutorial muestra cómo crear incrustaciones categóricas para modelos ML o DL.


**Conjuntos de datos personales sobre costes médicos**

Fuente: https://www.kaggle.com/mirichoi0218/insurance

Definiciones de las variables:

*edad:* edad del beneficiario principal

*sexo:* sexo del contratante del seguro (mujer u hombre)

*IMC:* índice de masa corporal, definido como kg / m^2

*hijos:* número de hijos/dependientes cubiertos por el seguro médico

*fumador:* Condición de fumador (sí o no)

*región:* zona residencial del beneficiario en EE.UU. (noreste, sureste, suroeste o noroeste)

*cargos:* gastos médicos individuales facturados por el seguro médico

In [5]:
# El categorical_embedder funciona con versiones inferiores de keras y tensorflow;
# por lo que necesitamos bajar las versiones de keras y tensorflow en consecuencia.
# Tendremos que reiniciar el tiempo de ejecución antes de que podamos importar las versiones degradadas.
!pip install tensorflow_addons==0.8.3 --quiet
!pip install tqdm==4.41.1 --quiet
!pip install keras==2.3.1 --quiet
!pip install tensorflow==2.2.0 --quiet

[31mERROR: Could not find a version that satisfies the requirement tensorflow_addons==0.8.3 (from versions: 0.10.0, 0.11.0, 0.11.1, 0.11.2, 0.12.0, 0.12.1, 0.13.0, 0.14.0, 0.15.0, 0.16.1, 0.17.0, 0.17.1, 0.18.0, 0.19.0)[0m
[31mERROR: No matching distribution found for tensorflow_addons==0.8.3[0m


In [6]:
# Instalar categorical_embedder
!pip install categorical-embedder --quiet

In [7]:
# Importar bibliotecas
import tensorflow as tf
import keras
import categorical_embedder as ce
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [22]:
# Abrir el conjunto de datos del seguro de enfermedad
df = pd.read_csv('insurance.csv')
print(df.shape)
df.head()

(1338, 7)


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [9]:
# Separar las características del objetivo
X = df.drop(['charges'], axis = 1)
y = df['charges']

In [10]:
# ce.get_embedding_info identifica las variables categóricas.
# La función devuelve un diccionario, con tuplas de
# (número de categorías, tamaño de incrustación)
# Nota: El valor por defecto es que el tamaño de la incrustación sea la mitad que el número de categorías.
# También podemos cambiar el valor por defecto creando el diccionario a mano.
embedding_info = ce.get_embedding_info(X)
embedding_info

{'sex': (2, 1), 'smoker': (2, 1), 'region': (4, 2)}

In [23]:
# ce.get_label_encoded_data integer codifica las variables categóricas 
# y la prepara para alimentar la red neuronal.
X_encoded, encoders = ce.get_label_encoded_data(X)
X_encoded.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19,0,27.9,0,1,3
1,18,1,33.77,1,0,2
2,28,1,33.0,3,0,2
3,33,1,22.705,0,0,1
4,32,1,28.88,0,0,1


In [24]:
# Mostrar el esquema de codificadores
encoders

{'sex': __LabelEncoder__(),
 'smoker': __LabelEncoder__(),
 'region': __LabelEncoder__()}

In [13]:
# Dividir los datos en entrenamiento y prueba

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y)

In [14]:
# ce.get_embeddings entrena un modelo de red neuronal, 
# extrae las incrustaciones y devuelve un diccionario que contiene las incrustaciones
embeddings = ce.get_embeddings(
  # Proporcionar el train Set
  X_train, y_train, 
  # Proporcionar embedding info
  categorical_embedding_info = embedding_info, 
  # Nuestro objetivo es un gasto o sea regression
  is_classification = False,  
  # Especifique las épocas y el tamaño del lote 
  epochs = 100, batch_size = 32)

HBox(children=(FloatProgress(value=0.0, description='Training', style=ProgressStyle(description_width='initial…

HBox(children=(FloatProgress(value=0.0, description='Epoch 0', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 4', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 5', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 6', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 7', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 8', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 9', max=802.0, style=ProgressStyle(description_widt…

HBox(children=(FloatProgress(value=0.0, description='Epoch 10', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 11', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 12', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 13', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 14', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 15', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 16', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 17', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 18', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 19', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 20', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 21', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 22', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 23', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 24', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 25', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 26', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 27', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 28', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 29', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 30', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 31', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 32', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 33', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 34', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 35', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 36', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 37', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 38', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 39', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 40', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 41', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 42', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 43', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 44', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 45', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 46', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 47', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 48', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 49', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 50', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 51', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 52', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 53', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 54', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 55', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 56', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 57', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 58', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 59', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 60', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 61', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 62', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 63', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 64', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 65', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 66', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 67', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 68', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 69', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 70', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 71', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 72', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 73', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 74', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 75', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 76', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 77', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 78', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 79', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 80', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 81', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 82', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 83', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 84', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 85', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 86', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 87', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 88', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 89', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 90', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 91', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 92', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 93', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 94', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 95', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 96', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 97', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 98', max=802.0, style=ProgressStyle(description_wid…

HBox(children=(FloatProgress(value=0.0, description='Epoch 99', max=802.0, style=ProgressStyle(description_wid…




In [25]:
# Echa un vistazo a las embeddings aprendidas
embeddings

{'sex': array([[-0.3206883 ],
        [-0.02961533]], dtype=float32), 'smoker': array([[-1.4210148],
        [ 1.2758241]], dtype=float32), 'region': array([[-0.14003518, -0.09384121],
        [ 0.24018909,  0.24068332],
        [ 0.00924863, -0.02692768],
        [ 0.33370847,  0.31401742]], dtype=float32)}

In [26]:
# Formas embeddings
print(embeddings['sex'].shape)
print(embeddings['smoker'].shape)
print(embeddings['region'].shape)

(2, 1)
(2, 1)
(4, 2)


In [27]:
# Si no te gusta el formato diccionario; 
# podemos convertirlo a dataframe para facilitar la legibilidad.
dfs = ce.get_embeddings_in_dataframe(
  embeddings = embeddings, 
  encoders = encoders)

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




In [28]:
# Embeddings para regions
dfs['region']

Unnamed: 0,region_embedding_0,region_embedding_1
northeast,-0.140035,-0.093841
northwest,0.240189,0.240683
southeast,0.009249,-0.026928
southwest,0.333708,0.314017


In [19]:
# Embeddings para sex
dfs['sex']

Unnamed: 0,sex_embedding_0
female,-0.320688
male,-0.029615


In [20]:
# Embeddings para smoker
dfs['smoker']

Unnamed: 0,smoker_embedding_0
no,-1.421015
yes,1.275824


In [30]:
# Incluir estas Embeddings en el conjunto de datos
data = ce.fit_transform(
  X, 
  embeddings = embeddings, 
  encoders = encoders, 
  # Eliminar las variables categóricas originales
  drop_categorical_vars = True)
data.head()

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




Unnamed: 0,age,bmi,children,sex_embedding_0,smoker_embedding_0,region_embedding_0,region_embedding_1
0,19,27.9,0,-0.320688,1.275824,0.333708,0.314017
1,18,33.77,1,-0.029615,-1.421015,0.009249,-0.026928
2,28,33.0,3,-0.029615,-1.421015,0.009249,-0.026928
3,33,22.705,0,-0.029615,-1.421015,0.240189,0.240683
4,32,28.88,0,-0.029615,-1.421015,0.240189,0.240683
