<a href="https://colab.research.google.com/github/isa-ulisboa/greends-pml/blob/main/tests/categorical_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here is an example of using categorical embedding with PyTorch. The most common approach to create continuous values from categorical data is `nn.Embedding`. It creates a learnable vector representation of the available classes, such that two similar classes (in a specific context) are closer to each other than two dissimilar classes¹.

```python
import torch.nn as nn
import torch

num_classes = 4
embedding_size = 10
embedding = nn.Embedding(num_classes, embedding_size)
class_vector = torch.tensor([1, 0, 3, 3, 2])
embedded_classes = embedding(class_vector)
```

In this example, we create an `nn.Embedding` object with `num_classes` and `embedding_size` as arguments. Then we create a tensor `class_vector` representing our categorical data. Finally, we pass this tensor to our embedding object to get the embedded representation of our categorical data¹.


Source: Conversation with Bing, 4/19/2023(1) deep learning - How to create a Pytorch network with mixed categorical .... https://stackoverflow.com/questions/62242396/how-to-create-a-pytorch-network-with-mixed-categorical-and-continuous-matrix-inp Accessed 4/19/2023.
(2) Embedding — PyTorch 2.0 documentation. https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html Accessed 4/19/2023.
(3) Pytorch: encode categorical feature by using nn.Embedding. https://stackoverflow.com/questions/63844428/pytorch-encode-categorical-feature-by-using-nn-embedding Accessed 4/19/2023.

In [7]:
import torch.nn as nn
import torch
import string

num_classes = 4
embedding_size = 10
embedding = nn.Embedding(num_classes, embedding_size)
class_vector = torch.tensor([1, 0, 3, 3, 2])
embedded_classes = embedding(class_vector)

In [2]:
embedded_classes

tensor([[ 0.2130, -1.0381,  0.0809, -0.9300,  1.2288,  0.4709,  0.6693, -1.2708,
         -0.5324,  0.7186],
        [-0.3471,  0.5338,  1.0348, -0.7506,  0.5978,  1.0767, -0.8734, -1.7563,
          1.5826,  2.2270],
        [ 1.3843,  0.4443, -0.6798,  1.2588,  0.2433, -0.0122,  1.3599,  2.1358,
         -0.9531, -0.5399],
        [ 1.3843,  0.4443, -0.6798,  1.2588,  0.2433, -0.0122,  1.3599,  2.1358,
         -0.9531, -0.5399],
        [-0.5726,  0.5357,  0.4664, -0.4321, -0.9876,  0.0523, -1.4601,  0.2823,
         -0.8536, -0.0932]], grad_fn=<EmbeddingBackward0>)

In [8]:
string.ascii_uppercase

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

One-hot encoding is a process of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions. In Python, one way to perform one-hot encoding is by using the `pandas` library. Here's an example:

```python
import pandas as pd

data = {'fruits': ['apple', 'banana', 'orange', 'banana']}
df = pd.DataFrame(data)

one_hot_df = pd.get_dummies(df, columns=['fruits'])
print(one_hot_df)
```

This will output:

```
   fruits_apple  fruits_banana  fruits_orange
0             1              0              0
1             0              1              0
2             0              0              1
3             0              1              0
```

In this example, we create a `DataFrame` with a column `fruits` containing categorical data. Then we use the `pd.get_dummies` function to perform one-hot encoding on the `fruits` column.
