# Categorical encoding 

## Reference

1. [kaggle nb](https://www.kaggle.com/code/chewzy/tutorial-how-to-train-your-custom-word-embedding/notebook)

1. [ML mastery](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/)

1. [gensim docs](https://radimrehurek.com/gensim/models/word2vec.html)

1. [dirty_cat docs](https://dirty-cat.github.io/stable/auto_examples/01_dirty_categories.html#sphx-glr-auto-examples-01-dirty-categories-py)

In [1]:
# Use labelled training data. Each pair in `train` is [wuc_as_coded, wuc_manual]

# A Word2Vec model should learn that A is closer to B than to C. 
# That is, when A is miscoded, it is more likely to be miscoded as B than as C. 

# Also, the embedding should learn that B and C are very close, and are closer 
# than A and B. 

# This embedding can be passed as a feature to the WUC classifier. 

In [2]:
from gensim.models import Word2Vec

train = [
    ['A', 'B'], 
    
    ['B', 'C'], 
    ['B', 'C'], 
    ['B', 'C'], 
    
    ['C', 'B'], 
    ['C', 'B'], 
    ['C', 'B'], 
]

model = Word2Vec(sentences=train, sg=1, vector_size=10, min_count=1, window=1)

print(f'indexes: {model.wv.key_to_index}')

embeddings = {}
for word in ['A', 'B', 'C']: 
    vec = model.wv.get_vector(word)
    embeddings[word] = vec

print(f'embeddings: {embeddings} \n')

for word in ['A', 'B', 'C']: 
    print(f'{word} most similar to {model.wv.most_similar(word)}')

indexes: {'B': 0, 'C': 1, 'A': 2}
embeddings: {'A': array([ 0.07311766,  0.05070262,  0.06757693,  0.00762866,  0.06350889,
       -0.03405366, -0.00946403,  0.05768573, -0.07521639, -0.03936105],
      dtype=float32), 'B': array([-0.00536227,  0.0023643 ,  0.0510335 ,  0.09009273, -0.0930295 ,
       -0.07116809,  0.06458871,  0.08972988, -0.05015428, -0.03763373],
      dtype=float32), 'C': array([ 0.07380505, -0.01533473, -0.04536615,  0.06554051, -0.0486016 ,
       -0.01816018,  0.0287658 ,  0.00991874, -0.08285215, -0.09448819],
      dtype=float32)} 

A most similar to [('C', 0.32937222719192505), ('B', 0.3004249632358551)]
B most similar to [('C', 0.5436005592346191), ('A', 0.3004249632358551)]
C most similar to [('B', 0.5436005592346191), ('A', 0.3293722867965698)]
