# Idea

Utilize ```tf.lookup``` module for mapping.

1. Use [tf.lookup.KeyValueTensorInitializer] (https://www.tensorflow.org/api_docs/python/tf/lookup/KeyValueTensorInitializer) as the backend of (key, value) lookup table.
```
tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype=None, value_dtype=None, name=None
)
```

2. Use [tf.lookup.StaticVocabularyTable](https://www.tensorflow.org/api_docs/python/tf/lookup/StaticVocabularyTable)  for string to index map that can handle unkonwn vocabulary tokens using OOV buckets.
```
tf.lookup.StaticVocabularyTable(
    initializer,                     # <--- KeyValueTensorInitializer instance
    num_oov_buckets,                 # <--- Number of buckets to manage unknown vocaburary
    lookup_key_dtype=None,
    name=None,
    experimental_is_anonymous=False
)
```

3. Run strings to ```indices``` mapping to get the indices to the strings.
4. Run [tf.one_hot](https://www.tensorflow.org/api_docs/python/tf/one_hot) to get the One Hot Encoding. 
```
tf.one_hot(
    indices=indices,
    depth=len(vocabulary+num_oov_buckets),    # Add num_oov_buckets to encode unknown strings
    on_value=None,
    off_value=None,
    axis=None,
    dtype=None,
    name=None
)
```


# tf.lookup module

> The tf.lookup module provides several types of tables, including:
> * tf.lookup.StaticHashTable: This is a read-only table that is initialized from a set of keys and values at graph construction time.
> * tf.lookup.MutableHashTable: This is a mutable table that allows you to insert, delete, and update entries at runtime.
> * tf.lookup.StaticVocabularyTable: This is a read-only table that maps strings to integers using a fixed vocabulary.
> * tf.lookup.TextFileInitializer: This is a table initializer that loads key-value pairs from a text file.
> 
> Common use cases for tf.lookup include:
> 
> * **Indexing into embeddings**:   
> You can use tf.lookup.StaticHashTable to map sparse indices to dense embeddings, allowing you to efficiently store and retrieve embeddings for large sparse inputs.
> * **Vocabulary lookup**:  
> You can use tf.lookup.StaticVocabularyTable to map words to integers in a fixed vocabulary, which is often used in natural language processing tasks like text classification or sequence modeling.
> * **Label mapping**:  
> You can use tf.lookup.MutableHashTable to map labels or categories to integers, allowing you to perform classification or regression tasks.
> * **Data preprocessing**:  
> You can use tf.lookup.TextFileInitializer to load preprocessed data from a text file, which can be useful for tasks like data augmentation or data filtering.

In [2]:
from typing import (
    List,
)
import numpy as np
import tensorflow as tf

# (key, value) lookup table backend

* keys = vocabulary
* values = indices (sequential number from 0 to len(vocabulary).

Note [StaticVocabularyTable](https://github.com/tensorflow/tensorflow/blob/v2.11.0/tensorflow/python/ops/lookup_ops.py#L1298-L1300) requires ```tf.int64``` as the value dype.

```
if initializer.value_dtype != dtypes.int64:
    raise TypeError(
        "Invalid `value_dtype`, expected %s but got %s." %
        (dtypes.int64, initializer.value_dtype)
    )
```


In [11]:
vocabulary: List[str] = ["INLAND", "NEAR OCEARN", "NEAR BAY", "ISLAND"]
indices = tf.range(
    len(vocabulary), 
    dtype=tf.int64     # <--- need to be int64 due to StaticVocabularyTable
)

In [12]:
table_initializer: tf.lookup.KeyValueTensorInitializer = tf.lookup.KeyValueTensorInitializer(
    keys=vocabulary, 
    values=indices, 
    key_dtype=tf.dtypes.string, 
    value_dtype=tf.dtypes.int64, 
    name=None
)

# String to index mapping with OOV

In [19]:
num_oov_buckets: int = 5
    
lookup_table = tf.lookup.StaticVocabularyTable(
    initializer=table_initializer,
    num_oov_buckets=num_oov_buckets,
    lookup_key_dtype=tf.dtypes.string,    
    name=None,
    experimental_is_anonymous=False
)

In [20]:
indices = lookup_table.lookup(tf.constant(["INLAND", "INVALID"]))
indices

<tf.Tensor: shape=(2,), dtype=int64, numpy=array([0, 7])>

# One Hot Encode

In [21]:
tf.one_hot(
    indices=indices,
    depth=len(vocabulary)+num_oov_buckets
)

<tf.Tensor: shape=(2, 9), dtype=float32, numpy=
array([[1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0.]], dtype=float32)>