Portfolio Assignment 21-1
This notebook demonstrates how to compile a model using TensorFlow and Keras.

# Deep Learning for Text - Encoder Models
## Data Preprocessing

This code below demonstrates how to suppress various types of warning messages in Python using the built-in `warnings` module. This is a common practice in data science and machine learning projects where libraries often generate numerous warnings that can clutter the output without indicating actual problems.

## Warning Suppression Strategy

The code uses `warnings.simplefilter()` with the `action="ignore"` parameter to completely suppress different categories of warnings. Each call targets a specific warning type: `FutureWarning` (alerts about upcoming changes in library behavior), `UserWarning` (general warnings for users), `RuntimeWarning` (warnings about dubious runtime behavior), and `DeprecationWarning` (notifications about deprecated features). By ignoring these warnings, the notebook output becomes cleaner and focuses on the actual results rather than being overwhelmed by non-critical messages.

## When and Why to Use Warning Suppression

This approach is particularly valuable in educational or demonstration contexts, like this deep learning assignment, where the focus should be on understanding the core concepts rather than dealing with library maintenance issues. Many machine learning libraries like TensorFlow, scikit-learn, and pandas frequently emit warnings about future API changes or deprecated functions, which can be distracting when learning. However, it's important to note that suppressing warnings should be done judiciously - in production code, you'd typically want to address the underlying causes of warnings rather than simply hiding them.

## Best Practices and Considerations

While this blanket suppression is convenient for notebooks and learning environments, **be cautious about using this in production code**. Warnings often provide valuable information about potential issues or upcoming breaking changes that could affect your application. A better practice in production would be to address specific warnings individually or use more targeted filtering. Additionally, consider that suppressing `DeprecationWarning` might mask important information about functions that will be removed in future library versions, potentially leading to code that breaks when dependencies are updated.

In [1]:
import warnings 
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=UserWarning)
warnings.simplefilter(action="ignore", category=RuntimeWarning)
warnings.simplefilter(action="ignore", category=DeprecationWarning)
warnings.simplefilter(action="ignore", category=SyntaxWarning)
warnings.simplefilter(action="ignore", category=ImportWarning)  
warnings.simplefilter(action="ignore")

The code below demonstrates hardware detection and Metal GPU configuration for TensorFlow on Apple Silicon (M1/M2) machines. It's a system diagnostic that checks available compute devices and attempts to configure GPU memory settings.

## Device Discovery and Backend Information

The code begins by importing TensorFlow and the Keras backend module, then uses `tf.config.list_physical_devices()` to discover all available hardware devices (CPUs, GPUs, etc.) on the system. The `GPUs = tf.config.list_physical_devices()` line captures all physical devices for later use, while `backend.backend()` confirms which backend Keras is using (typically "tensorflow"). This information is crucial for understanding what computational resources are available and ensuring the correct backend is loaded.

## Metal GPU Configuration Attempt

The try-except block attempts to configure Apple's Metal Performance Shaders (MPS) backend, which allows TensorFlow to leverage the GPU on Apple Silicon chips. The `tf.device('/metal:GPU')` context manager tries to target the Metal GPU specifically. Inside this context, the code checks if memory growth is enabled using `tf.config.experimental.get_memory_growth(GPUs[0])`. Memory growth is a setting that allows TensorFlow to allocate GPU memory incrementally rather than claiming all available memory at startup, which is especially important on unified memory architectures like Apple Silicon.

## Error Handling and Diagnostics

The exception handling is particularly important here because Metal GPU support isn't available on all systems and configurations. If the Metal GPU isn't accessible (perhaps because it's not an Apple Silicon machine, Metal support isn't properly installed, or the device is already in use), the exception will be caught and a diagnostic message printed. This graceful degradation ensures the notebook can continue running even if the optimal GPU acceleration isn't available.

**Key gotcha**: The code assumes at least one physical device exists when accessing `GPUs[0]`. On systems with no GPUs, this could raise an IndexError. Also, memory growth settings typically need to be configured before any TensorFlow operations are performed, so this diagnostic code should run early in your notebook.

In [2]:
#%pip install --quiet --upgrade pip
#%pip install --quiet --upgrade keras
#%pip install --quiet --upgrade tensorflow
#%pip install --quiet --upgrade tensorflow-macos
#%pip install --quiet --upgrade tensorflow-metal
#%pip install --quiet --upgrade scikit-learn
#%pip install --quiet --upgrade pydot
#%pip install --quiet --upgrade graphviz
import tensorflow as     tf  
from   keras      import backend   
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
print('=== TensorFlow Environment Summary ===')
print('Operating System          :', os.name )
print('Platform Version          :', os.sys.platform )
print('Operating System Version  :', os.uname().version if hasattr(os, 'uname') else 'N/A')
print('TensorFlow version        :', tf.__version__)
print('Keras version             :', tf.keras.__version__)
print('Python version            :', os.sys.version)
print('Python executable         :', os.sys.executable)
print('Python path               :', os.sys.path)
print('TensorFlow backend        :', backend.backend())
print('TensorFlow device list    :', tf.config.list_physical_devices())
print('TensorFlow eager execution:', tf.executing_eagerly())
print('TensorFlow GPU available  :', tf.config.list_physical_devices('GPU'))
print('TensorFlow TPU available  :', tf.config.list_physical_devices('TPU'))
print('VS Code Version           :', os.environ.get('TERM_PROGRAM_VERSION', 'N/A'))
all_devices = tf.config.list_physical_devices()
print('=== Device Discovery Summary ===')
print(f'Total devices found: {len(all_devices)}')
print()
# Group devices by type
device_types = {}
for device in all_devices:
    device_type = device.device_type
    if device_type not in device_types:
        device_types[device_type] = []
    device_types[device_type].append(device)
# Display each device type separately
for device_type, devices in device_types.items():
    print(f'{device_type} Devices ({len(devices)} found):')
    for i, device in enumerate(devices):
        print(f'  [{i}] {device.name}')
    print()
# Alternative: Query specific device types
print('=== Detailed Device Breakdown ===')
cpu_devices = tf.config.list_physical_devices('CPU')
gpu_devices = tf.config.list_physical_devices('GPU')
print(f'CPU devices: {len(cpu_devices)}')
for i, cpu in enumerate(cpu_devices):
    print(f'  CPU[{i}]: {cpu.name}')
print(f'GPU devices: {len(gpu_devices)}')
for i, gpu in enumerate(gpu_devices):
    print(f'  GPU[{i}]: {gpu.name}')
# Check for other common device types
try:
    tpu_devices = tf.config.list_physical_devices('TPU')
    if tpu_devices:
        print(f'TPU devices: {len(tpu_devices)}')
        for i, tpu in enumerate(tpu_devices):
            print(f'  TPU[{i}]: {tpu.name}')
except Exception:
    print('TPU devices: Not available')
# Get GPU devices specifically
gpu_devices = tf.config.list_physical_devices('GPU')
print('GPU devices found:', len(gpu_devices))
if gpu_devices:
    try:
        # Check memory growth for first GPU
        memory_growth = tf.config.experimental.get_memory_growth(gpu_devices[0])
        print(f"Memory growth enabled: {memory_growth}")
        # Try Metal GPU access
        with tf.device(':GPU:0'):
            print("Successfully accessed GPU device")
    except Exception as e:
        print(f"GPU access error: {e}")
else:
    print("No GPU devices available")

=== TensorFlow Environment Summary ===
Operating System          : posix
Platform Version          : darwin
Operating System Version  : Darwin Kernel Version 25.0.0: Fri Jul 11 23:59:44 PDT 2025; root:xnu-12377.0.154.0.2~26/RELEASE_ARM64_T6041
TensorFlow version        : 2.19.0
Keras version             : 3.10.0
Python version            : 3.11.13 | packaged by conda-forge | (main, Jun  4 2025, 14:52:34) [Clang 18.1.8 ]
Python executable         : /Users/eneas/anaconda3/envs/tf-env/bin/python
Python path               : ['/Users/eneas/anaconda3/envs/tf-env/lib/python311.zip', '/Users/eneas/anaconda3/envs/tf-env/lib/python3.11', '/Users/eneas/anaconda3/envs/tf-env/lib/python3.11/lib-dynload', '', '/Users/eneas/anaconda3/envs/tf-env/lib/python3.11/site-packages']
TensorFlow backend        : tensorflow
TensorFlow device list    : [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
TensorFlow eager execution: True
TensorFlow GPU available  : []
TensorFlow TPU available  : []

The code below demonstrates the essential imports and reproducibility setup for a machine learning project using TensorFlow and Keras. It establishes the foundation for data manipulation, deep learning model development, and ensures consistent results across multiple runs.

## Core Library Imports

The code imports the fundamental libraries needed for machine learning workflows. `numpy` provides the foundation for numerical computing with efficient array operations, while `pandas` offers powerful data manipulation and analysis capabilities through DataFrames and Series. The `tensorflow` import brings in the main deep learning framework, and `from tensorflow import keras` specifically imports Keras, which is TensorFlow's high-level API for building and training neural networks. This import structure is standard practice in modern TensorFlow projects where Keras serves as the primary interface for model development.

## Reproducibility Through Seed Setting

The `tf.random.set_seed(42)` call is crucial for ensuring reproducible results in machine learning experiments. This sets the global random seed for TensorFlow operations, which means that any random processes in your model (like weight initialization, dropout, or data shuffling) will produce the same sequence of "random" numbers each time you run the code. The choice of 42 is arbitrary but commonly used in programming examples. This reproducibility is essential for debugging, comparing model performance, and ensuring that research results can be replicated.

## Importance in Machine Learning Workflows

Setting a random seed is particularly important in deep learning because neural networks rely heavily on randomness - from initial weight values to training batch order. Without a fixed seed, you might get different results each time you train the same model, making it difficult to determine whether performance improvements come from actual model changes or just lucky random initialization. This is especially critical in educational contexts like this assignment, where consistent results help students understand the impact of different techniques and hyperparameters.

**Key consideration**: While setting seeds ensures reproducibility, remember that this only works within the same hardware and software environment. Results may still vary across different versions of TensorFlow or when switching between CPU and GPU execution.

In [3]:
import numpy          as np
import pandas         as pd
import tensorflow     as tf
from   tensorflow import keras
tf.random.set_seed(42)

Let's begin extracting the data from the ATIS dataset and turning into a form that we can use in our Deep Learning models.

The ATIS dataset is standard benchmark dataset widely used to build models for intent classification and slot filling tasks (we will explain all this shortly). You can find a very detailed explanation [here](https://catalog.ldc.upenn.edu/docs/LDC93S4B/corpus.html).

We will begin by loading the file and then partitioning into a test and a training set.

In [4]:
df_train = pd.read_csv('files/atis_train_data.csv') # type: ignore
df_test  = pd.read_csv('files/atis_test_data.csv')  # type: ignore

Let's visualize all of this on a dataframe. Below we display an example query for each intent class in a nice layout.

The first column of the Dataframe below contains the actual query that was asked. The second column indicates the intent (flight, flight time, etc), whereas the last column contains the slot filling structure.

In [5]:
pd.set_option('display.max_colwidth', None) # type: ignore
df_small = pd.DataFrame(columns=['query','intent','slot filling']) # type: ignore
j        = 0
for i in df_train.intent.unique(): # type: ignore
  df_small.loc[j] = df_train[df_train.intent==i].iloc[0] # type: ignore
  j               = j + 1
df_small

Unnamed: 0,query,intent,slot filling
0,i want to fly from boston at 838 am and arrive in denver at 1110 in the morning,flight,O O O O O B-fromloc.city_name O B-depart_time.time I-depart_time.time O O O B-toloc.city_name O B-arrive_time.time O O B-arrive_time.period_of_day
1,what is the arrival time in san francisco for the 755 am flight leaving washington,flight_time,O O O B-flight_time I-flight_time O B-fromloc.city_name I-fromloc.city_name O O B-depart_time.time I-depart_time.time O O B-fromloc.city_name
2,cheapest airfare from tacoma to orlando,airfare,B-cost_relative O O B-fromloc.city_name O B-toloc.city_name
3,what kind of aircraft is used on a flight from cleveland to dallas,aircraft,O O O O O O O O O O B-fromloc.city_name O B-toloc.city_name
4,what kind of ground transportation is available in denver,ground_service,O O O O O O O O B-city_name
5,what 's the airport at orlando,airport,O O O O O B-city_name
6,which airline serves denver pittsburgh and atlanta,airline,O O O B-fromloc.city_name B-fromloc.city_name O B-fromloc.city_name
7,how far is it from orlando airport to orlando,distance,O O O O O B-fromloc.airport_name I-fromloc.airport_name O B-toloc.city_name
8,what is fare code h,abbreviation,O O O O B-fare_basis_code
9,how much does the limousine service cost within pittsburgh,ground_fare,O O O O B-transport_type O O O B-city_name


Let's see how many different types of "intent" are present in the data.

In [6]:
intent_counts = df_train['intent'].value_counts() # type: ignore
intent_counts

intent
flight                        3666
airfare                        423
ground_service                 255
airline                        157
abbreviation                   147
aircraft                        81
flight_time                     54
quantity                        51
flight+airfare                  21
airport                         20
distance                        20
city                            19
ground_fare                     18
capacity                        16
flight_no                       12
meal                             6
restriction                      6
airline+flight_no                2
ground_service+ground_fare       1
airfare+flight_time              1
cheapest                         1
aircraft+flight+flight_no        1
Name: count, dtype: int64

This code snippet demonstrates data extraction from pandas DataFrames, converting structured tabular data into NumPy arrays for machine learning preprocessing. It systematically separates the different components of the ATIS dataset into individual arrays for both training and testing purposes.

## Data Extraction Strategy

The code extracts three distinct types of information from each DataFrame: queries (the actual user input text), intents (the classification labels indicating what the user wants to accomplish), and slot filling data (the structured entity labels for each word in the query). By using the `.values` attribute on pandas Series, the code converts the DataFrame columns directly into NumPy arrays, which are the preferred data format for most machine learning libraries. This conversion eliminates pandas-specific metadata and index information, leaving only the raw data values.

## Training and Testing Data Separation

The extraction follows a consistent pattern for both training and testing datasets, creating six separate arrays total. The training arrays (`query_data_train`, `intent_data_train`, `slot_data_train`) contain the data that will be used to train the model, while the testing arrays (`query_data_test`, `intent_data_test`, `slot_data_test`) contain held-out data for evaluating model performance. This separation is crucial for unbiased model evaluation, as the testing data represents unseen examples that the model hasn't learned from.

## Preparing for Multi-Task Learning

This data structure sets up the foundation for a multi-task natural language understanding system. The queries serve as input features, while both intents and slot filling information can serve as target labels for different prediction tasks. In the context of this assignment, the model will primarily focus on slot filling prediction, but having the intent data available allows for potential expansion into joint intent classification and slot filling tasks. The consistent naming convention and parallel structure make it easy to feed this data into TensorFlow/Keras models later in the pipeline.

**Key consideration**: The `.values` extraction assumes that the DataFrames are properly aligned (same number of rows in the same order) between training and testing sets, which is typically guaranteed when the data comes from a properly split dataset.

In [7]:
# Extract 
#   query_data_train, intent_data_train, slot_data_train,
#   query_data_test,  intent_data_test,  slot_data_test 
# from the train and test dataframes
query_data_train  = df_train['query'].values        # type: ignore
intent_data_train = df_train['intent'].values       # type: ignore
slot_data_train   = df_train['slot filling'].values # type: ignore
query_data_test   = df_test['query'].values         # type: ignore
intent_data_test  = df_test['intent'].values        # type: ignore
slot_data_test    = df_test['slot filling'].values  # type: ignore

We briefly mentioned what the difference were between slot filling and intent in the introduction, but is worth going into more detail.

As an example, let’s consider the user query “*i want to fly from boston at 838 am and arrive in denver at 1110 in the morning*”. The model should classify this user query as “**flight**” intent. It should also parse the query, identify and fill all slots necessary for understanding the query. Although the words “I”, “want”, “to”, “fly”, “from”, “at”, “and”, “arrive”, “in”, “the” contribute to understand the context of the intent, the model should correctly label the entities needed to fulfill user’s goal in its intention to take a flight. These are “boston” as departure city (B-fromloc.city), “8:38 am” as departure time (B-depart_time.time), “denver” as destination city (B-toloc.city_name), “11:10” as arrival time (B-arrive_time.time) and “morning” as arrival period of day (B-arrive_time.period_of_day). The 123 slot categories are shown below.

In [8]:
unique_slots = set()
for s in slot_data_train: # type: ignore
  unique_slots = unique_slots.union(set(s.split()))
unique_slots

{'B-aircraft_code',
 'B-airline_code',
 'B-airline_name',
 'B-airport_code',
 'B-airport_name',
 'B-arrive_date.date_relative',
 'B-arrive_date.day_name',
 'B-arrive_date.day_number',
 'B-arrive_date.month_name',
 'B-arrive_date.today_relative',
 'B-arrive_time.end_time',
 'B-arrive_time.period_mod',
 'B-arrive_time.period_of_day',
 'B-arrive_time.start_time',
 'B-arrive_time.time',
 'B-arrive_time.time_relative',
 'B-city_name',
 'B-class_type',
 'B-connect',
 'B-cost_relative',
 'B-day_name',
 'B-day_number',
 'B-days_code',
 'B-depart_date.date_relative',
 'B-depart_date.day_name',
 'B-depart_date.day_number',
 'B-depart_date.month_name',
 'B-depart_date.today_relative',
 'B-depart_date.year',
 'B-depart_time.end_time',
 'B-depart_time.period_mod',
 'B-depart_time.period_of_day',
 'B-depart_time.start_time',
 'B-depart_time.time',
 'B-depart_time.time_relative',
 'B-economy',
 'B-fare_amount',
 'B-fare_basis_code',
 'B-flight_days',
 'B-flight_mod',
 'B-flight_number',
 'B-flight_st

The code below demonstrates how to count the number of unique slot categories in the ATIS dataset by determining the size of the `unique_slots` set. It provides a quick way to understand the complexity and scope of the slot filling task.

## Counting Unique Elements with len()

The `len()` function is a built-in Python function that returns the number of items in a collection. When applied to a set like `unique_slots`, it efficiently counts the number of unique elements without any duplicates. This is particularly useful here because sets automatically eliminate duplicate entries, so `len(unique_slots)` gives us the exact count of distinct slot categories present in the training data. The function works by calling the object's `__len__()` method internally, making it compatible with various Python data structures including lists, tuples, strings, and sets.

## Understanding Dataset Complexity

In the context of this natural language understanding task, knowing the number of unique slot categories is crucial for understanding the model's classification challenge. The slot filling task requires the model to assign each word in a query to one of these categories (plus non-slot tokens). A higher number of categories indicates a more complex classification problem, as the model needs to distinguish between many different entity types like cities, times, airlines, and airports. This count directly influences model architecture decisions, such as the size of the final output layer in the neural network.

## Practical Implications for Model Design

The result of `len(unique_slots)` (which shows 123 slot categories in this dataset) reveals that this is a fairly complex multi-class classification problem. This number will be used later in the code to set the `slot_vocab_size` parameter, which determines the output dimension of the model's final classification layer. Understanding this complexity upfront helps in setting appropriate model hyperparameters and managing expectations about training time and potential accuracy levels.

In [9]:
len(unique_slots) # type: ignore

123

**123 slot categories!!**
## Transformers

The code below imports the `TransformerEncoder` and `PositionalEmbedding` classes from the `HODL` library, which are essential components for building transformer-based models in natural language processing tasks. These components will be used to create a transformer encoder architecture that can effectively process sequential data like text.

### Explain the attention mechanism

The attention mechanism is a fundamental component of transformer models that allows the model to focus on different parts of the input sequence when making predictions. Unlike traditional recurrent neural networks (RNNs), which process sequences in order, transformers use self-attention to compute a representation of the entire sequence at once.
The self-attention mechanism works by calculating a weighted sum of the input embeddings, where the weights are determined by the relevance of each word to every other word in the sequence. This allows the model to capture long-range dependencies and relationships between words, regardless of their position in the sequence. The attention scores are computed using query, key, and value vectors derived from the input embeddings, enabling the model to dynamically adjust its focus based on the context.

### Encoder Model

The `TransformerEncoder` class implements the encoder part of the transformer architecture, which consists of multiple layers of self-attention and feed-forward neural networks. Each layer applies self-attention to the input sequence, allowing the model to weigh the importance of different words in the context of each other. This is particularly useful for understanding relationships between words in a sentence, regardless of their position. 

# Positional Embedding

The `PositionalEmbedding` class adds positional information to the input embeddings, which is crucial because it helps the model understand the order of words in the sequence. Unlike RNNs, transformers process all words simultaneously, so they need a way to incorporate word order information. The `PositionalEmbedding` class achieves this by adding sinusoidal position encodings to the input embeddings, allowing the model to differentiate between words based on their position in the sequence.

Because the code for transformer encoder architecture is a bit complicated to write, we have decided to package it. This means that you can import it directly from our own "library" (in the same way you do it for Keras layers).

Import `TransformerEncoder`, and `PositionalEmbedding` from 'HODL' file present in the current file directory. 

Hint: Take a look at the `HODL.py` file on the left-sidebar menu.

In [10]:
from HODL import TransformerEncoder, PositionalEmbedding

This code demonstrates basic NumPy array slicing to examine the first five elements of the training query data. It's a common exploratory data analysis technique used to quickly inspect the structure and content of your dataset before proceeding with machine learning preprocessing.

## Array Slicing for Data Inspection

The syntax `query_data_train[:5]` uses Python's slice notation to extract the first five elements from the `query_data_train` array. The colon (`:`) indicates slicing, and the number `5` specifies that we want elements from index 0 up to (but not including) index 5. This is equivalent to writing `query_data_train[0:5]` but more concise. Since `query_data_train` contains the raw text queries from the ATIS dataset, this operation will display the first five user queries as strings.

## Understanding Your Input Data

This inspection step is crucial for understanding what your model will be working with. By examining these sample queries, you can see the natural language patterns, vocabulary, and sentence structures that your transformer model will need to process. The queries typically contain travel-related requests like flight bookings, airport information, or schedule inquiries. This preview helps verify that the data extraction from the DataFrame was successful and gives you insight into the complexity and variety of the input text.

## Best Practice for Data Exploration

Displaying a small sample of your data is a fundamental step in any machine learning pipeline. It allows you to catch potential issues early, such as unexpected data formats, encoding problems, or missing values. In the context of this natural language processing task, seeing the actual text helps you understand the linguistic challenges your model will face and can inform decisions about preprocessing steps, tokenization strategies, and model architecture choices.

In [11]:
query_data_train[:5] # type: ignore

array([' i want to fly from boston at 838 am and arrive in denver at 1110 in the morning ',
       ' what flights are available from pittsburgh to baltimore on thursday morning ',
       ' what is the arrival time in san francisco for the 755 am flight leaving washington ',
       ' cheapest airfare from tacoma to orlando ',
       ' round trip fares from pittsburgh to philadelphia under 1000 dollars '],
      dtype=object)

This code snippet demonstrates array slicing to inspect the first five elements of the slot filling training data. It serves as a crucial data exploration step to understand the structure and format of the target labels that the model will need to predict.

## Examining Slot Filling Labels

The syntax `slot_data_train[:5]` uses Python's slice notation to extract the first five slot filling sequences from the training dataset. Unlike the query data which contains natural language text, `slot_data_train` contains structured label sequences that correspond to each word in the queries. These labels follow the BIO (Begin-Inside-Outside) tagging scheme, where each token is assigned a specific slot category like "B-fromloc.city" for the beginning of a departure city or "O" for tokens that don't belong to any slot.

## Understanding Label Structure

This inspection reveals the complexity of the slot filling task by showing how each word in a query is systematically labeled with its semantic role. The slot labels provide fine-grained entity recognition, identifying not just that "Boston" is a city, but specifically that it's a departure location ("B-fromloc.city"). This level of detail allows the model to distinguish between different types of entities that might appear similar but serve different purposes in travel queries, such as departure cities versus destination cities.

## Preparing for Model Training

By examining these sample slot sequences, you can verify that the data extraction process worked correctly and understand the exact format your model will need to output. This preview also helps you appreciate the alignment between queries and their corresponding slot labels - each position in the slot sequence corresponds to a word in the query. This one-to-one correspondence is essential for training the sequence labeling model and ensures that the transformer can learn to map input words to their appropriate semantic categories.

In [12]:
slot_data_train[:5] # type: ignore

array([' O O O O O B-fromloc.city_name O B-depart_time.time I-depart_time.time O O O B-toloc.city_name O B-arrive_time.time O O B-arrive_time.period_of_day ',
       ' O O O O O B-fromloc.city_name O B-toloc.city_name O B-depart_date.day_name B-depart_time.period_of_day ',
       ' O O O B-flight_time I-flight_time O B-fromloc.city_name I-fromloc.city_name O O B-depart_time.time I-depart_time.time O O B-fromloc.city_name ',
       ' B-cost_relative O O B-fromloc.city_name O B-toloc.city_name ',
       ' B-round_trip I-round_trip O O B-fromloc.city_name O B-toloc.city_name B-cost_relative B-fare_amount I-fare_amount '],
      dtype=object)

This code demonstrates the text preprocessing pipeline using Keras TextVectorization layers to convert raw text data into numerical representations suitable for deep learning models. It establishes two separate vectorization pipelines: one for slot labels (targets) and another for query text (inputs), creating the foundation for training a sequence labeling model.

## Setting Up Text Vectorization Parameters

The code begins by defining `max_query_length = 30`, which sets a fixed sequence length for all inputs and outputs. This standardization is crucial for batch processing in neural networks, as all sequences in a batch must have the same dimensions. The TextVectorization layers will either truncate longer sequences or pad shorter ones with zeros to reach this exact length. This parameter should be chosen based on the dataset characteristics - too short and you'll lose important information, too long and you'll waste computational resources on padding.

## Slot Label Vectorization Pipeline

The first TextVectorization layer processes the slot filling labels with `standardize=None`, which is important because slot labels like "B-fromloc.city" and "O" should remain exactly as they are without any text preprocessing. The `adapt()` method analyzes the training slot data to build a vocabulary mapping each unique slot label to an integer index. This creates a consistent encoding scheme where "O" might map to index 1, "B-fromloc.city" to index 2, and so on. The `vocabulary_size()` call returns the total number of unique slot categories plus special tokens, which will be used later to define the output layer size of the neural network.

## Query Text Vectorization Pipeline

The second TextVectorization layer handles the input queries using default standardization, which includes lowercasing, punctuation removal, and whitespace normalization. This preprocessing helps the model generalize better by treating "Boston" and "boston" as the same token. After adapting to the query training data, this vectorizer can convert any text input into a sequence of integer indices representing words in its learned vocabulary. The resulting `query_vocab_size` determines the size of the embedding layer that will convert these integer indices into dense vector representations.

## Creating Numerical Datasets

The final step applies both vectorizers to transform the raw text data into numerical tensors. The slot vectorizer converts `slot_data_train` and `slot_data_test` into `target_train` and `target_test` - integer sequences where each position corresponds to the slot label for that word position. Similarly, the query vectorizer transforms the text queries into `source_train` and `source_test` - integer sequences representing the input words. These numerical representations are now ready to be fed into the transformer model for training and evaluation.

**Key consideration**: The order of operations matters here - you must call `adapt()` on the training data before applying the vectorizer to both training and test sets to ensure consistent vocabulary mappings.

In [13]:
max_query_length = 30
# Text vectorization of slots
text_vectorization_slots = keras.layers.TextVectorization( # type: ignore
    output_sequence_length = max_query_length,
    standardize            = None
)
text_vectorization_slots.adapt(slot_data_train) # type: ignore
# Number of slots
slot_vocab_size = text_vectorization_slots.vocabulary_size() 
target_train    = text_vectorization_slots(slot_data_train) # type: ignore
target_test     = text_vectorization_slots(slot_data_test) # type: ignore
# Text vectorization of query
text_vectorization_query = keras.layers.TextVectorization( # type: ignore
    output_sequence_length = max_query_length
)
text_vectorization_query.adapt(query_data_train) # type: ignore
# Number of unique query words
query_vocab_size = text_vectorization_query.vocabulary_size()
source_train     = text_vectorization_query(query_data_train) # type: ignore
source_test      = text_vectorization_query(query_data_test) # type: ignore

This code demonstrates the construction of a transformer-based neural network architecture for slot filling, combining modern attention mechanisms with traditional dense layers to create a sophisticated sequence labeling model. The architecture follows a clear pipeline from input processing through transformer encoding to final classification.

## Model Architecture and Hyperparameters

The code begins by defining key hyperparameters that control the model's capacity and behavior. The `embedding_dim = 512` sets a relatively large embedding space that allows the model to capture rich semantic representations of words. The `encoder_units = 64` parameter controls the size of the feed-forward network within the transformer encoder, while `units = 128` defines the hidden layer size in the final classifier. The `num_heads = 5` specifies the number of attention heads in the multi-head attention mechanism, allowing the model to focus on different aspects of the input sequence simultaneously.

## Input Processing and Positional Embedding

The model starts with a `keras.Input` layer that accepts sequences of length `max_query_length`, representing tokenized query text as integer indices. The `PositionalEmbedding` layer then performs two crucial transformations: it converts the integer tokens into dense vector representations using token embeddings, and adds positional information so the model understands word order. This custom layer combines both token and position embeddings by simply adding them together, creating rich input representations that encode both semantic meaning and sequential structure essential for understanding natural language.

## Transformer Encoder Processing

The heart of the model is the `TransformerEncoder` layer, which applies the self-attention mechanism that has revolutionized natural language processing. This layer allows each word in the sequence to attend to all other words, capturing complex dependencies and relationships regardless of distance. The transformer encoder uses residual connections and layer normalization to ensure stable training, while the multi-head attention mechanism enables the model to focus on different types of relationships simultaneously. The output `encoder_out` contains contextualized representations where each word's embedding has been enriched with information from the entire sequence.

## Classification Head and Output

The final portion of the architecture consists of a classification head that transforms the transformer's contextualized representations into slot predictions. The first `Dense` layer with ReLU activation serves as a feature transformation layer, followed by `Dropout(0.5)` for regularization to prevent overfitting. The final `Dense` layer with softmax activation produces probability distributions over the `slot_vocab_size` categories for each word position, enabling the model to predict slot labels like "B-fromloc.city" or "O" for each token in the input sequence.

**Key architectural insight**: This design effectively combines the global context modeling capabilities of transformers with task-specific classification layers, creating a model that can understand both local word meanings and global sentence structure while making precise slot filling predictions at each position.

In [14]:
# Params
embedding_dim = 512
encoder_units = 64
units         = 128
num_heads     = 5
# Embedding and Masking
inputs        = keras.Input(shape=(max_query_length,)) # type: ignore
embedding     = PositionalEmbedding(max_query_length, query_vocab_size, embedding_dim) # type: ignore
x             = embedding(inputs)
# Transformer Encoding
encoder_out   = TransformerEncoder(embedding_dim, encoder_units, num_heads)(x) # type: ignore
# Classifier
x             = keras.layers.Dense(units, activation='relu')(encoder_out) # type: ignore
x             = keras.layers.Dropout(0.5)(x) # type: ignore
outputs       = keras.layers.Dense(slot_vocab_size, activation="softmax")(x) # type: ignore
model         = keras.Model(inputs, outputs) # type: ignore
model.summary()

This code snippet demonstrates the model compilation step in Keras, which configures the transformer model for training by specifying the optimization algorithm, loss function, and evaluation metrics. Model compilation is a crucial step that bridges the gap between model architecture definition and actual training.

## Optimizer Selection and Adam Algorithm

The code uses `optimizer="adam"`, which specifies the Adam optimization algorithm for updating the model's weights during training. Adam is an adaptive learning rate optimizer that combines the benefits of two other popular optimizers: AdaGrad (which adapts learning rates based on historical gradients) and RMSprop (which uses a moving average of squared gradients). Adam is particularly well-suited for transformer models because it handles sparse gradients effectively and automatically adjusts learning rates for each parameter individually, making it robust across different types of neural network architectures and datasets.

## Loss Function for Multi-Class Classification

The `loss="sparse_categorical_crossentropy"` parameter defines how the model measures the difference between predicted and actual slot labels during training. This loss function is specifically designed for multi-class classification problems where each sample belongs to exactly one category out of many possible categories (in this case, the 123+ slot categories). The "sparse" variant is used because the target labels are provided as integer indices rather than one-hot encoded vectors, which is more memory-efficient for problems with large vocabularies like slot filling tasks.

## Performance Monitoring with Metrics

The [`metrics=["sparse_categorical_accuracy"]`](/Users/eneas/Desktop/MIT/WEEK-23-MODULE-21/Portfolio-Assignment-21-1/Assignment21.ipynb) parameter specifies that the model should track accuracy during training and validation. This metric calculates the percentage of predictions where the model correctly identifies the slot category for each token position. Using sparse categorical accuracy is consistent with the sparse categorical crossentropy loss function and provides an intuitive measure of model performance that's easy to interpret - a value of 0.95 means the model correctly predicts 95% of slot labels.

**Key consideration**: The compilation step doesn't change the model's architecture but prepares it for training by defining the mathematical framework for learning. These choices (Adam optimizer, sparse categorical crossentropy loss, and accuracy metric) are well-established best practices for sequence labeling tasks in natural language processing.

In [15]:
# Compile your model
with tf.device(':GPU:0'): # type: ignore
    print('Using GPU for training')
    model.compile(optimizer="adam", # type: ignore
              loss="sparse_categorical_crossentropy",
              metrics=["sparse_categorical_accuracy"])

Using GPU for training


This code demonstrates the model training phase in Keras, where the transformer model begins learning to perform slot filling by processing the ATIS dataset. The training process involves setting key hyperparameters and calling the `fit()` method to start the iterative learning process.

## Training Hyperparameter Configuration

The code begins by defining two critical training hyperparameters. The `BATCH_SIZE = 128` parameter determines how many training examples the model processes simultaneously before updating its weights. A batch size of 128 strikes a good balance between computational efficiency and gradient stability - larger batches provide more stable gradient estimates but require more memory, while smaller batches update weights more frequently but with noisier gradients. The `epochs = 10` parameter specifies that the model will see the entire training dataset 10 times during training, allowing it to gradually learn the patterns in the data through repeated exposure.

## Model Training Execution

The `model.fit()` method is the core function that orchestrates the training process. It takes `source_train` (the vectorized query text) as input features and `target_train` (the vectorized slot labels) as target outputs, establishing the supervised learning relationship between questions and their corresponding slot annotations. During each epoch, the model processes the training data in batches of 128 examples, computing predictions, calculating loss using the sparse categorical crossentropy function defined during compilation, and updating weights using the Adam optimizer to minimize prediction errors.

## Training History and Monitoring

The function returns a `history` object that contains detailed logs of the training process, including loss values and accuracy metrics for each epoch. This history is invaluable for understanding how the model's performance evolves during training - you can observe whether the loss is decreasing consistently, whether accuracy is improving, and whether the model might be overfitting or underfitting. The training process will show real-time updates of these metrics, allowing you to monitor the transformer's progress as it learns to map natural language queries to their corresponding slot filling annotations.

**Key insight**: This training configuration represents the model's first exposure to the slot filling task, where 10 epochs should provide sufficient learning iterations for the transformer to capture the relationship between travel-related queries and their semantic slot structures.

The output `Epoch 1/10` indicates that the model training has just begun and is currently processing the first epoch out of the total 10 epochs specified in the training configuration.

## Understanding Epoch Output

An epoch represents one complete pass through the entire training dataset. When you see `Epoch 1/10`, it means the model is starting to process all the training examples for the first time. During this epoch, the transformer model will:

1. **Process all training batches**: The model will go through the `source_train` data (vectorized queries) in batches of 128 examples each
2. **Make predictions**: For each batch, the model will predict slot labels for every word position
3. **Calculate loss**: Using sparse categorical crossentropy to measure prediction accuracy
4. **Update weights**: The Adam optimizer will adjust model parameters to minimize prediction errors

## What Happens During Training

As the epoch progresses, you would typically see additional output showing:
- Current batch number and total batches
- Real-time loss values decreasing as the model learns
- Accuracy metrics improving as predictions become more accurate
- Time estimates for completion

## Training Progress Monitoring

The fact that training has started successfully indicates that:
- The model architecture is correctly defined
- Input and target data shapes are compatible
- The compilation settings (optimizer, loss function, metrics) are working properly
- The transformer is ready to learn the relationship between travel queries and their corresponding slot filling annotations

**Expected behavior**: After this first epoch completes, you'll see the final loss and accuracy values for epoch 1, then the training will continue through epochs 2-10, with the model gradually improving its slot filling performance with each pass through the data.

In [16]:
BATCH_SIZE = 128
epochs     = 10
tf.config.set_soft_device_placement(True) # type: ignore
with tf.device(':GPU:0'): # type: ignore
    print('Using GPU for training')
    # Train the model
    history = model.fit(source_train,  # type: ignore
                    target_train, # type: ignore
                    batch_size = BATCH_SIZE,
                    epochs     = epochs)

Using GPU for training
Epoch 1/10
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 516ms/step - loss: 1.1685 - sparse_categorical_accuracy: 0.7887
Epoch 2/10
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 522ms/step - loss: 0.1693 - sparse_categorical_accuracy: 0.9538
Epoch 3/10
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 526ms/step - loss: 0.1192 - sparse_categorical_accuracy: 0.9633
Epoch 4/10
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 530ms/step - loss: 0.0965 - sparse_categorical_accuracy: 0.9694
Epoch 5/10
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 517ms/step - loss: 0.0826 - sparse_categorical_accuracy: 0.9736
Epoch 6/10
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 522ms/step - loss: 0.0650 - sparse_categorical_accuracy: 0.9802
Epoch 7/10
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 512ms/step - loss: 0.0520 - sparse_categorical_accuracy: 0.98

This code demonstrates a comprehensive evaluation pipeline for the slot filling model, including a custom accuracy calculation function and model prediction on test data. The evaluation provides two different accuracy metrics to understand both overall performance and slot-specific performance.

## Custom Accuracy Function Design

The `slot_filling_accuracy` function implements a sophisticated accuracy calculation that handles the unique challenges of sequence labeling tasks. The `not_padding` mask uses `np.not_equal(actual, 0)` to exclude padding tokens (represented by zeros) from accuracy calculations, ensuring that only meaningful positions contribute to the final metric. The function offers two evaluation modes through the `only_slots` parameter: when `False`, it calculates accuracy across all tokens including non-slot "O" labels; when `True`, it focuses exclusively on actual slot entities by filtering out the "O" tokens using `text_vectorization_slots(['O']).numpy()[0, 0]` to get the integer representation of the non-slot token.

## Prediction Generation and Device Management

The code uses `tf.config.set_soft_device_placement(True)` to enable flexible device placement, allowing TensorFlow to automatically choose the best available device for operations. The `with tf.device('/CPU:0'):` context manager forces prediction to run on CPU, which can be useful for ensuring consistent behavior or avoiding GPU memory issues during evaluation. The prediction process involves calling `model.predict(source_test)` to get probability distributions for each token position, then using `np.argmax(axis=-1)` to convert these probabilities into predicted class indices. The `.reshape(-1)` operation flattens both predicted and actual arrays into 1D vectors for easier comparison.

## Dual Accuracy Metrics and Performance Insights

The evaluation calculates two distinct accuracy metrics that provide complementary insights into model performance. The overall accuracy (`only_slots=False`) measures how well the model performs across all token positions, including the common "O" (outside) labels that represent non-slot tokens. The slot-specific accuracy (`only_slots=True`) focuses exclusively on the model's ability to correctly identify and classify actual slot entities, providing a more challenging metric that excludes the easier-to-predict non-slot tokens. This dual approach is crucial for understanding slot filling performance because models might achieve high overall accuracy by correctly predicting many "O" tokens while still struggling with the more complex slot classification task.

**Key insight**: The custom accuracy function's use of `np.dot(correct_predictions, weights) / sample_length` is mathematically equivalent to a simple mean calculation but demonstrates a more general framework that could easily be extended to support weighted accuracy metrics if certain token types were considered more important than others in the evaluation.

In [17]:
def slot_filling_accuracy(actual, predicted, only_slots=False):
  not_padding = np.not_equal(actual, 0) # type: ignore #+ np.not_equal(predicted, 0)
  if only_slots:
    non_slot_token      = text_vectorization_slots(['O']).numpy()[0, 0] # type: ignore
    slots               = np.not_equal(actual, non_slot_token) # type: ignore
    correct_predictions = np.equal(actual, predicted)[not_padding * slots] # type: ignore
  else:
    correct_predictions = np.equal(actual, predicted)[not_padding] # type: ignore
  sample_length = len(correct_predictions)
  weights       = np.ones(sample_length) # type: ignore
  return np.dot(correct_predictions, weights) / sample_length # type: ignore
tf.config.set_soft_device_placement(True) # type: ignore
with tf.device(':GPU:0'): # type: ignore
  print('Using GPU for inference' )
  predicted = np.argmax(model.predict(source_test), axis=-1).reshape(-1) # type: ignore
actual    = target_test.numpy().reshape(-1) # type: ignore
acc       = slot_filling_accuracy(actual, predicted, only_slots=False)
acc_slots = slot_filling_accuracy(actual, predicted, only_slots=True)
print(f'Accuracy = {acc:.3f}')
print(f'Accuracy on slots = {acc_slots:.3f}')

Using GPU for inference
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 47ms/step
Accuracy = 0.961
Accuracy on slots = 0.910


Now we get 92% accuracy on the slots and 97% accuracy in general. This is so much better!!

Let's see some examples:


This code demonstrates a practical inference pipeline that showcases the trained transformer model's slot filling capabilities on real-world travel queries. It combines prediction generation with human-readable output formatting to create an interactive demonstration of the model's performance.

## Prediction Function Architecture

The `predict_slots_query` function encapsulates the complete inference workflow for converting raw text queries into slot predictions. The function begins by applying the same text vectorization process used during training through `text_vectorization_query([query])`, which converts the input string into a numerical sequence that matches the model's expected input format. The model then generates probability distributions for each token position via `model.predict(sentence)`, and `np.argmax(axis=-1)[0]` converts these probabilities into the most likely slot category indices for each word position in the sequence.

## Vocabulary Inversion and Decoding

A crucial component of the function is the creation of an inverse vocabulary mapping using `dict(enumerate(text_vectorization_slots.get_vocabulary()))`. This creates a dictionary that maps integer indices back to their corresponding slot labels, essentially reversing the encoding process performed during preprocessing. The `enumerate` function pairs each vocabulary word with its index position, creating key-value pairs like `{0: '', 1: 'O', 2: 'B-fromloc.city'}`. The final step uses this inverse mapping to decode the numerical predictions back into human-readable slot labels, joining them with spaces to create a formatted output string that aligns with the original query structure.

## Comprehensive Testing Examples

The examples list provides a diverse set of test cases that demonstrate different aspects of the model's slot filling capabilities. Simple directional phrases like "from los angeles" and "to los angeles" test basic location recognition, while complex queries like "cheapest flight from boston to los angeles tomorrow" challenge the model's ability to handle multiple entities and modifiers. The inclusion of information-seeking queries such as "what is the airport at orlando" tests the model's understanding of different intent patterns, and the variation between "flight from boston to santiago" and "flight boston to santiago" evaluates robustness to different grammatical constructions.

**Key insight**: This demonstration effectively bridges the gap between technical model output and practical application, showing how the transformer's numerical predictions translate into meaningful semantic annotations that could power real travel booking systems or conversational AI interfaces.

In [18]:
# Define the predict_slots_query() function which takes in a query
def predict_slots_query(query):
  sentence           = text_vectorization_query([query]) # type: ignore
  prediction         = np.argmax(model.predict(sentence), axis=-1)[0] # type: ignore
  inverse_vocab      = dict(enumerate(text_vectorization_slots.get_vocabulary())) # type: ignore
  decoded_prediction = " ".join(inverse_vocab[int(i)] for i in prediction)
  return decoded_prediction
examples = [
            'from los angeles',
            'to los angeles',
            'from boston',
            'to boston',
            'cheapest flight from boston to los angeles tomorrow',
            'what is the airport at orlando',
            'what are the air restrictions on flights from pittsburgh to atlanta for the airfare of 416 dollars',
            'flight from boston to santiago',
            'flight boston to santiago'
]
for e in examples:
    print(e)
    print(predict_slots_query(e))
    print()

from los angeles
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
O B-fromloc.city_name I-fromloc.city_name                           

to los angeles
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step
O B-fromloc.city_name I-toloc.city_name                           

from boston
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step
O B-fromloc.city_name                            

to boston
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step
O B-toloc.city_name                            

cheapest flight from boston to los angeles tomorrow
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step
B-cost_relative O O B-fromloc.city_name O B-toloc.city_name I-toloc.city_name B-depart_date.today_relative                      

what is the airport at orlando
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
O O O O O B-airport_name                        

what are the 

Even though 'Santiago' is not a city that is present in the training data set, it is still capable of recognizing it as a destination city name just by context! This is the power of the attention mechanism of transformers.

Can we get even better accuracy if we train for longer? Let's try!


This code demonstrates extended training of the transformer model, where the learning process is continued for an additional 20 epochs to explore whether longer training can improve slot filling performance. This represents a common experimental approach in deep learning where researchers investigate the relationship between training duration and model accuracy.

## Extended Training Configuration

The code sets `epochs = 20`, which means the model will process the entire training dataset 20 more times beyond the initial 10 epochs completed earlier. This extended training period allows the model to continue refining its understanding of the slot filling task through additional exposure to the training examples. The decision to train for 20 additional epochs represents a hypothesis that the model hasn't yet reached its optimal performance and can benefit from more learning iterations. The same `BATCH_SIZE` from the previous training session is maintained, ensuring consistency in the training process and allowing for direct comparison of results.

## Continued Learning Process

The `model.fit()` call resumes training from the current state of the model weights, rather than starting from scratch. This continuation approach is valuable because it builds upon the knowledge already acquired during the initial 10 epochs, allowing the transformer to make incremental improvements to its slot classification abilities. During these additional epochs, the Adam optimizer will continue adjusting the model parameters based on the sparse categorical crossentropy loss, potentially discovering more nuanced patterns in the relationship between travel queries and their corresponding slot annotations.

## Performance Optimization Strategy

This extended training approach reflects a fundamental trade-off in machine learning between training time and model performance. While longer training can lead to better accuracy, it also increases the risk of overfitting, where the model becomes too specialized to the training data and performs poorly on new, unseen examples. The `history` object returned from this extended training will contain valuable information about whether the additional epochs resulted in continued improvement or if the model has reached a performance plateau, helping to inform decisions about optimal training duration for this slot filling task.

**Key consideration**: Extended training beyond the initial epochs allows experimentation with the training duration to find the sweet spot between underfitting (too little training) and overfitting (too much training), ultimately seeking to maximize the model's ability to generalize to new travel-related queries.

In [19]:
epochs  = 20
history = model.fit(source_train,  # type: ignore
                    target_train, # type: ignore
                    batch_size = BATCH_SIZE, # type: ignore
                    epochs     = epochs)

Epoch 1/20
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 522ms/step - loss: 0.0255 - sparse_categorical_accuracy: 0.9924
Epoch 2/20
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 529ms/step - loss: 0.0220 - sparse_categorical_accuracy: 0.9936
Epoch 3/20
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 526ms/step - loss: 0.0201 - sparse_categorical_accuracy: 0.9938
Epoch 4/20
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 527ms/step - loss: 0.0168 - sparse_categorical_accuracy: 0.9947
Epoch 5/20
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 528ms/step - loss: 0.0162 - sparse_categorical_accuracy: 0.9953
Epoch 6/20
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 532ms/step - loss: 0.0149 - sparse_categorical_accuracy: 0.9956
Epoch 7/20
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 555ms/step - loss: 0.0131 - sparse_categorical_accuracy: 0.9959
Epoch 8/20
[1m39/39

This code demonstrates a comprehensive evaluation pipeline for the slot filling model after extended training, featuring an updated custom accuracy function and dual-metric assessment to measure the transformer's improved performance on the test dataset.

## Simplified Accuracy Function Implementation

The `slot_filling_accuracy` function has been streamlined compared to the earlier version, replacing the previous `np.dot(correct_predictions, weights) / sample_length` calculation with the more direct `np.sum(correct_predictions) / sample_length`. This change maintains identical functionality while improving code readability and eliminating the unnecessary creation of a weights array filled with ones. The function still implements the same sophisticated logic for handling padding tokens through the `not_padding` mask and provides the dual evaluation modes via the `only_slots` parameter, but now uses the more intuitive sum-and-divide approach for calculating the mean accuracy.

## Model Prediction and Post-Processing Pipeline

The prediction generation follows the same robust pipeline as before, with `tf.device('/CPU:0')` ensuring consistent execution on CPU for reliable evaluation results. The `model.predict(source_test)` call generates probability distributions for each token position, which are then converted to class predictions using `np.argmax(axis=-1)`. The subsequent `.reshape(-1)` operations flatten both the predicted and actual label arrays into one-dimensional vectors, creating aligned sequences that can be directly compared element-by-element for accuracy calculation.

## Performance Assessment After Extended Training

The evaluation computes both overall accuracy and slot-specific accuracy metrics to provide a comprehensive view of how the additional 20 epochs of training affected model performance. The overall accuracy includes all token predictions (including the common "O" non-slot labels), while the slot-specific accuracy focuses exclusively on the model's ability to correctly classify actual slot entities. This dual assessment is particularly valuable after extended training because it reveals whether the additional epochs improved the model's ability to handle the more challenging slot classification task, or if gains were primarily in the easier non-slot token predictions.

**Key insight**: The simplified accuracy calculation using `np.sum()` directly demonstrates that in NumPy, when working with boolean arrays, summing gives the count of `True` values, making this approach both more readable and computationally efficient than the previous dot product implementation while maintaining mathematical equivalence.

In [20]:
def slot_filling_accuracy(actual, predicted, only_slots=False):
  not_padding = np.not_equal(actual, 0) # type: ignore
  if only_slots:
    non_slot_token      = text_vectorization_slots(['O']).numpy()[0, 0] # type: ignore
    slots               = np.not_equal(actual, non_slot_token) # type: ignore
    correct_predictions = np.equal(actual, predicted)[not_padding * slots] # type: ignore
  else:
    correct_predictions = np.equal(actual, predicted)[not_padding]
  sample_length = len(correct_predictions)
  return np.sum(correct_predictions) / sample_length
with tf.device('/CPU:0'): # type: ignore
  predicted = np.argmax(model.predict(source_test), axis=-1).reshape(-1) # type: ignore
actual    = target_test.numpy().reshape(-1) # type: ignore
acc       = slot_filling_accuracy(actual, predicted, only_slots=False)
acc_slots = slot_filling_accuracy(actual, predicted, only_slots=True)
print(f'Accuracy          = {acc:.3f}')
print(f'Accuracy on slots = {acc_slots:.3f}')

[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 44ms/step
Accuracy          = 0.967
Accuracy on slots = 0.922
