# Building a Dialogue Language Model using DLA Sample Code

**Author**: Deep Learning Academy  
**Date**: {}

This notebook demonstrates how to build a sequence-to-sequence dialogue model using the DLA (Deep Learning Academy) sample code for data processing. We'll use the `DialogueParser` class from `src/data_processor.py` to prepare our dialogue data and then build a transformer-based model for dialogue generation.

## 1.0 Background

In this notebook, we will:
1. Use the DLA DialogueParser to process dialogue data from various formats
2. Build a sequence-to-sequence model for dialogue generation
3. Train the model on context-response pairs
4. Generate responses given dialogue context

The DLA sample code (`src/data_processor.py`) provides a flexible parser that can handle:
- Context-response pair format
- Speaker-labeled dialogue format
- Mixed formats with scene descriptions

## 2.0 Setup and Import Dependencies

In [None]:
# Install required packages if needed
# !pip install tensorflow keras numpy

In [None]:
import os
import sys
import numpy as np
import random
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

print(f"TensorFlow version: {tf.__version__}")
print(f"Keras version: {keras.__version__}")

## 3.0 Import DLA Sample Code

We'll import the DialogueParser from the DLA sample code to process our dialogue data.

In [None]:
# Add src directory to path to import the DLA sample code
import sys
sys.path.append('./src')

# Import the DLA DialogueParser
from data_processor import DialogueParser, DialogueTurn, DatasetStatistics

print("DLA DialogueParser imported successfully!")

## 4.0 Prepare Sample Dialogue Data

For demonstration purposes, we'll create sample dialogue data. In production, you would use the DialogueParser to load data from files.

In [None]:
# Create sample dialogue data in context-response format
sample_dialogues = """
context: Hello, how are you today?
response: I'm doing well, thank you for asking! How about you?

context: I'm doing well, thank you for asking! How about you?
response: I'm great! What brings you here today?

context: What brings you here today?
response: I wanted to learn about building dialogue models.

context: I wanted to learn about building dialogue models.
response: That's wonderful! Dialogue models are fascinating.

context: Can you help me with my project?
response: Of course! I'd be happy to help. What do you need?

context: I'd be happy to help. What do you need?
response: I need to understand sequence-to-sequence models.

context: What is machine learning?
response: Machine learning is a subset of artificial intelligence.

context: How does deep learning work?
response: Deep learning uses neural networks with multiple layers.

context: What are transformers in NLP?
response: Transformers are neural network architectures for sequence processing.

context: Tell me about attention mechanisms.
response: Attention helps models focus on relevant parts of the input.
"""

# Save sample data to a file
with open('sample_dialogues.txt', 'w') as f:
    f.write(sample_dialogues)

print("Sample dialogue data created!")

## 5.0 Use DLA DialogueParser to Process Data

In [None]:
# Initialize the DLA DialogueParser
parser = DialogueParser()

# Parse the sample dialogue file
dialogue_turns = parser.parse_file('sample_dialogues.txt')

print(f"Parsed {len(dialogue_turns)} dialogue turns")
print(f"\nFirst dialogue turn:")
print(f"Context: {dialogue_turns[0].context}")
print(f"Response: {dialogue_turns[0].response}")

## 6.0 Calculate Dataset Statistics

In [None]:
# Use DLA DatasetStatistics to analyze the data
stats = DatasetStatistics.calculate_stats(dialogue_turns)

print("Dataset Statistics:")
for key, value in stats.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.2f}")
    else:
        print(f"  {key}: {value}")

## 7.0 Prepare Data for Model Training

In [None]:
# Extract contexts and responses
contexts = [turn.context for turn in dialogue_turns]
responses = ["[start] " + turn.response + " [end]" for turn in dialogue_turns]

# Create text pairs
text_pairs = list(zip(contexts, responses))

# Shuffle and split data
random.seed(42)
random.shuffle(text_pairs)

val_samples = int(0.2 * len(text_pairs))
train_samples = len(text_pairs) - val_samples

train_pairs = text_pairs[:train_samples]
val_pairs = text_pairs[train_samples:]

print(f"Training samples: {len(train_pairs)}")
print(f"Validation samples: {len(val_pairs)}")
print(f"\nExample training pair:")
print(f"Context: {train_pairs[0][0]}")
print(f"Response: {train_pairs[0][1]}")

## 8.0 Vectorize Text Data

In [None]:
from tensorflow.keras.layers import TextVectorization

# Configuration
vocab_size = 5000
sequence_length = 20

# Create text vectorization layers
context_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length,
)

response_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length + 1,  # +1 for start/end tokens
)

# Adapt to the training data
train_contexts = [pair[0] for pair in train_pairs]
train_responses = [pair[1] for pair in train_pairs]

context_vectorization.adapt(train_contexts)
response_vectorization.adapt(train_responses)

print(f"Context vocabulary size: {context_vectorization.vocabulary_size()}")
print(f"Response vocabulary size: {response_vectorization.vocabulary_size()}")

## 9.0 Create Training Dataset

In [None]:
def format_dataset(contexts, responses):
    contexts = context_vectorization(contexts)
    responses = response_vectorization(responses)
    return ({
        "context": contexts,
        "response": responses[:, :-1],
    }, responses[:, 1:])

def make_dataset(pairs, batch_size=64):
    contexts_list = [pair[0] for pair in pairs]
    responses_list = [pair[1] for pair in pairs]
    dataset = tf.data.Dataset.from_tensor_slices((contexts_list, responses_list))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset, num_parallel_calls=tf.data.AUTOTUNE)
    return dataset.shuffle(2048).prefetch(tf.data.AUTOTUNE).cache()

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

print("Training dataset created successfully!")

## 10.0 Build Sequence-to-Sequence Model with Transformer

In [None]:
from tensorflow.keras import layers

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential([
            layers.Dense(dense_dim, activation='relu'),
            layers.Dense(embed_dim),
        ])
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "dense_dim": self.dense_dim,
            "num_heads": self.num_heads,
        })
        return config

print("TransformerEncoder defined")

In [None]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim
        )
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim
        )
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super().get_config()
        config.update({
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
            "output_dim": self.output_dim,
        })
        return config

print("PositionalEmbedding defined")

In [None]:
class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential([
            layers.Dense(dense_dim, activation='relu'),
            layers.Dense(embed_dim),
        ])
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype='int32')
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)],
            axis=0,
        )
        return tf.tile(mask, mult)

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype='int32')
            padding_mask = tf.minimum(padding_mask, causal_mask)
        else:
            padding_mask = causal_mask
        attention_output_1 = self.attention_1(
            query=inputs,
            value=inputs,
            key=inputs,
            attention_mask=causal_mask,
        )
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(
            query=attention_output_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        attention_output_2 = self.layernorm_2(attention_output_1 + attention_output_2)
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "dense_dim": self.dense_dim,
            "num_heads": self.num_heads,
        })
        return config

print("TransformerDecoder defined")

## 11.0 Build Complete Model

In [None]:
# Model configuration
embed_dim = 256
dense_dim = 2048
num_heads = 8

# Encoder
encoder_inputs = keras.Input(shape=(None,), dtype='int64', name='context')
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)

# Decoder
decoder_inputs = keras.Input(shape=(None,), dtype='int64', name='response')
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, dense_dim, num_heads)(x, encoder_outputs)
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation='softmax')(x)

# Create model
dialogue_model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

print("Model built successfully!")
dialogue_model.summary()

## 12.0 Compile and Train the Model

In [None]:
# Compile the model
dialogue_model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

print("Model compiled!")

In [None]:
# Train the model
# Note: For better results, train for more epochs with more data
history = dialogue_model.fit(
    train_ds,
    epochs=10,
    validation_data=val_ds,
)

print("Training complete!")

## 13.0 Generate Responses

Now we'll create a function to generate dialogue responses given a context.

In [None]:
def decode_sequence(input_context):
    """
    Generate a response given an input context.
    """
    # Vectorize the input context
    tokenized_input = context_vectorization([input_context])
    
    # Initialize the response with the start token
    decoded_response = "[start]"
    
    for i in range(sequence_length):
        # Vectorize the current decoded response
        tokenized_target = response_vectorization([decoded_response])
        
        # Predict the next token
        predictions = dialogue_model.predict(
            [tokenized_input, tokenized_target],
            verbose=0
        )
        
        # Get the most likely next token
        sampled_token_index = np.argmax(predictions[0, i, :])
        
        # Get the word from vocabulary
        sampled_token = response_vectorization.get_vocabulary()[sampled_token_index]
        
        # Append to the decoded response
        decoded_response += " " + sampled_token
        
        # Stop if we hit the end token
        if sampled_token == "[end]":
            break
    
    # Remove start and end tokens
    decoded_response = decoded_response.replace("[start] ", "")
    decoded_response = decoded_response.replace(" [end]", "")
    
    return decoded_response

print("Response generation function defined!")

## 14.0 Test the Model

In [None]:
# Test the model with various contexts
test_contexts = [
    "Hello, how are you today?",
    "What brings you here today?",
    "Can you help me with my project?",
    "What is machine learning?",
    "Tell me about deep learning."
]

print("\n" + "="*80)
print("GENERATING DIALOGUE RESPONSES")
print("="*80 + "\n")

for context in test_contexts:
    response = decode_sequence(context)
    print(f"Context:  {context}")
    print(f"Response: {response}")
    print("-" * 80)

## 15.0 Using DLA Sample Code for Production Data

The above example uses sample data. For production use, you can use the DLA DialogueParser to load and process real dialogue datasets:

In [None]:
# Example: Load dialogue data from a directory using DLA sample code
"""
# Initialize parser
parser = DialogueParser()

# Parse all dialogue files in a directory
dialogue_turns = parser.parse_directory(
    directory='/path/to/dialogue/dataset',
    pattern='*.txt'
)

# Calculate statistics
stats = DatasetStatistics.calculate_stats(dialogue_turns)
print('Dataset Statistics:', stats)

# Convert to training format
jsonl_output = parser.to_training_format(dialogue_turns, 'jsonl')

# Extract context-response pairs for model training
contexts = [turn.context for turn in dialogue_turns]
responses = ['[start] ' + turn.response + ' [end]' for turn in dialogue_turns]

# Continue with model training as shown above...
"""

print("See the code above for production usage with real dialogue datasets.")

## 16.0 Summary

In this notebook, we:
1. ✅ Imported and used the DLA DialogueParser sample code
2. ✅ Processed dialogue data using the context-response parser
3. ✅ Built a transformer-based sequence-to-sequence model
4. ✅ Trained the model on dialogue pairs
5. ✅ Generated responses given dialogue contexts

### Key Takeaways:
- The DLA sample code (`src/data_processor.py`) provides flexible dialogue parsing
- Transformer architecture works well for dialogue generation
- The model learns to generate contextually relevant responses
- For better results, use larger datasets and train for more epochs

### Next Steps:
- Use the DialogueParser to load your own dialogue datasets
- Experiment with different model architectures
- Fine-tune hyperparameters for better performance
- Implement beam search for better response generation