In [21]:
!git clone https://github.com/ishreya09/ProTACT

fatal: destination path 'ProTACT' already exists and is not an empty directory.


In [32]:
!pip install gdown

Collecting gdown
  Downloading gdown-5.2.0-py3-none-any.whl.metadata (5.8 kB)
Downloading gdown-5.2.0-py3-none-any.whl (18 kB)
Installing collected packages: gdown
Successfully installed gdown-5.2.0


In [22]:
!cp -r /kaggle/working/ProTACT/* /kaggle/working/


In [33]:
import gdown
import os

# Create directory if it doesn't exist
os.makedirs('embeddings', exist_ok=True)

# https://drive.google.com/file/d/1qUImUtKmpspJCxAQY2OYD5DFlvvZA4IZ/view?usp=sharing

# Google Drive file ID
file_id = "1qUImUtKmpspJCxAQY2OYD5DFlvvZA4IZ"

# Download file to the specified directory
gdown.download(f"https://drive.google.com/uc?id={file_id}", 'embeddings/glove.6B.50d.txt', quiet=False)

Downloading...
From (original): https://drive.google.com/uc?id=1qUImUtKmpspJCxAQY2OYD5DFlvvZA4IZ
From (redirected): https://drive.google.com/uc?id=1qUImUtKmpspJCxAQY2OYD5DFlvvZA4IZ&confirm=t&uuid=65c7f381-ab6d-423a-9428-90037e5dcdb2
To: /kaggle/working/embeddings/glove.6B.50d.txt
100%|██████████| 171M/171M [00:05<00:00, 30.8MB/s] 


'embeddings/glove.6B.50d.txt'

In [23]:
import os
import time
import argparse
import random
import numpy as np
from models.ProTACT import build_ProTACT
import tensorflow as tf
from configs.configs import Configs
from utils.read_data_pr import read_pos_vocab, read_word_vocab, read_prompts_we, read_essays_prompts, read_prompts_pos
from utils.general_utils import get_scaled_down_scores, pad_hierarchical_text_sequences, get_attribute_masks, load_word_embedding_dict, build_embedd_table
from evaluators.multitask_evaluator_all_attributes import Evaluator as AllAttEvaluator
from tensorflow import keras
import matplotlib.pyplot as plt

class CustomHistory(keras.callbacks.Callback):
    def init(self):
        self.train_loss = []
        self.val_loss = []
        self.train_acc = []
        self.val_acc = []        
        
    def on_epoch_end(self, batch, logs={}):
        self.train_loss.append(logs.get('loss'))
        self.val_loss.append(logs.get('val_loss'))
        self.train_acc.append(logs.get('acc'))
        self.val_acc.append(logs.get('val_acc'))


In [24]:
test_prompt_id = 1
seed = 12
num_heads = 2
features_path = "/kaggle/working/data/LDA/hand_crafted_final_1.csv"

np.random.seed(seed)
tf.random.set_seed(seed)
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)

print("Test prompt id is {} of type {}".format(test_prompt_id, type(test_prompt_id)))
print("Seed: {}".format(seed))


Test prompt id is 1 of type <class 'int'>
Seed: 12


In [25]:
configs = Configs()

data_path = configs.DATA_PATH
train_path = data_path + str(test_prompt_id) + '/train.pk'
dev_path = data_path + str(test_prompt_id) + '/dev.pk'
test_path = data_path + str(test_prompt_id) + '/test.pk'
pretrained_embedding = configs.PRETRAINED_EMBEDDING
embedding_path = configs.EMBEDDING_PATH
readability_path = configs.READABILITY_PATH
prompt_path = configs.PROMPT_PATH
vocab_size = configs.VOCAB_SIZE
epochs = configs.EPOCHS
batch_size = configs.BATCH_SIZE
print("Numhead : ", num_heads, " | Features : ", features_path, " | Pos_emb : ", configs.EMBEDDING_DIM)

read_configs = {
    'train_path': train_path,
    'dev_path': dev_path,
    'test_path': test_path,
    'features_path': features_path,
    'readability_path': readability_path,
    'vocab_size': vocab_size
}

Numhead :  2  | Features :  /kaggle/working/data/LDA/hand_crafted_final_1.csv  | Pos_emb :  50


In [26]:
# contains all the paths in a dict
read_configs

{'train_path': 'data/cross_prompt_attributes/1/train.pk',
 'dev_path': 'data/cross_prompt_attributes/1/dev.pk',
 'test_path': 'data/cross_prompt_attributes/1/test.pk',
 'features_path': '/kaggle/working/data/LDA/hand_crafted_final_1.csv',
 'readability_path': 'data/allreadability.pickle',
 'vocab_size': 4000}

Sure, let me clarify with an example.

In `prompt_pos_data`, `max_sentnum` and `max_sentlen` define limits on the structure of prompts by standardizing the number of sentences and words in each sentence across all prompts.

### 1. `max_sentnum`: 8

- This means that each prompt has been limited to a maximum of **8 sentences**. 
- If a prompt has **fewer than 8 sentences**, extra sentences will be added (usually with padding) to reach 8 sentences.
- If a prompt has **more than 8 sentences**, it will be truncated to only include the first 8 sentences.
- This standardization ensures that the model always sees prompts with 8 sentences, making it easier to handle prompts consistently.

### 2. `max_sentlen`: 18

- This means that each sentence has been limited to a maximum of **18 words** (or POS tags, in this case).
- If a sentence has **fewer than 18 words**, it will be padded to reach 18 words.
- If a sentence has **more than 18 words**, it will be truncated to keep only the first 18 words.
  
### Example

Let’s say we have a prompt with 3 sentences:

1. Sentence 1: `['NNS', 'VB', 'NN', 'JJ']` (4 words)
2. Sentence 2: `['NN', 'DT', 'VB', 'NNS', 'IN', 'NN', 'VBZ']` (7 words)
3. Sentence 3: `['DT', 'JJ', 'NN']` (3 words)

Without standardization, we would represent this prompt as:
```
[
    [3, 18, 5, 7],            # Sentence 1
    [5, 2, 18, 3, 4, 5, 10],  # Sentence 2
    [2, 7, 5]                 # Sentence 3
]
```

After applying `max_sentnum=8` and `max_sentlen=18`, the prompt will be transformed into a consistent shape, such as:

```
[
    [3, 18, 5, 7, 0, 0, 0, ..., 0],  # Sentence 1 padded to 18 words
    [5, 2, 18, 3, 4, 5, 10, 0, ..., 0],  # Sentence 2 padded to 18 words
    [2, 7, 5, 0, ..., 0],          # Sentence 3 padded to 18 words
    [0, 0, 0, ..., 0],             # Padding for 8 sentences
    [0, 0, 0, ..., 0],
    [0, 0, 0, ..., 0],
    [0, 0, 0, ..., 0],
    [0, 0, 0, ..., 0]
]
```

Now each prompt is represented as an 8x18 matrix, where each cell is a POS tag ID, making the input shape uniform for all prompts in the dataset. This uniform structure is crucial for batching data efficiently during training.

In [27]:
# read POS for prompts
"""
This dictionary pos_vocab maps part-of-speech (POS) tags to integer IDs. The POS tags represent parts of speech such as determiners, nouns, verbs, and adjectives. Here’s a closer look:

'<pad>': 0, '<unk>': 1 — Special tokens for padding and unknown words.
The integer mappings for tags like 'DT': 2, 'NNS': 3, etc., are used to represent words by their grammatical roles in the prompt data.
For example:

NN: Noun (e.g., 5)
VBD: Verb, past tense (e.g., 6)
JJ: Adjective (e.g., 7)
These mappings allow the model to work with POS tags as numeric data during processing.

The prompt_pos_data dictionary contains the POS-encoded prompt data and metadata:

prompt_pos: This is a list of prompts, where each prompt is encoded as a list of sentences, and each sentence is represented as a list of integers corresponding to POS tags.

Example: [[3, 20, 3, 5, 10, 3, 5, 8], ...]
In this sentence, the integers represent a sequence of POS tags for words in that sentence. For instance, [3, 20, 3, 5, 10, 3, 5, 8] would correspond to the POS tags like NNS, VBP, etc., in the sequence.
Each prompt has multiple sentences, with each sentence containing POS-encoded words.

prompt_ids: List of IDs for each prompt. Each integer in this list corresponds to a prompt, allowing for easy reference during training and evaluation.

max_sentnum: 8 — The maximum number of sentences in a prompt. This ensures all prompts have a consistent number of sentences, either through truncation or padding.

max_sentlen: 18 — The maximum length of a sentence across all prompts. Each sentence will be either padded or truncated to 18 tokens.




"""
pos_vocab = read_pos_vocab(read_configs)
prompt_pos_data = read_prompts_pos(prompt_path, pos_vocab) # for prompt POS embedding

print(pos_vocab)
print(prompt_pos_data)

 prompt_pos size: 8
{'<pad>': 0, '<unk>': 1, 'DT': 2, 'NNS': 3, 'IN': 4, 'NN': 5, 'VBD': 6, 'JJ': 7, '.': 8, 'CD': 9, 'VBZ': 10, 'RB': 11, 'VBG': 12, 'TO': 13, 'PRP$': 14, 'PRP': 15, 'VBN': 16, 'WDT': 17, 'VB': 18, 'WRB': 19, 'VBP': 20, 'MD': 21, 'CC': 22, 'EX': 23, ',': 24, 'JJR': 25, 'RP': 26, 'WP': 27, ':': 28, 'NNP': 29, 'RBR': 30, 'FW': 31, 'POS': 32, 'JJS': 33, "''": 34}
{'prompt_pos': [[[3, 20, 3, 5, 10, 3, 5, 8], [5, 3, 5, 20, 3, 7, 5, 3, 8], [5, 5, 5, 5, 7, 3, 5, 20, 11, 3, 3, 11, 20, 3, 20, 7, 3, 8], [3, 7, 3, 8], [3, 7, 3, 12, 7, 5, 3, 30, 5, 12, 12, 5, 12, 5, 3, 8], [7, 5, 7, 5, 5, 5, 3, 3, 3, 8], [5, 3, 20, 8]], [[5, 3, 8], [15, 20, 5, 20, 5, 3, 3, 16, 5, 8], [11, 18, 5, 5, 5, 5, 11, 11, 11, 5, 11, 8], [3, 6, 5, 15, 8], [5, 5, 5, 8], [11, 7, 20, 5, 12, 3, 7, 3, 8], [18, 7, 3, 3, 5, 3, 3, 11, 8], [16, 3, 6, 7, 5, 5, 12, 3, 5, 3, 5, 8]], [[7, 5, 10, 3, 12, 7, 5, 8], [5, 20, 3, 20, 5, 5, 8]], [[16, 7, 7, 5, 8], [18, 11, 5, 6, 11, 12, 3, 6, 7, 5, 5, 12, 18, 5, 8], [7, 5, 10, 7

Here's a breakdown of your code's components and the output variables.

### Explanation of Code and Output

1. **Vocabulary (`word_vocab`):**
   - The `word_vocab` dictionary maps words (like `'the'`, `'to'`, and special tokens like `<pad>`, `<unk>`, `<num>`) to unique IDs.
   - This is used to convert each word in the prompts to its corresponding ID for model processing.

2. **Prompt Data (`prompt_data`):**
   - `prompt_words`: Contains the tokenized prompts in word-ID format (lists of integers). Each word has been converted to its ID from `word_vocab`. Sentences are limited to a maximum of 8 (`max_sentnum`) and 18 tokens per sentence (`max_sentlen`), as specified.
   - `prompt_ids`: This is a list of IDs for each prompt, where each number represents a unique prompt.

3. **Example Breakdown:**
   - `prompt_words`: An example list of tokenized prompts.
     ```python
     'prompt_words': [
         [
             [37, 272, 2141, 160, 3829, 2873, 621, 4],  # First sentence of the first prompt
             [1023, 1, 2232, 201, 2141, 837, 724, 37, 4],  # Second sentence of the first prompt
             ...
         ],
         [
             [218, 125, 4],  # First sentence of the second prompt
             ...
         ],
         ...
     ]
     ```
   - This list has each prompt represented as an 8x18 matrix, where sentences shorter than 18 tokens are padded, and prompts with fewer than 8 sentences are padded with empty sentences.

4. **Prompt Word Sizes (`max_sentnum` and `max_sentlen`):**
   - The `max_sentnum: 8` and `max_sentlen: 18` settings specify that each prompt should contain 8 sentences with a maximum of 18 tokens each.
   - Shorter sentences or prompts are padded to match this size, ensuring all data is of consistent shape, which is critical for efficient model training.


In [28]:
    
# read words for prompts 
word_vocab = read_word_vocab(read_configs)
prompt_data = read_prompts_we(prompt_path, word_vocab) # for prompt word embedding 

print(word_vocab)
print(prompt_data)

 prompt_words size: 8
{'prompt_words': [[[37, 272, 2141, 160, 3829, 2873, 621, 4], [1023, 1, 2232, 201, 2141, 837, 724, 37, 4], [825, 600, 1317, 1, 270, 37, 1091, 312, 1, 870, 37, 94, 250, 37, 377, 2185, 37, 4], [292, 175, 698, 4], [2917, 2441, 37, 2237, 170, 76, 2141, 772, 76, 1, 1773, 342, 1, 105, 174, 4], [662, 2066, 1273, 1779, 107, 384, 1337, 2141, 37, 4], [1, 824, 594, 4]], [[218, 125, 4], [122, 72, 73, 340, 1007, 124, 124, 288, 262, 4], [143, 409, 73, 262, 275, 2714, 79, 949, 143, 160, 304, 4], [54, 261, 262, 122, 4], [917, 938, 74, 4], [662, 1, 704, 1779, 1, 1, 218, 125, 4], [201, 211, 256, 54, 114, 130, 192, 428, 4], [289, 214, 243, 82, 1023, 2378, 1, 2800, 650, 3272, 229, 4]], [[662, 2547, 736, 281, 165, 319, 106, 4], [2547, 1715, 1274, 704, 1023, 413, 4]], [[90, 271, 131, 84, 4], [190, 108, 150, 814, 820, 309, 864, 739, 341, 483, 210, 472, 96, 148, 4], [662, 2547, 736, 74, 308, 84, 131, 4], [2547, 1715, 2252, 1274, 84, 1023, 698, 4]], [[1526, 92, 245, 74, 151, 4], [1023, 985

In [29]:
!pip freeze > /kaggle/working/requirements.txt

In [30]:
# read essays and prompts
train_data, dev_data, test_data = read_essays_prompts(read_configs, prompt_data, prompt_pos_data, pos_vocab) 

 pos_x size: 9513
 readability_x size: 9513
 pos_x size: 1680
 readability_x size: 1680
 pos_x size: 1783
 readability_x size: 1783


The line OOV number =190, OOV ratio = 0.047512 indicates that 190 words from your vocabulary did not have corresponding entries in the GloVe embeddings, making up around 4.75% of the vocabulary.
These OOV words will likely receive a placeholder vector, such as zeros or a random initialization.

In [34]:
if pretrained_embedding:
    embedd_dict, embedd_dim, _ = load_word_embedding_dict(embedding_path)
    embedd_matrix = build_embedd_table(word_vocab, embedd_dict, embedd_dim, caseless=True)
    embed_table = [embedd_matrix]
else:
    embed_table = None

Loading GloVe ...
OOV number =190, OOV ratio = 0.047512


Certainly! Here’s a breakdown of the input and output data structures:

### Input Data Explanation

1. **`X_train_pos`, `X_dev_pos`, `X_test_pos`**:
   - These represent Part-of-Speech (POS) tagged sequences for the essays.
   - Each is structured as `(number of samples, max_sentnum * max_sentlen)`, where:
     - `max_sentnum` is the maximum number of sentences in any essay.
     - `max_sentlen` is the maximum number of words in any sentence.
   - This padding ensures that all essays are represented by sequences of the same length, allowing batch processing.

2. **`X_train_prompt`, `X_dev_prompt`, `X_test_prompt`**:
   - These represent the prompt words, with the same padding structure as the essay POS tags.
   - Since the model may need to account for the prompt (e.g., the essay topic) when scoring, the prompt words are treated as an input feature.

3. **`X_train_prompt_pos`, `X_dev_prompt_pos`, `X_test_prompt_pos`**:
   - These are POS-tagged versions of the prompt words, structured similarly to the essay data.
   - They help the model learn any linguistic patterns associated with prompts and how these might influence essay scoring.

4. **`X_train_readability`, `X_dev_readability`, `X_test_readability`**:
   - These are readability scores for each essay, with 35 features each.
   - They offer additional essay characteristics that reflect various readability metrics, providing insights into linguistic complexity, sentence structure, vocabulary usage, etc.

5. **`X_train_linguistic_features`, `X_dev_linguistic_features`, `X_test_linguistic_features`**:
   - These are handcrafted linguistic features with 52 values per essay.
   - Examples could include sentence length, word variety, spelling errors, or grammar complexity metrics, all of which can influence an essay’s quality.

6. **`X_train_attribute_rel`, `X_dev_attribute_rel`, `X_test_attribute_rel`**:
   - These are attribute relevance masks shaped as `(samples, number of attributes)`.
   - They help the model focus on relevant aspects of the essays for scoring. For example, an essay could be scored on attributes like structure, grammar, or creativity, depending on the prompt’s requirements.

### Output Data Explanation

1. **`Y_train`, `Y_dev`, `Y_test`**:
   - These are the scaled scores for each essay across multiple scoring attributes, represented as `(samples, number of attributes)`.
   - Each attribute might represent a different scoring criterion, such as grammar, cohesion, or creativity.
   - The scores have been scaled down to fall within a consistent range (e.g., between 0 and 1), allowing the model to predict values in a normalized range.

### How These Inputs and Outputs Are Used in the Model

In a typical model architecture for this setup:
- **Embedding Layers**: For `X_train_pos`, `X_train_prompt`, and `X_train_prompt_pos`, embeddings or POS embeddings can be applied, translating each word/POS tag into a vector space.
- **Feature Concatenation**: The different inputs (readability, linguistic features, prompt, POS tags) are concatenated to form a comprehensive feature set for each essay.
- **Attention or Attribute Masking**: `X_train_attribute_rel` helps direct the model’s focus to specific attributes that are most relevant to the prompt.
- **Output Layer**: The model outputs `Y_train` as predicted scores for each attribute, comparing with `Y_test` in evaluation.

This setup allows the model to capture both linguistic and structural essay features, enhancing the ability to predict scores across multiple grading dimensions.

In [35]:
max_sentlen = max(train_data['max_sentlen'], dev_data['max_sentlen'], test_data['max_sentlen'])
max_sentnum = max(train_data['max_sentnum'], dev_data['max_sentnum'], test_data['max_sentnum'])
prompt_max_sentlen = prompt_data['max_sentlen']
prompt_max_sentnum = prompt_data['max_sentnum']

print('max sent length: {}'.format(max_sentlen))
print('max sent num: {}'.format(max_sentnum))
print('max prompt sent length: {}'.format(prompt_max_sentlen))
print('max prompt sent num: {}'.format(prompt_max_sentnum))


max sent length: 50
max sent num: 97
max prompt sent length: 18
max prompt sent num: 8


In [36]:
# scale final scores

train_data['y_scaled'] = get_scaled_down_scores(train_data['data_y'], train_data['prompt_ids'])
dev_data['y_scaled'] = get_scaled_down_scores(dev_data['data_y'], dev_data['prompt_ids'])
test_data['y_scaled'] = get_scaled_down_scores(test_data['data_y'], test_data['prompt_ids'])



In [37]:
X_train_pos = pad_hierarchical_text_sequences(train_data['pos_x'], max_sentnum, max_sentlen)
X_dev_pos = pad_hierarchical_text_sequences(dev_data['pos_x'], max_sentnum, max_sentlen)
X_test_pos = pad_hierarchical_text_sequences(test_data['pos_x'], max_sentnum, max_sentlen)

X_train_pos = X_train_pos.reshape((X_train_pos.shape[0], X_train_pos.shape[1] * X_train_pos.shape[2]))
X_dev_pos = X_dev_pos.reshape((X_dev_pos.shape[0], X_dev_pos.shape[1] * X_dev_pos.shape[2]))
X_test_pos = X_test_pos.reshape((X_test_pos.shape[0], X_test_pos.shape[1] * X_test_pos.shape[2]))

X_train_prompt = pad_hierarchical_text_sequences(train_data['prompt_words'], max_sentnum, max_sentlen)
X_dev_prompt = pad_hierarchical_text_sequences(dev_data['prompt_words'], max_sentnum, max_sentlen)
X_test_prompt = pad_hierarchical_text_sequences(test_data['prompt_words'], max_sentnum, max_sentlen)

X_train_prompt = X_train_prompt.reshape((X_train_prompt.shape[0], X_train_prompt.shape[1] * X_train_prompt.shape[2]))
X_dev_prompt = X_dev_prompt.reshape((X_dev_prompt.shape[0], X_dev_prompt.shape[1] * X_dev_prompt.shape[2]))
X_test_prompt = X_test_prompt.reshape((X_test_prompt.shape[0], X_test_prompt.shape[1] * X_test_prompt.shape[2]))

X_train_prompt_pos = pad_hierarchical_text_sequences(train_data['prompt_pos'], max_sentnum, max_sentlen)
X_dev_prompt_pos = pad_hierarchical_text_sequences(dev_data['prompt_pos'], max_sentnum, max_sentlen)
X_test_prompt_pos = pad_hierarchical_text_sequences(test_data['prompt_pos'], max_sentnum, max_sentlen)

X_train_prompt_pos = X_train_prompt_pos.reshape((X_train_prompt_pos.shape[0], X_train_prompt_pos.shape[1] * X_train_prompt_pos.shape[2]))
X_dev_prompt_pos = X_dev_prompt_pos.reshape((X_dev_prompt_pos.shape[0], X_dev_prompt_pos.shape[1] * X_dev_prompt_pos.shape[2]))
X_test_prompt_pos = X_test_prompt_pos.reshape((X_test_prompt_pos.shape[0], X_test_prompt_pos.shape[1] * X_test_prompt_pos.shape[2]))


In [38]:
X_train_linguistic_features = np.array(train_data['features_x'])
X_dev_linguistic_features = np.array(dev_data['features_x'])
X_test_linguistic_features = np.array(test_data['features_x'])

X_train_readability = np.array(train_data['readability_x'])
X_dev_readability = np.array(dev_data['readability_x'])
X_test_readability = np.array(test_data['readability_x'])


In [39]:
Y_train = np.array(train_data['y_scaled'])
Y_dev = np.array(dev_data['y_scaled'])
Y_test = np.array(test_data['y_scaled'])

In [40]:
X_train_attribute_rel = get_attribute_masks(Y_train)
X_dev_attribute_rel = get_attribute_masks(Y_dev)
X_test_attribute_rel = get_attribute_masks(Y_test)

print('================================')
print('X_train_pos: ', X_train_pos.shape)
print('X_train_prompt_words: ', X_train_prompt.shape)
print('X_train_prompt_pos: ', X_train_prompt_pos.shape)
print('X_train_readability: ', X_train_readability.shape)
print('X_train_ling: ', X_train_linguistic_features.shape)
print('X_train_attribute_rel: ', X_train_attribute_rel.shape)
print('Y_train: ', Y_train.shape)

print('================================')
print('X_dev_pos: ', X_dev_pos.shape)
print('X_dev_prompt_words: ', X_dev_prompt.shape)
print('X_dev_prompt_pos: ', X_dev_prompt_pos.shape)
print('X_dev_readability: ', X_dev_readability.shape)
print('X_dev_ling: ', X_dev_linguistic_features.shape)
print('X_dev_attribute_rel: ', X_dev_attribute_rel.shape)
print('Y_dev: ', Y_dev.shape)

print('================================')
print('X_test_pos: ', X_test_pos.shape)
print('X_test_prompt_words: ', X_test_prompt.shape)
print('X_test_prompt_pos: ', X_test_prompt_pos.shape)
print('X_test_readability: ', X_test_readability.shape)
print('X_test_ling: ', X_test_linguistic_features.shape)
print('X_test_attribute_rel: ', X_test_attribute_rel.shape)
print('Y_test: ', Y_test.shape)
print('================================')


X_train_pos:  (9513, 4850)
X_train_prompt_words:  (9513, 4850)
X_train_prompt_pos:  (9513, 4850)
X_train_readability:  (9513, 35)
X_train_ling:  (9513, 52)
X_train_attribute_rel:  (9513, 9)
Y_train:  (9513, 9)
X_dev_pos:  (1680, 4850)
X_dev_prompt_words:  (1680, 4850)
X_dev_prompt_pos:  (1680, 4850)
X_dev_readability:  (1680, 35)
X_dev_ling:  (1680, 52)
X_dev_attribute_rel:  (1680, 9)
Y_dev:  (1680, 9)
X_test_pos:  (1783, 4850)
X_test_prompt_words:  (1783, 4850)
X_test_prompt_pos:  (1783, 4850)
X_test_readability:  (1783, 35)
X_test_ling:  (1783, 52)
X_test_attribute_rel:  (1783, 9)
Y_test:  (1783, 9)


In [53]:
print(X_dev_pos[0])
print(X_dev_prompt_pos[0])

[4 2 5 ... 0 0 0]
[16  7  7 ...  0  0  0]


In [56]:
dev_data['prompt_pos'][0]

[[16, 7, 7, 3, 3, 20, 5, 5, 6, 12, 7, 3, 5, 8], [5, 5, 5, 7, 5, 5, 8]]

In [48]:
dev_data['y_scaled'][0]

[0.75, 0.5, -1, -1, -1, -1, 0.5, 0.5, 0.5]

In [44]:
dev_data['readability_x'][0]

array([0.59565176, 0.41567585, 0.52881241, 0.31776822, 0.46624325,
       0.478611  , 0.48412292, 0.32608696, 0.35442793, 0.48915323,
       0.71380471, 0.39204545, 0.35      , 0.40467261, 0.33896757,
       0.33544304, 0.34565217, 0.46575342, 0.35      , 0.        ,
       0.28205128, 0.24193548, 0.29577465, 0.22727273, 0.26666667,
       0.07692308, 0.21428571, 0.32307692, 0.22222222, 0.16666667,
       0.        , 0.3       , 0.        , 0.        , 0.14285714])

In [46]:
dev_data['features_x'][0] # linguistic features / handcrafted features

[0.9113381674357284,
 0.37864864864864844,
 0.13036962365591392,
 0.20262390670553934,
 0.02884128858154832,
 0.3386581469648562,
 0.3481152993348115,
 0.27272727272727276,
 0.5,
 0.21428571428571425,
 0.08333333333333333,
 0.25,
 0.3225806451612903,
 0.41558441558441556,
 0.4382207939063992,
 0.14100185528756956,
 0.17666666666666667,
 0.455641592920354,
 0.5714285714285714,
 0.4285714285714285,
 0.0,
 0.9737648985143633,
 0.0,
 0.06629834254143603,
 0.27071823204419854,
 0.32320441988950255,
 0.0,
 0.0,
 0.0,
 0.17403314917127063,
 0.06629834254143603,
 0.5318600368324126,
 0.03867403314917102,
 0.19521178637200623,
 0.06156274664561914,
 0.2817679558011043,
 0.2900552486187845,
 0.3845303867403317,
 0.1988950276243093,
 0.0,
 0.23941068139963156,
 0.07213014119091456,
 0.0,
 0.0,
 0.5805556019206267,
 0.2394106813996303,
 0.6332544067350698,
 0.30386740331491713,
 0.15469613259668505,
 0.17403314917127063,
 0.0911602209944745,
 0.2917127071823204]

In [61]:
train_features_list = [X_train_pos, X_train_prompt, X_train_prompt_pos, X_train_linguistic_features, X_train_readability]
dev_features_list = [X_dev_pos, X_dev_prompt, X_dev_prompt_pos, X_dev_linguistic_features, X_dev_readability]
test_features_list = [X_test_pos, X_test_prompt, X_test_prompt_pos, X_test_linguistic_features, X_test_readability]
     


In [62]:
dev_features_list

[array([[ 4,  2,  5, ...,  0,  0,  0],
        [16,  4,  2, ...,  0,  0,  0],
        [ 2,  5, 10, ...,  0,  0,  0],
        ...,
        [ 2,  5, 10, ...,  0,  0,  0],
        [ 2,  5,  4, ...,  0,  0,  0],
        [ 9,  5,  5, ...,  0,  0,  0]], dtype=int32),
 array([[ 756,  445, 1526, ...,    0,    0,    0],
        [ 756,  445, 1526, ...,    0,    0,    0],
        [ 662, 2547,  736, ...,    0,    0,    0],
        ...,
        [  90,  271,  131, ...,    0,    0,    0],
        [ 662,  248,    4, ...,    0,    0,    0],
        [ 662,  248,    4, ...,    0,    0,    0]], dtype=int32),
 array([[16,  7,  7, ...,  0,  0,  0],
        [16,  7,  7, ...,  0,  0,  0],
        [ 7,  5, 10, ...,  0,  0,  0],
        ...,
        [16,  7,  7, ...,  0,  0,  0],
        [ 7,  5,  8, ...,  0,  0,  0],
        [ 7,  5,  8, ...,  0,  0,  0]], dtype=int32),
 array([[0.91133817, 0.37864865, 0.13036962, ..., 0.17403315, 0.09116022,
         0.29171271],
        [0.97709295, 0.44629847, 0.12943879, .

In [63]:
import numpy as np
import tensorflow as tf
import tensorflow.keras.layers as layers
from tensorflow import keras
import tensorflow.keras.backend as K
from custom_layers.zeromasking import ZeroMaskedEntries
from custom_layers.attention import Attention
from custom_layers.multiheadattention_pe import MultiHeadAttention_PE
from custom_layers.multiheadattention import MultiHeadAttention

# Correlation Coefficient: Computes the Pearson correlation coefficient, while ignoring masked values.
# Cosine Similarity: Measures the cosine of the angle between two non-zero vectors, effectively quantifying their similarity.

def correlation_coefficient(trait1, trait2):
    x = trait1
    y = trait2
    
    # maksing if either x or y is a masked value
    mask_value = -0.
    mask_x = K.cast(K.not_equal(x, mask_value), K.floatx())
    mask_y = K.cast(K.not_equal(y, mask_value), K.floatx())
    
    mask = mask_x * mask_y
    x_masked, y_masked = x * mask, y * mask
    
    mx = K.sum(x_masked) / K.sum(mask) # ignore the masked values when obtaining the mean
    my = K.sum(y_masked) / K.sum(mask) # ignore the masked values when obtaining the mean
    
    xm, ym = (x_masked-mx) * mask, (y_masked-my) * mask # maksing the masked values
    
    r_num = K.sum(xm * ym)
    r_den = K.sqrt(K.sum(K.square(xm)) * K.sum(K.square(ym)))
    r = 0.
    r = tf.cond(r_den > 0, lambda: r_num / (r_den), lambda: r+0)
    return r

def cosine_sim(trait1, trait2):
    x = trait1
    y = trait2
    
    mask_value = 0.
    mask_x = K.cast(K.not_equal(x, mask_value), K.floatx())
    mask_y = K.cast(K.not_equal(y, mask_value), K.floatx())
    
    mask = mask_x * mask_y
    x_masked, y_masked = x*mask, y*mask
    
    normalize_x = tf.nn.l2_normalize(x_masked,0) * mask # mask 값 반영     
    normalize_y = tf.nn.l2_normalize(y_masked,0) * mask # mask 값 반영
        
    cos_similarity = tf.reduce_sum(tf.multiply(normalize_x, normalize_y))
    return cos_similarity
    

# Trait Similarity Loss: This function calculates a similarity loss based on the correlation coefficient and cosine similarity 
# between different traits. It encourages the model to produce predictions that are similar for traits that have a high correlation.

# Masked Loss Function: This function computes the mean squared error while ignoring certain masked values in 
# the target and predicted outputs.

# Total Loss Function: Combines the masked loss and trait similarity loss, allowing for a balance between prediction 
# accuracy and trait similarity.

def trait_sim_loss(y_true, y_pred):
    mask_value = -1
    mask = K.cast(K.not_equal(y_true, mask_value), K.floatx())
    
    # masking
    y_trans = tf.transpose(y_true * mask)
    y_pred_trans = tf.transpose(y_pred * mask)
    
    sim_loss = 0.0
    cnt = 0.0
    ts_loss = 0.
    #trait_num = y_true.shape[1]
    trait_num = 9
    print('trait num: ', trait_num)
    
    # start from idx 1, since we ignore the overall score 
    for i in range(1, trait_num):
        for j in range(i+1, trait_num):
            corr = correlation_coefficient(y_trans[i], y_trans[j])
            sim_loss = tf.cond(corr>=0.7, lambda: tf.add(sim_loss, 1-cosine_sim(y_pred_trans[i], y_pred_trans[j])), 
                            lambda: tf.add(sim_loss, 0))
            cnt = tf.cond(corr>=0.7, lambda: tf.add(cnt, 1), 
                            lambda: tf.add(cnt, 0))
    ts_loss = tf.cond(cnt > 0, lambda: sim_loss/cnt, lambda: ts_loss+0)
    return ts_loss
    
def masked_loss_function(y_true, y_pred):
    mask_value = -1
    mask = K.cast(K.not_equal(y_true, mask_value), K.floatx())
    mse = keras.losses.MeanSquaredError()
    return mse(y_true * mask, y_pred * mask)

def total_loss(y_true, y_pred):
    alpha = 0.7
    mse_loss = masked_loss_function(y_true, y_pred)
    ts_loss = trait_sim_loss(y_true, y_pred)
    return alpha * mse_loss + (1-alpha) * ts_loss

def build_ProTACT(pos_vocab_size, vocab_size, maxnum, maxlen, readability_feature_count,
                  linguistic_feature_count, configs, output_dim, num_heads, embedding_weights):
    embedding_dim = configs.EMBEDDING_DIM
    dropout_prob = configs.DROPOUT
    cnn_filters = configs.CNN_FILTERS
    cnn_kernel_size = configs.CNN_KERNEL_SIZE
    lstm_units = configs.LSTM_UNITS
    
    ### 1. Essay Representation
    
    # Input layer for position information of words in the essay
    pos_input = layers.Input(shape=(maxnum * maxlen,), dtype='int32', name='pos_input')
    
    # Embedding layer for position encoding, transforming indices into dense vectors
    pos_x = layers.Embedding(output_dim=embedding_dim, input_dim=pos_vocab_size, input_length=maxnum * maxlen,
                             weights=None, mask_zero=True, name='pos_x')(pos_input)
    
    # Masking out the padding in the embeddings
    pos_x_maskedout = ZeroMaskedEntries(name='pos_x_maskedout')(pos_x)
    
    # Applying dropout to the position embeddings to prevent overfitting
    pos_drop_x = layers.Dropout(dropout_prob, name='pos_drop_x')(pos_x_maskedout)
    
    # Reshaping the embeddings for CNN processing
    pos_resh_W = layers.Reshape((maxnum, maxlen, embedding_dim), name='pos_resh_W')(pos_drop_x)
    
    # Convolutional layer to extract local features from position embeddings
    pos_zcnn = layers.TimeDistributed(layers.Conv1D(cnn_filters, cnn_kernel_size, padding='valid'), name='pos_zcnn')(pos_resh_W)
    
    # Applying attention to summarize the feature maps generated by the CNN
    pos_avg_zcnn = layers.TimeDistributed(Attention(), name='pos_avg_zcnn')(pos_zcnn)

    # Input layer for linguistic features
    linguistic_input = layers.Input((linguistic_feature_count,), name='linguistic_input')
    # Input layer for readability features
    readability_input = layers.Input((readability_feature_count,), name='readability_input')

    # Applying Multi-Head Attention to position embeddings
    pos_MA_list = [MultiHeadAttention(100, num_heads)(pos_avg_zcnn) for _ in range(output_dim)]
    # LSTM layers to capture sequential dependencies in attention outputs
    pos_MA_lstm_list = [layers.LSTM(lstm_units, return_sequences=True)(pos_MA) for pos_MA in pos_MA_list]
    # Attention mechanism to summarize LSTM outputs
    pos_avg_MA_lstm_list = [Attention()(pos_hz_lstm) for pos_hz_lstm in pos_MA_lstm_list]

    ### 2. Prompt Representation
    # word embedding

    # Input layer for word indices in the prompt
    prompt_word_input = layers.Input(shape=(maxnum * maxlen,), dtype='int32', name='prompt_word_input')
    # Word embedding for the prompt, using pre-trained weights
    prompt = layers.Embedding(output_dim=embedding_dim, input_dim=vocab_size, input_length=maxnum * maxlen,
                              weights=embedding_weights, mask_zero=True, name='prompt')(prompt_word_input)
    # Masking out the padding in the prompt embeddings
    prompt_maskedout = ZeroMaskedEntries(name='prompt_maskedout')(prompt)

    # pos embedding
    # Input layer for position indices in the prompt
    prompt_pos_input = layers.Input(shape=(maxnum * maxlen,), dtype='int32', name='prompt_pos_input')
    # Position embedding for the prompt
    prompt_pos = layers.Embedding(output_dim=embedding_dim, input_dim=pos_vocab_size, input_length=maxnum * maxlen,
                                  weights=None, mask_zero=True, name='pos_prompt')(prompt_pos_input)
    # Masking out the padding in the position embeddings of the prompt
    prompt_pos_maskedout = ZeroMaskedEntries(name='prompt_pos_maskedout')(prompt_pos)
    
    # add word + pos embedding
    prompt_emb = tf.keras.layers.Add()([prompt_maskedout, prompt_pos_maskedout])

    # Applying dropout to the combined embeddings
    prompt_drop_x = layers.Dropout(dropout_prob, name='prompt_drop_x')(prompt_emb)
    # Reshaping for CNN processing
    prompt_resh_W = layers.Reshape((maxnum, maxlen, embedding_dim), name='prompt_resh_W')(prompt_drop_x)
    # Convolutional layer to extract features from the prompt
    prompt_zcnn = layers.TimeDistributed(layers.Conv1D(cnn_filters, cnn_kernel_size, padding='valid'), name='prompt_zcnn')(prompt_resh_W)
    # Applying attention to summarize the prompt feature maps
    prompt_avg_zcnn = layers.TimeDistributed(Attention(), name='prompt_avg_zcnn')(prompt_zcnn)

    # Applying Multi-Head Attention to prompt embeddings
    prompt_MA_list = MultiHeadAttention(100, num_heads)(prompt_avg_zcnn)
    # LSTM to capture sequential dependencies in the prompt attention outputs
    prompt_MA_lstm_list = layers.LSTM(lstm_units, return_sequences=True)(prompt_MA_list)
    # Attention to summarize the outputs from the LSTM
    prompt_avg_MA_lstm_list = Attention()(prompt_MA_lstm_list)

    # Query
    query = prompt_avg_MA_lstm_list

    # Attention between position and prompt representations
    es_pr_MA_list = [MultiHeadAttention_PE(100, num_heads)(pos_avg_MA_lstm_list[i], query) for i in range(output_dim)]
    # LSTM layers to process the results from attention
    es_pr_MA_lstm_list = [layers.LSTM(lstm_units, return_sequences=True)(pos_hz_MA) for pos_hz_MA in es_pr_MA_list]
    # Summarizing the LSTM outputs with attention
    es_pr_avg_lstm_list = [Attention()(pos_hz_lstm) for pos_hz_lstm in es_pr_MA_lstm_list]
    # Concatenating representations with linguistic and readability features
    es_pr_feat_concat = [layers.Concatenate()([rep, linguistic_input, readability_input])
                         for rep in es_pr_avg_lstm_list]

    # Wrapping tf.concat inside a Lambda layer to handle concatenation
    pos_avg_hz_lstm = layers.Lambda(lambda reps: tf.concat(
        [layers.Reshape((1, lstm_units + linguistic_feature_count + readability_feature_count))(rep)
         for rep in reps], axis=-2))(es_pr_feat_concat)

    final_preds = []
    for index, _ in enumerate(range(output_dim)):
        mask = np.array([True for _ in range(output_dim)])
        mask[index] = False
        
        # Wrapping tf.boolean_mask inside a Lambda layer
        non_target_rep = layers.Lambda(lambda x: tf.boolean_mask(x, mask, axis=-2))(pos_avg_hz_lstm)
        target_rep = pos_avg_hz_lstm[:, index:index+1]
        
        # Applying attention to the target representation and the non-target representations
        att_attention = layers.Attention()([target_rep, non_target_rep])
        # Concatenating the target and attended representations
        attention_concat = layers.Concatenate(axis=-1)([target_rep, att_attention])
        attention_concat = layers.Flatten()(attention_concat)
        # Final prediction layer
        final_pred = layers.Dense(units=1, activation='sigmoid')(attention_concat)
        final_preds.append(final_pred)

    # Concatenating all final predictions
    y = layers.Concatenate()([pred for pred in final_preds])

    model = keras.Model(inputs=[pos_input, prompt_word_input, prompt_pos_input, linguistic_input, readability_input], outputs=y)
    model.summary()
    model.compile(loss=total_loss, optimizer='rmsprop')

    return model


The `build_ProTACT` function defines a neural network model for text analysis, likely focused on essay assessment or a similar task. Here’s a detailed explanation of each formal parameter in the function signature:

### Function Signature

```python
def build_ProTACT(pos_vocab_size, vocab_size, maxnum, maxlen, readability_feature_count,
                  linguistic_feature_count, configs, output_dim, num_heads, embedding_weights):
```

### Parameter Descriptions

1. **`pos_vocab_size`**:
   - **Type**: `int`
   - **Description**: The size of the vocabulary used for position embeddings. This refers to the number of unique position indices in the input data. It is essential for the embedding layer that transforms these indices into dense vector representations.

2. **`vocab_size`**:
   - **Type**: `int`
   - **Description**: The size of the word vocabulary. This indicates the total number of unique words or tokens in the dataset. It is crucial for the embedding layer that handles word embeddings.

3. **`maxnum`**:
   - **Type**: `int`
   - **Description**: The maximum number of segments (or sentences) per input example. This parameter determines how many sentences the model will consider for each essay or document.

4. **`maxlen`**:
   - **Type**: `int`
   - **Description**: The maximum length of each segment (or sentence) in terms of the number of words. It specifies how many words will be included from each sentence during training.

5. **`readability_feature_count`**:
   - **Type**: `int`
   - **Description**: The number of features related to the readability of the text. These could include various readability metrics that quantify how easy or difficult the text is to read.

6. **`linguistic_feature_count`**:
   - **Type**: `int`
   - **Description**: The number of linguistic features derived from the text. These features could represent syntactic, semantic, or stylistic aspects of the essays that might help improve the model's performance.

7. **`configs`**:
   - **Type**: `object` (or `dict`)
   - **Description**: A configuration object or dictionary that contains various hyperparameters for the model. This can include settings for embedding dimensions, dropout rates, CNN filters, kernel sizes, and LSTM units, which dictate how the model processes the input data.

8. **`output_dim`**:
   - **Type**: `int`
   - **Description**: The number of output dimensions or traits that the model will predict. This is critical as it defines the shape of the model's output layer and represents the different traits associated with the essays.

9. **`num_heads`**:
   - **Type**: `int`
   - **Description**: The number of attention heads used in the multi-head attention mechanism. Using multiple heads allows the model to attend to different parts of the input simultaneously, capturing various relationships and interactions more effectively.

10. **`embedding_weights`**:
    - **Type**: `ndarray` (e.g., `numpy` array)
    - **Description**: A pre-trained embedding matrix (such as GloVe or Word2Vec) that initializes the word embedding layer. This matrix provides dense vector representations for the words in the vocabulary, which can enhance the model’s performance by leveraging learned semantic relationships.

### Summary
Each of these parameters plays a crucial role in defining the architecture and functionality of the ProTACT model. By customizing these inputs, you can tailor the model to effectively analyze and predict traits based on the essays provided. 

In [64]:
model = build_ProTACT(len(pos_vocab), len(word_vocab), max_sentnum, max_sentlen, 
                      X_train_readability.shape[1],
                      X_train_linguistic_features.shape[1],
                      configs, Y_train.shape[1], num_heads, embed_table)




It looks like you've successfully instantiated the ProTACT model using the `build_ProTACT` function and are viewing its architecture summary from Keras. Let’s break down what you're seeing in the model summary and address the warning you received.

### Model Summary Breakdown

1. **Input Layers**:
   - **`prompt_word_input`, `prompt_pos_input`, `pos_input`**: 
     - **Shape**: `(None, 4850)` 
     - **Description**: These are the input layers for the word indices of the prompt, position indices of the prompt, and position indices of the essays, respectively. The `None` indicates that the model can accept a variable batch size, while `4850` is the total number of words or tokens expected in each input example.

2. **Embedding Layers**:
   - **`prompt (Embedding)`**:
     - **Output Shape**: `(None, 4850, 50)`
     - **Parameters**: `200,000`
     - **Description**: This layer transforms the prompt word indices into dense vector representations (embeddings) of size `50`. The total parameters indicate that there are 200,000 trainable weights in this embedding layer.
   
   - **`pos_prompt (Embedding)` and `pos_x (Embedding)`**:
     - **Output Shape**: `(None, 4850, 50)`
     - **Parameters**: `1,750` each
     - **Description**: These layers transform position indices into dense vectors, also of size `50`. The low number of parameters suggests that these embeddings are initialized with a smaller vocabulary.

3. **Zero Masked Entries**:
   - **`prompt_maskedout`, `prompt_pos_maskedout`, `pos_x_maskedout`**:
     - **Output Shape**: `(None, 4850, 50)`
     - **Description**: These custom layers (likely defined in your `ZeroMaskedEntries` class) remove or mask out the padded (zero) entries from the embeddings to focus only on the actual words. They effectively prepare the embeddings for subsequent layers by ensuring that padding does not affect the calculations.

Let's break down the additional layers from your model summary and explain what each of them does.

1. **Add Layer**:
   - **`add (Add)`**:
     - **Output Shape**: `(None, 4850, 50)`
     - **Description**: This layer performs an element-wise addition of two tensors (likely the masked outputs of the prompt and position embeddings). The resulting tensor retains the same shape as the input tensors, which is useful for combining different information streams.

2. **Dropout Layers**:
   - **`pos_drop_x (Dropout)`** and **`prompt_drop_x (Dropout)`**:
     - **Output Shape**: `(None, 4850, 50)`
     - **Description**: These layers apply dropout to the previous outputs (from the `add` layer and masked outputs, respectively). Dropout helps prevent overfitting by randomly setting a fraction of input units to zero during training, which forces the model to learn more robust features.

3. **Reshape Layers**:
   - **`pos_resh_W (Reshape)`** and **`prompt_resh_W (Reshape)`**:
     - **Output Shape**: `(None, 97, 50, 50)`
     - **Description**: These layers reshape the tensors from the previous dropout layers. The specifics of the shape will depend on your model's design. The `97` could represent a certain number of sentences or time steps, while the `50` represents the embedding size. Reshaping is often done to prepare the data for subsequent layers that expect inputs of specific dimensions.

4. **TimeDistributed Layers**:
   - **`pos_zcnn (TimeDistributed)`** and **`prompt_zcnn (TimeDistributed)`**:
     - **Output Shape**: `(None, 97, 46, 100)`
     - **Parameters**: `25,100` each
     - **Description**: These layers apply a convolutional neural network (CNN) operation to each time step independently. The output shape indicates that the output feature maps have a width of `46` and a depth of `100`. The `TimeDistributed` wrapper allows CNN layers to process each time step in the input sequences separately.

5. **Average Pooling Layers**:
   - **`pos_avg_zcnn (TimeDistributed)`** and **`prompt_avg_zcnn (TimeDistributed)`**:
     - **Output Shape**: `(None, 97, 100)`
     - **Description**: These layers perform average pooling over the convolutional outputs. This helps reduce dimensionality while retaining the most important features. The pooling operation typically takes the average across the spatial dimensions, reducing the width and height while keeping the depth.

6. **Multi-Head Attention Layers**:
   - **`multi_head_attention...`** (multiple layers):
     - **Output Shape**: `(None, None, 100)`
     - **Parameters**: `40,400` each
     - **Description**: These layers implement multi-head attention mechanisms. Each head in multi-head attention learns different attention distributions and captures various features of the input. The output shape suggests that the attention layers produce feature representations with a depth of `100`. The `None` in the shape indicates that the model can handle variable lengths of input sequences. Multiple heads allow the model to focus on different parts of the input simultaneously, improving its capacity to capture complex patterns.

### Summary of Architecture Flow
The model architecture you've provided outlines a multi-layered neural network that combines various components to process input text and potentially produce embeddings or classifications:

1. The input layers accept tokenized word indices for both prompts and positions.
2. Embedding layers convert these indices into dense representations.
3. Masked entries help to focus the model on valid inputs, avoiding the influence of padding.
4. Dropout layers mitigate overfitting by randomly deactivating units during training.
5. Reshape layers prepare the data for convolutional operations.
6. CNN layers capture local features from the sequences, which are then pooled for dimensionality reduction.
7. Finally, multi-head attention layers help the model learn relationships and dependencies between different parts of the input sequences.

Let’s continue analyzing the layers from your model summary, focusing on the multiple instances of the Multi-Head Attention and LSTM layers.

### Model Layers Breakdown (Continued)

7. **Multi-Head Attention Layers**:
   - **`multi_head_attenti… (MultiHeadAttention)`**:
     - **Output Shape**: `(None, None, 100)`
     - **Parameters**: `40,400` each
     - **Description**: The model has several instances of the Multi-Head Attention layer. Each one processes its input independently to produce an output representation. The `None` dimension indicates that the output can accommodate variable-length sequences. The `100` denotes the size of the output embeddings. This design allows the model to attend to different parts of the input sequences, capturing various dependencies and relationships.

### LSTM Layers
8. **LSTM Layers**:
   - **`lstm (LSTM)`**, **`lstm_9 (LSTM)`**, **`lstm_1 (LSTM)`**, **`lstm_2 (LSTM)`**, **`lstm_3 (LSTM)`**, **`lstm_4 (LSTM)`**, **`lstm_5 (LSTM)`**, **`lstm_6 (LSTM)`**, **`lstm_7 (LSTM)`**, **`lstm_8 (LSTM)`**:
     - **Output Shape**: `(None, None, 100)`
     - **Parameters**: `80,400` each
     - **Description**: There are multiple LSTM layers in this model. Each LSTM layer processes the outputs from the previous Multi-Head Attention layers. They maintain sequential information and capture long-range dependencies in the data. The output shape reflects that these layers can handle variable-length sequences, maintaining an output size of `100`, which likely corresponds to the dimensionality of the representations produced by the attention layers.

### Summary of Model Flow
Here’s a brief overview of how these layers interact:

1. **Multi-Head Attention**: The multiple attention layers learn to focus on different parts of the input sequences. Each layer can attend to the context from different sequences and extract various features.

2. **LSTM Processing**: Following the attention mechanisms, the LSTM layers process the output to capture sequential dependencies. LSTMs are particularly useful in sequence modeling tasks because they can remember information from earlier time steps, which is vital for understanding context in text.

### Implications for Your Model
- **Complex Interactions**: By stacking multiple attention and LSTM layers, your model can learn complex interactions between words and their contexts across sequences.
- **Overfitting and Generalization**: With so many layers, it's essential to monitor for overfitting. Consider using techniques like dropout (which you already have) and regularization to help with generalization.
- **Hyperparameter Tuning**: The number of layers, attention heads, and LSTM units are all hyperparameters that could significantly affect performance. It may be beneficial to experiment with different configurations based on your task requirements.

Let's continue breaking down the model layers you provided, focusing on the attention and LSTM layers that follow.

### Attention Layers
1. **Attention Layers**:
   - **`attention_1` to `attention_9`**:
     - **Output Shape**: `(None, 100)`
     - **Parameters**: `0`
     - **Description**: Each of these attention layers produces a fixed-size output of `100`. They take the output from the corresponding LSTM layers (e.g., `lstm`, `lstm_9`, etc.) as inputs. The attention mechanism allows the model to focus on specific parts of the sequence, enhancing the representation by weighting the importance of different input tokens based on the context provided by the LSTM.

### Multi-Head Attention Layers
2. **Multi-Head Attention Layers**:
   - **`multi_head_attenti… (MultiHeadAttention)`** (multiple instances):
     - **Output Shape**: `(None, None, 100)`
     - **Parameters**: `40,400` each
     - **Description**: These layers are designed to combine information from the various attention outputs (`attention_1` to `attention_9`). They allow the model to jointly attend to different features across multiple attention outputs, preserving the sequence information. The output shape indicates that they produce variable-length sequences (due to `None`) while maintaining a dimensionality of `100`.

### LSTM Layers
3. **LSTM Layers**:
   - **`lstm_10 (LSTM)`** to **`lstm_18 (LSTM)`**:
     - **Output Shape**: `(None, None, 100)`
     - **Parameters**: `80,400` each
     - **Description**: These layers follow the multi-head attention layers and continue processing the output sequences. Similar to the previous LSTM layers, they help to maintain temporal dependencies and manage long-range contexts in the data, effectively making the model capable of understanding sequences over time.

### Summary of Model Flow
The structure of your model is quite intricate, with a series of attention mechanisms followed by LSTM layers. Here’s a simplified flow of how data moves through the model:

1. **Initial LSTM Layers**: These layers process the input sequences to extract features and maintain sequential information.
   
2. **Attention Layers**: Each LSTM output is passed through corresponding attention layers, which weigh the importance of various parts of the sequence. The outputs of these layers are fixed in size (100).

3. **Multi-Head Attention**: The attention outputs are further processed by multi-head attention layers that combine information from multiple attention heads. This allows the model to integrate diverse perspectives from the sequence data.

4. **Final LSTM Layers**: The processed sequences from the multi-head attention layers are then fed into additional LSTM layers to refine the representation and maintain temporal dependencies.


It looks like you're working with a complex model architecture in Keras. From the details you've shared, it appears to be an LSTM-based neural network with multiple attention layers, combining various inputs, likely for a task like sequence modeling or natural language processing. Here’s a brief overview of the components you're working with:

1. **Input Layers**: 
   - `linguistic_input`: Accepts input sequences of length 52.
   - `readability_input`: Accepts input sequences of length 35.

2. **LSTM Layers**: 
   - Several LSTM layers (e.g., `lstm_10`, `lstm_11`, etc.) are present, processing the sequences with attention mechanisms applied on their outputs.

3. **Attention Mechanisms**:
   - Attention layers (e.g., `attention_12`, `attention_13`, etc.) are connected to the LSTM outputs. This indicates that the model is designed to focus on different parts of the input sequences when making predictions.

4. **Concatenation Layers**:
   - Multiple concatenation layers (e.g., `concatenate`, `concatenate_1`, etc.) are merging outputs from the attention layers with other inputs, resulting in larger feature vectors.

5. **Lambda Layers**: 
   - Several lambda layers are employed to manipulate the data shapes or select specific dimensions, facilitating the model's workflow.

6. **Output Preparation**:
   - The final layers, which appear to concatenate the attention outputs with corresponding input sequences, indicate that the model is preparing to make predictions or output a structured representation based on the combined features.

### Considerations
- **Batch Size**: The `(None, ...)` dimension in the input shape indicates that the model can handle variable batch sizes, which is standard in Keras.
- **Dimensionality**: The final concatenation layers output vectors of shape `(None, 1, 374)`, which suggests that the model might be designed to output a single sequence prediction based on a combination of input and attended features.
 
 
The model summary you've provided indicates that you have a fairly complex architecture involving multiple Flatten layers, followed by Dense layers and a Concatenate layer. Here’s a breakdown of what you’re looking at:

1. **Flatten Layers**: You have eight `Flatten` layers, each converting the output of their respective preceding layers into a one-dimensional tensor of shape `(None, 374)`. This is typical when you want to transition from convolutional or recurrent layers to fully connected (Dense) layers.

2. **Dense Layers**: Each `Flatten` layer is followed by a `Dense` layer that produces a single output (shape `(None, 1)`), suggesting that this model is likely performing a regression task or binary classification (predicting a single value).

3. **Concatenate Layer**: Finally, there’s a `Concatenate` layer that takes the outputs from all eight `Dense` layers, merging them into a single tensor of shape `(None, 9)`. This could be useful for tasks that require combining predictions from multiple inputs or branches of the model.

4. **Total Parameters**: The model has a total of **2,552,275** parameters, all of which are trainable. This suggests a significant amount of complexity, which might lead to overfitting if the dataset is not large enough.

