## Representação da implementação da Arquitetura BERT

```mermaid
graph LR
    subgraph Input [Inputs]
    end

    subgraph Embeddings ["Input Embedding"]
        word_embeddings(["word_embeddings"]) 
        position_embeddings(["position_embeddings"])
        token_type_embeddings(["token_type_embeddings"]) 
        LayerNorm(["LayerNorm"])
        Dropout(["Dropout"])
    end

    subgraph Encoder [Encoder]
        subgraph BertLayersEncoder[6 x BertLayer]
            subgraph BertAttention
                subgraph BertSelfAttention
                    query(["Query Linear"])
                    key(["Key Linear"])
                    value(["Value Linear"])
                    Dropout_self_attention(["Dropout Self Attention"])
                end
                subgraph BertSelfOutput
                    dense_self_output(["Linear Dense"])
                    LayerNorm_self_output(["Layer Norm"])
                    Dropout_self_output(["Dropout Self Output"])
                end
            end
            subgraph BertIntermediate
                dense_intermediate(["LinearDense"])
                intermediate_act_fn(["GELUActivation"])
            end
            subgraph BertOutput
                dense_output(["Dense Linear Output"])
            end
            subgraph LayerNorm e Dropout
                LayerNorm_output(["LayerNorm Output"])
                Dropout_bert_output(["Dropout Output"])                
            end
        end
    end

    subgraph Decoder [Decoder]
        subgraph BertLayersDecoder[6 x BertLayer]
            subgraph BertAttentionDecoder
                subgraph BertSelfAttentionDecoder
                    query(["Query Linear"])
                    key(["Key Linear"])
                    value(["Value Linear"])
                    Dropout_self_attention(["Dropout Self Attention"])
                end
                subgraph BertSelfOutputDecoder
                    dense_self_output(["Linear Dense"])
                    LayerNorm_self_output(["Layer Norm"])
                    Dropout_self_output(["Dropout Self Output"])
                end
            end
            subgraph BertIntermediateDecoder
                dense_intermediate(["LinearDense"])
                intermediate_act_fn(["GELUActivation"])
            end
            subgraph BertOutputDecoder
                dense_output(["Dense Linear Output"])
            end
            subgraph LayerNorm e Dropout Decoder
                LayerNorm_output(["LayerNorm Output"])
                Dropout_bert_output(["Dropout Output"])                
            end
        end
    end

    subgraph Saídas
        attentions
        pooler_output
        hidden_states
        last_hidden_state
    end

    subgraph Pooler [Optional Pooler]
        dense_pooler( Linear )
        activation( Tanh )
        output( pooler_output )
    end

    Input --> Embeddings
    Embeddings --> Encoder
    Encoder --> Decoder
    Encoder -.- Pooler
    Decoder --> Saídas

    word_embeddings -.- position_embeddings
    position_embeddings -.- token_type_embeddings
    token_type_embeddings -.- LayerNorm 
    LayerNorm -.- Dropout
    
    attentions -.- pooler_output
    pooler_output -.- hidden_states
    hidden_states -.- last_hidden_state    
```

```mermaid
graph LR
    subgraph Input [Input]
    end

    subgraph Embeddings
        word_embeddings(["word_embeddings"]) 
        position_embeddings(["position_embeddings"])
        token_type_embeddings(["token_type_embeddings"]) 
        LayerNorm(["LayerNorm"])
        Dropout(["Dropout"])
    end

    subgraph Encoder [Encoder]
        BertLayers[12 x BertLayer]
    end

    subgraph Pooler [Pooler]
        dense_pooler( Linear )
        activation( Tanh )
    end

    subgraph BertLayer
        BertAttention
        BertIntermediate
        BertOutput
        LayerNorm_output(["LayerNorm"])
        Dropout_output(["Dropout"])
    end

    subgraph BertAttention
        BertSelfAttention
        BertSelfOutput
    end

    subgraph BertSelfAttention
        query(["Linear"])
        key(["Linear"])
        value(["Linear"])
        Dropout_self_attention(["Dropout"])
    end

    subgraph BertSelfOutput
        dense_self_output(["Linear"])
        LayerNorm_self_output(["LayerNorm"])
        Dropout_self_output(["Dropout"])
    end

    subgraph BertIntermediate
        dense_intermediate(["Linear"])
        intermediate_act_fn(( GELUActivation ))
    end

    subgraph BertOutput
        dense_output(["Linear"])
        LayerNorm_output(["LayerNorm"])
        Dropout_bert_output(["Dropout"])
    end

    Input --> Embeddings
    Embeddings --> Encoder
    Encoder --> Pooler
    Pooler --> BertOutput(( pooler_output )) 
    Encoder --> last_hidden_state(( last_hidden_state ))

    BertLayers --> BertLayer
    BertLayer --> BertAttention
    BertAttention --> BertSelfAttention
    BertSelfAttention --> BertSelfOutput
    BertAttention -.-> BertSelfOutput  
    BertSelfOutput --> BertOutput
    BertLayer -.-> BertOutput 
    BertOutput --> LayerNorm_output
    LayerNorm_output --> Dropout_output
    Dropout_output --> BertLayer 
    BertLayer --> BertIntermediate
    BertIntermediate --> GELUActivation
    BertIntermediate --> BertOutput
    BertSelfAttention --> query
    BertSelfAttention --> key
    BertSelfAttention --> value
    BertSelfAttention --> Dropout_self_attention
    BertOutput --> dense_output
    BertOutput --> LayerNorm_output
    LayerNorm_output --> Dropout_bert_output

    %% Embeddings - vertical arrangement
    word_embeddings --> position_embeddings
    position_embeddings --> token_type_embeddings
    token_type_embeddings --> LayerNorm
    LayerNorm --> Dropout

    %% BertLayer Outputs
    BertLayer --> hidden_states(( hidden_states ))
    BertAttention --> attentions(( attentions ))
```