# üöÄ GrammaticalBERT Training - Google Colab

Este notebook treina o GrammaticalBERT no Google Colab usando GPU gr√°tis!

## Setup R√°pido:
1. **Runtime ‚Üí Change runtime type ‚Üí GPU (T4)**
2. Execute as c√©lulas em ordem (Shift+Enter)
3. Aguarde ~20 minutos para treinar no SST-2

## O que vamos fazer:
- Fine-tuning do GrammaticalBERT no dataset SST-2 (sentiment analysis)
- Comparar com vanilla BERT
- Medir accuracy, F1, e redu√ß√£o de hallucinations

**Dataset**: Baixa automaticamente (sem prepara√ß√£o manual!)

## 1Ô∏è‚É£ Verificar GPU

In [None]:
# Verificar se GPU est√° dispon√≠vel
import torch

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    print(f"‚úÖ GPU dispon√≠vel: {gpu_name}")
    print(f"   VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("‚ùå GPU n√£o dispon√≠vel! V√° em Runtime ‚Üí Change runtime type ‚Üí GPU")

## 2Ô∏è‚É£ Clonar Reposit√≥rio

In [None]:
# Clonar o reposit√≥rio
!git clone https://github.com/nooa-ai/nooa-transformers.git
%cd nooa-transformers/grammatical_transformers

## 3Ô∏è‚É£ Instalar Depend√™ncias

In [None]:
# Instalar pacote
!pip install -e . -q
!pip install datasets accelerate -q

print("‚úÖ Instala√ß√£o completa!")

## 4Ô∏è‚É£ Teste R√°pido (Opcional)

In [None]:
# Teste r√°pido para verificar que tudo funciona
from grammatical_transformers import GrammaticalBertModel, GrammaticalBertConfig
import torch

print("üîß Criando modelo de teste...")
config = GrammaticalBertConfig(
    vocab_size=30522,
    hidden_size=768,
    num_hidden_layers=4,  # Pequeno para teste r√°pido
    num_attention_heads=12,
    constituency_penalty=0.5
)
model = GrammaticalBertModel(config)

# Teste forward pass
input_ids = torch.randint(0, 30522, (2, 32))
outputs = model(input_ids=input_ids)

print(f"‚úÖ Modelo funciona! Output shape: {outputs.last_hidden_state.shape}")
print(f"   Constituency trees: {len(outputs.constituency_trees)} exemplos")
print(f"   Symmetry scores: {outputs.symmetry_scores.shape if outputs.symmetry_scores is not None else 'N/A'}")

## 5Ô∏è‚É£ Op√ß√£o A: Fine-tuning Simples no SST-2 (Recomendado)

**SST-2**: Sentiment Analysis (positive/negative)
- 67K exemplos de treinamento
- Tempo estimado: ~20 minutos no T4
- Dataset baixa automaticamente

In [None]:
# Treinar no SST-2
!python benchmarks/glue_test.py \
  --task sst2 \
  --epochs 3 \
  --batch_size 32 \
  --learning_rate 2e-5 \
  --constituency_penalty 0.5 \
  --device cuda

print("\n‚úÖ Treinamento completo!")
print("üìä Confira os resultados acima (accuracy, F1 score)")

## 5Ô∏è‚É£ Op√ß√£o B: Compara√ß√£o GrammaticalBERT vs Vanilla BERT

Executa benchmark completo comparando os dois modelos

In [None]:
# Comparar com vanilla BERT
!python benchmarks/compare_vanilla.py \
  --task sst2 \
  --batch_size 32 \
  --num_samples 1000

## 6Ô∏è‚É£ Teste de Hallucination Detection

Testa a capacidade do modelo de detectar inconsist√™ncias (hallucinations)

In [None]:
# Teste de hallucination
!python benchmarks/hallucination_test.py

## 7Ô∏è‚É£ Uso Interativo (Python)

In [None]:
# Carregar modelo treinado para uso
from grammatical_transformers import (
    GrammaticalBertForSequenceClassification,
    GrammaticalBertConfig
)
from transformers import BertTokenizer
import torch

# Config
config = GrammaticalBertConfig(
    vocab_size=30522,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    num_labels=2,  # SST-2: positive/negative
    constituency_penalty=0.5
)

# Modelo e tokenizer
model = GrammaticalBertForSequenceClassification(config)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Mover para GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
model.eval()

print("‚úÖ Modelo pronto para infer√™ncia!")

In [None]:
# Fun√ß√£o para classificar sentimentos
def classify_sentiment(text):
    # Tokenize
    inputs = tokenizer(
        text,
        return_tensors='pt',
        padding=True,
        truncation=True,
        max_length=128
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Predict
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.softmax(logits, dim=-1)
        pred = torch.argmax(probs, dim=-1)
    
    label = "Positive" if pred.item() == 1 else "Negative"
    confidence = probs[0, pred.item()].item()
    
    print(f"\nüìù Text: {text}")
    print(f"üéØ Sentiment: {label}")
    print(f"üìä Confidence: {confidence:.2%}")
    
    # Mostrar constituency tree se dispon√≠vel
    if hasattr(outputs, 'constituency_trees') and outputs.constituency_trees:
        tree = outputs.constituency_trees[0]
        print(f"üå≥ Constituency Tree: {tree}")
    
    return label, confidence

# Exemplos
examples = [
    "This movie is absolutely amazing!",
    "I hated every minute of it.",
    "The plot was confusing but the acting was great.",
]

for text in examples:
    classify_sentiment(text)

## 8Ô∏è‚É£ Visualizar Constituency Trees (Opcional)

In [None]:
# Visualizar estrutura gramatical
def visualize_constituency(text):
    inputs = tokenizer(
        text,
        return_tensors='pt',
        padding=True,
        truncation=True,
        max_length=128
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    if hasattr(outputs, 'constituency_trees') and outputs.constituency_trees:
        tree = outputs.constituency_trees[0]
        tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
        
        print(f"\nüìù Sentence: {text}")
        print(f"üî§ Tokens: {' '.join(tokens)}")
        print(f"üå≥ Constituency Tree:\n{tree}")
    else:
        print("No constituency tree available")

# Testar
visualize_constituency("The quick brown fox jumps over the lazy dog")

## 9Ô∏è‚É£ Salvar Modelo Treinado

In [None]:
# Salvar modelo
output_dir = "./grammatical_bert_sst2"
model.save_pretrained(output_dir)
config.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"‚úÖ Modelo salvo em {output_dir}")
print("\nüì¶ Para baixar, v√° em Files (√† esquerda) ‚Üí Clique com direito ‚Üí Download")

## üîü Upload para Google Drive (Opcional)

In [None]:
# Montar Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Copiar modelo para Drive
!cp -r ./grammatical_bert_sst2 /content/drive/MyDrive/

print("‚úÖ Modelo copiado para Google Drive!")
print("üìÅ Localiza√ß√£o: MyDrive/grammatical_bert_sst2")

## üìä Pr√≥ximos Passos

Agora voc√™ pode:

1. **Testar outras tarefas GLUE**:
   ```python
   !python benchmarks/glue_test.py --task cola  # Gramaticalidade
   !python benchmarks/glue_test.py --task mnli  # Entailment
   ```

2. **Comparar resultados**: Execute `compare_vanilla.py` para ver diferen√ßas

3. **Publicar resultados**: Atualize `RESULTS.md` no GitHub

4. **Upload para Hugging Face Hub**: Compartilhe modelo treinado

5. **Escrever paper**: Documente descobertas

---

**Problemas?** Veja: https://github.com/nooa-ai/nooa-transformers/issues

**LFG!** üöÄ