<a href="https://colab.research.google.com/github/munib-ehman/aura-ai-agent-engine/blob/main/DOER_MODEL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Install simpletransformers**

In [1]:
!pip install -q simpletransformers

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.3/43.3 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m330.8/330.8 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m132.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m146.1 MB/s[0m eta [36m0:

**T5 flant model trained on custom dataset**

In [2]:
import json
import pandas as pd
from simpletransformers.t5 import T5Model, T5Args
import os

# --- Configuration ---
MODEL_TYPE = 't5'
MODEL_NAME = 'google/flan-t5-small'
DATASET_FILE = 'doer_dataset.jsonl'
OUTPUT_DIR = 'outputs/doer_flan_model'
TRAIN_EPOCHS = 12

def create_training_dataframe(file_path):
    """
    Loads the .jsonl file and prepares it for training.
    The target_text is ONLY the workflow array.
    """
    records = []
    print(f"Loading and preparing dataset from '{file_path}'...")
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            for line in f:
                clean_line = line.strip()
                if clean_line:
                    records.append(json.loads(clean_line))

        print(f"Dataset loaded. Found {len(records)} training examples.")

        data_for_df = []
        for record in records:
            input_data = record.get('input', {})
            output_workflow = record.get('output', {}).get('workflow', [])

            data_for_df.append({
                "prefix": "workflow",
                "input_text": json.dumps(input_data),
                "target_text": json.dumps(output_workflow)
            })

        return pd.DataFrame(data_for_df)

    except FileNotFoundError:
        print(f"Error: Dataset file not found at '{file_path}'")
        return None
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON from the dataset file: {e}")
        return None

def train_doer_model():
    """
    Trains the T5 model on our custom dataset and saves it.
    """
    print("--- Starting Training for Custom 'Doer' Model ---")
    train_df = create_training_dataframe(DATASET_FILE)

    if train_df is None or train_df.empty:
        print("Training cannot proceed without a valid dataset.")
        return

    model_args = T5Args()
    model_args.max_seq_length = 512
    model_args.train_batch_size = 2
    model_args.eval_batch_size = 2
    model_args.num_train_epochs = TRAIN_EPOCHS
    model_args.overwrite_output_dir = True
    model_args.output_dir = OUTPUT_DIR
    model_args.save_steps = -1

    # --- CHANGE 1: ENABLE GPU USAGE ---
    # This tells the model to use the T4 GPU you enabled.
    model_args.use_cuda = True
    model_args.n_gpu = 1

    model_args.learning_rate = 1e-4
    model_args.warmup_steps = 50

    print(f"Initializing '{MODEL_NAME}' model...")
    # --- CHANGE 2: REMOVE use_cuda=False from model creation ---
    # The arguments object (model_args) now handles this.
    model = T5Model(MODEL_TYPE, MODEL_NAME, args=model_args)

    print("\n--- Starting Model Fine-Tuning ---")
    model.train_model(train_df)

    print("\n--- Training Complete ---")
    print(f"Your custom 'Doer' model has been saved to the '{OUTPUT_DIR}' directory.")

# Run the training function
train_doer_model()

--- Starting Training for Custom 'Doer' Model ---
Loading and preparing dataset from 'doer_dataset.jsonl'...
Dataset loaded. Found 54 training examples.
Initializing 'google/flan-t5-small' model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565



--- Starting Model Fine-Tuning ---


  0%|          | 0/54 [00:00<?, ?it/s]



Epoch:   0%|          | 0/12 [00:00<?, ?it/s]

  scaler = amp.GradScaler()


Running Epoch 1 of 12:   0%|          | 0/27 [00:00<?, ?it/s]

  with amp.autocast():


Running Epoch 2 of 12:   0%|          | 0/27 [00:00<?, ?it/s]

Running Epoch 3 of 12:   0%|          | 0/27 [00:00<?, ?it/s]

Running Epoch 4 of 12:   0%|          | 0/27 [00:00<?, ?it/s]

Running Epoch 5 of 12:   0%|          | 0/27 [00:00<?, ?it/s]

Running Epoch 6 of 12:   0%|          | 0/27 [00:00<?, ?it/s]

Running Epoch 7 of 12:   0%|          | 0/27 [00:00<?, ?it/s]

Running Epoch 8 of 12:   0%|          | 0/27 [00:00<?, ?it/s]

Running Epoch 9 of 12:   0%|          | 0/27 [00:00<?, ?it/s]

Running Epoch 10 of 12:   0%|          | 0/27 [00:00<?, ?it/s]

Running Epoch 11 of 12:   0%|          | 0/27 [00:00<?, ?it/s]

Running Epoch 12 of 12:   0%|          | 0/27 [00:00<?, ?it/s]


--- Training Complete ---
Your custom 'Doer' model has been saved to the 'outputs/doer_flan_model' directory.


**Make Zip File of Trained Model**

In [3]:
!zip -r doer_model.zip outputs/doer_flan_model/

  adding: outputs/doer_flan_model/ (stored 0%)
  adding: outputs/doer_flan_model/checkpoint-243-epoch-9/ (stored 0%)
  adding: outputs/doer_flan_model/checkpoint-243-epoch-9/spiece.model (deflated 48%)
  adding: outputs/doer_flan_model/checkpoint-243-epoch-9/config.json (deflated 62%)
  adding: outputs/doer_flan_model/checkpoint-243-epoch-9/added_tokens.json (deflated 83%)
  adding: outputs/doer_flan_model/checkpoint-243-epoch-9/optimizer.pt (deflated 52%)
  adding: outputs/doer_flan_model/checkpoint-243-epoch-9/scheduler.pt (deflated 62%)
  adding: outputs/doer_flan_model/checkpoint-243-epoch-9/generation_config.json (deflated 29%)
  adding: outputs/doer_flan_model/checkpoint-243-epoch-9/tokenizer_config.json (deflated 94%)
  adding: outputs/doer_flan_model/checkpoint-243-epoch-9/special_tokens_map.json (deflated 85%)
  adding: outputs/doer_flan_model/checkpoint-243-epoch-9/model_args.json (deflated 63%)
  adding: outputs/doer_flan_model/checkpoint-243-epoch-9/model.safetensors (defla