In [None]:
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "header_cell"
      },
      "source": [
        "# Recipe Model Training on Google Colab\n",
        "\n",
        "This notebook trains a Flan-T5 Small model on recipe data to generate ingredients and directions from recipe titles."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "setup_section"
      },
      "source": [
        "## 1. Mount Google Drive & Setup Environment"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mount_drive"
      },
      "outputs": [],
      "source": [
        "# Mount Google Drive\n",
        "from google.colab import drive\n",
        "drive.mount('/content/drive')\n",
        "\n",
        "# Change this path to match your actual folder structure\n",
        "BASE_PROJECT_PATH = \"/content/drive/MyDrive/ml_pipeline\"\n",
        "\n",
        "# Print files to verify\n",
        "!ls -la $BASE_PROJECT_PATH"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "install_requirements"
      },
      "outputs": [],
      "source": [
        "# Install required packages\n",
        "%cd $BASE_PROJECT_PATH\n",
        "!pip install transformers datasets torch sacrebleu rouge-score pandas pyyaml"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "create_symlink"
      },
      "outputs": [],
      "source": [
        "# Create a symlink for easier access\n",
        "!ln -sf $BASE_PROJECT_PATH /content/recipe-monk\n",
        "%cd /content/recipe-monk\n",
        "\n",
        "# Verify files are accessible\n",
        "!ls -la\n",
        "!cat params.yaml"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "data_section"
      },
      "source": [
        "## 2. Verify Raw Data & Preprocess"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "check_raw_data"
      },
      "outputs": [],
      "source": [
        "# Check raw data exists\n",
        "!ls -la data/raw/\n",
        "\n",
        "# View first few lines of the raw data\n",
        "!head -n 5 data/raw/recipes_data.csv"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "run_preprocessing"
      },
      "outputs": [],
      "source": [
        "# Run the preprocessing script\n",
        "!python scripts/preprocess.py"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "check_processed_data"
      },
      "outputs": [],
      "source": [
        "# Check processed data\n",
        "!ls -la data/processed/\n",
        "\n",
        "# View first few lines of processed data\n",
        "!head -n 5 data/processed/train.csv\n",
        "\n",
        "# Check number of rows in train and test files\n",
        "!echo \"Train file rows: $(wc -l < data/processed/train.csv)\"\n",
        "!echo \"Test file rows: $(wc -l < data/processed/test.csv)\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "debug_data_section"
      },
      "source": [
        "## 3. Debug Data Files (if needed)\n",
        "\n",
        "If you encounter issues with the dataset loading, this section helps identify and fix problems."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "debug_data_loading"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "\n",
        "# Try loading the data with pandas to verify format\n",
        "try:\n",
        "    train_df = pd.read_csv('data/processed/train.csv')\n",
        "    test_df = pd.read_csv('data/processed/test.csv')\n",
        "    \n",
        "    print(f\"Train data shape: {train_df.shape}\")\n",
        "    print(f\"Test data shape: {test_df.shape}\")\n",
        "    print(\"\\nTrain data columns:\")\n",
        "    print(train_df.columns.tolist())\n",
        "    \n",
        "    # Check for required columns\n",
        "    if 'input_text' in train_df.columns and 'target_text' in train_df.columns:\n",
        "        print(\"\\n✅ Required columns 'input_text' and 'target_text' found.\")\n",
        "        \n",
        "        # Show a few examples\n",
        "        print(\"\\nExample inputs and targets:\")\n",
        "        for i in range(min(3, len(train_df))):\n",
        "            print(f\"\\nExample {i+1}:\")\n",
        "            print(f\"Input: {train_df['input_text'].iloc[i]}\")\n",
        "            print(f\"Target (first 100 chars): {train_df['target_text'].iloc[i][:100]}...\")\n",
        "    else:\n",
        "        print(\"\\n❌ Required columns not found. Available columns:\")\n",
        "        print(train_df.columns.tolist())\n",
        "        \n",
        "except Exception as e:\n",
        "    print(f\"Error loading data: {e}\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fix_data_if_needed"
      },
      "outputs": [],
      "source": [
        "# This cell is for fixing data issues if needed\n",
        "# Uncomment and modify if your data needs fixes\n",
        "\n",
        "# Example: If column names are wrong, rename them\n",
        "'''\n",
        "if 'input_text' not in train_df.columns and 'some_other_column' in train_df.columns:\n",
        "    # Rename columns\n",
        "    train_df = train_df.rename(columns={'some_other_column': 'input_text', 'another_column': 'target_text'})\n",
        "    test_df = test_df.rename(columns={'some_other_column': 'input_text', 'another_column': 'target_text'})\n",
        "    \n",
        "    # Save fixed data\n",
        "    train_df.to_csv('data/processed/train_fixed.csv', index=False)\n",
        "    test_df.to_csv('data/processed/test_fixed.csv', index=False)\n",
        "    \n",
        "    print(\"Fixed data saved to train_fixed.csv and test_fixed.csv\")\n",
        "'''"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "training_section"
      },
      "source": [
        "## 4. Training the Model\n",
        "\n",
        "Now let's train the Flan-T5 Small model on our recipe data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "run_training"
      },
      "outputs": [],
      "source": [
        "# First, create the model output directory\n",
        "!mkdir -p models/flan_t5_small\n",
        "\n",
        "# Run the training script\n",
        "!python scripts/train_model_colab.py"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "testing_section"
      },
      "source": [
        "## 5. Testing the Trained Model\n",
        "\n",
        "Let's test our model with some example inputs."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "create_test_script"
      },
      "outputs": [],
      "source": [
        "%%writefile scripts/test_model_colab.py\n",
        "import torch\n",
        "from transformers import T5Tokenizer, T5ForConditionalGeneration\n",
        "import yaml\n",
        "\n",
        "# Load parameters\n",
        "with open(\"params.yaml\", \"r\") as f:\n",
        "    params = yaml.safe_load(f)\n",
        "\n",
        "# Load model and tokenizer\n",
        "MODEL_DIR = params[\"model\"][\"output_dir\"]\n",
        "tokenizer = T5Tokenizer.from_pretrained(MODEL_DIR)\n",
        "model = T5ForConditionalGeneration.from_pretrained(MODEL_DIR)\n",
        "\n",
        "# Check for GPU availability\n",
        "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
        "model = model.to(device)\n",
        "print(f\"Using device: {device}\")\n",
        "\n",
        "# Test with sample inputs\n",
        "sample_inputs = [\n",
        "    \"Generate ingredients and directions for: Chicken Curry\",\n",
        "    \"Generate ingredients and directions for: Chocolate Cake\",\n",
        "    \"Generate ingredients and directions for: Vegetable Soup\"\n",
        "]\n",
        "\n",
        "for input_text in sample_inputs:\n",
        "    # Tokenize and generate\n",
        "    inputs = tokenizer(input_text, return_tensors=\"pt\").to(device)\n",
        "    outputs = model.generate(\n",
        "        inputs.input_ids, \n",
        "        max_length=512, \n",
        "        num_beams=4, \n",
        "        early_stopping=True\n",
        "    )\n",
        "    \n",
        "    # Decode and print\n",
        "    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
        "    \n",
        "    print(f\"\\n{'='*50}\")\n",
        "    print(f\"INPUT: {input_text}\")\n",
        "    print(f\"{'='*50}\")\n",
        "    print(f\"OUTPUT:\\n{generated_text}\")\n",
        "    print(f\"{'='*50}\\n\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "run_test_script"
      },
      "outputs": [],
      "source": [
        "# Run the test script\n",
        "!python scripts/test_model_colab.py"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "saving_model_section"
      },
      "source": [
        "## 6. Saving and Downloading the Model\n",
        "\n",
        "The model is already saved to your Google Drive, but you can also download a copy."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "download_model"
      },
      "outputs": [],
      "source": [
        "# Zip the model for easier download\n",
        "!zip -r /content/flan_t5_recipe_model.zip models/flan_t5_small/\n",
        "\n",
        "# Download the zipped model\n",
        "from google.colab import files\n",
        "files.download('/content/flan_t5_recipe_model.zip')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "troubleshooting_section"
      },
      "source": [
        "## Troubleshooting\n",
        "\n",
        "### Common Issues and Solutions:\n",
        "\n",
        "1. **Memory issues**: Reduce batch size in params.yaml or reduce max sequence length\n",
        "2. **Dataset loading errors**: Use the alternative dataset loading method (pandas to datasets conversion)\n",
        "3. **Training too slow**: Reduce number of epochs or train on a subset of data\n",
        "4. **Model loading errors**: Check if the model was saved correctly (look for files in models/flan_t5_small/)\n",
        "5. **Path issues**: Make sure BASE_PROJECT_PATH is correctly pointing to your project in Google Drive"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "Recipe_Model_Training.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU",
    "gpuClass": "standard"
  },
  "nbformat": 4,
  "nbformat_minor": 0
}