# 🌾 Cotton Weed Detection Challenge - Starter Notebook

## Quick Start Guide

This notebook walks you through:
1. **Environment Setup** - Install 3LC and required packages
2. **Dataset Registration** - Create 3LC Tables for data management
3. **Baseline Training** - Train YOLOv8n with run tracking
4. **Generate Predictions** - Create Kaggle submission
5. **Iterative Improvement** - Use 3LC Dashboard to improve data quality

### About 3LC (Three Lines of Code)
3LC is a data-centric AI platform that enables the **train–fix–retrain loop**:
- **Train** - Track experiments automatically
- **Analyze** - Use Dashboard to find data issues
- **Fix** - Correct labels and improve quality
- **Retrain** - Iterate with better data

**Let's begin!**

---
# Environment Setup

## Before You Begin

This notebook uses **3LC (Three Lines of Code)** for data-centric AI workflows. Follow these steps to set up your environment.

### Video Guide
Watch the full setup walkthrough: [3LC Quickstart Video](https://www.youtube.com/watch?v=zdIq1QpeSI8&list=PLFOZfHCPrAhDbgmxYcu9Qq5UUVMf7YLFy)

---

## Step 1: Create a 3LC Account

1. Go to [https://account.3lc.ai](https://account.3lc.ai)
2. Create your account
   - **Note:** A default workspace is automatically created for you (visible after login)
   - Your workspace name is what others on your team will see when collaborating
3. Get your API key from [https://account.3lc.ai/api-key](https://account.3lc.ai/api-key) and save it for the next step

   <div align="left">
   <img src="content/api.png" alt="Description" width="600">
   </div>

---

## Step 2: Set Up Python Environment (Recommended)

### Create a Virtual Environment

**Windows:**
```bash
python -m venv cotton-weed-env
cotton-weed-env\Scripts\activate
```

**Linux/MacOS:**
```bash
python -m venv cotton-weed-env
source cotton-weed-env/bin/activate
```

**Note:** You can skip this if you prefer to use your current Python environment.

---

## Step 3: Install 3LC and Dependencies

### ⚠️ Important: PyTorch GPU Setup (For GPU Training)

The following command will install PyTorch, but **by default it installs the CPU version**. 

**If you have a GPU and want to use it for training:**
1. First, install 3LC: `pip install 3lc-ultralytics`
2. Then, **reinstall PyTorch with CUDA support:**
   ```bash
   # For CUDA 11.8
   pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
   
   # For CUDA 12.1
   pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
   ```
   Visit [PyTorch Get Started](https://pytorch.org/get-started/locally/) to find the correct command for your CUDA version.

**If you're fine with CPU training (slower but works):**
```bash
pip install 3lc-ultralytics
```

**What gets installed:**
- `3lc-ultralytics` - 3LC integration with Ultralytics YOLO
- `3lc` - Core 3LC library
- `ultralytics`, `torch`, `pandas`, `numpy`, `pillow`, `pycocotools`, and other dependencies

**System Requirements:**
- Python 3.8+ 
- Windows 10+, Linux, or macOS
- GPU with CUDA (optional but recommended for faster training)

---

## Step 4: Login to 3LC

Once Finished with installation in the same terminal login using an API key. [Create one if not present](https://account.3lc.ai/api-key). 
Replace `<your_api_key>` with your actual API key:

```bash
3lc login <your_api_key>
```

This saves your API key locally. **For future sessions, you don't need to run this again.**

---

## Step 5: Start the 3LC Service (For Dashboard visualization)

**Important Clarification:**
- The 3LC service is **NOT required for training** - you can train models without it
- It **IS required** if you want to use the 3LC Dashboard to visualize and analyze your data/runs
- Since the Dashboard is a key part of the data-centric workflow, we recommend starting it

**To start the service, run in the terminal:**
```bash
3lc service
```
   <div align="left">
   <img src="content/service.png" alt="Description" width="600">
   </div>

**What happens:**
- Starts the local service for Dashboard connectivity
- Loads some preloaded example projects to help you learn 3LC
- These examples are useful for getting familiar with the Dashboard features

**Keep this terminal open while using the Dashboard.** To stop: Press `Q` or `Ctrl+C`.

---

## Step 6: Open 3LC Dashboard (For visualization)

Once the service is running, open [https://dashboard.3lc.ai](https://dashboard.3lc.ai) in your browser.

**Browser Requirements:**
- Chrome (recommended), Firefox, or Edge (latest versions)
- Hardware acceleration enabled (GPU) for smoother experience
  - Setup guide: [3LC GPU Acceleration Guide](https://docs.3lc.ai/3lc/latest/user-guide/dashboard/gpu-acceleration.html)

**Tip:** Explore the preloaded example projects to learn Dashboard features before working on competition data!

---

## Verification Checklist

Before running the notebook, ensure:

1. ✅ 3LC account created and API key obtained
2. ✅ Python environment activated (if using venv)
3. ✅ `3lc-ultralytics` installed
4. ✅ PyTorch GPU version installed (if you have a GPU)
5. ✅ Logged in to 3LC (`3lc login <api_key>`)
6. ✅ 3LC service running and Dashboard open

**For future sessions:** Only steps 2 and 6 are needed (activate environment and optionally start service/Dashboard).

---

## Additional Resources

- [3LC Documentation](https://docs.3lc.ai/)
- [3LC Example Notebooks](https://github.com/3lc-ai/3lc-examples?tab=readme-ov-file)
- [Getting Started Video Playlist](https://www.youtube.com/watch?v=zdIq1QpeSI8&list=PLFOZfHCPrAhDbgmxYcu9Qq5UUVMf7YLFy)

---

## Troubleshooting

**"3lc: command not found"**
→ Activate your Python environment or reinstall 3lc

**"API key invalid"**
→ Check your API key at [https://account.3lc.ai/api-key](https://account.3lc.ai/api-key)

**"No GPU detected" during training**
→ Reinstall PyTorch with CUDA support (see Step 3)

**"Cannot connect to Dashboard"**
→ Make sure `3lc service` is running in a terminal


**Ready? Let's begin!**


---
## Phase 1: Environment Setup & Dataset Registration

First, let's verify our environment and register the dataset with 3LC Tables.

In [None]:
# Import required packages
import torch
import tlc
from pathlib import Path
import pandas as pd
from IPython.display import display

# Check environment
print("Environment Check:")
print("=" * 50)
print(f"PyTorch version: {torch.__version__}")
print(f"3LC version: {tlc.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(
        f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB"
    )
else:
    print("!!! No GPU detected - training will be slower on CPU")

print("\n All systems ready! Let's begin.")

---
## Step 1: Dataset Configuration

The competition dataset is organized in YOLO format with train/val/test splits already prepared.

### Dataset Structure:
```
cotton_weed_dataset/
├── train/images/        # 542 training images
├── train/labels/        # 542 YOLO label files  
├── val/images/          # 133 validation images
├── val/labels/          # 133 YOLO label files
├── test/images/         # 170 test images (no labels provided)
└── dataset.yaml         # YOLO dataset configuration
```

### Test Set Information:
The test set contains **170 images** split as follows:
- **Public leaderboard**: 85 images (50%)
- **Private leaderboard**: 85 images (50%)

**Note**: All 170 test images are included in the download for convenience. Only the public/private split determines which images are used for each leaderboard score.

### YOLO Label Format:
Each `.txt` file contains bounding boxes: `class_id x_center y_center width height`  
All coordinates are normalized to [0, 1]

### ⚠️ Data Quality Note:
This dataset includes labeling imperfections. You'll need to identify and fix these issues to maximize performance.

In [None]:
# Set up file paths
WORK_DIR = Path(".")  # Current directory
DATASET_YAML = WORK_DIR / "dataset.yaml"

# Verify paths exist
print("Verifying dataset structure...")
print("=" * 50)

if not DATASET_YAML.exists():
    print(f"Could not find {DATASET_YAML}")
    print(f"Current directory: {Path.cwd()}")
    print("Please make sure dataset.yaml is in the current directory")
    raise FileNotFoundError(f"Dataset config not found: {DATASET_YAML}")

print(f"✅ Dataset config: {DATASET_YAML}")
print(f"✅ Working directory: {WORK_DIR.resolve()}")

# Display dataset configuration
print("\n Dataset Configuration:")
print("-" * 50)
with open(DATASET_YAML, "r") as f:
    config_content = f.read()
    print(config_content)

# Count dataset files
train_images = list((WORK_DIR / "train" / "images").glob("*.jpg"))
train_labels = list((WORK_DIR / "train" / "labels").glob("*.txt"))
val_images = list((WORK_DIR / "val" / "images").glob("*.jpg"))
val_labels = list((WORK_DIR / "val" / "labels").glob("*.txt"))
test_images = list((WORK_DIR / "test" / "images").glob("*.jpg"))

print("\n Dataset Statistics:")
print("-" * 50)
print(f"✅ Training:   {len(train_images)} images, {len(train_labels)} labels")
print(f"✅ Validation: {len(val_images)} images, {len(val_labels)} labels")
print(f"✅ Test: {len(test_images)} images")

---
## Visual Guide: Target Weed Species

Before we dive into model development, let's familiarize ourselves with the three target weed species. Understanding their visual characteristics is crucial for effective model training and data quality assessment.

### 🌿 The Three Target Weeds:

**Class 0: Carpetweed (*Mollugo verticillata*)**
- Mat-forming low-growing weed
- Small, spoon-shaped leaves arranged in whorls
- Forms dense ground cover competing with cotton seedlings
- Light green color, spreads horizontally

**Class 1: Morning Glory (*Ipomoea* species)**
- Climbing/twining vine that wraps around cotton plants
- Heart-shaped or lobed leaves
- Major yield impact - strangles cotton plants
- Can have purple, white, or pink flowers

**Class 2: Palmer Amaranth (*Amaranthus palmeri*)**
- Tall, upright, fast-growing "super weed"
- Lance-shaped leaves with prominent veins
- Herbicide-resistant strain causing major agricultural problems
- Reddish stems, can grow several feet tall

**Why Visual Familiarity Matters:**
- Helps identify mislabeled samples during error analysis
- Enables better understanding of class confusion patterns
- Assists in recognizing missing annotations
- Improves data quality decisions in the train-fix-retrain loop

Let's view example images from our dataset to see what these weeds actually look like!


In [None]:
# Example images for each weed class
import cv2
from pathlib import Path
import matplotlib.pyplot as plt
from collections import defaultdict

print("Finding example images for each weed class...")
print("=" * 70)

# Set up paths
TRAIN_IMAGES = WORK_DIR / "train" / "images"
TRAIN_LABELS = WORK_DIR / "train" / "labels"
CLASS_NAMES = ["Carpetweed", "Morning Glory", "Palmer Amaranth"]

# Find images containing each class
class_examples = defaultdict(list)

for label_file in TRAIN_LABELS.glob("*.txt"):
    if label_file.stat().st_size > 0:
        with open(label_file, "r") as f:
            for line in f:
                parts = line.strip().split()
                if len(parts) >= 5:
                    class_id = int(parts[0])
                    image_file = TRAIN_IMAGES / f"{label_file.stem}.jpg"
                    if image_file.exists():
                        class_examples[class_id].append(image_file)

# Select one clear example per class (first occurrence)
examples_to_show = {}
for class_id in range(len(CLASS_NAMES)):
    if class_examples[class_id]:
        examples_to_show[class_id] = class_examples[class_id][0]
        print(
            f"✓ Found example for {CLASS_NAMES[class_id]}: {examples_to_show[class_id].name}"
        )
    else:
        print(f"!!!  No examples found for {CLASS_NAMES[class_id]}")

# Display the examples
if examples_to_show:
    print("\n" + "=" * 70)
    print("Displaying example images with bounding boxes...")
    print("=" * 70)

    fig, axes = plt.subplots(1, len(examples_to_show), figsize=(15, 5))
    if len(examples_to_show) == 1:
        axes = [axes]

    colors = [(0, 255, 0), (255, 0, 0), (0, 0, 255)]  # BGR colors for OpenCV

    for idx, (class_id, image_path) in enumerate(sorted(examples_to_show.items())):
        # Read image
        img = cv2.imread(str(image_path))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        h, w = img.shape[:2]

        # Read corresponding label
        label_file = TRAIN_LABELS / f"{image_path.stem}.txt"
        with open(label_file, "r") as f:
            for line in f:
                parts = line.strip().split()
                if len(parts) >= 5:
                    cls = int(parts[0])
                    if cls == class_id:  # Only draw boxes for the target class
                        # Convert YOLO format to pixel coordinates
                        x_center, y_center, box_w, box_h = map(float, parts[1:5])
                        x1 = int((x_center - box_w / 2) * w)
                        y1 = int((y_center - box_h / 2) * h)
                        x2 = int((x_center + box_w / 2) * w)
                        y2 = int((y_center + box_h / 2) * h)

                        # Draw bounding box
                        color = colors[class_id]
                        cv2.rectangle(img, (x1, y1), (x2, y2), color, 3)

                        # Add class label
                        label_text = f"{CLASS_NAMES[class_id]}"
                        cv2.putText(
                            img,
                            label_text,
                            (x1, y1 - 10),
                            cv2.FONT_HERSHEY_SIMPLEX,
                            0.8,
                            color,
                            2,
                        )

        # Display
        axes[idx].imshow(img)
        axes[idx].set_title(
            f"Class {class_id}: {CLASS_NAMES[class_id]}", fontsize=12, fontweight="bold"
        )
        axes[idx].axis("off")

    plt.tight_layout()
    plt.show()

    print("\n✅ Example images displayed!")
    print("\n Pro Tip: Keep these visual characteristics in mind when:")
    print("   • Analyzing model predictions in the 3LC Dashboard")
    print("   • Identifying mislabeled or missing annotations")
    print("   • Understanding class confusion patterns")

else:
    print("\n⚠️  Could not find example images for visualization")

---
## Step 2: Introduction to 3LC - Data-Centric AI Platform

### What is 3LC?
**3LC (Three Lines of Code)** is your toolkit for data-centric AI. It enables the critical **train–fix–retrain loop**:

#### 📊 Tables - Dataset Registration & Analysis
- Register your datasets with structured metadata
- Automatic quality analysis and statistics
- Visual exploration in browser dashboard
- Track data versions and changes

#### 🏃 Runs - Experiment Tracking & Model Feedback
- Automatically track all training experiments
- **Analyze model errors to identify data problems**
- Compare runs to find what actually improves performance
- Export predictions for detailed failure analysis

### The Train–Fix–Retrain Workflow

**Traditional Approach (Doesn't Work Here):**
- ❌ Train bigger models → **Not allowed (hardware constraints)**
- ❌ Ensemble multiple models → **Prohibited by rules**
- ❌ Use Test-Time Augmentation (TTA) → **Prohibited (slows inference and violates edge device constraints)**

**Data-Centric Approach (The Solution):**
1. ✅ **Train** baseline model with 3LC Run tracking
2. ✅ **Analyze** errors using 3LC Dashboard  
3. ✅ **Fix** data issues (labels, annotations, missing weeds)
4. ✅ **Retrain** with improved data
5. ✅ **Repeat** - Continuous improvement loop

### Why This Matters
In production AI, model feedback reveals data problems you'd never find manually. 3LC makes this systematic and reproducible - exactly how production teams work.

**This is the core skill of this competition!** 

---
## Step 3: Register Your Dataset with 3LC

Now we'll register your dataset with 3LC, creating **Tables** that track your data.

### What's a 3LC Table?
A Table is like a smart spreadsheet for your dataset - it tracks images, labels, and metadata, enabling:
- Visual exploration in the Dashboard
- Data versioning (track edits over time)
- Integration with training for automatic error analysis

**⚠️ Important:** Run the cell below **only once** to register your dataset. After that, you can skip it when retraining.

In [None]:
# ============================================================================
# Create 3LC Tables from YOLO Format Dataset
# ⚠️ RUN THIS CELL ONLY ONCE (Initial Setup)
# ============================================================================
# This cell registers your dataset with 3LC for version control and analysis.
#
# ⚠️ IMPORTANT FOR RETRAINING:
#    - First time: Run this cell to create tables
#    - Retraining: SKIP this cell and go directly to the next cell
#                  (it loads tables independently without needing this)

# Import required packages
import tlc
from pathlib import Path

# Define constants for 3LC registration
PROJECT_NAME = "kaggle_cotton_weed_detection"
DATASET_NAME = "cotton_weed_det3"
WORK_DIR = Path(".")
DATASET_YAML = WORK_DIR / "dataset.yaml"

print("=" * 70)
print("DATA REGISTRATION")
print("=" * 70)

# ============================================================================
# IDEMPOTENCY CHECK - Safe to run multiple times
# ============================================================================
try:
    # Check if tables already exist
    existing_train = tlc.Table.from_names(
        project_name=PROJECT_NAME,
        dataset_name=DATASET_NAME,
        table_name=f"{DATASET_NAME}-train1",
    )
    existing_val = tlc.Table.from_names(
        project_name=PROJECT_NAME,
        dataset_name=DATASET_NAME,
        table_name=f"{DATASET_NAME}-val1",
    )

    print("\n⚠️  Tables already exist!")
    print(f" Training: {len(existing_train)} samples")
    print(f" Validation: {len(existing_val)} samples")
    print("\n✅ Using existing tables (no duplicates created)")
    print(" This cell is safe to run multiple times!")

    # Set variables for compatibility
    train_table = existing_train
    val_table = existing_val

except Exception:
    # Tables don't exist, create them
    print("\n✅ No existing tables - creating new ones...")

    # Create training table
    print("\n Creating training table...")
    train_table = tlc.Table.from_yolo(
        dataset_yaml_file=str(DATASET_YAML),
        split="train",
        task="detect",
        dataset_name=DATASET_NAME,
        project_name=PROJECT_NAME,
        table_name=f"{DATASET_NAME}-train1",
    )

    # Create validation table
    print(" Creating validation table...")
    val_table = tlc.Table.from_yolo(
        dataset_yaml_file=str(DATASET_YAML),
        split="val",
        task="detect",
        dataset_name=DATASET_NAME,
        project_name=PROJECT_NAME,
        table_name=f"{DATASET_NAME}-val1",
    )

# Display registration results
print("\n✅ Tables created successfully!")
print("=" * 70)
print("\n Training Table:")
print(f"   Samples: {len(train_table)}")
print(f"   URL: {train_table.url}")

print("\n Validation Table:")
print(f"   Samples: {len(val_table)}")
print(f"   URL: {val_table.url}")

print("\n" + "=" * 70)
print("✅ Phase 1 Complete: Dataset Registered with 3LC!")
print("=" * 70)

print("\n Next Steps:")
print("  (Optional) Explore tables in Dashboard: https://dashboard.3lc.ai")

---
## Explore Your Data in the Dashboard 

Before training, you can explore your dataset visually:

1. Open Dashboard: [https://dashboard.3lc.ai](https://dashboard.3lc.ai)
2. Navigate to your project: `kaggle_cotton_weed_detection`
3. Click on a table to view images, annotations, and statistics
4. To view bounding box overlay on your images click on the `IMAGE` column and ctrl click `BBOX` columns and press `2`. Output will be a 2D chart omething like this:

<div align="left">
  <img src="content/dashboard.png" alt="Description" width="600">
</div>

**This is optional during the first iteration** - you can skip ahead to training and come back to the Dashboard later for error analysis.


---
## Step 4: Train Your Baseline Model

Time to train your first model! We'll use YOLOv8n with 3LC tracking.

**The workflow:**
1. **Load tables** - Get your registered data 

    1.1 To Load tables Via Urls go to the Tables tab and copy to table URL to clipboard by hovering over the table and clicking this icon in the url column shown in the screenshot below:
![3LC Dashboard](content/table_url.png)


2. **Configure training** - Set epochs, batch size, run name
3. **Train** - YOLOv8n trains with automatic tracking

**For retraining later:** Just rerun the training cells below. They automatically load the latest table version (including any edits you make in the Dashboard).


---
### Competition Rules Reminder:
- ✅ **YOLOv8n only** (3M parameters, 6MB)
- ✅ **640 input size** (fixed)
- ✅ **No ensembles**
- ✅ **Hyperparameter tuning allowed**

In [None]:
# ============================================================================
# Load Tables + Configure Training
# ============================================================================
# This cell:
#   1. Loads your registered tables (includes any Dashboard edits)
#   2. Sets up training configuration (RUN_NAME, EPOCHS, etc.)
#
# For retraining: Just modify RUN_NAME/EPOCHS and rerun this + next cell!

# Import required packages
import tlc
from tlc_ultralytics import YOLO, Settings

# ============================================================================
# STEP 1: Load Tables for Training
# ============================================================================
# Define 3LC project constants
PROJECT_NAME = "kaggle_cotton_weed_detection"
DATASET_NAME = "cotton_weed_det3"

print("=" * 70)
print("LOADING TABLES FOR TRAINING")
print("=" * 70)

try:
    # ========================================================================
    # OPTION 1: Load by Name (Recommended - Automatic Latest Version)
    # ========================================================================
    # This automatically loads the latest table version (includes Dashboard edits)

    train_table_latest = tlc.Table.from_names(
        project_name=PROJECT_NAME,
        dataset_name=DATASET_NAME,
        table_name=f"{DATASET_NAME}-train1",
    ).latest()

    val_table_latest = tlc.Table.from_names(
        project_name=PROJECT_NAME,
        dataset_name=DATASET_NAME,
        table_name=f"{DATASET_NAME}-val1",
    ).latest()

    print(
        f"\n✅ Training table loaded: {len(train_table_latest)} samples (latest version)"
    )
    print(
        f"✅ Validation table loaded: {len(val_table_latest)} samples (latest version)"
    )

    # Prepare tables dictionary for training
    tables = {"train": train_table_latest, "val": val_table_latest}

    # ========================================================================
    # OPTION 2: Load by URL (Alternative - Specific Table Version)
    # ========================================================================
    # Comment above and Uncomment below to load specific table URLs from Dashboard instead
    # Use this when you want a specific edited table version, not the latest

    """
    # Get URLs from Dashboard: Click on the Tables tab → Copy URL from the spoecific table info panel to clipboard
    TRAIN_TABLE_URL = "paste_your_train_table_url_here"
    VAL_TABLE_URL = "paste_your_val_table_url_here"
    
    train_table_latest = tlc.Table.from_url(TRAIN_TABLE_URL)
    val_table_latest = tlc.Table.from_url(VAL_TABLE_URL)
    
    tables = {"train": train_table_latest, "val": val_table_latest}
    
    print(f"\n✅ Training table loaded from URL: {len(tables['train'])} samples")
    print(f"✅ Validation table loaded from URL: {len(tables['val'])} samples")
    """

    print("\n" + "=" * 70)
    print("✅ Tables Ready!")
    print("=" * 70)

except Exception as e:
    print(f"\n Error loading tables: {e}")
    print("\n💡 Troubleshooting:")
    print("   1. Make sure you ran Data Registration Cell at least once")
    print("   2. Check that PROJECT_NAME and DATASET_NAME match your setup")
    print("   3. Verify tables exist in Dashboard: https://dashboard.3lc.ai")
    raise

# ============================================================================
# STEP 2: Training Configuration
# ============================================================================

print("\n" + "=" * 70)
print("YOLOV8N TRAINING WITH 3LC TRACKING")
print("=" * 70)

# ============================================================================
# TRAINING CONSTANTS - Change these for each iteration
# ============================================================================
RUN_NAME = "yolov8n_baseline"  # Change for each run (e.g., "v2_fixed_labels")
RUN_DESCRIPTION = "Baseline YOLOv8n with default hyperparameters"

# Hyperparameters (customize these!)
EPOCHS = 5  # Number of training epochs
BATCH_SIZE = 16  # Batch size (adjust based on GPU memory)
IMAGE_SIZE = 640  # Input image size (FIXED by competition rules)
DEVICE = 0  # GPU device (0 for first GPU, 'cpu' for CPU)
WORKERS = 4  # Number of dataloader workers

# Display configuration
print("\n Training Configuration:")
print(f"   Run name: {RUN_NAME}")
print("   Model: YOLOv8n (ONLY model allowed)")
print(f"   Epochs: {EPOCHS}")
print(f"   Batch size: {BATCH_SIZE}")
print(f"   Image size: {IMAGE_SIZE} (FIXED)")
print(f"   Device: GPU {DEVICE}" if DEVICE != "cpu" else "   Device: CPU")

# Display dataset info (already loaded in STEP 1 above)
print("\n Dataset:")
print(f"   Training: {len(tables['train'])} samples")
print(f"   Validation: {len(tables['val'])} samples")

# Create 3LC Settings for run tracking
settings = Settings(
    project_name=PROJECT_NAME,
    run_name=RUN_NAME,
    run_description=RUN_DESCRIPTION,
    image_embeddings_dim=2,
)

print("\n" + "=" * 70)
print("✅ CONFIGURATION COMPLETE!")
print("=" * 70)

print("\n💡 Configuration Summary:")
print(f"   • Tables loaded: {len(tables['train'])} train, {len(tables['val'])} val")
print(f"   • Run name: {RUN_NAME}")
print(f"   • Training for: {EPOCHS} epochs")
print(f"   • Batch size: {BATCH_SIZE}")
print(f"   • Device: GPU {DEVICE}" if DEVICE != "cpu" else "   • Device: CPU")

print("\n Next: Run the cell below to start training!")
print("   (Review the configuration above before proceeding)")

In [None]:
# ============================================================================
# Train the Model
# ============================================================================
# This cell loads YOLOv8n and starts training.
# Make sure you ran the cell above first!


print("=" * 70)
print("STARTING TRAINING")
print("=" * 70)

# Load YOLOv8n pretrained model
print("\nLoading YOLOv8n pretrained weights...")
model = YOLO("yolov8n.pt")
print("✅ Model loaded (3M parameters, 6MB size)")

# Train the model with 3LC tracking
print("\n Training in progress...")
print("=" * 70)

results = model.train(
    tables=tables,  # Use 3LC Tables
    name=RUN_NAME,  # Name for saving results (creates runs/detect/{RUN_NAME}/)
    epochs=EPOCHS,
    imgsz=IMAGE_SIZE,
    batch=BATCH_SIZE,
    device=DEVICE,
    workers=WORKERS,
    settings=settings,  # 3LC tracking
    val=True,  # Validate during training
    # AUGMENTATION - Uncomment for better performance in later iterations:
    # mosaic=1.0,              # Mosaic augmentation - helps with scale variation
    # copy_paste=0.1,          # Copy-paste - helps with occlusion
    # mixup=0.05,              # Mixup - improves generalization
    # patience=20,             # Early stopping patience
)

print("\n" + "=" * 70)
print("✅ TRAINING COMPLETE!")
print("=" * 70)

print("\n📁 Model Weights Saved:")
print(f"   Best model: runs/detect/{RUN_NAME}/weights/best.pt")
print(f"   Last model: runs/detect/{RUN_NAME}/weights/last.pt")
print("\n Use 'best.pt' for predictions and submissions (highest validation mAP)")

print("\n Next Steps:")
print("   1. Visit 3LC Dashboard: https://dashboard.3lc.ai/")
print("   2. Open your Run to analyze model errors")
print("   3. Identify data issues:")
print("      • False negatives (missed detections)")
print("      • False positives (incorrect predictions)")
print("      • Class confusion")
print("      • Poor localization")
print("   4. Fix data issues in Dashboard")
print("   5. Retrain with improved data!")
print(
    "\nLearn more: https://docs.3lc.ai/3lc/latest/how-to/basics/open-project-table-run.html"
)

---
## What's Next? The Improvement Loop

**Congratulations!** Your baseline model is trained. Now comes the data-centric improvement cycle:

1. **Analyze** - Open Dashboard ([https://dashboard.3lc.ai/](https://dashboard.3lc.ai/)) to see your training run and identify errors
2. **Fix** - Edit problematic samples directly in Dashboard (fix labels, remove bad images)
3. **Retrain** - Rerun the two cells above - they automatically load your edits
4. **Compare** - Check if your mAP improved
5. **Repeat** - Keep iterating!

**💡 Tip:** Each time you edit data in the Dashboard, just rerun the training cells. The `.latest()` method automatically picks up your changes.


---
## Step 5.5: Load Model Weights (Optional)

**IMPORTANT:** This cell allows you to load trained weights instead of using the model from the previous training session.

### When to use this:
- ✅ Loading a previously trained model from the `runs` folder
- ✅ Continuing work in a new notebook session
- ✅ Testing predictions from your best training run
- ✅ Comparing different model versions

### Options:
1. **Use current model**: Skip this cell if you just trained a model above
2. **Load latest weights**: Automatically finds the most recent training run
3. **Load specific weights**: Provide a custom path to your best model

### Where are model weights saved?
After training, YOLOv8 saves weights in:
```
runs/detect/train/weights/
├── best.pt      # Best checkpoint (highest mAP)
└── last.pt      # Last epoch checkpoint
```

**Pro tip:** Always use `best.pt` for final submissions!


In [None]:
# ============================================================================
# OPTIONAL: Load Model Weights from Previous Training
# ============================================================================
# Uncomment ONE of the options below to load weights

# OPTION 1: Use the model from the training cell above (DEFAULT)
# If you just ran the training cell, the 'model' variable is already loaded
# → No action needed, skip this cell!
"""
print("Current model status:")
try:
    print(f"✅ Model loaded: {type(model).__name__}")
    print(f"  Using model from training session above")
except NameError:
    print("  No model found from training session")
    print("  You must load weights using one of the options below!")
"""
# OPTION 2: Load the LATEST trained model from runs folder
# Uncomment the ENTIRE block below to auto-load the most recent training run

"""
from tlc_ultralytics import YOLO
from pathlib import Path

# Find the most recent training run
runs_dir = Path("runs/detect")
if runs_dir.exists():
    train_dirs = sorted(runs_dir.glob("train*"), key=lambda x: x.stat().st_mtime, reverse=True)
    if train_dirs:
        latest_weights = train_dirs[0] / "weights" / "best.pt"
        if latest_weights.exists():
            print(f"\nLoading latest model: {latest_weights}")
            print(f"Using weights from Ultralytics run folder: {train_dirs[0]}")
            model = YOLO(str(latest_weights))
            print("✅ Model loaded successfully!")
        else:
            print(f"!!! Weights not found: {latest_weights}")
    else:
        print("!!! No training runs found in runs/detect/")
else:
    print("!!! runs/detect/ directory not found")
"""

# By default, use the model from Cell 11 (training)
print("✓ Using model from Cell 11 training session")
print("  (To load different weights, uncomment one of the options above)")


# OPTION 3: Load SPECIFIC weights (custom path)
# Replace the path with your best model
"""
from tlc_ultralytics import YOLO

# Example paths:
# - "runs/detect/train/weights/best.pt"           # First training run
# - "runs/detect/train2/weights/best.pt"          # Second training run
# - "runs/detect/yolov8n_v3/weights/best.pt"      # Named run

CUSTOM_WEIGHTS_PATH = "runs/detect/train/weights/best.pt"

print(f"\nLoading custom weights: {CUSTOM_WEIGHTS_PATH}")
model = YOLO(CUSTOM_WEIGHTS_PATH)
print("✅ Model loaded successfully!")
"""

# OPTION 4: Load pretrained YOLOv8n (no custom training)
# Use this if you want to test the baseline pretrained model
"""
from tlc_ultralytics import YOLO

print("\nLoading pretrained YOLOv8n (COCO weights)")
model = YOLO("yolov8n.pt")
print("✅ Model loaded successfully!")
print("!!!  Note: This is the pretrained model, not trained on cotton weeds!")
"""

print("\n" + "=" * 70)
print("Ready to generate predictions!")
print("=" * 70)

---
## Step 6: Generate Test Predictions

Generate predictions on the test set for Kaggle submission.

In [None]:
# Import required packages
from pathlib import Path
import shutil

# Define paths and constants
WORK_DIR = Path(".")
TEST_DIR = WORK_DIR / "test" / "images"
PRED_DIR = Path("predictions")
IMAGE_SIZE = 640  # Competition requirement

# Get list of test images
test_images = list(TEST_DIR.glob("*.jpg"))

# ============================================================================
# SAFER FILE MANAGEMENT - Backup instead of delete
# ============================================================================
if PRED_DIR.exists():
    from datetime import datetime

    # Create timestamped backup instead of deleting
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    backup_dir = Path(f"predictions_backup_{timestamp}")

    print("⚠️  Predictions folder exists. Creating backup...")
    print(f"   Moving to: {backup_dir}")

    shutil.move(str(PRED_DIR), str(backup_dir))

    print(f"✅ Previous predictions backed up: {backup_dir}")
    print("   (Delete old backups manually if not needed)")
else:
    print("✅ No existing predictions found")

print("Generating predictions on test set...")
print("=" * 50)
print(f"Test images: {TEST_DIR}")
print(f"Output directory: {PRED_DIR}")
print(f"Test set size: {len(test_images)} images")

# Run inference
print("\nRunning inference...")
test_results = model.predict(
    source=str(TEST_DIR),
    save=False,  # Don't save annotated images (faster, prevents duplication)
    save_txt=True,  # Save YOLO format predictions
    save_conf=True,  # Include confidence scores
    conf=0,  # Confidence threshold (adjust as needed)
    imgsz=IMAGE_SIZE,
    project=str(PRED_DIR.parent),
    name=PRED_DIR.name,
    exist_ok=False,  # Don't allow overwriting (ensures clean predictions)
)

print("\n----Predictions generated!")

---
## Step 7: Analyze Test Predictions (Optional)

In [None]:
# Import required packages
from pathlib import Path

# Define constants
CLASS_NAMES = ["Carpetweed", "Morning Glory", "Palmer Amaranth"]

# Analyze predictions
PRED_DIR = Path("predictions")  # Must match Cell 21
labels_dir = PRED_DIR / "labels"

if labels_dir.exists():
    print("Test Set Prediction Analysis:")
    print("=" * 50)

    pred_files = list(labels_dir.glob("*.txt"))

    class_counts = {i: 0 for i in range(len(CLASS_NAMES))}
    images_with_preds = 0
    total_detections = 0

    for pred_file in pred_files:
        if pred_file.stat().st_size > 0:
            images_with_preds += 1
            with open(pred_file, "r") as f:
                for line in f:
                    if line.strip():
                        parts = line.strip().split()
                        if len(parts) >= 6:
                            class_id = int(parts[0])
                            class_counts[class_id] += 1
                            total_detections += 1

    print(f"Total test images: {len(test_images)}")
    print(f"Images with detections: {images_with_preds}")
    print(f"Images with no detections: {len(test_images) - images_with_preds}")
    print(f"Total detections: {total_detections}")

    print("\n Detections by class:")
    for class_id, count in class_counts.items():
        percentage = (count / total_detections * 100) if total_detections > 0 else 0
        print(f"   {CLASS_NAMES[class_id]:20s}: {count:4d} ({percentage:5.1f}%)")

    print("\n----Analysis complete!")
else:
    print("!!!!No predictions found.")

---
## Step 8: Convert to Kaggle Submission Format (Required)

Convert YOLO predictions to Kaggle CSV format.

### ✅ Submission Format Requirements:
**IMPORTANT:** Use exact column names and format below!

**Columns:** `image_id,prediction_string` (lowercase!)

**Prediction String Format:**
- Each box: `class_id confidence x_center y_center width height` (6 values, space-separated)
  - `class_id`: 0 (carpetweed), 1 (morningglory), 2 (palmer_amaranth)
  - `confidence`: 0.0-1.0 (model confidence score)
  - `x_center, y_center, width, height`: normalized coordinates (0-1)
- Multiple boxes: Space-separated on same line
- No detections: Use `"no box"` (not empty string!)

### Example:
```csv
image_id,prediction_string
img_001,1 0.95 0.5 0.5 0.2 0.3 2 0.87 0.7 0.4 0.15 0.2
img_002,no box
```

In [None]:
# ============================================================================
# STEP 8: Generate Kaggle Submission By running this cell of code
# ============================================================================
# Import required packages
from pathlib import Path

# Define paths
WORK_DIR = Path(".")  # Current directory
PRED_DIR = Path(
    "predictions"
)  # Prediction directory (change path if you want to convert from a different predictions folder)
TEST_DIR = (
    WORK_DIR / "test" / "images"
)  # Change path if you have the Test images stored Elsewhere


print("=" * 70)
print("GENERATING KAGGLE SUBMISSION")
print("=" * 70)

labels_dir = PRED_DIR / "labels"
output_csv = "submission.csv"

# Get all test images (deduplicate by stem to avoid duplicates from case-insensitive file systems)
test_images_dict = {}  # Use dict to automatically deduplicate by image_id (stem)
for ext in ["*.jpg", "*.jpeg", "*.JPG", "*.JPEG", "*.png", "*.PNG"]:
    for img_path in TEST_DIR.glob(ext):
        image_id = img_path.stem  # filename without extension
        if image_id not in test_images_dict:
            test_images_dict[image_id] = img_path

# Convert to sorted list
test_images_list = [
    test_images_dict[img_id] for img_id in sorted(test_images_dict.keys())
]

print(f"\n✓ Found {len(test_images_list)} test images")
print(f"✓ Looking for predictions in: {labels_dir}")

# Create submission data
submission_data = []
images_with_preds = 0
images_without_preds = 0
total_boxes = 0

for img_path in test_images_list:
    image_id = img_path.stem
    pred_file = labels_dir / f"{image_id}.txt"

    # Check if prediction file exists and has content
    if pred_file.exists() and pred_file.stat().st_size > 0:
        prediction_boxes = []

        with open(pred_file, "r") as f:
            for line in f:
                line = line.strip()
                if line:
                    parts = line.split()

                    # YOLO saves as: class xc yc w h conf (confidence is LAST!)
                    # Kaggle needs: class conf xc yc w h (confidence is SECOND!)
                    if len(parts) >= 6:
                        # Reorder values: move confidence from position 5 to position 1
                        class_id = parts[0]
                        conf = parts[5]  # Confidence is at the end in YOLO format
                        xc, yc, w, h = parts[1], parts[2], parts[3], parts[4]
                        box_str = f"{class_id} {conf} {xc} {yc} {w} {h}"
                        prediction_boxes.append(box_str)
                        total_boxes += 1

        if prediction_boxes:
            # Join all boxes with spaces
            prediction_string = " ".join(prediction_boxes)
            images_with_preds += 1
        else:
            prediction_string = "no box"
            images_without_preds += 1
    else:
        # No prediction file or empty file
        prediction_string = "no box"
        images_without_preds += 1

    submission_data.append(
        {"image_id": image_id, "prediction_string": prediction_string}
    )

# Create DataFrame with correct column names (lowercase!)
submission_df = pd.DataFrame(submission_data)
submission_df = submission_df[["image_id", "prediction_string"]]

# Save to CSV
submission_df.to_csv(output_csv, index=False)

# Print statistics
print("\n" + "=" * 70)
print("SUBMISSION STATISTICS")
print("=" * 70)
print(f"Total images:               {len(submission_df)}")
print(f"Images with predictions:    {images_with_preds}")
print(f"Images without predictions: {images_without_preds}")
print(f"Total bounding boxes:       {total_boxes}")
if len(submission_df) > 0:
    print(f"Average boxes per image:    {total_boxes / len(submission_df):.2f}")

# Show sample
print("\n" + "=" * 70)
print("SAMPLE PREDICTIONS")
print("=" * 70)
display(submission_df.head(10))

# Validation
print("\n" + "=" * 70)
print("FORMAT VALIDATION")
print("=" * 70)

# Check format
errors = []
if list(submission_df.columns) != ["image_id", "prediction_string"]:
    errors.append(f"!!! Wrong columns: {list(submission_df.columns)}")
else:
    print("✓ Columns correct: image_id, prediction_string")

if len(submission_df) != len(test_images_list):
    errors.append("!!! Row count mismatch")
else:
    print(f"✓ Row count correct: {len(submission_df)}")

# Validate prediction format (sample first 20)
format_ok = True
for idx in range(min(20, len(submission_df))):
    pred_str = str(submission_df.iloc[idx]["prediction_string"])

    if pred_str == "no box":
        continue

    values = pred_str.split()
    if len(values) % 6 != 0:
        format_ok = False
        break

if format_ok:
    print("✓ All sampled predictions properly formatted (6 values per box)")
else:
    errors.append("!!! Some predictions have wrong format")

if errors:
    print("\n!!! VALIDATION FAILED:")
    for err in errors:
        print(f"  {err}")
else:
    print("\n" + "=" * 70)
print("✅ SUBMISSION READY FOR KAGGLE!")
print("=" * 70)
print(f"\nFile: {output_csv}")
print("\n Upload 'submission.csv' to Kaggle!")
print("\n Tips:")
print("   - Check your score on the public leaderboard")
print("   - You have 3 submissions per day (use them wisely!)")
print("   - Select up to 2 final submissions for judging")

---
## Phase 2 Complete! Now Begins the Real Competition

### 🏆 What You've Accomplished:
✅ **Phase 1**: Registered dataset with 3LC Tables  
✅ **Phase 2**: Trained baseline YOLOv8n, inspected the Run in the 3LC Dashboard and made first submission  

### The baseline is just your starting point. Now comes the actual competition work!

---

## Phase 3: Iterative Optimization (The Train–Fix–Retrain Loop)

This is where you'll actually improve your score. **The model is fixed (YOLOv8n), so improvements MUST come from data.**

### Step-by-Step Iterative Workflow:

#### 1. Analyze Data Gaps (Use 3LC Dashboard)
Visit your 3LC Dashboard and inspect your Run:

**Look for these error patterns:**

##### A. False Negatives 
- **What it is**: Model fails to detect a weed that's actually present in the image
- **How to identify in 3LC**: 
  - High confidence predictions with low/zero IoU (model predicts something not in ground truth)
  - Visual inspection shows weed is present but unlabeled in training data
- **Common causes**: 
  - **Missing labels**: Objects exist but weren't annotated during labeling (data quality issue)
  - Insufficient training examples of that weed type or appearance
- **Fix**: Add missing annotations to training data using Dashboard editor
- **Learn more**: [Edit tables in Dashboard](https://docs.3lc.ai/3lc/latest/how-to/basics/edit-table.html)

##### B. False Positives
- **What it is**: Model predicts a weed where there isn't one
- **How to identify in 3LC**: 
  - Predictions with low IoU to any ground truth box
  - Visual inspection shows no weed at predicted location
- **Common causes**: 
  - Similar-looking objects mislabeled as weeds in training data
  - Model overfitting to spurious patterns (soil, shadows, debris)
- **Fix**: Remove incorrect labels, add negative examples (images without weeds)

##### C. Class Confusion (Wrong Class Predicted)
- **What it is**: Model detects a weed but assigns wrong class (e.g., predicts "carpetweed" when it's "morningglory")
- **How to identify in 3LC**: 
  - High IoU but wrong class assignment
  - Compare predicted class vs ground truth class
- **Common causes**: 
  - Visually similar weed species
  - Inconsistent labeling in training data (same weed labeled differently)
- **Fix**: Correct mislabeled classes, add more distinguishing examples

##### D. Poor Localization (Inaccurate Bounding Boxes)
- **What it is**: Model detects weed and correct class, but bounding box is inaccurate
- **How to identify in 3LC**: 
  - Correct class but low IoU (e.g., IoU between 0.3-0.5)
  - Bounding box too large, too small, or misaligned
- **Common causes**: 
  - Inconsistent annotation conventions (tight vs loose boxes)
  - Partially labeled objects in training data
- **Fix**: Standardize bounding box annotations, correct misoriented boxes

#### 2. Find and Fix Data Issues in Dashboard
Use 3LC Dashboard workflows:

**Step 1: Filter problematic samples**
- Sort predictions by confidence, IoU, or class
- Create custom filters (e.g., "high confidence + low IoU" = likely missing labels)
- Group by error type

**Step 2: Visual review**
- Compare model predictions vs ground truth side-by-side
- Look for patterns in failures (certain lighting, weed sizes, occlusions)

**Step 3: Edit data directly in Dashboard**
- Add missing bounding boxes
- Correct wrong class labels
- Remove incorrect annotations
- Adjust misaligned bounding boxes

**Step 4: Export edited table**
- Dashboard creates a new table version automatically
- Copy the table URL for retraining (see Step 3.5 above)

**Learn more**: [Dashboard editing guide](https://docs.3lc.ai/3lc/latest/how-to/basics/edit-table.html)

#### 3. Load Edited Table for Retraining

After editing your data in the Dashboard:

**Option A: Use table URL (Recommended)**
```python
# Copy the edited table URL from Dashboard
train_table_v2 = tlc.Table.from_url("your_edited_table_url_here")
val_table_unchanged = val_table.latest()  # Keep val unchanged for fair comparison

tables_v2 = {
    "train": train_table_v2,
    "val": val_table_unchanged
}
```

**Option B: Use .latest() to get most recent version**
```python
# Automatically loads the newest version of your table
train_table_v2 = train_table.latest()
val_table_unchanged = val_table.latest()

tables_v2 = {
    "train": train_table_v2,
    "val": val_table_unchanged
}
```

#### 4. Retrain with Improved Data
```python
# Create new run with improved data
RUN_NAME = "yolov8n_with_fixed_labels_v2"
RUN_DESCRIPTION = "Fixed class confusion between carpetweed and morningglory"

settings_v2 = Settings(
    project_name=PROJECT_NAME,
    run_name=RUN_NAME,
    run_description=RUN_DESCRIPTION,
)

tables_v2 = {
    "train": train_table_v2,
    "val": val_table  # Keep val unchanged for fair comparison
}

# Retrain
model_v2 = YOLO("yolov8n.pt")
results_v2 = model_v2.train(
    tables=tables_v2,
    epochs=EPOCHS,
    imgsz=IMAGE_SIZE,
    batch=BATCH_SIZE,
    device=DEVICE,
    workers=WORKERS,
    settings=settings_v2,
    val=True,
)
```

#### 5. Compare Runs in 3LC Dashboard
- View both runs side-by-side
- Compare mAP@0.5 improvements
- Identify which data fixes helped most
- Plan next iteration

#### 6. Repeat the Loop
Each iteration should:
1. Fix a specific data issue identified from previous run
2. Document what you changed (in run description)
3. Retrain and measure improvement
4. Generate new Kaggle submission (3 per day limit)

---

## Phase 4: Advanced Data-Centric Techniques

Once you've fixed obvious label issues, try these advanced approaches:

### 1. Hard Example Mining
Use 3LC to identify:
- Low-confidence correct predictions (model unsure but right)
- High-confidence wrong predictions (model confident but wrong)
These are your most valuable training samples!

### 2. Class Rebalancing & Sample Weighting

If one class performs poorly:
- Add more diverse examples of that class
- Oversample underrepresented classes in your training data
- Apply class-specific augmentation
- Use 3LC Dashboard to apply sample weights and rebalance training distribution 



**Pro Tip**: 3LC Dashboard allows you to:
- Weight samples by confidence, IoU, or custom metrics
- Create stratified training splits
- Oversample hard examples automatically
- Balance class distributions without duplicating files

### 3. Augmentation Tuning (Allowed!)
```python
# Experiment with augmentation hyperparameters:
model.train(
    tables=tables,
    epochs=EPOCHS,
    # Augmentation params:
    hsv_h=0.015,      # Hue shifts (lighting changes)
    hsv_s=0.7,        # Saturation (weather variations)
    hsv_v=0.4,        # Brightness
    degrees=10,       # Rotation (camera angles)
    translate=0.1,    # Position shifts
    scale=0.5,        # Size variation
    fliplr=0.5,       # Horizontal flip
    mosaic=1.0,       # Mosaic augmentation
    mixup=0.1,        # Mixup augmentation
    copy_paste=0.1,   # Copy-paste augmentation
)
```

### 4. Hyperparameter Tuning
```python
# Experiment with:
# - Learning rate: lr0=0.001, 0.005, 0.01, 0.02
# - Batch size: 8, 16, 32 (GPU memory permitting)
# - Epochs: 30, 50, 75, 100 (watch for overfitting)
# - Warmup epochs: warmup_epochs=3, 5, 10
# - Optimizer: optimizer='Adam', 'AdamW', 'SGD'
```

---

## Your Success Strategy
1. Visit your 3LC Dashboard
2. Analyze where your model fails
3. Fix those data issues
4. Retrain with better data (updated tables)
5. Submit improved predictions
6. Repeat!

---

## ⚠️ Competition Rules Reminder

### ✅ ALLOWED:
- YOLOv8n model only (REQUIRED)
- 640 input size (FIXED)
- Hyperparameter tuning
- Data augmentation (training-time)
- Label corrections and improvements
- Multiple training runs
- Confidence threshold adjustment
- 3 submissions per day

### ❌ PROHIBITED:
- Larger models (YOLOv8s, YOLOv8m, etc.)
- Model ensembles
- Model stacking
- Different architectures
- Post-processing beyond standard NMS
- External data sources
- Test-Time Augmentation (TTA) - prohibited due to edge device inference speed requirements

---

## Professional Skills You're Developing

### Data-Centric AI Mastery:
✅ **Using model feedback** to identify data problems  
✅ **Systematic error analysis** via 3LC Dashboard  
✅ **Iterative improvement loops** (train–fix–retrain)  
✅ **Version control for datasets** (tracking data changes)  
✅ **Reproducible experiments** (every run documented)  

### Production AI Reality:
✅ **Working within strict constraints** (model, compute, memory)  
✅ **Data quality as primary lever** (when model can't scale)  
✅ **Systematic rather than random** improvements  
✅ **Documentation and tracking** for team collaboration  

### These are the skills production AI teams use every day!

---

## Resources & Support

### 3LC Documentation:
- **Tables**: https://docs.3lc.ai/3lc/latest/user-guide/python-package/core-concepts/tables.html
- **Runs**: https://docs.3lc.ai/3lc/latest/user-guide/python-package/core-concepts/runs.html
- **YOLO Integration**: https://github.com/3lc-ai/3lc-ultralytics

### YOLOv8 Documentation:
- **Training Guide**: https://docs.ultralytics.com/modes/train/
- **Hyperparameters**: https://docs.ultralytics.com/usage/cfg/
- **Augmentation**: https://docs.ultralytics.com/guides/preprocessing_annotated_data/

### Competition Support:
- **Discussion Forum**: Share insights and strategies
- **Leaderboard**: 50% public, 50% private (prevents overfitting)
- **Submission Limit**: 3 per day, choose 2 final


**Good luck! May your data improve with every iteration!** 

---
## Ready for Faster Iteration? Use the Scripts!

Congratulations! You've completed the starter notebook and learned the data-centric AI workflow. 

### Switch to Production-Ready Scripts

The training and prediction workflow you just learned has been packaged into simple, easy-to-edit scripts for faster experimentation:

#### 📄 Available Scripts

**`train.py`** - Train models with 3LC tracking
- Simple edit-in-place configuration at the top of the file
- Just modify variables (like RUN_NAME, EPOCHS, TRAIN_TABLE_URL) and run
- Auto-loads latest table versions or specific edited tables from Dashboard

**`predict.py`** - Generate predictions and submissions
- Edit configuration section to set model weights path
- Creates Kaggle-ready CSV with automatic validation
- Adjustable confidence thresholds

---

### How to Use

Both scripts work the same way:

1. **Open the script** in your editor (VS Code, PyCharm, Notepad++, etc.)
2. **Edit the CONFIGURATION section** at the top:
   - Change run names, epochs, table URLs, etc.
   - All settings are clearly labeled with comments
3. **Run the script** - that's it!

**No command-line arguments to remember!** Just edit and run.

---

### Quick Examples

#### Training Example

**Step 1:** Open `train.py` and edit the configuration:
```python
# ============================================================================
# CONFIGURATION - Edit these values for your training run
# ============================================================================

# 3LC Table URLs (get these from Dashboard)
TRAIN_TABLE_URL = "your/train/table/url"
VAL_TABLE_URL = "your/val/table/url"

# Run configuration
RUN_NAME = "v1_baseline"  # Change for each experiment
RUN_DESCRIPTION = "Baseline YOLOv8n training run"

# Training hyperparameters
EPOCHS = 5  # Number of training epochs
BATCH_SIZE = 16  # Batch size
DEVICE = 0  # GPU device (0 for first GPU, 'cpu' for CPU)

# Data augmentation
USE_AUGMENTATION = False  # Set to True to enable mosaic, mixup, copy_paste
```

**Step 2:** Run the script:
```bash
python train.py
```

---

#### Prediction Example

**Step 1:** Open `predict.py` and edit the configuration:
```python
# ============================================================================
# CONFIGURATION - Edit these values
# ============================================================================

# Model weights path (from training)
MODEL_WEIGHTS = "runs/detect/yolov8n_baseline/weights/best.pt"

# Inference settings
CONFIDENCE_THRESHOLD = 0.25  # Confidence threshold for detections
IMAGE_SIZE = 640  # Input image size (FIXED by competition)
DEVICE = 0  # GPU device (0 for first GPU, 'cpu' for CPU)

# Output
OUTPUT_CSV = "submission.csv"  # Output submission file
```

**Step 2:** Run the script:
```bash
python predict.py
```

---

### Why Use Scripts?

| Notebook | Scripts |
|----------|---------|
| ✅ Learn concepts | ✅ Fast iteration |
| ✅ Visual explanations | ✅ Easy to customize |
| ✅ Step-by-step guide | ✅ No cell dependencies |
| ✅ Interactive exploration | ✅ One command runs all |
|  Slower (run cells) |  Faster workflow |

**You've learned the "why" in this notebook. Now use the scripts for the "how" in your iterations!**

---

### Typical Workflow

1. **Train baseline** (you just did this in the notebook!)

2. **Switch to scripts** for faster iteration:
   ```bash
   # Edit train.py configuration, then:
   python train.py
   ```

3. **Generate predictions:**
   ```bash
   # Edit predict.py configuration if needed, then:
   python predict.py
   ```

4. **Analyze in Dashboard:**
   - Open your run in 3LC Dashboard
   - Identify data issues (missing labels, mislabels, etc.)
   - Edit the table directly in Dashboard
   - Copy the edited table URL

5. **Retrain with fixed data:**
   - Paste the edited table URL into `train.py`
   - Change RUN_NAME to track the iteration (e.g., "v3_fixed_labels")
   - Run `python train.py`

6. **Submit and iterate!**
   - Update MODEL_WEIGHTS in `predict.py` to your latest model
   - Run `python predict.py`
   - Upload `submission.csv` to Kaggle

---

### Script Features

Both scripts include:
- ✅ **Clear configuration section** - all settings in one place at the top
- ✅ **Helpful comments** - explains what each setting does
- ✅ **Auto-validation** - checks for missing files and invalid settings
- ✅ **Progress tracking** - clear output showing what's happening
- ✅ **Error messages** - tells you exactly what went wrong and how to fix it
- ✅ **Easy to customize** - modify the pipeline code if needed

---

### Learn More

- **Script source code**: Both scripts are well-commented - read them to understand implementation details
- **README.md**: Complete documentation with more examples
- **3LC Dashboard workflow**: The scripts integrate seamlessly with Dashboard editing

---

**🎯 Bottom Line:** The notebook taught you the concepts. The scripts help you iterate faster with less friction!

**Ready to iterate? Open `train.py`, make your edits, and run!** 🌾
