## **Current Experiment: Revised Experiment #1 (Aggregated Sparse-Synthetic Data)**

This run of the notebook focuses on **Experiment #1 (Revised)**. In this experiment, the LLM's Stage 1 is trained on data that is:
* Derived from **sparse individual synthetic pavement sections** (simulating real-world inspection infrequency).
* **Aggregated by ClimateZone, SurfaceType, and Age** to create dense, average degradation curves.

This approach aims to teach the LLM general, continuous degradation patterns from averaged data, which it will later adapt to specific individual sections in Stage 2.

---

### **Workflow Sections:**

1.  **Colab Session Setup:** Initial environment setup (Drive, GitHub, Libraries).
2.  **Data Generation & Preparation:** Functions to create and preprocess synthetic data.
3.  **Data Preprocessing & DataLoader Setup:** Preparing data for PyTorch.
4.  **Model Definition & Initialization:** Defining and instantiating the LLM architecture.
5.  **Stage 1 Training Loop:** The core learning phase.
6.  **Model Saving & Evaluation:** Saving progress and checking performance (future steps).

In [6]:
# Cell 1.1: Mount Google Drive (Optional but Recommended for Larger Files)
import os # Make sure os is imported

# Check if Drive is already mounted
if not os.path.exists('/content/drive'):
    from google.colab import drive
    print("Mounting Google Drive...")
    drive.mount('/content/drive')
    print("Google Drive mounted.")
else:
    print("Google Drive is already mounted.")


Mounting Google Drive...
Mounted at /content/drive
Google Drive mounted.


In [15]:
# Cell 1.2: Set Up GitHub Credentials & Clone/Update Repository (Enhanced Robustness)
import os

# --- 1. Set Git User Identity (Persists for the session) ---
!git config --global user.email "anthony@icatalystinc.com"
!git config --global user.name "Anthony Meyer"
print("Git user identity configured.")

# --- 2. Set GitHub Personal Access Token (PAT) as an environment variable ---
# IMPORTANT SECURITY WARNING:
# Direct embedding of PATs in notebooks is NOT secure for shared or public notebooks.
# For production or shared environments, use Colab's "Secrets" feature (key icon on left sidebar).
# For personal projects, you can use this for convenience.
# Replace "your_pat_here" with your actual PAT.
os.environ["GITHUB_PAT"] = "ghp_qAlrEycrxTF6FUn7XaYSKOR0LVZPLb24HPme"
print("GITHUB_PAT environment variable set.")

# --- 3. Define Repository Info ---
# Construct the URL with PAT for cloning (only needed for private repos)
repo_url_with_pat = f"https://{os.environ['GITHUB_PAT']}@github.com/iCatalyst-D3M/pci-forecasting.git"
repo_name = "pci-forecasting" # Your repository's folder name

# --- 4. Navigate to /content and Robustly Clone/Update Repository ---
# All clones typically go into /content in Colab
%cd /content

# Proactive check and cleanup:
# If the directory exists but is not a valid Git repo, or if it looks like a nested one, remove it.
if os.path.exists(repo_name):
    # Check if .git subdirectory (indicates a git repo) or if a nested repo is found
    if not os.path.exists(os.path.join(repo_name, '.git')) or os.path.exists(os.path.join(repo_name, repo_name)):
        print(f"Detected problematic or non-Git '{repo_name}' directory. Removing it to re-clone cleanly.")
        !rm -rf {repo_name}
        print("Removed existing problematic directory.")
    else:
        print(f"'{repo_name}' directory exists and appears to be a valid Git repo. Proceeding to pull.")

# Now, perform the clone or pull (after potential cleanup)
if not os.path.exists(repo_name):
    print(f"\nCloning {repo_name}...")
    !git clone {repo_url_with_pat}
    print("Repository cloned.")
else:
    # This block is executed if repo_name exists AND is likely valid (not problematic)
    print(f"\nRepository {repo_name} exists. Pulling latest changes...")
    %cd {repo_name}
    !git pull
    %cd ..
    print("Repository updated.")

# --- 5. Navigate into your Repository Directory for all subsequent work ---
# This is the final step to ensure your notebook is operating from your repo's root
%cd {repo_name}

# Verify current working directory and list contents
print(f"\nCurrently in: {os.getcwd()}")
!ls

Git user identity configured.
GITHUB_PAT environment variable set.
/content
Currently in:
'pci-forecasting' directory exists and appears to be a valid Git repo. Proceeding to pull.

Repository pci-forecasting exists. Pulling latest changes...
/content/pci-forecasting
Already up to date.
/content
Repository updated.
/content/pci-forecasting

Currently in: /content/pci-forecasting
aggregated_synthetic_pci_data.csv  predict_all_2.csv
pci_forecasting.ipynb		   synthetic_pci_data.csv


In [17]:
# Cell 1.3: Create src/ directory and make it a Python package (Run this once per session)

import os

# Define the path for your source directory
src_dir = 'src'
init_file = os.path.join(src_dir, '__init__.py')

# Check if the src directory exists, if not, create it
if not os.path.exists(src_dir):
    os.makedirs(src_dir)
    print(f"Created '{src_dir}' directory.")
else:
    print(f"'{src_dir}' directory already exists.")

# Ensure __init__.py exists to make 'src' a Python package
if not os.path.exists(init_file):
    with open(init_file, 'w') as f:
        f.write('') # Write an empty file
    print(f"Created empty '{init_file}' to make 'src' a package.")
else:
    print(f"'{init_file}' already exists.")

print("File structure setup for 'src/' complete locally.")
print("\nNEXT STEPS: Manually create/paste your code into src/dataset.py and src/models.py using the Colab file browser, then run the next cell.")

Created 'src' directory.
Created empty 'src/__init__.py' to make 'src' a package.
File structure setup for 'src/' complete locally.

NEXT STEPS: Manually create/paste your code into src/dataset.py and src/models.py using the Colab file browser, then run the next cell.


In [None]:
# Cell 1.4 Committing files to GitHub

import os

# --- 1. Define Notebook Name and Paths ---
# Replace 'pci_llm_main_workflow.ipynb' with the exact name you saved your notebook as in Google Drive.
notebook_name = "pci_llm_main_workflow.ipynb"

# Assuming your notebook is in the default 'Colab Notebooks' folder in your Drive.
# If you saved it elsewhere in Drive, adjust this path.
notebook_path_in_drive = f"/content/drive/MyDrive/Colab Notebooks/{notebook_name}"

# Your current working directory is the root of your cloned repo (e.g., /content/pci-forecasting/)
# We will copy the notebook to this directory.
destination_path_in_repo = "." # '.' means current directory

print(f"Attempting to copy notebook from: {notebook_path_in_drive}")
print(f"To current repo directory: {os.getcwd()}")


# --- 2. Copy the Notebook from Drive into your Cloned Repository ---
# Use !cp to copy the file. -f means force overwrite if it exists.
!cp -f "{notebook_path_in_drive}" "{destination_path_in_repo}"
print(f"\nNotebook '{notebook_name}' copied into the local repository folder.")

# Verify it's in the current directory now
!ls # You should now see your notebook_name.ipynb listed here


# --- 3. Stage, Commit, and Push the Notebook to GitHub ---
print("\n--- Committing Notebook to GitHub ---")

# Stage the notebook file for commit
!git add {notebook_name}

# Stage any other pending changes (e.g., if you just created src/ and its __init__.py)
# You might have already added these in Cell 2.4, but it's safe to add again if unsure.
!git add src/ # Adds src/ and its contents, including __init__.py

# Commit your changes with a descriptive message
!git commit -m "Add main LLM workflow notebook to repo; ensure src/ is tracked"

# Push changes to GitHub (Colab might prompt for credentials if not configured with PAT)
# Ensure GITHUB_PAT environment variable is set from Cell 1.2
!git push

print(f"\nNotebook '{notebook_name}' and other changes pushed to GitHub!")
print("You can now verify this in your GitHub repository.")