# PCA_V9_bin_orig.py — PCA & Export Guide

Documentation for `PCA_rodrigo/file_script/PCA_V9_bin_orig.py`, highlighting the PCA visualisation workflow (first stage) and the Excel/TXT metrics export (second stage).

## How the script is triggered

- Called directly from the GUI: pressing **Run** in `PCA-GUI.py` ultimately executes `plot_combined_layout(...)` with the GUI inputs (directories, labels, colours, Raman shift span, chart title).
- Can be run standalone for quick tests by editing the lists at the top of the file and executing the module as a script:

In [None]:
# Example: run from a Python session
from PCA_rodrigo.file_script import PCA_V9_bin_orig as pca_orig

directories = [r'path/to/groupA/Processed', r'path/to/groupB/Processed']
labels = ['Group A', 'Group B']
colours = ['red', 'blue']
raman_range = (700, 900)
manual_offset = 0
title = 'Demo PCA'

pca_orig.plot_combined_layout(directories, labels, raman_range, manual_offset, colours, title)

## Input data flow

1. **Binary files (`orig_*.bin`)** are produced by `csv_to_bin_orig.py`. They store a dictionary with keys `Raman Shifts` (wavenumber axis) and `Intensity Data` (mapping scan point IDs to raw spectra).
2. `load_group_data(directory, label, raman_shift_range)` walks the folder, reads each binary file, and filters the wavenumber axis to the requested range. The result is a `pandas.DataFrame` where rows correspond to acquisition points and columns to the filtered Raman shifts.
3. During the combined layout run the script loads each group twice:
   - **Restricted range** for the PCA fit (`raman_shift_range`).
   - **Full spectrum** later on for the representative spectra panel and exports.
4. A deterministic colour map is derived from the provided labels (`build_color_map`).
5. Global font scaling and bold styling are applied via `scale_fonts` and `apply_bold_to_axes`.

## PCA canvas (first stage)

`plot_combined_layout` orchestrates an interactive 2 × 2 grid inside a 16 × 10 inch figure:

- **Data preparation** — The filtered spectra from all groups are concatenated into `all_data_combined`; missing values are filled with zero. A 3-component PCA is fitted (`PCA(n_components=3)`) to support both 2D and 3D projections.
- **Left pane: PCA scatter** — Initially 2D (PC1 vs PC2) with dense annotations:
  - Each group’s points are coloured according to the GUI selection.
  - A grey baseline at PC2 = 0 shows signed distance along PC1. The script computes group centroids and a global centroid: $$d_{igma} =rac{u_{	ext{group}} - u_{	ext{global}}}{igma_{	ext{global}}}. $$
  - Labels are staggered vertically and slightly shifted horizontally to avoid overlaps; translucent bands enhance readability.
  - The **2D/3D** button swaps between the 2D view and a 3D scatter (PC1, PC2, PC3) while preserving the baseline logic.
- **Top-right: representative spectra** — For each group the spectrum closest to its PCA centroid is plotted across the full Raman axis. Curves are vertically offset (`manual_offset`) and a yellow band highlights the evaluated interval.
- **Bottom-right: explained variance** — Displays the PCA variance ratio per component with bold axes and grid.
- **Toolbar buttons** — Besides the projection toggle there is a **Copy** button that saves the canvas to a temporary PNG file and opens it so the user can copy it to the clipboard.
- **Layout helpers** — `shrink_axes_width` and `shift_axes_right` create breathing room between subplots so the annotations remain legible even when many groups are drawn.

## Metrics computed before exporting

For each group the script measures both PCA-space statistics and spectral features around the representative peak:

- **Centroid scores** — Mean PC1/PC2 coordinates (`PC1 Centroid`, `PC2 Centroid`) and their signed σ-distance relative to the global PC1 baseline.
- **Within-group spread** — Sample standard deviations along PC1/PC2 (`PC1 Std`, `PC2 Std`) and the determinant of the 2 × 2 covariance matrix (if available).
- **Peak identification** — For the centroid spectrum the index of the maximum intensity within the filtered range determines `Centroid Peak Shift (cm^-1)` and `Centroid Peak Intensity`.
- **RSD at the peak** — Using all raw spectra, the relative standard deviation is: $$athrm{RSD}_{	ext{peak}} = rac{igma_{	ext{peak}}}{u_{	ext{peak}}} 	imes 100. $$
- **Robust windowed RSD** — The script inspects a window ±`PEAK_WINDOW_CM` around the peak (default ±5 cm⁻¹). For each spectrum it records the maximum within that window, then computes $$athrm{RSD}_{	ext{window}} = rac{igma_{ax}}{u_{ax}} 	imes 100, $$ yielding `Mean Intensity in Window`, `RSD% in Window (raw)`, and the actual `Window Span Used (cm^-1)`.
- **Fallback handling** — Whenever a calculation fails (e.g., missing data or single-spectrum groups) the script records `NaN` rather than raising an exception, keeping the export consistent.

## Excel workbook (second stage)

Exports live in `PCA_rodrigo/file_script/outputs/` with timestamped names. Each run produces:

1. **`results_<title>_<timestamp>.xlsx`** — generated via `pandas.ExcelWriter` with up to three sheets:
   - `Summary`: chart title, Raman shift span, variance ratio for PC1/PC2, and the configured peak window width.
   - `GroupMetrics`: one row per group with all metrics listed above (centroids, signed σ, spreads, peaks, RSDs).
   - `CentroidSpectra`: optional wide table containing the full Raman axis in the first column and one centroid spectrum per group (only written when the script successfully aligns all spectra on a shared axis).
2. **`centroid_spectra_all_<title>_<timestamp>.txt`** — a long-format text file (semicolon-delimited) that records every centroid intensity value per group and Raman shift. The header encodes metadata such as the evaluated range and generation time.

## Error handling and logging

- Console messages announce successful loads, skipped files, and the paths of the exported artefacts.
- When an export step fails (for example due to file permissions) the script prints a warning but keeps the PCA window open so results can be inspected on screen.
- The GUI wraps the call inside a `try/except` to surface any issues via Tk message boxes.

## Customisation pointers

- Adjust `PEAK_WINDOW_CM` near the top of the script to widen or shrink the robust window analysis.
- Change the default PCA span (`raman_shift_range`) before calling `plot_combined_layout` or expose it through the GUI if operators frequently tweak the range.
- `manual_offset` controls vertical separation between representative spectra; the GUI currently fixes it to 0.
- Add more colours to `color_order` or override `build_color_map` to harmonise palettes across teams.