# Getting started with pyglotaran

> Pyglotaran is an open-source modeling framework, written in Python and designed for global and target analysis of time-resolved spectroscopy data. It provides a flexible framework for analyzing complex spectrotemporal datasets, supporting a wide range of kinetic models including sequential, parallel, and target analysis schemes.

This getting-started notebook exists to get you started with using pyglotaran, hopefully in 20 minutes or less.

What you are viewing now may be a *static* rendering of an otherwise **interactive** notebook. This guide is the most useful if you either follow along in a *new* notebook, or download the original notebook from the repository.

<details>
<summary>Click here for more details on notebooks.</summary>

A Python notebook, also known as Jupyter Notebook, is an interactive computational environment, where you can run code, explore data and present your results, all in a single file (with the file extension `.ipynb`).

There are three main ways to run a Jupyter Notebook:
- `[local]` Using Jupyter Notebook Directly
- `[local|github]` Using VS Code with the Python and Jupyter extensions
- `[cloud]` Using Google Colab  

</details>

For the purpose of this guide, it is assumed you already know how to work with notebooks; else there are plenty of tutorials online.

## Preface

If you are going through this guide, you most likely have some dataset burning a hole in your pocket. 

Please rest assured, we'll get to that. But **first**, we'd like to take you through a typically modeling workflow.

## Working with data 

Let's start with the premise that we already have some data imported in our notebook.

<sub><i>To make this guide self-contained, we'll make use of some simulated data from the glotaran.testing package</i></sub>

In [None]:
# For the purpose of illustrating the workflow, we will just use
# some simulated data from the glotaran.testing package.
from glotaran.testing.simulated_data.sequential_spectral_decay import DATASET as my_dataset

# ending the cell with the variable name will display the content of the variable
my_dataset

Like all data in `pyglotaran`, the dataset is an [xarray.Dataset](https://xarray.pydata.org/en/stable/api.html#dataset), for which you can find more information on the [xarray hompage](https://xarray.pydata.org/en/stable/).

From the output cell, we can quickly see that the dataset `my_dataset` has the following properties:
- It has the  `Dimensions`: `time` and `spectral` (and they *must* be named like that)
- For *these* data the time coordinate starts at `-1.0` and runs until `19.99`, 
- For *these* data the spectral coordinate starts at `600` and runs until `699.4`.
- The dataset (currently) has a single Data variable called `data` with (2100 time x 72 spectral)=151200 datapoints, later more variables may be added to the dataset.

<sub>Towards the end of this notebook you will find out how to read in your data and transform it into an xarray.Dataset, but for now, let's continue.</sub>

## Plotting raw data

The pyglotaran framework has built-in functionality to create useful plots. They are packed as part of the `pyglotaran_extras` package, if you followed the installation instructions, you should already have this installed. 

In there, we have a number of plotting functions we can import and use in our notebook. Let's start with `plot_data_overview`. 

In [None]:
from pyglotaran_extras import plot_data_overview

plot_data_overview(my_dataset);

Isn't it pretty? We see our 2D intensity map with the time coordinate (labeled Time (ps)) along the x-axis and the spectral coordinate (labeled Wavelength (nm)) along the y-axis.

Just below that we see our singular value decomposition (SVD) of the data matrix, with the first four (4) singular vectors plotted. 
- The left most plot (data. LSV) shows the left singular vectors representing the vectors evolution along the time coordinate.
- The right most plot (data. RSV) shows right singular vectors reflecting the spectral shapes associated that evolution. 
- The middle plot shows the first 10 singular *values*, on a logarithmic scale. 

From this we can deduce the rank of our data matrix is three (3), which is how many decay components we'll need for our model later.

<details>
<summary><i>Q: What? How do you know that? And what's a S.V.D.?</i>

... A: (click here)</summary>

Very simply put the Singular Value Decomposition (SVD) is a mathematical technique used to decompose a matrix (our data) into a number of left and right singular vectors and their corresponding 'weights' (the singular values). It allows us to quickly visualize the 'rank' of the matrix (which gives us a rough approximation of how many decay components we might need in our model to fit the data). 

More precisely, the SVD decomposed a (data) matrix into three other matrices. It is often used in signal processing and statistics to identify patterns in data. Specifically, SVD decomposes a matrix \(A\) into three matrices \(U\), \(Σ\), and \(V^T\) such that \(A = UΣV^T\). Here, \(U\) and \(V\) are orthogonal matrices, and \(Σ\) is a diagonal matrix containing the singular values. The left singular vectors (columns of \(U\)) represent the time coordinates, while the right singular vectors (columns of \(V\)) represent the spectral coordinates. The singular values in \(Σ\) provide a measure of the importance of each corresponding singular vector.

Put simply: if we take the first left singular vector (LSV), multiply it by the first singular value (SV), and then multiply by the first right singular vector (RSV), we obtain the first approximation of the data matrix. Repeating this process for the second and third singular vectors provides an even better approximation, especially since the rank of this particular matrix appears to be 3. By summing the products of all the left and right singular vectors, each weighted by their corresponding singular value, we can reconstruct the original data matrix.

That's more than you need to know about Singular Value Decompositions at this time. 
</details>

## Starting a project

Once we have decided that our data is good enough to attempt to model it using `pyglotaran` we can start our adventure.

To start using `pyglotaran` in your analysis, you only have to import the `Project` class and open a project (folder).

In [None]:
from glotaran.project import Project

my_project = Project.open("my_project")

If the project does not already exist this will create a new project and its folder structure for 
you. In our case we had only the `models` + `parameters` folders and the `data` + `results` folder
were created when opening the project.

In [None]:
# This so called shell command allows us to 'list' (ls) the content of the project folder.
%ls my_project

### Import the data into your project

As long as your data can be transformed into a well structured (`time`,`spectral`,`data`)  `xarray.Dataset` (or `xarray.DataArray`) you can directly import it in to your project. 

This will then save your data as `NetCDF` (`.nc`) file into the data  folder inside of your project with the name that you gave it (here `my_project/data/my_data.nc`).

In [None]:
my_project.import_data(my_dataset, dataset_name="my_data")

After importing our `my_project` is aware of the data that we named `my_data` when importing.

### Working with models

> **⚠️** 
> Please note that the *exact* way in which a model is defined may still change slightly in future versions of pyglotaran. But no worries, there will always be a clear procedure to upgrade any existing models you may have created in the meantime.

After importing our data into the project, to analyse them, we need a `model` (or analysis `scheme`).

If it does not already exists, create a file called `my_model.yaml` in your projects' `models` folder and fill it with the following content.

> **📝**
> Don't let this file extension (`.yaml` or `.yml`) scare you, it's just another **plain text file**, which you can open with literally any text editor.</sub>

<sub>In our case the file already exists, so we can just *show* you the content, which you can then copy paste.</sub>

In [None]:
my_project.show_model_definition("my_model")
# copy the model definition from the output below

<sub>⬆️ copy into `models`/`my_model.yaml`</sub>

The above model reads (from bottom to top) as:
- We have a dataset named `my_data`, which we model with a single (kinetic) megacomplex `m1`, with initial_concentration vector `input` and instrument response function (IRF) `irf1`. 
- The IRF `irf1` is defined as being of type `gaussian` with its center location at `irf.center` and its width to be `irf.wifth`
- The megacomplex `m1` is composed of just a single kinetic matrix `k1` of type `decay` (short for exponential decays)
- This k_matrix is composed of just 3 rate constants descriving the sequential kinetic scheme: "s1->s2->s3->ground"
- It is sequential because the initial_concentration `input` defines all of the input (1) going to s1, and none of it going to s2 or s3.

You can check your model for problems with the `validate` method.

In [None]:
my_project.validate("my_model")

### Working with parameters

A model by itself is not sufficient, we also need *starting values* for the parameters we define in the model.

For this, we use a parameters file. Create a file called `my_parameters.yaml` in your `parameters` folder with the following content.

<sub>Again, in our case the file already exists, so we just *show* the content for you to copy.</sub>

In [None]:
my_project.show_parameters_definition("my_parameters")

This reads as: 
- There are two `input` parameters `input.1` with (fixed) value 1, and `input.0` with (fixed) value 0.
- There are 3 kinetics rates, with starting values `0.51`, `0.31`, `0.11`.
- There are 2 IRF related parameters `irf.center` with starting value `0.31` and `irf.width` with starting value `0.11`.
All parameters are implicitly 'free', unless specified with `{ "vary": False }` to be fixed.

You can use `validate` method also to check for missing parameters.

In [None]:
my_project.validate("my_model", "my_parameters")

Since not all problems in the model can be detected automatically it is wise to visually inspect the model. 

For this purpose, you can just load the model and inspect its markdown rendered version.

In [None]:
my_project.load_model("my_model")

The same way you should inspect your parameters.

In [None]:
my_project.load_parameters("my_parameters")

## Optimizing data

Now we have everything together to optimize our parameters.

In [None]:
result = my_project.optimize("my_model", "my_parameters")
result

What you see from the optimization 'log' is that the optimization took about 5 iterations to converge, going from a cost of 11.2k down to 7.56.

The Optimization Result table gives us a nice 'statistics' table and if we click on details we can view the optimized model.

Since we are satisfied that our fit has converged, we'll proceed to look at the outcome in more detail.

Speaking of outcomes, please note that each time you run an optimization the result will be saved in the projects results folder.

<sub>You may occasionally want to clean these up, especially with larger projects or datasets.</sub>

In [None]:
# To view the saved results, look at the content of the project's results folder
%ls "my_project/results"

One way to look at our results in more detail, is to query the optimized_parameters object in the results.

This will also print (if it can be computed) the standard error for each *estimated* parameter.

In [None]:
result.optimized_parameters

You can further inspect the data of the `result` by accessing `data` attribute. In our example it only 
contains a single `my_data` dataset, but it could contain many datasets in a multi-dataset analysis.

In [None]:
result.data

Finally, although knowing your optimization converged, and the optimized parameters seem reasonable is half the battle, the really question is of course ... what does it **LOOK** like. Let's plot!

## Visualize the Result

The results can be visualized in a similar way as the dataset, using a function from the pyglotaran_extras part of the framework, in this case `plot_overview`. 

In [None]:
from pyglotaran_extras import plot_overview

plot_overview(result);

In this overview you see the following.

**first row**:
- `[left]`    The concentrations corresponding to the components in the kinetic scheme
- `[middle]`  The species associated spectra (SAS), in this case equivalent to the evolution associated spectra (EAS)
- `[right]`   The decay associated spectra (DAS)

**second row**:
- `[left]`    The residual matrix, with the IRF (dispersion) curve plotted on top
- `[middle]`  The normalized SAS
- `[right]`   The normalized DAS

**third row**:  the SVD of the *residual* matrix, with
- `[left]`   The first 2 components (black, red) of the left singular vectors (time)
- `[left]`   The first 2 components (black, red) of the right singular vectors (spectral)
- `[right]`  The first 10 singular values on a logarithmic scale. 

**forth row**:  the SVD of the *data* matrix, with
- `[left]`   The first 4 components (black, red, blue, green) of the left singular vectors (time)
- `[left]`   The first 4 components (black, red, blue, green) of the right singular vectors (spectral)
- `[right]`  The first 10 singular values on a logarithmic scale. 

## Conclusion

In this guide we showed how to
- plot your data (once it's in the right format)
- start a project, import the data into it
- work with models (and a model `.yaml` file)
- specify starting values for our parameters (in a parameters `.yaml` file)

Welcome to the future of global target analysis, we hope you like it here!

## And now for something completely different

Right, right. We were going to talk about _your_ data. That .csv file did end up burning a hole in your pocket, didn't it? 😉

Well, there is a reason we have standerdized around [xarray](https://xarray.pydata.org/en/stable/)'s [Dataset](https://xarray.pydata.org/en/stable/api.html#dataset)s, we know what to expect.

We don't know that with `.csv`. 🙈

- Some people save something as .csv, but it is space or tab delimited. 
- Some people put in a header, of a single line, two lines, 4 lines. 
- Some people put in a footer, of *many* lines. 
- Some people pad their data matrix with a zero, some don't. 
- Some people include their spectral and time coordinates, some don't.
- Some do, but call it something completely different. 
- Some people use monotonic increasing coordinates (as one should), but some don't 😱.

In short, you can never be sure with `.csv`.

All of that being said ... if you had some fairly clean .csv data lying around you wanted to import, you could use the pandas library to read the data and then create an xarray Dataset out of it. 

How is left as an exercise to the reader, but we left some tips below.

<details>
<summary>Click to reveal tips!</summary>

Assuming a csv file with the timepoints in the first row, and the spectral coordinate (e.g. wavelengths) in the fist column.

```csv
-2.0,0.0,2.0,10.0,100.0,1000.0
420,0,10,15,5,1,0
520,1,100,25,7,2,0
620,0,25,15,2,1,0
```

Then this bit of Python code could read in your data.

```py
import pandas as pd
import numpy as np
import xarray as xr
from pathlib import Path
filepath = Path(r"file_that_burned_a_hole_in_your_pocket.csv")
# Load the CSV file into a pandas DataFrame
df = pd.read_csv(filepath, delimiter=',')
# Convert index and columns to numeric, ignoring any non-numeric values
df.index = pd.to_numeric(df.index, errors='coerce')
df.columns = pd.to_numeric(df.columns, errors='coerce')
# Remove any rows or columns that couldn't be converted to numeric
df = df.loc[df.index.notnull(), df.columns.notnull()]
# Extract the coordinates (assuming they were in the .csv file)
timepoints = np.array(df.columns.values).astype(float)
wavelengths = np.array(df.index.values).astype(float)
dataset = xr.DataArray(
        df.values.T,
        dims=["time", "spectral"],
        coords={"time": timepoints, "spectral": wavelengths},
    ).to_dataset(name="data")
dataset
```

```py
# pro tip, try to plot it to see if it was read in correctly.
plot_data_overview(dataset,linlog=True);
```

</details>

At some point we will add a more robust and battle tested `csv_to_dataset` function to the pyglotaran_extras package, but even then the above continues to hold true.