# Coarse-grain a simple transition matrix

To illustrate the application of *pyGPCCA* we will coarse-grain a simple transition matrix *P* calculated from a toy adjacency matrix *W* found in *[S. Roeblitz and M. Weber, Fuzzy spectral clustering by PCCA+: application to Markov state models and data classification. Adv Data Anal Classif 7, 147-179 (2013)](https://doi.org/10.1007/s11634-013-0134-6)*. 

Firstly, we will import needed packages like ``numpy`` and of course ``pygpcca``:

In [1]:
import numpy as np
import pygpcca as gp

Next, we define the aforementioned adjacency matrix *W* and calculate a row-stochastic (meaning that the rows of *P* each sum up to one) transition matrix *P* from it:

In [2]:
# Choose zero pertubation mu of the adjacency matrix.
mu = 100
W = np.array(
        [
            [1000, 100, 100, 10, 0, 0, 0, 0, 0],
            [100, 1000, 100, 0, 0, 0, 0, 0, 0],
            [100, 100, 1000, 0, mu, 0, 0, 0, 0],
            [10, 0, 0, 1000, 100, 100, 10, 0, 0],
            [0, 0, mu, 100, 1000, 100, 0, 0, 0],
            [0, 0, 0, 100, 100, 1000, 0, mu, 0],
            [0, 0, 0, 10, 0, 0, 1000, 100, 100],
            [0, 0, 0, 0, 0, mu, 100, 1000, 100],
            [0, 0, 0, 0, 0, 0, 100, 100, 1000],
        ],
        dtype=np.float64,
    )
# Only make non-zero rows stochastic, otherwise we might divide by zero later.
# (This is just a general precaution and not explicitly necessary in the special case considered here.)
row = np.sum(W, axis=1) > 0.0001
P = W.copy()
W_ = W[row, :]
# Calculate the transition matrix from W.
P[row, :] = np.diag(1.0 / np.sum(W_, axis=1)) @ W_
P

array([[0.82644628, 0.08264463, 0.08264463, 0.00826446, 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.08333333, 0.83333333, 0.08333333, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.07692308, 0.07692308, 0.76923077, 0.        , 0.07692308,
        0.        , 0.        , 0.        , 0.        ],
       [0.00819672, 0.        , 0.        , 0.81967213, 0.08196721,
        0.08196721, 0.00819672, 0.        , 0.        ],
       [0.        , 0.        , 0.07692308, 0.07692308, 0.76923077,
        0.07692308, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.07692308, 0.07692308,
        0.76923077, 0.        , 0.07692308, 0.        ],
       [0.        , 0.        , 0.        , 0.00826446, 0.        ,
        0.        , 0.82644628, 0.08264463, 0.08264463],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.07692308, 0.07692308, 0.76923077, 0.07692308],


Following this, we initialize a *GPCCA* object from the transition matrix *P*:

In [3]:
gpcca = gp.GPCCA(P)

Afterwards, we can get a list of *minChi* values for numbers of macrostates *m* in an interval *\[2, 8\]* to determine an interval
*\[m_min, m_max\]* of (nearly) optimal numbers of macrostates for clustering:

In [4]:
gpcca.minChi(2, 8)

[-1.4898919131105394e-15,
 -0.027309451028471734,
 -0.7684172018023119,
 -0.6014286769355208,
 -0.9999999999998638,
 -0.3755280421712998,
 -0.8142820838213831]

This might result in a warning:

```
UserWarning: The Schur vectors aren't D-orthogonal so they are D-orthogonalized.
warnings.warn("The Schur vectors aren't D-orthogonal so they are D-orthogonalized.")
```
  
This is uncritical and has merely technical reasons.

The *minChi* citerion states that cluster numbers *m* (i.e. clustering into *m* clusters) with a *minChi* value close to zero will potentially result in a optimal (meaning especially *crisp* or sharp) clustering.
Obviously, only *m*=3 qualifies as non-trivially (potentially) optimal, since *m*=2 is always (trivially) optimal.

Now, we would typically optimize the clustering for numbers of macrostates *m* in the previously determined interval *\[m_min, m_max\]* and find the optimal number of macrostates *n_metastable* in the given interval. Since this interval would here actually only span one number, i.e. \[3\], we will optimize the clustering for the whole interval of possible cluster numbers to see what happens: 

In [5]:
gpcca.optimize({'m_min':2, 'm_max':8})

GPCCA[n=9, n_metastable=3]

The optimized *GPCCA* object is returned above and we can now access different properties of it.

The optimal number of macrostates *n_metastable* can be accessed via:

In [6]:
gpcca.n_metastable

3

The optimal number of clusters or macrostates is *n_metastable*=3 as expected.

The optimal coarse-grained matrix can be accessed via:

In [7]:
gpcca.coarse_grained_transition_matrix

array([[ 0.95616082,  0.02191723,  0.02192196],
       [ 0.0241382 ,  0.97795318, -0.00209138],
       [ 0.02413837, -0.00208985,  0.97795148]])

The memberships are available via:

In [8]:
gpcca.memberships

array([[4.09710517e-02, 2.30713891e-02, 9.35957559e-01],
       [3.29275691e-06, 2.73491397e-02, 9.72647568e-01],
       [2.48693197e-01, 1.18222120e-17, 7.51306803e-01],
       [9.57170619e-01, 2.14386162e-02, 2.13907652e-02],
       [7.99131815e-01, 2.66020686e-05, 2.00841583e-01],
       [7.99131115e-01, 2.00868885e-01, 0.00000000e+00],
       [4.09678719e-02, 9.35885841e-01, 2.31462868e-02],
       [2.48690579e-01, 7.51262274e-01, 4.71470082e-05],
       [0.00000000e+00, 9.72570489e-01, 2.74295114e-02]])

There are many more properties that can be accessed as you can see in the API documentation <a href="https://pygpcca.readthedocs.io/en/latest/api/pygpcca.GPCCA.html" target="_blank">here</a>.