# Python HIV Dataset Simulation

This notebook attempts to recreate my original R/C++ implementation for simulating an HIV patient dataset. The original code in R/C++ can be found [here](https://github.com/pennystack/Simulation-For-HIV-Dataset).

<br>

#### Import libraries

Import all the necessary libraries to start the data simulation and estimation. The libraries **`hiv_smm`** and **`load_functions`** are custom libraries that I developed. 

- The **`hiv_smm`** implements optimized versions of core functions in C++, which are used for the patient dataset simulation for faster computations. 
- The **`load_functions`** handles the optimization of the likelihood function, using **JAX** for efficient gradient calculations and improved performance.

In [7]:
import pyreadr
import os
import jax
import jax.numpy as jnp
from jax import jit, vmap
import numpy as np
from scipy.optimize import minimize
import pandas as pd
from tkinter import *
from tkinter import simpledialog
import numpy as np
import logging
import time
import scipy.stats as st
from scipy.stats import weibull_min

# My library loglikelihood C++
import hiv_smm 

# My library for the optimization
import load_functions

<br>

#### Load the .RData files 

Load the parameter estimations from the real dataset and prepare the $v_{ij}$, $s_{ij}$, $a_{ij}$, $b_{ij}$ matrices for use in the data simulation.

In [8]:
root = Tk() # Tk is now available directly
root.withdraw()

# Where are the .RData files stored?
folder_path = simpledialog.askstring(
    "User Input", 
    "Please enter the path where you have stored the .RData files: ", 
    parent = root
)

aij_path = os.path.join(folder_path, 'aij.RData')
bij_path = os.path.join(folder_path, 'bij.RData')
vij_path = os.path.join(folder_path, 'vij.RData')
sij_path = os.path.join(folder_path, 'sij.RData')

# Load them with pyreadr
aij_rdata = pyreadr.read_r(aij_path)
bij_rdata = pyreadr.read_r(bij_path)
vij_rdata = pyreadr.read_r(vij_path)
sij_rdata = pyreadr.read_r(sij_path)

# Define the number of states - here we are examining the 4 state model
nstates = 4

# Convert them to a one-dimensional vector
aij = aij_rdata['aij'].to_numpy().T.flatten()
bij = bij_rdata['bij'].to_numpy().T.flatten()
vij = vij_rdata['vij'].to_numpy()
sij = sij_rdata['sij'].to_numpy()

# Scaling parameter to control the initial parameters for the estimation - always set to 1
parscale = 1

root.destroy()

<br>

#### Number of bootstrap simulations

Set the number of bootstrap data simulations and estimations to produce.  

- For statistically valid results, use 500+ samples.  
- For quick tests, you can use 10–20 samples (much faster).

In [9]:
root = Tk()
root.withdraw()

# How many samples of simulated data do you want to produce?
n_bootstrap_str = simpledialog.askstring(
    "User Input", 
    "Please enter the number of simulated data \n (it is advised to have a minimum number of 500 simulations):", 
    parent = root
)

n_bootstrap = int(n_bootstrap_str)

# Check for potential invalid bootstrapping number
if not isinstance(n_bootstrap, (int, float)) or n_bootstrap <= 0:
    print("-" * 30)
    print("No valid number entered.")
    print("-" * 30)

else:
    print("-" * 30)
    print(f"You entered: {n_bootstrap}") 
    print("-" * 30)

root.destroy()

------------------------------
You entered: 10
------------------------------


<br>

#### Bootstrapping samples

Run the HIV dataset simulation and estimation, based on the number of bootstrap samples you specified above.

In [10]:
# Create two empty lists to store the results from each bootstrap sample
optimized_params_list = []

# Calculate the initial distribution of the 4 states in the real data set for sampling the first patient.
# For confidentiality reasons, the initial state distribution is not calculated here. Instead, I will display only the final probabilities.
initial_dist = np.array([0.04, 0.02, 0.78, 0.16])

# Calculate the number of patients from the real data set and produce the same number of simulated patients. For confidentiality reasons,
# the number of patients from the real data set is not calculated here. Instead, I will display only the final number of total patients.
n_simulations = 5932


# Start the bootstrap simulation and estimation
n_estim = 1
while n_estim <= n_bootstrap:
    # Set up the logging configuration for the dataset simulation
    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
    logging.info(f'Starting simulation {n_estim}')
    start_sim_time = time.time()
    
    # Create the empty data frame to store the simulated results
    results_table = []
    
    for w in range(n_simulations):
        
        states_sample = np.arange(nstates)
    
        # Set the initial time and current state of the patient
        t = 0
        current_state = np.random.choice(a = states_sample, size = 1, p = initial_dist)[0].item()
    
        # Create the empty sojourn times vector to store the simulated deltaobstime
        sojourn_times = []
    
        # Create the empty observation times vector to store the simulated obstime
        observation_times = []
        observation_times.append(t)
    
        # Create the empty states vector to store the simulated states
        states = [current_state]
        
        patients = [w]
    
        C = 12
        iteration = 1
        
        # Start the loop for the data simulation
        while iteration == 1 or np.sum(sojourn_times) < C:
        
            patients.append(w)
            current_state = states[iteration - 1]
            t = observation_times[iteration - 1]
        
            # Calculate the probability matrix Pij based on the current state of the patient
            P = np.zeros((nstates, nstates))
            for i in range(nstates):
                for j in range(nstates):
                    if i != j:
                        P[i,j] = hiv_smm.cpp_p(i, j, aij, bij, t, nstates)
            
            for i in range(nstates):
                row_sum = P[i, :].sum()
                if row_sum > 0:
                    P[i, :] = P[i, :] / row_sum
                        
            
            # Sample the next_state from the probability matrix
            prob = np.nan_to_num(P[current_state, :], nan = 0.0).flatten()
            prob = prob / np.sum(prob)
    
            next_state = np.random.choice(a = states_sample, size = 1, p = prob).item()
            states.append(next_state)
        
            
            # Sample the sojourn time (x) from weibull distribution
            if iteration == 1:
                w_shape = vij[states[0], states[1]]
                w_scale = sij[states[0], states[1]]
            else:
                w_shape = vij[states[iteration - 2], states[iteration - 1]]
                w_scale = sij[states[iteration - 2], states[iteration - 1]]
            
            soj_time = weibull_min.rvs(c = w_shape, scale = w_scale, size = 1)[0].item()
            sojourn_times.append(soj_time)
        
            
            # Calculate the observation_times (t)
            obs_time = np.sum(sojourn_times[:iteration]).item()
            observation_times.append(obs_time)
            iteration += 1
            

        data = {
            'PATIENT': patients,
            'state': states,
            'obstime': observation_times,
            'deltaobstime': np.append(sojourn_times, np.nan)
        }
        
        results_table.append(data)
    
    
    results_table = pd.DataFrame(results_table)
    
    # Unnest the DataFrame
    results_table = results_table.explode(['PATIENT', 'state', 'obstime', 'deltaobstime'])
    
    # Ensure all columns have the correct data type
    results_table = results_table.astype({'PATIENT': int, 'state': int, 'obstime': float, 'deltaobstime': float})
    
    # Calculate elapsed simulation time
    end_sim_time = time.time()
    elapsed_sim_time = end_sim_time - start_sim_time
    sim_minutes = int(elapsed_sim_time // 60)
    sim_seconds = int(elapsed_sim_time % 60)
    
    logging.info(f'Finished! The simulation {n_estim} of the dataset lasted {sim_minutes} minutes and {sim_seconds} seconds.')
    
    
    # Prepare the results_table for the simulated data parameter estimation
    results_table['state_prev'] = results_table.groupby('PATIENT')['state'].shift(1)
    results_table['state_prev'] = results_table['state_prev'].astype('Int64')
    results_table['state_next'] = results_table.groupby('PATIENT')['state'].shift(-1)
    results_table['state_next'] = results_table['state_next'].astype('Int64')
    results_table['death'] = 0
    results_table = results_table.reset_index(drop=True)
    results_table['rowpos'] = results_table.index
    patient_counts = results_table.groupby('PATIENT')['PATIENT'].transform('size')
    results_table_c = results_table[patient_counts > 1]
    

    # --------------------------------------------------------------------------------------------------
    # -------------------------------------- Start the estimation --------------------------------------
    # --------------------------------------------------------------------------------------------------
    
    # Set up the logging configuration for the estimation
    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
    logging.info(f'Starting estimation of the simulated dataset {n_estim}')
    start_est_time = time.time()
    
    # Define the initial parameters to start the estimation - They should be the estimated parameters from the real data set.
    # Load the RData files again
    aij_r = pyreadr.read_r(aij_path)
    bij_r = pyreadr.read_r(bij_path)
    vij_r = pyreadr.read_r(vij_path)
    sij_r = pyreadr.read_r(sij_path)
    
    aij_s = jnp.array(aij_r['aij'].to_numpy().T.flatten())
    bij_s = jnp.array(bij_r['bij'].to_numpy().T.flatten())
    vij_s = jnp.array(vij_r['vij'].to_numpy().T.flatten())
    sij_s = jnp.array(sij_r['sij'].to_numpy().T.flatten())
    
    # Put the parameters into one vector for the optimization
    params_s = np.concatenate((vij_s, sij_s, aij_s, bij_s))
    
    # Enable double-precission floating-point to maximize speed
    jax.config.update("jax_enable_x64", True)

    # Start the estimation of the parameters. Optimize the loglikelihood function
    result = minimize(
        fun=load_functions.fun_wrapper,
        x0=params_s,
        jac=load_functions.jac_wrapper,
        args=(results_table_c, nstates, parscale),
        method='BFGS',
        options={'maxiter': 20000}
    )

            
    # Get the optimized parameters:
    optimized_params = result.x
    print(f'Sample {n_estim} estimated parameters: \n, {optimized_params} \n')

    # The maximum log-likelihood value is the negative of the final fun value:
    max_log_likelihood = -result.fun
    print(f'Sample {n_estim} loglikelihood value: \n, {max_log_likelihood}, \n')

    # Append the results to the lists
    optimized_params_list.append(optimized_params)
    
    # Calculate elapsed time
    end_est_time = time.time()
    elapsed_est_time = end_est_time - start_est_time
    est_minutes = int(elapsed_est_time // 60)
    est_seconds = int(elapsed_est_time % 60)

    
    logging.info(f'Finished! The estimation of the simulated dataset {n_estim} lasted {est_minutes} minutes and {est_seconds} seconds.')

    # Start the next iteration
    n_estim += 1


2025-08-20 13:17:05,042 - INFO - Starting simulation 1
2025-08-20 13:17:11,045 - INFO - Finished! The simulation 1 of the dataset lasted 0 minutes and 6 seconds.
2025-08-20 13:17:11,064 - INFO - Starting estimation of the simulated dataset 1
2025-08-20 13:17:14,013 - INFO - Finished! The estimation of the simulated dataset 1 lasted 0 minutes and 2 seconds.
2025-08-20 13:17:14,015 - INFO - Starting simulation 2


Sample 1 estimated parameters: 
, [ 0.         -0.22699688 -0.550939    0.12931457 -0.36647759  0.
  0.15552793  0.17895834 -0.3996836  -0.11226751  0.          0.24436148
 -0.35006004 -0.25513462  0.14023941  0.          0.          2.18390697
  3.17925922  0.31103771  6.02196642  0.         -0.49995927  0.85072851
  6.59555232  3.33831967  0.          1.22602598  8.89358146  2.0708593
  0.01420107  0.          0.         -0.04374185  0.09274669 -0.42104274
 -0.13250788  0.         -0.42865104  0.18690134 -0.09563574  0.15060573
  0.          0.19        0.13       -0.01        0.06        0.
  0.          4.09528778  2.57990277 -1.89298802  5.6171439   0.
 -0.11416563  1.75605837  2.0435913  -4.34945879  0.          0.18
 -5.45       -2.62       -2.41        0.        ] 

Sample 1 loglikelihood value: 
, -60654.287768847345, 



2025-08-20 13:17:20,020 - INFO - Finished! The simulation 2 of the dataset lasted 0 minutes and 6 seconds.
2025-08-20 13:17:20,041 - INFO - Starting estimation of the simulated dataset 2
2025-08-20 13:17:23,027 - INFO - Finished! The estimation of the simulated dataset 2 lasted 0 minutes and 2 seconds.
2025-08-20 13:17:23,028 - INFO - Starting simulation 3


Sample 2 estimated parameters: 
, [ 0.         -0.20538851 -0.56926481  0.08857672 -0.37519111  0.
  0.31024982  0.13096517 -0.40199736 -0.16924939  0.          0.22791825
  0.03846323 -0.24919452  0.05472016  0.          0.          2.17608492
  2.93499685  0.15319079  5.97867658  0.         -0.45087752  0.90566209
  5.99568149  2.19469489  0.          1.43971528  6.05969996  2.08965865
 -0.06584458  0.          0.         -0.05321993  0.10933243 -0.41098062
 -0.16077634  0.         -0.42860773  0.16591099 -0.10649074  0.17864244
  0.          0.19        0.13       -0.01        0.06        0.
  0.          3.95759325  2.64174988 -1.82293905  5.60974947  0.
 -0.03209516  1.7743082   2.00981374 -4.02356573  0.          0.18
 -5.45       -2.62       -2.41        0.        ] 

Sample 2 loglikelihood value: 
, -60737.2268397445, 



2025-08-20 13:17:29,182 - INFO - Finished! The simulation 3 of the dataset lasted 0 minutes and 6 seconds.
2025-08-20 13:17:29,204 - INFO - Starting estimation of the simulated dataset 3
2025-08-20 13:17:32,606 - INFO - Finished! The estimation of the simulated dataset 3 lasted 0 minutes and 3 seconds.
2025-08-20 13:17:32,607 - INFO - Starting simulation 4


Sample 3 estimated parameters: 
, [ 0.00000000e+00 -2.41870199e-01 -5.58015822e-01  1.96799305e-01
 -3.63222374e-01  0.00000000e+00  3.18988592e-01  1.65378337e-01
 -4.00721973e-01 -1.84389861e-01  0.00000000e+00  2.00202050e-01
 -2.30004777e-01 -2.14254588e-01  1.35607109e-01  0.00000000e+00
  0.00000000e+00  2.16473145e+00  2.74821704e+00  3.63367546e-01
  5.93495660e+00  0.00000000e+00 -4.02773531e-01  1.03017999e+00
  5.46472288e+00  2.05656082e+00  0.00000000e+00  1.42435305e+00
  8.69291337e+00  2.07373805e+00 -1.36865427e-01  0.00000000e+00
  0.00000000e+00  4.16039029e-03  1.03143509e-01 -4.65084811e-01
 -8.97456856e-02  0.00000000e+00 -4.98291874e-01  1.99063143e-01
 -6.73339617e-02 -4.80751859e-03  0.00000000e+00  1.90000000e-01
  1.30000000e-01 -1.00000000e-02  6.00000000e-02  0.00000000e+00
  0.00000000e+00  3.80536196e+00  2.68189663e+00 -2.04108803e+00
  5.04182099e+00  0.00000000e+00  3.86336748e-02  1.81368210e+00
  1.47660866e+00 -3.60184149e+00  0.00000000e+00  1.8000

2025-08-20 13:17:38,600 - INFO - Finished! The simulation 4 of the dataset lasted 0 minutes and 5 seconds.
2025-08-20 13:17:38,622 - INFO - Starting estimation of the simulated dataset 4
2025-08-20 13:17:41,432 - INFO - Finished! The estimation of the simulated dataset 4 lasted 0 minutes and 2 seconds.
2025-08-20 13:17:41,433 - INFO - Starting simulation 5


Sample 4 estimated parameters: 
, [ 0.00000000e+00 -2.17716524e-01 -5.60948016e-01  6.56851137e-02
 -3.72328713e-01  0.00000000e+00  3.08153655e-01  1.89314807e-01
 -3.34956380e-01 -1.70388443e-01  0.00000000e+00  2.55830916e-01
 -4.66105313e-01 -1.95536694e-01  1.02888579e-01  0.00000000e+00
  0.00000000e+00  2.10538075e+00  3.05400843e+00  1.66791477e-01
  6.03726985e+00  0.00000000e+00 -4.80303938e-01  9.25707701e-01
  5.66428858e+00  2.30940871e+00  0.00000000e+00  1.42026802e+00
  9.27008711e+00  1.88797215e+00 -1.75754575e-02  0.00000000e+00
  0.00000000e+00 -3.59463515e-02  1.12525890e-01 -4.03740125e-01
 -1.36013019e-01  0.00000000e+00 -4.36786311e-01  1.86526639e-01
 -1.61700185e-01  1.27755446e-01  0.00000000e+00  1.90000000e-01
  1.30000000e-01 -1.00000000e-02  6.00000000e-02  0.00000000e+00
  0.00000000e+00  3.92526495e+00  2.62720895e+00 -2.17816282e+00
  5.43774626e+00  0.00000000e+00 -6.52879810e-03  1.88650456e+00
  2.09828445e+00 -4.04509962e+00  0.00000000e+00  1.8000

2025-08-20 13:17:47,339 - INFO - Finished! The simulation 5 of the dataset lasted 0 minutes and 5 seconds.
2025-08-20 13:17:47,359 - INFO - Starting estimation of the simulated dataset 5
2025-08-20 13:17:50,169 - INFO - Finished! The estimation of the simulated dataset 5 lasted 0 minutes and 2 seconds.
2025-08-20 13:17:50,170 - INFO - Starting simulation 6


Sample 5 estimated parameters: 
, [ 0.00000000e+00 -2.23439127e-01 -5.41527176e-01  3.44078661e-01
 -3.77836893e-01  0.00000000e+00  1.68599175e-01  1.24478739e-01
 -3.19917225e-01 -2.75408163e-01  0.00000000e+00  2.91441339e-01
 -2.02426158e-01 -2.14775952e-01  1.92463220e-01  0.00000000e+00
  0.00000000e+00  2.23485371e+00  3.13728321e+00  1.68839870e-01
  6.04030652e+00  0.00000000e+00 -3.45943573e-01  8.35846811e-01
  6.81602595e+00  2.07151337e+00  0.00000000e+00  1.56368635e+00
  8.14834544e+00  1.81555646e+00 -1.23159960e-01  0.00000000e+00
  0.00000000e+00 -4.26041375e-03  9.72297391e-02 -6.03870846e-01
 -1.22796816e-01  0.00000000e+00 -4.76001039e-01  2.65507047e-01
 -1.03127164e-01  2.67305329e-02  0.00000000e+00  1.90000000e-01
  1.30000000e-01 -1.00000000e-02  6.00000000e-02  0.00000000e+00
  0.00000000e+00  3.77165898e+00  2.68090116e+00 -1.97874930e+00
  5.36241355e+00  0.00000000e+00 -8.11871264e-02  1.76602407e+00
  1.93380082e+00 -3.60170975e+00  0.00000000e+00  1.8000

2025-08-20 13:17:56,099 - INFO - Finished! The simulation 6 of the dataset lasted 0 minutes and 5 seconds.
2025-08-20 13:17:56,120 - INFO - Starting estimation of the simulated dataset 6
2025-08-20 13:17:59,753 - INFO - Finished! The estimation of the simulated dataset 6 lasted 0 minutes and 3 seconds.
2025-08-20 13:17:59,754 - INFO - Starting simulation 7


Sample 6 estimated parameters: 
, [ 0.         -0.23279427 -0.5605001   0.14665075 -0.36403132  0.
  0.22467074  0.04719423 -0.28395428 -0.32168605  0.          0.13401498
 -0.1710135  -0.20802426  0.03636212  0.          0.          2.14565193
  3.07301524  0.28476302  6.19588399  0.         -0.44634056  0.85885583
  6.45424016  2.03740143  0.          1.37676178  5.57795674  1.52789826
 -0.12136033  0.          0.         -0.01445207  0.09463433 -0.43089682
 -0.10696848  0.         -0.39361515  0.19251435 -0.06426534  0.09121388
  0.          0.19        0.13       -0.01        0.06        0.
  0.          3.79396191  2.63357561 -2.34033633  5.36166704  0.
  0.03256661  1.97862901  1.71920357 -3.58676454  0.          0.18
 -5.45       -2.62       -2.41        0.        ] 

Sample 6 loglikelihood value: 
, -59872.55441356934, 



2025-08-20 13:18:05,883 - INFO - Finished! The simulation 7 of the dataset lasted 0 minutes and 6 seconds.
2025-08-20 13:18:05,903 - INFO - Starting estimation of the simulated dataset 7
2025-08-20 13:18:08,733 - INFO - Finished! The estimation of the simulated dataset 7 lasted 0 minutes and 2 seconds.
2025-08-20 13:18:08,734 - INFO - Starting simulation 8


Sample 7 estimated parameters: 
, [ 0.00000000e+00 -2.21668243e-01 -5.34573941e-01  2.19642502e-01
 -3.86908824e-01  0.00000000e+00  2.31999343e-01  4.40795729e-02
 -4.36303944e-01 -3.04973182e-01  0.00000000e+00  1.91991142e-01
 -4.30174210e-01 -2.19047783e-01  1.06712891e-01  0.00000000e+00
  0.00000000e+00  2.17673017e+00  2.87133556e+00  1.27279003e-01
  6.08063111e+00  0.00000000e+00 -5.34027677e-01  8.64465532e-01
  6.27740646e+00  2.00048860e+00  0.00000000e+00  1.36816943e+00
  4.81594544e+00  1.93167232e+00 -8.08665708e-04  0.00000000e+00
  0.00000000e+00 -2.02783150e-02  1.04976759e-01 -6.91258263e-01
 -9.90030149e-02  0.00000000e+00 -4.90153046e-01  2.94968515e-01
 -7.41598796e-02  5.79357118e-02  0.00000000e+00  1.90000000e-01
  1.30000000e-01 -1.00000000e-02  6.00000000e-02  0.00000000e+00
  0.00000000e+00  3.78280765e+00  2.58113477e+00 -2.02565674e+00
  5.31613494e+00  0.00000000e+00  1.44579243e-02  1.84507892e+00
  1.65917511e+00 -3.37096624e+00  0.00000000e+00  1.8000

2025-08-20 13:18:14,831 - INFO - Finished! The simulation 8 of the dataset lasted 0 minutes and 6 seconds.
2025-08-20 13:18:14,850 - INFO - Starting estimation of the simulated dataset 8
2025-08-20 13:18:17,831 - INFO - Finished! The estimation of the simulated dataset 8 lasted 0 minutes and 2 seconds.
2025-08-20 13:18:17,832 - INFO - Starting simulation 9


Sample 8 estimated parameters: 
, [ 0.00000000e+00 -2.17704729e-01 -5.57060255e-01  2.45925304e-01
 -3.64672777e-01  0.00000000e+00  2.75100952e-01  1.26496681e-01
 -2.90858639e-01 -7.48100986e-02  0.00000000e+00  1.29837678e-01
 -1.46179546e-01 -1.39538681e-01  1.51307038e-01  0.00000000e+00
  0.00000000e+00  2.15642977e+00  2.94016445e+00  3.09780893e-01
  6.02107905e+00  0.00000000e+00 -4.87124171e-01  7.96708150e-01
  6.27508447e+00  2.62664660e+00  0.00000000e+00  1.30067408e+00
  9.13800061e+00  2.60124705e+00 -8.79652030e-02  0.00000000e+00
  0.00000000e+00 -5.93040202e-02  1.14230913e-01 -5.10062292e-01
 -1.14106892e-01  0.00000000e+00 -4.58022513e-01  2.14545439e-01
 -6.47778145e-02  1.96064446e-01  0.00000000e+00  1.90000000e-01
  1.30000000e-01 -1.00000000e-02  6.00000000e-02  0.00000000e+00
  0.00000000e+00  4.11820699e+00  2.63874207e+00 -2.02731104e+00
  5.37514292e+00  0.00000000e+00 -6.21212411e-03  1.79594612e+00
  1.77144632e+00 -4.43236466e+00  0.00000000e+00  1.8000

2025-08-20 13:18:23,936 - INFO - Finished! The simulation 9 of the dataset lasted 0 minutes and 6 seconds.
2025-08-20 13:18:23,958 - INFO - Starting estimation of the simulated dataset 9
2025-08-20 13:18:27,168 - INFO - Finished! The estimation of the simulated dataset 9 lasted 0 minutes and 3 seconds.
2025-08-20 13:18:27,169 - INFO - Starting simulation 10


Sample 9 estimated parameters: 
, [ 0.00000000e+00 -1.97935528e-01 -5.21295678e-01  1.72779037e-01
 -3.79974169e-01  0.00000000e+00  2.66362612e-01  6.39505818e-02
 -2.63493643e-01 -3.47242081e-01  0.00000000e+00  2.70212255e-01
 -3.53009068e-01 -1.82102623e-01  3.69821559e-02  0.00000000e+00
  0.00000000e+00  2.26631199e+00  3.20449555e+00  1.98583784e-01
  5.93636913e+00  0.00000000e+00 -4.90204363e-01  8.54684771e-01
  6.55586655e+00  2.04022809e+00  0.00000000e+00  1.48952384e+00
  9.14214328e+00  2.28906377e+00 -1.56334174e-02  0.00000000e+00
  0.00000000e+00 -2.18350180e-02  1.01854485e-01 -1.17444024e+00
 -1.13754916e-01  0.00000000e+00 -4.91121669e-01  5.32476441e-01
 -8.25462818e-02  2.70421112e-02  0.00000000e+00  1.90000000e-01
  1.30000000e-01 -1.00000000e-02  6.00000000e-02  0.00000000e+00
  0.00000000e+00  3.71393647e+00  2.68558203e+00 -1.70144398e+00
  5.37154408e+00  0.00000000e+00  1.25585131e-03  1.68835696e+00
  1.77738446e+00 -3.08189704e+00  0.00000000e+00  1.8000

2025-08-20 13:18:33,265 - INFO - Finished! The simulation 10 of the dataset lasted 0 minutes and 6 seconds.
2025-08-20 13:18:33,285 - INFO - Starting estimation of the simulated dataset 10
2025-08-20 13:18:37,296 - INFO - Finished! The estimation of the simulated dataset 10 lasted 0 minutes and 4 seconds.


Sample 10 estimated parameters: 
, [ 0.         -0.2255907  -0.53023549 -0.03759049 -0.38251074  0.
  0.28187602  0.12142306 -0.30810934 -0.12065437  0.          0.22023985
 -0.40694304 -0.13464545  0.06830481  0.          0.          2.25834321
  2.9141372   0.48687807  5.7830414   0.         -0.48357999  0.77205559
  6.9325641   2.21800154  0.          1.32662504  6.0291172   2.41859962
  0.03864989  0.          0.         -0.03464031  0.11701633 -0.37175102
 -0.13691639  0.         -0.50251033  0.14576716 -0.10853963  0.13798135
  0.          0.19        0.13       -0.01        0.06        0.
  0.          3.74603223  2.59482273 -2.24230727  5.47206776  0.
 -0.15278314  1.91396099  1.94930729 -3.56365418  0.          0.18
 -5.45       -2.62       -2.41        0.        ] 

Sample 10 loglikelihood value: 
, -60667.765310810966, 



<br>

#### Arrange the estimated parameters

Build the $v_{ij}$, $s_{ij}$, $a_{ij}$, $b_{ij}$ parameter lists from the simulated data. These will be used to calculate some descriptive statistics for each parameter.

In [11]:
vij_list = []
sij_list = []
aij_list = []
bij_list = []

# Calculate the expected length based on nstates
expected_length = 4 * nstates**2

# Iterate through each array in the input list
for params in optimized_params_list:

    # Calculate the size of each parameter section
    part_size = nstates**2

    # vij
    vij = np.abs(params[0:part_size].reshape(nstates, nstates)) / parscale

    # sij
    sij = np.abs(params[part_size:2*part_size].reshape(nstates, nstates)) / parscale

    # aij
    aij = params[2*part_size:3*part_size].reshape(nstates, nstates)

    # bij
    bij = params[3*part_size:4*part_size].reshape(nstates, nstates)
    bij[nstates - 1, nstates - 2] = 1 - np.sum(bij[nstates - 1, 0:nstates - 2])
    bij[:, nstates - 1] = 1 - np.sum(bij[:, 0:nstates - 1], axis=1)

    # Append the resulting matrices to their respective lists
    vij_list.append(vij)
    sij_list.append(sij)
    aij_list.append(aij)
    bij_list.append(bij)

<br>

### Descriptive statistics

Calculate the descriptive statistics for each parameter.

**Note:** *NaN values in the t-value or p-value indicate the diagonal elements of the matrices, which are not relevant for our calculations.*

#### Statistics for $v_{ij}$ parameters:

In [12]:
print("-" * 100)
print(load_functions.analyze_parameters(vij_list))
print("-" * 100)

----------------------------------------------------------------------------------------------------
   Parameter    Mean      SD  CI Lower (95%)  CI Upper (95%)   t-value  \
0    Param 1  0.0000  0.0000          0.0000          0.0000       NaN   
1    Param 2  0.2211  0.0120          0.2125          0.2297   55.3679   
2    Param 3  0.5484  0.0149          0.5378          0.5591  110.5232   
3    Param 4  0.1647  0.0873          0.1022          0.2272    5.6596   
4    Param 5  0.3733  0.0081          0.3676          0.3791  139.1131   
5    Param 6  0.0000  0.0000          0.0000          0.0000       NaN   
6    Param 7  0.2542  0.0548          0.2150          0.2934   13.9137   
7    Param 8  0.1192  0.0497          0.0837          0.1548    7.1932   
8    Param 9  0.3440  0.0575          0.3029          0.3851   17.9478   
9   Param 10  0.2081  0.0918          0.1424          0.2738    6.8011   
10  Param 11  0.0000  0.0000          0.0000          0.0000       NaN   
11  Param 1

#### Statistics for $s_{ij}$ parameters:

In [13]:
print("-" * 100)
print(load_functions.analyze_parameters(sij_list))
print("-" * 100)

----------------------------------------------------------------------------------------------------
   Parameter    Mean      SD  CI Lower (95%)  CI Upper (95%)   t-value  \
0    Param 1  0.0000  0.0000          0.0000          0.0000       NaN   
1    Param 2  2.1868  0.0487          2.1520          2.2217  134.7442   
2    Param 3  3.0057  0.1399          2.9056          3.1058   64.4322   
3    Param 4  0.2571  0.1084          0.1795          0.3346    7.1123   
4    Param 5  6.0030  0.1022          5.9299          6.0762  176.1479   
5    Param 6  0.0000  0.0000          0.0000          0.0000       NaN   
6    Param 7  0.4621  0.0512          0.4255          0.4987   27.0834   
7    Param 8  0.8695  0.0685          0.8205          0.9185   38.0800   
8    Param 9  6.3031  0.4515          5.9802          6.6261   41.8817   
9   Param 10  2.2893  0.3923          2.0087          2.5700   17.5059   
10  Param 11  0.0000  0.0000          0.0000          0.0000       NaN   
11  Param 1

#### Statistics for $a_{ij}$ parameters:

In [14]:
print("-" * 100)
print(load_functions.analyze_parameters(aij_list))
print("-" * 100)

----------------------------------------------------------------------------------------------------
   Parameter    Mean      SD  CI Lower (95%)  CI Upper (95%)       t-value  \
0    Param 1  0.0000  0.0000          0.0000          0.0000           NaN   
1    Param 2 -0.0284  0.0196         -0.0424         -0.0143 -4.335500e+00   
2    Param 3  0.1048  0.0080          0.0991          0.1105  3.941840e+01   
3    Param 4 -0.5483  0.2291         -0.7122         -0.3844 -7.180000e+00   
4    Param 5 -0.1213  0.0199         -0.1355         -0.1070 -1.829560e+01   
5    Param 6  0.0000  0.0000          0.0000          0.0000           NaN   
6    Param 7 -0.4604  0.0351         -0.4855         -0.4353 -3.934260e+01   
7    Param 8  0.2384  0.1067          0.1621          0.3147  6.704400e+00   
8    Param 9 -0.0929  0.0283         -0.1131         -0.0726 -9.850500e+00   
9   Param 10  0.0989  0.0661          0.0516          0.1462  4.489100e+00   
10  Param 11  0.0000  0.0000          0.0

  res = hypotest_fun_out(*samples, axis=axis, **kwds)


#### Statistics for $b_{ij}$ parameters:

In [15]:
print("-" * 100)
print(load_functions.analyze_parameters(bij_list))
print("-" * 100)

----------------------------------------------------------------------------------------------------
   Parameter    Mean      SD  CI Lower (95%)  CI Upper (95%)       t-value  \
0    Param 1  0.0000  0.0000          0.0000          0.0000           NaN   
1    Param 2  3.8710  0.1379          3.7724          3.9696  8.424200e+01   
2    Param 3  2.6346  0.0381          2.6073          2.6618  2.073000e+02   
3    Param 4 -5.5056  0.1306         -5.5990         -5.4121 -1.264444e+02   
4    Param 5  5.3965  0.1540          5.2864          5.5067  1.051007e+02   
5    Param 6  0.0000  0.0000          0.0000          0.0000           NaN   
6    Param 7 -0.0306  0.0612         -0.0744          0.0132 -1.499900e+00   
7    Param 8 -4.3659  0.1268         -4.4567         -4.2752 -1.032815e+02   
8    Param 9  1.8439  0.1859          1.7109          1.9768  2.975700e+01   
9   Param 10 -3.7657  0.4098         -4.0589         -3.4725 -2.756430e+01   
10  Param 11  0.0000  0.0000          0.0

  res = hypotest_fun_out(*samples, axis=axis, **kwds)
